# PROBLEM STATEMENT

## Instructions
- Donot delete the cells with default code given
- Make sure to run the last cell to save your answers

- Dataset includes house sale prices for King County in USA. 
- Homes that are sold in the time period: May, 2014 and May, 2015.
- Data Source: https://www.kaggle.com/harlfoxem/housesalesprediction

- Columns:
    - ida: notation for a house
    - date: Date house was sold
    - price: Price is prediction target
    - bedrooms: Number of Bedrooms/House
    - bathrooms: Number of bathrooms/House
    - sqft_living: square footage of the home
    - sqft_lot: square footage of the lot
    - floors: Total floors (levels) in house
    - waterfront: House which has a view to a waterfront
    - view: Has been viewed
    - condition: How good the condition is ( Overall )
    - grade: overall grade given to the housing unit, based on King County grading system
    - sqft_abovesquare: footage of house apart from basement
    - sqft_basement: square footage of the basement
    - yr_built: Built Year
    - yr_renovated: Year when house was renovated
    - zipcode: zip
    - lat: Latitude coordinate
    - long: Longitude coordinate
    - sqft_living15: Living room area in 2015(implies-- some renovations) 
    - sqft_lot15: lotSize area in 2015(implies-- some renovations)

# STEP #0: LIBRARIES IMPORT


In [67]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# IMPORT DATASET

### **Question 1**


- Import the dataset 'kc_house_data.csv' and store it in variable `house_df`.
- Note: Use proper Encoding format `ISO-8859-1`

In [2]:
house_df=_

## Question 2:
Describe the dataset and store the mean values of bedrooms,view,condition in the list named `desc` in the same order mentioned rounded off to two decimal places.

In [3]:
desc=_

# VISUALIZE DATASET

- Try executing the below cells with code to visualise the dataset and explore the details

In [None]:
sns.scatterplot(x = 'bedrooms', y = 'price', data = house_df)


In [None]:
sns.scatterplot(x = 'sqft_living', y = 'price', data = house_df)

In [None]:
sns.scatterplot(x = 'sqft_lot', y = 'price', data = house_df)

In [None]:
house_df.hist(bins=20,figsize=(20,20), color = 'r')

In [None]:
f, ax = plt.subplots(figsize=(20, 20))
sns.heatmap(house_df.corr(), annot = True)

# CREATE TESTING AND TRAINING DATASET/DATA CLEANING

## Question 3

In the given dataset **Price** column is the label column to be predicted.

- Select the following features and store it in `X` dataframe.
    1. 'bedrooms'
    2. 'bathrooms
    3. 'sqft_living'
    4. 'sqft_lot'
    5. 'floors'
    6. 'sqft_above'
    7. 'sqft_basement'
    8. 'waterfront'
    9. 'view'
    10. 'condition'
    11. 'grade'
    12. 'sqft_above'
    13. 'yr_built'
    14. 'yr_renovated'
    15. 'zipcode'
    16. 'lat'
    17. 'long'
    18. 'sqft_living15'
    19. 'sqft_lot15'
    
    
- Store the `price` feature in `y` (Small y)

In [4]:
X=_
y=_

# SPLITTING THE DATASET

## Question 4
- Split the dataset `X` and `Y` generated in the previous step in to training and testing set with the test size of 20%
- Variables to be used X_train, X_test, y_train, y_test .

In [None]:
X_train, X_test, y_train, y_test = _


# TRAINING THE MODEL WITH LINEAR REGRESSION

## Question 5
- Create a Linear Regression model and store it in variable `regressor`.
- Fit the model with training dataset
- Try to print out the values of coefficients and intercepts after fitting the data.


In [None]:
regressor=_

# EVALUATE MODEL

## QUestion 6
- Use `regressor` to predict the labels of `X_test`.
- Store the predicted values in variable `y_predict`.

In [None]:

y_predict=_

### Run the below cells to visualise how good your Linear Regression Model has predicted along with statistics.

In [None]:
plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlim(0, 3000000)
plt.ylim(0, 3000000)

plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Linear Regression Predictions')
plt.show()

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

k = X_test.shape[1]
n = len(X_test)
RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


# TRAINING and EVALUATING THE MODEL WITH RIDGE REGRESSION


## Question 7

**Training**
- Create a Ridge Regression model and store it in variable `regressor_ridge`. Have alpha value as 50
- Fit the model with training dataset
- Try to print out the values of coefficients and intercepts after fitting the data.

**Evaluation**
- Use `regressor_ridge` to predict the labels of `X_test`.
- Store the predicted values in variable `y_predict`.


In [None]:
regressor_ridge=_


y_predict=_

### Run the below cells to visualise how good your Ridge Regression Model has predicted along with statistics.

In [None]:

plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlim(0, 3000000)
plt.ylim(0, 3000000)

plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Ridge Regression Predictions')
plt.show()

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


# TRAINING and EVALUATING THE MODEL WITH LASSO REGRESSION

## Question 8

**Training**
- Create a Lasso Regression model and store it in variable `regressor_lasso`. Have alpha value as 500
- Fit the model with training dataset
- Try to print out the values of coefficients and intercepts after fitting the data.

**Evaluation**
- Use `regressor_lasso` to predict the labels of `X_test`.
- Store the predicted values in variable `y_predict`.


In [None]:
regressor_lasso=_

y_predict=_

### Run the below cells to visualise how good your Lasso Regression Model has predicted along with statistics.

In [None]:

plt.plot(y_test, y_predict, "^", color = 'r')
plt.xlim(0, 3000000)
plt.ylim(0, 3000000)

plt.xlabel("Model Predictions")
plt.ylabel("True Value (ground Truth)")
plt.title('Lasso Regression Predictions')
plt.show()

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2, '\nAdjusted R2 =', adj_r2) 


# SAVE YOUR ANSWERS

# RUN THE BELOW CELLS TO SAVE YOUR ANSWERS

In [101]:
import pickle

def pickle1(file_name, obj):
    with open(file_name,'wb')as f:
        pickle.dump(obj,f)

def pickling():
    try:
        pickle1('house_df.pickle',house_df.shape[0])
    except:
        print('house_df variable is not defined. Please check the variable')
    try:
        pickle1('desc.pickle',desc)
    except:
        print('desc variable is not defined. Please check the variable')
    try:
        pickle1('Xshape.pickle',X.shape[1])
    except:
        print('X variable is not defined. Please check the variable')
    try:
        pickle1('yshape.pickle',y.shape[0])
    except:
        print('y variable is not defined. Please check the variable')
    try:
        pickle1('Xtrain.pickle',X_train.shape[0])
    except:
        print('X_train is not defined. Please check the variable')
    try:
        pickle1('Xtest.pickle',X_test.shape[0])
    except:
        print('X_test variable is not defined. Please check the variable')
     
        
pickling()               