<a id="toc_section"></a>
## Contents of this notebook

[**Raghav Rastogi**](https://www.kaggle.com/raghavrastogi75) 


* [Data import](#1)
* [Checking NULL values](#2)
* [Exploratory Data Analysis](#3)
* [Excluding the test set before further analysis](#4)
* [Continuing the EDA](#5)
    - [Visualising the Geographical data](#6)   
    - [Correlation of the data](#7)
    - [Scatter plot for highly correlated attributes](#8)
* [Data Preparation](#9)
* [Machine Learning model selection and training](#10)
* [Evaluation and RMSE](#11)
* [Conclusion](#12)
   

I have referred the book 'Hands-on MAchine Learning with scikit-Learn, Keras, and Tensorflow' by Aurélien Géron and applied on this dataset. I highly recommend it if you are a beginner. If you have a question or feedback, do not hesitate to write and if you find this kernal helpful, please <b><font color="orange">do not forget to </font><font color="green">UPVOTE </font></b> 🙂

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

* # <span id="1"></span> Data import
Let's import the data and have a look at some rows and data types

In [None]:
housing = pd.read_csv('/kaggle/input/housesalesprediction/kc_house_data.csv')
print(housing.info())
#housing.head()

# <span id="2"></span> Checking Null Values
Let's check if there are any blank vaules in bulk

In [None]:
import seaborn as sns
housing.isna().count()

There are no 'NULL' values in the data set, however, there are a lot of '0s' in the data set which is fine. Had there been null values, there are 3 ways to approach it:

1. Remove the column itself if it is not important
2. Remove the missing rows if the NULLs are very less
3. Replace the NULL values with the median or mean or '0' in case of numerical columns.

The columns 'yr_renovated', 'sqft_basement', 'view' and 'waterfront' has huge number of '0' values filled. We should check how important they are using correlation.

In [None]:
#removing id and date as they are not important for prediction
housing = housing.drop(columns = ['id','date'])
#print(housing.shape)
                        

# <span id="3"></span> Exploratry Data analysis
Let's a look at the data to know more about it and gain some insights before trying to predict it.

In [None]:
#looking at the data types
housing.dtypes


All of the datatypes are either integers or floats

In [None]:
housing.describe()

Let's plot a histogram to get a better feel of the data

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()

We first observe that some are categorical and some are continuos. For example 'view' and 'condition' are categorical. Also many histograms are tail heavy: they extend much farther to the right of the median than to the left. This might make the ML Algorithms a little harder to detect patterns. We will have to use Standardisation to make them symmetrical suc that ML algos are able to perform better on the dataset.

# <span id="4"></span> Excluding the test set before further analysis

I am doing this to ensure that I have no insight whatsoever about the test set and this will help in getting completely unbiased results in the end.

We use stratified shuffle split to evenly separate the training set and test set from the total data. For example we a male/female values on the data as 60%/30%, we would have the same distribution of the male/female ratio in our data set. Using this strategy makes our validation set much more reliable.


Let's choose 'year built' as the criteria of stratified shuffle shift as it seems like a good distributed factor of the data. Plotting it to get a view of the distribution.

In [None]:
housing['yr_built'].hist()


Looking at the numerical values.

In [None]:
housing["year_cat"] = pd.cut(housing["yr_built"],bins=[0, 1920, 1940, 1960, 1980,2000, np.inf], labels=[1, 2, 3, 4, 5,6])
housing["year_cat"].value_counts()

As we can see we have a good enough values for each 20 year bucket. This makes sure that we are not using an uneven distribution as out test test.

Separating the test set from the data

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits = 1, test_size = 0.2, random_state = 42)
for train_index,test_index in split.split(housing, housing["year_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
# looking at the percentage wise distribution of bucket of years
strat_test_set["year_cat"].value_counts() / len(strat_test_set)
    

# <span id="5"></span> Continuing the EDA
Now that we are done with the separation of train and test set. Lets explore the train set more.

In [None]:
housing = strat_train_set.copy()


## <span id="4"></span>Visualising the geographical data

In [None]:
housing.plot(kind = 'scatter', x = 'long', y = 'lat', alpha = 2, figsize = (15,15),c = 'price',colorbar = True,cmap=plt.get_cmap("cool"))


We are kind of able to see the boundaries of the actual location and where the densities are higher. We are also able to see the darker areas with a higher price.

## <span id="7"></span> Correlation of the data


In [None]:
corr_matrix = housing.corr()
plt.figure(figsize = (10,10))
s = corr_matrix['price'].sort_values(ascending = False)
print(s)
s.plot.bar()


We observe that sqft_living has the highest correlation with the price of the house which seems natural. Followed by 'grade' and 'Sqft_above'

## <span id="8"></span> Scatter plot for highly correlated attributes
Let's now have a look at the scatter plot with the price and other highly correlated attributes

In [None]:
import seaborn as sns
attributes = ['price','sqft_living','grade','sqft_above','sqft_living15']
housing_at = housing.loc[:,attributes]
#print(housing_at)
sns.pairplot(housing_at)
plt.show()




We can now clearly observe the linear correlation of the attributes specially with the price.

## Looking at the most correlated attribute even closer
Let's have a look at the most correlated attribute more closely.

In [None]:
plt.figure(figsize = (15,15))
sns.scatterplot(x = housing['price'], y = housing['sqft_living'])

We do have some outliers but they are along the same trend. So we are good to keep these. 

# <span id="9"></span> Data Preparation
Let's now prepare the data to perform Machine Learning.

In [None]:
housing = strat_train_set.drop("price", axis=1)
housing_labels = strat_train_set["price"].copy()

Let's now scale the data using StandardScaler which will subtract the mean value so that mean is 0 and then divide by standard deviation so that it has unit variance. Using pipeline for it's implementation.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([('std_scaler', StandardScaler())])

housing_std_train = num_pipeline.fit_transform(housing)
housing_prepared = pd.DataFrame(housing_std_train, columns=housing.columns, index=housing.index)
housing_prepared
housing_labels = np.log1p(housing_labels)

# <span id="10"></span> Machine learning model selection and training
Now it's time to finally select and train a model

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
param_grid = [
{'n_estimators': [25,50], 'max_features': [8 ,10, 15]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor()
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

Looking at the best parameters

In [None]:
grid_search.best_params_

We can keep changing the max features and n_estimators to get the best value

In [None]:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
#print(strat_train_set.columns.tolist())
sorted(zip(feature_importances,housing.columns.tolist()),reverse = True)

As expected we get 'sqft_living' as the best attribute to predict the price of the house.

# <span id="11"></span> Evaluation and RMSE
Evaluating the model on test set

In [None]:
final_model = grid_search.best_estimator_
X_test = strat_test_set.drop("price", axis=1)
y_test = strat_test_set["price"].copy()
X_test_prepared = num_pipeline.transform(X_test)
y_test_prep = np.log1p(y_test)
final_predictions = final_model.predict(X_test_prepared)
final_rmse = mean_squared_error(y_test_prep, final_predictions,squared = False)
print(final_rmse)

We get a decent enough RMSE score. We can try different models to reduce this number further down.

## 95% confidence range
Finally let's get a 95% confidence range ont the predicted values

In [None]:
from scipy import stats
confidence = 0.95
squared_errors = (np.expm1(final_predictions) - np.expm1(y_test_prep)) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,loc=squared_errors.mean(),scale=stats.sem(squared_errors)))

# <span id="12"></span> Conclusion

This is my first attempt at end to end EDA and Machine Learning prediction. Please let me know how I can improve this code even more.