# Step By Step : From Data Exploration To Model Building

<img src='https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png' alt='houses'>
<p>
    Welcome all 👋<br><br>
    In this **NoteBook** we will go with **House Prices data** step by step. This NoteBook will be devided into the following parts 👇 <br>
    <ol>
        <li><b>Data Preprocessing</b></li>
        <li><b>Feature Selection</b></li>
        <li><b>Data Scaling</b></li>
        <li><b>Model Bulding</b></li>
        <li><b>Model evaluation</b></li>
        <li><b>References</b></li>
    </ol>
</p>

**First We will import libraries and load data 👇**

In [None]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from fancyimpute import KNN
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDRegressor 
from sklearn.linear_model import Ridge 
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error 
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

print("libraries loaded successfully")

In [None]:
#load data
data_train  = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
print("Data loaded successfully")

# 1. Data Preprocessing

In this section especially i want to thank [Pedro Marcelino](https://www.kaggle.com/pmarcelino)<br>Because i learned awesome things from his Kernel [Comprehensive data exploration with Python](https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python#5.-Getting-hard-core).<br><br>Now we will start **Data Preprocessing** by exploar our data.

## 1.1 Data exploration 👇

In [None]:
#exploar data
print("data shape : ",data_train.shape)
data_train.describe()

Frome above table ☝ we noticed that some columns have missing values as **LotFrontage** because there count less than 1460 and **MasVnrArea**.<br>
And columns have different range of value as **MSSubClass** and **LotFrontage** approximately have same range of values but other columns like **LotArea** has different range of values.<br>

This is only **Big picture** of our data. Now let's deal with **missing data**

## 1.2 Missing data 👇

In [None]:
#get total count of data including missing data
total = data_train.isnull().sum().sort_values(ascending=False)

#get percent of missing data relevant to all data
percent = (data_train.isnull().sum()/data_train.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(25)

Frome above table ☝ we noticed that some columns have many missing values as **PoolQC** 0.99 of this column is missing<br> And **MiscFeature** 0.96 of this column is missing and other column have large count of missing values like **Alley** and **Fence**.<br>

### How to Handle Missing Data ? 🙄
One of the most common problems I have faced in Data Cleaning/Exploratory Analysis is handling the missing values.<br>
This is a picture that give us a guide to deal with missingg data 👇 <br>
<img src='https://miro.medium.com/max/1528/1*_RA3mCS30Pr0vUxbp25Yxw.png' width="550px" style='float:left;'>
<div style='clear:both'></div>
<br>
As we see in above picture there are many ways to deal with Missing Data. In this **Kernel** i will use two of them on at each branch.<br><br>
In **Deletion** I will use **Deleting Columns** technique.<br>

Sometimes we can drop variables if the data is missing for more than 60% observations because these variables are useless.

In **Imputation** because our problem is a general problem I will use predictive models that impute the missing data.
As we know our data contain both **Categorical** and **Continuous** features I will use a **KNN (K Nearest Neighbors)** to impute data 
becuse it can work with both features types.<br>

I will Delete the following Columns **PoolQC, MiscFeature, Alley, Fence** because missing data in these columns more than **60%** observations. 

In [None]:
#drop PoolQC, MiscFeature, Alley, Fence columns
data_train = data_train.drop(['PoolQC','MiscFeature','Alley','Fence'], axis=1)

#after drop thes columns data shape will be (1460, 77) insted of (1460, 81)
print("data shape : ",data_train.shape)

Now we will using **KNN (K Nearest Neighbors)** with number of neighbors = 5 to impute missing values.<br>

The distance metric varies according to the type of data:<br>
1. **Continuous Data:** The commonly used distance metrics for continuous data are Euclidean, Manhattan and Cosine.
2. **Categorical Data:** Hamming distance is generally used in this case.

In [None]:
#get continuous features
colnames_numerics_only = data_train.iloc[:,1:-1].select_dtypes(include=np.number).columns.tolist()
print('numerical features')
print(colnames_numerics_only)

print("----------------------------------------")

print("number of numerics features = ",len(colnames_numerics_only))

In [None]:
#impute missing values of continuous features using KNN
data_train[colnames_numerics_only] = KNN(k=5).fit_transform(data_train[colnames_numerics_only])
print('missing values of continuous features imputed successfully')

In [None]:
#get categorical features
colnames_categorical_only = data_train.iloc[:,1:-1].select_dtypes(include='object').columns.tolist()
print('categorical features')
print(colnames_categorical_only)

print("----------------------------------------")

print("number of categorical features = ",len(colnames_categorical_only))

According to categorical features I don't find way to impute it using KNN so if any one know any way to do that i will thank him 😊<br>
Now I will use my own custom simpel imputer it will act as simpel **sklearn** imputer by set **strategy = most_frequent** but on categorical data. This may not best choice.

In [None]:
for categorical_col in colnames_categorical_only:
    most_frequent = data_train[categorical_col].value_counts().idxmax()
    hasCol       = 'Has'+categorical_col
    
    #create new col 
    data_train[hasCol] = pd.Series(len(data_train[categorical_col]), index=data_train.index)
    
    #set new col = 1
    data_train[hasCol] = 1
    
    #set new col = 0 if data_train[categorical_col] not empty
    data_train.loc[data_train[categorical_col].isnull(),hasCol] = 0
    
    #set data_train[categorical_col] = most_frequent if new col = 0
    #if location of new col = 0 this mean that data_train[categorical_col] in this location is empty
    data_train.loc[data_train[hasCol] == 0,categorical_col] = most_frequent
    
    #drop new col
    data_train = data_train.drop(hasCol, axis=1)
    
print('missing values of categorical features imputed successfully')    

**Moment of truth** 😧<br>
Now let's know if our data contain any missing value

In [None]:
#print max count number of null values
print('Number of missing values = ',data_train.isnull().sum().max())

**Congratulations 👏 now we don't have any missing values in our data**<br>

Now let's talk about **anomaly detection** or in other word **outliers**

## 1.3 Outliers 👇
In statistics, an outlier is an observation point that is distant from other observations.<br>
<img src='https://miro.medium.com/max/869/1*N_C1Mhiz8hzZkKrUfjez3A.jpeg' width='300px' style='float:left;'>
<div style='clear:both'></div>
<br>
In above ☝ image we noticed that all numbers in the 30’s range except number 3.<br>

But **why we must discover outliers?** This below 👇 image show outliers effect on predictive line.<br>
<img src='https://i.imgur.com/1YBK3E1.png' width='300px' style='float:left;'>
<div style='clear:both'></div>
<br>
As we see <span style='color:blue'>blue</span> line represent predictive line in case of our data don't include outliers. But <span style='color:orange'>orange</span> line in case of existence of outliers. **All we see the difference** 😎

### How to discover outliers ? 👀

We can dicover outliers by **visualization tools** or **numirical methods**<br>

**Discover outliers with visualization tools** 📈<br>

We can using visualization tools to detect outliers as
<ul>
    <li>**Box plot**</li>
    <li>**Scatter plot**</li>
</ul>

**Discover outliers with numerical methods**<br>

We can using numerical methods to detect outliers as
<ul>
    <li>**IQR score**</li>
    <li>**Z-Score**</li>
</ul>

In this kernel I will use **Box plot** and **IQR score** to detect outliers.
#### Box plot 👇
**Wikipedia Definition,**
> **In descriptive statistics,** a box plot is a method for graphically depicting groups of numerical data through their quartiles.<br>**Outliers** may be plotted as individual points. 

Now we loop over some numeric features and draw **Box plot** for each one.

In [None]:
#box plot
cols = ['MSSubClass','LotFrontage','LotArea','OverallQual']
for col in cols:
    plt.figure()
    ax = sns.boxplot(x=data_train[col])

As we see some of above ☝ boxplots include outliers like **MSSubClass, LotFrontage, and OverallQual** Now we will use **IQR score** to discover outliers.<br>

#### IQR score
Box plot use the IQR method to display data and outliers.<br>

**Wikipedia Definition,**
> The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles, <code>IQR = Q3 − Q1.</code>

If <code> value < Q1 - 1.5 x IQR </code>  or <code> value > Q3 + 1.5 x IQR </code> this mean that this value is **outlier**<br>
    
Now we loop over all numeric features and calculate **IQR** for each one.


In [None]:
Q1 = data_train[colnames_numerics_only].quantile(0.25)
Q3 = data_train[colnames_numerics_only].quantile(0.75)
IQR = Q3 - Q1

hasOutlier = (data_train[colnames_numerics_only] < (Q1 - 1.5 * IQR)) | (data_train[colnames_numerics_only] > (Q3 + 1.5 * IQR))
hasOutlier

Now we will claen our data from outliers <span style='font-size:25px;font-weight:bold;'>🗑<span>

In [None]:
num_data = data_train[colnames_numerics_only]

for numeric_col in colnames_numerics_only: 
    data_train = data_train.drop(data_train.loc[hasOutlier[numeric_col]].index)

In [None]:
#after drop thes raws which contain outliers data raws will be less than 1460 raw 
print("data raws number : ",data_train.shape[0])

**Big Congratulations 👏👏 now we don't have any missing values or outliers in our data**<br>

# 2. Feature Selection
Before go with feature selection we will drop **Id** column first

In [None]:
data_train = data_train.drop('Id', axis=1)
print('Id column deleted successfully')

We can choose beast feature if we well know a correlation between each feature and target variable (SalePrice), To do this we can use 👇 <br>
* Correlation matrix
* Scatter Plot

### Correlation matrix 👇 

A correlation matrix is a table showing correlation coefficients between variables. Each cell has a value between <code>1 to -1</code>
<br>
if cell value = <code>1</code> this mean high positive correlation else if cell value = <code>-1</code> this mean high negative correlation

In [None]:
#correlation matrix
corrmat = data_train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True)

As we see from abov ☝ correlation matrix that the following feature more related to target variable **(SalePrice)**
<table style='float:left;'>
    <tr>
        <th rowspan='2'>Features Names</th>
        <td style='text-align:center;'>LotFrontage</td>
        <td style='text-align:center;'>LotArea</td>
        <td style='text-align:center;'>OverallQual</td>
        <td style='text-align:center;'>YearBuilt</td>
        <td style='text-align:center;'>YearRemodAdd</td>
        <td style='text-align:center;'>MasVnrArea</td>
        <td style='text-align:center;'>TotalBsmtSF</td>
        <td style='text-align:center;'>1stFlrSF</td>
    </tr>
    <tr>
        <td style='text-align:center;'>GrLivArea</td>
        <td style='text-align:center;'>FullBath</td>
        <td style='text-align:center;'>TotRmsAbvGrd</td>
        <td style='text-align:center;'>GarageYrBlt</td>
        <td style='text-align:center;'>GarageCars</td>
        <td style='text-align:center;'>GarageArea</td>
        <td style='text-align:center;'>WoodDeckSF</td>
        <td style='text-align:center;'>OpenPorchSF</td>
    </tr>
<table>
<div style="clear:both"></div>    
Let's draw scatter plots of this features 📈
    
### Scatter plot 👇
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. **For this** we can know a relationship between **feature** and **target variable**<br>

Now we will draw scatter plots of this features. we will divied it into two groups for clearing plots.   

In [None]:
#scatterplot
sns.set()
cols = ['LotFrontage','LotArea','OverallQual','YearBuilt','YearRemodAdd','MasVnrArea','TotalBsmtSF','1stFlrSF',
        'GrLivArea','FullBath','TotRmsAbvGrd','GarageYrBlt','GarageCars','GarageArea','WoodDeckSF','OpenPorchSF']

group_1 = cols[0:8]
group_1.insert(0, "SalePrice")

#draw scatter plot of first group
sns.pairplot(data_train[group_1], size = 2)
plt.show();

In [None]:
group_2 = cols[8:]
group_2.insert(0, "SalePrice")

#draw scatter plot of first group
sns.pairplot(data_train[group_2], size = 2)
plt.show();

At this point we dealing with **correlation matrix** and **Scatter plot** to choose best features for our model. But these methods don't include any feedback to know if our choices true or not all of them depend only on native statistical techniques.<br>

So we need to some method tell us **Were we successful in our selection of features ?**

For this we will use **exhaustive feature selection algorithm** or **brute force features selector algorithm**<br>It work as **Grid search** work to choose best model parameter. This exhaustive feature selection algorithm is a wrapper approach for brute-force evaluation of feature subsets; the best subset is selected by optimizing a specified performance metric given an arbitrary regressor or classifier.<br>

**For Exampel,**

If we have the following features **0,1,2** <code>(if min_features=1 and max_features=3)</code> the compinations will be <br>

1. {0}
2. {1}
3. {2}
4. {0, 1} 
5. {0, 2} 
6. {1, 2}
7. {0, 1, 2}

The main disadvantage of this algorithm is **Time consuming** because large number of combination. For this we will not use it in this notebook Because large number of features. Now we will drob numerical features which we don't use them in our model.

In [None]:
# all numerical features in our data
allNumericalFeatures = colnames_numerics_only

# numerical features which we use it in our model
selectedNumericalFeatures = cols

# numerical features that we will drop it
deletedFeatures =  list(set(allNumericalFeatures) - set(selectedNumericalFeatures))

print("data shape before delete features = ",data_train.shape)

# delete unwanted features
data_train = data_train.drop(deletedFeatures, axis=1)

print("data shape after delete features = ",data_train.shape)

print("unwanted features deleted successfully")

Now let's deal with **Categorical** columns 🛒<br>

We have 3 approach to preprocess categorical columns like 👇
* Drop Categorical Variables
* Label Encoding
* One-Hot Encoding

we will use **Label Encoding** to preprocess our categorical columns.<br>

Let's start 😊
### LabelEncoder 👇

In [None]:
#convert categorical variable into lables
labelEncoder = LabelEncoder()

for categorical_col in colnames_categorical_only:
    data_train[categorical_col] =  labelEncoder.fit_transform(data_train[categorical_col])
    
print("categorical columns converted successfully")

In [None]:
print(colnames_categorical_only)

# 3. Data Scaling
Now we will scale our data using standardization. 

Let's go 😊

In [None]:
#data scaling
scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
data_train[selectedNumericalFeatures] = scaler.fit_transform(data_train[selectedNumericalFeatures])

print("data scaling successfully")
data_train.describe()

# 4. Model Bulding
Now we will build our model. we wil use **Stochastic Gradient Descent Regressor** model used for **regression** problems. But before using it we will split our data to train and test set first.

In [None]:
X = data_train.drop('SalePrice', axis=1)
y = data_train['SalePrice']

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.40, random_state=55, shuffle =True)
print('data splitting successfully')

Now we will use **GridSearchModel** for choose best parameters for our models we will start by using **stochastic gradient descent** model 💪 
## 1. stochastic gradient descent

In [None]:
#model bulding
SGDRRegModel = SGDRegressor(random_state=55,loss = 'squared_loss')
SelectedParameters = {
                      'alpha':[0.1,0.5,0.01,0.05,0.001,0.005],
                      'max_iter':[100,500,1000,5000,10000],
                      'tol':[0.0001,0.00001,0.000001],
                      'penalty':['l1','l2','none','elasticnet']
                      }

GridSearchModel = GridSearchCV(SGDRRegModel,SelectedParameters, cv = 5,return_train_score=True)
GridSearchModel.fit(X_train,y_train)

SGDRRegModel = GridSearchModel.best_estimator_
SGDRRegModel.fit(X_train,y_train)

print("stochastic gradient model run successfully")

## 2. Ridge Regression

In [None]:
RidgeRegModel = Ridge(random_state= 55, copy_X=True)
SelectedParameters = {
                      'alpha':[0.1,0.5,0.01,0.05,0.001,0.005],
                      'normalize':[True,False],
                      'max_iter':[100,500,1000,5000,10000],
                      'tol':[0.0001,0.00001,0.000001],
                      'solver':['auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg']
                      }

GridSearchModel = GridSearchCV(RidgeRegModel,SelectedParameters, cv = 5,return_train_score=True)
GridSearchModel.fit(X_train,y_train)

RidgeRegModel = GridSearchModel.best_estimator_
RidgeRegModel.fit(X_train,y_train)

print("Ridge model run successfully")

## 3. Lasso

In [None]:
LassoRegModel = Lasso(random_state= 55 ,copy_X=True)
SelectedParameters = {
                      'alpha':[0.1,0.5,0.01,0.05,0.001,0.005],
                      'normalize':[True,False],
                      'tol':[0.0001,0.00001,0.000001],
                      }

GridSearchModel = GridSearchCV(LassoRegModel,SelectedParameters, cv = 5,return_train_score=True)
GridSearchModel.fit(X_train,y_train)

LassoRegModel = GridSearchModel.best_estimator_
LassoRegModel.fit(X_train,y_train)

print("lasso model run successfully")

## 4. Linear Regression

In [None]:
linearRegModel = LinearRegression(copy_X=True)
linearRegModel.fit(X_train,y_train)
print("Linear regression model run successfully")

## 5. Decision Tree Regressor

In [None]:
decisionTreeModel = DecisionTreeRegressor(random_state=55)

SelectedParameters = {
                      'criterion': ['mse','friedman_mse','mae'] ,
                      'max_depth': [None,2,3,4,5,6,7,8,9,10],
                      'splitter' : ['best','random'],
                      'min_samples_split':[2,3,4,5,6,7,8,9,10],
                      }

GridSearchModel = GridSearchCV(decisionTreeModel,SelectedParameters, cv = 5,return_train_score=True)
GridSearchModel.fit(X_train,y_train)

decisionTreeModel = GridSearchModel.best_estimator_
decisionTreeModel.fit(X_train,y_train)

print("decision Tree Regressor model run successfully")

## 6. Xgboost Regressor

In [None]:
XGBRModel = XGBRegressor(n_jobs = 4)

SelectedParameters = {
                      'n_estimators': [100,1000,10000] ,
                      'learning_rate': [0.1,0.5,0.01,0.05],
                      }

GridSearchModel = GridSearchCV(XGBRModel,SelectedParameters, cv = 5,return_train_score=True)
GridSearchModel.fit(X_train,y_train)

XGBRModel = GridSearchModel.best_estimator_
XGBRModel.fit(X_train,y_train)

print("Xgboost Regressor model run successfully")

# 5. Model evaluation
Now we will evaluate our model using following metrics 👇
1. Model score
2. Mean absolute error (MAE)
3. Mean squared error (MSE)
4. Root mean squared error (RMSE)

let's go 😊

In [None]:
#evaluation Details
models = [SGDRRegModel, RidgeRegModel, LassoRegModel, linearRegModel, decisionTreeModel,XGBRModel]

for model in models:
    print(type(model).__name__,' Train Score is   : ' ,model.score(X_train, y_train))
    print(type(model).__name__,' Test Score is    : ' ,model.score(X_test, y_test))
    print('--------------------------------------------------------------------------')

In [None]:
#predict
for model in models:
    print(type(model).__name__," error metrics")
    print('---------------------------------------------------------')
    y_pred = model.predict(X_test)

    MAE = mean_absolute_error(y_test,y_pred)
    print("mean absolute error = ",MAE)

    MSE = mean_squared_error(y_test,y_pred)
    print("mean squared error = ",MSE) 

    RMSE = np.sqrt(mean_squared_error(y_test,y_pred))
    print("root mean squared error = ",RMSE) 
    print()

<br>
**From above results we will use XGBRegressor for submission task**

# 6. References 
These some reference i used it in this kernel 👇

1. [Ways to Detect and Remove the Outliers](https://towardsdatascience.com/ways-to-detect-and-remove-the-outliers-404d16608dba)
2. [Handling Missing Values in Machine Learning: Part 2](https://towardsdatascience.com/handling-missing-values-in-machine-learning-part-2-222154b4b58e)
3. [How to Handle Missing Data](https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4)

<p style='font-size:25px;font-weight:bold'>Please If you find this kernel useful, upvote it to help others see it 😊</p>