<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Kaggle Challenges House Prices (Regression)



### Group 5 team members:

- Raghad Alharbi
- Fatimah Aljohani
- Hessah Hamed Alkhattabi

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png">


## Problem Statment

As a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. However, we can confidently say that more attributes influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we built a machine learning model that predicts the final price of each home.

## Executive Summary

As a second project in our Data Science Immersive Course with General Assembly and MiSK Academy, we were asked to finish this "House Prices" Competition in Kaggle, We used multiple data cleaning methods, employed EDA methods including a good number of visualizations, to get to know the data well. Finally, we applied multiple machine learning methods in order to predict the Sale Price of the houses in the test data set. We achieved an amazing score that we are very proud of. 
 
 
 Root-Mean-Squared-Error (RMSE)  = 0.11890

### Contents:
- [Datasets Description](#Datasets-Description)
- [Data Import & Cleaning](#Data-Import-and-Cleaning)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Data Visualization](#Visualize-the-data)
- [Descriptive and Inferential Statistics](#Descriptive-and-Inferential-Statistics)
- [Outside Research](#Outside-Research)
- [Conclusions and Recommendations](#Conclusions-and-Recommendations)

## Datasets Description

#### We were provided with four datasets to complete this challange:
- train.csv - the training set ( with 81 columns, and 1460 rows)
- test.csv - the test set (with 80 columns, and 1460 rows)
- data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here
- sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms

## Data Cleaning and Exploratory Data Analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# all the used libraries in this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap
from scipy import stats

from sklearn import datasets, metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn import decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_squared_log_error, mean_absolute_error
from math import sqrt
from sklearn.tree import DecisionTreeRegressor
from sklearn import neighbors
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from xgboost import XGBRegressor # just for fun :) 
from scipy.special import boxcox, inv_boxcox
 
plt.style.use('ggplot')
sns.set(font_scale = 1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

# Pallets used for visualizations
color= "Spectral"
color_plt = ListedColormap(sns.color_palette(color).as_hex())
color_hist = 'teal'

#### 1. Read CSV file

In [None]:
# Both train and test files
df_full = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
df = df_full

test_df['SalePrice'] = 0
# saving the IDs for the first and last data point in the test set 
# because we will be merging both train and test
test_first_id = test_df['Id'].iloc[0]
test_last_id = test_df['Id'].iloc[-1]

In [None]:
pd.set_option('display.max_columns', None)

#### 2. Display data

Print the first 10 rows of each dataframe to your jupyter notebook

In [None]:
df.head()

In [None]:
test_df.head()

In [None]:
# Combining both train and test datasets
df = df.append(test_df, ignore_index = True)

In [None]:
#Making sure that test set was appended to the main df
df.tail()

#### 3. Briefly describe the data

Note things about what the columns might mean, and the general information that is conveyed in the dataframe.

In [None]:
df.shape

In [None]:
df.info()

#### 4a. How complete is the data? and any Issues

In [None]:
df.isna().sum()[df.isnull().sum() > 0].sort_values(ascending = False)

In [None]:
#Finding missing data and the percentage of it in each column
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis = 1, keys = ['Total_NaN', 'Percent_Nan'])
missing_data.head(20)

In [None]:
#visualize the missing data 
plt.figure(figsize = (19, 10))
sns.heatmap(data = df.isnull())

In [None]:
df.columns

In [None]:
df.describe()

#### 5. What are your data types? 

In [None]:
df.dtypes

#### 6. fill null values

In [None]:
# fill missing values with NA in Categorical Columns
cat_bsmt_col = ['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2']
cat_multi_col = ['PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu']
cat_Garage_col = ['GarageType', 'GarageCond', 'GarageFinish', 'GarageQual']

df[cat_bsmt_col] = df[cat_bsmt_col].fillna('No_Basement')
df[cat_multi_col] = df[cat_multi_col].fillna('No')
df[cat_Garage_col] = df[cat_Garage_col].fillna('No_Garage')
df['MasVnrType']= df['MasVnrType'].fillna('No_MasVnr')

# numerical values
df['Electrical'].fillna(df['Electrical'].mode().iloc[0], inplace = True)
df['LotFrontage'].fillna(df['LotFrontage'].median(), inplace = True) #right skewed
df['GarageYrBlt'].fillna(df['YearBuilt'], inplace = True) #left skewed
df['MasVnrArea'].fillna(df['MasVnrArea'].median(), inplace = True) 

In [None]:
# check if all columns are  filled 
df.isna().sum().sum()

In [None]:
# These are the null in test
df.isna().sum()[df.isnull().sum() > 0].sort_values(ascending = False)

In [None]:
# fill missing data with mode as there are very few missing data
# we do not need to use other complex methods for filling data
df.fillna(df.mode().iloc[0], inplace = True)
df.isna().sum().sum()

In [None]:
# head of categorical columns
df[df.select_dtypes('object').columns].head()

In [None]:
# head of numerical columns
df[df.select_dtypes('number').columns].head()

## Converting Categorical values that are considered rankings to numerical 

### workflow steps:
    1- Convert all columns that are rankings to numbers
    2- check models preformace (got worst)
    3- Find corretaion to Sale Prices
    4- Recording corr values for reference
    5- Proceed with converting for only columns that have high corr with target.
    4- check models preformace (improved)
    
### Note:   
These steps were followed after building all the models and testing them, their scores were not optimal, therefore, we thought we can improve the results by converting ordinal categorical columns that imply ranking to numerical values.

In [None]:
####################################################### Mapping all quality columns to ranking numbers
ranking_columns = ['ExterQual', 'BsmtQual', 'HeatingQC', 'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond']
# Removed: 'PoolQC', 'ExterCond', 'BsmtCond',

qual_dict = {"NA": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}

for column in ranking_columns:
    col = np.array(df[column].map(qual_dict), np.int16)
    df[column] = col
    
#Corr with Saleprice: 
#ExterQual    =     0.686756
#KitchenQual  =     0.662236
#BsmtQual     =     0.586674
#FireplaceQu  =     0.521144
#HeatingQC    =     0.428024
#GarageQual   =     0.273898
#GarageCond   =     0.263249
#BsmtCond     =     0.212632
#ExterCond    =     0.018865
#PoolQC       =     0.124084

####################################################### Mapping basement quality columns to ranking numbers  
#BsmtExposure    = 0.376309
#'BsmtFinType1'  = 0.305372
#'BsmtFinType2'  = -0.011422

basement_columns = ['BsmtFinType1']
basement_dict = {'NA':0, 'Unf':1, 'LwQ':2, 'Rec':3, 'BLQ':4, 'ALQ':5, 'GLQ':6 }    
#Removed: , 'BsmtFinType2'
for column in basement_columns:
    col = np.array(df[column].map(basement_dict), np.int16)
    df[column] = col  
    
# BsmtExposure   =   0.376309 
BsmtExposure_col = np.array(df['BsmtExposure'].map({'NA':0, 'No':1, 'Mn':2, 'Av':3, 'Gd':4}), np.int16)
df['BsmtExposure'] = BsmtExposure_col

####################################################### Mapping garage quality columns to ranking numbers  
# GarageFinish  = 0.550255
GarageFinish_col = np.array(df['GarageFinish'].map({'NA':0, 'Unf':1, 'RFn':2, 'Fin':3 }), np.int16)
df['GarageFinish'] = GarageFinish_col    

#CentralAir =    0.251328
df['CentralAir'] = df['CentralAir'].map({'N':0, 'Y':1}).astype(int) 

#PavedDrive =    0.233281
df['PavedDrive'] = df['PavedDrive'].map({'N':0, 'P':1, 'Y':3}).astype(int) 

#LandSlope =  -0.051779
df['LandSlope'] = df['LandSlope'].map({'Sev':0, 'Mod':1, 'Gtl':3}).astype(int)

In [None]:
# Just to make sure again no Null values after conversion 
df.isna().sum().sum()

7- Data Visualization 

In [None]:
# new dataframe for only train set to visualize relationships and corelations 
visual_df = df.iloc[0:(df[df['Id'] == test_first_id].index[0]), :] # train set
visual_df.head()

In [None]:
# Finding correlation of all numerical columns with target
visual_df.corr()['SalePrice'].sort_values(ascending = False)

In [None]:
# Just to check if any of the categorical columns has a high correlation with SalePrice
cat_columns = visual_df[visual_df.select_dtypes('object').columns]
for column in cat_columns:
    cat_columns[column] = visual_df[column].astype('category').cat.codes
    
cat_columns['SalePrice'] = visual_df['SalePrice']
cat_columns.corr()['SalePrice'].sort_values(ascending = False)

### Checking Normality for SalePrice

In [None]:
fig, ax = plt.subplots( figsize=(15, 6))
ax.hist(visual_df['SalePrice'], bins = 300, color = color_hist)

ax.set_xlabel('SalePrice')
ax.set_ylabel('Frequency')
fig.suptitle('The Distribution of Sale Price Before Transformation', fontsize = 20)

ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

plt.show()

From the graph above it shows that the sales price lies between 100k and 250k. Also, it shows alot of outlires on the right side.

In [None]:
fig, ax = plt.subplots(figsize = (14, 6))
res = stats.probplot(visual_df['SalePrice'], plot = plt)
fig.suptitle('Probability Plot of Sale Price Before Transformation', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

### Applying log transform on the data to making it normally distributed

In [None]:
fig, ax = plt.subplots( figsize = (15, 6))
ax.hist(np.log(visual_df['SalePrice']), bins = 300, color = color_hist)

ax.set_xlabel('SalePrice')
ax.set_ylabel('Frequency')

fig.suptitle('The Distribution of Sale Price After  log Transformation', fontsize = 20)

ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))

plt.show()

In [None]:
fig, ax = plt.subplots(figsize = (14, 6))
res = stats.probplot(np.log1p(visual_df['SalePrice']), plot = plt)

fig.suptitle('Probability Plot of Sale Price After log Transformation', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

**Why are we thinking about transforming the 'SalePrice' with log**

We can see that the distribution of the target was right-skewed, and we cannot really drop all the outliers that are affecting the distribution. However, after the log, the target became normally distributed, and much better and it will be easier to fit models and to predict correct values. Therefore, we will change the target to log now, and after prediction, we will use exponentiation.

In a very useful article about log in Medium:
> It is useful if and only if the distribution of the target variable is right-skewed, which can be observed by a simple histogram plot. This occurs when there are outliers that can’t be filtered out as they are important to the model.

sourse: (https://medium.com/towards-artificial-intelligence/when-and-why-to-use-log-transformation-in-regression-6a326d6259e6)

In [None]:
# applying log to target
df['SalePrice'] = np.log(df['SalePrice'])

## Detecting for outliers

In [None]:
# getting the columns with the highest corelation with salePrice, and clean them from outliers
high_corr = visual_df.corr()['SalePrice'].sort_values(ascending = False).head(10)
high_corr.index.to_list()

In [None]:
# overview of all plots
sns.pairplot(visual_df[high_corr.index.to_list()])

### Checking Columns with high Correlation with the target individually to visualize distribution and outliers 

In [None]:
fig, ax = plt.subplots( figsize = (12, 8))
ax = sns.scatterplot(x = 'ExterQual', 
                     y = 'SalePrice', 
                     data = visual_df, 
                     marker = 'o', s = 200, palette = color)

ax.set_ylabel('Sale Price')
ax.set_xlabel('The quality of the material on the exterior')
fig.suptitle('The Guality of the Material on the Exterior vs. Sales Price', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

The distribution is ok, maybe two potential outliers when the price is higher than 700,000

In [None]:
fig, ax = plt.subplots( figsize = (12, 8))
ax = sns.scatterplot(x = 'OverallQual', 
                     y = 'SalePrice', 
                     data = visual_df, 
                     marker = 'o', s = 200, palette = color)

ax.set_ylabel('Sale Price')
ax.set_xlabel('Overall Quality')
fig.suptitle('Overall Quality vs. Sales Price', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

The distribution is ok, maybe two potential outliers when the price is higher than 700,000, same as above.

In [None]:
fig, ax = plt.subplots( figsize = (12, 8))
ax = sns.scatterplot(x = 'GrLivArea', 
                     y = 'SalePrice', 
                     data = visual_df, 
                     marker = 'o', s = 200, palette = color)

ax.set_ylabel('Sale Price')
ax.set_xlabel('Ground living area')
fig.suptitle('Ground living Area vs. Sales Price', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

There is a positive correlation between ground living area and sale price, but for sure there are two outliers when the price is lower than 200,000,and ground leving area is higher than 4000. We need to drop these two outliers. 

In [None]:
fig, ax = plt.subplots( figsize = (12, 8))
ax = sns.scatterplot(x = 'TotalBsmtSF', 
                     y = 'SalePrice', 
                     data = visual_df, 
                     marker = 'o', s = 200, palette = color)

ax.set_ylabel('Sale Price')
ax.set_xlabel('TotalBsmtSF')
fig.suptitle('Total square feet of basement area vs. Sales Price', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

No need to check outliers in TotalBsmtSF because the multiconriality with 1stFlrSF, so we dropped it. There is one outlier which is TotalBsmtSF is higher than 6000.



In [None]:
fig, ax = plt.subplots( figsize = (12, 8))
ax = sns.scatterplot(x = '1stFlrSF', 
                     y = 'SalePrice', 
                     data = visual_df, 
                     marker = 'o', s = 200, palette = color)

ax.set_ylabel('Sale Price')
ax.set_xlabel('1stFlrSF')
fig.suptitle('First Floor Square Feet vs. Sales Price', fontsize = 20)

ax.get_yaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
ax.get_xaxis().set_major_formatter(plt.FuncFormatter(lambda x, loc: "{:,}".format(int(x))))
plt.show()

There is one outlier that we will drop, which is when 1stFlrSF is higher than 4000

In [None]:
cat_visual_df = visual_df[visual_df.select_dtypes('object').columns]
cat_visual_df.columns

#### 8. Create a data dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|Id|int|df|Id of the house|
|MSSubClass|int|df|Identifies the type of dwelling involved in the sale|
|MSZoning|object|df|Identifies the general zoning classification of the sale|
|LotFrontage|float|df|Linear feet of street connected to property|
|LotArea|int|df|Lot size in square feet|
|Street|object|df|Type of road access to property| 
|Alley|object|df|Type of alley access to property| 
|LotShape|object|df|General shape of property|
|LandContour|object|df|Flatness of the property|
|Utilities|object|df|Type of utilities available|
|LotConfig|object|df|Lot configuration|
|LandSlope|object|df|Slope of property|
|Neighborhood|object|df|Physical locations within Ames city limits|
|Condition1|object|df|Proximity to various conditions|
|Condition2|object|df|Proximity to various conditions (if more than one is present)|
|BldgType|object|df|Type of dwelling|
|HouseStyle|object|df|Style of dwelling|
|OverallQual|int|df|Rates the overall material and finish of the house|
|OverallCond|int|df|Rates the overall condition of the house|
|YearBuilt|int|df|Original construction date|
|YearRemodAdd|int|df|Remodel date (same as construction date if no remodeling or additions)|
|RoofStyle|object|df|Type of roof|
|RoofMatl|object|df|Roof material|
|Exterior1st|object|df|Exterior covering on house|
|Exterior2nd|object|df|Exterior covering on house (if more than one material)| 
|MasVnrType|object|df|Masonry veneer type|
|MasVnrArea|float|df|Masonry veneer area in square feet|
|ExterQual|object|df|Evaluates the quality of the material on the exterior|
|ExterCond|object|df|Evaluates the present condition of the material on the exterior|
|Foundation|object|df|Type of foundation|
|BsmtQual|object|df|Evaluates the height of the basement|
|BsmtCond|object|df|Evaluates the general condition of the basement|
|BsmtExposure|object|df|Refers to walkout or garden level walls|
|BsmtFinType1|object|df|Rating of basement finished area|
|BsmtFinSF1|int|df|Type 1 finished square feet|
|BsmtFinType2|object|df|Rating of basement finished area (if multiple types)|
|BsmtFinSF2|int|df|Type 2 finished square feet|
|BsmtUnfSF|int|df|Unfinished square feet of basement area|
|TotalBsmtSF|int|df|Total square feet of basement area|
|Heating|object|df|Type of heating|
|HeatingQC|object|df|Heating quality and condition|
|CentralAir|object|df|Central air conditioning|
|Electrical|object|df|Electrical system|
|1stFlrSF|int|df|First Floor square feet|
|2ndFlrSF|int|df|Second floor square feet|
|LowQualFinSF|int|df|Low quality finished square feet (all floors)|
|GrLivArea|int|df|Above grade (ground) living area square feet|
|BsmtFullBath|int|df|Basement full bathrooms|
|BsmtHalfBath|int|df|Basement half bathrooms|
|FullBath|int|df|Full bathrooms above grade|
|HalfBath|int|df|Half baths above grade|
|BedroomAbvGr|int|df|Bedrooms above grade (does NOT include basement bedrooms)|
|KitchenAbvGr|int|df|Kitchens above grade|
|KitchenQual|object|df|Kitchen quality|
|TotRmsAbvGrd|int|df|Total rooms above grade (does not include bathrooms)|
|Functional|object|df|Home functionality (Assume typical unless deductions are warranted)|
|Fireplaces|int|df|Number of fireplaces|
|FireplaceQu|object|df|Fireplace quality|
|GarageType|object|df|Garage location|
|GarageYrBlt|float|df|Year garage was built|
|GarageFinish|object|df|Interior finish of the garage|
|GarageCars|int|df|Size of garage in car capacity|
|GarageArea|int|df|Size of garage in square feet|
|GarageQual|object|df|Garage quality|
|GarageCond|object|df|Garage condition|
|PavedDrive|object|df|Paved driveway|
|WoodDeckSF|int|df|Wood deck area in square feet|
|OpenPorchSF|int|df|Open porch area in square feet|
|EnclosedPorch|int|df|Enclosed porch area in square feet|
|3SsnPorch|int|df|Three season porch area in square feet|
|ScreenPorch|int|df|Screen porch area in square feet|
|PoolArea|int|df|Pool area in square feet|
|PoolQC|object|df|Pool quality|
|Fence|object|df|Fence quality|
|MiscFeature|object|df|Miscellaneous feature not covered in other categories|
|MiscVal|int|df|$Value of miscellaneous feature|
|MoSold|int|df|Month Sold (MM)|
|YrSold|int|df|Year Sold (YYYY)|
|SaleType|object|df|Type of sale|
|SaleCondition|object|df|Condition of sale|

## Finding correlation between columns and visualize it

#### Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric features

In [None]:
# find corelations between all columns and the target
df.corr()['SalePrice'].sort_values(ascending = False)

In [None]:
fig, axs = plt.subplots(figsize = (16, 14)) 
mask = np.triu(np.ones_like(visual_df.corr(), dtype = np.bool))
g = sns.heatmap(visual_df.corr(), ax = axs, mask=mask, cmap = sns.diverging_palette(180, 10, as_cmap = True), square = True)

plt.title('Correlation between Features')

# fix for mpl bug that cuts off top/bottom of seaborn viz
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()

From the heatmap above, we can see that there are several features highly correlated, and these will cause multicollinearity. We need to drop one of them.

- YearBuilt and GarageYrBlt, this is reasonable since many times YearBuilt and GarageYrBlt will be the same. Drop GarageYrBlt
- GrLivArea and TotRmsAbvGrd, drop TotRmsAbvGrd
- 1stFlrSF and TotalBsmtSF, drop TotalBsmtSF
- GarageCars & GarageArea, tried to drop GarageCars, but preformance got worst. Therefor, keeping both.

In [None]:
df = df.drop(['GarageYrBlt', 'TotRmsAbvGrd', 'TotalBsmtSF'], axis = 1)

## Visualizing only coulmns with high correlation with the target

In [None]:
corr_matrix = visual_df.corr()
top_corr_features = corr_matrix.index[abs(corr_matrix['SalePrice']) > 0.5]

fig, axs = plt.subplots(figsize = (13, 8)) 
mask = np.triu(np.ones_like(visual_df[top_corr_features].corr(), dtype = np.bool))
sns.heatmap(visual_df[top_corr_features].corr(), ax = axs, annot = True, mask = mask, cmap = sns.diverging_palette(180, 10, as_cmap = True))
plt.title('Correlation of high correlated columns with Sale Price')

# fix for mpl bug that cuts off top/bottom of seaborn viz
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()

## The distribution of Numerical Columns in the Dataframe

In [None]:
# a function that takes a dataframe and transforms it into a standard form after dropping nun_numirical columns
def to_standard (df):
    
    num_df = df[df.select_dtypes(include = np.number).columns.tolist()]
    
    ss = StandardScaler()
    std = ss.fit_transform(num_df)
    
    std_df = pd.DataFrame(std, index = num_df.index, columns = num_df.columns)
    return std_df

In [None]:
ax, fig = plt.subplots(1, 1, figsize = (18, 18))
plt.title('The distribution of All Numeric Variable in the Dataframe', fontsize = 20) #Change please

sns.boxplot(y = "variable", x = "value", data = pd.melt(to_standard(visual_df)), palette = color)
plt.xlabel('Range after Standarization', size = 16)
plt.ylabel('Attribue', size = 16)


# fix for mpl bug that cuts off top/bottom of seaborn viz
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values

plt.show()

In [None]:
ax, fig = plt.subplots(1, 1, figsize = (18, 8))
plt.title('The distribution of All Numeric Variable in the Dataframe', fontsize = 20) #Change please

sns.boxplot(y = "variable", x = "value", data = pd.melt(to_standard(visual_df[top_corr_features])), palette = color)
plt.xlabel('Range after Standarization', size = 16)
plt.ylabel('Attribue', size = 16)


# fix for mpl bug that cuts off top/bottom of seaborn viz
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values

plt.show()

## Checking skewness of all numerical columns

### Workflow Steps:
1. Building and testing models without any transformation, good enough results, but not the best
2. Applying the Box Cox transformation to all columns, got better results
3. Tried to reduce the number of coulmns being transformed by only choosing the columns with high skewness (>4.00), we got even better results.
4. As a final step, we also transformed the columns that have high correlation with the target and have have some skewness, we got the best results. 

### Note
We actually applied this transformation method after building and testing the model, the RMSE score improved drastically after the Box Cox transformation of some columns.
> A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

Sourse: (https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/)

In [None]:
numeric_feats = visual_df.dtypes[visual_df.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = visual_df[numeric_feats.tolist()].apply(lambda x:stats.skew(x.dropna())).sort_values(ascending = False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew':skewed_feats})
skewness.head()


In [None]:
skewed_features =['MiscVal', 'PoolArea', 'LotArea', '3SsnPorch', 'LowQualFinSF', 
                  'KitchenAbvGr','BsmtFinSF2', 'ScreenPorch', 'GrLivArea', 'ExterQual',
                  'BsmtHalfBath']

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))


#skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    df[feat] = boxcox(df[feat], lam)

## Getting dummies for all categorical columns 

In [None]:
# changing all Categorical columns to dummies (0,1)
df = pd.get_dummies(df, columns = df.select_dtypes('object').columns, drop_first = True)

In [None]:
# split data from from test_first_id to the end to be in the test
test_df = df.iloc[df[df['Id'] == test_first_id].index[0]:, :]
test_df.head()

In [None]:
# deleting the test dataset from the main df
df = df.iloc[0:(df[df['Id'] == test_first_id].index[0]), :]
df.tail()

## Outlier removal

In [None]:
# removing GrLivArea outliers, also dropping SalePrice > 700,000 by default, which is good.
print(df.shape)
df.drop(df.index[[523, 1298]], inplace = True)
df = df.drop(df[(df['GrLivArea'] > 4000) & (df['SalePrice'] < 300000)].index)
df = df.drop(df[df['1stFlrSF'] >= 3000].index)
print(df.shape)

In [None]:
#No need for the ID column
df = df.drop('Id', axis = 1)

test_df = test_df.drop(['Id','SalePrice'] , axis = 1)

## Kaggle Submission File

In [None]:
# a function that gets the predictions and saves them into a csv file with the correct format
def submission_file (test_pred):
    for_id = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

    my_submission = pd.DataFrame({'Id':for_id.Id, 'SalePrice':test_pred.reshape(1459)})
    my_submission.to_csv('submission.csv', index = False) # dropping the index column before saving it

## Applying Machine learning models for predictions

In [None]:
BOLD = '\033[1m'
END = '\033[0m'

In [None]:
y = df['SalePrice']
X = df.drop('SalePrice', axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .10, shuffle = True, random_state = 42)

In [None]:
# a function that gets all datasets and model, and will fit and calculates all metrics, and return predictions
def model_metrics(model, kfold, X_train, X_test, y_train, y_test, test_df):

    model.fit(X_train, y_train)

    #metrics -> R squared
    results = cross_val_score(model, X_train, y_train, cv = kfold, scoring = 'r2')
    print("CV scores: ", results); print("CV Standard Deviation: ", results.std()); print();
    print('CV Mean score: ', results.mean()); 
    print('Train score:   ', model.score(X_train, y_train))
    print('Test score:    ', model.score(X_test, y_test))
      
    MSE = -(cross_val_score(model, X_train, y_train, cv = kfold, scoring = 'neg_mean_squared_error').mean())
    print("CV MSE:        ",MSE)
    
    RMSE = sqrt(MSE)
    print("CV RMSE:       ",RMSE)
    
    test_pred = model.predict(test_df)
    test_pred_exp = np.exp(test_pred)
    
    return test_pred_exp

## Multiple models

Multiple initial models to check base line

In [None]:
def Multi_models (X_train, X_test, y_train, y_test, test_df):
    kfold = 5
#     # Create an scaler object
#     ss = StandardScaler()
#     X_train = ss.fit_transform(X_train)
#     X_test = ss.transform(X_test)
#     test_df = ss.transform(test_df)
#     y_train = ss.transform(y_train)
#     y_test = ss.transform(y_test)
###################################################################################################### Linear Regression model
    print(BOLD + 'Linear Regression model:' + END)
    
    lr = LinearRegression()
    lr_pred= model_metrics(lr, kfold, X_train, X_test, y_train, y_test, test_df)
    
######################################################################################################  Lasso model
    print(); print(BOLD + 'Lasso model:' + END)
    
    alpha = np.arange(0, 3, 200)
    lasso = Lasso(alpha = alpha, max_iter = 50000)
    lasso_pred = model_metrics(lasso, kfold, X_train, X_test, y_train, y_test, test_df)

######################################################################################################  Ridge model
    print(); print(BOLD + 'Ridge model:' + END)
    
    ridge_alpha_values = np.logspace(0, 5, 200)
    ridgecv_optimal = RidgeCV(alphas = ridge_alpha_values, cv = 10)
    ridge_pred = model_metrics(ridgecv_optimal, kfold, X_train, X_test, y_train, y_test, test_df)
    
######################################################################################################  Elastic Net model
    print(); print(BOLD + 'Elastic Net model:' + END)
    
    elasticnet = ElasticNet(alpha = 0.01)
    elasticnet_pred = model_metrics(elasticnet, kfold, X_train, X_test, y_train, y_test, test_df)
    
######################################################################################################  Decision Tree Regressor model
    print(); print(BOLD + 'Decision Tree Regressor model:' + END)
    
    dtr = DecisionTreeRegressor()
    dtr_pred = model_metrics(dtr, kfold, X_train, X_test, y_train, y_test, test_df)
    
######################################################################################################  K Neighbors Regressor model
    print(); print(BOLD + 'K Neighbors Regressor model:' + END)
    
    KNN = neighbors.KNeighborsRegressor()
    KNN_pred = model_metrics(KNN, kfold, X_train, X_test, y_train, y_test, test_df)
    
######################################################################################################  Random Forest Regressor model   
    print(); print(BOLD + 'Random Forest Regressor model:' + END)

    rfr = RandomForestRegressor(n_estimators = 100, oob_score = True, random_state = 42)
    rfr_pred = model_metrics(rfr, kfold, X_train, X_test, y_train, y_test, test_df)
    
    #submission_file (dtr_pred)
    

In [None]:
Multi_models (X_train, X_test, y_train, y_test, test_df)

## Ridge Model

The main Machine learning model that gave us the best score, which is Ridge model using RidgeCV to get the optimal alpha. 

In [None]:
def ridge__optimizer(X_train, X_test, y_train, y_test, test_df):
    print(); print(BOLD + 'Ridge model (best so far with 0.12100 kaggle score):' + END)
    kfold = 5
    
    
    ridge_alpha_values = np.logspace(0, 5, 200)

    ridgecv_optimal = RidgeCV(alphas = ridge_alpha_values, cv = 10)
    ridgecv_optimal.fit(X_train, y_train)

    print('Optimal Alpha:   ' , ridgecv_optimal.alpha_)
    
    # Create a logistic regression object with an L2 penalty
    ridge = Ridge(alpha = ridgecv_optimal.alpha_)

    
    ridge_opt_pred = model_metrics(ridge, kfold, X_train, X_test, y_train, y_test, test_df)
    print(ridge_opt_pred)
    
    #submission_file (ridge_opt_pred) 
    #The line for saving the predictions for submission gives an error in Kaggle, therefor we commented it here

In [None]:
ridge__optimizer(X_train, X_test, y_train, y_test, test_df)

**Trying to Find the best Alpha for ridge**

In [None]:
rmse = []
# check the below alpha values for Ridge Regression
alpha = np.arange(0.0001, 10, 200)

for alph in alpha:
    ridge = Ridge(alpha = alph, copy_X = True, fit_intercept = True)
    ridge.fit(X_train, y_train)
    predict = ridge.predict(X)
    rmse.append(np.sqrt(mean_squared_error(predict, y)))
print(rmse)
plt.scatter(alpha, rmse)
rmse = pd.Series(rmse, index = alpha)
print(rmse.argmin())
print(rmse.min())

## KNN Model

In [None]:
# trying KNN regression model with a stander scaler and grid search for multiple k values
def KNN_opt_model (X_train, X_test, y_train, y_test, test_df):
    kfold = 5
    
    print(); print(BOLD + 'K Neighbors Regressor model:' + END)

    # Create an scaler object
    ss = StandardScaler()

    # Create a logistic regression object with an L2 penalty
    KNN = neighbors.KNeighborsRegressor()

    # Create a pipeline of three steps. First, standardize the data.
    # Second, tranform the data with PCA.
    # Third, train a Decision Tree Classifier on the data.
    pipe = Pipeline(steps = [('ss', ss),
                           ('KNN', KNN)])
    
    # Create lists of parameter for KNeighborsRegressor()
    n_neighbors = [5, 10, 15]
    algorithm = ['auto', 'ball_tree', 'kd_tree', 'brute']

    # Create a dictionary of all the parameter options 
    # Note has you can access the parameters of steps of a pipeline by using '__’
    parameters = dict(KNN__n_neighbors = n_neighbors,
                      KNN__algorithm = algorithm)

    # Conduct Parameter Optmization With Pipeline
    # Create a grid search object
    clf = GridSearchCV(pipe, parameters)
    
    KNN_pred = model_metrics(KNN, kfold, X_train, X_test, y_train, y_test, test_df)
    #submission_file (KNN_pred)
   

In [None]:
KNN_opt_model(X_train, X_test, y_train, y_test, test_df)

## Lasso Model

In [None]:
def lasso_optimizer(X_train, X_test, y_train, y_test, test_df, X, y):
    print(); print(BOLD + 'Optimized Lasso model:' + END)
    kfold = 5
    optimal_lasso = LassoCV(n_alphas = 500, cv = 10, verbose = 1)
    optimal_lasso.fit(X_train, y_train)
    print('optimal_lasso:    ', optimal_lasso.alpha_)
    
    # Create a logistic regression object with an L2 penalty
    lasso = Lasso(alpha = optimal_lasso.alpha_)

    lasso_pred = model_metrics(lasso, kfold,  X_train, X_test, y_train, y_test, test_df)
    submission_file(lasso_pred)
    
    lasso.fit(X, y)

    lasso_coefs = pd.DataFrame()
    lasso_coefs['Column_name'] = X.columns
    lasso_coefs['coefficient'] = lasso.coef_
    lasso_coefs['absolute_coefficient'] = np.abs(lasso.coef_)

    lasso_coefs = lasso_coefs.sort_values('absolute_coefficient', ascending = False)
    print('Percent variables zeroed out:', np.sum(lasso.coef_ == 0) / X.iloc[:, 0].count())

    lasso_coefs.head(15)
    return lasso_coefs

In [None]:
lasso_coefs = lasso_optimizer(X_train, X_test, y_train, y_test, test_df, X, y);

In [None]:
lasso_coefs.head(10)

## Elastic Net Model

In [None]:
def ElasticNet__optimizer(X_train, X_test, y_train, y_test, test_df, X, y):
    
    kfold = 5
    
    l1_ratios = np.linspace(0.01, 1.0, 25)

    optimal_enet = ElasticNetCV(l1_ratio = l1_ratios, n_alphas = 100, cv = 10, verbose = 1)
    optimal_enet.fit(X, y)

    print(); print(BOLD + 'Elastic Net model:' + END)
    print('Optimal alpha:       ', optimal_enet.alpha_)
    print('Optimal l1 ratio  :  ', optimal_enet.l1_ratio_)
    
    enet = ElasticNet(alpha = optimal_enet.alpha_, l1_ratio = optimal_enet.l1_ratio_)
    

    enet_pred = model_metrics(enet, kfold,  X_train, X_test, y_train, y_test, test_df)
    #submission_file (enet_pred)

In [None]:
ElasticNet__optimizer(X_train, X_test, y_train, y_test, test_df, X, y)

## Decision Tree Regressor Model

In [None]:
#Applying DecisionTreeRegressor Model 

print(BOLD + 'Decision Tree Regressor model:' + END)


decision_tree = DecisionTreeRegressor( max_depth = 10, random_state = 33)
decision_tree.fit(X_train, y_train)

#Calculating Training & Testing Scores
print('Train Score: ', decision_tree.score(X_train, y_train))
print('Test Score is : ', decision_tree.score(X_test, y_test))
print('----------------------------------------------------')

#Calculating Prediction
y_pred = decision_tree.predict(X_test)

#----------------------------------------------------
#Calculating MAE
MAE_value = mean_absolute_error(y_test, y_pred, multioutput = 'uniform_average') 
print('MAE Score: ', MAE_value)

#----------------------------------------------------
#Calculating MSE
MSE_value = mean_squared_error(y_test, y_pred, multioutput = 'uniform_average') 
print('MSE Score: ', MSE_value)



## Evaluation and Conceptual Understanding

After evaluating all of the applied models, we can confidently say that the best model without any competitors was Ridge with the optimal alpha value. The rest of the Lasso and ElasticNet performed slightly worst that ridge, even after finding the optimal alpha. Other models like KNN and leaner regression did not perform well. Decision tree and random forest had overfitting as the train scores were too hight compared to the test. The below scores were for the best preforming model (Ridge) that got us the RMSE of  0.11890

* Ridge model (best so far with 0.12100 kaggle score):
* Optimal Alpha:    12.216773489967919
* CV scores:  [0.89327908 0.89926145 0.90971806 0.93280416 0.9311133 ]
* CV Standard Deviation:  0.016176855310963398
* CV Mean score:  0.913235211102004
* Train score:    0.9375894112878308
* Test score:     0.9321056236353693
* CV MSE:         0.013403207809351957
* CV RMSE:        0.11577222382485342
* [117563.4133125  155207.48419969 181581.10791461 ... 173173.10705228
 115697.95594533 223288.03456509]

## Conclusion


As a second project in our Data Science Immersive Course with General Assembly and MiSK Academy, we were asked to finish this "House Prices" Competition in Kaggle, We used multiple data cleaning methods, employed (EDA) methods including a good number of visualizations, to get to know the data well, and we preprocessed our dataset with some transformation methods. Finally, we applied multiple machine learning methods in order to predict the Sale Price of the houses in the test data set. The regrission models used were Linear, Ridge, Lasso, Elastic Net, KNN, Decision Tree, Random Forest and a few more models. We achieved an amazing score in Kaggle competition that we are very proud of, we were ranked as the 607th out of 4,675 teams. 
 
 Root-Mean-Squared-Error (RMSE)  = 0.11890
 
Thank you very mcuh,
Raghad, Fatmah, Hessah
 