![width = 500](https://ctp-media.imigino.com/image/1/process/nullxnull?source=https://d3cx3ub94vxukq.cloudfront.net/wp-content/uploads/sites/30/2018/06/The-Tembisan-Gauteng-property-market-showing-signs-of-early-recovery.jpeg)

# House Price Prediction using diffrent regression methods
* EDA 
* Data preparation and Feature Engineering 
    * outliers
    * missing data
    * categorical data
* Liniear regression
* Polynomial regression
* L1 regression 
* L2 regression
* Elastic Net
* Conclusion

# 📊 Data Gathering and EDA

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
df_train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
df_train.info()

In [None]:
df_train.describe()

In [None]:
df_train.head(10)

In [None]:
df_train.drop("Id", axis = 1, inplace = True)
df_test.drop("Id", axis = 1, inplace = True)
df_train.head()

In [None]:
df_train.columns

## what we're gonna do is first take a look at what we're predicting! That would be the SalePrice. So let's see what we got. :)

In [None]:
sns.histplot(df_train['SalePrice'])

In [None]:
from scipy import stats
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()

The target variable is right skewed. To make it more like a normal distro with assign a log function to it because especially linear models like normal distributions.

In [None]:
df_train["SalePrice"] = np.log1p(df_train["SalePrice"])
fig = plt.figure()
res = stats.probplot(df_train['SalePrice'], plot=plt)
plt.show()

The skew seems now corrected and the data appears more normally distributed.

In [None]:
df_train_cor = df_train.corr()
df_train_cor[df_train_cor['PoolArea']>0.7]

In [None]:
#df_train.GarageYrBit.astype('float64')
sns.scatterplot(data=df_train, x ='YearBuilt', y='GarageYrBlt')

Now let's take look at our categorical features.

In [None]:
var = 'OverallQual'
data = pd.concat([df_train['SalePrice'], df_train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=df_train)

Up to now we just kept going after our gut to find the corrolations! we can take a look at all corrolations with df.corr. But I preffer to find them by a heatplot because it's graphical and instead of numbers we can find them by colors.

In [None]:
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(13, 9))
sns.heatmap(corrmat, vmax=.9, square=True,cmap="YlGnBu");

Find the dark blues. They show us big corrolations. The relation that our intuition told us was true. Look at the SalePrice on axis X and OverallQual on the Y axis. Yep that's dark dark blue. Besides two other thing got my attention. First it's the relation between GrLiveArea and TotRmsAbvGrd. Second intersting relation is GarageYrBit and YearBuilt.

To be sure about our corrolations we'll use scatterplots. Because we have a lot of features we will choose noncategorical features that we think they may have some corrolations with one another.

In [None]:
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']
plt.figure(figsize=(6,4),dpi=150)
sns.pairplot(df_train[cols], size = 2.5)
plt.show();

We have a linear regression between GrLiveArea and ToralBsmtSF and we can see a nice positive corrolation between SalesPrice and GrLiveArea. Also realtion between SalePrice and YearBuilt can make us think!(kinda exponential regression)

# Feature Engineering and Data Prepration
We have three issues to deal with:
1. Outliers
2. Missing Data
3. Categorical Data 


# Outliers 

In [None]:
sns.scatterplot(x = 'OverallQual', y= 'SalePrice', data = df_train)

In [None]:
sns.scatterplot(x = 'GrLivArea', y= 'SalePrice', data = df_train)

In the first plot we can see that two points in the 10th bar that are being soled with low prices and we can say these two points are probably our outliers. In the second plot we see two points for GrLivArea that are highly off-priced. we can assume that these two points in the first and second plot are the same.

In [None]:
df_train[(df_train['SalePrice']<12.5) & (df_train['OverallQual'] > 8) & (df_train['GrLivArea']>4000)]

We found them. these are the two points that will really screw up our regression. Obviously we can not check all the features one by one in order to find outliers. The corrolations hepled us to find the most corrolative features wich are OverallQual and GrLivArea and by checking them we found two points that will really screw our prediction up.

In [None]:
dropouts = df_train[(df_train['SalePrice']<12.5) & (df_train['OverallQual'] > 8) & (df_train['GrLivArea']>4000)]
df_train = df_train.drop(df_train[(df_train['SalePrice']<12.5) & (df_train['OverallQual'] > 8) & (df_train['GrLivArea']>4000)].index)
sns.scatterplot(x = 'GrLivArea', y= 'SalePrice', data = df_train)

In [None]:
print(f'There are {df_train.isnull().sum().sum()} missing values')
df_train.isnull().sum().sort_values(ascending=False)

Let's concatinate the train and test data, because the data preparation such as data missing procedure must apply on both train and test data.

In [None]:
df = df_train
y = df["SalePrice"]
df.drop(['SalePrice'], axis=1, inplace=True)
df.head()

# Missing Data

In [None]:
df.isnull().sum().sort_values(ascending = False).head(20)

* PoolQC: data description says NA means "No Pool". Ok then it's reasonable to fill the NA by None beacause majority of houses have no pool.

In [None]:
df["PoolQC"] = df["PoolQC"].fillna("None")

* MiscFeature : data description says NA means "no misc feature"

In [None]:
df["MiscFeature"] = df["MiscFeature"].fillna("None")

* Alley : data description says NA means "no alley access"

In [None]:
df["Alley"] = df["Alley"].fillna("None")

* Fence : data description says NA means "no fence"


In [None]:
df["Fence"] = df["Fence"].fillna("None")

* FireplaceQu : data description says NA means "no fireplace"

In [None]:
df["FireplaceQu"] = df["FireplaceQu"].fillna("None")

LotFrontage : I actually didn't know what to do with this column and since it has almost 500 missing data, we cann't drop them. So I got help from https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard notebook to find a way. Since the area of each street connected to the house property most likely have a similar area to other houses in its neighborhood , we can fill in missing values by the median LotFrontage of the neighborhood.

In [None]:
df["LotFrontage"] = df.groupby("Neighborhood")["LotFrontage"].apply(
    lambda x: x.fillna(x.median()))

Notice that median is diffrent from mean it's actually the middle of the values in the list of numbers.

* GarageType, GarageFinish, GarageQual and GarageCond : Replacing missing data with None

In [None]:
df['GarageType'] = df['GarageType'].fillna('None')
df['GarageFinish'] = df['GarageFinish'].fillna('None')
df['GarageQual'] = df['GarageQual'].fillna('None')
df['GarageCond'] = df['GarageCond'].fillna('None')

* GarageYrBlt, GarageArea and GarageCars : Replacing missing data with 0 (Since No garage = no cars in such garage.)

In [None]:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    df[col] = df[col].fillna(0)

* BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath : missing values are likely zero for having no basement

In [None]:
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    df[col] = df[col].fillna(0)

* BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 : For all these categorical basement-related features, NaN means that there is no basement.

In [None]:
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    df[col] = df[col].fillna('None')

* MasVnrArea and MasVnrType : NA most likely means no masonry veneer for these houses. We can fill 0 for the area and None for the type.

In [None]:
df["MasVnrType"] = df["MasVnrType"].fillna("None")
df["MasVnrArea"] = df["MasVnrArea"].fillna(0)

For MSZoning we hava 4 missing data. We can either delete them or another approach is to fill missing with the most common data is this column. 

In [None]:
df['MSZoning'] = df['MSZoning'].fillna(df['MSZoning'].mode()[0])

* Utilities : For this categorical feature all records are "AllPub", except for one "NoSeWa" and 2 NA . Since the house with 'NoSewa' is in the training set, this feature won't help in predictive modelling. We can then safely remove it.

In [None]:
df = df.drop(['Utilities'], axis=1)

* Functional : data description says NA means typical

In [None]:
df["Functional"] = df["Functional"].fillna("Typ")

* Electrical : we will repeat the same apprach we used for MSZoning. It has one NA value. Since this feature has mostly 'SBrkr', we can set that for the missing value.

In [None]:
df['Electrical'] = df['Electrical'].fillna(df['Electrical'].mode()[0])

* KitchenQual: Only one NA value, and same as Electrical, we set 'TA' (which is the most frequent) for the missing value in KitchenQual.

In [None]:
df['KitchenQual'] = df['KitchenQual'].fillna(df['KitchenQual'].mode()[0])

* Exterior1st and Exterior2nd : Again Both Exterior 1 & 2 have only one missing value. We will just substitute in the most common string

In [None]:
df['Exterior1st'] = df['Exterior1st'].fillna(df['Exterior1st'].mode()[0])
df['Exterior2nd'] = df['Exterior2nd'].fillna(df['Exterior2nd'].mode()[0])

* SaleType : Fill in again with most frequent which is "WD"

In [None]:
df['SaleType'] = df['SaleType'].fillna(df['SaleType'].mode()[0])

* MSSubClass : Na most likely means No building class. We can replace missing values with None

In [None]:
df['MSSubClass'] = df['MSSubClass'].fillna("None")

In [None]:
df.isnull().sum().sort_values(ascending=False)

Ok it seems we've got rid of the missing data succesfully. Let's move on to categorical data.

# Categorical Data

In [None]:
categorical_data = df.dtypes[(df.dtypes == "object")].index
categorical_data

In [None]:
df[categorical_data].head()

In [None]:
df_numerical = df.select_dtypes(exclude='object')
df_categorical = df.select_dtypes(include='object')
dff = pd.get_dummies(df_categorical, drop_first=True)
df= pd.concat([df_numerical, dff], axis=1)
df.head()

**This part was kanda tricky and honestly I choose the easiest way to deal with the categorical data. Another way to deal with them was checking their corrolation with SalePrice and assigning a number to that feature target according to its corrolation. The more domain knowlege we use in this part the more precise data we will extract.**

Fnally It's time build out model and start predecting.

# 📈 Training a Regression Model

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3, random_state=101)
X_train.head()

# 🟢 Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn import metrics

linearModel = LinearRegression()
linearModel.fit(X_train, y_train)
linearPred = linearModel.predict(X_test)

MAE_linear = metrics.mean_absolute_error(y_test, linearPred)
MSE_linear = metrics.mean_squared_error(y_test, linearPred)
RMSE_linear = np.sqrt(MSE_linear)

results_df = pd.DataFrame(data=[["Linear Regression", MAE_linear, MSE_linear, RMSE_linear]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE'])
results_df

# 🟢 Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_convertor = PolynomialFeatures(degree = 2, include_bias = False)
polynomial_convertor.fit(df)
poly_features = polynomial_convertor.transform(df)
poly_features.shape

In [None]:
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(poly_features, y, test_size=0.3, random_state=101)
polyLinearModel = LinearRegression()
polyLinearModel.fit(X_train_poly, y_train_poly)
polyLinearPred = polyLinearModel.predict(X_test_poly)

MAE_poly_linear = metrics.mean_absolute_error(y_test_poly, polyLinearPred)
MSE_poly_linear = metrics.mean_squared_error(y_test_poly, polyLinearPred)
RMSE_poly_linear = np.sqrt(MSE_poly_linear)
results_df = pd.DataFrame(data=[["Linear Regression", MAE_linear, MSE_linear, RMSE_linear],
                                ["Polynomial Regression", MAE_poly_linear, MSE_poly_linear, RMSE_poly_linear ]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE'])
results_df

# 🟢 Ridge Regression

In [None]:
from sklearn.linear_model import Ridge, RidgeCV
ridgeModel = Ridge(alpha = 10)
ridgeModel.fit(X_train, y_train)
ridgePred = ridgeModel.predict(X_test)

MAE_ridge = metrics.mean_absolute_error(y_test, ridgePred)
MSE_ridge = metrics.mean_squared_error(y_test, ridgePred)
RMSE_ridge = np.sqrt(MSE_ridge)

results_df = pd.DataFrame(data=[["Linear Regression", MAE_linear, MSE_linear, RMSE_linear],
                                ["Polynomial Regression", MAE_poly_linear, MSE_poly_linear, RMSE_poly_linear],
                                ["Ridge Regression", MAE_ridge, MSE_ridge, RMSE_ridge]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE'])
results_df

In [None]:
ridgeCvModel = RidgeCV(alphas=(0.1,1.0,10.0,100.0))
ridgeCvModel.fit(X_train, y_train)
ridgeCvModel.alpha_

# 🟢 LASSO Regression (Least Absolute Shrinking and Selection Operator)

In [None]:
from sklearn.linear_model import LassoCV
lassoCvModel = LassoCV(eps=0.1, n_alphas=10000,cv=5)
lassoCvModel.fit(X_train,y_train)
lassoCvModel.alpha_

In [None]:
lassoPred = lassoCvModel.predict(X_test)
MAE_lasso = metrics.mean_absolute_error(y_test, lassoPred)
MSE_lasso = metrics.mean_squared_error(y_test, lassoPred)
RMSE_lasso = np.sqrt(MSE_lasso)

results_df = pd.DataFrame(data=[["Linear Regression", MAE_linear, MSE_linear, RMSE_linear],
                                ["Polynomial Regression", MAE_poly_linear, MSE_poly_linear, RMSE_poly_linear],
                                ["Ridge Regression", MAE_ridge, MSE_ridge, RMSE_ridge],
                                ["LASSO Regression", MAE_lasso, MSE_lasso, RMSE_lasso]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE'])
results_df

In [None]:
lassoCvModel.coef_

As you saw in the table LASSO Regression didn't do a good a job predecting the data and that's because lasso regression allow the coeficient to be zero. All coeficients are zero execpt two of them that means we only consider two of our feature to predict our label.
* **Ok then why do we even use this model?!!!**
    * In some cases the trade-off in resault may worth it to consider less feature. Considering only two feature would make out job too easy beacause as then we should only be worried about those two although we should consider the error rate it has given two us.    

# 🟢 Elastic Net

In [None]:
from sklearn.linear_model import ElasticNetCV
elasticModel = ElasticNetCV(l1_ratio=[.1,.5,.7,.7,.9,.95,.99,1],
                           eps = 0.001, n_alphas=100,max_iter=1000000)
elasticModel.fit(X_train, y_train)

In [None]:
elasticPred = elasticModel.predict(X_test)
print(elasticModel.l1_ratio_)

This means since l1_ratio is 1 and that's the alpha parameter in the formula, it only considered lasso.

In [None]:
MAE_elastic = metrics.mean_absolute_error(y_test, lassoPred)
MSE_elastic = metrics.mean_squared_error(y_test, lassoPred)
RMSE_elastic = np.sqrt(MSE_elastic)
results_df = pd.DataFrame(data=[["Linear Regression", MAE_linear, MSE_linear, RMSE_linear],
                                ["Polynomial Regression", MAE_poly_linear, MSE_poly_linear, RMSE_poly_linear],
                                ["Ridge Regression", MAE_ridge, MSE_ridge, RMSE_ridge],
                                ["LASSO Regression", MAE_lasso, MSE_lasso, RMSE_lasso],
                                ["Elastic Net", MAE_elastic, MSE_elastic, RMSE_elastic]], 
                          columns=['Model', 'MAE', 'MSE', 'RMSE'])
results_df

As you can see the result for elastic and LASSO regression are exactly the same.

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(10,7)
regressors = ["Linear", "Polynomial", "Ridge", "LASSO", "Elastic Net"]
rmses = [RMSE_linear, RMSE_poly_linear, RMSE_ridge, RMSE_lasso, RMSE_elastic]
sns.barplot(x=regressors, y=rmses, ax= ax)
plt.ylabel('RMSE')
plt.show()