# <b><u>Advanced Regression:</b></u>
## US-based housing company :
## <b><u>Aim:</b></u>
### <ol>1. To build a regression model using regularisation to predict the actual value of prospective properties.</ol>
### <ol>2. Based on the model, to decide whether to invest in the purchase of these prospective properties or not.

## <b><u>Problem solving steps:</b></u>
### <ol>1. Initial analysis and data cleanup.</ol>
### <ol>2. Data visualization.</ol>
### <ol>3. Training and testing data creation.</ol>
### <ol>4. Model building using Ridge regression.</ol>
### <ol>5. Model building using Lasso Regression.</ol>


## Step 1: <b><u>Initial analysis,data cleanup and new feature creation:</b></u>

In [None]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors
import warnings
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.feature_selection import RFE
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from datetime import date
warnings.filterwarnings('ignore')
pd.set_option('max_row', None)
pd.set_option("max_colwidth", 100000)

In [None]:
df_housing = pd.read_csv('train.csv')

In [None]:
df_housing.head()

In [None]:
df_housing.shape

In [None]:
df_housing.info()

In [None]:
df_housing.describe()

### <b>Observations:</b>

### 1.  The below features are either irrelevant or similar features describing the same are present, for determining the sale price of the housing properties:
##### <li>Id</li>
##### <li>LotFrontage</li>
##### <li>MasVnrArea</li>
##### <li>ExterQual</li>
##### <li>BsmtExposure</li>
##### <li>BsmtFinSF1</li>
##### <li>BsmtFinSF2</li>
##### <li>BsmtUnfSF</li>
##### <li>Electrical</li>
##### <li>1stFlrSF</li>
##### <li>2ndFlrSF</li>
##### <li>LowQualFinSF</li>
##### <li>GarageArea</li>
##### <li>GarageQual</li>
##### <li>MiscVal</li>
##### <li>LandSlope</li>
##### <li>PavedDrive</li>

### 2.  The below features can be combined into a single new feature, and then dropped:
##### <li>Condition1, Condition2</li>
##### <li>Exterior1st, Exterior2nd</li>
##### <li>BsmtFullBath, FullBath</li>
##### <li>HalfBath, BsmtHalfBath</li>
##### <li>WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch</li>
##### <li>BsmtFinType1, BsmtFinType2</li>

### 3.  A new feature to indicate age can be derived for each of the below features, and these features can then be dropped:
##### <li>YearBuilt</li>
##### <li>YearRemodAdd</li>
##### <li>GarageYrBlt</li>

### 4.  Features with a large number of null values in the rows must be removed.

### 5.  Categorical features must be analysed for consistency of data, data type etc, inconsistencies must be fixed, and only features that strongly influence the property's sale price, must be retained, based on intuition:

### 6.  Numeric data type features must also be checked for consistency.

### 1. Removing the above columns listed in point 1:


In [None]:
columns_to_drop = ['Id','LotFrontage','MasVnrArea','ExterQual','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','BsmtExposure','Electrical','1stFlrSF','2ndFlrSF','LowQualFinSF','GarageArea','GarageQual','PavedDrive','MiscVal','LandSlope']
df_housing.drop(columns=columns_to_drop,inplace=True)
df_housing.columns

### 2. Combining related features and creating a new feature, and dropping the original features, after data fixing/cleanup of the original featurs, as per point number 2 above:

##### <li>Condition1, Condition2</li>

In [None]:
df_housing.Condition1.value_counts()

In [None]:
df_housing.Condition2.value_counts()

In [None]:
def combineValues(x ,y):
    if str(x).upper()==str(y).upper():
        return str(x)
    else:
        return str(x+'_'+y)


df_housing['Proximity'] = df_housing.apply(lambda x: combineValues(x['Condition1'],x['Condition2']), axis=1)

In [None]:
df_housing.Proximity.value_counts()

In [None]:
df_housing.drop(columns=['Condition1','Condition2'],inplace=True)

##### <li>Exterior1st, Exterior2nd</li>

In [None]:
df_housing.Exterior1st.value_counts().keys()

In [None]:
df_housing.Exterior2nd.value_counts()

##### Fixing the erroneous values of the Exterior2nd feature (CmentBd, Wd Shng, Brk Cmn):

In [None]:
def fixExterior2nd(x):
    if x == 'CmentBd':
        return 'CemntBd'
    elif x=='Wd Shng':
        return 'WdShing'
    elif x=='Brk Cmn':
        return 'BrkComm'
    else:
        return x    

df_housing['Exterior2nd'] = df_housing.Exterior2nd.apply(fixExterior2nd)


In [None]:
df_housing.Exterior2nd.value_counts()

In [None]:
df_housing['Exterior'] =  df_housing.apply(lambda x: combineValues(x['Exterior1st'],x['Exterior2nd']), axis=1)
df_housing.Exterior.value_counts()

In [None]:
df_housing.drop(columns=['Exterior1st','Exterior2nd'],inplace=True)

##### <li>BsmtFullBath, FullBath</li>

In [None]:
df_housing.BsmtFullBath.value_counts()

In [None]:
df_housing.FullBath.value_counts()

In [None]:
def addValues(x,y):
    return int(x) + int(y)

In [None]:
df_housing['Bath'] = df_housing.apply(lambda x: addValues(x['BsmtFullBath'],x['FullBath']), axis=1)
df_housing.Bath.value_counts()

In [None]:
df_housing[['BsmtFullBath','FullBath','Bath']].head()

In [None]:
df_housing.drop(columns=['BsmtFullBath','FullBath'],inplace=True)

##### <li>HalfBath, BsmtHalfBath</li>

In [None]:
df_housing.HalfBath.value_counts()

In [None]:
df_housing.BsmtHalfBath.value_counts()

In [None]:
df_housing['SemiBath'] = df_housing.apply(lambda x:addValues(x['BsmtHalfBath'],x['HalfBath']),axis=1)
df_housing.SemiBath.value_counts()

In [None]:
df_housing[['BsmtHalfBath','HalfBath','SemiBath']].head()

In [None]:
df_housing.drop(columns=['BsmtHalfBath','HalfBath'],inplace=True)

##### <li>WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch</li>

In [None]:
def addMultipleValues(a,b,c,d,e):
    return a + b + c + d + e

In [None]:
df_housing['TotalPorchArea'] = df_housing.apply(lambda x:addMultipleValues(x['WoodDeckSF'],x['OpenPorchSF'],x['EnclosedPorch'],x['3SsnPorch'],x['ScreenPorch']),axis=1)
df_housing.TotalPorchArea.value_counts()

In [None]:
df_housing[['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch','TotalPorchArea']].tail()

In [None]:
df_housing.drop(columns=['WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'],inplace=True)

##### <li>BsmtFinType1, BsmtFinType2</li>

In [None]:
df_housing.BsmtFinType1.value_counts().sum()

In [None]:
df_housing.BsmtFinType2.value_counts().sum()

In [None]:
df_housing.dtypes

In [None]:
import math
def combineFinTypes(x,y):
    if x == y:
        ##if str(x) == 'nan':
        ##    x = 'NO`'
        return str(x)
    elif str(y) == 'nan':
        return str(x)
    else:
        return str(x) + '_' + str(y)    

In [None]:
df_housing['BsmtFinType1'] = df_housing['BsmtFinType1'].fillna('NA')
df_housing['BsmtFinType2'] = df_housing['BsmtFinType2'].fillna('NA')
df_housing['BsmtFinType'] = df_housing.apply(lambda x:combineFinTypes(x['BsmtFinType1'],x['BsmtFinType2']), axis=1)


In [None]:
df_housing.BsmtFinType.value_counts()

In [None]:
df_housing.drop(columns=['BsmtFinType1','BsmtFinType2'],inplace=True)

### 3. Creating 'Age' features for the features listed in point 3 above, before dropping the original features:


In [None]:
def calculateAge(x):
    if x:
        current_year = date.today().year
        return current_year - x
    else:
        return 0    

##### <li>YearBuilt</li>


In [None]:
df_housing['PropertyAge'] = df_housing.YearBuilt.apply(calculateAge)
df_housing.PropertyAge.value_counts()

In [None]:
df_housing[['YearBuilt','PropertyAge']].tail()

In [None]:
df_housing.drop(columns=['YearBuilt'], inplace=True)

##### <li>YearRemodAdd</li>


In [None]:
df_housing['YearsSinceRemodelling'] = df_housing.YearRemodAdd.apply(calculateAge)
df_housing.YearsSinceRemodelling.value_counts()

In [None]:
df_housing[['YearRemodAdd','YearsSinceRemodelling']].tail()

In [None]:
df_housing.drop(columns=['YearRemodAdd'], inplace=True)

##### <li>GarageYrBlt</li>

In [None]:
df_housing.GarageYrBlt = df_housing.GarageYrBlt.fillna(0)
df_housing['GarageAge'] = df_housing.GarageYrBlt.apply(calculateAge)
df_housing.GarageAge.value_counts()

In [None]:
df_housing[['GarageYrBlt','GarageAge']].tail()

In [None]:
df_housing.drop(columns=['GarageYrBlt'],inplace=True)

In [None]:
df_housing.shape

### 4. Eliminating features with a large number of null/zero values:


In [None]:
df_housing.isnull().sum()

### Removing the below columns for which a large number of null values in the rows :
#### <li>Alley</li>
#### <li>PoolQC</li>
#### <li>MiscFeature</li>
#### <li>Fence</li>

In [None]:
columns_isnull = ['Alley','PoolQC','MiscFeature','Fence']
df_housing.drop(columns=columns_isnull,inplace=True)

In [None]:
df_housing.shape

### 5. Checking and cleaning the categorical features in detail:


#### The remaining categorical features are as follows:
##### <li>MSSubClass</li>
##### <li>MSZoning</li>
##### <li>Street</li>
##### <li>LotShape</li>
##### <li>LandContour</li>
##### <li>Utilities</li>
##### <li>LotConfig</li>
##### <li>Neighborhood</li>
##### <li>Proximity</li>
##### <li>BldgType</li>
##### <li>HouseStyle</li>
##### <li>OverallQual</li>
##### <li>OverallCond</li>
##### <li>RoofStyle</li>
##### <li>RoofMatl</li>
##### <li>Exterior</li>
##### <li>MasVnrType</li>
##### <li>ExterCond</li>
##### <li>Foundation</li>
##### <li>BsmtQual</li>
##### <li>BsmtCond</li>
##### <li>BsmtFinType</li>
##### <li>Heating</li>
##### <li>HeatingQC</li>
##### <li>CentralAir</li>
##### <li>KitchenQual</li>
##### <li>Functional</li>
##### <li>FireplaceQu</li>
##### <li>GarageType</li>
##### <li>GarageCond</li>
##### <li>GarageFinish</li>
##### <li>SaleType</li>
##### <li>SaleCondition</li>

##### (a) MSSubClass</li>

In [None]:
df_housing.MSSubClass = df_housing.MSSubClass.astype('object')

In [None]:
df_housing.MSSubClass.value_counts().sum()

##### The data type of MSSubClass was changed to type 'object'.</li>

##### (b) MSZoning

In [None]:
df_housing.MSZoning.value_counts()

##### Some of the rows have a value of 'C (all)' instead of 'C' for the feature 'MSZoning'. Hence fixing the same:

In [None]:
def fixMSZoning(x):
    if x == 'C (all)':
        return 'C'
    else:
        return x

df_housing.MSZoning = df_housing.MSZoning.apply(fixMSZoning)

##### Changing the data type to 'object':

In [None]:
df_housing.MSZoning = df_housing.MSZoning.astype('object')
df_housing.MSZoning.value_counts()

##### (c) Street

##### Changing the data type to 'object'

In [None]:
df_housing.Street = df_housing.Street.astype('object')
df_housing.Street.value_counts()

##### (d) LotShape

##### Changing the data type to 'object'

In [None]:
df_housing.LotShape = df_housing.LotShape.astype('object')
df_housing.LotShape.value_counts()

##### (e) LandContour

##### Changing the data type to 'object':

In [None]:
df_housing.LandContour = df_housing.LandContour.astype('object')
df_housing.LandContour.value_counts()

##### (f) Utilities

##### Changing the data type to 'object':

In [None]:
df_housing.Utilities = df_housing.Utilities.astype('object')
df_housing.Utilities.value_counts()

##### (g) LotConfig

##### Changing the data type to 'object':

In [None]:
df_housing.LotConfig = df_housing.LotConfig.astype('object')
df_housing.LotConfig.value_counts()

##### (h) Neighborhood

In [None]:
df_housing.Neighborhood.value_counts()

##### Updating the value of 'NAmes' to 'Names' and changing the data type to 'object':

In [None]:
def fixNeighborhood(x):
    if x == 'NAmes':
        return 'Names'
    else:
        return x

df_housing.Neighborhood = df_housing.Neighborhood.apply(fixNeighborhood)

In [None]:
df_housing.Neighborhood = df_housing.Neighborhood.astype('object')
df_housing.Neighborhood.value_counts()

##### (i) Proximity

##### Changing the data type to 'object':

In [None]:
df_housing.Proximity = df_housing.Proximity.astype('object')
df_housing.Proximity.value_counts()

##### (j) BldgType:

In [None]:
df_housing.BldgType.value_counts()

##### Fixing the incorrect values for BldgType: 'Duplex','Twnhs' and '2fmCon' and changing the data type of the feature to 'object':

In [None]:
def fixBldgType(x):
    if x == 'Duplex':
        return 'Duplx'
    elif x == 'Twnhs':
        return 'TwnhsI'
    elif x == '2fmCon':
        return '2FmCon'
    else:
        return x

df_housing.BldgType = df_housing.BldgType.apply(fixBldgType)
df_housing.BldgType = df_housing.BldgType.astype('object')
df_housing.BldgType.value_counts()

##### (k) HouseStyle

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.HouseStyle.value_counts()

In [None]:
df_housing.HouseStyle = df_housing.HouseStyle.astype('object')
df_housing.HouseStyle.value_counts()

##### (l) OverallQual

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.OverallQual = df_housing.OverallQual.astype('object')
df_housing.OverallQual.value_counts()

##### (m) OverallCond

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.OverallCond = df_housing.OverallCond.astype('object')
df_housing.OverallCond.value_counts()

##### (n) RoofStyle

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.RoofStyle.value_counts()

In [None]:
df_housing.RoofStyle = df_housing.RoofStyle.astype('object') 
df_housing.RoofStyle.value_counts()

##### (o) RoofMatl

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.RoofMatl = df_housing.RoofMatl.astype('object')
df_housing.RoofMatl.value_counts()

##### (p) Exterior

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.Exterior = df_housing.Exterior.astype('object')
df_housing.Exterior.value_counts()

##### (q) MasVnrType

##### Changing the data type of the feature to 'object' and imputing the blank values with the mode value i.e. 'None':

In [None]:
df_housing.MasVnrType.value_counts()

In [None]:
df_housing['MasVnrType'] = df_housing['MasVnrType'].fillna('None')
df_housing['MasVnrType'] = df_housing['MasVnrType'].astype('object')
df_housing.MasVnrType.value_counts()

##### (r) ExterCond

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.ExterCond = df_housing.ExterCond.astype('object')
df_housing.ExterCond.value_counts()

#### (s) Foundation

##### Changing the data type of the feature to 'object':

In [None]:
df_housing.Foundation = df_housing.Foundation.astype('object')
df_housing.Foundation.value_counts()

##### (t) BsmtQual
##### Changing the data type of the feature to 'object' and imputing the missing values with the mode value i.e. 'TA':

In [None]:
df_housing.BsmtQual = df_housing.BsmtQual.fillna('TA')
df_housing.BsmtQual = df_housing.BsmtQual.astype('object')
df_housing.BsmtQual.value_counts()

#### (u) BsmtCond
##### Changing the data type of the feature to 'object' and imputing the missing values with the mode value i.e. 'TA':

In [None]:
df_housing.BsmtCond = df_housing.BsmtCond.fillna('TA')
df_housing.BsmtCond = df_housing.BsmtCond.astype('object')
df_housing.BsmtCond.value_counts()

#### (v) BsmtFinType
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.BsmtFinType.value_counts()

#### (w) Heating
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.Heating = df_housing.Heating.astype('object')
df_housing.Heating.value_counts()

#### (x) HeatingQC
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.HeatingQC = df_housing.HeatingQC.astype('object')
df_housing.HeatingQC.value_counts()

#### (y) CentralAir
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.CentralAir = df_housing.CentralAir.astype('object')
df_housing.CentralAir.value_counts()

#### (z) KitchenQual
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.KitchenQual.value_counts()

In [None]:
df_housing.KitchenQual = df_housing.KitchenQual.astype('object')
df_housing.KitchenQual.value_counts()

#### (a.a) Functional
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.Functional = df_housing.Functional.astype('object')
df_housing.Functional.value_counts()

#### (a.b) FireplaceQu
##### Changing the data type of the feature to 'object' and setting the values of FireplaceQu for records with no fireplaces to 'NA':

In [None]:
df_housing[df_housing.FireplaceQu.isnull()].Fireplaces.value_counts()


In [None]:
df_housing['FireplaceQu'] = df_housing['FireplaceQu'].fillna('NA')

df_housing.FireplaceQu = df_housing.FireplaceQu.astype('object')

df_housing.FireplaceQu.value_counts()

#### (a.c) GarageType
##### Changing the data type of the feature to 'object' and replace blank values with 'NA':

In [None]:
df_housing.GarageType = df_housing.GarageType.fillna('NA')
df_housing.GarageType = df_housing.GarageType.astype('object')
df_housing.GarageType.value_counts()

#### (a.d) GarageCond
##### Changing the data type of the feature to 'object' and replace blank values with 'NA':

In [None]:
df_housing.GarageCond = df_housing.GarageCond.fillna('NA')
df_housing.GarageCond = df_housing.GarageCond.astype('object')
df_housing.GarageCond.value_counts()

#### (a.e) GarageFinish
##### Changing the data type of the feature to 'object' and replace blank values with 'NA':

In [None]:
df_housing.GarageFinish = df_housing.GarageFinish.fillna('NA')
df_housing.GarageFinish = df_housing.GarageFinish.astype('object')
df_housing.GarageFinish.value_counts()

#### (a.f) SaleType
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.SaleType = df_housing.SaleType.astype('object')
df_housing.SaleType.value_counts()

In [None]:
df_housing.SaleType.value_counts().sum()

#### (a.g) SaleCondition
##### Changing the data type of the feature to 'object':

In [None]:
df_housing.SaleCondition.value_counts()

### 6. Checking the numeric data type features for consistency:
#### The following will be checked:
#### <li>Spread of values</li>
#### <li>Presence of '0' or null values, and dropping features with a high percentage of null or '0' values</li>
#### NOTE: We will not delete rows, since we have only 1460 records for the purpose of building a model.

In [None]:
df_housing[df_housing.select_dtypes(include=['int64','float64']).columns].describe()

#### Checking the above properties for zero values:

In [None]:
(df_housing.LotArea == 0).sum()


In [None]:
(df_housing.TotalBsmtSF == 0).sum()


In [None]:
(df_housing.GrLivArea == 0).sum()


In [None]:
(df_housing.BedroomAbvGr == 0).sum()


In [None]:
(df_housing.KitchenAbvGr == 0).sum()


In [None]:
(df_housing.LotArea == 0).sum()


In [None]:
(df_housing.TotRmsAbvGrd == 0).sum()


In [None]:
(df_housing.Fireplaces == 0).sum()


In [None]:
(df_housing.GarageCars == 0).sum()


In [None]:
(df_housing.PoolArea == 0).sum()


In [None]:
(df_housing.MoSold == 0).sum()


In [None]:
(df_housing.YrSold == 0).sum()


In [None]:
(df_housing.SalePrice == 0).sum()


In [None]:
(df_housing.Bath == 0).sum()


In [None]:
(df_housing.SemiBath == 0).sum()


In [None]:
(df_housing.TotalPorchArea == 0).sum()


In [None]:
(df_housing.PropertyAge == 0).sum()


In [None]:
(df_housing.YearsSinceRemodelling == 0).sum()

In [None]:
(df_housing.GarageAge == 0).sum()

#### Dropping 'PoolArea' feature since most of the records have a value of '0' for this column:

In [None]:
df_housing.drop(columns=['PoolArea'],inplace=True)
df_housing.shape

## Step 2: Data Visualization
### The following data visualizations will be performed:
### <li>Initial visualization of feature correlations.</li>
### <li>Univariate analysis of some of the important features.</li>
### <li>Bi-variate analysis of some of the important features.</li>


### 1. Initial visualization of feature correlations (numeric-type features):

In [None]:
plt.figure(figsize=(16,8))
sns.heatmap(df_housing.corr(), cmap="YlGnBu", annot=True)
plt.show()

In [None]:
df_housing.select_dtypes(include=['int64','float64']).columns

In [None]:
df_housing.YrSold = df_housing.YrSold.astype('object')
df_housing.MoSold = df_housing.MoSold.astype('object')

#### Observations:
#### The below features have a significant correlation with 'SalePrice':
#### <li>TotalBsmtSF</li>
#### <li>GrLivArea</li>
#### <li>TotRmsAbvGrd</li>
#### <li>Fireplaces</li>
#### <li>GarageCars</li>
#### <li>Bath</li>
#### <li>PropertyAge</li>
#### <li>YearsSinceRemodelling</li>
#### <li>GarageAge</li>

### 2. Univariate analysis of some of the important features:
#### Performing univariate analysis on the below important features:
##### <li>Utilities</li>
##### <li>Neighborhood</li>
##### <li>Proximity</li>
##### <li>HouseStyle</li>
##### <li>OverallQual</li>
##### <li>OverallCond</li>
##### <li>ExterCond</li>
##### <li>Foundation</li>
##### <li>BsmtQual</li>
##### <li>HeatingQC</li>
##### <li>GarageCond</li>
##### <li>SaleType</li>
##### <li>SaleCondition</li>
##### <li>MoSold</li>
##### <li>YrSold</li>
##### <li>TotalBsmtSF</li>
##### <li>GrLivArea</li>
##### <li>TotRmsAbvGrd</li>
##### <li>Fireplaces</li>
##### <li>GarageCars</li>
##### <li>Bath</li>
##### <li>PropertyAge</li>
##### <li>YearsSinceRemodelling</li>
##### <li>GarageAge</li>



In [None]:
univariate_cat_cols = ['Utilities','Neighborhood','Proximity','HouseStyle','OverallQual','OverallCond','ExterCond','Foundation','BsmtQual','HeatingQC','GarageCond','SaleType','SaleCondition','MoSold','YrSold','Functional']
for column in univariate_cat_cols:
    plt.figure(figsize=(20,10))
    plt.title(column)
    sns.histplot(data=df_housing,y=column)
    plt.show()

In [None]:
univariate_num_cols = ['TotalBsmtSF',
'GrLivArea',
'TotRmsAbvGrd',
'Fireplaces',
'GarageCars',
'Bath',
'PropertyAge',
'YearsSinceRemodelling',
'GarageAge']
for column in univariate_num_cols:
    plt.figure(figsize=(20,10))
    plt.title(column)
    sns.histplot(data=df_housing,x=column)
    plt.show()

#### Univariate analysis observations:

##### Property sales were high for properties:
##### <li>having all public utilities</li>
##### <li>within the neighbourhoods College Creek and Northwest Ames.</li>
##### <li>with normal proximity.</li>
##### <li>of dwelling type one story and two story.</li>
##### <li>whose overall material and finsh were rated as 'Average', 'Above Average' and 'Good'.</li>
##### <li>whose overall condition is 'Average'.</li>
##### <li>with an 'Average/Typical' exterior material condition.</li>
##### <li>with foundation type 'Poured Concrete' and 'Cinder Block'.</li>
##### <li>whose basement heights are 'Typical (80-89 inches)' and 'Good (90-99 inches)'.</li>
##### <li>with heating quality rated as 'Excellent'.</li>
##### <li>with a 'Typical/Average' garage condition.</li>
##### <li>with sale type 'Warranty Deed - Conventional'.</li>
##### <li>with sale condition 'Normal'.</li>
##### <li>during the months of June and July.</li>
##### <li>during the years 2006 to 2009, with a significant decrease in sales during the year 2010.</li>
##### <li>with a total basement area between 800 to 1200 square feet.</li>
##### <li>with a ground living area between 1000 to 1800 square feet.</li>
##### <li>with the total rooms above grade between 6 to 7 in number.</li>
##### <li>with the number of fireplaces between 0 to 1.</li>
##### <li>with a garage having space for 2 cars.</li>
##### <li>with 2 full bathrooms.</li>
##### <li>whose age ranges from 0 to 20 and 45 to 75.</li>
##### <li>that were remodelled 12 to 20 years ago and 65 to 70 years ago.</li>
##### <li>whose garage is less than 20 years old.</li>
##### <li>with functional type 'Typ' (typical)</li>

### 3. Bivariate analysis of some of the important features:
#### Performing a limited bivariate analysis between the below important features:
##### <li>SalePrice vs Neighborhood</li>
##### <li>SalePrice vs HouseStyle</li>
##### <li>SalePrice vs OverallQuall</li>
##### <li>SalePrice vs YrSold</li>
##### <li>SalePrice vs PropertyAge</li>
##### <li>SalePrice vs SaleCondition</li>
##### <li>SalePrice vs SaleType</li>

##### <li>SalePrice vs GrLivArea</li>
##### <li>SalePrice vs TotRmsAbvGrd</li>
##### <li>SalePrice vs PropertyAge</li>
##### <li>SalePrice vs TotalPorchArea</li>
##### <li>SalePrice vs TotalBsmtSF</li>
##### <li>SalePrice vs LotArea</li>

##### <li>Neighborhood vs OverallCond</li>
##### <li>Neighborhood vs BldgType</li>
##### <li>Exterior vs ExteriorCond</li>
##### <li>SaleCondition vs OverallCond</li>
##### <li>SaleType vs OverallQuall</li>
##### <li>SaleType vs HouseStyle</li>
##### <li>PropertyAge vs OverallCond</li>
##### <li>PropertyAge vs Foundation</li>

In [None]:
df_housing.columns

In [None]:
cont_vs_cat_cols = [['SalePrice',  'Neighborhood'],
['SalePrice',  'HouseStyle'],
['SalePrice', 'OverallQual'],
['SalePrice', 'YrSold'],
['SalePrice','SaleCondition'],
['SalePrice' , 'SaleType'],
['PropertyAge','OverallCond'],
['PropertyAge','Foundation']]
cont_vs_cont_cols = [['SalePrice','GrLivArea'],['SalePrice','TotRmsAbvGrd'],['SalePrice','PropertyAge'],['SalePrice','TotalPorchArea'],['SalePrice','TotalBsmtSF']]
cat_vs_cat_cols = [['Neighborhood','OverallCond'],['Neighborhood','BldgType'],['SaleCondition','OverallCond'],['SaleType','OverallQual'],['SaleType','HouseStyle']]

In [None]:
for row in cat_vs_cat_cols:
    plt.figure(figsize=(20,10))
    sns.countplot(y=row[0],hue=row[1],data=df_housing)
    plt.title(row[0]+ ' vs ' + row[1] )
    plt.show()

In [None]:
for row in cont_vs_cat_cols:
    plt.figure(figsize=(20,10))
    plt.title(row[1]+ ' vs ' + row[0] )
    sns.boxplot(y=row[0],x=row[1],data=df_housing)
    plt.show() 
    

In [None]:
for row in cont_vs_cont_cols:
    plt.figure(figsize=(20,10))
    plt.title(row[1]+ ' vs ' + row[0] )
    sns.scatterplot(y=row[0],x=row[1],data=df_housing)
    plt.show() 
    

#### Bivariate analysis observations:

##### <li> A majority of the neighbourhoods have properties whose overall condition is 'Average' with the exception of Crawford, Old Town, Meadow Village and Bluestem.</li>

##### <li> Most of the neighbourhoods have a 'Single-family Detached' type of dwelling.</li>

##### <li>There are more properties that are in an 'Average' overall condition across the various condition of sale values, with the 'Normal' sale condition having the largest number of properties.</li>


##### <li>Properties with 'Warranty Deed - Conventional' sale type have mostly 'Average', 'Above Average' and 'Good' overall quality.</li>

##### <li>Properties with 'Home just constructed and sold' sale type have mostly 'Good' and 'Very Good' overall quality.</li>


##### <li>A majority of the properties with 'Warranty Deed - Conventional' and 'Home just constructed and sold' sale types are of types 'One story' and 'Two story'.</li>

##### <li>Sale prices are higher for the neighbourhoods 'Northridge', 'Northridge Heights' and 'Stone Brook' and the neighbourhoods 'Stone Brook', 'Veenker', 'Northridge Heights' and 'Timeberland' having a larger spread or range of sale price values.</li>

##### <li>The house styles 'Two and one-half story: 2nd level finished', 'Two story' and 'One story' have higher sale prices, and the same have a larger range of sale prices compared to the other house styles.</li>

##### <li>The sale price also increases steadily with an increase in the rating of the overall material and finish of the house.</li>

##### <li>The property sale prices and the spread of the sale prices have remained steady across all the years from 2006 to 2010.</li>


##### <li>Properties with sale condition type 'Partial' (Home was not completed when last assessed (associated with New Homes)) have the highest sale prices and the largest range of sale price  values in comparision to other sale condition types.</li>


##### <li>Properties whose sale types are 'New' (Home just constructed and sold) and 'Con' (Contract 15% Down payment regular terms) have higher sales prices compared to other sale types.</li>

##### <li>Only properties which are between 20 to 50 years old have an 'Average' overall condition, and the properties greater than 50 years of age tend to fall in any of the other overall condition state values. </li>


##### <li>Properties with a foundation of either 'Stone' or 'Brick & Tile' are older compared to the others.</li>

##### <li>Properties made of 'PConc' (Poured Concrete) are younger in age compared to the other properties.</li>

##### <li>There are very few properties whose foundation is made of wood.</li>

##### <li>The sale price of the properties increase with an increase in 'GrLivArea' (living area above ground in square feet).</li>

##### <li>The maximum sale price of the properties do not show any definitive pattern, but more number of properties that have more rooms tend to have a higher sale price.</li>

##### <li>The sale price values for older properties tend to be more stable and lower in value when compared to newer properties that are costlier.</li>

##### <li>No definitive trend is visible for sale price vs total porch area.</li>

##### <li>The property sale prices tend to increase with an increase in the total basement square feet value.</li>

## Step 3. Training and testing data creation.

#### Creating the dummy variables for the categorical features:

In [None]:
df_housing_categorical = df_housing.select_dtypes(include=['object'])
df_housing_dummies = pd.get_dummies(df_housing_categorical,drop_first=True)
df_housing_dummies.head()

##### Dropping the categorical variables, and concatenating the dummy variables created wiht the housing data frame:


In [None]:
df_housing.drop(df_housing_categorical.columns,axis=1, inplace=True)
df_housing = pd.concat([df_housing,df_housing_dummies], axis = 1)


#### Splitting the data into test and train data:

In [None]:
np.random.seed(100)
df_train, df_test = train_test_split(df_housing, train_size = 0.7, test_size = 0.3,random_state=100)

#### Scaling the numeric data type features using min-max scaling (dummy variables excluded):

In [None]:
df_train.head()

In [None]:
minMaxScaler = MinMaxScaler()
housing_numeric_cols = df_housing.select_dtypes(include=['int64','float64']).columns
df_train[housing_numeric_cols] = minMaxScaler.fit_transform(df_train[housing_numeric_cols])
df_train.head()

#### Dividing the data into X (independent features) and Y (dependent feature):

In [None]:
df_train_y = df_train.pop('SalePrice')
df_train_x = df_train

In [None]:
df_train_x.head()

## Step 4: Model building using Ridge regression.

#### Initial recursive feature elimination:

In [None]:
ridge = Ridge()
ridge.fit(df_train_x,df_train_y)
rfe = RFE(ridge,15)
rfe = rfe.fit(df_train_x, df_train_y)

In [None]:
selected_features = df_train_x.columns[rfe.support_]

#### Retaining only the final list of columns after recursive feature elimination:

In [None]:
df_train_x_rfe = df_train_x[selected_features]
df_train_x_rfe.columns

#### Determination of the optimal value of alpha:

In [None]:
alpha_values = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 
 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 
 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# cross validation
folds = 5
best_cv = GridSearchCV(estimator = ridge, 
                        param_grid = alpha_values, 
                        scoring= 'neg_mean_absolute_error',  
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
best_cv.fit(df_train_x_rfe, df_train_y) 
print(best_cv.best_params_)

#### Using 5-fold cross-validation, the best value of alpha (hyperparameter) predicted is 0.1
#### Proceeding with predicting the outcome and calculating the residuals for the training data:

In [None]:
ridge1 = Ridge(alpha=best_cv.best_params_['alpha'])
ridge1.fit(df_train_x_rfe,df_train_y)
df_pred_train = ridge1.predict(df_train_x_rfe)
df_res_train = df_train_y - df_pred_train

#### Evaluating the model using residual analysis:

#### Checking the distribution of the error terms:

In [None]:
sns.distplot(df_res_train, bins = 20)
plt.title('Distribution of training data residuals:')
plt.xlabel('Residuals')
plt.show()

#### Checking the independence and homoscedasticity of the error terms:

In [None]:
## plotting residual errors in training data
sns.scatterplot(data=df_res_train)
plt.title('Scatterplot of the residuals:')
plt.ylabel('Residuals')
plt.show()

#### Residual analysis observations:
##### <li>The error terms are normally distributed around 0.</li>
##### <li> The error terms are mostly homoscedastics and appear to be independent of each other.</li>

#### Evaluating the model using test data:


#### Scaling test data first:

In [None]:
housing_numeric_cols = df_housing.select_dtypes(include=['int64','float64']).columns
df_test[housing_numeric_cols] = minMaxScaler.transform(df_test[housing_numeric_cols])
df_test.head()

#### Predicting the sale price for the test data:

In [None]:
df_test_y = df_test.pop('SalePrice')
df_test_x_rfe = df_test[selected_features]
df_pred_test = ridge1.predict(df_test_x_rfe)


#### Plotting the actual vs predicted sale price (scaling applied) for the test data:

In [None]:
plt.scatter(df_test_y, df_pred_test)
plt.title('Actual vs Predicted Sale Price (Ridge regression)') 
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

#### The above scattter plot indicates a linear relationship through which a line of best fit can be plotted, indicating a good model.

#### Model Evaluation using the R-squared test on both the train and test data:

In [None]:
print("R2 score for training data: ",r2_score(df_train_y,df_pred_train))
print("R2 score for test data: ",r2_score(df_test_y,df_pred_test))

#### <li>The model, as per its R2 score, is able to explain approximately 84% of the variance in the training data and approximately 79% of the variance in the test data for the target feature 'SalePrice'.</li>
#### <li>Since the R2 score difference for the training and test data is less than 5%, and based on the observations made in the residual analysis and actual vs predicted scatter plot analysis, we can say with confidence that the built model, with regularization applied (alpha = 0.1) is a good one.</li>
#### NOTE: The number of features were initially reduced to 15 using recursive feature elimination (RFE).

#### Summarizing the most important features for the created model using ridge regression, along with their co-efficient values:

In [None]:
model_ridge = pd.DataFrame()
model_ridge['Features'] = df_train_x_rfe.columns
model_ridge['Beta Values(co-efficents)'] = ridge1.coef_
model_ridge1 = model_ridge[model_ridge['Beta Values(co-efficents)']>0].sort_values(by=['Beta Values(co-efficents)'],ascending=False)
model_ridge2 = model_ridge[model_ridge['Beta Values(co-efficents)']<0].sort_values(by=['Beta Values(co-efficents)'],ascending=True)
model_ridge = model_ridge1.merge(model_ridge2,how='outer')
print("List of the top 15 most important features that influence SalePrice (using Ridge regression):")
print(model_ridge)

#### The top 5 most important independent features that influence the dependent feature 'SalePrice', using a ridge regression model are:
##### <li><b>Proximity_PosN</b> (Proximity condition: Near positive off-site feature--park, greenbelt, etc.)</li>
##### <li><b>GrLivArea</b> (Above grade (ground) living area square feet)</li>
##### <li><b>OverallQual_10</b> (Rating of the overall material and finish of the house: 10 i.e. 'Very Excellent'</li>
##### <li><b>OverallQual_9</b> (Rating of the overall material and finish of the house: 10 i.e. 'Excellent'</li>
##### <li><b>Exterior_Wd Sdng_ImStucc</b> (The exterior covering combination of 'Wood Siding'+'Imitation Stucco')</li>

## Step 5. Model building using Lasso Regression.

#### Determining the optimal value of alpha for lasso regression, using the cross validation technique:

In [None]:
lasso = Lasso()

best_cv = GridSearchCV(estimator = lasso, 
                        param_grid = alpha_values, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)

best_cv.fit(df_train_x,df_train_y)   
print(best_cv.best_params_)             

#### The best value of alpha, using cross-validation was found to be 0.0001
#### Fitting the training data using this value of alpha and predicting the output for the train and test data
#### NOTE: scaling of the train and test data is already done during model building with ridge regression.

In [None]:
lasso1 = Lasso(alpha=best_cv.best_params_['alpha'])
lasso1.fit(df_train_x,df_train_y)

df_pred_train_lasso = lasso1.predict(df_train_x)
df_res_train_lasso = df_train_y - df_pred_train_lasso
df_test_x = df_test
df_pred_test_lasso = lasso1.predict(df_test_x)
df_res_test_lasso = df_test_y - df_pred_test_lasso

#### Evaluation of the residuals:

#### Checking the distribution of the error terms:

In [None]:
sns.distplot(df_res_train_lasso, bins = 20)
plt.title('Distribution of training data residuals (Lasso regression):')
plt.xlabel('Residuals')
plt.show()

#### Checking the independence and homoscedasticity of the error terms:

In [None]:
sns.scatterplot(data=df_res_train_lasso)
plt.title('Scatterplot of the residuals(Lasso regression):')
plt.ylabel('Residuals')
plt.show()

#### Residual analysis observations:
##### <li> The residuals of the training data have a normal distribution.</li>
##### <li> The residuals are independent of each other and barring a few values, are mostly homoscedastic and have a constant variance.

#### Plotting the actual vs predicted sale price values, for the test data:

In [None]:
plt.scatter(df_test_y, df_pred_test_lasso)
plt.title('Actual vs Predicted Sale Price (Lasso regression)') 
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

#### Observation:
##### <li>There is linear relationship between the actual and predicted output value, through which a line of best fit can be plotted.</li>
##### <li> This is an indication of a good model.</li>

#### Checking the R2 scores of the model for both the train and test data:

In [None]:
print("R2 score for training data: ",r2_score(df_train_y,df_pred_train_lasso))
print("R2 score for test data: ",r2_score(df_test_y,df_pred_test_lasso))

#### Observations:
##### <li>The model is able to explain approximately 90% of the variance in the output of the training data and 85% of the variance in the output of the test data.</li>
##### <li> Since the difference  in the R2 scores of the test and train data sets is not more than 5%, we can say with reasonable confidence that the created model, with alpha = 0.0001, is good.</li>

#### Summarizing the most important independent features and their co-efficient (beta) values:

In [None]:
model_lasso = pd.DataFrame()
model_lasso['Features'] = df_train_x.columns
model_lasso['Beta Values(co-efficents)'] = lasso1.coef_
model_lasso1 = model_lasso[model_lasso['Beta Values(co-efficents)']>0].sort_values(by=['Beta Values(co-efficents)'],ascending=False)
model_lasso2 = model_lasso[model_lasso['Beta Values(co-efficents)']<0].sort_values(by=['Beta Values(co-efficents)'],ascending=True)
model_lasso_final = model_lasso1.append(model_lasso2,ignore_index=True)
print("List of the features that influence SalePrice (using Lasso regression):")
print(model_lasso_final)

#### The top 5 most important independent features that influence the dependent feature 'SalePrice', using a lasso regression model are:
##### <li><b>GrLivArea</b> (Above grade (ground) living area square feet)</li>
##### <li><b>Proximity_PosN</b> (Proximity condition: Near positive off-site feature--park, greenbelt, etc.)</li>
##### <li><b>OverallQual_10</b> (Rating of the overall material and finish of the house: 10 i.e. 'Very Excellent'</li>
##### <li><b>RoofMatl_WdShngl</b> (Roof material: 'Wood Shingles')</li>
##### <li><b>Exterior_Wd Sdng_ImStucc</b> (The exterior covering combination of 'Wood Siding'+'Imitation Stucco')</li>