## House Price Prediction using Linear, Ridge and Lasso Regression
---
The solution is divided into the following sections: 
- Data understanding
- Data cleaning
- Data Exploration  
    1. Univariate Analysis
    2. Bivariate Analysis
    3. Outliers treatment
- Data preparation
- Model building and evaluation


In [None]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model, metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

import os

# hide warnings
import warnings
warnings.filterwarnings('ignore')

# Data Understanding
---

In [None]:
# Read the file
house = pd.read_csv('train.csv',na_values='NA')
house.head()

In [None]:
# Shape of the data
house.shape

In [None]:
# Data type of the columns
house.info()

In [None]:
# Starts Data cleaning
# Verification of null columns - Loop over all the columns and create an object with null columns only where the null count>50

null_col_dict={}

for i in house.columns:
    if house[i].isna().sum()>50:
        null_col_dict.update({i:house[i].isna().sum()})
null_col_dict

# Data Cleaning
---

In [None]:
# For now we will only drop these columns from our data set
list(null_col_dict.keys())

house.drop(columns=list(null_col_dict.keys()),inplace=True,axis=1)
house.info()

In [None]:
# Finding duplicate rows

duplicates = house.duplicated().sum()
duplicates

# So no duplicate rows are found


In [None]:
# Checking for empty rows
house[house.isnull().all(axis=1)]


In [None]:
# we should drop the ID column
house.drop(columns=['Id'],axis=1,inplace=True)


In [None]:
house.select_dtypes(include= ['float64','int64']).columns

### Change the below fields to object data type as they carry  categorical data
>   1. MSSubClass
>   2. OverallQual
>   3. OverallCond

### Change the below fields to numeric data type as they carry  continoius data
>   1. MasVnrArea

### delete below columns as the data variation is negligigble
>   1. Utilities

In [None]:
# Convert int to object

house[ ['MSSubClass','OverallQual','OverallCond'] ] = house[ ['MSSubClass','OverallQual','OverallCond'] ].astype('object')
house.info()

In [None]:
# Change data type to int

house['MasVnrArea'] = pd.to_numeric(house['MasVnrArea'], errors='coerce')
house.info()

In [None]:
# Delete column Utilities as it has only one data of different category, rest all are of the same category
house.drop(columns=['Utilities'],axis=1,inplace=True)

In [None]:
## Statistisal description of our data set
house.describe()

### Imputing still reamining null values
   - For Continious fields = null can be replaced by meadian
   - For categorical fields 
       > .If mode is predominent then replace by mode.
       > .If not then replace by new category 'Unknown'

In [None]:
null_col_dict={}

for i in house.columns:
    if house[i].isna().sum()>0:
        null_col_dict.update({i:house[i].isna().sum()})
null_col_dict

In [None]:
# MasVnrArea - continious variable, will impute with the median value. As we can see presence of outliers fro the decribe() step
house.MasVnrArea = house.MasVnrArea.fillna(house.MasVnrArea.median())


In [None]:
# Categorical field Null handling - BsmtQual
print(house.BsmtQual.value_counts())
print(house.BsmtQual.mode()[0])

# With such distribution we cant impute with mode and this we will introduce 'Unknown' for the missing values
house.BsmtQual = house.BsmtQual.fillna('Unknown')

In [None]:
#  Categorical field Null handling - BsmtCond
print(house.BsmtCond.value_counts())
print(house.BsmtCond.mode()[0])

# As we see the mode value 'TA' compprehensively larger than any other values, we will impute with the mode here
house.BsmtCond = house.BsmtCond.fillna(house.BsmtCond.mode()[0])

In [None]:
null_col_dict={}

for i in house.columns:
    if house[i].isna().sum()>0:
        null_col_dict.update({i:house[i].isna().sum()})
null_col_dict

In [None]:
#  Categorical field Null handling - BsmtExposure
sns.displot(house['BsmtExposure'])

# we can impute nulls with mode as the mode is significantly higher than the rest
house.BsmtExposure = house.BsmtExposure.fillna(house.BsmtExposure.mode()[0])


In [None]:
#  Categorical field Null handling - BsmtFinType1
sns.displot(house['BsmtFinType1'])

# With such distribution we cant impute with mode and this we will introduce 'Unknown' for the missing values
house.BsmtFinType1 = house.BsmtFinType1.fillna('Unknown')


In [None]:
#  Categorical field Null handling - BsmtFinType2
sns.displot(house['BsmtFinType2'])

# we can impute nulls with mode as the mode is significantly higher than the rest
house.BsmtFinType2 = house.BsmtFinType2.fillna(house.BsmtFinType2.mode()[0])

In [None]:
#  Categorical field Null handling - BsmtFinType2
sns.displot(house['Electrical'])

# we can impute nulls with mode as the mode is significantly higher than the rest
house.Electrical = house.Electrical.fillna(house.Electrical.mode()[0])

# Data Exploration
---

In [None]:
# Univariate Analysis - Target Variable SalePrice
house.SalePrice.describe()
sns.boxplot(house.SalePrice)

sns.displot(house.SalePrice,kind='kde')


### We can find that because of certain outliers the target variable is right skewed. I come to know from my own study that right skewness can be handled by various means, some of them are
- Log Transformation
- Square Root Transformation    

In [None]:
# Lets check the degree of skew
house.SalePrice.skew()

In [None]:
# Lets check Square root transformation first and verify 
x =np.sqrt(house.SalePrice).skew()
print(x)
# Lets apply Log transformation now
y = np.log(house.SalePrice).skew()
print(y)


### Skew value as close to 0 is considered better distribution. Hence we would select the log transformation.

In [None]:
house.SalePrice = np.log(house.SalePrice)
sns.displot(house.SalePrice,kind='kde')

# Now we see our data is well distributed


### Univariate analysis of continious variables

In [None]:
house_cont = house.select_dtypes(include=['int64', 'float64'])
house_cont.columns

In [None]:
for col in house_cont.columns:
    plt.figure(figsize=(15,5))
    
    plt.subplot(1,2,1)
    plt.title(col)
    sns.histplot(house_cont[col],kde=True)
    plt.subplot(1,2,2)
    sns.boxplot(house_cont[col])
    plt.show()    

#### we can see certain outliers and non-normal distribution for most of the independent numerical variables.Although normal distribution of the independent variable is not a mandatory pre-requisite for inear regression but more the normality better can be the model. Outliers handling is required here too.

In [None]:
# After performing IQR based elimination of outliers found that almost 50% of data got removed. Hence decided to replace outliers with with the data 
#of 5 percentile and 95 percentile

for col in house_cont.columns:
    if col != 'SalePrice':
        house[col][house[col] <= house[col].quantile(0.05)] = house[col].quantile(0.05)
        house[col][house[col] >= house[col].quantile(0.95)] = house[col].quantile(0.95)

In [None]:
house.describe()

In [None]:
# Once more plot the data after removing imputing outliers

for col in house_cont.columns:
    plt.figure(figsize=(15,6))
    plt.subplot(1,2,1)
    plt.title(col)
    sns.histplot(house[col],kde=True)
    plt.subplot(1,2,2)
    sns.boxplot(house[col])

# From the boxplots now we see better outliers state after handlig.  

### Univariate analysis of categorical variables

In [None]:
house_cat = house.select_dtypes('object')
house_cat.columns

In [None]:
# Lets check data distribution for each of the variables

i=0
for col in house_cat.columns:

    
    plt.figure(figsize=(10,4))
    plt.subplot(1,2,2)
    plt.title(col)
    plt.xticks(rotation=90)
    sns.histplot(house[col])

### From the above histograms we find certain categorical variables are highly skewed with very low variance in terms of data distribution. Such fields may have not be good predictor variables, so we can delete them.

In [None]:
house_col_del = []
house_cat = house.select_dtypes('object')
for col in house_cat.columns:
    if (house_cat[col].value_counts()/house_cat.shape[0] >=.95).any():
        house_col_del.append(col)
print(house_col_del)    

In [None]:
# We will drop the columns with more than 95% of the data in one category
house.drop(columns=house_col_del,axis=1,inplace=True)


## Bivariate Analysis of the continious variables

In [None]:
plt.figure(figsize=(15,6))
sns.heatmap(house_cont.corr(), annot=True, fmt='.1f', cmap='coolwarm')
plt.tight_layout()

#### we can certainly find some co-related numerical fields like
>   1. `GarageForCars` are highly co-related wih `GarageArea` (.9)
>   2. `TotalBasementSF` are highly co-related wih `1stFlrSF` (.8)
>   2. `GrLivArea` are highly co-related wih `TotRmsAbvGrd` (.8)

#### We will drop the columns like `GarageForCars,1stFlrSF,TotRmsAbvGrd` as the corel is >.8. Rest can be evaluated by Lasso and VIF


In [None]:
# Drop columns because of the high co-linearity
house.drop(columns=['GarageArea','TotRmsAbvGrd','1stFlrSF'],axis=1,inplace=True)

### Lets have a look into the SalePrice vs Other numerical variable plotting. This should give us an indication about the linearity between the variables.

In [None]:
house_cont = house.select_dtypes(include=['int64', 'float64'])
for col in house_cont.columns:
    plt.figure(figsize=(10,3))
    
    plt.subplot(1,2,1)
    plt.title(col)
    sns.scatterplot(x=house[col], y=house['SalePrice'])
    plt.show()    

### We find that the below columns have only one value, hence we can drop them
>   1. LowQualFinSF
>   2. BedroomAbvGr
>   3. KitchenAbvGr
>   4. 3SsnPorch
>   5. PoolArea
>   6. MiscVal

In [None]:
# Drop columns because of low vaiance
house.drop(columns=['BsmtFinSF2','LowQualFinSF','BsmtHalfBath','KitchenAbvGr','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal'],axis=1,inplace=True)


### Bivariate analysis of the categorical variables

In [None]:
house_cat = house.select_dtypes(['object'])
for col in house_cat.columns:
    plt.figure(figsize=(10,4))
    plt.subplot(1,2,2)
    plt.title(col)
    plt.xticks(rotation=90)
    sns.boxplot(x=house[col], y=house['SalePrice'].sort_values())
    plt.show()

In [None]:
# Once more check for any null values
null_col_dict={}
for i in house.columns:
    if house[i].isna().sum()>0:
        null_col_dict.update({i:house[i].isna().sum()})
null_col_dict

In [None]:
house.info()

## Data Preparation - Here we will go through the categorical variables and start encoding them

In [None]:
# Label encoding for the ordinal columns

house['LotShape'] = house['LotShape'].map({'IR1':0,'IR2':1,'IR3':2,'Reg':3})
house['LandSlope'] = house['LandSlope'].map({'Gtl':0,'Mod':1,'Sev':2})
house['HouseStyle'] = house['HouseStyle'].map({'1Story':0, '1.5Unf':1, '1.5Fin':2,  '2Story' :3, '2.5Unf':4, '2.5Fin':5, 'SFoyer':6, 'SLvl':7})
house['ExterQual'] = house['ExterQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
house['ExterCond'] = house['ExterCond'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
house['BsmtQual'] = house['BsmtQual'].map({'Unknown':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
house['BsmtCond'] = house['BsmtCond'].map({'Unknown':0,'Po':1,'Fa':2,'TA':3,'Gd':4,'Ex':5})
house['BsmtExposure'] = house['BsmtExposure'].map({'Unknown':0,'No':1,'Mn':2,'Av':3,'Gd':4})
house['BsmtFinType1'] = house['BsmtFinType1'].map({'Unknown':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
house['BsmtFinType2'] = house['BsmtFinType2'].map({'Unknown':0,'Unf':1,'LwQ':2,'Rec':3,'BLQ':4,'ALQ':5,'GLQ':6})
house['HeatingQC'] = house['HeatingQC'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
house['CentralAir'] = house['CentralAir'].map({'N':0,'Y':1})
house['KitchenQual'] = house['KitchenQual'].map({'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4})
house['Functional'] = house['Functional'].map({'Typ':0, 'Min1':1, 'Min2':2, 'Mod':3, 'Maj1':4, 'Maj2':5, 'Sev':6, 'Sal':7})


In [None]:
house.info()

In [None]:
# Once more check for any null values
null_col_dict={}
for i in house.columns:
    if house[i].isna().sum()>0:
        null_col_dict.update({i:house[i].isna().sum()})
null_col_dict

In [None]:
# One hot encoding of the nominal fields
house_cat_nom =  ['MSSubClass','MSZoning',  'LandContour', 'LotConfig', 'Neighborhood', 'Condition1' ,'BldgType', 'RoofStyle',  'Exterior1st', 'Exterior2nd', 'Foundation','Electrical','PavedDrive', 'SaleType','SaleCondition']
house_dummy = pd.get_dummies(house[house_cat_nom], drop_first=True,dtype=int)
house_dummy.shape

In [None]:
# Concat the dummy variables to the original data set
house = pd.concat([house,house_dummy],axis=1)
house.shape

In [None]:
# Dropping the redundant columns
house.drop(house_cat_nom,axis=1,inplace=True)
house.shape


In [None]:
house.SalePrice

## Model Building
---
### Our EDA and data preparation are completed, we will now start building model. We will create a Linear Regression model first followed by Ridge and Lasso. we will also use cross validation to make our model stronger

### Train Test Split


In [None]:
house_train,house_test = train_test_split(house,train_size=0.7,random_state=100)
print(house_train.shape,house_test.shape)

In [None]:
# Divide into X and y for train
y_train = house_train.pop('SalePrice')
X_train = house_train

# Divide into X and y for test
y_test = house_test.pop('SalePrice')
X_test = house_test

print(y_train.shape,X_train.shape,y_test.shape,X_test.shape)

### Feature Scaling

In [None]:
#1. Instantiate an scaler object and fit transform the train data
scaler=StandardScaler()
house_cont = X_train.dtypes[X_train.dtypes != "object"].index
X_train[house_cont]=scaler.fit_transform(X_train[house_cont])

X_train.shape

In [None]:
# See the glimpse of the scaled data
X_train.head()

In [None]:
# Perform scaling on the test data using the same scaler object
X_test[house_cont]=scaler.transform(X_test[house_cont])
X_test.head()

### Feature selection with RFE and cross validation

In [None]:
# Lets start with the linear regression model and RFE intial feature value as 50
from sklearn.feature_selection import RFE

lm = LinearRegression()
lm.fit(X_train, y_train)

# Apply RFE to get the top 90 features
rfe = RFE(lm, n_features_to_select=90)
rfe = rfe.fit(X_train, y_train)


In [None]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [None]:
# Features which can be excluded to make the model according to RFE
X_train.columns[~rfe.support_]

In [None]:
# Select the rfe supported columns only for both train and test
X_train_rfe1 = X_train[X_train.columns[rfe.support_]]
X_test_rfe1 = X_test[X_test.columns[rfe.support_]]
print(X_train_rfe1.shape,X_test_rfe1.shape)

In [None]:
# Evaluate model performance
lm_rfe_1 = lm.fit(X_train_rfe1, y_train)
y_test_pred = lm_rfe_1.predict(X_test_rfe1)

# Check r2 score
round(r2_score(y_test,y_test_pred),3)


In [None]:
import statsmodels.api as sm  
X_train_rfe1 = sm.add_constant(X_train_rfe1) #Adding Constant
X_train_rfe1.head()

In [None]:
# Ensure that X_train_rfe1 and y_train contain only numeric data
X_train_rfe1 = X_train_rfe1.apply(pd.to_numeric, errors='coerce')
y_train = y_train.apply(pd.to_numeric, errors='coerce')

# Now fit the model
lm1 = sm.OLS(y_train, X_train_rfe1).fit()   
print(lm1.summary())

### Kfold cross validation through Grid Search method

In [None]:
# We will perform k-fold cv with all 90 feature variables
from sklearn.model_selection import cross_val_score



In [None]:
cv_model = LinearRegression()
# Set the CV scheme and the metric
cv_scores = cross_val_score(cv_model, X_train_rfe1, y_train, cv=5, scoring='r2')
cv_scores

### we have 90 hyperparameters which we now need to regulerize using k-fold

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [None]:
# Step 1 : Create a cross validation scheme
kfold = KFold(n_splits=5, random_state=42, shuffle=True)

# Step 2 : Specify the hyperparameters to be tuned
hyper_params = [{'n_features_to_select' : list(range(1,91))}]

# Step 3 : perform grid search

#3.1 : Create a model object
lm = LinearRegression()
lm.fit(X_train_rfe1, y_train)
rfe = RFE(lm)
    
#3.2 : Create a grid search object
model_cv = GridSearchCV(estimator=rfe,scoring='r2',return_train_score=True,param_grid=hyper_params,cv=kfold,verbose=1,n_jobs=-1)
# 3.3 : fit the model
model_cv.fit(X_train_rfe1,y_train)

In [None]:
# Check the result in table form
cv_results = pd.DataFrame(model_cv.cv_results_)
cv_results

In [None]:
# Plotting CV results
plt.figure(figsize=(16,6))
plt.plot(cv_results["param_n_features_to_select"], cv_results["mean_train_score"])
plt.plot(cv_results['param_n_features_to_select'] , cv_results['mean_test_score'])
plt.xlabel('number of features')
plt.ylabel('r-squared')
plt.title("Optimal Number of Features")
plt.legend(['test score', 'train score'], loc='upper left')
plt.show()

### Ridge regression for regulerization


In [None]:
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}
ridge = Ridge()
# cross validation
ridgeCV = GridSearchCV(ridge, param_grid=params, scoring='neg_mean_absolute_error', cv=kfold, verbose=1, n_jobs=-1,return_train_score=True)
ridgeCV.fit(X_train, y_train)


In [None]:
ridgeCV.best_params_

In [None]:
# Cross Validation results
ridgeCV_results = pd.DataFrame(ridgeCV.cv_results_)
ridgeCV_results

In [None]:
# Now lets build another Ridge with the best alpha value 100
ridge = Ridge(alpha=100)
ridge.fit(X_train, y_train)
ridge.coef_

In [None]:
# Now if we make a prediction with this Ridge model
y_train_pred = ridge.predict(X_train)
y_test_pred = ridge.predict(X_test)

In [None]:

# Show Metrices
print('Train R2:',r2_score(y_train, y_train_pred))
print('Test R2:',r2_score(y_test, y_test_pred))
print('--------------------')
print('Train RMSE:',np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('Test RMSE:',np.sqrt(mean_squared_error(y_test, y_test_pred)))
print('--------------------')

In [None]:
# Most important predictors as per Ridge
ridge_coeff_df = pd.DataFrame({'column':X_train.columns,'coeff':ridge.coef_})
ridge_coeff_df['coeff_abs'] = np.abs(ridge_coeff_df['coeff'])
ridge_coeff_df = ridge_coeff_df.sort_values('coeff_abs',ascending=False)
ridge_coeff_df.head(10)

## Lasso Regression


In [None]:
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}

lasso  =Lasso() # Create a Lasso object


In [None]:
# Perform grid search cross validation on the alpha values for Lasso
lassoCV = GridSearchCV(estimator=lasso,param_grid=params,scoring='neg_mean_absolute_error',cv=kfold,verbose=1,n_jobs=-1,return_train_score=True)

# Fit the model
lassoCV.fit(X_train, y_train)


In [None]:
# Get the best hyper parameter value
lassoCV.best_params_

In [None]:
# Coefficients from Lasso
lassoCV_df = pd.DataFrame(lassoCV.cv_results_)
lassoCV_df

In [None]:
# Build another lasso model using the best alpha value
alpha_optimal = 0.001
lasso = Lasso(alpha=alpha_optimal)
lasso.fit(X_train, y_train)
lasso.coef_

In [None]:
# Predictions using Lasso
y_train_pred = lasso.predict(X_train)
y_test_pred = lasso.predict(X_test)

In [None]:
# Matrices using lasso
print('Train R2:',r2_score(y_train, y_train_pred))
print('Test R2:',r2_score(y_test, y_test_pred))
print('--------------------')
print('Train RMSE:',np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('Test RMSE:',np.sqrt(mean_squared_error(y_test, y_test_pred)))


In [None]:
lassoCV_df= pd.DataFrame(lassoCV.cv_results_)

In [None]:
# plotting r2 score using   laso
plt.figure(figsize=(10,6))
plt.plot(lassoCV_df['param_alpha'], lassoCV_df['mean_train_score'])
plt.plot(lassoCV_df['param_alpha'], lassoCV_df['mean_test_score'])

In [None]:
# Most important predictors 

coeef_df = pd.DataFrame(list(zip(X_train.columns,lasso.coef_)),columns=['Feature','Coef'])

# Sort this data frame based on the absolute values of the coefficients

coeef_df.sort_values(by='Coef_abs',ascending=False,inplace=True)

# select top 10 predictors based on the coefficients
coeef_df.head(10)

###  `The variables significant in predicting the price of a house are:`

So the above listed Features can be considered as the most important factors to determine the price of a house.

### `How well those variables describe the price of a house?`

1.   GrLivArea : Above grade (ground) living area square feet, so if the GrLivArea increases by 1 (in sq feet) price of the house will increase by .12 times.
2.  MSZoning_RL : Residential Low Density, if the house is located in residential area with low neighbourhood density then the price will increase by .08 times


#### Question :
After building the model, you realised that the five most important predictor variables in the lasso model are not available in the incoming data. You will now have to create another model excluding the five most important predictor variables. Which are the five most important predictor variables now?

In [None]:
# Get the top 5 variables from Lasso
lasso_top_5 = coeef_df.head(5)['Feature']

# Drop top 5 lasso var from X_train
X_train_del_top5 = X_train.drop(lasso_top_5,axis =1)



In [None]:
# Drop top 5 from test data
X_test_del_top5 = X_test.drop(lasso_top_5,axis =1)

### We have to create another lasso model on this new dataset created after deleting the top 5 feature variable. We will follow the same sequence of steps created for the 1st lasso model

In [95]:
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}
# Create lasso object
lasso = Lasso()

In [96]:
# Perform GridSearcg cross validation to get the best hyper parameter value
lasso_new_CV = GridSearchCV(estimator=lasso,param_grid=params,scoring='neg_mean_absolute_error',cv=kfold,verbose=1,n_jobs=-1,return_train_score=True)
lasso_new_CV.fit(X_train_del_top5, y_train)

# Get the best hyper parameter value
lasso_new_CV.best_params_   

Fitting 5 folds for each of 27 candidates, totalling 135 fits


{'alpha': 0.001}

In [None]:
# Build another lasso model using the best alpha value
alpha_optimal = 0.001
lasso = Lasso(alpha=alpha_optimal)
lasso.fit(X_train_del_top5, y_train)
list(zip(X_train_del_top5.columns,lasso.coef_))


In [99]:
# predict using this lasso model and evaluate metrices
y_test_pred = lasso_new_CV.predict(X_test_del_top5)
y_train_pred = lasso_new_CV.predict(X_train_del_top5)

# Metrices
print('Train R2:',r2_score(y_train, y_train_pred))
print('Test R2:',r2_score(y_test, y_test_pred))
print('--------------------')
print('Train RMSE:',np.sqrt(mean_squared_error(y_train, y_train_pred)))
print('Test RMSE:',np.sqrt(mean_squared_error(y_test, y_test_pred)))



Train R2: 0.9072058977142287
Test R2: 0.8672552951576801
--------------------
Train RMSE: 0.12075624339915147
Test RMSE: 0.14791311219948153


In [102]:
# Top 5 predictors as per this new Lasso model
lasso_new_top5 = pd.DataFrame(list(zip(X_train_del_top5.columns,lasso.coef_)),columns=['Feature','Coef'])

# Sort this data frame based on the absolute values of the coefficients
lasso_new_top5['Coef_abs'] = np.abs(lasso_new_top5['Coef'])
lasso_new_top5.sort_values(by='Coef_abs',ascending=False,inplace=True)
lasso_new_top5.head()

Unnamed: 0,Feature,Coef,Coef_abs
14,BsmtFinSF1,0.104461,0.104461
16,BsmtUnfSF,0.08071,0.08071
19,2ndFlrSF,0.070871,0.070871
0,LotArea,0.045425,0.045425
5,YearBuilt,0.045377,0.045377
