# Problem Statement

# Training Dataset

Houses are one of the necessary need of each and every person around the globe. It is a very large market and there are various companies working in the domain. Data science comes as a very important tool to solve problems in the domain to help the companies increase their overall revenue and profits.

A US-based housing company named Surprise Housing has decided to enter the Australian market. The company is looking at prospective properties to buy houses to enter the market. The company uses data analytics to purchase houses at a price below their actual values and flip them at a higher price.

The purpose is to build a model using Machine Learning in order to predict the actual value of the prospective properties and decide whether to invest in them or not. The requirement is to to model the price of houses with the available independent variables in the test data.

The data is divided into two i.e. train and test data set. Model building will be done with the help of train data set and the same would be used on the test data set to predict house prices. The company has collected a data set from the sale of houses in Australia and is provided in the CSV file.

In [None]:
#importing the libraries

import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import csv

In [None]:
#importing the dataset

df = pd.read_csv(r'C:\Users\HP-15\housing_train.csv')

In [None]:
df

The dataset consists of 1168 records and 81 attributes in the dataset initially.

The "Id" column in the above dataset does not have any relationship with the dependent/target variable as ID will have a unique identity for each record. Hence, we drop the column.



In [None]:
df.columns

In [None]:
#Dropping "Id" column from the dataset because it has no relation with target varaible
df.drop("Id",axis=1 , inplace=True)

In [None]:
#Checking if the Id column is dropped by visualizing gist of data

df.head()

It is clearly noticable that the "Id" column has been removed from the above data set. Since the data set is huge, the whole data cannot be seen, hence maximum data needs to be visualized to understand data

In [None]:
#Displaying maximum columns for better understaning of data

pd.set_option('display.max_columns',None)
df.head()

We can now see all the columns being displayed in the dataframe.

In [None]:
#Checking size of present dataset
df.shape

In [None]:
#Displaying maximum rows to visualize data type of all attributes
pd.set_option('display.max_rows',None)

The dataset now consits of 1168 rows and 80 columns.

In [None]:
#Checking data types of the attributes
df.dtypes

It is very important to check the null values present in the data to clean data before proceeding. We cannot analyse data without removing null values in data.

# Checking Null Values

In [None]:
df.isnull().sum()

Since the null(NaN) values present in columns "PoolQC", "Fence", "MiscFeature" & "Alley" are more than 70% of the records in the data set. So the variables does not seem to be of any importance in analysing the sale price, hence removing them from the data set.

The other null values present will be treated using SimpleImputer and Imputer techniques. Null values in data set needs to be treated in order to make an analysis on data.

In [None]:
#Dropping variables with maximum null values i.e. near to 70% of data are null values
df.drop("PoolQC",axis=1,inplace=True)
df.drop("Fence",axis=1,inplace=True)
df.drop("MiscFeature",axis=1,inplace=True)
df.drop("Alley",axis=1,inplace=True)

In [None]:
#Checking number of variables left in the dataset
df.shape

It is noticable that 4 variables have been dropped from the data set.

Now will treat null values in the data set.

In [None]:
#Filling Nan values using 'most_frequent' of simle Imputer for categorical data
from sklearn.impute import SimpleImputer

imp=SimpleImputer(strategy='most_frequent')

df['GarageQual']=imp.fit_transform(df['GarageQual'].values.reshape(-1,1))
df['GarageCond']=imp.fit_transform(df['GarageCond'].values.reshape(-1,1))
df['GarageFinish']=imp.fit_transform(df['GarageFinish'].values.reshape(-1,1))
df['GarageType']=imp.fit_transform(df['GarageType'].values.reshape(-1,1))
df['FireplaceQu']=imp.fit_transform(df['FireplaceQu'].values.reshape(-1,1))
df['BsmtFinType2']=imp.fit_transform(df['BsmtFinType2'].values.reshape(-1,1))
df['BsmtFinType1']=imp.fit_transform(df['BsmtFinType1'].values.reshape(-1,1))
df['BsmtExposure']=imp.fit_transform(df['BsmtExposure'].values.reshape(-1,1))
df['BsmtCond']=imp.fit_transform(df['BsmtCond'].values.reshape(-1,1))
df['BsmtQual']=imp.fit_transform(df['BsmtQual'].values.reshape(-1,1))
df['MasVnrType']=imp.fit_transform(df['MasVnrType'].values.reshape(-1,1))

df.head()

We used "most_frequent" strategy of Simple Imputer to fill null values of categorical data.

In [None]:
#Checking Null values
df.isnull().sum()

The NaN values of categorical data is filled. Now we will deal with the Null values in numeric data. To deal with numeric data, we fill the values using median.

In [None]:
#Importing Libraries to treat Null values using median
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

In [None]:
#treating null values using Imputer
for i in df.columns:
    if df[i].isnull().sum()!=0:
        df[i]=imp.fit_transform(df[i].values.reshape(-1,1))

In [None]:
#Checking null values again
df.isnull().sum()

We now observe that our null values have been treated. Let's visualize and check if we still have any Null Values.

In [None]:
#Visualizing null values through heatmap
import seaborn as sns
sns.heatmap(df.isnull())

We cannot see any Null Values in our dataset. Hence, we can proceed forward with visualizing our data.

# Making DataFrame for the Nominal Data

In [None]:
#Copying nominal variables into a new dataframe
df_nominal=df[['MSZoning','Street','LotShape', 'LandContour', 'Utilities', 'LotConfig','LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType','HouseStyle','RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType','ExterQual', 'ExterCond', 'Foundation', 'BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1','BsmtFinType2', 'Heating','HeatingQC', 'CentralAir', 'Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive','SaleType','SaleCondition']].copy()

We create a new separate dataframe to store all categorical data and make an analysis on it separetly.

In [None]:
#Checking columns of new nominal dataframe created
df_nominal.columns

In [None]:
#Cheking shape of new nominal dataframe
df_nominal.shape

We have 1168 rows and 39 columns in the nominal DataFrame.

# Visualization of Data

For the nominal/categorical data we will use countplot as it will gives frequency of the columns.

In [None]:
#Importing Libraries for Visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#Visualizing different columns using countplot
ncol,nrow=10,4
ab=df_nominal.columns.values
plt.figure(figsize=(20,40))
for index,i in enumerate(ab):
    ab=plt.subplot(ncol,nrow,index+1)
    sns.countplot(df[i])
    plt.title(f"plot {i}")
    plt.tight_layout()
plt.show()

# Observation:

All the records in the data set consists of the "All public Utilities (E,G,W,& S)" in the 'Utilities' variable. Since the data in 'Utilities' column does not make any difference to the data set, we delete the column.

In MSZoning (general zoning classification of the sale) variable, maximum values belong to the "Residential Low Density" category, then from "Residential Medium Density" and least from "C(all)" category.

Type of road access to property ('Street') is "Paved" for most of the records.

General shape of property ('LotShape') is "Regular" for most of the cases, second most is "Slightly irregular" and hardly some properties have "Irregular" property shape.

Flatness of the property ('LandContour') is mostly "Near Flat/Level" for maximum records and least records have "Depression" flatness.

Lot configuration ('LotConfig') belongs to "Inside lot" for most of the properties in the data set, second most belongs to "Corner lot" and least belongs to "Frontage on 3 sides of property" category.

Maximum records have "Gentle slope" and very few have "Severe Slope" for variable Slope of property ('LandSlope').

Proximity to various conditions ('Condition1') are "Normal" in most of the cases and least are "Within 200' of East-West Railroad".

Proximity to various conditions ('Condition2') is 'Normal' for around 1500 records. Hardly the property has other various conditions.

'BldgType' is "1Fam" i.e. "Single-family Detached" for nearly 1000 records.

Style of dwelling ('HouseStyle') is "One story" which is maximum for nearly 580 records, "Two story" for nearly 360 records, "One and one-half story" for around 110 properties.

Type of roof ('RoofStyle') is "Gable" for more than 900 values and around 200 values for "Hip".

Roof material is "Standard (Composite) Shingle" for more than 110 properties in the dataset.

The value for 'Masonry veneer type' is "None" for 700 records, "Brick Face" for nearly 350 records, "Stone" for nearly 100 records and least for "Brick Common".

Quality of the material on the exterior ('ExterQual') is "Average/Typical" for more than 700 properties, "Good" for nearly 400 properties and least are "Excellent".

Present condition of the material ('ExterCond') on the exterior is "Average/Typical" is for more than 1000 properties.

Type of foundation ('Foundation') is maximum around 500 values for both "Cinder Block" & "Poured Contrete" and least for "Wood".

Height of the basement ('BsmtQual') is "Typical (80-89 inches)" and "Good (90-99 inches)" for most of the records in data set.

General condition of the basement ('BsmtCond') is "Average/Typical" for more tha 1000 properties and very few properties have "Poor" material.

'BsmtExposure' (Refers to walkout or garden level walls) has "No Exposure" for nearly 800 properties.

'BsmtFinType1' (Rating of basement finished area) is maximum for "Unfinished" i.e 350 values, second max for "Good Living Quarters" and least have "Low Quality".

'BsmtFinType2' (Rating of basement finished area) is maximum for "Unfinished" i.e more than 1000 records.

Type of heating ('Heating') has maximum values for "GasA - Gas forced warm air furnace" category which is more than 1100 records.

Central air conditioning ('CentralAir') is available ("Yes") for around 1100 properties.

'Electrical' is "Standard Circuit Breakers & Romex" for more than 1000 properties.

Kitchen quality ('KitchenQual') is "Typical/Average" & "Good" for most records in the data set.

Home functionality ('Functional') is "Typical Functionality" for more than 1000 properties.

Fireplace quality ('FireplaceQu') is "Good " for more than 800 properties and "Average" for around 250 properties which includes the maximum records.

'GarageType' is "Attchd" for maximum records and second maximum records holds for "Detchd".

'GarageFinish' holds maximum records for "Unfinished", second max for "Rough Finished" and least for "Finished" which below 300 properties.

'GarageQual' is "Typical/Average" for around 1100 properties and very less for the rest.

'GarageCond' is also "Typical/Average" for around 1100 properties and very less for the rest.

Paved driveway ('PavedDrive') is "Paved" for more than 1000 records.

'SaleType' is "Warranty Deed - Conventional" for around 1000 properties.

'SaleCondition' is "Normal Sale" for nearly 950 records.

In [None]:
#dropping column having no importance data since all the values inside the column are same
df.drop("Utilities",axis=1,inplace=True)

In [None]:
#Visualizing data which was not clear in the above analysis
ab=sns.countplot(x='Neighborhood', data=df_nominal)
print(df_nominal['Neighborhood'].value_counts())

'Neighborhood' of the maximum properties are "NAmes: North Ames" and "College Creek".

In [None]:
ab=sns.countplot(x='Exterior1st', data=df_nominal)
print(df_nominal['Exterior1st'].value_counts())

Exterior covering on house i.e. 'Exterior1st' are maximum for "Vinyl Siding" (maximum records), "Hard Board", "Metal Siding" and "Wood Siding"

In [None]:
ab=sns.countplot(x='Exterior2nd', data=df_nominal)
print(df_nominal['Exterior2nd'].value_counts())

'Exterior2nd' has maximum records for "Vinyl Siding".

# Making DataFrame of the Continuous type of Values

In [None]:
#Copying our continuous data into a new dataframe
df_continuous=df[['MSSubClass','LotFrontage','LotArea','OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','GarageCars', 'GarageArea','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold','SalePrice']].copy()

Created a new separate dataframe to store all numeric data and make an analysis on it separetly.

In [None]:
#Verifying columns
df_continuous.columns

In [None]:
#Cheking shape of new numeric dataframe
df_continuous.shape

# Visualizing Continuous Data

In [None]:
#Visualizing different columns using distplot
ncol,nrow=10,4
ab=df_continuous.columns.values
plt.figure(figsize=(20,40))
for index,i in enumerate(ab):
    ab=plt.subplot(ncol,nrow,index+1)
    sns.distplot(df[i])
    plt.title(f"plot {i}")
    plt.tight_layout()
plt.show()

# Observation:

The type of dwelling involved in the sale ('MSSubClass') is "20 : 1-STORY 1946 & NEWER ALL STYLES" for most of the cases.

Linear feet of street connected to property ('LotFrontage') is max between the range 50 & 100.

Lot size in square feet ('LotArea') is has max records between values 0 & 25000.

Rates the overall material and finish of the house ('OverallQual') has maximum records for values 5, 6 and 7.

Rates the overall condition of the house ('OverallCond') has maximum record for value 5.

Maximum records have Original construction date ('YearBuilt') as 2000.

Maximum records for Remodel date (same as construction date if no remodeling or additions) ['YearRemodAdd'] has values between 2000 and 2010.

The values of most of the properties for Masonry veneer area in square feet ('MasVnrArea') is 0 but the values range between 0 to 1750.

The values of most of the properties for Type 1 finished square feet ('BsmtFinSF1') is 0 but the values range between 0 to 6000 having most of them ranging from 0 to 2000.

The values of most of the properties for Type 2 finished square feet ('BsmtFinSF2') is 0 but the values range between 0 to 1500.

The values for Unfinished square feet of basement area ('BsmtUnfSF') lies max between 0 to 1000.

Total square feet of basement area ('TotalBsmtSF') has max records ranging between values 500 to 2000.

First Floor square feet ('1stFlrSF') has maximum records ranging between values 500 to 1500.

Second floor square feet ('2ndFlrSF') has maximum records at value 0 and also few between values 500 & 1500.

Above grade (ground) living area square feet ('GrLivArea') has maximum records ranging between values 900 to 3000.

Basement full bathrooms ('BsmtFullBath') has more records for value 0 than value 1.

Variables such as 'LowQualFinSF', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea' & 'MiscVal' contains maximum records at value 0.

In [None]:
#using multivariate analysis to analyise continuous data
sns.pairplot(df_continuous)

Since, the data is huge and contains too many variables, the cummulative analysis of data is difficult to make. Though we can still observe some independent variables making a positive correlation with the target variable. We will try to understand this data using correlations.

Converting Nominal data into numeric data representation is very important in order to understand data appropriately. Hence, we will use data encoding techniques to convert string values into floating numbers.

# Encoding of DataFrame

In [None]:
#Importing library for encoding and creating instance for the same 
from sklearn.preprocessing import OrdinalEncoder
enc=OrdinalEncoder()

In [None]:
#Converting object datatype into float values
for i in df.columns:
    if df[i].dtypes=='object':
        df[i]=enc.fit_transform(df[i].values.reshape(-1,1))

In [None]:
#Verifying Conversion 
df.head()

We can observe that our object datatype has been now converted into float values.

# Describe Data

In [None]:
#Describing present columns in the dataset
df.columns

In [None]:
#Defining Shape
df.shape

In [None]:
#Describing mean, median, min, max values of data
df.describe()

In [None]:
#Visualizing data description
import matplotlib.pyplot as plt
plt.figure(figsize=(22,7))
sns.heatmap(df.describe(),annot=True,linewidths=0.1,linecolor='blue',fmt='0.2f')

# Observation:

The standard deviation of few columns in the dataset are huge which means that the values in these columns are largely scattered and are not near to the mean values. They are very far away from their mean values. The standard devation of other columns is not too high which shows us a normal distribution of data and no skewness.

We assume that since the min & max values in every column contains a range difference, it indicates a possibility of having few outliers and skewness in data.

The only columns to have high values in the data set is the target variable 'SalePrice' and 'LotArea'.

The data does not contain negative values in any of it's variables.

# Correlation of Columns with the Target Variable

In [None]:
#Plotting correlation of input features with the target Variable
plt.figure(figsize=(22,7))
sns.heatmap(df.corr(),annot=True,linewidths=0.1,linecolor='black',fmt='0.2f')

Observations:

Many positive correlations of independent variables can be seen with the 'SalePrice' target variable.

Since there are lots of variables present in the data set we are not able to see correlations of variables clearly in the above diagram.

Hence, we will sort the correlations of independent variables with the target variable in an ascending order to understand how highly or how low the variables are correlated.

In [None]:
#Sorting correlation in order with the Target Variable
corr_matrix=df.corr()
corr_matrix['SalePrice'].sort_values(ascending=False)

In [None]:
#Plotting Correlation in order with the target variable to see which variables are more correlated
plt.figure(figsize=(22,7))
df.corr()['SalePrice'].sort_values(ascending=False).drop(['SalePrice']).plot(kind='bar',color='y')
plt.xlabel('Feature',fontsize=14)
plt.ylabel('column with target names',fontsize=14)
plt.title('correlation',fontsize=18)
plt.show()

23 variables out of 76 are negatively correlated to the target variable, rest all other are postively correlated.

The most negatively correlated variables to the target variable are 'BsmtQual' & 'ExterQual'.

There are many high positive correlations in the data.

Since variables 'BsmtFinSF2', 'BsmtHalfBath' & 'MiscVal' are correlated to the target variable at -0.01% which hardly shows any correlation so we can delete these columns.

The most positively correlated columns to the target variable are 'OverallQual', 'GrLivArea' and 'GarageCars'.

# Checking skewness

In [None]:
#checking skewness to see if the data is normally distributed
df.skew()

Keeping threshold +/-0.5 as the range for skewness, we can see high skewness in few variables such as 'LotArea', 'LowQualFinSF', '3SsnPorch', 'PoolArea' and 'MiscVal'. The skewness in these variables needs to be reduced in order to process data for analysis. Before dealing with the skewness, we'll first check outliers in data.

We'll first visualizie skewness before checking outliers.

In [None]:
#Visualizing skewness on density graph
#Example of multi variate analysis
df.plot(kind='density',subplots=True,layout=(8,10),legend=False,sharex=False,figsize=(60,40))
plt.show()

The maximum values are observed to be 0 in many numeric value variables from the above graph.

Hence, we can observe high skewness in some of the variables.

First will check outliers in data.

# Checking Outliers


In [None]:
#Visualizing outliers of different variables
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
#Visualizing outliers of different variables
collist=df.columns.values
ncol=6
nrows=14
plt.figure(figsize=(ncol,5*ncol))
for i in range(1,len(collist)):
    plt.subplot(nrows,ncol,i+1)
    sns.boxplot(df[collist[i]],color='orange',orient='v')
    plt.tight_layout()

We can see too many outliers present in the dataset. Tried removing outliers but it is giving an information loss of 58.73% which is huge data, hence data removal is not preffered. So instead of removing outliers, data transformation using power transfrom is applied on data to normalize it.

# Separating the Column into x & y

In [None]:
#Creating x & y columns 
x=df.drop('SalePrice',axis=1)
y=df['SalePrice']

Separating the columns into x & y as input featues and target variable respectively.

# Resolving Skewness using Power Transform

In [None]:
#Using power transform to remove skewness
from sklearn.preprocessing import power_transform
import warnings
warnings.filterwarnings('ignore')
hf=power_transform(x,method='yeo-johnson')

hf=pd.DataFrame(hf,columns=x.columns)

In [None]:
#Verifying skewness
hf.skew()

Skewness in some features are resolved but still can be seen in few variables even after applying transformation methods.

In few variables, even though the skewness is high we cannot clear those variables from the data set because it has good correlation with the target variable and few variables have categorical data where we don't consider having skewness or outliers. 

Hence, we will proceed with scaling data using Standard Scaler.

In [None]:
#Visualizing skewness after applying transformations on density graph
hf.plot(kind='density',subplots=True,layout=(8,10),legend=False,sharex=False,figsize=(60,40))
plt.show()

Many graphs have been normalized after applying transformations on data.

# Scaling Data Using Standard Scaler

In [None]:
#importing library for scaling
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Scaling our data to improve model performance
sc=StandardScaler()
x=sc.fit_transform(hf)
x

Hence, the data is Scaled.

Since our Target Variable has continuous values, we use Linear Regression Algorithm.

# Model Building

In [None]:
#Importing libraries to build linear model
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error,mean_absolute_error
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Finding Max accuracy at the best random state
maxAccu=0
maxRS=0
for i in range(1,200):
    x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.22,random_state=i)
    lr=LinearRegression()
    lr.fit(x_train,y_train)
    pred=lr.predict(x_test)
    acc=r2_score(y_test,pred)
    if acc>maxAccu:
        maxAccu=acc
        maxRS=i
print('Best accuracy is ',maxAccu," on Random_state",maxRS)

We are getting 85.90% accuracy at random state 140. Hence, we select random state as 140.

In [None]:
#Splitting the data into 80% training and 20% testing
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.20,random_state=140)

We split our data into Training and Testing giving 80% data for Training and 20% for Testing at best random state 140. We will use linear Regression to train our model as we have a continuous type of values in Target Variable 'SalePrice'.

In [None]:
#Code for Linear Regression
lm=LinearRegression()
lm.fit(x_train,y_train)
pred=lm.predict(x_test)

In [None]:
#displaying predicted and actual values
print("Predicted Happiness Score: ",pred)
print('actual Happiness Score: ',y_test)

Let's check the Training accuracy and see how well we have trained our model.

In [None]:
#training score
lm.score(x_train,y_train)

We are obtaining 82.41% training accuracy for our Model. Let's try finding errors.

In [None]:
#Finding Errors
print('error:')
print('Mean absolute error:',mean_absolute_error(y_test,pred))
print('Mean squared error:',mean_squared_error(y_test,pred))

print('Root Mean squared error:',np.sqrt(mean_squared_error(y_test,pred)))

We can observe errors in our model which we can try to minimise it using different regularization and ensemble techniques. Let's check the r2 score to check the performance of our model.

In [None]:
#r2 score of the model
print(r2_score(y_test,pred))

We are obtaining our r2 score as 87.03% which can also be due to overfitting or underfitting problems. Hence, we will use cross Validation to cross check if the model's performing accurately.

# Cross Validation of The Model

In [None]:
#Finding best cv score at a particular cv
pred_train=lr.predict(x_train)
pred_test=lr.predict(x_test)
train_accuracy=r2_score(y_train,pred_train)
test_accuracy=r2_score(y_test,pred_test)

from sklearn.model_selection import cross_val_score
for j in range(2,10):
    cv_score=cross_val_score(lr,x,y,cv=j)
    cv_mean=cv_score.mean()
    print(f"At cross fold {j} the cv score is {cv_mean} and r2 score for training is {train_accuracy} and accuracy for the testing is {test_accuracy}")
    print("\n")

Here we have handled the problem or overfitting and underfitting by checking the training and testing score at it's highest cv score. Let's try checking if we are covering all the points to attain accuracy by Visualizing it.

In [None]:
#Code for checking linear Regression data points with target variale
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.scatter(x=y_test,y=pred_test,color='r')
plt.plot(y_test,y_test,color='b')
plt.xlabel('Actual charges',fontsize=14)
plt.ylabel('Predicted charges',fontsize=14)
plt.title('Linear Regression',fontsize=18)
plt.show()

The Best Fit line is covering many datapoints and can be seen near to the actual values which shows quite a good fit of our model.

Now, we will try to use different testing techniques and compare them to achieve the best performance for the model and also deal with it's over fitting and under fitting problems and then save the best model with high performance. We are going to use hyper parameter tuning for doing so.

# Regularization

##### Using ElasticNet Regression

In [None]:
#Importing Libraries for hyper parameter tuning 
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Using ElasticNet Regression for our Model to solve overfitting and underfitting
from sklearn.linear_model import ElasticNet

parameters={'alpha':[.0001,.001,.01,.1,1,10],'random_state':list(range(0,10)),'selection':['cyclic','random']}
en=ElasticNet()
clf=GridSearchCV(en,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
#Code for ElasticNet Regression
en=ElasticNet(alpha=1,random_state=3,selection='random')
en.fit(x_train,y_train)
en.score(x_train,y_train)
pred_en=en.predict(x_test)

enn=r2_score(y_test,pred_en)
enn

The r2 score for ElasticNet Regression is 87.49%. 

##### Using Ridge Regression

In [None]:
#Using Ridge Regression for our Model to solve overfitting and underfitting
from sklearn.linear_model import Ridge

parameters={'alpha':[.0001,.001,.01,.1,1,10],'random_state':list(range(0,10)),'solver':['auto','svd','cholesky']}
rd=Ridge()
clf=GridSearchCV(rd,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
#Code for Ridge Regression
rd=Ridge(alpha=10,random_state=0,solver='svd')
rd.fit(x_train,y_train)
rd.score(x_train,y_train)
pred_rd=rd.predict(x_test)

rdd=r2_score(y_test,pred_rd)
rdd

The r2 score for Ridge Regression is 87.16%

##### Using Lasso Regression

In [None]:
#Using Lasso Regression for our Model to solve overfitting and underfitting
from sklearn.linear_model import Lasso

parameters={'alpha':[.0001,.001,.01,.1,1,10],'random_state':list(range(0,10)),'selection':['cyclic','random']}
ls=Lasso()
clf=GridSearchCV(ls,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
#Code for Lasso Regression
ls=Ridge(alpha=10,random_state=1)
ls.fit(x_train,y_train)
ls.score(x_train,y_train)
pred_ls=ls.predict(x_test)

lss=r2_score(y_test,pred_ls)
lss

The r2 score for Lasso Regression is also 87.16%.

Hence, comparing all the three regularization techniques for regression i.e. ElasticNet, Ridge and Lasso, the model performs best for ElasticNet Regularization Regression technique at r2 score 87.49%. 

# Ensemble Techniques

##### AdaBoostRegressor

In [None]:
#Finding best Parameters for Ada Boost Regressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostRegressor

parameters={'random_state':list(range(0,10))}
ab=AdaBoostRegressor()
clf=GridSearchCV(ab,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
#Code for Ada Boost Regressor
ab=AdaBoostRegressor(random_state=4, n_estimators=100)
ab.fit(x_train,y_train)
ab.score(x_train,y_train)
pred_y=ab.predict(x_test)

abs=r2_score(y_test,pred_y)
print('R2 score: ',abs*100)

In [None]:
#Finding the Cv score of the model
abscore=cross_val_score(ab,x,y,cv=8)
abc=abscore.mean()
print("Cross Val Score: ",abc*100)

The r2 score for Ada Boost Regressor is 81.32% and that for it's cv score is 78.9% which does not give us much of a good performance model.

Hence, we will try using Random Forest Regressor algorithm for better performance.

###### RandomForestRegressor

In [None]:
#Finding best Parameters for RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

parameters={'random_state':list(range(0,10)),'criterion':['mse','mae']}
rfg=RandomForestRegressor()
clf=GridSearchCV(rfg,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

In [None]:
#Code for RandomForestRegressor
rfg=RandomForestRegressor(random_state=9, n_estimators=100, criterion='mse')
rfg.fit(x_train,y_train)
rfg.score(x_train,y_train)
pred_y=rfg.predict(x_test)

abs=r2_score(y_test,pred_y)
print('R2 score: ',abs*100)

In [None]:
#Finding the Cv score of the model
abscore1=cross_val_score(rfg,x,y,cv=8)
abc1=abscore1.mean()
print("Cross Val Score: ",abc1*100)

The r2 score for RandomForestRegressor is 87.56% and it's cv score is 85.13% which has the best performance among all the other algorithms used for testing the performance of the model. The performance of ElasticNet Regularization technique is also good which is 87.49% where as for Random Forest Regressor is 87.56% which is a little better than the regularization technique. Hence, we use the RandomForestRegressor Ensemble technique for this model and save it as our best model. 

# Saving The Best Model

In [None]:
#Saving the best model
import pickle
filename='housing.pkl'
pickle.dump(rfg,open(filename,'wb'))

# Testing Dataset


# Data Description
The test data consists of the similar variables as present in the train data set except the "Target Variable" but the number of records in the test data set is comparatively less than that of train data set. This data set is used to predict values of houses using the independent variables present in the data.


In [None]:
# Importing Testing Dataset
df1=pd.read_csv(r'C:\Users\HP-15\housing_train.csv')
#Visualizing Test dataset
df1

In [None]:
#Initial variables in the test dataset (before data pre-processing)
df1.columns

In [None]:
#Dropping "Id" column from the test dataset just like we did it in he trin data set
df1.drop("Id",axis=1,inplace=True)

In [None]:
#Checking size of present dataset
df1.shape

The dataset now consits of 292 rows and 79 columns.

In [None]:
#Checking data types of all the attributes within the data set
df1.dtypes


The data set consists of a combination of 'String', 'int' and 'float' data types.

Since it is very important to check the null values present in the data to clean data before predicting values, hence we first check and resolve null values in data set using "Simple Imputer" techniques.

# Checking Null Values

In [None]:
#Checking Null values
df1.isnull().sum()

Since the null(NaN) values present in columns "PoolQC", "Fence", "MiscFeature" & "Alley" are more than 70% of the records in the data set. So the variables does not seem to be of any importance in analysing the sale price, hence removing them from the data set.

The other null values present will be treated using SimpleImputer and Imputer techniques. Null values in data set needs to be treated in order to make an analysis on data.

These are the same steps performed as we did while training the "train" data set. Hence, we have to follow the same procedure in order to predict values for "test" data set.

In [None]:
#Dropping variables with maximum null values i.e. near to 70% of data are null values
df1.drop("PoolQC",axis=1,inplace=True)
df1.drop("Fence",axis=1,inplace=True)
df1.drop("MiscFeature",axis=1,inplace=True)
df1.drop("Alley",axis=1,inplace=True)

In [None]:
#Checking number of variables left in he dataset
df1.shape

It is noticable that 4 variables have been dropped from the data set.

Now will treat null values of test data.

In [None]:
#Filling Nan values using 'most_frequent' of simle Imputer for categorical data
from sklearn.impute import SimpleImputer

imp=SimpleImputer(strategy='most_frequent')

df1['GarageQual']=imp.fit_transform(df1['GarageQual'].values.reshape(-1,1))
df1['GarageCond']=imp.fit_transform(df1['GarageCond'].values.reshape(-1,1))
df1['GarageFinish']=imp.fit_transform(df1['GarageFinish'].values.reshape(-1,1))
df1['GarageType']=imp.fit_transform(df1['GarageType'].values.reshape(-1,1))
df1['FireplaceQu']=imp.fit_transform(df1['FireplaceQu'].values.reshape(-1,1))
df1['BsmtFinType2']=imp.fit_transform(df1['BsmtFinType2'].values.reshape(-1,1))
df1['BsmtFinType1']=imp.fit_transform(df1['BsmtFinType1'].values.reshape(-1,1))
df1['BsmtExposure']=imp.fit_transform(df1['BsmtExposure'].values.reshape(-1,1))
df1['BsmtCond']=imp.fit_transform(df1['BsmtCond'].values.reshape(-1,1))
df1['BsmtQual']=imp.fit_transform(df1['BsmtQual'].values.reshape(-1,1))
df1['MasVnrType']=imp.fit_transform(df1['MasVnrType'].values.reshape(-1,1))

df1.head()

We used "most_frequent" strategy of Simple Imputer to fill null values of categorical data.

Now again we'll check the null values present in data.

In [None]:
#Checking Null values
df1.isnull().sum()

In [None]:
#Importing Libraries to treat Null values using median
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

In [None]:
#treating null values using Simple Imputer
for i in df1.columns:
    if df1[i].isnull().sum()!=0:
        df1[i]=imp.fit_transform(df1[i].values.reshape(-1,1))
        

In [None]:
#Checking null values again
df1.isnull().sum()

We now observe that our null values have been treated. Let's visualize and check if we still have any Null Values.

In [None]:
# Visualizing null values through heatmap
import seaborn as sns
sns.heatmap(df1.isnull())

We cannot see any Null Values in our dataset. Hence, we can proceed forward with visualizing our data.

# Making DataFrame for the Nominal Data`

In [None]:
#Copying nominal variables into a new dataframe
df_nominal1=df1[['MSZoning','Street','LotShape', 'LandContour', 'Utilities', 'LotConfig','LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType','HouseStyle','RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType','ExterQual', 'ExterCond', 'Foundation', 'BsmtQual','BsmtCond', 'BsmtExposure', 'BsmtFinType1','BsmtFinType2', 'Heating','HeatingQC', 'CentralAir', 'Electrical','KitchenQual','Functional','FireplaceQu','GarageType','GarageFinish','GarageQual','GarageCond','PavedDrive','SaleType','SaleCondition']].copy()

We create a new separate dataframe to store all categorical data and make an analysis on it separetly. Same procedures that we followed in Training our data set.

In [None]:
#Checking columns of new nominal dataframe created
df_nominal1.columns


In [None]:
#Cheking shape of new nominal dataframe
df_nominal1.shape

We have 292 rows and 39 columns in the nominal DataFrame.

# Visualization of Data

For the nominal/categorical data we will use countplot as it gives frequency of the columns. Since we have already imported the required libraries earlier, we'll not import them again.

In [None]:
#Visualizing different columns using countplot
ncol,nrow=10,4
ab=df_nominal1.columns.values
plt.figure(figsize=(20,40))
for index,i in enumerate(ab):
    ab=plt.subplot(ncol,nrow,index+1)
    sns.countplot(df1[i])
    plt.title(f"plot {i}")
    plt.tight_layout()
plt.show()

In [None]:
#Visualizing Utilities column separately to get exact figure of values for records
ab=sns.countplot(x='Utilities', data=df_nominal1)
print(df_nominal1['Utilities'].value_counts())

Observation
Since we have deleted the 'Utilities' variable in the Training data set as it contained similar value ("AllPub") for all records but in the above graph we can see 291 records have "AllPub" as its value but only 1 record has a different value i.e. "NoSeWa". Hence, we conclude the column to be of not much importance and delete it for the test data set as well.

'MSZoning' has maximum records for value "RL" and second highest is for "RM".

The most records in "Street" belongs to value "Pave".

The 'LotShape' for most properties is "Reg" and second most is "IR1".

'LandContour' is maximum for "Lvl".

The highest records in 'LotConfig' is for value "Inside LotConfig".

'LandSlope' has maximum records for "Gentle slope".

Variables such as "Neighborhood", "Exterior1st", "Exterior2nd" are unpredictable here as it consists of too many categories but have been separately explained for the train data set.

'Condition1' and 'Condition2' is "Normal" for most of the properties.

'BldgType' is "1Fam-Single-family Detached" for around 240 properties.

'HouseStyle' has values "1Story" and "2Story" for 140+ and 80 records respectively.

'RoofStyle' is "Gable" for more than 200 properties.

'RoofMatl', 'BsmtCond', 'BsmtFinType2' and 'Heating' holds maximum records i.e. 250+ for values "Standard (Composite) Shingle", "Typical - slight dampness allowed", "Unfinshed" and "GasA" respectively.

More than 250 properties have 'CentralAir'.

Variables 'Electrical', 'Functional', 'FireplaceQu' have 250+ records for values "Standard Circuit Breakers & Romex", "Typ" and "Gd".

In [None]:
#dropping column having no importance data since all the values inside the column are same except 1 record, hence does not serve of much importance in both train and test data set.
df1.drop("Utilities",axis=1,inplace=True)

Making DataFrame of the Continuous type of Values

In [None]:
#Copying our continuous data into a new dataframe
df_continuous1=df1[['MSSubClass','LotFrontage','LotArea','OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd','MasVnrArea','BsmtFinSF1','BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF','1stFlrSF', '2ndFlrSF', 'LowQualFinSF','GrLivArea','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr','TotRmsAbvGrd','Fireplaces','GarageYrBlt','GarageCars', 'GarageArea','WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch','ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']].copy()

In [None]:
#Verifying columns
df_continuous1.columns

In [None]:
#Cheking shape of new numeric dataframe
df_continuous1.shape

Visualizing Continuous Data

In [None]:
#Visualizing different columns using distplot
ncol,nrow=10,4
ab=df_continuous1.columns.values
plt.figure(figsize=(20,40))
for index,i in enumerate(ab):
    ab=plt.subplot(ncol,nrow,index+1)
    sns.distplot(df1[i])
    plt.title(f"plot {i}")
    plt.tight_layout()
plt.show()

Observation:
The type of dwelling involved in the sale ('MSSubClass') is "20 : 1-STORY 1946 & NEWER ALL STYLES" for most of the cases.

Linear feet of street connected to property ('LotFrontage') is max between the range 50 & 75.

Lot size in square feet ('LotArea') has max records between values 0 & 25000.

Rates the overall material and finish of the house ('OverallQual') has maximum records for values 5, 6 and 7.

Rates the overall condition of the house ('OverallCond') has maximum record for value 5.

Maximum records have Original construction date ('YearBuilt') as 2000.

Maximum records for Remodel date (same as construction date if no remodeling or additions) ['YearRemodAdd'] has values between 2000 and 2010.

The values of most of the properties for Masonry veneer area in square feet ('MasVnrArea') is 0 but the values range between 0 to 1200.

The values of most of the properties for Type 1 finished square feet ('BsmtFinSF1') is 0 but the values range between 0 to 2000 having most of them ranging from 0 to 1000.

The values of most of the properties for Type 2 finished square feet ('BsmtFinSF2') is 0 but the values range between 0 to 1200.

The values for Unfinished square feet of basement area ('BsmtUnfSF') lies max between 0 to 1000.

Total square feet of basement area ('TotalBsmtSF') has max records ranging between values 500 to 2000.

First Floor square feet ('1stFlrSF') has maximum records ranging between values 500 to 1500.

Second floor square feet ('2ndFlrSF') has maximum records at value 0 and also few between values 500 & 1000.

Above grade (ground) living area square feet ('GrLivArea') has maximum records ranging between values 900 to 3000.

Basement full bathrooms ('BsmtFullBath') has more records for value 0 than value 1.

Variables such as 'LowQualFinSF', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea' & 'MiscVal' contains maximum records at value 0.

In [None]:
#using multivariate analysis to analyise continuous type of data
sns.pairplot(df_continuous1)

Since, the data is huge and contains too many variables, the cummulative analysis of data is difficult to make. Though we can still observe some independent variables making a positive correlation with the target variable. We will try to understand this data using correlations.

Converting Nominal data into numeric data representation is very important in order to understand data appropriately. Hence, we will use data encoding techniques to convert string values into floating numbers.

# Encoding of DataFrame

In [None]:
#Importing library for encoding and creating instance for the same 
from sklearn.preprocessing import OrdinalEncoder
enc=OrdinalEncoder()

In [None]:
#Converting object datatype into float values
for i in df1.columns:
    if df1[i].dtypes=='object':
        df1[i]=enc.fit_transform(df1[i].values.reshape(-1,1))

In [None]:
#Verifying Conversion 
df1.head()

We can observe that our object datatype has been now converted into float values.

# Describe Data

In [None]:
#Describing present columns in the dataset
df1.columns

In [None]:
#Describing present columns in the dataset
df1.columns

We now have 292 rows and 74 columns in the dataset.

In [None]:
#Describing mean, median, min, max values of data
df1.describe()

In [None]:
#Visualizing data description
import matplotlib.pyplot as plt
plt.figure(figsize=(22,7))
sns.heatmap(df1.describe(),annot=True,linewidths=0.1,linecolor='blue',fmt='0.2f')

Observation:

The standard deviation of few columns in the dataset are huge which means that the values in these columns are largely scattered and are not near to the mean values. They are very far away from their mean values. The standard devation of other columns is not too high which shows us a normal distribution of data and no skewness.

We assume that since the min & max values in every column contains a range difference, it indicates a possibility of having few outliers and skewness in data.

The only column to have high values in the test data set is the independent variable 'LotArea'.

The data does not contain negative values in any of it's variables.

Correlation of Columns with the Target Variable
Since we don't have the target variable "SalePrice" in the test data set, we cannot find the correlation of independent variables with the dependent/ target variable. We have already found the correlation between the independent variables and target vaiables from the train data set.

The target variable is not available in the test data set because we are going to predict the values of "SalePrice" of test data set with the algorithm trained using the train data set. Hence, no correlations can be found for this data set.

The test data set is going to be processed just as the train data set since the data information and pattern is similar so same pre-processing steps will also be applied.

# Checking Skewness

In [None]:
#checking skewness to see if the data is normally distributed
df1.skew()

Keeping threshold +/-0.5 as the range for skewness, we can see high skewness in few variables such as 'LotArea', 'LowQualFinSF', 'Condition2', '3SsnPorch', 'RoofMatl', 'Street' and 'MiscVal'. The skewness in these variables needs to be reduced in order to process data for analysis. The skewed data needs to be normalized. Before dealing with the skewness, we'll first check outliers in data.

# Checking Outliers

In [None]:
#Visualizing outliers of different variables
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
#Visualizing outliers of different variables
collist=df1.columns.values
ncol=9
nrows=9
plt.figure(figsize=(ncol,5*ncol))
for i in range(1,len(collist)):
    plt.subplot(nrows,ncol,i+1)
    sns.boxplot(df1[collist[i]],color='orange',orient='v')
    plt.tight_layout()

We can see too many outliers present in the dataset. Tried removing outliers but it is giving an information loss of 58.73% which is huge data, hence data removal is not preffered. So instead of removing outliers, data is tranferred using power transfrom and is applied on data to normalize it.

There's no need of separating x and y since we already don't have y variable existing in the test data set. So, we skip the step.


Resolving Skewness using Power Transform

In [None]:
#Using power transform to remove skewness
from sklearn.preprocessing import power_transform
import warnings
warnings.filterwarnings('ignore')
hf1=power_transform(df1,method='yeo-johnson')

hf1=pd.DataFrame(hf1,columns=df1.columns)

In [None]:
#Verifying skewness
hf1.skew()

Skewness in some features can still be seen even after applying transformation methods.

In few variables, even though the skewness is high we cannot clear those variables from the data set because it had good correlation with the target variable in the train data set and few variables have categorical data where we don't consider having outliers or outliers.

Also we cannot drop the columns in the test data set if not deleted in the test data set, other wise we cannot predict the "Sales Price" without having similar variables in both data sets.

Hence, we will proceed with scaling data using Standard Scaler.

# Scaling Data Using Standard Scaler

In [None]:
#Scaling our data to improve model performance
sc=StandardScaler()
x1=sc.fit_transform(hf1)
x1

# Loading the Best Saved Model

In [None]:
#Loading model
import pickle
fitted_model=pickle.load(open('housing.pkl','rb'))

In [None]:
fitted_model

# Predicting The Test File using the Pre Saved Model

In [None]:
#Making Prediction of "SalesPrice" on test data using algorithm trained on Train data
predictions=fitted_model.predict(x1)

In [None]:
#Prediction of "SalePrice" values on test data 
predictions

In [None]:
# R2 score of the predicted model
fitted_model=pickle.load(open('housing.pkl','rb'))
result=fitted_model.score(x_test,y_test)
print(result)

Therefore, we are able to predict the test data at 87.56% accuracy.