# Problem Statement
To predict the price of a property given some parameters of the house which helps in property evaluation process.

# Approach

1.   First we will be analysing the data via Exploratory Data Analysis to derive useful insights. This will help us later in the data preprocessing/cleaning stage. We will do the following:

*   Check the statistics of the features
*   Check the outliers in the numerical features
*   Check the distribution of numerical features
*   Check the distribution and cardinality of categorical features
*   Check the presence of NaN values



2.   Next we will be performing the actual preprocessing/cleaning steps required in order to clean the data and convert it into a proper format for the Machine Learning model to train.
This is neccessary as the Machine Learning model can only work with specific kind of data and thus data preprocessing is required to convert into that particular format! We will perform the following steps:

*   Impute the Missing Values
*   Encode the categorical features
*   Scale the data



3.   Now we will be training the actual Machine Learning model to predict the SalePrice of a property given some input parameters! We will be training 5 Machine Learning models and see which one gives the best resuts.

*   Linear Regression
*   K Nearest Neighbour
*   DecisionTree
*   Random Forest
*   XGBoost
*   Support Vector Machine



4.   After training the model, we will tune the Hyperparameters of the models in order to further improve the results.



5.   Feature Selection/Engineering phase where we will be removing noisy features and/or combine multiple features into one in order to improve the results. We will be performing following techniques:


*   Chisquare pair plot
*   Variance Threshold
*   NaN value threshold
*   Random Forest feature importance

MSSubClass: Identifies the type of dwelling involved in the sale.

    20    1-STORY 1946 & NEWER ALL STYLES
    30    1-STORY 1945 & OLDER
    40    1-STORY W/FINISHED ATTIC ALL AGES
    45    1-1/2 STORY - UNFINISHED ALL AGES
    50    1-1/2 STORY FINISHED ALL AGES
    60    2-STORY 1946 & NEWER
    70    2-STORY 1945 & OLDER
    75    2-1/2 STORY ALL AGES
    80    SPLIT OR MULTI-LEVEL
    85    SPLIT FOYER
    90    DUPLEX - ALL STYLES AND AGES
    120    1-STORY PUD (Planned Unit Development) - 1946 & NEWER
    150    1-1/2 STORY PUD - ALL AGES
    160    2-STORY PUD - 1946 & NEWER
    180    PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
    190    2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

    A    Agriculture
    C    Commercial
    FV    Floating Village Residential
    I    Industrial
    RH    Residential High Density
    RL    Residential Low Density
    RP    Residential Low Density Park 
    RM    Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

    Grvl    Gravel  
    Pave    Paved

Alley: Type of alley access to property

    Grvl    Gravel
    Pave    Paved
    NA     No alley access

LotShape: General shape of property

    Reg    Regular 
    IR1    Slightly irregular
    IR2    Moderately Irregular
    IR3    Irregular

LandContour: Flatness of the property

    Lvl    Near Flat/Level 
    Bnk    Banked - Quick and significant rise from street grade to building
    HLS    Hillside - Significant slope from side to side
    Low    Depression

Utilities: Type of utilities available

    AllPub    All public Utilities (E,G,W,& S)    
    NoSewr    Electricity, Gas, and Water (Septic Tank)
    NoSeWa    Electricity and Gas Only
    ELO    Electricity only    

LotConfig: Lot configuration

    Inside    Inside lot
    Corner    Corner lot
    CulDSac    Cul-de-sac
    FR2    Frontage on 2 sides of property
    FR3    Frontage on 3 sides of property

LandSlope: Slope of property

    Gtl    Gentle slope
    Mod    Moderate Slope  
    Sev    Severe Slope

Neighborhood: Physical locations within Ames city limits

    Blmngtn    Bloomington Heights
    Blueste    Bluestem
    BrDale    Briardale
    BrkSide    Brookside
    ClearCr    Clear Creek
    CollgCr    College Creek
    Crawfor    Crawford
    Edwards    Edwards
    Gilbert    Gilbert
    IDOTRR    Iowa DOT and Rail Road
    MeadowV    Meadow Village
    Mitchel    Mitchell
    Names    North Ames
    NoRidge    Northridge
    NPkVill    Northpark Villa
    NridgHt    Northridge Heights
    NWAmes    Northwest Ames
    OldTown    Old Town
    SWISU    South & West of Iowa State University
    Sawyer    Sawyer
    SawyerW    Sawyer West
    Somerst    Somerset
    StoneBr    Stone Brook
    Timber    Timberland
    Veenker    Veenker

Condition1: Proximity to various conditions

    Artery    Adjacent to arterial street
    Feedr    Adjacent to feeder street   
    Norm    Normal  
    RRNn    Within 200' of North-South Railroad
    RRAn    Adjacent to North-South Railroad
    PosN    Near positive off-site feature--park, greenbelt, etc.
    PosA    Adjacent to postive off-site feature
    RRNe    Within 200' of East-West Railroad
    RRAe    Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present)

    Artery    Adjacent to arterial street
    Feedr    Adjacent to feeder street   
    Norm    Normal  
    RRNn    Within 200' of North-South Railroad
    RRAn    Adjacent to North-South Railroad
    PosN    Near positive off-site feature--park, greenbelt, etc.
    PosA    Adjacent to postive off-site feature
    RRNe    Within 200' of East-West Railroad
    RRAe    Adjacent to East-West Railroad

BldgType: Type of dwelling

    1Fam    Single-family Detached  
    2FmCon    Two-family Conversion; originally built as one-family dwelling
    Duplx    Duplex
    TwnhsE    Townhouse End Unit
    TwnhsI    Townhouse Inside Unit

HouseStyle: Style of dwelling

    1Story    One story
    1.5Fin    One and one-half story: 2nd level finished
    1.5Unf    One and one-half story: 2nd level unfinished
    2Story    Two story
    2.5Fin    Two and one-half story: 2nd level finished
    2.5Unf    Two and one-half story: 2nd level unfinished
    SFoyer    Split Foyer
    SLvl    Split Level

OverallQual: Rates the overall material and finish of the house

    10    Very Excellent
    9    Excellent
    8    Very Good
    7    Good
    6    Above Average
    5    Average
    4    Below Average
    3    Fair
    2    Poor
    1    Very Poor

OverallCond: Rates the overall condition of the house

    10    Very Excellent
    9    Excellent
    8    Very Good
    7    Good
    6    Above Average   
    5    Average
    4    Below Average   
    3    Fair
    2    Poor
    1    Very Poor

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

    Flat    Flat
    Gable    Gable
    Gambrel    Gabrel (Barn)
    Hip    Hip
    Mansard    Mansard
    Shed    Shed

RoofMatl: Roof material

    ClyTile    Clay or Tile
    CompShg    Standard (Composite) Shingle
    Membran    Membrane
    Metal    Metal
    Roll    Roll
    Tar&Grv    Gravel & Tar
    WdShake    Wood Shakes
    WdShngl    Wood Shingles

Exterior1st: Exterior covering on house

    AsbShng    Asbestos Shingles
    AsphShn    Asphalt Shingles
    BrkComm    Brick Common
    BrkFace    Brick Face
    CBlock    Cinder Block
    CemntBd    Cement Board
    HdBoard    Hard Board
    ImStucc    Imitation Stucco
    MetalSd    Metal Siding
    Other    Other
    Plywood    Plywood
    PreCast    PreCast 
    Stone    Stone
    Stucco    Stucco
    VinylSd    Vinyl Siding
    Wd Sdng    Wood Siding
    WdShing    Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

    AsbShng    Asbestos Shingles
    AsphShn    Asphalt Shingles
    BrkComm    Brick Common
    BrkFace    Brick Face
    CBlock    Cinder Block
    CemntBd    Cement Board
    HdBoard    Hard Board
    ImStucc    Imitation Stucco
    MetalSd    Metal Siding
    Other    Other
    Plywood    Plywood
    PreCast    PreCast
    Stone    Stone
    Stucco    Stucco
    VinylSd    Vinyl Siding
    Wd Sdng    Wood Siding
    WdShing    Wood Shingles

MasVnrType: Masonry veneer type

    BrkCmn    Brick Common
    BrkFace    Brick Face
    CBlock    Cinder Block
    None    None
    Stone    Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

    Ex    Excellent
    Gd    Good
    TA    Average/Typical
    Fa    Fair
    Po    Poor

ExterCond: Evaluates the present condition of the material on the exterior

    Ex    Excellent
    Gd    Good
    TA    Average/Typical
    Fa    Fair
    Po    Poor

Foundation: Type of foundation

    BrkTil    Brick & Tile
    CBlock    Cinder Block
    PConc    Poured Contrete 
    Slab    Slab
    Stone    Stone
    Wood    Wood

BsmtQual: Evaluates the height of the basement

    Ex    Excellent (100+ inches) 
    Gd    Good (90-99 inches)
    TA    Typical (80-89 inches)
    Fa    Fair (70-79 inches)
    Po    Poor (<70 inches
    NA    No Basement

BsmtCond: Evaluates the general condition of the basement

    Ex    Excellent
    Gd    Good
    TA    Typical - slight dampness allowed
    Fa    Fair - dampness or some cracking or settling
    Po    Poor - Severe cracking, settling, or wetness
    NA    No Basement

BsmtExposure: Refers to walkout or garden level walls

    Gd    Good Exposure
    Av    Average Exposure (split levels or foyers typically score average or above)  
    Mn    Mimimum Exposure
    No    No Exposure
    NA    No Basement

BsmtFinType1: Rating of basement finished area

    GLQ    Good Living Quarters
    ALQ    Average Living Quarters
    BLQ    Below Average Living Quarters   
    Rec    Average Rec Room
    LwQ    Low Quality
    Unf    Unfinshed
    NA    No Basement

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

    GLQ    Good Living Quarters
    ALQ    Average Living Quarters
    BLQ    Below Average Living Quarters   
    Rec    Average Rec Room
    LwQ    Low Quality
    Unf    Unfinshed
    NA    No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

    Floor    Floor Furnace
    GasA    Gas forced warm air furnace
    GasW    Gas hot water or steam heat
    Grav    Gravity furnace 
    OthW    Hot water or steam heat other than gas
    Wall    Wall furnace

HeatingQC: Heating quality and condition

    Ex    Excellent
    Gd    Good
    TA    Average/Typical
    Fa    Fair
    Po    Poor

CentralAir: Central air conditioning

    N    No
    Y    Yes

Electrical: Electrical system

    SBrkr    Standard Circuit Breakers & Romex
    FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
    FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
    FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
    Mix    Mixed

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

    Ex    Excellent
    Gd    Good
    TA    Typical/Average
    Fa    Fair
    Po    Poor

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

    Typ    Typical Functionality
    Min1    Minor Deductions 1
    Min2    Minor Deductions 2
    Mod    Moderate Deductions
    Maj1    Major Deductions 1
    Maj2    Major Deductions 2
    Sev    Severely Damaged
    Sal    Salvage only

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

    Ex    Excellent - Exceptional Masonry Fireplace
    Gd    Good - Masonry Fireplace in main level
    TA    Average - Prefabricated Fireplace in main living area or 
    
Masonry Fireplace in basement
    Fa    Fair - Prefabricated Fireplace in basement
    Po    Poor - Ben Franklin Stove
    NA    No Fireplace

GarageType: Garage location

    2Types    More than one type of garage
    Attchd    Attached to home
    Basment    Basement Garage
    BuiltIn    Built-In (Garage part of house - typically has room above garage)
    CarPort    Car Port
    Detchd    Detached from home
    NA    No Garage

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

    Fin    Finished
    RFn    Rough Finished  
    Unf    Unfinished
    NA    No Garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

    Ex    Excellent
    Gd    Good
    TA    Typical/Average
    Fa    Fair
    Po    Poor
    NA    No Garage

GarageCond: Garage condition

    Ex    Excellent
    Gd    Good
    TA    Typical/Average
    Fa    Fair
    Po    Poor
    NA    No Garage

PavedDrive: Paved driveway

    Y    Paved 
    P    Partial Pavement
    N    Dirt/Gravel

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

    Ex    Excellent
    Gd    Good
    TA    Average/Typical
    Fa    Fair
    NA    No Pool

Fence: Fence quality

    GdPrv    Good Privacy
    MnPrv    Minimum Privacy
    GdWo    Good Wood
    MnWw    Minimum Wood/Wire
    NA    No Fence

MiscFeature: Miscellaneous feature not covered in other categories

    Elev    Elevator
    Gar2    2nd Garage (if not described in garage section)
    Othr    Other
    Shed    Shed (over 100 SF)
    TenC    Tennis Court
    NA    None

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

    WD     Warranty Deed - Conventional
    CWD    Warranty Deed - Cash
    VWD    Warranty Deed - VA Loan
    New    Home just constructed and sold
    COD    Court Officer Deed/Estate
    Con    Contract 15% Down payment regular terms
    ConLw    Contract Low Down payment and low interest
    ConLI    Contract Low Interest
    ConLD    Contract Low Down
    Oth    Other

SaleCondition: Condition of sale

    Normal    Normal Sale
    Abnorml    Abnormal Sale -  trade, foreclosure, short sale
    AdjLand    Adjoining Land Purchase
    Alloca    Allocation - two linked properties with separate deeds, typically condo with a garage unit  
    Family    Sale between family members
    Partial    Home was not completed when last assessed (associated with New Homes)

***IMPORTING REQUIRED LIBRARIES***

In [None]:
import os
import warnings
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import warnings

%matplotlib inline
warnings.filterwarnings('ignore')

***Reading the data into a Dataframe. index_col is used to set the index column***

In [None]:
data = pd.read_csv('train1.csv', index_col = 0)#, index_col = 0

***Taking a deep copy of the data so as to not modify the original data***

In [None]:
data1 = data.copy(deep = True)
data2 = data.copy(deep = True)

***Display the first 5 rows of the dataframe***

In [None]:
data1.head()

***Display the last 5 rows of the dataframe***

In [None]:
data1.tail()

***Display all the columns of the dataframe***

In [None]:
data1.columns

***Display the selected column only***

In [None]:
data1['MSSubClass']#.head()

***Display the data of multiple selected columns***

In [None]:
data1[['MSSubClass', 'SaleType', 'Utilities']]

***Display the Index of the Dataframe***

In [None]:
data1.index

***Create a new column based on existing columns. This helps in reducing the dimensionality of the dataframe.***

***This can also be used to create new features during the feature engineering stage.***

In [None]:
data1['Age'] = data1['YrSold'] - data1['YearBuilt']

In [None]:
data1['Age']

***Select particular rows using "loc" function***

In [None]:
data1.loc[3]

***Select row with Id "3" and column name "LotArea" and "Street"***

In [None]:
data1.loc[3, ['LotArea', 'Street']]

***Select rows using "iloc" function***

In [None]:
data1.iloc[2, 2]

In [None]:
data1.head(6)

***Conditional selection of rows.***

In [None]:
data1[data1['LotArea'] >= 5000]

In [None]:
data1[data1['LotShape'] == 'IR1']

***Multiple conditional selection. Conditional selection can help weed out outliers or incorrect data later in the data cleaning stage.***

In [None]:
data1[(data1['LotShape'] == 'IR1') & (data1['LotArea'] >= 50000)]

In [None]:
data1[data1['Age'] > 100]

***Get statistics of the dataframe i.e. mean, median, mode, correlation, variance, standard deviation and more***

In [None]:
data1.mean(numeric_only = True)

In [None]:
data1.median(numeric_only =True)

In [None]:
data1.mode()

In [None]:
data1['MSSubClass'].mode()

In [None]:
data1.std(numeric_only = True)

In [None]:
data1.var(numeric_only = True)

In [None]:
data1.skew(numeric_only = True)

In [None]:
data1.dtypes

***Correlation plot helps us in identifying features which are important in prediting the target feature and those which are just noise features.***

In [None]:
data1.corr(numeric_only = True)

***Sort the dataframe in ascending or descending order based on a particular column.***

In [None]:
low_to_high_price = data1.sort_values('SalePrice', ascending = True)

In [None]:
low_to_high_price.head(15)

***Grouby data based on specific columns and statitical parameters.***

In [None]:
data1.groupby(['LotShape']).mean(numeric_only = True)#median/sum

In [None]:
data1.groupby(['LotShape', 'Street']).mean(numeric_only = True)#median/sum

***Get general info of the dataframe i.e. data types, non null values, statitical measures and more with just a couple lines of code!***

In [None]:
data1.info()

In [None]:
data1.describe()

In [None]:
data1.isnull().sum()#to_dict()

***Separating dataframe into numerical and categorical columns for Visualization***

In [None]:
num = data1.select_dtypes(include = 'number')
cat = data1.select_dtypes(include = 'object')

***Fins out the outliers of numerical features using Boxplot. We can see that most of the numerical features contain outliers and hence their distribution should to be skewed.***

***Boxplot also helps us in understanding the distribution of the features along with uncovering statistical measures. For example the lowermost line is the minimum percentile, the baseline of the box is 25th percentile, the middle line is the 50 percentile(or median), the uppermost line of the box is 75th percentile and the topmost line is the maximum percentile.***

In [None]:
num.boxplot(column = 'GarageYrBlt')

In [None]:
for i in num:
    num.boxplot(column = i, patch_artist = True, notch ='True')
    plt.ylabel(i)
    plt.show()

***From the boxplots, we can see there are a lot of outliers present in the dataset. In order to remove them we will use the IQR method which will basically set an upper limit and lower limit on the range of values (based on the 25th and the 75th percentile value) and then remove values outside of this range.***

In [None]:
data1.shape

***Calculating the 25th and 75th percentile values for a columns.***

In [None]:
percentile25 = data1['MSSubClass'].quantile(0.25)
percentile75 = data1['MSSubClass'].quantile(0.75)

***Calculating the IQR value.***

In [None]:
iqr=percentile75 - percentile25

***Calculating the upper limit and the lower limit in order to define a range of good data points.***

In [None]:
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr

In [None]:
upper_limit

In [None]:
lower_limit

***Data points that lie outside the IQR range.***

In [None]:
upper_data_points = data1[data1['MSSubClass'] > upper_limit]
lower_data_points = data1[data1['MSSubClass'] < lower_limit]

In [None]:
upper_data_points.shape

In [None]:
lower_data_points.shape

***Removing the outliers(the data points that lie outside the IQR range).***

In [None]:
data1 = data1[(data1['MSSubClass'] < upper_limit) & (data1['MSSubClass'] > lower_limit)]

In [None]:
data1.shape

***Check the distribution of numerical features using distplot. This helps in checking the skewness of numerical features. As seen above in the boxplots, most of the numerical features have outliers and hence the distribution is skewed thus validating our previous insight.***

In [None]:
sns.distplot(data['MoSold'])

In [None]:
sns.set_style('whitegrid')
for j in num:
    sns.distplot(data1[j], kde = True, color = 'red')
    plt.show()

***Scatter plot to check the correlation of input feature to the target feature. This also helps in uncovering useful and actionable insights from the data.***

***One can also get the outliers from the scatterplots.***

In [None]:
sns.scatterplot(x = data1['GarageArea'], y = data1['SalePrice'], palette='pastel')

***Houses sold in 2008 fetched minimum price. This can be attributed to the Lehman brothers bank crisis.***

In [None]:
sns.scatterplot(x = data1['YrSold'], y = data1['SalePrice'], palette='pastel')

In [None]:
for i in num:
  sns.scatterplot(x = num[i], y = num['SalePrice'], palette='pastel')
  plt.show()

***Countplot to get the frequency of various classes of categorical features. This also helps us in detecting the minority classes.***

In [None]:
sns.countplot(x = data1['MSZoning'])

In [None]:
for i in cat:
  sns.countplot(x = data1[i], palette = "Spectral")
  plt.show()

***Properties with Lotshape IR1 are the most expensive folllowed by IR3 and Reg Lotshape is the cheapest.***

In [None]:
sns.barplot(x = data1['LotShape'], y = data1['SalePrice'], ci = 0)

***1Fam buildingtype is the most expensive along with Twinhouse, whereas 2Family condo is the least expensive.***

In [None]:
sns.barplot(x = data1['BldgType'], y = data1['SalePrice'], ci = 0)

In [None]:
for i in cat:
  sns.barplot(x = data1[i], y = data1['SalePrice'], ci = 0)
  plt.show()

***Display and remove the duplicate rows in the Dataframe. Duplicate rows increase the computational time of the Machine Learning model and also result in falsely positive results.***

In [None]:
data1[data1.duplicated()]

In [None]:
data1.drop_duplicates(keep = 'first', inplace = True)

In [None]:
data1[data1.duplicated()]

***Data Preprocessing Stage. In this phase, we will be performing the following steps:-***

***1.   Split the data into train and test. This is necessary in order to check how well the model is performing before shipping the model into production. Also any kind of preprocessing needs to be done after splitting the data into train and test set.***

***2.   Dealing with Missing values. This is another critical aspect of data preprocessing as Machine Learning models cannot deal with Missing values. So we have to either remove them or impute them.***

***3.   Remove the outliers/nonsensical values observed in the EDA phase.***

***4.   Encode the categorical features as Machine Learning models cannot work with categorical data.***

***5.   Scaling the features. This is neccessary as we will have features of different scales (think outliers!) and units. So standardizing/normalizing them will help the model learn better.***

In [None]:
x = data1.drop(['SalePrice'], axis = 1)
y = data1['SalePrice']

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, random_state = 69)

***Separatin numerical and categorical features from train and test set as both will require separate treatment.***

In [None]:
train_num = train_x.select_dtypes(include = 'number')
train_cat = train_x.select_dtypes(include = 'object')

test_num = test_x.select_dtypes(include = 'number')
test_cat = test_x.select_dtypes(include = 'object')

***Missing value imputation. We can also remove the missing values but then it would lead to data loss. So we will impute them using mean, median or mode. For numerical features we can use mean or median and for categorical features we will use mode. ***

In [None]:
print('Missing values before imputation \n', train_cat.isnull().sum())
train_cat.fillna(train_cat.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', train_cat.isnull().sum())

In [None]:
print('Missing values before imputation \n', train_num.isnull().sum())
train_num.fillna(train_num.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', train_num.isnull().sum())

In [None]:
print('Missing values before imputation \n', test_cat.isnull().sum())
test_cat.fillna(train_cat.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', test_cat.isnull().sum())

In [None]:
print('Missing values before imputation \n', test_num.isnull().sum())
test_num.fillna(train_num.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', test_num.isnull().sum())

***Encoding the categorical features. Machines cannot work with charatcers/text. They can only understand numbers and digits.***

***Hence we first need to convert the characters and text into numbers and the proceed forward using the concept of Categorical Encoding.***

***There are many encoders we can use. We are going to use OneHotEncoder.***

***OneHotEncoder encodes each category into a new column with values either 0 or 1.***

In [None]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
encoder.fit(train_cat)
train_cat = pd.DataFrame(encoder.transform(train_cat), columns = encoder.get_feature_names_out())
test_cat = pd.DataFrame(encoder.transform(test_cat), columns = encoder.get_feature_names_out())

***After encoding the categorical values, we can now concat the numerical and categorical features together into one single dataframe for both the trian and test sets. ***

***But before that we need to reset the index of the train and test set in order to avoid noisy null values.***

In [None]:
train_num.reset_index(inplace = True, drop = True)
train_cat.reset_index(inplace = True, drop = True)
test_num.reset_index(inplace = True, drop = True)
test_cat.reset_index(inplace = True, drop = True)

***Concatenating the numerical and categorical dataframes together.***

In [None]:
train_x1 = pd.concat([train_num, train_cat], axis = 1)
test_x1 = pd.concat([test_num, test_cat], axis = 1) 

***Feature Scaling. Putting the values in the same range/scale so that no feature is dominated by other features.***

***This is needed as the dataset might contain features with different units and ranges. So the data must be normalized/scaled before proceeding further in order to avoid incorrect results.***

***StandardScaler will transform the data such that the transformed data will have mean 0 and standard deviation 1.*** 

***This scaler is used when your dataset shows a normal distribution. ***

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# scaler = StandardScaler()
# scaler.fit(train_x1)
# train_x1 = pd.DataFrame(scaler.transform(train_x1), columns = train_x1.columns)
# test_x1 = pd.DataFrame(scaler.transform(test_x1), columns = test_x1.columns)

***MinMaxScaler will transform each data point within the range of [0, 1].***

***This scaler preserves the underlying distribution of your dataset and does not distort it.***

In [None]:
scaler = MinMaxScaler()
scaler.fit(train_x1)
train_x1 = pd.DataFrame(scaler.transform(train_x1), columns = train_x1.columns)
test_x1 = pd.DataFrame(scaler.transform(test_x1), columns = test_x1.columns)

***RobustScaler is used when your data contains a lot of outliers.***

***When your dataset is heavily skewed, it is advised to use this scaler.***

In [None]:
# scaler = RobustScaler()
# scaler.fit(train_x1)
# train_x1 = pd.DataFrame(scaler.transform(train_x1), columns = train_x1.columns)
# test_x1 = pd.DataFrame(scaler.transform(test_x1), columns = test_x1.columns)

***Now we are ready to train our Machine Learning models! We will be training 5 different models (linear, tree, forest, distance and boosting based models) in order to see which performs the best. ***

In [None]:
!pip install xgboost
import xgboost as xgb

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR

model1 = LinearRegression()
model2 = DecisionTreeRegressor(random_state = 69)
model3 = RandomForestRegressor(random_state = 69)
model4 = KNeighborsRegressor()
model5 = xgb.XGBRegressor()
model6 = SVR()

***Fitting the Linear Regression model.***

In [None]:
model1.fit(train_x1, train_y)

***Predicting on the test set using the Linear Regression model.***

In [None]:
pred1 = model1.predict(test_x1)

In [None]:
pred1

***Calculating the error for the Linear Regression model.***

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mae1 = mean_absolute_error(test_y, pred1)

In [None]:
mae1

***DecisionTreeRegressor***

In [None]:
model2.fit(train_x1, train_y)
pred2 = model2.predict(test_x1)

mae2 = mean_absolute_error(test_y, pred2)

mae2

***RandomForestRegressor***

In [None]:
model3.fit(train_x1, train_y)
pred3 = model3.predict(test_x1)

mae3 = mean_absolute_error(test_y, pred3)

mae3

***KNNRegressor***

In [None]:
model4.fit(train_x1, train_y)
pred4 = model4.predict(test_x1)

mae4 = mean_absolute_error(test_y, pred4)

mae4

***XGBRegressor***

In [None]:
model5.fit(train_x1, train_y)
pred5 = model5.predict(test_x1)

mae5 = mean_absolute_error(test_y, pred5)

mae5

In [None]:
#SVM Regressor

In [None]:
model6.fit(train_x1, train_y)
pred6 = model6.predict(test_x1)

mae6 = mean_absolute_error(test_y, pred6)

mae6

***Base model score is 2030517400638875.8 for Linear Regression, 28315.71 for Decision Tree Regressor, 18916.22 for Random Forest Regressor and 26654.95 for KNN Regressor, 19490.79 for XGBRegressor and 54643.78 for SVMRegressor. Now we will implement Hyperparameter Tuning using GridSearcCV in order to improve the score!***

***Hyperparameter Tuning using GridSearchCV.***

In [None]:
DTR_params = [{'max_depth': [3, 6, 9, 12], 'max_features': [3, 6, 9]}]

dtr = GridSearchCV(model2, DTR_params, cv = 5, scoring='neg_mean_absolute_error')
dtr.fit(train_x1, train_y)

print(dtr.best_params_)
print(-(dtr.best_score_))

In [None]:
RFR_params = [{'n_estimators': [5, 10, 15, 20], 'max_depth': [3, 6, 9, 12]}]

rfr = GridSearchCV(model3, RFR_params, cv = 5, scoring='neg_mean_absolute_error')
rfr.fit(train_x1, train_y)

print(rfr.best_params_)
print(-(rfr.best_score_))

In [None]:
KNN_params = [{'n_neighbors': [3, 6, 9, 12], 'weights': ['uniform', 'distance']}]

knn = GridSearchCV(model4, KNN_params, cv = 5, scoring='neg_mean_absolute_error')
knn.fit(train_x1, train_y)

print(knn.best_params_)
print(-(knn.best_score_))

In [None]:
xgb_params = [{'eta': [0.1, 0.2, 0.3], 'max_depth': [3, 6, 9]}]

gb = GridSearchCV(model5, xgb_params, cv = 5, scoring='neg_mean_absolute_error')
gb.fit(train_x1, train_y)

print(gb.best_params_)
print(-(gb.best_score_))

In [None]:
svm_params = [{'kernel': ['linear', 'rbf', 'poly'], 'C': [0.1, 0.2, 0.3]}]

svm = GridSearchCV(model6, svm_params, cv = 5, scoring='neg_mean_absolute_error')
svm.fit(train_x1, train_y)

print(svm.best_params_)
print(-(svm.best_score_))

***Now we have trained our models and tested them on the test set to validate the results. How can we further increase the accuracy?***

***One way to achieve this by removing noisy/unneccessary features from the set and keeping only the relevent features. This process is known as Feature Selection. There are many ways through which we can do this.***

***One way is to manually look and find features which are noise features. Although widely used, it required manual effort and domain knowledge which not everyone has.***

***Another approach is an automated appproach. This approach is preferred and widely used. There are many techniques which can be used to find the important features.***

***Feature Selection (Removing high NaN value features)***

***Remove features with high percentage of NaN values, as they do not contain enogh information for the model to learn. ***

In [None]:
nan_feat = [cname for cname in data2.columns if data2[cname].isnull().sum() >= 1]

# TO FIND OUT THE % OF NAN 
for i in nan_feat:
    print(i, np.round(data2[i].isnull().mean(), 2))

***Feature Selection (Variance Threshold)***

***In this method, we remove the features with low variance, as they do not contribute in the outcome prediction and only increase the dimensionality of the data thus increasing the time execution.***

In [None]:
cat_data = data2.select_dtypes(include = 'object')
pd.crosstab(cat_data['Street'], columns = 'counts', normalize = True)

In [None]:
list1 = []
for i in cat_data.columns:
    list1.append((i, pd.crosstab(cat_data[i], columns = 'counts', normalize = True)))

In [None]:
list1

***Feature Selection (Correlation based)***

***Check the linear correlation of the features to detect important ones.***

In [None]:
corr = data2.corr()
corr.style.background_gradient(cmap='coolwarm')

***Select the best features with the help of various algorithms like chi2, mutual information and more.***

***We just need to specify the algorithm and the number of top features required.***

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func = chi2, k = 10)
selector.fit(train_x1, train_y)
train_x1.columns[selector.get_support()]

***Feature Importance using Random Forest model***

In [None]:
rf_model = RandomForestRegressor(random_state=69)

rf_model.fit(train_x1, train_y)

In [None]:
feature_scores = pd.Series(rf_model.feature_importances_, index=train_x1.columns).sort_values(ascending=False)

In [None]:
feature_scores[:10]

In [None]:
feature_scores.index[:10]

In [None]:
# Creating a seaborn bar plot

f, ax = plt.subplots(figsize=(30, 24))
ax = sns.barplot(x=feature_scores[:10], y=feature_scores.index[:10])
ax.set_title("Visualize feature scores of the features")
ax.set_yticklabels(feature_scores.index[:10])
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()

***Dropping the unwanted features***

In [None]:
low_var_list = ['Alley', 'YrSold', 'PoolQC', 'MiscFeature', 'MiscVal', 'GarageYrBlt', 'YearBuilt', 'MoSold', 
            '1stFlrSF', '2ndFlrSF', 'LotArea', 'YearRemodAdd', 'Street', 'Utilities', 'LandSlope', 
            'Condition2', 'RoofMatl', 'Heating', 'GarageCond']
data2.drop(low_var_list, axis = 1, inplace = True)

***Now we repeat the same steps performed above i.e. splitting into train and test set, imputation of NaN values, encoding, scaling and then training!***

In [None]:
x1 = data2.drop(['SalePrice'], axis = 1)
y1 = data2['SalePrice']

In [None]:
train_x4, test_x4, train_y4, test_y4 = train_test_split(x1, y1, random_state = 69)

In [None]:
train_num1 = train_x4.select_dtypes(include = 'number')
train_cat1 = train_x4.select_dtypes(include = 'object')

test_num1 = test_x4.select_dtypes(include = 'number')
test_cat1 = test_x4.select_dtypes(include = 'object')

In [None]:
print('Missing values before imputation \n', train_num1.isnull().sum())
train_num1.fillna(train_num1.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', train_num1.isnull().sum())

In [None]:
print('Missing values before imputation \n', train_cat1.isnull().sum())
train_cat1.fillna(train_cat1.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', train_cat1.isnull().sum())

In [None]:
print('Missing values before imputation \n', test_cat1.isnull().sum())
test_cat1.fillna(train_cat1.mode().loc[0], inplace = True)
print('\n')
print('Missing values after imputation \n', test_cat1.isnull().sum())


In [None]:
print('Missing values before imputation \n', test_num1.isnull().sum())
test_num1.fillna(train_num1.median(), inplace = True)
print('\n')
print('Missing values after imputation \n', test_num1.isnull().sum())

In [None]:
train_num1.reset_index(inplace = True, drop = True)
train_cat1.reset_index(inplace = True, drop = True)
test_num1.reset_index(inplace = True, drop = True)
test_cat1.reset_index(inplace = True, drop = True)

In [None]:
train_x5 = pd.concat([train_num1, train_cat1], axis = 1)
test_x5 = pd.concat([test_num1, test_cat1], axis = 1) 

In [None]:
encoder1 = OneHotEncoder(sparse = False, handle_unknown = 'ignore')
encoder1.fit(train_x5)
train_x5 = pd.DataFrame(encoder1.transform(train_x5), columns = encoder1.get_feature_names_out())
test_x5 = pd.DataFrame(encoder1.transform(test_x5), columns = encoder1.get_feature_names_out())

In [None]:
scaler1 = StandardScaler()
scaler1.fit(train_x5)
train_x5 = pd.DataFrame(scaler1.transform(train_x5), columns = train_x5.columns)
test_x5 = pd.DataFrame(scaler1.transform(test_x5), columns = test_x5.columns)

In [None]:
model7 = LinearRegression()
model8 = DecisionTreeRegressor(random_state = 69)
model9 = RandomForestRegressor(random_state = 69)
model10 = KNeighborsRegressor()
model11 = xgb.XGBRegressor()

***Linear Regression***

In [None]:
model7.fit(train_x5, train_y4)
pred7 = model7.predict(test_x5)

In [None]:
mae7 = mean_absolute_error(test_y4, pred7)

In [None]:
mae7

In [None]:
model8.fit(train_x5, train_y4)
pred8 = model8.predict(test_x5)

mae8 = mean_absolute_error(test_y4, pred8)

mae8

In [None]:
model9.fit(train_x5, train_y4)
pred9 = model9.predict(test_x5)

mae9 = mean_absolute_error(test_y4, pred9)

mae9

In [None]:
model10.fit(train_x5, train_y4)
pred10 = model10.predict(test_x5)

mae10 = mean_absolute_error(test_y4, pred10)

mae10

In [None]:
model11.fit(train_x5, train_y4)
pred11 = model11.predict(test_x5)

mae11 = mean_absolute_error(test_y4, pred11)

mae11