## Project Name: House Prices: Advanced Regression Techniques

##### Problem Statement ##############

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement
 ceiling or the proximity to an east-west railroad.But this playground competition's dataset proves that 
much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, 
this competition challenges you to predict the final price of each home.


**The main aim of this project is to predict the house price based on various features which we will discuss as we go ahead**

#### Dataset to downloaded from the below link
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

In [1]:
##3 Importing libraries
import pandas as pd ## data preprocessing 
import numpy as np  ## mathmatical calculation
import matplotlib.pyplot as plt 
import seaborn as sns

# pd.pandas.set_option('display.max_columns' , None)

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
df_train = pd.concat([train, test])
df_train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500.0
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500.0
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500.0
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000.0
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000.0


In [3]:
df_train.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2006,WD,Normal,
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2006,WD,Abnorml,
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,9,2006,WD,Abnorml,
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,Shed,700,7,2006,WD,Normal,
1458,2919,60,RL,74.0,9627,Pave,,Reg,Lvl,AllPub,...,0,,,,0,11,2006,WD,Normal,


In [4]:
df_train.shape

(2919, 81)

In [5]:
train.shape

(1460, 81)

In [6]:
test.shape

(1459, 80)

## EDA and Feature Engineering

In [7]:
duplicate = df_train.duplicated()
print(duplicate)

0       False
1       False
2       False
3       False
4       False
        ...  
1454    False
1455    False
1456    False
1457    False
1458    False
Length: 2919, dtype: bool


In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 0 to 1458
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallC

In [9]:
df_train.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,2919.0,2919.0,2433.0,2919.0,2919.0,2919.0,2919.0,2919.0,2896.0,2918.0,...,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,1460.0
mean,1460.0,57.137718,69.305795,10168.11408,6.089072,5.564577,1971.312778,1984.264474,102.201312,441.423235,...,93.709832,47.486811,23.098321,2.602261,16.06235,2.251799,50.825968,6.213087,2007.792737,180921.19589
std,842.787043,42.517628,23.344905,7886.996359,1.409947,1.113131,30.291442,20.894344,179.334253,455.610826,...,126.526589,67.575493,64.244246,25.188169,56.184365,35.663946,567.402211,2.714762,1.314964,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,730.5,20.0,59.0,7478.0,5.0,5.0,1953.5,1965.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0,129975.0
50%,1460.0,50.0,68.0,9453.0,6.0,5.0,1973.0,1993.0,0.0,368.5,...,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,2189.5,70.0,80.0,11570.0,7.0,6.0,2001.0,2004.0,164.0,733.0,...,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,2919.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0,755000.0


### Handling numerical  Missing values

**1. For Continious**

In [10]:
# missing_values_continious = [feature for feature in df_train.columns if df_train[feature].dtype != "O" and len(df_train[feature].unique()) >20 and df_train[feature].isnull().sum()>0]
# missing_values_continious

In [11]:
len(df_train["LotFrontage"].unique())

129

In [12]:
missing_values_continious = []
for feature in df_train.columns:
    if df_train[feature].dtype != "object" and len(df_train[feature].unique())>20:
        missing_values_continious.append(feature)
missing_values_continious    

['Id',
 'LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'GrLivArea',
 'GarageYrBlt',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'MiscVal',
 'SalePrice']

In [13]:
df_train.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            4
LotFrontage       486
LotArea             0
                 ... 
MoSold              0
YrSold              0
SaleType            1
SaleCondition       0
SalePrice        1459
Length: 81, dtype: int64

In [14]:
for feature in missing_values_continious:
    print(feature, round(df_train[feature].isnull().mean() , 2) * 100)

Id 0.0
LotFrontage 17.0
LotArea 0.0
YearBuilt 0.0
YearRemodAdd 0.0
MasVnrArea 1.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 0.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 0.0
GarageYrBlt 5.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
MiscVal 0.0
SalePrice 50.0


In [15]:
median_value = df_train["GarageYrBlt"].median()

In [16]:
median_value

1979.0

In [17]:
for feature in missing_values_continious:
    if feature == "SalePrice":
        pass
    else:        
        median_value = df_train[feature].median()
        df_train[feature].fillna(median_value,inplace=True)    

In [18]:
for feature in missing_values_continious:
    print(feature, round(df_train[feature].isnull().mean(),4)*100)

Id 0.0
LotFrontage 0.0
LotArea 0.0
YearBuilt 0.0
YearRemodAdd 0.0
MasVnrArea 0.0
BsmtFinSF1 0.0
BsmtFinSF2 0.0
BsmtUnfSF 0.0
TotalBsmtSF 0.0
1stFlrSF 0.0
2ndFlrSF 0.0
LowQualFinSF 0.0
GrLivArea 0.0
GarageYrBlt 0.0
GarageArea 0.0
WoodDeckSF 0.0
OpenPorchSF 0.0
EnclosedPorch 0.0
3SsnPorch 0.0
ScreenPorch 0.0
MiscVal 0.0
SalePrice 49.980000000000004


In [19]:
df_train.drop("Id" , inplace=True , axis = 1)

**2. For Descrete**

In [20]:
#missing_values_descrete = [feature for feature in df_train.columns if df_train[feature].dtype != "O" and len(df_train[feature].unique()) <20 and df_train[feature].isnull().sum()>0]
#missing_values_descrete

In [21]:
missing_values_descrete = []
for feature in df_train.columns:
    if df_train[feature].dtype != "object" and len(df_train[feature].unique()) <=20:
        missing_values_descrete.append(feature)
len(missing_values_descrete)        

15

In [22]:
for feature in missing_values_descrete:
    print(feature, round(df_train[feature].isnull().mean(),4)*100)

MSSubClass 0.0
OverallQual 0.0
OverallCond 0.0
BsmtFullBath 0.06999999999999999
BsmtHalfBath 0.06999999999999999
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 0.0
Fireplaces 0.0
GarageCars 0.03
PoolArea 0.0
MoSold 0.0
YrSold 0.0


#### df_train["GarageCars"].mode()[0]

In [23]:
for feature in missing_values_descrete:
    mode_value = df_train[feature].mode()[0]
    df_train[feature].fillna(mode_value,inplace=True)

In [24]:
for feature in missing_values_descrete:
    print(feature, round(df_train[feature].isnull().mean(),4)*100)

MSSubClass 0.0
OverallQual 0.0
OverallCond 0.0
BsmtFullBath 0.0
BsmtHalfBath 0.0
FullBath 0.0
HalfBath 0.0
BedroomAbvGr 0.0
KitchenAbvGr 0.0
TotRmsAbvGrd 0.0
Fireplaces 0.0
GarageCars 0.0
PoolArea 0.0
MoSold 0.0
YrSold 0.0


### Handling categorical missing values

In [25]:
#missing_values_c = [feature for feature in df_train.columns if df_train[feature].dtype == "O" and df_train[feature].isnull().sum()>0]
#missing_values_c

In [26]:
missing_values_c = []
for feature in df_train.columns:
    if df_train[feature].dtype == "O" and df_train[feature].isnull().sum()>0:
        missing_values_c.append(feature)
len(missing_values_c)        

23

In [27]:
for feature in missing_values_c:
    print(feature, round(df_train[feature].isnull().mean(),5)*100)

MSZoning 0.13699999999999998
Alley 93.217
Utilities 0.06899999999999999
Exterior1st 0.034
Exterior2nd 0.034
MasVnrType 0.822
BsmtQual 2.775
BsmtCond 2.809
BsmtExposure 2.809
BsmtFinType1 2.706
BsmtFinType2 2.741
Electrical 0.034
KitchenQual 0.034
Functional 0.06899999999999999
FireplaceQu 48.647
GarageType 5.379
GarageFinish 5.447
GarageQual 5.447
GarageCond 5.447
PoolQC 99.657
Fence 80.43900000000001
MiscFeature 96.403
SaleType 0.034


In [28]:
for feature in missing_values_c:
    mode_value = df_train[feature].mode()[0]
    df_train[feature].fillna(mode_value,inplace=True)    
df_train.drop(["Alley" ,"PoolQC", "Fence" , "MiscFeature"  , "FireplaceQu" ] , axis = 1 , inplace = True)

In [29]:
df_train.isnull().sum()

MSSubClass          0
MSZoning            0
LotFrontage         0
LotArea             0
Street              0
                 ... 
MoSold              0
YrSold              0
SaleType            0
SaleCondition       0
SalePrice        1459
Length: 75, dtype: int64

In [None]:
df_train.shape

In [None]:
df_train.head()

### Handling year feature

In [None]:
#year = [feature for feature in df_train.columns if "Yr" in feature or "Year" in feature] 
#year

In [None]:
year = []
for feature in df_train.columns:
    if "Yr" in feature or "Year" in feature:
        year.append(feature)
year

In [None]:
for feature in year:
    print(feature, len(df_train[feature].unique()) , df_train[feature].dtype)

In [None]:
df_train["YrSold"].value_counts()

In [None]:
df_train.groupby('YrSold')['SalePrice'].median().plot()   
plt.xlabel('Year Sold')
plt.ylabel('Median House Price')
plt.title("House Price vs YearSold")

In [None]:
for feature in year:       
    df_train[feature] = df_train['YrSold']-df_train[feature]
df_train.drop("YrSold", axis = 1 , inplace = True)

In [None]:
df_train.shape

In [None]:
df_train.head()

### Handling continious values

In [None]:
#continious = [feature for feature in df_train.columns if len(df_train[feature].unique())>20 and df_train[feature].dtype != "O" and feature not in year]
#continious

In [None]:
continious = []
for feature in df_train.columns:
     if df_train[feature].dtype != "O" and len(df_train[feature].unique())>20  and feature not in year:
            continious.append(feature)
continious         

In [None]:
df_train["LotFrontage"].skew()

In [None]:
## We will be using logarithmic transformation
for feature in continious:
    data = df_train.copy()
    #data[feature]=np.log1p(data[feature])
    ax = sns.distplot(data[feature])
    ax.legend(["skewness : {:0.3f}".format(data[feature].skew())])
    plt.xlabel(feature)
    plt.ylabel('SalesPrice')
    plt.title(feature)
    plt.show() 

In [None]:
#skewed = [feature for feature in continious if data[feature].skew()<1]
#skewed

In [None]:
skewed = []
for feature in continious:
    if abs(df_train[feature].skew())>1:
        skewed.append(feature)
skewed        

In [None]:
abs(-5)

In [None]:
for feature in continious:
    if feature == "SalePrice":
        pass
    else:        
        df_train[feature] = np.log1p(df_train[feature])

In [None]:
df_train.shape

In [None]:
# correlation heatmap
plt.figure(figsize=(25,25))
ax = sns.heatmap(df_train[continious].corr(), cmap = "coolwarm", annot=True, linewidth=2)

# to fix the bug "first and last row cut in half of heatmap plot"
# bottom, top = ax.get_ylim()
# ax.set_ylim(bottom + 0.5, top - 0.5)

In [None]:
# correlation heatmap of higly correlated features with SalePrice
low_corr = df_train[continious].corr()
low_corr_features = low_corr.index[low_corr["SalePrice"] < 0.10]
low_corr_features

In [None]:
df_train.drop(low_corr_features , axis = 1 , inplace = True)

In [None]:
df_train.shape

### Handling categorical variables

In [None]:
#categorical = [feature for feature in df_train.columns if df_train[feature].dtype == "O"]
#len(categorical)

In [None]:
categorical = []
for feature in df_train.columns:
    if df_train[feature].dtype == "object":
        categorical.append(feature)
len(categorical)

In [None]:
for feature in categorical:
    #df_train.groupby(feature)['SalePrice'].median().plot.bar()
    sns.barplot(x = df_train[feature] , y = df_train["SalePrice"])
    plt.xlabel(feature)
    plt.ylabel('SalePrice')
    plt.title(feature)
    plt.show()

**ORDINAL**

In [None]:
from pandas.api.types import CategoricalDtype

In [None]:
df_train['BsmtCond'].unique() ## cardinality of categorical variables

In [None]:
df_train['BsmtCond'].value_counts()

In [None]:
df_train['BsmtCond'] = df_train['BsmtCond'].astype(CategoricalDtype(categories=['TA', 'Gd', 'Fa', 'Po'], ordered = True)).cat.codes

In [None]:
df_train['BsmtCond'].value_counts()

In [None]:
df_train['BsmtExposure'] = df_train['BsmtExposure'].astype(CategoricalDtype(categories=['NA', 'Mn', 'Av', 'Gd'], ordered = True)).cat.codes
df_train['BsmtFinType1'] = df_train['BsmtFinType1'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df_train['BsmtFinType2'] = df_train['BsmtFinType2'].astype(CategoricalDtype(categories=['NA', 'Unf', 'LwQ', 'Rec', 'BLQ','ALQ', 'GLQ'], ordered = True)).cat.codes
df_train['BsmtQual'] = df_train['BsmtQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['ExterQual'] = df_train['ExterQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['ExterCond'] = df_train['ExterCond'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['Functional'] = df_train['Functional'].astype(CategoricalDtype(categories=['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod','Min2','Min1', 'Typ'], ordered = True)).cat.codes
df_train['GarageCond'] = df_train['GarageCond'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['GarageQual'] = df_train['GarageQual'].astype(CategoricalDtype(categories=['NA', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['GarageFinish'] = df_train['GarageFinish'].astype(CategoricalDtype(categories=['NA', 'Unf', 'RFn', 'Fin'], ordered = True)).cat.codes
df_train['HeatingQC'] = df_train['HeatingQC'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['KitchenQual'] = df_train['KitchenQual'].astype(CategoricalDtype(categories=['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered = True)).cat.codes
df_train['PavedDrive'] = df_train['PavedDrive'].astype(CategoricalDtype(categories=['N', 'P', 'Y'], ordered = True)).cat.codes
df_train['Utilities'] = df_train['Utilities'].astype(CategoricalDtype(categories=['ELO', 'NASeWa', 'NASeWr', 'AllPub'], ordered = True)).cat.codes

In [None]:
ordinal = ["BsmtCond" , "BsmtExposure" , "BsmtFinType1" , "BsmtFinType2" , "BsmtQual" , "ExterQual" , "ExterCond" , "Functional",
          "GarageCond" , "GarageQual" , "GarageFinish" , "HeatingQC" , "KitchenQual" , "PavedDrive" , "Utilities"]

In [None]:
len(ordinal)

In [None]:
df_train.shape

In [None]:
df_train.head()

**Nominal**

* **One hot encoding**

In [None]:
#nominal = [feature for feature in categorical if feature not in ordinal ]

In [None]:
nominal = []
for feature in categorical:
    if feature not in ordinal:
        nominal.append(feature)
len(nominal)

In [None]:
#nominal = [feature for feature in categorical if feature not in ordinal]
for feature in nominal:
    print(feature , len(df_train[feature].unique()))

In [None]:
new_nominal = ["Neighborhood" , "Exterior1st" , "Exterior2nd"]
#nominal1 = [feature for feature in nominal if feature not in new_nominal]

In [None]:
nominal1 = []
for feature in nominal:
    if feature not in new_nominal:
        nominal1.append(feature)
nominal1       

In [None]:
len(nominal1)

In [None]:
len(nominal)

In [None]:
nominal_variable = pd.get_dummies(df_train[nominal1], drop_first=True)
#nominal_variable.drop(new_nominal , axis = 1 , inplace = True)

In [None]:
nominal_variable.shape

In [None]:
nominal_variable.head()

* **One hot encoding with many variables**

In [None]:
df_train["Neighborhood"].value_counts()

In [None]:
def top_ten(feature):
    top_ten = []
    for x in feature.value_counts().sort_values(ascending = False).head(10).index:
        top_ten.append(x)
    return top_ten       

In [None]:
top_ten(df_train["Neighborhood"])

In [None]:
top_10_Neighborhood = top_ten(df_train["Neighborhood"])
top_10_Exterior1st =  top_ten(df_train["Exterior1st"])
top_10_Exterior2nd =  top_ten(df_train["Exterior2nd"])

In [None]:
df_train["Exterior1st"].unique()

In [None]:
df_train["Exterior2nd"].unique()

In [None]:
for label in top_10_Neighborhood:
    print(label)

In [None]:
#top_10_Neighborhood = [x for x in df_train.Neighborhood.value_counts().sort_values(ascending=False).head(10).index]
#top_10_Exterior1st = [x for x in df_train.Exterior1st.value_counts().sort_values(ascending=False).head(10).index]
#top_10_Exterior2nd = [x for x in df_train.Exterior2nd.value_counts().sort_values(ascending=False).head(10).index]


for i in top_10_Neighborhood:
    df_train[i]= np.where(df_train["Neighborhood"]== i,1,0) 
    
for label in top_10_Exterior1st:
    df_train[label]= np.where(df_train["Exterior1st"]==label,1,0)
    
#for label in top_10_Exterior2nd:a
    #df_train[label]= np.where(df_train["Exterior2nd"]==label,1,0)
    
#df_train[top_10_Exterior2nd].head()

In [None]:
df_train[top_10_Exterior1st].head()

In [None]:
df_train.head()

In [None]:
df_train.drop(nominal , axis = 1 , inplace = True)

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
train = pd.concat([nominal_variable , df_train] , axis = 1)

In [None]:
train.shape

In [None]:
train.head()

In [None]:
train.columns.duplicated()

In [None]:
#preview the df
#train =  train.loc[:,~ train.columns.duplicated()]
#train.shape

In [None]:
train.isnull().sum().sum()

### split data into test and train

In [None]:
train_df = train.iloc[:1460, :]
test1 = train.iloc[1460: , :]

print(train_df.shape)
print(test1.shape)
#print(len(y_train))

In [None]:
test1['SalePrice']

In [None]:
test = test1.drop("SalePrice" , axis = 1)

In [None]:
X = train_df.drop("SalePrice" , axis = 1)
y = train_df["SalePrice"]

In [None]:
X.shape

In [None]:
y

## Feature Selection

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)

In [None]:
print(model.feature_importances_)

In [None]:
X.columns

In [None]:
plt.figure(figsize = (16 , 7))
ranked_features =  pd.Series(model.feature_importances_, index = X.columns)
ranked_features.nlargest(40).plot(kind='barh')
plt.show()

In [None]:
features = ranked_features.nlargest(23)

In [None]:
features.index

In [None]:
X = train_df[features.index]

In [None]:
X.shape

In [None]:
X.head()

## Model Building

In [None]:
# split dataset into train and test
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 5)

### Robost scaller

In [None]:
test1 = test[features.index]

In [None]:
test1.shape

In [None]:
# scaling dataset with robust scaler
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
test1 = scaler.transform(test1)

In [None]:
X_train

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error # for calculating mean_squared error
from sklearn.metrics import r2_score # for measering the goodness of best fit line

reg = LinearRegression()
reg.fit(X_train , y_train)

y_pred = reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test , y_pred))

score=r2_score(y_test,y_pred)
print(f"value of R^2 is {score}")
print(f"rmse value is {rmse}")

In [None]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor()
model.fit(X_train , y_train)

prediction = model.predict(X_test)

score = r2_score(y_test , prediction)
print(score)

### Random Forest

In [None]:
# Random Forest Classifier
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor()
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
score_rf = r2_score(y_test,y_pred_rf)
rmse = np.sqrt(mean_squared_error(y_test , y_pred_rf))


print(f"value of R^2 is {score_rf}")
print(f"rmse value is {rmse}")

In [None]:
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator =rf, X = X_train,y = y_train, cv = 10)
print("Cross validation accuracy of random forest model = ", cross_validation)
print("\nCross validation mean accuracy of random forest model = ", cross_validation.mean())

### Xgboost

In [None]:
import xgboost
xgb_model = xgboost.XGBRegressor()
xgb_model.fit(X_train,y_train)


y_pred_xg = xgb_model.predict(X_test)
score_xg=r2_score(y_test,y_pred_xg)
rmse = np.sqrt(mean_squared_error(y_test , y_pred_xg))


print(f"value of R^2 is {score_xg}")
print(f"rmse value is {rmse}")

In [None]:
from sklearn.model_selection import cross_val_score
cross_validation = cross_val_score(estimator = xgb_model, X = X_train,y = y_train, cv = 10)
print("Cross validation accuracy of xgboost model = ", cross_validation)
print("\nCross validation mean accuracy of xgboost model = ", cross_validation.mean())

In [None]:
y_pred_hyper = xgb_model.predict(test1)
y_pred_hyper

In [None]:
df = pd.read_csv("test.csv" , usecols = ["Id"])

In [None]:
df.head()

In [None]:
submit_test1 = pd.concat([df["Id"], pd.DataFrame(y_pred_hyper)], axis=1)
submit_test1.columns=['Id', 'SalePrice']

In [None]:
submit_test1.head(20)

In [None]:
submit_test1.info()

In [None]:
#submit_test1 = submit_test1.astype({'Id': 'int', 'SalePrice': 'float'})

In [None]:
submit_test1.to_csv('sample_submission.csv', index=False)

In [None]:
#df = pd.read_csv("sample_submission.csv")
#df

### Hyper parameter tuning

In [None]:
## Hyperparameter optimization using RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV , GridSearchCV

In [None]:
#Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split

#criterion = ["squared_error" , "absolute_error" , "poisson"]

max_features = ['auto', 'sqrt', "log2"]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

In [None]:
# Create the random grid

random_grid = {'n_estimators': n_estimators,
               #'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

In [None]:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, cv = 5, verbose=2, n_jobs = -1 , random_state = 5)

In [None]:
rf_random.fit(X_train,y_train)

In [None]:
rf_random.best_params_

In [None]:
prediction = rf_random.predict(X_test)
score_rf=r2_score(y_test,prediction)


print(f"value of R^2 is {score_rf}")
print('RMSE:', np.sqrt(mean_squared_error(y_test, prediction)))

In [None]:
y_pred_hyper = rf_random.predict(test1)
y_pred_hyper

### Hyper parameter tuning with Xgboost

In [None]:
params = {
 "learning_rate"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
    
}

In [None]:
xgb = RandomizedSearchCV(xgb_model,param_distributions=params,n_iter=10,scoring='r2',n_jobs=-1,cv=5,verbose=3)

In [None]:
xgb.fit(X_train , y_train)
y_pred = xgb.predict(test1)
y_pred

In [None]:
prediction = xgb.predict(X_test)
score_rf=r2_score(y_test,prediction)


print(f"value of R^2 is {score_rf}")
print('RMSE:', np.sqrt(mean_squared_error(y_test, prediction)))

In [None]:
df = pd.read_csv("test.csv" , usecols = ["Id"])
submit_test1 = pd.concat([df["Id"], pd.DataFrame(y_pred)], axis=1)
submit_test1.columns=['Id', 'SalePrice']

In [None]:
# submit_test1 = submit_test1.astype({'Id': 'int', 'SalePrice': 'float'})
submit_test1.to_csv('sample_submission.csv', index=False)

In [None]:
submit_test1