**In this notebook we are trying to predict house prices by using xgboost and performing some EDA.**

[The competition data we are using](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)** consists of 79 columns describing (almost) every aspect of residential homes in Ames, Iowa.**

**Here's a brief version of what you'll find in the data description file:**
* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale


# **Importing libraries.**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

# **Importing the dataset.**

In [2]:
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
test_df_origin= pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [3]:
test_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [4]:
train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [5]:
test_df['SalePrice'] = np.arange(0,1459)

In [6]:
test_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,6,2010,WD,Normal,0
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,,,Gar2,12500,6,2010,WD,Normal,1
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,,MnPrv,,0,3,2010,WD,Normal,2
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,,,,0,6,2010,WD,Normal,3
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,0,,,,0,1,2010,WD,Normal,4


In [7]:
train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [8]:
df = pd.concat([train_df,test_df])

In [9]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [10]:
df.drop('Id',axis=1,inplace=True)


# **Cleaning the data.**

In [11]:
df.isnull().sum().sort_values(ascending=False)[:40]

PoolQC          2909
MiscFeature     2814
Alley           2721
Fence           2348
FireplaceQu     1420
LotFrontage      486
GarageFinish     159
GarageQual       159
GarageCond       159
GarageYrBlt      159
GarageType       157
BsmtCond          82
BsmtExposure      82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MasVnrArea        23
MSZoning           4
Functional         2
Utilities          2
BsmtHalfBath       2
BsmtFullBath       2
Exterior2nd        1
Exterior1st        1
BsmtUnfSF          1
TotalBsmtSF        1
GarageArea         1
KitchenQual        1
BsmtFinSF2         1
GarageCars         1
BsmtFinSF1         1
SaleType           1
Electrical         1
RoofStyle          0
RoofMatl           0
SalePrice          0
YearRemodAdd       0
YearBuilt          0
OverallCond        0
dtype: int64

In [12]:
df['PoolQC'].fillna('Np',inplace=True)

**na values here means no pool and we filled it with 'Np' string**

In [13]:
df['MiscFeature'].fillna('No plus features',inplace=True)

**na values here means no Misc Features and we filled it with 'No plus features' string**

In [14]:
df['Alley'].fillna('Naacc',inplace=True)

**na values here means No alley access and we filled it with 'Naacc' string**

In [15]:
df['Fence'].fillna('Nf',inplace=True)

**na values here means No Fence access and we filled it with 'Nf' string**

In [16]:
df['FireplaceQu'].fillna('Nofire',inplace=True)

**na values here means No Fireplace access and we filled it with 'Nofire' string**

In [17]:
df['GarageType'].fillna('Nogar',inplace=True)

In [18]:
df['GarageFinish'].fillna('Nogar',inplace=True)

In [19]:
df['GarageQual'].fillna('Nogar',inplace=True)

In [20]:
df['GarageCond'].fillna('Nogar',inplace=True)

**na values here means No Garage access and we filled it with 'Nogar' string**

In [21]:
df['BsmtExposure'].fillna('No',inplace=True)

**na values here means No basement and we filled it with 'No' string**

In [22]:
df['BsmtFinType2'].fillna('Nobsmnt',inplace=True)
df['BsmtFinType1'].fillna('Nobsmnt',inplace=True)
df['BsmtCond'].fillna('Nobsmnt',inplace=True)
df['BsmtQual'].fillna('Nobsmnt',inplace=True)

**na values here means No basement and we filled it with 'Nobsmnt' string**

In [23]:
df['MasVnrType'].fillna('Novnr',inplace=True)

**na values here means No veneer and we filled it with 'Novnr' string**

In [24]:
df['MasVnrArea'].fillna(0,inplace=True)

**na values here means No veneer and we filled it with 0**

In [25]:
df['Electrical'].unique()

array(['SBrkr', 'FuseF', 'FuseA', 'FuseP', 'Mix', nan], dtype=object)

In [26]:
df['Electrical'].fillna('SBrkr',inplace=True)

The most frequent Electrical system is Standard Circuit Breakers & Romex.

In [27]:
df['LotFrontage'].mean()

69.30579531442663

In [28]:
df['LotFrontage'].fillna(70,inplace=True)

**The average value of LotFrontage is 70 ft.**

In [29]:
df['GarageYrBlt'].min()

1895.0

In [30]:
df['GarageYrBlt'].fillna(0.0,inplace=True)

**We filled na with 0 (since there's no basement).**

In [31]:
df['MSZoning'].mode()

0    RL
dtype: object

In [32]:
df['MSZoning'].fillna('RL',inplace=True)

Residential Low Density is the most frequent general zoning class.

In [33]:
df['Functional'].mode()

0    Typ
dtype: object

In [34]:
df['Functional'].fillna('Typ',inplace=True)

Typical Functionality is the most frequent Home functionality type.

In [35]:
df['Utilities'].mode()

0    AllPub
dtype: object

In [36]:
df['Utilities'].fillna('AllPub',inplace=True)

AllPub is the most frequent Type of utilities available.

In [37]:
df['BsmtFullBath'].mode()

0    0.0
dtype: float64

In [38]:
df['BsmtFullBath'].fillna(0,inplace=True)

The most frequent basement full bathrooms number is 0.

In [39]:
df['BsmtHalfBath'].mode()

0    0.0
dtype: float64

In [40]:
df['BsmtHalfBath'].fillna(0,inplace=True)

The most frequent basement half bathrooms number is 0.

In [41]:
df[df['BsmtUnfSF'].isna()]['BsmtFinType1']

660    Nobsmnt
Name: BsmtFinType1, dtype: object

In [42]:
df[df['BsmtUnfSF'].isna()]['BsmtFinType2']

660    Nobsmnt
Name: BsmtFinType2, dtype: object

In [43]:
df['BsmtUnfSF'].fillna(0,inplace=True)
df['TotalBsmtSF'].fillna(0,inplace=True)
df['BsmtFinSF1'].fillna(0,inplace=True)
df['BsmtFinSF2'].fillna(0,inplace=True)

No basement in the na value's row.

In [44]:
df['SaleType'].mode()

0    WD
dtype: object

In [45]:
df['SaleType'].fillna('WD',inplace=True)

Warranty Deed - Conventional is the most frequent Type of sale.

In [46]:
df[df['GarageCars'].isna()]['GarageFinish']

1116    Nogar
Name: GarageFinish, dtype: object

In [47]:
df[df['GarageArea'].isna()]['GarageFinish']

1116    Nogar
Name: GarageFinish, dtype: object

In [48]:
df['GarageCars'].fillna(0,inplace=True)
df['GarageArea'].fillna(0,inplace=True)

No garage at the na value's row.

In [49]:
df['Exterior1st'].mode()

0    VinylSd
dtype: object

In [50]:
df['Exterior2nd'].mode()

0    VinylSd
dtype: object

In [51]:
df['Exterior1st'].fillna('VinylSd',inplace=True)
df['Exterior2nd'].fillna('VinylSd',inplace=True)

Vinyl Siding is the most frequent Exterior covering.

In [52]:
df['KitchenQual'].mode()

0    TA
dtype: object

In [53]:
df['KitchenQual'].fillna('TA',inplace=True)

Typical/Average is the most frequent kitchen quality.

In [54]:
df.isnull().sum().sort_values(ascending=False)[:40]

SalePrice        0
SaleCondition    0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
MasVnrArea       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinSF1       0
BsmtFinType2     0
BsmtFinSF2       0
BsmtUnfSF        0
TotalBsmtSF      0
RoofStyle        0
YearRemodAdd     0
YearBuilt        0
Utilities        0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
Alley            0
LotShape         0
LandContour      0
LotConfig        0
OverallCond      0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
OverallQual      0
Heating          0
dtype: int64

# **One hot encoding**

In [55]:
to_be_dummed = ['MSSubClass','MSZoning','Street','Alley','LotShape','LandContour','Utilities','LotConfig','LandSlope',
                'Neighborhood','Condition1','Condition2','BldgType','HouseStyle','OverallQual','OverallCond',
                'RoofStyle','RoofMatl','Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond',
                'Foundation','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','Heating',
                'HeatingQC','CentralAir','Electrical','KitchenQual','Functional','FireplaceQu','GarageType',
                'GarageFinish','GarageQual','GarageCond','PavedDrive','PoolQC','Fence','MiscFeature','SaleType',
                'SaleCondition']

In [56]:
df_add = pd.get_dummies(df[to_be_dummed],drop_first=True)

In [57]:
df.drop(to_be_dummed,axis=1,inplace=True)

In [58]:
df = pd.concat([df,df_add],axis=1)

In [59]:
df.isna().sum().sort_values()

LotFrontage              0
BsmtQual_Gd              0
BsmtQual_Nobsmnt         0
BsmtQual_TA              0
BsmtCond_Gd              0
                        ..
Condition2_PosN          0
Condition2_RRAe          0
Condition2_RRAn          0
BldgType_2fmCon          0
SaleCondition_Partial    0
Length: 260, dtype: int64

# **Making the predictions.**

In [60]:
X_train = df.drop('SalePrice',axis=1).iloc[:1460,:]
y_train = df['SalePrice'][:1460]
X_test = df.drop('SalePrice',axis=1).iloc[1460:,:]

In [61]:
X_train.shape

(1460, 259)

In [62]:
X_test.shape

(1459, 259)

In [63]:
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBRegressor

In [64]:
mms  = MinMaxScaler((-1,1))

**Data scaling is a pivotal function to assure that the model will converge into a minimum and decrease the effect of the outliers.**
**This estimator scales and translates each feature individually such that it is in the given range on the training set(between minus one and one).**

In [65]:
X_train = mms.fit_transform(X_train)

In [66]:
boost = XGBRegressor()

In [67]:
boost.fit(X_train,y_train)

XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
             validate_parameters=False, verbosity=None)

In [68]:
X_test = mms.transform(X_test)

In [69]:
X_test.shape

(1459, 259)

In [70]:
X_train.shape

(1460, 259)

In [71]:
predections = boost.predict(X_test)

In [72]:
predections.shape

(1459,)

In [73]:
test_df_origin['SalePrice'] = predections

In [74]:
test_df_origin[['Id','SalePrice']].to_csv('submition.csv',index=False)

**Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.**

**The best score i got on this competition is 0.14116. The top score that others got is ~ 0.1.**