<img class="thumb_g_article" data-org-src="https://i.gifer.com/AwdJ.gif" data-org-width="680" dmcf-mid="cCD3JaWEk0" dmcf-mtype="image" height="auto" src="https://i.gifer.com/AwdJ.gif" width="8000">

# Introduction📝

###  Goal: 
The focus of this workbook is EDA and trying on multiple models to check their performance. 🎯

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
This playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

**Evaluation Metric:**

Root Mean Squared Error (RMSE)

![RMSE](https://programmerah.com/wp-content/uploads/2020/11/20190714113817886.png)

where:

* 𝑦𝑖  : original value
* 𝑦𝑖^  : predicted value
* 𝑛  : number of rows in the test data

# Import Libraries 📚

In [None]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
#import some necessary librairies

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #for some statistics


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Limiting floats output to 3 decimal points


from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #check the files available in the directory

### Loading dataset! 

In [None]:
#Now let's import and put the train and test datasets in  pandas dataframe

train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
X = train.drop("SalePrice", axis=1)
y = train[['SalePrice']]

#Dropping 'Id' colum as it doesnt contribute in prediction
X.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

X.shape
y.shape

# Data Preprocessing 🔮

In [None]:
plt.figure(figsize=(7,7))
sns.boxplot(data=y)

In [None]:
# Filter columns based on data type. First get num columns then find the columns that are not numbers
num_cols = [col for col in X.columns if X[col].dtype in ['int64', 'float64']]
cat_cols = [col for col in X.columns if X[col].dtype in ['object','str']]

In [None]:
discrete_numeric_columns = ['OverallQual','OverallCond','BsmtFullBath','BsmtHalfBath','FullBath','HalfBath',
                'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'MoSold', 'YrSold']

continuous_numeric_column=[]
for i in num_cols:
    if i not in discrete_numeric_columns:
        continuous_numeric_column.append(i)

In [None]:
X[continuous_numeric_column]

In [None]:
fig=plt.figure(figsize=(20,30))

for index, col in enumerate(continuous_numeric_column):
    plt.subplot(6,4,index+1)
    sns.distplot(X[col].dropna())
fig.tight_layout(pad=2.0)


## Bivariate Analysis 

In [None]:

fig=plt.figure(figsize=(20,10))

sns.pairplot(train, y_vars=y, x_vars=X[continuous_numeric_column].columns.values)

fig.tight_layout(pad=5.0)

In some of the distribution of these numeric values, we can see how increase in certain feature is correlated with increase in Sale Price, some interesting insights can be surely derived from these. 

In [None]:
def bar_plot(variable):

    # get feature
    var = X[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))
    
    

fig = plt.figure(figsize=(18,20))
for cols in cat_cols:
    bar_plot(cols)

fig.tight_layout(pad=1.0)


It can be seen from this hist that we have high cardinality in some classes while some got low cardinality but high class imbalance. It would be interesting to expand the feature space by encoding these categorical variable and then apply dimensionality reduction- prefrebly PCA or try out feature selection using Regularization. Lets see how correlated these features are !


## Correlation Heatmap 

In [None]:
#Colored Corr heatmaps are good ! But its much better without colors ! 
plt.figure(figsize=(18,15))
correlation = X[num_cols].corr()
sns.heatmap(correlation, mask = correlation <0.7, linewidth=0.5, annot = True, fmt = ".2f", cmap='Accent')

In [None]:
X.shape

In [None]:
sns.scatterplot(x=X['LotFrontage'], y=y['SalePrice'], hue=X['LotShape'])

# Checking in Cardianality of Categorical Cols 🧩

In [None]:
low_cardinal_categorical_cols= [ col for col in cat_cols if X[col].nunique()>10]

In [None]:
X['Topography'] = X['LotConfig'] + X['LandContour']
X['Geometry'] = X['LotArea'] / X['LotFrontage']
X['TotalIndoorSqFt'] = X['TotalBsmtSF'] + X['1stFlrSF'] + X['2ndFlrSF'] + X['GarageArea']
X['HouseToYardRatio'] = X['TotalIndoorSqFt'] / X['LotArea']
X['HouseToPoolRatio'] = X['TotalIndoorSqFt'] / (X['PoolArea'] + 1)
X['Value'] = X['OverallCond'] * X['OverallQual']
X['Condition'] = X['Condition1'] + X['ExterCond']
X['YardToSeatingAreaRatio'] =  (X['WoodDeckSF'] + X['OpenPorchSF'] + 1) / X['LotArea']
X['Meh'] = X['Fireplaces'] * X['TotRmsAbvGrd']

In [None]:
X[(X['Utilities']=='NoSeWa') &  (X['ScreenPorch']==203) ]

In [None]:
dummies = pd.get_dummies(X[cat_cols]).rename(columns=lambda x: 'Category_' + str(x))
dummies.head()
dummies.shape
X[num_cols].shape

In [None]:
X_encoded=pd.concat([ X[num_cols], dummies], axis=1)
X_encoded.isnull().sum().sort_values(ascending=False)
X_encoded=X_encoded.fillna(X_encoded.mean())

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

In [None]:
data_dmatrix = xgb.DMatrix(data=X_encoded,label=y)

# Train Test Split ⚔️

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=123)

In [None]:
model = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 10, alpha = 10, n_estimators = 400)

In [None]:
model.fit(X_train,y_train)

preds = model.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

# Best model by Randomized Search CV 👓

In [None]:
# from sklearn.model_selection import RandomizedSearchCV



# parameters = {
#     'max_depth': [3, 5, 10, None],
#      'n_estimators': [100, 200, 300, 400, 500],
#      'learning_rate': [0.01, 0.1, 0.5],
#      'booster' : ['gbtree', 'gblinear', 'dart']
#  }

# rv = RandomizedSearchCV(model,
#                         param_distributions=parameters,
#                          n_iter=25,
#                          cv=5,
#                          n_jobs=-1,
#                          random_state=42)

# rv.fit(X_train, y_train)

# rv.best_params_, rv.best_score_

In [None]:
#rv.best_params_, rv.best_score_

The paramters obtained are used to create below model.

In [None]:
model = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 3, alpha = 10, n_estimators = 500,booster= 'gbtree')

In [None]:
#fitting best model

model.fit(X_train,y_train)


# Preprocessing the Validation set ⚙️

In [None]:
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
print('test_head', test.head())


print('test_shape', test.shape)
print('column_list:', train.columns)

In [None]:
test['Topography'] = test['LotConfig'] + test['LandContour']
test['Geometry'] = test['LotArea'] / test['LotFrontage']
test['TotalIndoorSqFt'] = test['TotalBsmtSF'] + test['1stFlrSF'] + test['2ndFlrSF'] + test['GarageArea']
test['HouseToYardRatio'] = test['TotalIndoorSqFt'] / test['LotArea']
test['HouseToPoolRatio'] = test['TotalIndoorSqFt'] / (test['PoolArea'] + 1)
test['Value'] = test['OverallCond'] * test['OverallQual']
test['Condition'] = test['Condition1'] + test['ExterCond']
test['YardToSeatingAreaRatio'] =  (test['WoodDeckSF'] + test['OpenPorchSF'] + 1) / test['LotArea']
test['Meh'] = test['Fireplaces'] * test['TotRmsAbvGrd']
test.head()

In [None]:
dummies = pd.get_dummies(test[cat_cols]).rename(columns=lambda x: 'Category_' + str(x))
dummies.head()

In [None]:
test_encoded=pd.concat([ test[num_cols], dummies], axis=1)

In [None]:
unessential_col=set(X_encoded.columns)-set(test_encoded.columns)
print("X Train Shape",X_train.shape)
print("test_encoded",test_encoded.shape)

## Running on Validation Set 🤌

In [None]:
model.fit(X_train[test_encoded.columns],y_train)

preds = model.predict(X_test[test_encoded.columns])
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

In [None]:
test_encoded=test_encoded.fillna(test_encoded.mean())
test_pred = model.predict(test_encoded)

# Submission 📝

In [None]:
submission = pd.DataFrame({'Id': test.Id, 'SalePrice': test_pred})
submission.to_csv('submission.csv', index=False)


# This workbook is done just for exploratory purpose and I will be using this to try out other models which include Regularization and using XGBOOST.
# Please upvote/follow if you want to tag along the progress on this.