<center><h1 style='color:green'>PLz upvote if you find it valuable!

# - Contents:

### 1. Include Libraries
### 2. Import DataSet
### 3. Handle Missing Value
### 4. Feature Engineering by OneHotEncoding
### 5. PCA(Principle component analysis)
### 6. Hyperparameter Tunning
### 7. Train Random Forest Regressor
### 8. Train Xgboost Regressor


# Include Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
warnings.filterwarnings('ignore')
plt.style.use('ggplot')

# Import dataset

In [None]:
train = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
plt.figure(figsize=(15,5))
plt.scatter(train.index,train.SalePrice.sort_values().reset_index(drop=True))
plt.title('Distribution Plot for Sales Prices')
plt.ylabel('Sales Price');

# Handle Missing Values

let's look at train and test set and determine the missing values!

In [None]:
sns.heatmap(train.isnull(),yticklabels=False, cmap='plasma')

In [None]:
train.isnull().sum().sort_values(ascending=False)[0:19]

In [None]:
test.isnull().sum().sort_values(ascending=False)[0:33]

 Now let's solve the issue of missing value by looking into every feature
 
 **- LonFrontage**

In [None]:
train.LotFrontage.head()

In [None]:
train.LotFrontage.isnull().sum()

In [None]:
train['LotFrontage'] = train['LotFrontage'].fillna(train.LotFrontage.mean())

In [None]:
test.LotFrontage.isnull().sum()

In [None]:
test['LotFrontage'] = test['LotFrontage'].fillna(test.LotFrontage.mean())

**- Alley**

In [None]:
train.Alley.value_counts(dropna=False)

In [None]:
train.drop(columns=['Alley'], inplace=True)

In [None]:
test.Alley.value_counts(dropna=False)

In [None]:
test.drop(columns=['Alley'], inplace=True)

**- BsmtCond, BsmtQual, FirePlaceQu, GarageType, GarageCond, GarageFinish, GarageQual**

In [None]:
train.BsmtCond.value_counts(dropna=False)

In [None]:
train['BsmtCond'] = train['BsmtCond'].fillna(train.BsmtCond.mode()[0])

In [None]:
test['BsmtCond'] = test['BsmtCond'].fillna(test.BsmtCond.mode()[0])

Taking mode for all similar features like BsmtCond

In [None]:
list1 = ['BsmtQual', 'FireplaceQu', 'GarageType', 'GarageCond', 'GarageFinish', 'GarageQual', 'MasVnrType', 'MasVnrArea',
         'BsmtExposure','BsmtFinType2']

for item in list1:
    train[item] = train[item].fillna(train[item].mode()[0])
    test[item] = test[item].fillna(test[item].mode()[0])

**- GarageYrBlt, PoolQC, Fence, MiscFeature**

In [None]:
list1 = ['GarageYrBlt', 'PoolQC', 'Fence', 'MiscFeature']

for item in list1:
    train.drop(columns=item, inplace=True)
    test.drop(columns=item, inplace=True)

**Handle Remaining missing values**

In [None]:
train.isnull().sum().sort_values(ascending=False)

In [None]:
train.dropna(inplace=True)

In [None]:
train.drop(columns=['Id'], inplace=True)

In [None]:
train.shape

In [None]:
test.isnull().sum().sort_values(ascending=False)[0:17]

In [None]:
test['MSZoning']=test['MSZoning'].fillna(test['MSZoning'].mode()[0])

In [None]:
columns = ['BsmtFinType1', 'Utilities','BsmtFullBath', 'BsmtHalfBath', 'Functional', 'SaleType', 'Exterior2nd', 
           'Exterior1st', 'KitchenQual']
columns1 = ['GarageCars', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',  'TotalBsmtSF', 'GarageArea']

for item in columns:
    test[item] = test[item].fillna(test[item].mode()[0])
for item in columns1:
    test[item] = test[item].fillna(test[item].mean())

In [None]:
test.drop(columns=['Id'], inplace=True)

In [None]:
test.shape

### Checking for missing values if any!

In [None]:
train.isnull().any().any()

In [None]:
test.isnull().any().any()

## Feature Engineering by OneHotEncoding

Creating the list of categorical features that needs to be converted in binary values

In [None]:
columns = ['MSZoning', 'Street',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical',
       'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish',
       'GarageQual', 'GarageCond', 'PavedDrive', 'SaleType', 'SaleCondition']

In [None]:
len(columns)

In [None]:
final_df = pd.concat([train, test], axis=0)

In [None]:
final_df.shape

In [None]:
def One_hot_encoding(columns):
    df_final=final_df
    i=0
    for fields in columns:
        df1=pd.get_dummies(final_df[fields],drop_first=True)
        
        final_df.drop([fields],axis=1,inplace=True)
        if i==0:
            df_final=df1.copy()
        else:           
            df_final=pd.concat([df_final,df1],axis=1)
        i=i+1
       
        
    df_final=pd.concat([final_df,df_final],axis=1)
        
    return df_final

In [None]:
final_df = One_hot_encoding(columns)

In [None]:
final_df.shape

In [None]:
final_df =final_df.loc[:,~final_df.columns.duplicated()]

In [None]:
final_df.shape

In [None]:
df_Train=final_df.iloc[:1422,:]
df_Test=final_df.iloc[1422:,:]

In [None]:
df_Test.drop(['SalePrice'],axis=1,inplace=True)

In [None]:
X_train=df_Train.drop(['SalePrice'],axis=1)
y_train=df_Train['SalePrice']

# PCA(Principle component analysis)
let’s visualize our final dataset by implementing PCA and plot the graph

In [None]:
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X_train)

my_columns = X_train.columns
new_df = pd.DataFrame(X_std, columns=my_columns)

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
df_pca = pca.fit_transform(new_df)

In [None]:
plt.figure(figsize =(8, 6))
plt.scatter(df_pca[:, 0], df_pca[:, 1], c = y_train, cmap ='plasma')
# labeling x and y axes
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component');

**Let's train our models.**

In [None]:
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor()

# Hyperparameter Tunning

**Do not trust the defaults!, let's change the default parameters by different values**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [100, 500, 900]
criterion = ['mse', 'mae']
depth = [3,5,10,15]
min_split=[2,3,4]
min_leaf=[2,3,4]
bootstrap = ['True', 'False']
verbose = [5]

hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':depth,
    'criterion':criterion,
    'bootstrap':bootstrap,
    'verbose':verbose,
    'min_samples_split':min_split,
    'min_samples_leaf':min_leaf
    }

random_cv = RandomizedSearchCV(estimator=regressor,
                               param_distributions=hyperparameter_grid,
                               cv=5, 
                               scoring = 'neg_mean_absolute_error',
                               n_jobs = 4, 
                               return_train_score = True,
                               random_state=42)

In [None]:
random_cv.fit(X_train,y_train)

In [None]:
random_cv.best_estimator_

# Train Random Forest Regressor

### Caution: Remember to use random_forest_regressor(avoid random_forest_classifier),as this is a regressor type problem. I have tried using random forest classifier the scores after training is as follows,

- random forest classifier - 0.21
- random forest regressor - 0.15

Random Forest regressor has a massive increase in score.

In [None]:
regressor = RandomForestRegressor(bootstrap='True', ccp_alpha=0.0, criterion='mae',
                      max_depth=15, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=2,
                      min_samples_split=4, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=5, warm_start=False)

In [None]:
regressor.fit(X_train,y_train)

In [None]:
y_pred = regressor.predict(df_Test)

In [None]:
y_pred

In [None]:
pred=pd.DataFrame(y_pred)
samp = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
sub = pd.concat([samp['Id'],pred], axis=1)
sub.columns=['Id','SalePrice']

In [None]:
sub

In [None]:
#sub.to_csv('My_sub.csv',index=False)

# Train Xgboost Regressor

In [None]:
import xgboost

In [None]:
regressor=xgboost.XGBRegressor()

In [None]:
n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
base_score=[0.25,0.5,0.75,1]

# Define the grid of hyperparameters to search
hyperparameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }
random_cv = RandomizedSearchCV(estimator=regressor,
            param_distributions=hyperparameter_grid,
            cv=5, n_iter=50,
            scoring = 'neg_mean_absolute_error',n_jobs = 4,
            verbose = 5, 
            return_train_score = True,
            random_state=42)

In [None]:
random_cv.fit(X_train,y_train)

In [None]:
random_cv.best_estimator_

In [None]:
regressor = xgboost.XGBRegressor(base_score=0.25, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=2,
             min_child_weight=1, missing=None, monotone_constraints='()',
             n_estimators=900, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

In [None]:
regressor.fit(X_train,y_train)

In [None]:
y_pred = regressor.predict(df_Test)

In [None]:
y_pred

In [None]:
pred=pd.DataFrame(y_pred)
samp = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
sub = pd.concat([samp['Id'],pred], axis=1)
sub.columns=['Id','SalePrice']

In [None]:
sub

**Un comment the below code to generate csv file.**

In [None]:
#sub.to_csv('My_sub1.csv',index=False)