The link for the Advanced regression dataset comes from:
https://www.kaggle.com/c/house-prices-advanced-regression-techniques

In [1]:
#Let's read the dataset 

import pandas as pd

df=pd.read_csv('train.csv')

#let's see the contents of dataset (will only view first 5 rows)

df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [2]:
# Column Id is redundant , remove it!

df=df.drop(['Id'],axis=1)
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


1. Now let's split the dataset based on numerical, categorical features
2. before doing that, let's separate the dependent/target variable i.e. SalePrice , let's name it as train_y


In [3]:
train_y=df['SalePrice']

df_num=df.select_dtypes(include=['int','float']).copy() #numerical features

df_cat=df.select_dtypes(include=['object']).copy() #categorical features


#lets drop the target variable from numerical features set

df_num.drop(['SalePrice'],axis=1)

#now you can see that SalePrice feature isn't showing up in df_num. ok cool.

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,548,0,61,0,0,0,0,0,2,2008
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,460,298,0,0,0,0,0,0,5,2007
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,608,0,42,0,0,0,0,0,9,2008
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,642,0,35,272,0,0,0,0,2,2006
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,836,192,84,0,0,0,0,0,12,2008
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,60,62.0,7917,6,5,1999,2000,0.0,0,0,...,460,0,40,0,0,0,0,0,8,2007
1456,20,85.0,13175,6,6,1978,1988,119.0,790,163,...,500,349,0,0,0,0,0,0,2,2010
1457,70,66.0,9042,7,9,1941,2006,0.0,275,0,...,252,0,60,0,0,0,0,2500,5,2010
1458,20,68.0,9717,5,6,1950,1996,0.0,49,1029,...,240,366,0,112,0,0,0,0,4,2010


let's check number of numerical features and number of categorical features.!

In [4]:
print("number of numerical features are: ",df_num.shape[1])
print("number of categorical features are: ",df_cat.shape[1])

number of numerical features are:  37
number of categorical features are:  43


Let's check if we have any missing values in the columns of our dataset !

In [5]:
def missing_values(df):
    nan_values=df_num.isna()
    nan_columns=nan_values.any() #It will tell if any of the columns have 

    # just uncomment next 2 lines i.e. remove the # from the lines and see if there are any missing values or not
    # print(nan_values)
    # print(nan_columns)

    #next 2 lines will let you see the name of the columns with missing values

    columns_with_nan = df_num.columns[nan_columns].tolist()
    if len(columns_with_nan)>=1:
        print("you have missing values")
    else:print("you don't have missing values")
    return columns_with_nan

In [6]:
#Let's see if any of the numerical features have missing values or not
misslist=missing_values(df_num)
print(misslist)

you have missing values
['LotFrontage', 'MasVnrArea', 'GarageYrBlt']


In [7]:
#Now let's deal with missing values..
#Lets fill up the missing positions by the average of the values of that particular features

nacol=misslist
for i in range(len(nacol)):
    m=df_num[nacol[i]].mean() # took the mean of feature 'LotFrontage' when i=0, 'ManVnrArea' when i=1 and so for 2.
    df_num[nacol[i]].fillna(m,inplace=True)
    
#we have filled missing values, but let's check to make sure. ok?

misslist=missing_values(df_num)


you don't have missing values


..
Now, we shouldn't take all the features for regression, why? if we do use all features, it might affect negatively ! that is, reduce the model efficacy. 

Let's begin some feature engineering.!


We will use P-test on features to choose good features. The p value will help us figure out the promising features. we will only take those features who give p values less than 0.05 while working with target SalePrice i.e. train_y. Let's do it. !


In [13]:
def feature_selection(df_num):
    from sklearn.datasets import load_boston
    import pandas as pd
    import numpy as np
    import matplotlib
    import matplotlib.pyplot as plt
    import seaborn as sns
    import statsmodels.api as sm


    #df_num.drop('SalePrice',axis=1,inplace=True)

    X_1 = sm.add_constant(df_num)
    #Fitting sm.OLS model
    model = sm.OLS(train_y,X_1.astype(float)).fit()
    #print(model.pvalues)

    #Backward Elimination
    cols = list(df_num.columns)
    pmax = 1
    while (len(cols)>0):
        p= []
        X_1 = df_num[cols]
        X_1 = sm.add_constant(X_1)
        model = sm.OLS(train_y,X_1.astype(float)).fit()
        p = pd.Series(model.pvalues.values[1:],index = cols)      
        pmax = max(p)
        feature_with_p_max = p.idxmax()
        if(pmax>0.05):
            cols.remove(feature_with_p_max)
        else:
            break
    selected_features_BE = cols
    print("The selected features are:")
    print(selected_features_BE)


    df_n=df_num[selected_features_BE]

    return df_n


**Section F**

In the above, we made a function, which will make a list of important features.
Ok cool.

Now, we are gonna train our machine learning ...ummm. regression model..!! 

We will invoke scikit learn library to use their regression models !

And at first, we will use only numerical features to train and validate our model i.e. we will use df_num.

In [15]:
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_selection import f_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE

#from sklearn import feature_selection
df_new=df_num
#df_new['LandContour']=df_cat['LandContour']
#df_new=df_new.drop(['SalePrice'],axis=1)
print(df_new.shape)

df_new=feature_selection(df_new)

#print(df_new.shape)



from sklearn.preprocessing import OrdinalEncoder


################
#X_new2 = SelectKBest(chi2, k=220).fit_transform(df_new,train_y)
#print(X_new2.shape)
#############
X_train = df_new[:-120]
X_test = df_new[-120:]

# Split the targets into training/testing sets
y_train = train_y[:-120]
y_test = train_y[-120:]


poly = PolynomialFeatures(degree=2)
X_ = poly.fit_transform(X_train)

X_test_ = poly.fit_transform(X_test)

from sklearn.linear_model import LinearRegression
lg1 = LinearRegression()

lg1.fit(X_, y_train)
y_pred=lg1.predict(X_test_)
print('Mean squared error: %.2f'
      % mean_squared_error(y_test, y_pred))
lg1.score(X_test_, y_test)

(1460, 37)


  return ptp(axis=axis, out=out, **kwargs)


The selected features are:
['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'TotalBsmtSF', 'GrLivArea', 'BsmtFullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'WoodDeckSF', 'ScreenPorch', 'LandContour']
Mean squared error: 554329220.81


0.8960459586862444

So, we got score 86.12% by using numerical features and doing feature engineering on them. What if we didn't use the feature engineering?

Let's check it. just write df_new=df_num in line number 15 above and run it. 

what's the accuracy?
It's -8.71, it's terrible! The MSE is 11 digit !

So, now it's obvious that we should do feature engineering.

**Section G**
<br>
Now, let's add a categorical value to the train data,but before it, we have to convert it into numerical values. 

In [None]:
#We are taking the feature LandContour to convert it's string values to numerical values
#we will set, Bnk=1, Lvl=2, Low=3, HLS=4
cleanup_nums = {"LandContour": {"Lvl":2, "Bnk": 1, "HLS": 4, "Low":3}}
                
df_cat.replace(cleanup_nums, inplace=True)

#Now we will add it to the training set.

df_new['LandContour']=df_cat['LandContour']


copy line 9 and paste in line 12 of section F.
now run section F, and check & compare the score!

The score is 89.6% !
! Just adding a single categorical feature increased the score by 3% !! 

Now the question is, how I assigned the numerical values to the categorical values Lvl,Bnk,HLS,Low ? 