**Boston House Price Prediction**

There are total 80 input variables used to predict the output "SalePrice"! More unfortunately, I can barely identify the all details of those inputs so that I must take the quantitative methods to do feature selections. 

The whole of machine learning obtains two steps, one is data-processing and the another one is model-building. In general, 70% time and effort will be taken by the former.


In [1]:
# # Basical API
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import math

# Sklearn support
from sklearn import preprocessing # dataset preprocess
from sklearn.feature_selection import SelectKBest,mutual_info_regression # feature selection
from sklearn.model_selection import StratifiedKFold # k fold cross-validation
from sklearn.model_selection import cross_val_score 
from sklearn.linear_model import LinearRegression # Linear
from sklearn.linear_model import Ridge # Ridge
from sklearn.linear_model import Lasso # Lasso
from sklearn.svm import SVR # SVR
from sklearn.neighbors import KNeighborsRegressor # KNR
from sklearn.ensemble import RandomForestRegressor # RFR

# Import "Boston House Price" datasets
train_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')
pred_df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

In [2]:
# Missing Values
"""
In train_df and pred_df, the features "Alley","FireplaceQu","PoolQC","Fence","MiscFeature" 
perform so terrible in missing values problems that we need to drop them in both train_df and pred_df. 
Besides, the featuer "ID" is meaningless and useless. For other featuers with missing values, we can fill them by some methods.
"""

missing_train = train_df.isnull().sum()[train_df.isnull().sum()>0]
print("missing values of train_df:"+"\n",missing_train/train_df.shape[0],"\n")

missing_pred = pred_df.isnull().sum()[pred_df.isnull().sum()>0]
print("missing values of pred_df:"+"\n",missing_pred/pred_df.shape[0],"\n")

# drop features
train_df = train_df.drop(columns = ['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'])
pred_df = pred_df.drop(columns = ['Id', 'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'FireplaceQu'])

# fill features
length = train_df.shape[1]
for i in range(length-1):
    if train_df.iloc[:,i].isnull().sum()>0:
        if train_df.iloc[:,i].dtype=="object":
            train_df.iloc[:,i] = train_df.iloc[:,i].fillna(train_df.iloc[:,i].mode()[0])
        else:
            train_df.iloc[:,i] = train_df.iloc[:,i].fillna(train_df.iloc[:,i].mean())
    
    if pred_df.iloc[:,i].isnull().sum()>0:
        if pred_df.iloc[:,i].dtype=="object":
            pred_df.iloc[:,i] = pred_df.iloc[:,i].fillna(pred_df.iloc[:,i].mode()[0])
        else:
            pred_df.iloc[:,i] = pred_df.iloc[:,i].fillna(pred_df.iloc[:,i].mean())


missing values of train_df:
 LotFrontage     0.177397
Alley           0.937671
MasVnrType      0.005479
MasVnrArea      0.005479
BsmtQual        0.025342
BsmtCond        0.025342
BsmtExposure    0.026027
BsmtFinType1    0.025342
BsmtFinType2    0.026027
Electrical      0.000685
FireplaceQu     0.472603
GarageType      0.055479
GarageYrBlt     0.055479
GarageFinish    0.055479
GarageQual      0.055479
GarageCond      0.055479
PoolQC          0.995205
Fence           0.807534
MiscFeature     0.963014
dtype: float64 

missing values of pred_df:
 MSZoning        0.002742
LotFrontage     0.155586
Alley           0.926662
Utilities       0.001371
Exterior1st     0.000685
Exterior2nd     0.000685
MasVnrType      0.010966
MasVnrArea      0.010281
BsmtQual        0.030158
BsmtCond        0.030843
BsmtExposure    0.030158
BsmtFinType1    0.028787
BsmtFinSF1      0.000685
BsmtFinType2    0.028787
BsmtFinSF2      0.000685
BsmtUnfSF       0.000685
TotalBsmtSF     0.000685
BsmtFullBath    0.001371
B

In [3]:
# Categories Encoding
"""
1 Target Encoding recodes categorical features by computing the probability of different categories or types resoectively.

2 Buiding the model of target encoding based on train_df, and then using it to transform test_df or pred_df, 
which requests same distribution or structure of train_df and pred_df.
"""
num_features = [] # numerical features
cat_features = [] # categorical features
columns = train_df.columns[:-1] # feature names except "SalePrice"
for col in columns:
    if train_df[col].dtype == "object":
        cat_features.append(col)
    else:
        num_features.append(col)

def target_encoding(target_feature):
    if len(cat_features) == 0:
        return 'All features are encoded'
    target_std = train_df[target_feature].std()
    for f in cat_features:
        train_df[f] = train_df.groupby(f)[target_feature].transform('mean')
        pred_df[f] = pred_df.groupby(f)[target_feature].transform('mean')
            
for f in num_features:
    target_encoding(f)

In [4]:
# Standardization
std_scale = preprocessing.StandardScaler()
columns = train_df.columns[:-1]
for col in columns:
    train_df[col] = std_scale.fit_transform(train_df[col].values.reshape(-1,1))
    pred_df[col] = std_scale.fit_transform(pred_df[col].values.reshape(-1,1))

In [5]:
# Features Selection By "mutual_info_regression"
X_df = train_df.iloc[:,:-1] # input X
Y_df = train_df.iloc[:,-1] # output Y
skb = SelectKBest(mutual_info_regression,k=30).fit(X_df,Y_df) # save 30 features with better performance
Index = skb.get_support(indices=True)
train_df = train_df.iloc[:,np.append(Index,train_df.shape[1]-1)]
pred_df = pred_df.iloc[:,Index]

In [6]:
# Models Building and Selection
Model = [LinearRegression(), Ridge(), Lasso(), SVR(), KNeighborsRegressor(), RandomForestRegressor()]
RMSE = []
X_df = train_df.iloc[:,:-1] # dataset of features 
Y_df = train_df.iloc[:,-1] # dataset of label 

for model in Model:
    kfold = StratifiedKFold(n_splits = 10, random_state = 1, shuffle=True)
    mse_results = cross_val_score(model, X_df, Y_df, cv = kfold, scoring = 'neg_mean_squared_error')
    rmse = np.sqrt(-mse_results)
    RMSE.append(round(rmse.mean(),2))

print("RMSE for models:"+"\n"+  
"LinearRegression: "+str(RMSE[0])+"\n"+
"Ridge: "+str(RMSE[1])+"\n"+
"Lasso: "+str(RMSE[2])+"\n"+
"SVR: "+str(RMSE[3])+"\n"+
"KNeighborsRegression: "+str(RMSE[4])+"\n"+
"RandomForestRegression: "+str(RMSE[5])+"\n"
)

  coef_, l1_reg, l2_reg, X, y, max_iter, tol, rng, random, positive


RMSE for models:
LinearRegression: 34818.53
Ridge: 34802.69
Lasso: 34815.95
SVR: 80963.78
KNeighborsRegression: 34883.23
RandomForestRegression: 27859.19



RMSE for models:

LinearRegression: 34668.13

Ridge: 34651.71

Lasso: 34665.2

SVR: 80963.91

KNeighborsRegression: 35698.96

RandomForestRegression: 28343.31

In [7]:
# Prediction
RFR = RandomForestRegressor().fit(X_df, Y_df)
pred_y = RFR.predict(pred_df)

print(pred_y)

df = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')
submission = pd.DataFrame({
        "Id": df["Id"],
        "SalePrice": pred_y
    })
submission.to_csv('./submission.csv', index=False)

[132161.58 161813.5  177714.27 ... 167702.   117433.87 230111.3 ]
