# Predictions on Different Data <a name = 'beginning'></a>

In this notebook I will experiment with data in order to get more accurate predictions of models.   

The test data was taken from the Kaggle competition, so there is no direct access to house sale prices. Therefore, the competition submissions will be used to evaluate all attempts. Each attempt will be accompanied by a screenshot of the result. 

### Steps  

- [Scoring Method](#scoring)   
- [Prediction on test data processed in EDA notebook](#processed_EDA)   
- [Prediction on test data with different number of features](#diff_num)
- [Prediction on normalized test data](#norm_data)     
- [Try to add some features to the test data]



In [27]:
%store -r models
%store -r models_results
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score 
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

In [28]:
models

Unnamed: 0,linear_regression,ridge_regression,lasso_regression,elastic_net,decision_tree_regressor,random_forest_regressor,xgb_regressor
0,LinearRegression(),"Ridge(alpha=2.641025641025641, random_state=0)","Lasso(alpha=41, random_state=0)","ElasticNet(alpha=41, l1_ratio=1.0, random_stat...","DecisionTreeRegressor(max_depth=6, max_feature...","RandomForestRegressor(max_depth=27, max_featur...","XGBRegressor(base_score=None, booster=None, ca..."


In [29]:
models_results.sort_values(ascending = False)

linear_regression          0.121499
random_forest_regressor   -0.111158
xgb_regressor             -0.112387
elastic_net               -0.117499
lasso_regression          -0.117499
ridge_regression          -0.118524
decision_tree_regressor   -0.137118
dtype: float64

# Scoring Method <a name = 'scoring'></a>   

[Table of Content](#beginning)

In [30]:
# My metric for cross validation
def rmse_log(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(mean_squared_error(np.log1p(y), np.log1p(y_pred)))

# My metric for GridSearch
def rmse_log_grid(y_true, y_pred):
    return np.sqrt(mean_squared_error(np.log(y_true), np.log(y_pred)))

custom_scorer = make_scorer(rmse_log_grid, greater_is_better=False)

# Data Loading

In [31]:
train_data = pd.read_csv('../../Data/train_data_processed.csv', index_col = 'Id')
X = train_data.copy()
y = X['SalePrice']
X = X.drop(['SalePrice'], axis = 1)

test_data = pd.read_csv('../../Data/test_data_processed.csv', index_col = 'Id')

# Prediction on test data processed in EDA notebook  <a name = 'processed_EDA'></a>  

[Table of Content](#beginning)

In [32]:
# Get the three best models from our model storage created in FittingAndEvaluation notebook
linear_regression = models['linear_regression'][0]
ridge_regression = models['ridge_regression'][0]
lasso_regression = models['lasso_regression'][0]
decision_tree = models['decision_tree_regressor'][0]
random_forest = models['random_forest_regressor'][0]
xgb_regressor = models['xgb_regressor'][0]

# Create a list with that models
try_models = [linear_regression, ridge_regression, lasso_regression, decision_tree, random_forest, xgb_regressor]

In [33]:
# Get predictions of my best models on test data processed in EDA notebook
for model in try_models:
   
    # fit / predict
    model.fit(X, y)
    prediction = model.predict(test_data)
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_processed_EDA.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)

# Prediction on test data with different number of features <a name = 'diff_num'></a>

[Table of Content](#beginning)

Here I will try to fit my model to a different number of features that I selected using mutual information in the EDA notebook.

In [34]:
# Get a list of columns
columns_list = list(X.columns)

In [35]:
# Get a number for each feature
for i in range(0, len(columns_list)):
    print(i, ' - ', columns_list[i])

0  -  OverallQual
1  -  GrLivArea
2  -  YearBuilt
3  -  GarageArea
4  -  TotalBsmtSF
5  -  FullBath
6  -  YearRemodAdd
7  -  2ndFlrSF
8  -  LotArea
9  -  Fireplaces
10  -  OpenPorchSF
11  -  Neighborhood_Blmngtn
12  -  Neighborhood_Blueste
13  -  Neighborhood_BrDale
14  -  Neighborhood_BrkSide
15  -  Neighborhood_ClearCr
16  -  Neighborhood_CollgCr
17  -  Neighborhood_Crawfor
18  -  Neighborhood_Edwards
19  -  Neighborhood_Gilbert
20  -  Neighborhood_IDOTRR
21  -  Neighborhood_MeadowV
22  -  Neighborhood_Mitchel
23  -  Neighborhood_NAmes
24  -  Neighborhood_NPkVill
25  -  Neighborhood_NWAmes
26  -  Neighborhood_NoRidge
27  -  Neighborhood_NridgHt
28  -  Neighborhood_OldTown
29  -  Neighborhood_SWISU
30  -  Neighborhood_Sawyer
31  -  Neighborhood_SawyerW
32  -  Neighborhood_Somerst
33  -  Neighborhood_StoneBr
34  -  Neighborhood_Timber
35  -  Neighborhood_Veenker
36  -  Foundation_BrkTil
37  -  Foundation_CBlock
38  -  Foundation_PConc
39  -  Foundation_Stone
40  -  Foundation_Wood
41  

In [36]:
# Create a DataFrame where we will store the results of models` performance
scores_diff_num_features = pd.DataFrame({'model': [], 'num_of_features': [], 'score': []})

In [37]:
# Try each of best models
for model in try_models:
    # On different number of features from 5 to 15 (considering One-Hot encoding)
    for number_of_columns in [5, 6, 7, 8, 9, 10, 11, 36, 41, 55, 65]:
        scores = cross_val_score(model, 
                                 X[columns_list[:number_of_columns]], 
                                 y, 
                                 cv = 5, 
                                 scoring = rmse_log)
        
        # Store the results
        scores_diff_num_features.loc[len(scores_diff_num_features)] = str(model), number_of_columns - 1, scores.mean()
    
    

Check the result.

In [45]:
pd.set_option('display.max_rows', None)
for i in scores_diff_num_features.index:
        scores_diff_num_features.loc[i, 'model'] = str(scores_diff_num_features.loc[i, 'model'])

In [47]:
min_score_rows = scores_diff_num_features.loc[scores_diff_num_features.groupby('model')['score'].idxmin()]
min_score_rows

Unnamed: 0,model,num_of_features,score
41,"DecisionTreeRegressor(max_depth=6, max_feature...",40.0,0.145225
32,"Lasso(alpha=41, random_state=0)",64.0,0.118355
7,LinearRegression(),35.0,0.1192
51,"RandomForestRegressor(max_depth=27, max_featur...",35.0,0.110095
20,"Ridge(alpha=2.641025641025641, random_state=0)",54.0,0.119384
62,"XGBRegressor(base_score=None, booster=None, ca...",35.0,0.112119


We found the optimal number of features to train the models. 

In [52]:
cut_list = [36, 55, 65, 41, 36, 36]

for i in range(len(try_models)):
    model = try_models[i]
    num_features = cut_list[i] 
    
    X_short = X.copy()
    X_short = X_short[columns_list[:num_features]]
    model.fit(X_short, y)
    
    prediction_short = model.predict(test_data[columns_list[:num_features]])
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_shorted.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)
    

LinearRegression
36
LinearRegression_shorted.csv 

Ridge
55
Ridge_shorted.csv 

Lasso
65
Lasso_shorted.csv 

DecisionTreeRegressor
41
DecisionTreeRegressor_shorted.csv 

RandomForestRegressor
36
RandomForestRegressor_shorted.csv 

XGBRegressor
36
XGBRegressor_shorted.csv 



# Prediction on normalized test data <a name = 'norm_data'></a>

[Table of Content](#beginning)

In [55]:
train_data_stats = pd.read_csv('../../Data/train_data_stats.csv')
test_data_stats = pd.read_csv('../../Data/test_data_stats.csv')

X_stats = train_data_stats.copy()
y_stats = X_stats['SalePrice']
X_stats = X_stats.drop(['SalePrice'], axis = 1)

In [56]:
X_stats.head()

Unnamed: 0,Id,OverallQual,GrLivArea,YearBuilt,GarageArea,TotalBsmtSF,FullBath,YearRemodAdd,2ndFlrSF,Fireplaces,...,Exterior1st_AsbShng,Exterior1st_BrkFace,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing
0,1,7,7.444249,2003,548,6.75227,2,2003,854,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,2,6,7.140453,1976,460,7.140453,2,1976,0,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,3,7,7.487734,2001,608,6.824374,2,2002,866,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,4,7,7.448334,1915,642,6.628041,1,1970,756,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,5,8,7.695303,2000,836,7.04316,2,2000,1053,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [59]:
try_linear_models = try_models[:3]
try_linear_models

[LinearRegression(),
 Ridge(alpha=2.641025641025641, random_state=0),
 Lasso(alpha=41, random_state=0)]

In [60]:
cut_list_norm = [36, 65, 65]

for i in range(len(try_linear_models)):
    model = try_linear_models[i]
    num_features = cut_list_norm[i] 
    
    X_norm = X_stats.copy()
    X_norm = X_norm[columns_list[:num_features]]
    model.fit(X_norm, y_stats)
    
    prediction_norm = np.exp(model.predict(test_data_stats[columns_list[:num_features]]))
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_norm.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction_norm})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)