# Predictions on Different Data <a name = 'beginning'></a>

In this notebook I will experiment with data in order to get more accurate predictions of the models defined in FittingAndEvaluation notebook.   

The test data was taken from the Kaggle competition, so there is no direct access to house sale prices. Therefore, the competition submissions will be used to evaluate all attempts. Each attempt will be accompanied by a screenshot of the result. 

### Steps  
  
- [Scoring Method](#scoring)   
- [Data Loading](#loading)     
- [Prediction on test data processed in EDA notebook](#processed_EDA)   
- [Prediction on test data with different number of features](#diff_num)
- [Prediction on normalized test data](#norm_data)     
- [Prediction on scaled test data](#scaled)  



In [24]:
%store -r models
%store -r models_results
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

In [25]:
models

[LinearRegression(),
 Ridge(alpha=2.5384615384615383, random_state=0),
 Lasso(alpha=41, random_state=0),
 RandomForestRegressor(max_depth=27, max_features=12, min_samples_split=3,
                       n_estimators=1150, n_jobs=-1, random_state=0),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.25, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
              max_leaves=None, min_child_weight=6, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=17,
              n_jobs=None, num_parallel_tree=None, random_state=0, ...)]

Since the elastic net equals Lasso regression, let's exclude it from the list of models.

In [26]:
models.pop(3)
models

[LinearRegression(),
 Ridge(alpha=2.5384615384615383, random_state=0),
 Lasso(alpha=41, random_state=0),
 XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None, colsample_bytree=1,
              device=None, early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=0.25, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=3,
              max_leaves=None, min_child_weight=6, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=17,
              n_jobs=None, num_parallel_tree=None, random_state=0, ...)]

# Scoring Method <a name = 'scoring'></a>  

[Table of Content](#beginning)

In [27]:
# My metric for cross validation
def rmse_log(model, X, y):
    y_pred = model.predict(X)
    return np.sqrt(mean_squared_error(np.log1p(y), np.log1p(y_pred)))

# Data Loading <a name = 'loading'></a>  

[Table of Content](#beginning)

In [28]:
train_data = pd.read_csv('../../Data/train_data_processed.csv', index_col = 'Id')
X = train_data.copy()
y = X['SalePrice']
X = X.drop(['SalePrice'], axis = 1)

test_data = pd.read_csv('../../Data/test_data_processed.csv', index_col = 'Id')

# Prediction on test data processed in EDA notebook  <a name = 'processed_EDA'></a>  

[Table of Content](#beginning)

Let this be the starting point. Real predictions on real test data using models trained on data processed in an EDA notebook.

In [29]:
# Get predictions of my best models on test data processed in EDA notebook
for model in models:
     
    # fit / predict
    model.fit(X, y)
    prediction = model.predict(test_data)
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_processed_EDA.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)

![EDA](pics/EDA_result.png)

Lasso regression gives the best results so far.

# Prediction on test data with different number of features <a name = 'diff_num'></a>

[Table of Content](#beginning)

Here I will try to fit my model to a different number of features that I selected using mutual information in the EDA notebook.

In [30]:
# Get a list of columns
columns_list = list(X.columns)

In [31]:
# Get a number for each feature
for i in range(0, len(columns_list)):
    print(i, ' - ', columns_list[i])

0  -  OverallQual
1  -  GrLivArea
2  -  YearBuilt
3  -  GarageArea
4  -  TotalBsmtSF
5  -  FullBath
6  -  YearRemodAdd
7  -  2ndFlrSF
8  -  LotArea
9  -  Fireplaces
10  -  OpenPorchSF
11  -  Neighborhood_Blmngtn
12  -  Neighborhood_Blueste
13  -  Neighborhood_BrDale
14  -  Neighborhood_BrkSide
15  -  Neighborhood_ClearCr
16  -  Neighborhood_CollgCr
17  -  Neighborhood_Crawfor
18  -  Neighborhood_Edwards
19  -  Neighborhood_Gilbert
20  -  Neighborhood_IDOTRR
21  -  Neighborhood_MeadowV
22  -  Neighborhood_Mitchel
23  -  Neighborhood_NAmes
24  -  Neighborhood_NPkVill
25  -  Neighborhood_NWAmes
26  -  Neighborhood_NoRidge
27  -  Neighborhood_NridgHt
28  -  Neighborhood_OldTown
29  -  Neighborhood_SWISU
30  -  Neighborhood_Sawyer
31  -  Neighborhood_SawyerW
32  -  Neighborhood_Somerst
33  -  Neighborhood_StoneBr
34  -  Neighborhood_Timber
35  -  Neighborhood_Veenker
36  -  Foundation_BrkTil
37  -  Foundation_CBlock
38  -  Foundation_PConc
39  -  Foundation_Stone
40  -  Foundation_Wood
41  

In [32]:
# Create a DataFrame where we will store the results of models` performance
scores_diff_num_features = pd.DataFrame({'model': [], 'num_of_features': [], 'score': []})

In [33]:
# Try each of best models
for model in models:
    # On different number of features from 5 to 15 (considering One-Hot encoding)
    for number_of_columns in [5, 6, 7, 8, 9, 10, 11, 36, 41, 55, 65]:
        scores = cross_val_score(model, 
                                 X[columns_list[:number_of_columns]], 
                                 y, 
                                 cv = 5, 
                                 scoring = rmse_log)
        
        # Store the results
        scores_diff_num_features.loc[len(scores_diff_num_features)] = str(model), number_of_columns - 1, scores.mean()
    
    

Let's print the best result for every model.

In [34]:
min_score_rows = scores_diff_num_features.loc[scores_diff_num_features.groupby('model')['score'].idxmin()]
min_score_rows

Unnamed: 0,model,num_of_features,score
32,"Lasso(alpha=41, random_state=0)",64.0,0.118273
7,LinearRegression(),35.0,0.119104
20,"Ridge(alpha=2.5384615384615383, random_state=0)",54.0,0.119264
41,"XGBRegressor(base_score=None, booster=None, ca...",40.0,0.115584


We found the optimal number of features to train the models. Let's make a prediction.

In [35]:
# Create a list with best number of features for every model
cut_list = [36, 55, 65, 41, 36, 36]


for i in range(len(models)):
    # Get next model and corresponding num of features from cut_list
    model = models[i]
    num_features = cut_list[i] 
    
    # Shrinking the training data and fitting the model
    X_short = X.copy()
    X_short = X_short[columns_list[:num_features]]
    model.fit(X_short, y)
    
    # Making prediction on cutted test data
    prediction_short = model.predict(test_data[columns_list[:num_features]])
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_shorted.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)
    

![shorted result](pics/shorted_result.png)

There is no raise of accuracy.

# Prediction on normalized test data <a name = 'norm_data'></a>

[Table of Content](#beginning)

In the EDA notebook, we found that some features can be successfully normalized using the log function. Let's try linear models that have the greatest potential to benefit from normalized data.

In [36]:
# Load normalized data
train_data_stats = pd.read_csv('../../Data/train_data_stats.csv', index_col = 'Id')
test_data_stats = pd.read_csv('../../Data/test_data_stats.csv', index_col = 'Id')

# Prepare training data
X_stats = train_data_stats.copy()
y_stats = X_stats['SalePrice']
X_stats = X_stats.drop(['SalePrice'], axis = 1)

In [37]:
X_stats.head()

Unnamed: 0_level_0,OverallQual,GrLivArea,YearBuilt,GarageArea,TotalBsmtSF,FullBath,YearRemodAdd,2ndFlrSF,LotArea,Fireplaces,...,Exterior1st_BrkFace,Exterior1st_CemntBd,Exterior1st_HdBoard,Exterior1st_ImStucc,Exterior1st_MetalSd,Exterior1st_Plywood,Exterior1st_Stucco,Exterior1st_VinylSd,Exterior1st_Wd Sdng,Exterior1st_WdShing
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,7,7.444249,2003,548,6.75227,2,2003,854,8450,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,6,7.140453,1976,460,7.140453,2,1976,0,9600,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,7,7.487734,2001,608,6.824374,2,2002,866,11250,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,7,7.448334,1915,642,6.628041,1,1970,756,9550,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5,8,7.695303,2000,836,7.04316,2,2000,1053,14260,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


As mentioned only linear models will be applied.

In [38]:
linear_models = models[:3]
linear_models

[LinearRegression(),
 Ridge(alpha=2.5384615384615383, random_state=0),
 Lasso(alpha=41, random_state=0)]

In [39]:
cut_list_norm = [36, 55, 65]

for i in range(len(linear_models)):
    # Get next model and corresponding num of features from cut_list_norm
    model = linear_models[i]
    num_features = cut_list_norm[i] 
    
    # Shrink normalized train data and fit the model
    X_norm = X_stats.copy()
    X_norm = X_norm[columns_list[:num_features]]
    model.fit(X_norm, y_stats)
    
    # The target value in the training data is also normalized, 
    # so it is necessary to take the exponent to get the correct prediction
    prediction_norm = np.exp(model.predict(test_data_stats[columns_list[:num_features]]))
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_norm.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction_norm})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)

![norm result](pics/norm_result.png)

Lasso regression fitted on data processed in EDA notebook is still the best.

# Prediction on scaled test data <a name = 'scaled'></a>  

[Table of Content](#beginning)

The last option I'm going to try is data scaling. Feature scaling is a good thing. It improves model performance, prevents features from dominating, and increases the generalizability of the model. I expect good results.

In [40]:
scaler = StandardScaler()

# Scale data
X_train_scaled = scaler.fit_transform(X)
test_data_scaled = scaler.transform(test_data)

In [41]:
for i in range(len(models)):
    model = models[i]
    model.fit(X_train_scaled, y)
    
    prediction_scaled = model.predict(test_data_scaled)
    
    # Get submission name
    cut_model_name_after = str(model).find('(')
    model_name = str(model)[:cut_model_name_after]
    submission_name = model_name + '_scaled.csv'
    
    # Fit prediction to the submission format and save .csv
    submission_dataframe = pd.DataFrame({'Id': test_data.index, 'SalePrice': prediction_scaled})
    submission_dataframe.to_csv(f"Submissions/{submission_name}", index = False)

![scaled result](pics/scaled_result.png)

By a small margin, a new champion has emerged!    

**Conclusion:** Well, the model with the best predictive power is Lasso regression trained on scaled data.