
...... continued from previous Jupyter Notebook 1 project 2

### Contents:
- [IMPORTING OF RELEVANT LIBRARIES](#IMPORTING_OF_RELEVANT_LIBRARIES)
- [5. CLEANING DATASET](continued]
    - [F. GETTING THE DATA FROM CSV FORMAT AND MERGING TO GET_DUMMIES](#GETTING_THE_DATA_FROM_CSV_FORMAT )
    - [G. USING 'GET_DUMMIES' FUNCTION TO CONVERT NOMINAL DATA TO BINARY OUTPUT](#USING_'GET_DUMMIES'_FUNCTION)
    - [H. SAVING FINAL DATASETS IN CSV FORMAT](#SAVING_FINAL_'TRAIN'_DATASET_DATAFRAME_IN_CSV_FORMAT)
- [6. IDENTIFYING FEATURES FOR DESIGNING THE MODEL](#IDENTIFYING_FEATURES_FOR_DESIGNING_THE_MODEL)
    - [1. Filtering by correlation between the features and the target variable](#Filtering_by_correlation)
    - [2. Recursive Feature Engineering (RFE) to identify optimum number of features](#Recursive_Feature_Engineering)  
- [7. MODEL BUILDING](#MODEL_BUILDING)
    - [A. MODEL PREPARATION](#MODEL_PREPARATION)
    - [B. Model 1 - correlated_features 1](#Model_1_-_corr_features1)
    - [C. Model 2 - correlated features 2](#Model_2_-_corr_features2)
    - [D. Model 3 - rfe features](#Model_3_-_rfe_features)
- [8. EVALUATION OF THE MODELS](#EVALUATION_OF_THE_MODELS)
- [9. SUBMISSION TO KAGGLE](#SUBMISSION_TO_KAGGLE)
- [10. CONCLUSION](#CONCLUSION)

 

## IMPORTING OF RELEVANT LIBRARIES

In [None]:
import pandas as pd

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score
from sklearn import metrics

%matplotlib inline

In [None]:
# Setting options to view entire dataset

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## F. GETTING THE DATA FROM CSV FORMAT - 'df_train_alt' and 'df_test_alt'

#### In the previous Jupyter notebook, 'train' dataset was cleaned for missing and null values and ordinal values replaced with integers. The final 'train' dataset was saved as csv file 'df_train_alt_aft_ord.csv'. Similarly the 'test' dataset was also cleaned and ordinal values replaced with integers. The final 'test' dataset was saved as 'df_test_alt_aft_ord.csv'. 

#### In this Jupyter notebook, the 2 dataframes will be merged and the dummy columns included for nominal values.

#### The model design will start with selection of features through RFE method and Filtering methods and use of Linear Regression and the regularisation of Lasso and Ridge to study which gives the best fit.

#### The final predictions of the selected model are uploaded to Kaggle to get the Kaggle score. 

#### The Kaggle score can be used as one of the indicators of robustness of the model. This is in addition to other measures of robustness such as evaluation of R square score and the Root Square Mean of Errors.

In [None]:
# getting data from csv files - training data ('df_train_alt_aft_ord.csv')

df_train_alt = pd.read_csv('../datasets/df_train_alt_aft_ord.csv')
df_train_alt.drop(['Unnamed: 0'], axis=1, inplace=True)
df_train_alt.head()

In [None]:
# getting data from csv files - test data (df_test_alt_aft_ord.csv)

df_test_alt = pd.read_csv('../datasets/df_test_alt_aft_ord.csv') 
df_test_alt.drop(['Unnamed: 0'], axis=1, inplace=True)
df_test_alt.head()

In [None]:
df_train_alt.shape, df_test_alt.shape  

## G. MERGING TRAIN AND TEST DATASETS TO PROCEED TO GET_DUMMIES FOR NOMINAL  DATA

In [None]:
# merging train and test data 

df_merged = pd.concat([df_train_alt, df_test_alt], sort=False)
df_merged.shape

In [None]:
## creating dummy columns for nominal columns - df_merged_dummies

columns_to_getdummies =['MS SubClass','MS Zoning', 'Alley', 'Land Contour', 'Lot Config', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Foundation', 'Heating', 'Garage Type', 'Misc Feature', 'Sale Type']
df_merged_dummies = pd.get_dummies(df_merged, columns=(columns_to_getdummies), prefix=(columns_to_getdummies), prefix_sep='')
df_merged_dummies.head()

In [None]:
df_merged_dummies.shape

### Splitting of dataset back to train and test set after addition of dummy columns for nominal data

In [None]:
#splitting of merged dataframe back to train and test sets

df_train = df_merged_dummies.iloc[ :2050,:]
df_test = df_merged_dummies.iloc[2051: ,:] 

In [None]:
df_train.shape, df_test.shape

In [None]:
df_test = df_test.loc[:, df_test.columns != 'SalePrice']


## H. SAVING 'TRAIN' AND 'TEST' DATASET DATAFRAME IN CSV FORMAT

In [None]:
# saving final dataframes in csv format

df_train.to_csv('../datasets/df_train_final.csv')
df_test.to_csv('../datasets/df_test_final.csv')

## 6. IDENTIFYING FEATURES FOR DESIGNING THE MODEL

#### To identify features from the 239 columns and find out which are most correlated to 'SalePrice', two methods will be used

#### 1. Filtering using correlation (Pearson's Correlation Co-efficient) to identify features with highest correlation to SalePrice

#### 2. Recursive Feature Engineering (RFE) to identify optimum number of features


### 1. Filtering by correlation between the features and the target variable 

#### Filtering is done betweeen the features and target variable 'SalePrice' and then a threshold of 0.5 is set, thus selecting features that have an absolute correlation value above 0.5. After identifying the features, it is necessary to check for any correlation between the features since one of the assumptions of Linear Regression is that all independent features are not correlated to each other.

In [None]:
y = df_train['SalePrice']
X_all = df_train.loc[:, df_train.columns != 'SalePrice']

In [None]:
# filtering using correlation

cor = df_train.corr()

#Correlation with target variable
cor_target = abs(cor["SalePrice"])

#Selecting highly correlated features
relevant_features = cor_target[cor_target>0.5]
relevant_features.sort_values(ascending=False)

In [None]:
# checking for correlation between features identifed through filtering using correlation matrix

corr_features = ['Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 
                 'Total Bsmt SF', '1st Flr SF', 'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 
                 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'TotRms AbvGrd', 'Mas Vnr Area']

# Checking for correlation among the features
plt.figure(figsize=(10,10))
mask = np.zeros_like(X_all[corr_features].corr())
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    sns.heatmap(X_all[corr_features].corr(), mask=mask, annot=True)


#### The 17 features listed below have been identified as having an absolute correlation value above 0.5. 

#### features = 'Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 'Total Bsmt SF', '1st Flr SF',         'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'TotRms AbvGrd', 'Mas Vnr Area'.   


### 2. Recursive Feature Engineering (RFE) to identify optimum number of features


In [None]:
# using RFE to identify optimum number of features

#no of features
nof_list=np.arange(1,20)            
high_score=0

#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)): 
    X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size = 0.3, random_state = 0)
    model = LinearRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))



In [None]:
cols = list(X_all.columns)
model = LinearRegression()           #Initializing RFE model
rfe = RFE(model, 13)                 #Transforming data using RFE
X_all_rfe = rfe.fit_transform(X_all,y)       #Fitting the data to model
model.fit(X_all_rfe,y)              
temp = pd.Series(rfe.support_,index = cols)
selected_features_rfe = temp[temp==True].index
print(selected_features_rfe)


In [None]:
# looking at correlation between features selected by RFE - to drop features that have a high level of correlation

rfe_features = ['MS SubClass60', 'MS SubClass75', 'MS SubClass120', 'MS SubClass150',
       'Bldg TypeTwnhs', 'Roof StyleGable', 'Roof StyleGambrel',
       'Roof StyleMansard', 'Garage Type2Types', 'Garage TypeAttchd',
       'Garage TypeBuiltIn', 'Misc FeatureElev', 'Misc FeatureGar2']

plt.figure(figsize=(10,10))
mask = np.zeros_like(X_all[rfe_features].corr())
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    sns.heatmap(X_all[rfe_features].corr(), mask=mask, annot=True)


### The list of features from RFE are as listed:
#### rfe_features = 'MS SubClass60', 'MS SubClass75', 'MS SubClass120', 'MS SubClass150', 'Bldg TypeTwnhs', 'Roof StyleGable', 'Roof StyleGambrel', 'Roof StyleMansard', 'Garage Type2Types', 'Garage TypeAttchd', 'Garage TypeBuiltIn', 'Misc FeatureElev', 'Misc FeatureGar2'




## 7. MODEL BUILDING

### Model 1: After identifying the features, it is necessary to check for any correlation between the features. One of the assumptions of Linear Regression is that all independent features are not correlated to each other. Features with correlation of more than 0.85 are dropped - 'Garage Cars'. Features identified in earlier heatmap of integer data are included in the list of selected features.These include those with high positive correlation factors seen earlier such as, Overall Qual, GR Liv Area, Garage Cars, Garage Area, Total Bsmt SF, 1st Flr SF
#### corr_features1 = 'Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 'Total Bsmt SF', '1st Flr SF',  'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'TotRms AbvGrd', 'Mas Vnr Area'.  



### Model 2: Another pair of features with strong correlation is 'Gr Living Area' and 'Tot Rms AbvGrd'. We will be trying another model including by dropping 'Tot Rms AbvGrd' 
#### corr_features2 = 'Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 'Total Bsmt SF', '1st Flr SF',  'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 'Fireplace Qu', 'Full Bath', 'FoundationPConc',  'Mas Vnr Area'.  



### Model 3: Model designed from optimum features as identified by RFE. 
#### rfe_features = 'MS SubClass60', 'MS SubClass75', 'MS SubClass120', 'MS SubClass150', 'Bldg TypeTwnhs', 'Roof StyleGable', 'Roof StyleGambrel', 'Roof StyleMansard', 'Garage Type2Types', 'Garage TypeAttchd', 'Garage TypeBuiltIn', 'Misc FeatureElev', 'Misc FeatureGar2'
 

### A. MODEL PREPARATION 

#### Train-test-split:

#### The train data is divided into training and testing portion to find the proper Linear Regression model. For this the model is trained on the training data and then provided the testing data. The variance between the scores predicted by the model and the actual values is measured using metrics. 

#### To ensure that there is a fair spread of samples between the training and testing sets, a technique called cross-validation is used. The cross validation is done by dividing the data into a required number of parts and leaving one part out as the testing data.

#### For our calculations, we will be considering 30% of data to be test data for purpose of designing the model and will use a cross validation of 10 folds.

#### Standard Scaler:
#### The features are scaled to ensure they are on a compatible scale, else the difference between consecutive data points in different features may vary 

#### Initiating empty dataframe to collect outcomes

In [None]:
#initiating a dataframe to collect outcomes of evaluations

outcomes ={'Model':('Mean score', 'Variance score', 'Mean Square Error', 'Root Mean Square Error')} 
df_outcomes = pd.DataFrame(outcomes)
df_outcomes

## B. Model 1 - corr_features1

In [None]:
# model prep
corr_features1 = ['Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 
                 'Total Bsmt SF', '1st Flr SF', 'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 
                 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'TotRms AbvGrd', 'Mas Vnr Area']
X= X_all[corr_features1]
y = df_train['SalePrice']

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size= 0.3, random_state=42)    

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)
    

### Model 1 - i. linear model (baseline)

In [None]:
# function to do linear model - for a given X_train, X_test, y_train, y_test, name of regression model

def lin_reg(X_train, X_test, y_train, y_test, name):
        
    # instantiate linear regression model
    lr = LinearRegression()
    
    # training model and getting score
    lr_scores_train = cross_val_score(lr, X_train, y_train, cv=10)
    print ("Mean score:", np.mean(lr_scores_train))
        
    # Fitting the model 
    lr.fit(X_train, y_train)
    
    # getting co-eff values
    coeff_df = pd.DataFrame(lr.coef_ , X.columns, columns=['Coefficient'])  
   
    # to get predicted values
    pred = lr.predict(X_test)

    # Score for test data
    print( "Test score:",lr.score(X_test, y_test))
    
    #Metrics of model
    print ("Means Square Error:", metrics.mean_squared_error(y_test, pred ))
    print("Root Mean Square:", np.sqrt(metrics.mean_squared_error(y_test, pred)))
    
    perform_score=[round(np.mean(lr_scores_train),3), 
                        round(lr.score(X_test, y_test),3),
                        round(metrics.mean_squared_error(y_test, pred ),3),
                        round(np.sqrt(metrics.mean_squared_error(y_test, pred)),3)
                       ]
    # storing values in df_outcomes
    df_outcomes[name]= pd.Series(data=perform_score)

    return coeff_df, pred


In [None]:
coeff_df, pred = lin_reg(X_train, X_test, y_train, y_test, 'Linear/corr_features1')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, data= coeff_df.sort_values(by='Coefficient') )

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Linear Regression', loc='left');

### Model 1 - ii. Lasso regression
#### Calculation of optimal alpha and then fitting Lasso regression

In [None]:
# Lasso regression for given X_train, X_test, y_train, y_test, name of regression model

def lasso_reg(X_train, X_test, y_train, y_test, name):
    
    # Calculation of optimal alpha
    optimal_lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
    optimal_lasso.fit(X_train, y_train)

    # instantiate lasso model with optimal alpha
    lasso = Lasso(alpha=optimal_lasso.alpha_)

    # Training the model
    lasso.fit(X_train, y_train)
    lasso_scores_train = cross_val_score(lasso, X_train, y_train, cv=10)
    print ("Mean score:", np.mean(lasso_scores_train))
    
    # getting co-eff values
    coeff_df = pd.DataFrame(lasso.coef_ , X.columns, columns=['Coefficient'])  
    
    # Fitting the model
    lasso.score(X_test, y_test)
    print('Test score:', lasso.score(X_test, y_test))
    pred = lasso.predict(X_test)

    # Metrics
    print ("Means Square Error:", metrics.mean_squared_error(y_test, pred ))
    print("Root Mean Square:", np.sqrt(metrics.mean_squared_error(y_test, pred)))

    perform_score=[round(np.mean(lasso_scores_train), 4), 
                            round(lasso.score(X_test, y_test), 4), 
                            round(metrics.mean_squared_error(y_test, pred ),4), 
                            round(np.sqrt(metrics.mean_squared_error(y_test, pred)), 4)]

    # storing values in df_outcomes
    df_outcomes[name]= pd.Series(data=perform_score) 
    
    return coeff_df, pred



In [None]:
coeff_df, pred = lasso_reg(X_train, X_test, y_train, y_test, 'Lasso/corr_features1')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Lasso Reg', loc='left');

### Model 1 - iii. Ridge regression
#### Calculation of optimal alpha and then fitting Ridge regression

In [None]:
# Ridge regression for given X_train, X_test, y_train, y_test, name of regression model

def ridge_reg(X_train, X_test, y_train, y_test, name):

    # Calculating optimal alpha
    ridge_alphas = np.logspace(0, 5, 200)

    optimal_ridge = RidgeCV(alphas=ridge_alphas, cv=10)
    optimal_ridge.fit(X_train, y_train)

    # Training the model
    ridge = Ridge(alpha=optimal_ridge.alpha_)
    ridge_scores_train = cross_val_score(ridge, X_train, y_train, cv=10)

    # Mean Score
    print("Mean Ridge Score:", np.mean(ridge_scores_train))
    
    # getting co-eff values
    coeff_df = pd.DataFrame(ridge_coef , X.columns, columns=['Coefficient'])  
    
    #Fitting the model
    ridge.fit(X_train, y_train)
    pred = ridge.predict(X_test)
    print("Test score:",ridge.score(X_test, y_test))


    # Metrics
    print ("Means Square Error:", metrics.mean_squared_error(y_test, pred ))
    print("Root Mean Square:", np.sqrt(metrics.mean_squared_error(y_test, pred)))
    
    perform_score=[round(np.mean(ridge_scores_train),3), 
                        round(ridge.score(X_test, y_test),3),
                        round(metrics.mean_squared_error(y_test, pred ),3),
                        round(np.sqrt(metrics.mean_squared_error(y_test, pred)),3)
                       ]
    # storing values in df_outcomes
    df_outcomes[name]= pd.Series(data=perform_score)

    return coeff_df, pred

In [None]:
coeff_df, pred = ridge_reg(X_train, X_test, y_train, y_test, 'Ridge/corr_features1')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Ridge Reg', loc='left');

## C. Model 2 - corr_features2

In [None]:
# model prep
corr_features2 = ['Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 
                 'Total Bsmt SF', '1st Flr SF', 'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 
                 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'Mas Vnr Area']
X= X_all[corr_features2]
y = df_train['SalePrice']

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size= 0.3, random_state=42)    

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)
    

In [None]:
coeff_df, pred = lin_reg(X_train, X_test, y_train, y_test, 'Linear/corr_features2')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, data= coeff_df.sort_values(by='Coefficient') )

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Linear Regression', loc='left');

In [None]:
coeff_df, pred = lasso_reg(X_train, X_test, y_train, y_test, 'Lasso/corr_features2')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient', ascending=False))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Lasso Reg', loc='left');

In [None]:
coeff_df, pred = ridge_reg(X_train, X_test, y_train, y_test, 'Ridge/corr_features2')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Ridge Reg', loc='left');

## D. Model 3 - rfe_features

In [None]:
# model prep
rfe_features = rfe_features = ['MS SubClass60', 'MS SubClass75', 'MS SubClass120', 'MS SubClass150',
                               'Bldg TypeTwnhs', 'Roof StyleGable', 'Roof StyleGambrel',
                                'Roof StyleMansard', 'Garage Type2Types', 'Garage TypeAttchd',
                                'Garage TypeBuiltIn', 'Misc FeatureElev', 'Misc FeatureGar2']
X = X_all[rfe_features]
y = df_train['SalePrice']

# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        test_size= 0.3, random_state=42)    

# standard scaling
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_test = ss.transform(X_test)
    

In [None]:
coeff_df, pred = lin_reg(X_train, X_test, y_train, y_test, 'Linear/rfe_features')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Linear Reg', loc='left');

In [None]:
coeff_df, pred = lasso_reg(X_train, X_test, y_train, y_test, 'Lasso/rfe_features')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Lasso Reg', loc='left');

In [None]:
coeff_df, pred = ridge_reg(X_train, X_test, y_train, y_test, 'Ridge/rfe_features')

In [None]:
# Plotting coefficients
sns.barplot(x =coeff_df['Coefficient'],y=coeff_df.index, 
            data= coeff_df.sort_values(by='Coefficient'))
        

#plotting of predictions vs Actuals
sns.jointplot(y_test, pred)
plt.title( 'Predictions vs Actual SalePrice, Ridge Reg', loc='left');

## 8. EVALUATION OF THE MODELS

### Comparing the scores and metrics of the individual models

In [None]:
df_outcomes

#### Comparing the scores and metric the Ridge model with correlated features of Model 1 is preferred. This model has a high Mean(train data) score of .0.792 and has a low gap between train score (0.792) and test scores (0.855)  


## 9. SUBMISSION TO KAGGLE

### Ridge regression /corr_features1

In [None]:
#submission based on Ridge model

corr_features1 =['Overall Qual', 'Exter Qual', 'Gr Liv Area', 'Kitchen Qual', 'Garage Area', 'Garage Cars', 
                 'Total Bsmt SF', '1st Flr SF', 'Bsmt Qual', 'Year Built', 'Garage Finish', 'Year Remod/Add', 
                 'Fireplace Qu', 'Full Bath', 'FoundationPConc', 'TotRms AbvGrd', 'Mas Vnr Area']


X = X_all[corr_features1]
y = df_train['SalePrice']

X_test_final = df_test[corr_features1]

In [None]:
X.shape, y.shape, X_test_final.shape

In [None]:
# Scaling of features

ss = StandardScaler()
ss.fit_transform(X)
X = ss.transform(X)
X_test_final = ss.transform(X_test_final)

In [None]:
# Calculation of optimal alpha

ridge_alphas = np.logspace(0, 5, 200)

optimal_ridge = RidgeCV(alphas=ridge_alphas, cv=10)
optimal_ridge.fit(X, y)

# Training the model 

ridge = Ridge(alpha=optimal_ridge.alpha_)
ridge_scores = cross_val_score(ridge, X, y, cv=10)
print("Mean Ridge Score:", np.mean(ridge_scores))
coeff_df = pd.DataFrame(ridge.coef_, corr_features1)

# Fitting the model
ridge.fit(X, y)
pred_final = ridge.predict(X_test_final)

# Converting to prescribed format for submission to Kaggle
df_pred_final = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': pred_final})
df_pred_final

In [None]:
ridge_coeff_df = pd.DataFrame(ridge.coef_, corr_features1, columns=['Coefficient'])
print(ridge_coeff_df)
ridge_y_intercept = ridge.intercept_
print(ridge_y_intercept)

In [None]:
# conversion to csv file for submission to Kaggle
df_pred_final.to_csv('../datasets/submission_ridge1rev.csv', index=False)

## Submission for Kaggle - Linear regression /corr_features1

In [None]:
#submission based on Linear Reg model

X = X_all[corr_features1]
y = df_train['SalePrice']

X_test_final = df_test[corr_features1]

lr= LinearRegression()

# Training scores
lr_scores = cross_val_score(lr, X, y, cv=10)
print ("Mean R2:", np.mean(lr_scores))

# Fitting model and calculating predicted values
lr.fit(X, y)
pred_final = lr.predict(X_test_final)

# Predicted values in prescribed format
df_pred_final = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': pred_final})

df_pred_final

In [None]:
# Conversion to csv format for uploading to Kaggle

df_pred_final.to_csv('../datasets/submission_lr1rev.csv', index=False)

## Submission for Kaggle - Lasso regression /corr_features1

In [None]:
#submission based on Lasso model

X = X_all[corr_features1]
y = df_train['SalePrice']

X_test_final = df_test[corr_features1]

# Calculation of optimal alpha
lasso = LassoCV(n_alphas=500, cv=10, verbose=1)
lasso.fit(X, y)

# Training the model
lasso_scores = cross_val_score(lasso, X, y, cv=10)

print("Mean Lasso Score:", np.mean(lasso_scores))

# Fitting the model
lasso.fit(X, y)
pred_final = lasso.predict(X_test_final)

# Predicted values in format prescribed
df_pred_final = pd.DataFrame({'Id': df_test['Id'], 'SalePrice': pred_final})
df_pred_final

In [None]:
# Conversion to csv file fr uploading
df_pred_final.to_csv('../datasets/submission_lasso1rev.csv', index=False)


## 10. CONCLUSION

#### The datasets 'train.csv' and 'test.csv' were cleaned by checking for null values, converting categorical values to integers (ordinal and binary) .

#### After cleaning of the datasets ('train.csv' and 'test.csv'), the various features were studied for design of the model. The heatmap was plotted to get a visual representation of the correlation between the features.

#### The features for the models were shortlisted after considering the correlation of the target variable ('SalePrice') to all the features and taking a threshold of 0.5 absolute as the correlation value. Features with absolute correlation value of more that 0.5 were selected. This method is also called Filtering of features. The list of features shortlisted from correlation and heatmap is as given below.

#### The 18 individual features, that were identified by the above step, were also checked for high intercorrelation to avoid multi-collinearity. Here two pairs of features were found to have high collinearity. 
    - 'Garage Cars' and 'Garage Area'  (correlation: 0.89)
    - 'Gr Liv Area' and 'TotRms AbvGrd'(correlation: 0.81)
        
#### The decision to drop 'Garage Cars' for Model 1 was made since the correlation was very high. The remaining features listed above  after dropping 'Garage Cars' were considered in the design

#### For Model 2 the other pair of feature with high correlation was considered and 'TotRms AbvGrd' was also dropped from the features. 

#### Two other variations were also tried. Polynomial to the degree of 2 gave Kaggle scores of 37656 (Public) and 35359 (Private).

#### Model 3 was tried with optimal features identified through the Recursive Feature Elimination.

#### After generating the model predictions, the predicted values were uploaded to Kaggle Model 1 has been identified to have the lowest Kaggle scores for Ridge and Linear. The variation in the train and test scores were also not too high and RMSE was found to be lower in the case of Linear regression model. The two models are quit simlar.




### Kaggle scores for the 2 best models were as follows

####  Model 1 Ridge model submission with 17 correlated features
####  Private: 35054.17459
#### Public: 33901.40970

#### Model 1 (17 features) Linear Regression model submission 
#### Private: 35501.38125
#### Public: 35028.56812


                 
#### Hence the Ridge Regression model is selected to be the preferred model to predict the target variable, based on the lower Kaggle score.

#### Point to note while using the model:
#### 1. There can be no model which is a perfect fit, however from the various models considered, this will give the least variation

#### The model may be further refined by:
    - Including other other features 
    - Feature Engineering 
    - Polynomial with degree of 3
    - Try other known models of regression


### Details of Model 1 (17 features) Linear Regression 

#### Below are values for the beta coefficients and y_intercept for the individual features in this model

In [None]:
# Values of the beta coefficients for the selected features 


print('Beta Co-efficients:')
print(ridge_coeff_df)
print("Beta_zero /y_intercept:", ridge_y_intercept)