# <font color='firebrick'> Retirement Income Predictor - Advanced Regression Techniques  <font>

## <font color='goldenrod'>Introduction <front>

### <font color='black'> Project Description <font>

Predict the ideal income (the maximum amount that will last until death with a certain confidence level) a person must draw from retirement using personal, investment and retirement data.
Ideal income is defined as the maximum amount that will last until death with a certain confidence level.

### <font color='black'> Data Description <font>

The data consists of personal, investment, and retirement-specific data on an individual level with the eventual ideal income in retirement. The repo has both the training and testing set used by our team → please keep this unchanged and use the training and validation sets for parameter optimization and the testing set for metric reporting and analysis of generalization capacity.

### <font color='black'> Goal <font>

With 39 explanatory variables describing (almost) every aspect of the dataset, this project aims to predict the amount which a person will receive as retirement income.

### <font color='black'> Step-by-Step Procedure <font>

<font color='darkblue'> In order to tackle efficiently this project, we will follow the steps below suggested by [A. Qua](https://www.kaggle.com/adibouayjan). <front>
    

## <font color='goldenrod'> 1. Exploratory Data Analysis <front>

These are the core parts of this session, according to [Pedro Marcelino, Ph.D.](https://www.kaggle.com/code/pmarcelino/comprehensive-data-exploration-with-python/notebook): 
    
* Understand the problem. We'll look at each variable and do a philosophical analysis about their meaning and importance for this problem.
    
* Univariable study. We'll just focus on the dependent variable ('') and try to know a little bit more about it.
    
* Multivariate study. We'll try to understand how the dependent variable and independent variables relate.
    
* Basic cleaning. We'll clean the dataset and handle the missing data, outliers and categorical variables.
    
* Test assumptions. We'll check if our data meets the assumptions required by most multivariate techniques.


### <font color='forestgreen'>  1.1. General Exploration <front>

#### 1.1.1. Loading libraries

In [15]:
import pandas as pd 

Pandas is a Python library for data analysis. Started by Wes McKinney in 2008 out of a need for a powerful and flexible quantitative analysis tool, pandas has grown into one of the most popular Python libraries. It has an extremely active community of contributors. 
Pandas features: 

* Times series Analysis
* Split-Apply-Combine
* Data visualisation
* Pivot Table
    
See [Pandas](https://mode.com/python-tutorial/libraries/pandas/) for more details. 

In [16]:
import matplotlib.pyplot as plt

Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts etc. 

See [Matplotlib](https://matplotlib.org) for more details. 

In [17]:
import seaborn as sns

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

See [Seaborn](https://seaborn.pydata.org) for more details. 

In [18]:
import numpy as np

NumPy brings the computational power of languages like C and Fortran to Python. With this power comes simplicity: a solution in NumPy is often clear and elegant.

See [NumPy](https://numpy.org) for more details. 

In [19]:
import warnings

Warnings are provided to warn the developer of situations that aren’t necessarily exceptions. Usually, a warning occurs when there is some obsolete of certain programming elements, such as keyword, function or class, etc. A warning in a program is distinct from an error. Python program terminates immediately if an error occurs. 

Conversely, a warning is not critical. It shows some message, but the program runs. The warn() function defined in the ‘warning‘ module is used to show warning messages. The warning module is actually a subclass of Exception which is a built-in class in Python. 

See [Warning](https://docs.python.org/3/library/warnings.html) for more details. 

In [20]:
import statsmodels.api as sm #Cross-sectional models and methods.
import statsmodels.formula.api as smf #A convenience interface for specifying models using formula strings and DataFrames.

Statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

See [Statsmodels](https://www.statsmodels.org/stable/index.html) for more details. 

In [21]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

The variance inflation factor is a measure for the increase of the variance of the parameter estimates if an additional variable, given by exog_idx is added to the linear regression. It is a measure for multicollinearity of the design matrix, exog.

One recommendation is that if VIF is greater than 5, then the explanatory variable given by exog_idx is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.

In [22]:
import sklearn

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

See [Sckit-learn](https://www.tutorialspoint.com/scikit_learn/index.htm) for more details. 

In [23]:
# Function to deal with missing values via imputation
from sklearn.impute import SimpleImputer

In [24]:
# Function that converts categorical values into numerical values via ordinal encoding or one-hot encoding
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [55]:
# Function to split data into different groups
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import *
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import random
import math

In [26]:
# Statistics functions
from scipy.stats import norm
from scipy import stats
from scipy.stats import chi2_contingency

SciPy is a scientific computation library that uses NumPy underneath. SciPy stands for Scientific Python. It provides more utility functions for optimization, stats and signal processing. Like NumPy, SciPy is open source so we can use it freely. SciPy was created by NumPy's creator Travis Olliphant.

See [SciPy](https://scipy.org) for further details. 

In [27]:
# Suppressing a warning 
warnings.filterwarnings("ignore") 

# It is a magic function that renders the figure in the notebook
%matplotlib inline 

# Changing the figure size of a seaborn axes 
sns.set(rc={"figure.figsize": (20, 15)})

# The style parameters control properties like the color of the background and whether a grid is enabled by default.
sns.set_style("whitegrid")

#### 1.1.2. Loading the Data sets 

In [42]:
df_train = pd.read_csv('nedgroup_training_data.csv', index_col='Unnamed: 0')

In [43]:
df_validation = pd.read_csv('nedgroup_validation_data.csv', index_col='Unnamed: 0')

In [92]:
df_test = pd.read_csv('nedgroup_testing_data.csv', index_col='Unnamed: 0')

In [44]:
df_train.head()

Unnamed: 0,GENDER,RETIREMENT_AGE,RETIREMENT_FUND_VALUE,DEPT_VALUE,CURRENT_NET_MONTHLY_INCOME,SPARE_CASH_VALUE,FINANCIALLY_SUPPORT_PARTNER,FINANCIALLY_SUPPORT_CHILDREN,YEARS_SUPPORTING_CHILD,CHILD_MONTHLY_SUPPORTING_VALUE,...,INTERNATIONAL_CASH_UNIT_TRUST,SA_EQUITY_LAP,SA_BOND_LAP,SA_CASH_LAP,INTERNATIONAL_EQUITY_LAP,INTERNATIONAL_BOND_LAP,INTERNATIONAL_CASH_LAP,LAP_EAC_PA_INCL_VAT,LA_EAC_PA_INCL_VAT,UNIT_TRUST_EAC_PA_INCL_VAT
1,Female,74,9019817,108988,207,248324,No,No,0,0,...,9.0,36.75,19.75,16.0,18.85,6.9,1.75,1.22,1.22,1.22
5,Female,71,5194248,51928,2013,164644,Yes,Yes,4,4900,...,9.9,24.6,34.4,16.0,17.6,4.4,3.0,1.22,1.22,1.22
9,Female,70,6796815,41684,3047,240103,No,No,0,0,...,4.95,49.0,15.0,6.0,23.0,6.0,1.0,2.945625,2.945625,2.785625
10,Female,78,7238811,147184,1725,20071,No,No,0,0,...,0.0,49.0,15.0,6.0,23.0,6.0,1.0,0.336955,0.336955,0.350264
12,Male,87,6173619,229014,2261,82670,Yes,Yes,3,294,...,2.5,0.0,0.0,1.0,98.0,0.0,1.0,1.46,1.46,1.46


In [45]:
df_validation.head()

Unnamed: 0,GENDER,RETIREMENT_AGE,RETIREMENT_FUND_VALUE,DEPT_VALUE,CURRENT_NET_MONTHLY_INCOME,SPARE_CASH_VALUE,FINANCIALLY_SUPPORT_PARTNER,FINANCIALLY_SUPPORT_CHILDREN,YEARS_SUPPORTING_CHILD,CHILD_MONTHLY_SUPPORTING_VALUE,...,INTERNATIONAL_CASH_UNIT_TRUST,SA_EQUITY_LAP,SA_BOND_LAP,SA_CASH_LAP,INTERNATIONAL_EQUITY_LAP,INTERNATIONAL_BOND_LAP,INTERNATIONAL_CASH_LAP,LAP_EAC_PA_INCL_VAT,LA_EAC_PA_INCL_VAT,UNIT_TRUST_EAC_PA_INCL_VAT
43729,Female,67,9000142,15573,496,81693,No,Yes,1,2389,...,0.5,0.0,15.0,85.0,0.0,0.0,0.0,1.753037,1.753037,1.753037
43732,Female,62,6027373,174387,7463,185622,No,Yes,4,953,...,1.0,0.0,0.0,1.0,59.4,29.7,9.9,2.012777,2.012777,1.902777
43733,Female,92,4376257,114480,1343,223781,No,Yes,4,1009,...,2.0,0.0,0.0,1.0,90.0,0.0,9.0,0.704934,0.704934,0.704934
43734,Female,83,5023733,103967,1532,226560,Yes,No,0,0,...,2.5,0.0,0.0,1.0,59.4,29.7,9.9,1.46,1.46,1.46
43735,Female,84,5486899,45344,6143,234655,Yes,No,0,0,...,0.0,24.5,24.5,26.0,14.7,7.8,2.5,0.42,0.42,0.3


In [93]:
df_test.head()

Unnamed: 0,GENDER,RETIREMENT_AGE,RETIREMENT_FUND_VALUE,DEPT_VALUE,CURRENT_NET_MONTHLY_INCOME,SPARE_CASH_VALUE,FINANCIALLY_SUPPORT_PARTNER,FINANCIALLY_SUPPORT_CHILDREN,YEARS_SUPPORTING_CHILD,CHILD_MONTHLY_SUPPORTING_VALUE,...,INTERNATIONAL_CASH_UNIT_TRUST,SA_EQUITY_LAP,SA_BOND_LAP,SA_CASH_LAP,INTERNATIONAL_EQUITY_LAP,INTERNATIONAL_BOND_LAP,INTERNATIONAL_CASH_LAP,LAP_EAC_PA_INCL_VAT,LA_EAC_PA_INCL_VAT,UNIT_TRUST_EAC_PA_INCL_VAT
56495,Male,91,4251825,94130,4146,41574,No,No,0,0,...,3.38,14.57,16.48,29.75,31.848,5.357,1.995,0.922,0.922,0.854
56496,Male,76,6716316,141328,1265,200076,No,No,0,0,...,3.38,1.5,26.0,15.5,49.6,5.4,2.0,2.739954,2.739954,2.671954
56499,Male,62,508676,4805,5751,212957,Yes,No,0,0,...,2.715,31.4,16.5,10.3,28.04,10.38,3.38,1.181,1.181,1.137
56501,Female,63,10542288,211999,2541,8995,Yes,Yes,3,3317,...,2.465,14.57,16.48,29.75,31.848,5.357,1.995,2.772379,2.772379,2.391027
56503,Male,82,4949178,455,2029,240760,Yes,No,0,0,...,3.19,24.565,17.435,29.85,19.223,7.207,1.72,1.208244,1.208244,1.454422


In [46]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23944 entries, 1 to 43728
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   23944 non-null  object 
 1   RETIREMENT_AGE                           23944 non-null  int64  
 2   RETIREMENT_FUND_VALUE                    23944 non-null  int64  
 3   DEPT_VALUE                               23944 non-null  int64  
 4   CURRENT_NET_MONTHLY_INCOME               23944 non-null  int64  
 5   SPARE_CASH_VALUE                         23944 non-null  int64  
 6   FINANCIALLY_SUPPORT_PARTNER              23944 non-null  object 
 7   FINANCIALLY_SUPPORT_CHILDREN             23944 non-null  object 
 8   YEARS_SUPPORTING_CHILD                   23944 non-null  int64  
 9   CHILD_MONTHLY_SUPPORTING_VALUE           23944 non-null  int64  
 10  YEARS_SUPPORTING_SOMEONE_ELSE            23944

In [47]:
df_validation.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6841 entries, 43729 to 56494
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   6841 non-null   object 
 1   RETIREMENT_AGE                           6841 non-null   int64  
 2   RETIREMENT_FUND_VALUE                    6841 non-null   int64  
 3   DEPT_VALUE                               6841 non-null   int64  
 4   CURRENT_NET_MONTHLY_INCOME               6841 non-null   int64  
 5   SPARE_CASH_VALUE                         6841 non-null   int64  
 6   FINANCIALLY_SUPPORT_PARTNER              6841 non-null   object 
 7   FINANCIALLY_SUPPORT_CHILDREN             6841 non-null   object 
 8   YEARS_SUPPORTING_CHILD                   6841 non-null   int64  
 9   CHILD_MONTHLY_SUPPORTING_VALUE           6841 non-null   int64  
 10  YEARS_SUPPORTING_SOMEONE_ELSE            68

In [94]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3420 entries, 56495 to 62844
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   3420 non-null   object 
 1   RETIREMENT_AGE                           3420 non-null   int64  
 2   RETIREMENT_FUND_VALUE                    3420 non-null   int64  
 3   DEPT_VALUE                               3420 non-null   int64  
 4   CURRENT_NET_MONTHLY_INCOME               3420 non-null   int64  
 5   SPARE_CASH_VALUE                         3420 non-null   int64  
 6   FINANCIALLY_SUPPORT_PARTNER              3420 non-null   object 
 7   FINANCIALLY_SUPPORT_CHILDREN             3420 non-null   object 
 8   YEARS_SUPPORTING_CHILD                   3420 non-null   int64  
 9   CHILD_MONTHLY_SUPPORTING_VALUE           3420 non-null   int64  
 10  YEARS_SUPPORTING_SOMEONE_ELSE            34

#### 1.1.3. Checking for missing data

In [48]:
df_train.isnull().sum()

GENDER                                         0
RETIREMENT_AGE                                 0
RETIREMENT_FUND_VALUE                          0
DEPT_VALUE                                     0
CURRENT_NET_MONTHLY_INCOME                     0
SPARE_CASH_VALUE                               0
FINANCIALLY_SUPPORT_PARTNER                    0
FINANCIALLY_SUPPORT_CHILDREN                   0
YEARS_SUPPORTING_CHILD                         0
CHILD_MONTHLY_SUPPORTING_VALUE                 0
YEARS_SUPPORTING_SOMEONE_ELSE                  0
OTHER_MONTHLY_SUPPORTING_VALUE                 0
HAS_EMERGENCY_SAVINGS                          0
CRITICAL_ILLNESS                               0
ONGOING_COACHING_FEE                           0
CONFIDENCE_LEVEL                               0
INITIAL_PLANNER_FEE_INCL_VAT_UT                0
INITIAL_PLANNER_FEE_INCL_VAT_LA_AND_LAP        0
ONGOING_PLANNER_FEE_INCL_VAT_UT                0
ONGOING_PLANNER_FEE_INCL_VAT_LA_AND_LAP        0
TARGET_MONTHLY_INCOM

In [49]:
df_validation.isnull().sum()

GENDER                                        0
RETIREMENT_AGE                                0
RETIREMENT_FUND_VALUE                         0
DEPT_VALUE                                    0
CURRENT_NET_MONTHLY_INCOME                    0
SPARE_CASH_VALUE                              0
FINANCIALLY_SUPPORT_PARTNER                   0
FINANCIALLY_SUPPORT_CHILDREN                  0
YEARS_SUPPORTING_CHILD                        0
CHILD_MONTHLY_SUPPORTING_VALUE                0
YEARS_SUPPORTING_SOMEONE_ELSE                 0
OTHER_MONTHLY_SUPPORTING_VALUE                0
HAS_EMERGENCY_SAVINGS                         0
CRITICAL_ILLNESS                              0
ONGOING_COACHING_FEE                          0
CONFIDENCE_LEVEL                              0
INITIAL_PLANNER_FEE_INCL_VAT_UT               0
INITIAL_PLANNER_FEE_INCL_VAT_LA_AND_LAP       0
ONGOING_PLANNER_FEE_INCL_VAT_UT               0
ONGOING_PLANNER_FEE_INCL_VAT_LA_AND_LAP       0
TARGET_MONTHLY_INCOME                   

In [95]:
df_test.isnull().sum()

GENDER                                        0
RETIREMENT_AGE                                0
RETIREMENT_FUND_VALUE                         0
DEPT_VALUE                                    0
CURRENT_NET_MONTHLY_INCOME                    0
SPARE_CASH_VALUE                              0
FINANCIALLY_SUPPORT_PARTNER                   0
FINANCIALLY_SUPPORT_CHILDREN                  0
YEARS_SUPPORTING_CHILD                        0
CHILD_MONTHLY_SUPPORTING_VALUE                0
YEARS_SUPPORTING_SOMEONE_ELSE                 0
OTHER_MONTHLY_SUPPORTING_VALUE                0
HAS_EMERGENCY_SAVINGS                         0
CRITICAL_ILLNESS                              0
ONGOING_COACHING_FEE                          0
CONFIDENCE_LEVEL                              0
INITIAL_PLANNER_FEE_INCL_VAT_UT               0
INITIAL_PLANNER_FEE_INCL_VAT_LA_AND_LAP       0
ONGOING_PLANNER_FEE_INCL_VAT_UT               0
ONGOING_PLANNER_FEE_INCL_VAT_LA_AND_LAP       0
TARGET_MONTHLY_INCOME                   

### 1.2. Numerical Feature 

#### 1.2.1. Extracting Nnumerical features

In [50]:
# first, we want to make sure we make use of all our features, to do this, we have to make sure we encode the categorical features into numerical features; we do this for both the test and train data
objList = df_train.select_dtypes(include = "object").columns
print (objList)

#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for feat in objList:
    df_train[feat] = le.fit_transform(df_train[feat].astype(str))

df_train.info()

Index(['GENDER', 'FINANCIALLY_SUPPORT_PARTNER', 'FINANCIALLY_SUPPORT_CHILDREN',
       'HAS_EMERGENCY_SAVINGS', 'CRITICAL_ILLNESS', 'SPOUSE_GENDER',
       'SPOUSE_DATE_OF_BIRTH'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23944 entries, 1 to 43728
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   23944 non-null  int32  
 1   RETIREMENT_AGE                           23944 non-null  int64  
 2   RETIREMENT_FUND_VALUE                    23944 non-null  int64  
 3   DEPT_VALUE                               23944 non-null  int64  
 4   CURRENT_NET_MONTHLY_INCOME               23944 non-null  int64  
 5   SPARE_CASH_VALUE                         23944 non-null  int64  
 6   FINANCIALLY_SUPPORT_PARTNER              23944 non-null  int32  
 7   FINANCIALLY_SUPPORT_CHILDREN             23944 non-nu

In [51]:
objList = df_validation.select_dtypes(include = "object").columns
print (objList)

for feat in objList:
    df_validation[feat] = le.fit_transform(df_validation[feat].astype(str))

df_validation.info()

Index(['GENDER', 'FINANCIALLY_SUPPORT_PARTNER', 'FINANCIALLY_SUPPORT_CHILDREN',
       'HAS_EMERGENCY_SAVINGS', 'CRITICAL_ILLNESS', 'SPOUSE_GENDER',
       'SPOUSE_DATE_OF_BIRTH'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6841 entries, 43729 to 56494
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   6841 non-null   int32  
 1   RETIREMENT_AGE                           6841 non-null   int64  
 2   RETIREMENT_FUND_VALUE                    6841 non-null   int64  
 3   DEPT_VALUE                               6841 non-null   int64  
 4   CURRENT_NET_MONTHLY_INCOME               6841 non-null   int64  
 5   SPARE_CASH_VALUE                         6841 non-null   int64  
 6   FINANCIALLY_SUPPORT_PARTNER              6841 non-null   int32  
 7   FINANCIALLY_SUPPORT_CHILDREN             6841 non-

In [96]:
objList = df_test.select_dtypes(include = "object").columns
print (objList)

for feat in objList:
    df_test[feat] = le.fit_transform(df_test[feat].astype(str))

df_test.info()

Index(['GENDER', 'FINANCIALLY_SUPPORT_PARTNER', 'FINANCIALLY_SUPPORT_CHILDREN',
       'HAS_EMERGENCY_SAVINGS', 'CRITICAL_ILLNESS', 'SPOUSE_GENDER',
       'SPOUSE_DATE_OF_BIRTH'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3420 entries, 56495 to 62844
Data columns (total 40 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   GENDER                                   3420 non-null   int32  
 1   RETIREMENT_AGE                           3420 non-null   int64  
 2   RETIREMENT_FUND_VALUE                    3420 non-null   int64  
 3   DEPT_VALUE                               3420 non-null   int64  
 4   CURRENT_NET_MONTHLY_INCOME               3420 non-null   int64  
 5   SPARE_CASH_VALUE                         3420 non-null   int64  
 6   FINANCIALLY_SUPPORT_PARTNER              3420 non-null   int32  
 7   FINANCIALLY_SUPPORT_CHILDREN             3420 non-

In [52]:
df_train = df_train.fillna(0)

In [53]:
df_validation = df_validation.fillna(0)

In [98]:
df_test = df_test.fillna(0)

## <font color='goldenrod'> 2 Feature Engineering <front>

### 2.1 Features distribution

### 2.2 Dealing With Quasi-Constant variables

#### 2.2.1 Feature Selection By Variance Threshold

#### 2.2.2. Feature Selection By Correlation

A correlation matrix is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. 

##### Pearson's Correlation

It is a powerful tool to summarize a large dataset and to gauge how far the independent variables are linearly related to the target variable. So let's check this out.

##  <font color='goldenrod'>3. Preparing The Data For Modeling <front>

### 3.1. Spliting the training sets into independent and the dependent variables

In [65]:
X = df_train.drop(['RETIREMENT_FUND_VALUE'], axis=1) # splitting the train data into test and train splits
y = df_train['RETIREMENT_FUND_VALUE']

### 3.2. Split Data Into Train & Test

In [56]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

# <font color='goldenrod'>4. Modeling <front>
The next thing we do is to bring in all the regression models we which to build and assign them a variable name so we can easily call on them when we wish to use them.

### 4.1. Fitting Our Models
In the cell below, we then fit each model into our train set to know the outcome

In [82]:
linear=LinearRegression()

In [83]:
linear.fit(X_train, y_train) # fitting our base model on the splits from the training data
linear_prediction = linear.predict(X_test) # making our predictions on the test set

### 4.2. Evaluating The Fitted Model
In the cell below, we then fit each model into our train set to know the outcome

In [84]:
# extract MLR model intercept
intercept_linear = float(linear.intercept_) # for the training data


In [85]:
print("Intercept of Linear Model From The Training Data:", intercept_linear)


Intercept of Linear Model From The Training Data: 8404839.579997778


In [86]:
# extract model coeffs
print("Coefficients of Linear Model For Training Data:", linear.coef_)


Coefficients of Linear Model For Training Data: [-8.91961597e+05 -1.85709383e+05  1.89570042e+00  2.54015300e+00
  1.89430301e+00 -1.62082843e+06  6.36492391e+04 -8.54513497e+03
 -5.16266897e-01  7.18619805e+03  1.38808836e+01 -4.42044966e+02
 -1.97929601e+06  3.62441743e+03  4.42046104e+03 -3.63030266e+03
  8.61758390e+03  8.61820914e+02 -4.84949344e+03  1.58290956e+02
 -9.52325532e+04  3.17137298e+04  3.92211405e+03  7.38680990e+04
  8.88610556e+02 -8.24926311e+01  1.38452608e+03  3.40851597e+02
 -1.62560497e+03 -9.05890739e+02 -1.42006294e+04  4.07444953e+03
  6.29932627e+03 -7.86661942e+03 -3.49924319e+03  1.51927160e+04
  1.08839151e+05  1.08839151e+05  1.47805254e+05]


### 4.3. Choosing The Best Model

#### 4.3.1. Calculate The R2 Score
The next thing we want to do is to calculate the `r2 score` for each of our regression model, we are looking for the regression model that has the r2 closest to `1`, let's have a look below to see how that plays out.

In [88]:
print(f'R2 For Training Data:  {r2_score(y_test, linear_prediction)}')


R2 For Training Data:  0.911345824127714


Next we plot a Barplot to visualize the performance of r2 score for each regression model.

RMSE for our Training Set

In [89]:
def rmse(y_test, y_predict):
    return np.sqrt(mean_squared_error(y_test, y_predict))

print(f'Linear Model RMSE For Training Data:  {rmse(y_test, linear_prediction)}')


Linear Model RMSE For Training Data:  1144481.9530941774


RMSE for our Test Set

Difference between the Training Set RMSE and The Test Set RMSE 

Now we want to create avisualization for our choosen model using the scatter plot and a lIne that shows us how fit our model is on our test data.

### 4.4 Prediction Of Income For The Test Data

In [101]:
df_test1 = df_test.copy()

In [103]:
df_test1 = df_test1.drop(['RETIREMENT_FUND_VALUE'], axis=1)

Predict the values of our test data using the random forest model, since it is the regression model that gives us a r2 score closer to 1

In [104]:
prediction =  linear.predict(df_test1)

Now we print out the predicted value for the sales price using the test data

In [105]:
daf = pd.DataFrame(prediction, columns = ['RETIREMENT_FUND_VALUE'])
daf.head()

Unnamed: 0,RETIREMENT_FUND_VALUE
0,3993852.0
1,6352572.0
2,4383573.0
3,9440058.0
4,5440604.0


In [107]:
print(f'R2 For Training Data:  {r2_score(df_test.RETIREMENT_FUND_VALUE, prediction)}')


R2 For Training Data:  0.6083208463652969
