<a href="https://colab.research.google.com/github/Jonny-T87/Dojo-Work/blob/main/First_Model_(Practice).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Jonny Tesfahun
- 07/01/22

For this exercise, you will create, fit, and evaluate the performance of a linear regression model.  The machine learning question is: 

How well can the additional charges be predicted based on the age, sex, BMI, number of children, smoking habit, and region of the patient?  

This is the dataset you will be using: insurance.csv

For this task, you will need to:

- Create a preprocessing object, such as a column transformer or pipeline, that will:
 - Ordinal encode any ordinal features
 - One-hot encode any nominal features 
 - Scale any numeric features
 
Instantiate a linear regression model
Create a model pipeline with your preprocessor first and linear regression model last
Fit the modeling pipeline on the training data
Evaluate the model performance on both the training set and the test set using the R-squared score.

In [16]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.dummy import DummyRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import plot_tree
from sklearn import set_config
set_config(display='diagram')

In [20]:
# Create a function to take the true and predicted labels and print MAE, MSE, RMSE, and R2 metrics
def evaluate_regression(y_true, y_pred):
  """Takes true target and predicted target and prints MAE, MSE, RMSE and R2"""
  
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = np.sqrt(mse)
  r2 = r2_score(y_true, y_pred)

  print(f'scores: \nMAE: {mae:,.2f} \nMSE: {mse:,.2f} \nRMSE: {rmse:,.2f} \nR2: {r2:.2f}')

In [2]:
df = pd.read_csv('/content/drive/MyDrive/DojoBootCamp/Project Files/insurance.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   int64  
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(3), object(2)
memory usage: 73.3+ KB


In [4]:
df['smoker'].value_counts()

no     1064
yes     274
Name: smoker, dtype: int64

In [5]:
#Since this is nominal data, i will use replacement dictionary to make No=0 and Yes=1.
# And change with replacment from object to number.
replacement_dictionary = {'no':0, 'yes':1}

In [7]:
df['smoker'].replace(replacement_dictionary, inplace=True)
df['smoker'].value_counts()

0    1064
1     274
Name: smoker, dtype: int64

In [8]:
# Validation Split features and target 
X = df.drop('charges', axis=1)
y = df['charges']
# Also Train test split the data to prepare for machine learning.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [9]:
#Checking X train data
X_train

Unnamed: 0,age,sex,bmi,children,smoker,region
693,24,male,23.655,0,0,northwest
1297,28,female,26.510,2,0,southeast
634,51,male,39.700,1,0,southwest
1022,47,male,36.080,1,1,southeast
178,46,female,28.900,2,0,southwest
...,...,...,...,...,...,...
1095,18,female,31.350,4,0,northeast
1130,39,female,23.870,5,0,southeast
1294,58,male,25.175,0,0,northeast
860,37,female,47.600,2,1,southwest


In [12]:
#making colum selector for objects
cat_selector = make_column_selector(dtype_include='object')

In [13]:
#One Hot Encoder for for objects
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [14]:
#Tuple for One Hot Encoder and Cat_Selector
ohe_tuple = (ohe, cat_selector)

In [15]:
# let the numeric columns pass through unchanged
preprocessor = make_column_transformer(ohe_tuple, remainder='passthrough')

In [17]:
# remember, a simpler model createst a higher bias.  What does a simple tree look like?
lin_reg = LinearRegression()

In [18]:
# put the model in a pipeline with the preprocessor
lin_reg_pipe = make_pipeline(preprocessor, lin_reg)

In [19]:
#fitting line regression on Train data
lin_reg_pipe.fit(X_train, y_train)

In [21]:
#Evaluate the model performance on both the training set and the test set using the R-squared score.
print('Training')
evaluate_regression(y_train, lin_reg_pipe.predict(X_train))
print('Testing')
evaluate_regression(y_test, lin_reg_pipe.predict(X_test))

Training
scores: 
MAE: 4,183.15 
MSE: 37,004,502.18 
RMSE: 6,083.13 
R2: 0.74
Testing
scores: 
MAE: 4,243.65 
MSE: 35,117,755.74 
RMSE: 5,926.02 
R2: 0.77
