<a href="https://colab.research.google.com/github/Neoneto/CodingDojo_Week5/blob/main/Intro_to_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Machine Learning
Submitted by Kenneth Alaba


# Pre-Processing Exercise (Part 1)

## Loading the Data

In [1]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


In [2]:
# Load the data

## Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

## Opening the file
filename = '/content/drive/My Drive/Coding Dojo/05 Week 5: Intro to Machine Learning/insurance.csv'

## Storing the data in df
df = pd.read_csv(filename)

# display first few rows
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


By observing the columns and their contents, we can say that the columns sex, smoker, and region, being categorical data, are all considered to be nominal. Additionally, these columns have object datatype. The rest of the columns which include age, bmi, children, and charges are all numerical types of data.

In [3]:
# show df information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


Since we want to predict the charge based on the other given patient information, we have the column charges as the target (y) and the rest of the columns as the features (X). Additionally, since all the nominal data are in the features, we only need to OneHotEncode it later and not the target.

## Train Test Split

In [4]:
# split the df into features and targets
y = df[['charges']]
X = df.drop('charges', axis = 1)

In [5]:
#import sklearn
from sklearn.model_selection import train_test_split

In [6]:
# split the features and targets into train and test sets
X_train, X_test, y_train, y_test  = train_test_split(X, y, random_state=42)

## OneHotEncode

In [7]:
# import additional libraries
from sklearn.compose import make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

In [8]:
# create categorical/nominal selector
cat_selector = make_column_selector(dtype_include='object')

In [10]:
# select categorical columns in X_train
cat_Xtrain = X_train[cat_selector(X_train)]

# OneHotEncode, encode the categories
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe_encoder.fit(cat_Xtrain)
cat_ohe = ohe_encoder.transform(cat_Xtrain) # returns an array

# converts the array into dataframe
cat_Xtrain = pd.DataFrame(cat_ohe, columns=ohe_encoder.get_feature_names(cat_Xtrain.columns))
cat_Xtrain.head()



Unnamed: 0,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
3,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


Checking the dataframe for inconsistencies wasn't included in this notebook but we can also see from here that the data has no inconcistencies.

In [11]:
# Do the same for X_test

# select categorical columns in X_test
cat_Xtest = X_test[cat_selector(X_test)]

# OneHotEncode, encode the categories
ohe_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe_encoder.fit(cat_Xtest)
cat_ohe = ohe_encoder.transform(cat_Xtest) # returns an array

# converts the array into dataframe
cat_Xtest = pd.DataFrame(cat_ohe, columns=ohe_encoder.get_feature_names(cat_Xtest.columns))
cat_Xtest.head()



Unnamed: 0,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0


# Linear Regression (Part 2)


## Transform the columns

In [12]:
# replace categorical columns with the encoded ones
X_test_processed = pd.concat([X_test.drop(cat_selector(X_test) , axis = 1).reset_index(drop = True),
                              cat_Xtest], axis = 1,)

X_train_processed = pd.concat([X_train.drop(cat_selector(X_train) , axis = 1).reset_index(drop = True),
                              cat_Xtrain], axis = 1,)

## Training the Model

In [15]:
# import additional library
from sklearn.linear_model import LinearRegression

In [16]:
# instanciate a linear regression model
reg = LinearRegression()

In [17]:
# fit the train set / train the model
reg.fit(X_train_processed, y_train) # no processing was needed for the target

LinearRegression()

## Evaluate the Performance of the Model

In [26]:
# Get the R^2 value of the model on the train set
train_score = reg.score(X_train_processed, y_train)
print('Train R^2 value:', train_score.round(3))

Train R^2 value: 0.745


In [28]:
# Get the R^2 value of the model on the test set
test_score = reg.score(X_test_processed, y_test)
print('Test R^2 value:', test_score.round(3))

Test R^2 value: 0.767


The $R^2$ value for the test set is close that of the train set implying that the model does not overfit. Additionally, since the value is close to 1, the model is a good representation of the data.