Project-6 Shuzo Katayama, October 25 2020

This prject takes personal data from a variety of American beneficiaries of health insurance, including their age, sex, bmi, number of children, and if they are a smoker, and also provides their individual medical costs billed by health insurance. The task of these models will be to predict the insurance cost of an individual depending on the afforementioned factors

Data from: https://www.kaggle.com/mirichoi0218/insurance

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import numpy as np
import pandas as pd

In [5]:
df = pd.read_csv('insurance.csv')
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [6]:
a = df.to_numpy()

Here, I will clean the data. Namely, I will make the values under 'sex' to be 0 for male and 1 for female, and the values under 'smoker' to be 0 for no and 1 for yes

In [7]:
# Convert 'male' to 0, 'female' to 1 
for item in a:
    if item[1] == "male":
        item[1] = 0
    else:
        item[1] = 1

In [10]:
# Convert 'no' to 0, 'yes' to 1 
for item in a:
    if item[4] == "no":
        item[4] = 0
    else:
        item[4] = 1

Here, I will split the data into training and testing sets using train_test_split, and standardise the data using StandardScaler.

In [12]:
# Splitting into features (X) and targets(y)
X = a[:, [0,1,2,3,4]]
X = X.astype('float64')
y = a[:, 6]
y = y.astype('float64')

In [15]:
# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

In [16]:
# Standard scaler code from regression example
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

sc_y = StandardScaler()
sc.fit(y_train[:, np.newaxis])
y_train_std = sc.transform(y_train[:, np.newaxis]).flatten()
y_test_std = sc.transform(y_test[:, np.newaxis]).flatten()

The first model that will be trained with this data will be the Random Forest Regressor, built into scikit learn. This model trains a large number of small and 'sloppy' trees and aggregate their results into a prediction. Instead of taking a majority vote like in a random forest classifier, the regressor will average the outputs since it is providing a continuous answer.

In [19]:
from sklearn.ensemble import RandomForestRegressor

In [24]:
# Training the model
est = RandomForestRegressor(n_estimators=1000, criterion='mse', random_state=1, n_jobs=-1)
est.fit(X_train_std, y_train_std)

RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=1)

In [25]:
# Testing the model
y_train_pred = est.predict(X_train_std)
y_test_pred = est.predict(X_test_std)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train_std, y_train_pred),
        mean_squared_error(y_test_std, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train_std, y_train_pred),
        r2_score(y_test_std, y_test_pred)))

MSE train: 0.023, test: 0.174
R^2 train: 0.977, test: 0.820


The Random Forest Regression model has an R<sup>2</sup> of 0.977, which means that the model was able to fit the data very well. 

The next model i will train will be the symbolic regressor, using gplearn. The symbolic regressor uses a genetic algorithm to create a mathetmatical function, based off of the primordial ooze it was given (a variety of simple mathematical functions). From this genetic algorithm, the symbolic regressor will create a function that fits the data. 

In [26]:
from gplearn.genetic import SymbolicRegressor

In [27]:
est = SymbolicRegressor(population_size=1000,
                        init_depth=(4,6),
                        generations=100, stopping_criteria=0.01,
                        p_crossover=0.3, p_subtree_mutation=0.35,
                        p_hoist_mutation=0.0, p_point_mutation=0.35,
                        max_samples=1.0, verbose=1,
                        #const_range=None,
                        const_range=(-1.0,1.0),
                        tournament_size=5,
                        function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log', 
                                      'abs', 'neg', 'inv', 'max','min', 'sin', 'cos', 'tan'),
                        parsimony_coefficient=0.0001, random_state=0)
est.fit(X_train_std, y_train_std)

    |   Population Average    |             Best Individual              |
---- ------------------------- ------------------------------------------ ----------
 Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left
   0    14.29          4.58055        3         0.455755              N/A     51.59s
   1    11.51          1.23798        1         0.455755              N/A     52.09s
   2     9.83          1.19294        3         0.455755              N/A     49.87s
   3     9.24          1.17491       11         0.431811              N/A     48.09s
   4     7.47         0.931876        8          0.43673              N/A     46.33s
   5     5.98          2.35363        8          0.43673              N/A     45.52s
   6     4.75          1.02175        8          0.43673              N/A     44.75s
   7     3.41         0.934759       16         0.407669              N/A     42.08s
   8     3.20          0.90123       19         0.407669              N/A  

  94   114.42         0.465636      103         0.208658              N/A      6.19s
  95   117.62         0.452941      125         0.208525              N/A      5.00s
  96   118.55         0.511143      199         0.208658              N/A      3.80s
  97   121.99         0.533269      179         0.204973              N/A      2.55s
  98   124.05         0.488259      165         0.205755              N/A      1.30s
  99   127.01         0.469231      160         0.205755              N/A      0.00s


SymbolicRegressor(function_set=('add', 'sub', 'mul', 'div', 'sqrt', 'log',
                                'abs', 'neg', 'inv', 'max', 'min', 'sin', 'cos',
                                'tan'),
                  generations=100, init_depth=(4, 6), p_crossover=0.3,
                  p_hoist_mutation=0.0, p_point_mutation=0.35,
                  p_subtree_mutation=0.35, parsimony_coefficient=0.0001,
                  random_state=0, stopping_criteria=0.01, tournament_size=5,
                  verbose=1)

In [28]:
y_train_pred = est.predict(X_train_std)
y_test_pred = est.predict(X_test_std)

print('MSE train: %.3f, test: %.3f' % (
        mean_squared_error(y_train_std, y_train_pred),
        mean_squared_error(y_test_std, y_test_pred)))
print('R^2 train: %.3f, test: %.3f' % (
        r2_score(y_train_std, y_train_pred),
        r2_score(y_test_std, y_test_pred)))

MSE train: 0.157, test: 0.173
R^2 train: 0.843, test: 0.821


The Symbolic regressor model has an R<sup>2</sup> of 0.843, which means that this model fit the data worse than the Random Forest regression model. 

Ultimately, the data presented was able to be accurately modeled (with the random forest regressor having a very good coefficient of determination, and with the symbolic regressor having a good coefficient of determination, that was worse than the random forest model). The data presented, therefore, has a good correlation between its factors (the age, sex, bmi, 