# Improving a Model

In this tutorial we will improve the logistic regression model we built in the previous tutorial. The techniques we will use to improve the model are:
1. Using a subset of the features
2. Tuning the model parameter C

The measures for improvement will be based on precision, recall and f1-score - all which can be extracted from the classification report.

## Using a subset of the features

This is the same teachnique we used in the linear regression sprint when we compared various models. It can potentially lead to better results for the testing data if some features in the model are redundant. We will build three different models:
* Full model (all features)
* Basic info model (age, sex, region)
* Lifestyle model (steps, BMI, children, smoking status)

In [1]:
# Import all the libraries we will need
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
# Read claims data in and view first few entries
df = pd.read_csv('claims_data.csv')
df.head()

Unnamed: 0,age,sex,bmi,steps,children,smoker,region,insurance_claim,claim_amount
0,19,female,27.9,3009,0,yes,southwest,yes,16884.924
1,18,male,33.77,3008,1,no,southeast,yes,1725.5523
2,28,male,33.0,3009,3,no,southeast,no,0.0
3,33,male,22.705,10009,0,no,northwest,no,0.0
4,32,male,28.88,8010,0,no,northwest,yes,3866.8552


### Pre-Processing

Once again the following pre-processing steps will be performed, for each of the three models:
* Splitting the data into features and labels
* Transforming the features 
* Splitting the data into training and testing data

#### Labels

In [3]:
y = df['insurance_claim']

#### Full Model

In [4]:
# Features
X_full = df.drop('insurance_claim', axis=1)

# Transforming the Features
X_full_transformed = pd.get_dummies(X_full, drop_first=True)

# Train/test split
X_train_full, X_test_full, y_train, y_test = train_test_split(X_full_transformed, y, test_size=0.2, random_state=50)

#### Basic Info Model

In [5]:
# Features
X_info = df[['age','sex','region']]

# Transforming the Features
X_info_transformed = pd.get_dummies(X_info, drop_first=True)

# Train/test split
X_train_info, X_test_info, y_train, y_test = train_test_split(X_info_transformed, y, test_size=0.2, random_state=50)

#### Lifestyle Model

In [6]:
# Features
X_life = df[['bmi','steps','children','smoker']]

# Transforming the Features
X_life_transformed = pd.get_dummies(X_life, drop_first=True)

# Train/test split
X_train_life, X_test_life, y_train, y_test = train_test_split(X_life_transformed, y, test_size=0.2, random_state=50)

Now our data is ready. Let's train the logistic regression models.

### Training

We create instances for the `LogisticRegression()` object. Then we fit the models using the `fit()` method.

In [7]:
full = LogisticRegression()
info = LogisticRegression()
life = LogisticRegression()

In [8]:
full.fit(X_train_full,y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [9]:
info.fit(X_train_info, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [10]:
life.fit(X_train_life, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### Predicting

We need to make predictions usig the `predict()` method that we want to ultimately compare with the actuals.

In [11]:
pred_full = full.predict(X_test_full)
pred_info = info.predict(X_test_info)
pred_life = life.predict(X_test_life)

### Testing

For testing the results we will look at precision, recall and f1-score in the **classification report**.

In [12]:
print('Full Model')
print(classification_report(y_test, pred_full, target_names=['No claim', 'Claim']))

print('Info Model')
print(classification_report(y_test, pred_info, target_names=['No claim', 'Claim']))

print('Lifestyle Model')
print(classification_report(y_test, pred_life, target_names=['No claim', 'Claim']))

Full Model
              precision    recall  f1-score   support

    No claim       1.00      1.00      1.00       116
       Claim       1.00      1.00      1.00       152

   micro avg       1.00      1.00      1.00       268
   macro avg       1.00      1.00      1.00       268
weighted avg       1.00      1.00      1.00       268

Info Model
              precision    recall  f1-score   support

    No claim       0.46      0.20      0.28       116
       Claim       0.57      0.82      0.68       152

   micro avg       0.55      0.55      0.55       268
   macro avg       0.52      0.51      0.48       268
weighted avg       0.52      0.55      0.50       268

Lifestyle Model
              precision    recall  f1-score   support

    No claim       0.88      0.84      0.86       116
       Claim       0.88      0.91      0.90       152

   micro avg       0.88      0.88      0.88       268
   macro avg       0.88      0.88      0.88       268
weighted avg       0.88      0.88   

As you can see, the lifestyle model produced better results compared to the full model. This implies that some of the basic information might be redundant since adding it leads to worse results. 

The info model by itself also produced poor results - guessing should in theory produce similar results (i.e. 50%)

See if you can find a subset of the features that will outperform the lifestyle model!

## Tuning the model parameter

Models have tuning parameters that can be changed to alter the fit of the model. 

For the logistic regression model we have a parameter **C** which is used to control the penalty we apply to features that are less important (i.e. more important features will have greater weight). The smaller the value of **C**, the greater the penalty to less important features. **C** is a value greater than 0.

Let's build three models (on all the features) with different values of **C** and see how they compare.

### Training

We create instances of the `LogisticRegression()` object, specifying the value of the parameter C. Then we fit the models.

In [19]:
# C=0.1
model_1 = LogisticRegression(C=0.1)

# C=1 (same as original)
model_2 = LogisticRegression(C=1)

# C=10
model_3 = LogisticRegression(C=10)

In [20]:
model_1.fit(X_train_info,y_train)



LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [21]:
model_2.fit(X_train_info,y_train)



LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [22]:
model_3.fit(X_train_info,y_train)



LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### Predicting

The models have been trained. Let's make some predictions.

In [23]:
pred_1 = model_1.predict(X_test_info)
pred_2 = model_2.predict(X_test_info)
pred_3 = model_3.predict(X_test_info)

### Testing

We will once again use the classification report to compare the model results. 

In [24]:
print('C = 0.1')
print(classification_report(y_test, pred_1, target_names=['No claim', 'Claim']))

print('C = 1')
print(classification_report(y_test, pred_2, target_names=['No claim', 'Claim']))

print('C = 10')
print(classification_report(y_test, pred_3, target_names=['No claim', 'Claim']))

C = 0.1
              precision    recall  f1-score   support

    No claim       0.51      0.18      0.27       116
       Claim       0.58      0.87      0.70       152

   micro avg       0.57      0.57      0.57       268
   macro avg       0.55      0.52      0.48       268
weighted avg       0.55      0.57      0.51       268

C = 1
              precision    recall  f1-score   support

    No claim       0.46      0.20      0.28       116
       Claim       0.57      0.82      0.68       152

   micro avg       0.55      0.55      0.55       268
   macro avg       0.52      0.51      0.48       268
weighted avg       0.52      0.55      0.50       268

C = 10
              precision    recall  f1-score   support

    No claim       0.46      0.20      0.28       116
       Claim       0.57      0.82      0.68       152

   micro avg       0.55      0.55      0.55       268
   macro avg       0.52      0.51      0.48       268
weighted avg       0.52      0.55      0.50       268

It looks like C values of 1 and 10 gave the best results. See if you can find a value of C that leads to even better results. And if you are up for the challenge, see if you can combine using a subset of the data and tuning C to find the optimal logistic regression model for this problem!