**Classification model to predict whether the policyholder of some insurance product will claim from their insurance within the upcoming year.**

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report

In [2]:
# Read data in and view first few entries
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/classification_sprint/claims_data.csv')
df.head()

Unnamed: 0,age,sex,bmi,steps,children,smoker,region,insurance_claim,claim_amount
0,19,female,27.9,3009,0,yes,southwest,yes,16884.924
1,18,male,33.77,3008,1,no,southeast,yes,1725.5523
2,28,male,33.0,3009,3,no,southeast,no,0.0
3,33,male,22.705,10009,0,no,northwest,no,0.0
4,32,male,28.88,8010,0,no,northwest,yes,3866.8552


**Pre-processing**

In [3]:
# labels
y = df['insurance_claim']

# features
X = df.drop('insurance_claim', axis=1)

In [4]:
# Transforming the Features
X_transformed = pd.get_dummies(X, drop_first=True)

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.2, random_state=50)

#### Training

In [7]:
from sklearn.linear_model import LogisticRegression

In [8]:
lr = LogisticRegression()

In [9]:
lr.fit(X_train, y_train)

The intercept, β0, is interpreted as the log odds ratio of an observation being in the reference class when all other predictor variables are equal to zero.

We can exponentiate this value, i.e. raise the natural number e
to this value to convert it to a typical odds ratio.

In [11]:
lr.intercept_[0]

-9.770115053602655e-05

For binary categorical variables, like smoker and sex, the coefficient is interpreted as the log odds ratio between the class implied by a zero for the variable (i.e. non-smoker), and the class implied by a one for the variable (i.e. smoker).

Effectively, each coefficient is a measure of the change in the log odds of belonging to the reference class for one-unit changes in the variable.

In [12]:
coeff_df = pd.DataFrame(lr.coef_.T, X_transformed.columns, columns=['Coefficient'])
coeff_df

Unnamed: 0,Coefficient
age,-0.002559
bmi,-0.001818
steps,-0.004556
children,-0.000208
claim_amount,0.038042
sex_male,-7e-06
smoker_yes,-1.3e-05
region_northwest,-2.3e-05
region_southeast,-1.4e-05
region_southwest,5e-06


In [18]:
# Calculate and print the model score for the training dataset
train_score = lr.score(X_train, y_train)
print("Training Score:", train_score)

# Calculate and print the model score for the test dataset
test_score = lr.score(X_test, y_test)
print("Test Score:", test_score)

Training Score: 1.0
Test Score: 1.0


### Prediction

In [17]:
y_pred = lr.predict(X_test)
report = classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

          no       1.00      1.00      1.00       116
         yes       1.00      1.00      1.00       152

    accuracy                           1.00       268
   macro avg       1.00      1.00      1.00       268
weighted avg       1.00      1.00      1.00       268

