### Logistic Regression Sklearn

- Build a simple regression model
- Handle categorical variables


In [1]:
import pandas as pd
df = pd.read_csv("../data/car_insurance.csv")

In [2]:
df.head(2)

Unnamed: 0,policy_id,policy_tenure,age_of_car,age_of_policyholder,area_cluster,population_density,make,segment,model,fuel_type,...,is_brake_assist,is_power_door_locks,is_central_locking,is_power_steering,is_driver_seat_height_adjustable,is_day_night_rear_view_mirror,is_ecw,is_speed_alert,ncap_rating,is_claim
0,ID00001,0.515874,0.05,0.644231,C1,4990,1,A,M1,CNG,...,No,No,No,Yes,No,No,No,Yes,0,0
1,ID00002,0.672619,0.02,0.375,C2,27003,1,A,M1,CNG,...,No,No,No,Yes,No,No,No,Yes,0,0


**Predict if a claim will happen or not based on some numeric predictors**

In [3]:
df.isnull().sum()

policy_id                           0
policy_tenure                       0
age_of_car                          0
age_of_policyholder                 0
area_cluster                        0
population_density                  0
make                                0
segment                             0
model                               0
fuel_type                           0
max_torque                          0
max_power                           0
engine_type                         0
airbags                             0
is_esc                              0
is_adjustable_steering              0
is_tpms                             0
is_parking_sensors                  0
is_parking_camera                   0
rear_brakes_type                    0
displacement                        0
cylinder                            0
transmission_type                   0
gear_box                            0
steering_type                       0
turning_radius                      0
length      

In [5]:
### Assuming that an EDA has already been done, we can select the following predictor columns
pred_cols = ['policy_tenure','age_of_car','age_of_policyholder']
X = df[pred_cols].values
y = df['is_claim'].values

In [6]:
### We can divide the data into train and test sets and build a model
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [7]:
### Build a model
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression()

In [8]:
m = reg.fit(X_train,y_train)

In [9]:
m.intercept_

array([-3.1832556])

In [10]:
m.coef_

array([[ 0.82917821, -2.94791754,  0.30480135]])

### Interpretation

- policy_tenure->0.82917821 (log odds of filing a claim increase by 0.82 units for every unit increase in policy_tenure
- age_of_car->-2.94791754 (log odds of filing a claim decrease by -2.94 units for every unit increase in age_of_car)
- age_of_policyholder->0.30480135 (log odds of filing a claim will increase by 0.30 units for every unit increase in age_policy_holder)

## Handle categorical variables

In [11]:
pred_vars = ['policy_tenure','age_of_car','age_of_policyholder','is_speed_alert']
df[pred_vars].head(2)

Unnamed: 0,policy_tenure,age_of_car,age_of_policyholder,is_speed_alert
0,0.515874,0.05,0.644231,Yes
1,0.672619,0.02,0.375,Yes


In [12]:
X = pd.get_dummies(df[pred_vars])
X.head(2)

Unnamed: 0,policy_tenure,age_of_car,age_of_policyholder,is_speed_alert_No,is_speed_alert_Yes
0,0.515874,0.05,0.644231,0,1
1,0.672619,0.02,0.375,0,1


In [13]:
X = X[['policy_tenure','age_of_car','age_of_policyholder','is_speed_alert_No']].values
y = df['is_claim'].values

In [14]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

In [15]:
reg = LogisticRegression()
m = reg.fit(X_train,y_train)

In [16]:
m.intercept_

array([-3.15521259])

In [17]:
m.coef_

array([[ 0.81369547, -2.86745535,  0.25959433, -0.39353955]])

### Interpretation

- policy_tenure:0.81, for a unit increase in policy tenure the log odds of filing a claim increases by 0.81 units
- age_of_car:-2.86, for a unit increase in age_of_car, the log odds of filing a claim decreases by 2.86 units
- age_policy_holder:0.25, for a unit increase in age_policy_holder, the log odds of filing a claim increases by 0.25 units
- is_speed_alert_no: -0.393, compared to is_speed_alert_yes, if someone has speed_alert_no, then log odds of filing a claim decreases by 0.393 units