In [1]:
import pandas as pd

# load dataset
df_raw = pd.read_csv('aug_train.csv')

# show columns in dataset
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382154 entries, 0 to 382153
Data columns (total 12 columns):
id                      382154 non-null int64
Gender                  382154 non-null object
Age                     382154 non-null int64
Driving_License         382154 non-null int64
Region_Code             382154 non-null float64
Previously_Insured      382154 non-null int64
Vehicle_Age             382154 non-null object
Vehicle_Damage          382154 non-null object
Annual_Premium          382154 non-null float64
Policy_Sales_Channel    382154 non-null float64
Vintage                 382154 non-null int64
Response                382154 non-null int64
dtypes: float64(3), int64(6), object(3)
memory usage: 35.0+ MB


The target label is “Response”.

Prior to any model building, we need to preprocess the raw data.

In [2]:
# remove missing values
df = df_raw.dropna()

# remove id column
df = df.drop('id', axis=1)

# identify categorical variables
cat_var = [column for column in df if df[column].dtype=='object']

# one hot encode categorical variables
for col in df.columns:
    if df[col].dtype == 'object':
        dummies = pd.get_dummies(df[col])
        df = pd.concat([df, dummies], axis=1)     
df = df.drop(cat_var, axis=1)

# create input and output data
target = 'Response'
X = df.drop(target, axis=1)
y = df[target]

A quick count shows us the number of records in each class.

In [3]:
from collections import Counter

# count the frequency of each class
count = Counter(y)
print(count)

Counter({0: 319553, 1: 62601})


The ratio of the size of the majority class to that of the minority class is about 5:1, a strong indicator of data imbalance.



First, let’s build a random forest model to predict customer interest in vehicle insurance without SMOTE.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# normalize the data 
mms = MinMaxScaler()
X_train = mms.fit_transform(X_train)
X_test = mms.transform(X_test)

# generate predictions with the random forest classifier
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)

We’ll evaluate this model with the f-1 score metric.

In [5]:
from sklearn.metrics import f1_score
import numpy as np

# compute the f-1 score
f1 = np.round(f1_score(y_test, y_pred),2)
print('F-1 score of model without SMOTE: {}'.format(f1))

F-1 score of model without SMOTE: 0.44


Next, we will repeat the same procedure, but after adding artificial data. We can do this in Python with the imblearn module’s SMOTE.

In [7]:
from imblearn.over_sampling import SMOTE

# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# create artificial data with SMOTE
oversample = SMOTE()
X_train_smote, y_train_smote = oversample.fit_resample(X_train, y_train)

Note how SMOTE is only applied on the training set. As mentioned previously, it is very important not to generate artificial data for the testing set.

In [8]:
# count number of records in each class
count = Counter(y_train_smote)
print(count)

Counter({1: 239759, 0: 239759})


The data imbalance in the training data has now been addressed. So, let’s build a random forest classifier with this data and see how it performs on the testing set.

In [9]:
# normalize data
mms = MinMaxScaler()
X_train_smote = mms.fit_transform(X_train_smote)
X_test = mms.transform(X_test)

# create random forest model and generate predictions
rfc = RandomForestClassifier(random_state=42)
rfc.fit(X_train_smote, y_train_smote)
y_pred_smote = rfc.predict(X_test)

# compute the f-1 score
f1_smote = round(f1_score(y_test, y_pred_smote),2)
print('F1-score of model with SMOTE: {}'.format(f1_smote))

F1-score of model with SMOTE: 0.52


The model registers an f-1 score of 0.52, proving to be a better predictor of customer interest in vehicle insurance compared to the model trained without artificial data.