# Problem statement :

- In this classification problem, you need address the following using data science techniques:
    1. Address class imbalance issue and select the best technique.
    2. Create a model to predict “Coverage Code” &amp; “Accident Source”.
    3. Design a GUI that take the dataset file as input and will have a “Run” button to execute the model you created.
    4. After executing the model, the GUI will summarize the evaluation results on the screen and store the excel file in a folder.

# Dataset Description :
The dataset contains 190,000+ claim records with only one feature i.e., Claim description. The
target columns are Coverage Code and Accident Source.

In [179]:
# import necessary libraries

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import re
import string
from string import digits
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
import pickle
import numpy as np
import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    f1_score,
    recall_score,
)
from sklearn.model_selection import RandomizedSearchCV
from lightgbm import LGBMClassifier

In [180]:
# read data from excel format


data = pd.read_excel('./data/Dataset_Public.xlsx')

In [181]:
# glimpse of data

data.head(2)

Unnamed: 0,Claim Description,Coverage Code,Accident Source
0,THE IV WAS MAKING A LEFT TURN ON A GREEN ARROW...,AN,"Struck pedestrian, bicycle"
1,CLAIMANT ALLEGES SHE SUFFERED INJURIES IN AN E...,GB,Elevator/Escalator


# Analysis of data:

In [182]:
print("Number of rows in dataset : ",data.shape[0])

Number of rows in dataset :  191690


### feature : Claim Description

In [183]:
# general idea about features

data['Claim Description'].iloc[0]

'THE IV WAS MAKING A LEFT TURN ON A GREEN ARROW WHEN A PEDESTRIAN  RAN INTO THE P/S OF IV. THE IV WAS DENTED ON THE FRONT P/S OF THE HOOD, THEN FELL BACKWARDS TO THE PAVEMENT. THE PEDESTRIAN         EXPERIENCED A LUMP ON HIS HEAD.'

Claim description is text features consist information about how loss occurred.

In [184]:
# find duplicates in feature

data['Claim Description'].duplicated().sum()

27699

In [185]:
# get the value counts for same values

data['Claim Description'].value_counts()

CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DAMAGE.                                                                                                                                            1057
ROCK FROM ROAD - NO ONE AT FAULT                                                                                                                                                                598
CLAIMANT DROVE OVER A POTHOLE CAUSING VEHICLE DAMAGE.                                                                                                                                           432
PLAINTIFF ALLEGES INJURY CAUSED BY EMBEDDED BROKEN ARM OF PARAGARDIUD.                                                                                                                          311
WATER DAMAGE                                                                                                                                                                                    291
                    

In [186]:
# lets check data with one of the duplicate value

data[data['Claim Description']=='CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DAMAGE.']

Unnamed: 0,Claim Description,Coverage Code,Accident Source
296,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
319,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
589,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
667,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
763,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
...,...,...,...
191123,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
191265,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
191523,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole
191550,CLAIMANT DROVE OVER POTHOLE CAUSING VEHICLE DA...,GD,Pothole


In [187]:
# drop duplicates using claim description features

data.drop_duplicates(subset = ['Claim Description'],inplace = True)

In [188]:
# number of row after removing duplicate values

print("Number of rows in dataset : ",data.shape[0])

Number of rows in dataset :  163991


In [189]:
# check null values in claim description

data[data['Claim Description'].isna()]

Unnamed: 0,Claim Description,Coverage Code,Accident Source
576,,AD,Struck vehicle in rear


In [190]:
# remove null value from dataset

final_data = data[~data['Claim Description'].isna()]

final_data.shape

(163990, 3)

## Data pre-processing:

In [191]:
def pre_processing(final_data):

    # Lowercase all characters
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: x.lower())

    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"won\'t", "will not", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"can\'t", "can not", x))

    # general
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"n\'t", " not", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'re", " are", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'s", " is", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'d", " would", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'ll", " will", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'t", " not", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'ve", " have", x))
    final_data['Claim Description'] = final_data['Claim Description'].apply(lambda x: re.sub(r"\'m", " am", x))

    # Remove quotes
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: re.sub("'", '', x))



    exclude = set(string.punctuation) # Set of all special characters
    # Remove all the special characters
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: ''.join(ch for ch in x if ch not in exclude))


    # Remove all numbers from text
    remove_digits = str.maketrans('', '', digits)
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: x.translate(remove_digits))




    # remove extra
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: re.sub('[-_.:;\[\]\|,]', '', x))


    # Remove extra spaces
    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: x.strip())

    final_data['Claim Description']=final_data['Claim Description'].apply(lambda x: re.sub(" +", " ", x))
    
    return final_data

In [192]:
final_data = pre_processing(final_data)

In [193]:
print('Row before pre-processing:')
print('*'*100)

print(data['Claim Description'].iloc[0])
print('='*100)
print('Row after pre-processing:')
print('*'*100)

print(final_data['Claim Description'].iloc[0])

Row before pre-processing:
****************************************************************************************************
THE IV WAS MAKING A LEFT TURN ON A GREEN ARROW WHEN A PEDESTRIAN  RAN INTO THE P/S OF IV. THE IV WAS DENTED ON THE FRONT P/S OF THE HOOD, THEN FELL BACKWARDS TO THE PAVEMENT. THE PEDESTRIAN         EXPERIENCED A LUMP ON HIS HEAD.
Row after pre-processing:
****************************************************************************************************
the iv was making a left turn on a green arrow when a pedestrian ran into the ps of iv the iv was dented on the front ps of the hood then fell backwards to the pavement the pedestrian experienced a lump on his head


## class_label 1 : Coverage Code

In [194]:
# find unique classes

print('Number of classes in Coverage code :',data['Coverage Code'].nunique())

Number of classes in Coverage code : 43


In [195]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# # plot countplot of body_type and add count on each bar
# plt.figure(figsize=(30, 25))
# sns.countplot(y='Coverage Code', data=data)
# for p in plt.gca().patches:
#     plt.gca().annotate('{:.0f}'.format(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# plt.show()

In [196]:
final_data['Coverage Code'].value_counts()

AD    43070
GB    32193
GD    22399
AP    18894
AL     9983
AB     9397
PA     8030
PB     5089
RB     4553
NS     2297
PM     1359
AU     1273
EL      855
GK      632
AN      558
RC      514
PI      384
GO      337
PL      317
DC      261
GL      237
AM      224
EP      184
IK      183
OM      138
EO      137
PP      112
TE       85
BM       66
LL       57
IM       40
BL       36
CM       30
FB       18
OI       14
RF       12
BR        6
EB        4
RQ        3
FF        3
LS        3
EI        2
PC        1
Name: Coverage Code, dtype: int64

## Observation :

- There are 43 classes in Coverage code.
- There is class imabalnce observed.
- While few classes having very few data points.

### Approach 1 :

- Check classes with very few datapoints and create dictionary for features.
- Check classes with medium range of datapoints and convert them into one class.
- Develop model on new classes.
- For medium range classes create one more model.
- Use second model for prediction if class is in medium class.
- Finally override decision if the feature is in dictionary of very few classes.

#### Step 1 : Create a dictionary for very fewer classes:

In [197]:
# check count of classes

coverage_code_classes = pd.DataFrame(final_data['Coverage Code'].value_counts())

In [198]:
very_few_classes = coverage_code_classes[coverage_code_classes['Coverage Code'] < 10].index.to_list()
medium_classes = coverage_code_classes[coverage_code_classes['Coverage Code'] < 1000].index.to_list()

In [199]:
# check the data with very few classes

fewer_data = final_data[final_data['Coverage Code'].isin(very_few_classes)]

fewer_data.head(2)

Unnamed: 0,Claim Description,Coverage Code,Accident Source
33833,the caller states that the builder called and ...,BR,"Burglary, Robbery, Theft - Property"
63005,oil discovered in monitoring tanks,EI,Fuel


In [200]:
# for fewer classes either we can directrly create dictionary and use classes


fewer_class_dictionary = dict()
for idx,row in fewer_data.iterrows():
    
    fewer_class_dictionary[row[0]] = [row[1],row[2]]

fewer_class_dictionary

{'the caller states that the builder called and stated the items from the property that were being used to build the home were stolen and police report was created': ['BR',
  'Burglary, Robbery, Theft - Property'],
 'oil discovered in monitoring tanks': ['EI', 'Fuel'],
 'eeoc charge of discrimination filed by current st john is university employee charles fortmann': ['EB',
  'Alleged Discrimination'],
 'allegations of violation of wage and hour violations limited assignment for payment of legal fees': ['EB',
  'Alleged Negligent Act'],
 'graffiti painted on wall': ['RQ', 'Other - comprehensive'],
 'sewage release': ['EI', 'Backup Sewer, Seepage'],
 'cyber breach': ['FF', 'Not Otherwise Classified'],
 'claimant suffered defamation and loss of business': ['LS',
  'Loss of Use, Business Interruption'],
 'property theft and damage': ['BR', 'Burglary, Robbery, Theft - Property'],
 'tools were stolen from a job site at francis howell north': ['BR',
  'Burglary, Robbery, Theft - Property'],
 

In [201]:
fewer_class_dictionary[fewer_data['Claim Description'].iloc[0]]

['BR', 'Burglary, Robbery, Theft - Property']

#### Step 2 : Convert medium range classes into singl class

In [202]:
data_step_2 = final_data.copy()

def convert_class(feature):
    if feature in medium_classes:
        return 'med'
    else:
        return feature
    
final_data['Coverage Code'] = final_data['Coverage Code'].apply(convert_class)

In [203]:
final_data['Coverage Code'].value_counts()

AD     43070
GB     32193
GD     22399
AP     18894
AL      9983
AB      9397
PA      8030
med     5453
PB      5089
RB      4553
NS      2297
PM      1359
AU      1273
Name: Coverage Code, dtype: int64

#### Step 3 : Develop a model on given classes :

##### vectorization of feature

In [204]:
# create features and class label

X = final_data['Claim Description']

y = final_data['Coverage Code']


In [205]:
# split the dataset



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [206]:
# convert data into vector form

count_vector_step_1 = CountVectorizer()

X_train = count_vector_step_1.fit_transform(X_train)
X_test = count_vector_step_1.transform(X_test)

In [207]:
# # usning tfidf vectorizer

# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf_vectorizer = TfidfVectorizer(use_idf=True)
# X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) #tfidf runs on non-tokenized sentences unlike word2vec
# # Only transform x_test (not fit and transform)
# X_val_vectors_tfidf = tfidf_vectorizer.transform(X_test)

In [208]:
X_train.shape

(131192, 77557)

##### Model 1 - naive bayes

In [209]:


# train
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/naive_bayes_step_1.pickle"
pickle.dump(naive_bayes, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = naive_bayes.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for naive bayes :',np.round(accuracy,2))

print("*"*100)

target_names = naive_bayes.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/naive_bayes_step_1.pickle
Accuracy for naive bayes : 0.69
****************************************************************************************************
              precision    recall  f1-score   support

          AB       0.60      0.19      0.29      1825
          AD       0.62      0.83      0.71      8630
          AL       0.79      0.90      0.84      1923
          AP       0.58      0.35      0.44      3777
          AU       0.47      0.48      0.47       259
          GB       0.79      0.92      0.85      6438
          GD       0.65      0.76      0.70      4566
          NS       0.99      0.35      0.52       469
          PA       0.85      0.87      0.86      1609
          PB       0.90      0.49      0.64      1049
          PM       1.00      0.06      0.11       277
          RB       0.74      0.31      0.44       899
         med       0.70      0.17      0.28      1077

    accuracy                           0.69   

##### Model 2 - logistic regression

In [210]:
from sklearn.linear_model import LogisticRegression

lr_basemodel =LogisticRegression()

lr_basemodel.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/lr_step_1.pickle"
pickle.dump(lr_basemodel, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = lr_basemodel.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for logistic regression :',np.round(accuracy,2))

print("*"*100)

target_names = lr_basemodel.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/lr_step_1.pickle
Accuracy for logistic regression : 0.73
****************************************************************************************************
              precision    recall  f1-score   support

          AB       0.52      0.31      0.38      1825
          AD       0.68      0.80      0.73      8630
          AL       0.93      0.91      0.92      1923
          AP       0.59      0.48      0.53      3777
          AU       0.55      0.44      0.49       259
          GB       0.82      0.90      0.86      6438
          GD       0.73      0.77      0.75      4566
          NS       0.84      0.75      0.79       469
          PA       0.86      0.84      0.85      1609
          PB       0.77      0.71      0.74      1049
          PM       0.52      0.14      0.22       277
          RB       0.66      0.61      0.63       899
         med       0.52      0.27      0.36      1077

    accuracy                           0.73    

In [109]:
# # using tfidf vectorizer

# lr_basemodel =LogisticRegression()

# lr_basemodel.fit(X_train_vectors_tfidf, y_train)
# print("Model trained")

# # save model
# path = "results/lr_step_1.pickle"
# pickle.dump(lr_basemodel, open(path, "wb"))
# print("Model saved to", path)

# # predict
# y_pred = lr_basemodel.predict(X_val_vectors_tfidf)

# accuracy = accuracy_score(y_test, y_pred)
# print('Accuracy for logistic regression :',np.round(accuracy,2))

# print("*"*100)

# target_names = lr_basemodel.classes_
# print(classification_report(y_test, y_pred, target_names=target_names))
# print('*'*100)

##### Model 3 - hyper parameter tuning for random forest

In [36]:


#### model ####
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=500, num=5)]
# Number of features to consider at every split
max_features = ["auto", "sqrt"]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [int(x) for x in np.linspace(10, 110, num=5)]
# Minimum number of samples required at each leaf node
min_samples_leaf = [int(x) for x in np.linspace(10, 110, num=5)]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid

random_grid = {
    "n_estimators": n_estimators,
    "max_features": max_features,
    "max_depth": max_depth,
    "min_samples_split": min_samples_split,
    "min_samples_leaf": min_samples_leaf,
    "bootstrap": bootstrap,
}
print(random_grid)

{'n_estimators': [200, 275, 350, 425, 500], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 35, 60, 85, 110, None], 'min_samples_split': [10, 35, 60, 85, 110], 'min_samples_leaf': [10, 35, 60, 85, 110], 'bootstrap': [True, False]}


In [37]:


rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(
    estimator=rf,
    param_distributions=random_grid,
    cv=3,
    verbose=3,
    random_state=42,
    n_jobs=-1,
)

In [38]:
# fit random search

print("RF ..")
search_time_start = time.time()
rf_random.fit(X_train, y_train)
print("RF  time:", time.time() - search_time_start)

RF ..
Fitting 3 folds for each of 10 candidates, totalling 30 fits
RF  time: 549.3658676147461


In [39]:
# get best hyper parameter

rf_random.best_estimator_

RandomForestClassifier(max_depth=110, max_features='sqrt', min_samples_leaf=10,
                       min_samples_split=35, n_estimators=200)

In [40]:
# train random forest model

clf_random = RandomForestClassifier(max_depth=110, max_features='sqrt', min_samples_leaf=10, min_samples_split=35, n_estimators=200)
clf_random.fit(X_train,y_train)

RandomForestClassifier(max_depth=110, max_features='sqrt', min_samples_leaf=10,
                       min_samples_split=35, n_estimators=200)

In [41]:
# save model
path = "results/rf_basemodel.pickle"
pickle.dump(clf_random, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = clf_random.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for random forest model :',np.round(accuracy,2))

print("*"*100)

target_names = clf_random.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model saved to results/rf_basemodel.pickle
Accuracy for random forest model : 0.59
****************************************************************************************************
              precision    recall  f1-score   support

          AB       1.00      0.04      0.07      1851
          AD       0.51      0.97      0.67      8579
          AL       0.96      0.81      0.88      1992
          AP       0.83      0.04      0.07      3860
          AU       0.00      0.00      0.00       262
          GB       0.62      0.95      0.75      6408
          GD       0.59      0.45      0.51      4408
          NS       1.00      0.01      0.02       464
          PA       0.92      0.61      0.73      1670
          PB       0.98      0.17      0.29      1068
          PM       1.00      0.04      0.07       262
          RB       0.72      0.05      0.09       895
         med       0.95      0.02      0.04      1079

    accuracy                           0.59     32798
   m

#### Observation :

- Simple model like naive bayes and logistic regression having good performance.
- While ensembel model like random forest not works good on dataset



#### Step 4 : Create a dataset with medium range class and train model

In [211]:
med_data = data_step_2[data_step_2['Coverage Code'].isin(medium_classes[:-5])]

med_data.head(2)

Unnamed: 0,Claim Description,Coverage Code,Accident Source
0,the iv was making a left turn on a green arrow...,AN,"Struck pedestrian, bicycle"
9,claimant suffered bodily injury due to picked ...,EL,Fall - other falls


In [212]:
med_data.shape

(5441, 3)

In [213]:
med_data['Coverage Code'].value_counts()

EL    855
GK    632
AN    558
RC    514
PI    384
GO    337
PL    317
DC    261
GL    237
AM    224
EP    184
IK    183
OM    138
EO    137
PP    112
TE     85
BM     66
LL     57
IM     40
BL     36
CM     30
FB     18
OI     14
RF     12
BR      6
EB      4
Name: Coverage Code, dtype: int64

In [214]:
# create features and class label

X = med_data['Claim Description']

y = med_data['Coverage Code']


In [215]:
X_test.shape

(32798, 77557)

In [216]:
# split the dataset



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,stratify = y)

In [217]:
# convert data into vector form

count_vector_step_2 = CountVectorizer()

X_train = count_vector_step_2.fit_transform(X_train)
X_test = count_vector_step_2.transform(X_test)

In [218]:


# train
naive_bayes_2 = MultinomialNB()
naive_bayes_2.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/naive_bayes_step_2.pickle"
pickle.dump(naive_bayes_2, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = naive_bayes_2.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for naive bayes :',np.round(accuracy,2))

print("*"*100)

target_names = naive_bayes_2.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/naive_bayes_step_2.pickle
Accuracy for naive bayes : 0.67
****************************************************************************************************
              precision    recall  f1-score   support

          AM       0.80      0.09      0.16        45
          AN       0.54      0.84      0.66       112
          BL       0.00      0.00      0.00         7
          BM       0.00      0.00      0.00        13
          BR       0.00      0.00      0.00         1
          CM       0.00      0.00      0.00         6
          DC       1.00      0.58      0.73        52
          EB       0.00      0.00      0.00         1
          EL       0.81      0.97      0.88       171
          EO       0.65      0.41      0.50        27
          EP       0.62      0.57      0.59        37
          FB       0.00      0.00      0.00         4
          GK       0.50      0.94      0.65       127
          GL       0.85      0.70      0.77    

In [219]:
from sklearn.linear_model import LogisticRegression

lr_basemodel =LogisticRegression()

lr_basemodel.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/lr_basemodel_step_2.pickle"
pickle.dump(lr_basemodel, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = lr_basemodel.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for logistic regression :',np.round(accuracy,2))

print("*"*100)

target_names = lr_basemodel.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/lr_basemodel_step_2.pickle
Accuracy for logistic regression : 0.72
****************************************************************************************************
              precision    recall  f1-score   support

          AM       0.34      0.27      0.30        45
          AN       0.62      0.71      0.67       112
          BL       0.50      0.29      0.36         7
          BM       1.00      0.38      0.56        13
          BR       0.00      0.00      0.00         1
          CM       1.00      0.17      0.29         6
          DC       0.82      0.81      0.82        52
          EB       0.00      0.00      0.00         1
          EL       0.86      0.95      0.91       171
          EO       0.62      0.56      0.59        27
          EP       0.56      0.62      0.59        37
          FB       1.00      0.25      0.40         4
          GK       0.76      0.81      0.78       127
          GL       0.76      0.74     

#### Step 5 : Predict final output :

In [220]:
# load the different models from path

step_1_model_path = "results/lr_step_1.pickle"
step_2_model_path = "results/lr_basemodel_step_2.pickle"

step_1_model = pickle.load(open(step_1_model_path, 'rb'))
step_2_model = pickle.load(open(step_2_model_path, 'rb'))
                                                            

In [221]:
# create predict function for predicting class label

def predict(model_1,model_2,final_dict,query):
    # predict
    
    test_1 =  count_vector_step_1.transform([query])
    y_pred = model_1.predict(test_1)
    if y_pred == 'med':
        test_2 =  count_vector_step_2.transform([query])
        y_pred = model_2.predict(test_2)
    else:
        y_pred = y_pred
        
    if query in final_dict.keys():
        y_pred = final_dict[query]
    else:
        y_pred = y_pred
        
    return y_pred[0]

In [222]:
# check predictions

final_result= []

test = data.sample(150)
test_data = pre_processing(test)
x_test = test_data['Claim Description']
for query in x_test:
    
    result = predict(step_1_model,step_2_model,fewer_class_dictionary,query)
    final_result.append(result)

In [139]:
# check the accuracy and other metrics

y_test_ = test['Coverage Code'].tolist()

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for logistic regression :',np.round(accuracy,2))

print("*"*100)


print(classification_report(y_test_, final_result))
print('*'*100)

Accuracy for logistic regression : 0.71
****************************************************************************************************
              precision    recall  f1-score   support

          AB       0.57      0.36      0.44        11
          AD       0.74      0.84      0.79        38
          AL       1.00      1.00      1.00         8
          AN       0.00      0.00      0.00         2
          AP       0.60      0.56      0.58        16
          AU       0.00      0.00      0.00         2
          EL       0.00      0.00      0.00         1
          EO       1.00      1.00      1.00         1
          GB       0.83      0.97      0.89        30
          GD       0.62      0.75      0.68        20
          GO       0.00      0.00      0.00         0
          PA       1.00      1.00      1.00         3
          PB       1.00      0.50      0.67         8
          PI       0.00      0.00      0.00         0
          RB       0.80      0.57      0.67     

## class_label 2 : Accident Source

In [55]:
# find unique classes

print('Number of classes in Coverage code :',data['Accident Source'].nunique())

Number of classes in Coverage code : 312


In [56]:
# import seaborn as sns
# import matplotlib.pyplot as plt

# # plot countplot of body_type and add count on each bar
# plt.figure(figsize=(30, 25))
# sns.countplot(y='Coverage Code', data=data)
# for p in plt.gca().patches:
#     plt.gca().annotate('{:.0f}'.format(p.get_height()), (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# plt.show()

In [57]:
final_data['Accident Source'].value_counts()

Alleged Negligent Act            20254
Sideswipe or lane change         15414
Struck animal or object           7720
Struck vehicle in rear            7400
Not Otherwise Classified          7377
                                 ...  
Machine - point of operation        14
Insured Lost Control                14
VEHICLE                             13
Defective Pipework                  11
Boiler, pressure vessel, etc.       10
Name: Accident Source, Length: 312, dtype: int64

## Observation :

- There are 312 classes in Coverage code.
- There is class imabalnce observed.
- While few classes having very few data points.

In [66]:
# split the dataset



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [67]:
# convert data into vector form

count_vector_step_1 = CountVectorizer()

X_train = count_vector_step_1.fit_transform(X_train)
X_test = count_vector_step_1.transform(X_test)

In [68]:
X_train.shape

(131192, 77721)

##### vectorization of feature

In [69]:
# create features and class label

X = final_data['Claim Description']

y = final_data['Accident Source']


In [70]:
# split the dataset



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [71]:
# convert data into vector form

count_vector_step_1 = CountVectorizer()

X_train = count_vector_step_1.fit_transform(X_train)
X_test = count_vector_step_1.transform(X_test)

In [72]:
# # usning tfidf vectorizer

# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf_vectorizer = TfidfVectorizer(use_idf=True)
# X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) #tfidf runs on non-tokenized sentences unlike word2vec
# # Only transform x_test (not fit and transform)
# X_val_vectors_tfidf = tfidf_vectorizer.transform(X_test)

In [73]:
X_train.shape

(131192, 77556)

##### Model 1 - naive bayes

In [74]:


# train
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/naive_bayes_as.pickle"
pickle.dump(naive_bayes, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = naive_bayes.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for naive bayes :',np.round(accuracy,2))

print("*"*100)

target_names = naive_bayes.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/naive_bayes_as.pickle
Accuracy for naive bayes : 0.34
****************************************************************************************************
                                            precision    recall  f1-score   support

    Accidental disposal of property/object       0.00      0.00      0.00        89
                   Administrative Decision       0.00      0.00      0.00        21
                                  Aircraft       0.00      0.00      0.00         5
                    Alleged Discrimination       0.00      0.00      0.00        18
                      Alleged False Arrest       0.00      0.00      0.00         8
                          Alleged Incident       0.00      0.00      0.00        24
      Alleged Misconduct - Personal Injury       0.00      0.00      0.00        21
                     Alleged Negligent Act       0.25      0.85      0.39      4021
         Alleged Violation of Civil Rights       0.

In [165]:
# check count of classes

accident_source_classes = pd.DataFrame(data_step_2['Accident Source'].value_counts())

In [166]:
accident_source_classes

Unnamed: 0,Accident Source
Alleged Negligent Act,20254
Sideswipe or lane change,15414
Struck animal or object,7720
Struck vehicle in rear,7400
Not Otherwise Classified,7377
...,...
Machine - point of operation,14
Insured Lost Control,14
VEHICLE,13
Defective Pipework,11


In [167]:
very_few_classes_as = accident_source_classes[accident_source_classes['Accident Source'] < 20].index.to_list()
medium_classes_as = accident_source_classes[accident_source_classes['Accident Source'] < 100].index.to_list()

In [168]:
# check the data with very few classes

fewer_data_as = data_step_2[data_step_2['Accident Source'].isin(very_few_classes_as)]

fewer_data_as.head(2)

Unnamed: 0,Claim Description,Coverage Code,Accident Source
485,claimant suffer property damage due to leakage,GD,Leak/Flood - Defective Materials
545,a customer was buying bleach at aco around pm ...,GD,Container - Packaging Etc


In [169]:
# for fewer classes either we can directrly create dictionary and use classes


fewer_class_dictionary_as = dict()
for idx,row in fewer_data_as.iterrows():
    
    fewer_class_dictionary_as[row[0]] = [row[1],row[2]]

fewer_class_dictionary_as

{'claimant suffer property damage due to leakage': ['GD',
  'Leak/Flood - Defective Materials'],
 'a customer was buying bleach at aco around pm and the lid was not sealed properly and leaked on his pant while he was scanning products': ['GD',
  'Container - Packaging Etc'],
 'claimant was driving on the road and vehicle hit the speed bump causing vehicle damage': ['GD',
  'Speed bump'],
 'claimant suffered property damage due to flood': ['GD',
  'Leak/Flood - Defective Workmanship'],
 'devil is pool hot tub tracey got a rash on torso and back along with chills and body aches antibiotics were required': ['GB',
  'Infectious, parasitic agents, NOC - Liab.'],
 'claimant suffered property damaged due to dampness': ['GD',
  'Leak/Flood - Defective Workmanship'],
 'accident entre voiture derri re le magasin trop glissant alors ne pouvais pas freiner': ['AD',
  'Frost/Ice/Snow'],
 'claimant fall due to kerb and suffer the injuries': ['GB',
  'Trip - kerb stone'],
 'claimant suffered property

In [170]:
fewer_class_dictionary_as[fewer_data_as['Claim Description'].iloc[0]]

['GD', 'Leak/Flood - Defective Materials']

In [171]:
data_as_step_2 = data_step_2.copy()

def convert_class(feature):
    if feature in medium_classes_as:
        return 'med'
    else:
        return feature
    
data_as_step_2['Accident Source'] = data_as_step_2['Accident Source'].apply(convert_class)

In [172]:
data_as_step_2['Accident Source'].value_counts()

Alleged Negligent Act                    20254
Sideswipe or lane change                 15414
Struck animal or object                   7720
Struck vehicle in rear                    7400
Not Otherwise Classified                  7377
                                         ...  
Insured vehicle causes other accident      103
Excavation, trench, tunnel                 102
Alleged Discrimination                     101
Door - Office                              101
Malicious Mischief                         101
Name: Accident Source, Length: 150, dtype: int64

#### Step 3 : Develop a model on given classes :

##### vectorization of feature

In [223]:
# create features and class label

X = data_as_step_2['Claim Description']

y = data_as_step_2['Accident Source']


In [224]:
# split the dataset



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [225]:
# convert data into vector form

count_vector_as_step_1 = CountVectorizer()

X_train = count_vector_as_step_1.fit_transform(X_train)
X_test = count_vector_as_step_1.transform(X_test)

In [226]:
# # usning tfidf vectorizer

# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf_vectorizer = TfidfVectorizer(use_idf=True)
# X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train) #tfidf runs on non-tokenized sentences unlike word2vec
# # Only transform x_test (not fit and transform)
# X_val_vectors_tfidf = tfidf_vectorizer.transform(X_test)

In [227]:
X_train.shape

(131192, 77503)

##### Model 1 - naive bayes

In [228]:


# train
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/naive_bayes_as_step_1.pickle"
pickle.dump(naive_bayes, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = naive_bayes.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for naive bayes :',np.round(accuracy,2))

print("*"*100)

target_names = naive_bayes.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/naive_bayes_as_step_1.pickle
Accuracy for naive bayes : 0.35
****************************************************************************************************
                                            precision    recall  f1-score   support

    Accidental disposal of property/object       0.00      0.00      0.00        85
                   Administrative Decision       0.00      0.00      0.00        23
                    Alleged Discrimination       0.00      0.00      0.00        17
                          Alleged Incident       0.00      0.00      0.00        19
      Alleged Misconduct - Personal Injury       0.00      0.00      0.00        21
                     Alleged Negligent Act       0.26      0.83      0.40      4102
                   Alleged concrete defect       0.00      0.00      0.00        30
         Alleged contamination or spoilage       0.54      0.48      0.50       410
      Alleged damage to property of others  

##### Model 2 - logistic regression

#### Using bert for embedding :

In [234]:
# pip install -U sentence-transformers

In [231]:
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [232]:
X = final_data['Claim Description'].values
y = final_data['Accident Source']

In [233]:
X_bert_enc = model.encode(X, batch_size=500, show_progress_bar=True,)
print(X_bert_enc.shape)

Batches:   0%|          | 0/328 [00:00<?, ?it/s]

(163990, 384)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_bert_enc, y, test_size=0.20)

In [178]:
from sklearn.linear_model import LogisticRegression

lr_basemodel =LogisticRegression()

lr_basemodel.fit(X_train, y_train)
print("Model trained")

# save model
path = "results/lr_as_step_1.pickle"
pickle.dump(lr_basemodel, open(path, "wb"))
print("Model saved to", path)

# predict
y_pred = lr_basemodel.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy for logistic regression :',np.round(accuracy,2))

print("*"*100)

target_names = lr_basemodel.classes_
print(classification_report(y_test, y_pred, target_names=target_names))
print('*'*100)

Model trained
Model saved to results/lr_as_step_1.pickle
Accuracy for logistic regression : 0.45
****************************************************************************************************
                                            precision    recall  f1-score   support

    Accidental disposal of property/object       0.18      0.10      0.12       105
                   Administrative Decision       0.67      0.31      0.42        26
                    Alleged Discrimination       0.25      0.12      0.17        24
                          Alleged Incident       0.00      0.00      0.00        18
      Alleged Misconduct - Personal Injury       0.69      0.36      0.47        25
                     Alleged Negligent Act       0.37      0.61      0.46      4021
                   Alleged concrete defect       0.00      0.00      0.00        20
         Alleged contamination or spoilage       0.55      0.54      0.54       396
      Alleged damage to property of others   

In [229]:
# pickle.dump(count_vector_step_1, open("./vectorizer/count_vector_step_1.pickel","wb"))
# pickle.dump(count_vector_step_2, open("./vectorizer/count_vector_step_2.pickel","wb"))
# pickle.dump(count_vector_as_step_1, open("./vectorizer/count_vector_as_step_1.pickel","wb"))

### Conclusion :

- Coverage code :
    - For coverage code we use two models for prediction.
    - Also we check if feature is in dictionary.
    
- Accident source :
    - We use bert embedding.
    - And then train simple logistic regression model on top of it.
    
- We deployed this model on streamlt.

https://huggingface.co/spaces/mahesh3394/gallegher_insurance_app