## Machine Learning Analysis USA:
#### In this Jupyter Notebook you will find the process of building Machine Learning models and Evaluation of them on the file Clean_Insurance_USA.csv.

In [167]:
import pandas as pd
import numpy as np

In [168]:
usa = pd.read_csv('../Data/Clean_data/Clean_Insurance_USA.csv', index_col=0) #dataframe saved in us

In [169]:
usa.columns #Inspecting columns

Index(['Customer', 'State', 'Coverage', 'Education', 'Job_Status', 'Gender',
       'Income', 'Location', 'Civil_Status', 'Monthly_Price',
       'Months_LastClaim', 'Months_SinceActivation', 'Number_Open_Complaints',
       'Number_Insurances', 'Policy_Type', 'Sales_Channel', 'Car_Type',
       'Car_Size'],
      dtype='object')

Transformation of categorical data into boolean variables so that I can apply algorithms to it. I have decided to work with Civil_Status, Location, Policy_Type and Education because I think they are the treats that may influence having an accident.

In [170]:
#Gender dummy
dummy = pd.get_dummies(usa, columns = ['Civil_Status','Location', 'Policy_Type', 'Education'], drop_first = True)
usa_dummy = pd.concat([usa,dummy], axis=1)

Number_Open_Complaints is the feature that we want to predict, it is expressed in how many accidents did that person had in the last year, but for simplification, I will translate that into whether a customer had 1 or more accidents (1) or not (0).

In [171]:
usa['Number_Open_Complaints'] = usa.Number_Open_Complaints.apply(lambda x: 0 if x==0 else 1)

Separate our data in train and test, to check if our predictions are right or not.

In [172]:
from sklearn.model_selection import train_test_split #importing library

In [173]:
#test size of 0.2 and decided to use stratify to make sure the proportion of 0 and 1 in train and test is
#the same, to avoid bias on splitting the dataset.
X_train, X_test, y_train, y_test = train_test_split(usa_dummy[['Civil_Status_Married', 'Civil_Status_Single',
                                                               'Location_Suburban', 'Location_Urban', 
                                                               'Policy_Type_Personal Auto',
                                                               'Policy_Type_Special Auto', 'Education_College',
                                                               'Education_Doctor','Education_High School or Below', 
                                                               'Education_Master']], 
                                                    usa[['Number_Open_Complaints']],
                                                    test_size=0.2, stratify = usa[['Number_Open_Complaints']])

#### Building Supervised Learning algorithms.
This is a case of supervised learning since we are trying to predict one outcome that is on the dataset and hence, I will build algorithms that maximizes the number of True Positives, people that might have an accident, always looking at the accuracy of the model.

##### Decision Tree Model

In [174]:
from sklearn import tree
from sklearn.metrics import confusion_matrix
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [175]:
prediction = clf.predict(X_test)

In [176]:
confusion_matrix(y_test, prediction) #All values are on the left (all Positives).

array([[1444,    7],
       [ 375,    1]])

In [177]:
clf.score(X_test, y_test)*100

79.09140667761358

I have decided to dismiss this model since confusion matrix shows that this models does not predict well when there are accidents.

##### K-nearest Neighbours

In [178]:
from sklearn.neighbors import KNeighborsClassifier

In [179]:
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train.values.reshape(-1,))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [180]:
y_pred = knn.predict(X_test)

In [181]:
confusion_matrix(y_test, y_pred) #More distributed values, but low accuracy.

array([[1371,   80],
       [ 349,   27]])

In [182]:
knn.score(X_test, y_test)*100 #Good Score, 76.7% accuracy

76.51888341543514

In [183]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred,average='weighted') #0.71 F1 score is good.

0.7097611261045154

Then, I decided to change the number of neighbours to see how accuracy and confusion matrix may change.

In [184]:
knn = KNeighborsClassifier(n_neighbors=6, metric='euclidean')
knn.fit(X_train, y_train.values.reshape(-1,))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='euclidean',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')

In [185]:
y_pred = knn.predict(X_test)

In [187]:
confusion_matrix(y_test, y_pred) #Worse confusion matrix.

array([[1444,    7],
       [ 374,    2]])

In [188]:
knn.score(X_test, y_test)*100 #Score of 79%

79.14614121510674

I increased the number of neighbours to 7 and score decreased to 75%.

###### Conclusion
After analysing all the models below, I have checked all other models of supervised learning, but KNN seems to be the one that predicts better and has higher accuracy. So that's the model I will use.

##### Linear SVC

In [105]:
from sklearn.svm import LinearSVC

In [106]:
svc = LinearSVC()

In [107]:
svc.fit(X_train, y_train.values.reshape(-1,))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [108]:
y_predi = svc.predict(X_test)
confusion_matrix(y_test, y_predi)

array([[1451,    0],
       [ 376,    0]])

I have decided to dismiss this model since confusion matrix shows that this models does not predict well when there are accidents.

##### SVC

In [129]:
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

  y = column_or_1d(y, warn=True)


In [130]:
confusion_matrix(y_test, y_pred) #It doesn't predict well in case there are accidents.

array([[1451,    0],
       [ 376,    0]])

I have decided to dismiss this model since confusion matrix shows that this models does not predict well when there are accidents.

##### Logistic Regression

In [109]:
from sklearn.linear_model import LogisticRegression

ks_model = LogisticRegression().fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


In [110]:
y_pred_test = ks_model.predict(X_test)
#y_pred_test
confusion_matrix(y_test, y_pred_test)

array([[1451,    0],
       [ 376,    0]])

I have decided to dismiss this model since confusion matrix shows that this models does not predict well when there are accidents.

##### PCA with Logistic Regression

In [138]:
from sklearn.decomposition import PCA 
  
pca = PCA(n_components = 5) 
  
X_train = pca.fit_transform(X_train) 
X_test = pca.transform(X_test) 
  
explained_variance = pca.explained_variance_ratio_ 

In [139]:
explained_variance #With these components, 80% of the variance is explained.

array([0.26967405, 0.18841037, 0.16586087, 0.12555203, 0.0874939 ])

In [164]:
from sklearn.linear_model import LogisticRegression   
  
classifier = LogisticRegression(random_state = 0) 
classifier.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [165]:
y_pred = classifier.predict(X_test) 

In [166]:
confusion_matrix(y_test, y_pred)

array([[1451,    0],
       [ 376,    0]])

I have decided to dismiss this model since confusion matrix shows that this models does not predict well when there are accidents.

### CONCLUSION:
The model I have decided to choose was k-nearest neighbours with 5 neighbours since it is the best model in terms of accuracy and predictive power.