# Intro

This is my first data science project. Some of the approaches might not be the best ones, but I am still learning and always open to any feedback.

We will first start with analysis of the features in the dataset and the relationship between them, to get better idea of which features should and which shouldn't be taken into account to determine the dependent variable. We will also run statistical tests to compare different sets of data.
Then, we will split the data into training set and test set, will apply machine learning models on the training data and will try to:
* predict the insurance charges on the test data;
* identify clusters;
* predict the region of the customer (this non-binary variable was chosen just for training purposes)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# allow multi-outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
input = pd.read_csv("../input/insurance/insurance.csv")

# Data analysis

### Let's peek into the data

In [None]:
input.head(5)
input.info()
input.describe()

We have 7 columns and 1338 observations. Three of the columns are categorical - sex, smoker and region, + children column which takes a value between 0 and 5.

One of our first tasks would be to observe the distribution of each numeric variable.

## Distributions

In [None]:
plt.style.use('ggplot')

In [None]:
for col in input.loc[:,['age','bmi','children','charges']].columns:
    sns.distplot(a=input[col]);
    plt.show();

The distribution looks normal for 'BMI', but non-normal for 'age', 'children' and 'charges'.

This means that **for 'age' and 'charges' we should look at the median instead of the mean values**. Their mean values would be highly affected by the number of outliers and will significantly vary from their median. This is well illustrated by the example below.

In [None]:
input.groupby('children').charges.agg(['median','mean']).plot(kind='bar', title='Charges by number of children - mean vs median');

*Note: 'Children' is a numeric variable in the source data, but we will treat it as a categorical one, due to it being discrete rather than continuous (it is countable, accepting only integer values between 0 and 5).*

## How the charges relate to the other categorical variables?

In [None]:
plt.figure(figsize=(8,5))
plt.title('Charges by smoker status')
sns.violinplot(data=input, x='smoker', y='charges');
plt.show()

plt.figure(figsize=(8,5))
sns.scatterplot(x=input['bmi'], y=input['charges'], hue=input['smoker']);
plt.title('Charges by BMI');
plt.show();

It's not a surprise that smokers pay more than non-smokers. 
##### However, the charges go up for smokers with the increase in their BMI. Interestingly, for non-smokers such tendency is not observed.

## Region-wise analysis

In [None]:
plt.figure(figsize=(8,8))
plt.title('Charges by region')
sns.swarmplot(x=input['region'], y=input['charges']);
plt.show();

plt.figure(figsize=(8,8))
plt.title('Charges by region')
sns.boxplot(data=input, x='region', y='charges');
sns.swarmplot(data=input, x='region', y='charges', size=2, color=".3")
plt.show();

input.groupby('region').charges.agg(['mean','median']).sort_values(by='region', ascending=False).plot(kind="bar", title='Mean and median charges by region');

From the charts above it seems that in Southeast region the charges are a bit higher than in the other regions... But wait, this is only if we talk about the mean. If we have a look at the median values, the picture is different - the charges are higher in the Northeast. Two conclusions:
* The charges are 'normally' highest in the Northeast region
* If there are specific, outlying cases, the charges in the Southeast tend to go up much higher than in the other regions

This could be a result of specific demographic characteristics in these regions. For example, let's observe the smoker status by region.

In [None]:
cnt_smoker_byRegion = input.groupby(['region', 'smoker']).agg({'smoker':'count'})
cnt_byRegion = input.groupby('region').agg({'smoker':'count'})
cnt_smoker_byRegion.div(cnt_byRegion, level='region')

##### The proportion of smokers by region is highest (25%) for Southeast. Smokers are significantly less proportionally in the other regions, where they are between 17.9% and 20.7%. 
This is a good explanation why the charges in Southeast are high. However, this doesn't explain anything about the Northeast. As we said, charges are non-normally distributed, so we should be more interested in median than in mean.

We can additionally explore the relation between region and BMI:

In [None]:
plt.figure(figsize=(8,8))
sns.boxplot(data=input, x='region', y='bmi', hue='smoker');
plt.title("BMI by region and smoker status");

The boxplot for BMI gives one more finding:
##### the Southeast region has the highest average BMI for both smokers and non-smokers amongst all regions! It still doesn't say anything about the Northeast, though!

This gives us no other chance but to see how this competition would look like if we exclude the outliers from the picture.

In [None]:
sns.boxplot(data=input.loc[(((input.region=='southeast') & (input.charges<42000)) | ((input.region=='northeast') & (input.charges<35000)))],
            x='smoker', y='charges', hue='region');
plt.show();

sns.boxplot(data=input.loc[(((input.region=='southeast') & (input.charges<42000)) | ((input.region=='northeast') & (input.charges<35000)))],
            x='sex', y='charges', hue='region');
plt.show();

sns.boxplot(data=input.loc[(((input.region=='southeast') & (input.charges<42000)) | ((input.region=='northeast') & (input.charges<35000)))],
            x='children', y='charges', hue='region')
plt.show();

In [None]:
input.loc[(((input.region=='southeast') & (input.charges<42000)) | ((input.region=='northeast') & (input.charges<35000)))].groupby(['region','children']).charges.median().plot(kind="bar");

Let's run the Kruskal-Wallis test to compare the median charges by region.

In [None]:
from scipy.stats import kruskal

In [None]:
sw = input.loc[input.region=='southwest','charges']
se = input.loc[input.region=='southeast','charges']
ne = input.loc[input.region=='northeast','charges']
nw = input.loc[input.region=='northwest','charges']

kruskal(sw, se, ne, nw)

Although we can see some differences, it seems that they are not statistically significant, because the P-value is greater than 0.05.

In [None]:
fg = sns.FacetGrid(data=input, row='region', col='children');
fg.map(plt.scatter, 'bmi', 'charges');
fg.add_legend();

No other findings here. We can humbly leave this with the conclusion that:
##### Тhe charges are just naturally high in Northeast, although for some specific cases (e.g. smoking, high BMI, many children) the charges tend to go higher in Southeast. However, according to Kruskal-Wallis test, these differences are not statistically significant.

## Gender (in)equality?

We'll try below to analyze the charges by gender. We'll also have a look at the gender characteristics - to make sure that the charges have not been affected by the gender alone!

We'll first notice that there is some difference in charges, which is very slightly higher for females than for males (just to remind - we are working with median):

In [None]:
input.groupby('sex').charges.agg(['mean','median'])

In [None]:
plt.figure(figsize=(8,8))
plt.title('Charges by gender')
sns.boxplot(data=input, x='sex', y='charges');
plt.show();

Although the average charges are very similar for both genders, it's mainly affected by outliers. It seems that there are more outliers for females, which have dragged their average charges up. We will try below to re-create the boxplot, but by excluding the outliers (females' charges >30k and males' charges > 40k):

In [None]:
plt.figure(figsize=(8,8))
plt.title('Charges by gender')
sns.boxplot(data=input.loc[(((input.sex == 'female') & (input.charges < 30000)) | ((input.sex == 'male') & (input.charges < 40000))),:], x='sex', y='charges');
plt.show();

In [None]:
input.loc[(((input.sex == 'female') & (input.charges < 30000)) | ((input.sex == 'male') & (input.charges < 40000))),:].groupby('sex').charges.agg(['mean','median'])

Both the boxplot and the mean/median table show that when we exclude the special outlying cases, male customers have been charged significantly more than females.
However, does this have anything to do with smoking and BMI? Shouldn't we have a look at these two features and how they are distributed gender-wise?

In [None]:
sns.boxplot(data=input, x='smoker', y='charges', hue='sex')
plt.title("Charges by gender and smoker status")
plt.show();

input.groupby(['sex','smoker']).charges.median().sort_values().plot(title='Median charges by gender and smoker status');

Now, that's an interesting finding:
* On the one hand, female non-smokers have been charged about 9% more (on median) than male non-smokers
* On the other hand, female smokers have been charged about 20% less (on median) than male smokers

How is that possible? Well, let's see how the BMI will fit into the picture.

In [None]:
input.groupby(['sex', 'smoker']).bmi.mean().sort_values().plot(title='Average BMI by gender and smoker status');

sns.lmplot(data=input, x='bmi', y='charges', hue='sex');
plt.title('Regression line for BMI/Charges, gender-wise')
plt.show();

##### Here's the explanation: Female smokers have been charged significantly less than male smokers. This might be due to the fact that female smokers have significantly lower BMI than male smokers.

The steeper line for males on the second chart shows how much quicker their charges go up with the increase of BMI, compared to females.

The last part here will be to run a Mann-Whitney test to compare the median charges for males and females.

In [None]:
from scipy.stats import mannwhitneyu

In [None]:
mannwhitneyu(input.loc[input.sex=='female','charges'].values,input.loc[input.sex=='male','charges'].values)

The P-Value is greater than 0.05, meaning that there is no significant difference between males' and females' charges. Hence, there is no gender inequality.

## Does the number of children matter?

In [None]:
plt.figure(figsize=(8,8))
plt.title('Charges by number of children')
sns.boxplot(data=input, x='children', y='charges');
plt.show();

The average charges increase with the increase of the number of children, but decrease for 5 children. However, for people having no children the costs are high - they are comparable to the costs for people having 4 children.

We'll now run a Kruskal-Wallis test to compare the median charges.

In [None]:
# split the charges by number of children
children = []
for i in range(0,6):
    children.append(input.loc[input.children==i,'charges'])

In [None]:
kruskal(children[0], children[1], children[2], children[3], children[4], children[5])
kruskal(children[0], children[4])
kruskal(children[2], children[5])
kruskal(children[1], children[5])

The Kruskal-Wallis test for charges by number of children confirms that the children do matter - there is significant difference in the median charges. It also confirms that the charges for 'no children' are similar to those for 4 children; charges for 2 children are similar to 5 children.

However, from the boxplots above we can see that there are so many outliers for 0-3 children, therefore any conclusions here might be inappropriate.

# Machine Learning Models

## Data Preprocessing

In [None]:
X = input.iloc[:,0:6]
y = input.iloc[:,6]

In [None]:
X.head()

##### Encode categorical variables

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [None]:
LabelEncoder_X1 = LabelEncoder()
LabelEncoder_X4 = LabelEncoder()
LabelEncoder_X5 = LabelEncoder()
X.iloc[:,1] = LabelEncoder_X1.fit_transform(X.iloc[:,1])
X.iloc[:,4] = LabelEncoder_X4.fit_transform(X.iloc[:,4])
X.iloc[:,5] = LabelEncoder_X5.fit_transform(X.iloc[:,5])

In [None]:
from sklearn.compose import ColumnTransformer

ct = ColumnTransformer([('one_hot_encoder', OneHotEncoder(categories='auto'), [1,4,5])], remainder='passthrough')
X = ct.fit_transform(X)

##### Escape the dummy variable trap

In [None]:
X = X[:,[1,2,4,5,6,8,9,10]]

##### Split into test and training set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
from sklearn.metrics import mean_absolute_error

## Train regression models

### Multiple Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

regressor1 = LinearRegression()
regressor1.fit(X_train, y_train)

y_pred1 = regressor1.predict(X_test)

##### Measure the result

In [None]:
MAE1 = mean_absolute_error(y_test, y_pred1)
MAE1

### Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

In [None]:
regressor2 = PolynomialFeatures(degree=3)
X_poly = regressor2.fit_transform(X_train)
regressor2.fit(X_poly, y_train)

linreg = LinearRegression()
linreg.fit(X_poly,y_train)

y_pred2 = linreg.predict(regressor2.fit_transform(X_test))

##### Measure the result

In [None]:
MAE2 = mean_absolute_error(y_test, y_pred2)
MAE2

## Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor

regressor3 = DecisionTreeRegressor(random_state = 0)
regressor3.fit(X_train, y_train)

y_pred3 = regressor3.predict(X_test)

MAE3 = mean_absolute_error(y_test, y_pred3)
MAE3

## Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor

#### Apply Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
parameters = {'n_estimators': [2, 3, 5, 10, 15, 20, 30, 50, 75, 100, 500, 1000],
              'max_leaf_nodes': [5, 10, 20, 35, 50, 100],
              'random_state': [0]}
grid_search = GridSearchCV(estimator = RandomForestRegressor(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
print(f"Best MAE: {grid_search.best_score_ * (-1)}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
regressor4 = RandomForestRegressor(n_estimators=100, max_leaf_nodes=35, random_state=0)
regressor4.fit(X_train, y_train)

y_pred4 = regressor4.predict(X_test)

MAE4 = mean_absolute_error(y_test, y_pred4)
MAE4

## Support Vector Regression

In [None]:
from sklearn.svm import SVR

regressor5 = SVR(kernel = 'rbf')
regressor5.fit(X_train, y_train)

y_pred5 = regressor5.predict(X_test)

MAE5 = mean_absolute_error(y_test, y_pred5)
MAE5

#### This MAE is too high to be true. We have forgotten that SVR model requires feature scaling before fitting!

In [None]:
from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
sc_y = StandardScaler()
sc_X_train = sc_X.fit_transform(X_train)
sc_y_train = sc_y.fit_transform(y_train.values.reshape(-1,1))
sc_X_test = sc_X.fit_transform(X_test)

In [None]:
parameters = {'C': [1, 5, 10, 20, 50, 100],
              'kernel': ['rbf', 'linear', 'poly'],
              'degree': [2, 3, 4]}
grid_search = GridSearchCV(estimator = SVR(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(sc_X_train, sc_y_train)
print(f"Best MAE: {grid_search.best_score_ * (-1)}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
regressor5 = SVR(kernel = 'rbf', C = 1)
regressor5.fit(sc_X_train, sc_y_train)

y_pred5 = regressor5.predict(sc_X_test)
y_pred5 = sc_y.inverse_transform(y_pred5)

MAE5 = mean_absolute_error(y_test, y_pred5)
MAE5

### XGBoost

In [None]:
from xgboost import XGBRegressor

In [None]:
parameters = {'base_score': [0.1, 0.3, 0.5, 0.7, 1, 1.5, 2, 5, 10, 20],
              'learning_rate': [0.001, 0.005, 0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5],
              #'booster': ['gbtree', 'linear', 'dart'],
              'n_estimators': [50, 100, 150, 200, 250, 300, 500, 750, 1000]}
              #'max_depth': [3, 5]}
grid_search = GridSearchCV(estimator = XGBRegressor(),
                           param_grid = parameters,
                           scoring = 'neg_mean_absolute_error',
                           cv = 2,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
print(f"Best MAE: {grid_search.best_score_ * (-1)}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
regressor6 = XGBRegressor(learning_rate=0.01, n_estimators=300)
regressor6.fit(X_train, y_train)
y_pred6 = regressor6.predict(X_test)
MAE6 = mean_absolute_error(y_test, y_pred6)
MAE6

## Regression Models Summary

#### Summarize mean absolute error and R-squared

In [None]:
summary = {'Multiple Linear': MAE1, 'Polynomial': MAE2, 'Decision Tree': MAE3,
           'Random Forest': MAE4, 'SVR': MAE5, 'XGB': MAE6}

from sklearn.metrics import r2_score

summary_R2 = {'Multiple Linear': r2_score(y_test,y_pred1), 'Polynomial': r2_score(y_test,y_pred2),
             'Decision Tree': r2_score(y_test,y_pred3), 'Random Forest': r2_score(y_test,y_pred4),
             'SVR ': r2_score(y_test,y_pred5), 'XGBoost': r2_score(y_test,y_pred6)}

In [None]:
f = plt.figure(figsize=(15,5))

ax = f.add_subplot(121)
plt.bar(summary.keys(), summary.values(), color='green');
plt.title("Mean absolute error by model (the lower the better)")

ax=f.add_subplot(122)
plt.plot(summary_R2.keys(), summary_R2.values(), color='cyan');
plt.title("R-Squared coefficient by model (the higher the better)")
axes = plt.gca()
axes.set_ylim([0.5,1])
plt.show();

In [None]:
# compare MAE to the average value of the dependent variable
round(100*MAE6/np.mean(y_test),2)
round(100*MAE6/input.charges.mean(),2)

#### Conclusion: The best regression model that we've created for this dataset is:
### XGBoost
##### It returned mean absolute error of 2219, which is a deviation of about 16% of the average dependent variable value. It also has a high R-squared value of 0.90.
##### Support Vector Machine also did a very good job, but was just narrowly outperformed. Random Forest Regression also did a relatively good job and completes the top 3.

# Clustering

In [None]:
from sklearn.cluster import KMeans
inertia = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 0)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)
plt.plot(range(1, 11), inertia);
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show();

The optimal number of clusters is 3.

In [None]:
kmeans = KMeans(n_clusters = 3, init = 'k-means++', random_state = 0)
y_kmeans = kmeans.fit_predict(X)

# Classification

We will need to have a look at the data and split into test and training set again, but this time the the dependent variable will be a categorical one - we will try to predict the region of the customer.

In [None]:
input.head()

In [None]:
XClass = input.iloc[:,[0,1,2,3,4,6]]
yClass = input.iloc[:,5]

##### Encode categorical variables

In [None]:
LabelEncoder_XClass1 = LabelEncoder()
LabelEncoder_XClass4 = LabelEncoder()
LabelEncoder_yClass = LabelEncoder()
XClass.iloc[:,1] = LabelEncoder_XClass1.fit_transform(XClass.iloc[:,1])
XClass.iloc[:,4] = LabelEncoder_XClass4.fit_transform(XClass.iloc[:,4])
yClass = LabelEncoder_yClass.fit_transform(yClass)

In [None]:
ct_XClass = ColumnTransformer([('one_hot_encoder', OneHotEncoder(categories='auto'), [1,4])], remainder='passthrough')
XClass = ct_XClass.fit_transform(XClass)

In [None]:
XClass = XClass[:,[0,1,3,5,6,7]]

In [None]:
XClass_train, XClass_test, yClass_train, yClass_test = train_test_split(XClass, yClass, test_size=0.2, random_state=1)

In [None]:
sc_XClass = StandardScaler()
XClass_train = sc_XClass.fit_transform(XClass_train)
XClass_test = sc_XClass.transform(XClass_test)

## Train Classification models

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import cohen_kappa_score

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

classifier1 = DecisionTreeClassifier(criterion = 'entropy', random_state = 1)
classifier1.fit(XClass_train, yClass_train)
yClass_pred1 = classifier1.predict(XClass_test)

#### Most of the classification model measurement tools are designed for binary classifications. One of the options for non-binary data is the Cohen Kappa score.

In [None]:
kappa1 = cohen_kappa_score(yClass_test, yClass_pred1)
kappa1
acc1 = accuracy_score(yClass_test, yClass_pred1)
acc1

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
parameters = {'n_estimators': [50, 100, 150, 200, 250, 300, 500, 750, 1000], 
              'max_leaf_nodes': [5, 10, 20, 30, 50, 100, 300, 600, 800, 1000],
              'criterion': ['gini', 'entropy'],
              'max_depth': [3, 4, 6, 8, 9],
              'random_state': [0]}
grid_search = GridSearchCV(estimator = RandomForestClassifier(),
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 2,
                           n_jobs = -1)
grid_search = grid_search.fit(XClass_train, yClass_train)
print(f"Best Accuracy: {grid_search.best_score_}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
classifier2 = RandomForestClassifier(n_estimators = 100)
classifier2.fit(XClass_train, yClass_train)
yClass_pred2 = classifier2.predict(XClass_test)
kappa2 = cohen_kappa_score(yClass_test, yClass_pred2)
kappa2
acc2 = accuracy_score(yClass_test, yClass_pred2)
acc2

## SVM (Kernel)

In [None]:
from sklearn.svm import SVC

In [None]:
parameters = {'kernel': ['rbf'], 
              'C': [1, 3, 5, 9, 10, 20, 25, 30, 40, 50, 75, 100, 200, 500, 1000],
              'random_state': [0]}
grid_search = GridSearchCV(estimator = SVC(),
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(XClass_train, yClass_train)
print(f"Best Accuracy: {grid_search.best_score_}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
classifier3 = SVC(kernel = 'poly', C = 1, random_state = 0)
classifier3.fit(XClass_train, yClass_train)
yClass_pred3 = classifier3.predict(XClass_test)
kappa3 = cohen_kappa_score(yClass_test, yClass_pred3)
kappa3
acc3 = accuracy_score(yClass_test, yClass_pred3)
acc3

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier4 = LogisticRegression(random_state = 0)
classifier4.fit(XClass_train, yClass_train)

# Predicting the Test set results
yClass_pred4 = classifier4.predict(XClass_test)
kappa4 = cohen_kappa_score(yClass_test, yClass_pred4)
kappa4
acc4 = accuracy_score(yClass_test, yClass_pred4)
acc4

## Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier5 = GaussianNB()
classifier5.fit(XClass_train, yClass_train)
yClass_pred5 = classifier5.predict(XClass_test)
kappa5 = cohen_kappa_score(yClass_test, yClass_pred5)
kappa5
acc5 = accuracy_score(yClass_test, yClass_pred5)
acc5

## K-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
parameters = {'n_neighbors': [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], 
              'p': [1, 2, 3, 5, 10, 20, 30, 50, 70, 90, 120, 150, 200]}
grid_search = GridSearchCV(estimator = KNeighborsClassifier(),
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(XClass_train, yClass_train)
print(f"Best Accuracy: {grid_search.best_score_}")
print(f"Best parameters: {grid_search.best_params_}")

In [None]:
classifier6 = KNeighborsClassifier(n_neighbors = 4, metric = 'minkowski', p = 120)
classifier6.fit(XClass_train, yClass_train)
yClass_pred6 = classifier6.predict(XClass_test)
kappa6 = cohen_kappa_score(yClass_test, yClass_pred6)
kappa6
acc6 = accuracy_score(yClass_test, yClass_pred6)
acc6

## XGBoost

In [None]:
from xgboost import XGBClassifier

In [None]:
classifier7 = XGBClassifier(base_score=0.1, n_estimators=2600, max_depth=2)
classifier7.fit(XClass_train, yClass_train)
yClass_pred7 = classifier7.predict(XClass_test)
kappa7 = cohen_kappa_score(yClass_test, yClass_pred7)
kappa7
acc7 = accuracy_score(yClass_test, yClass_pred7)
acc7

## Summary

In [None]:
summaryClass = {'Decision Tree': kappa1, 'Random Forest': kappa2, 'Kernel SVM': kappa3,
               'Logistic Regression': kappa4, 'Naive Bayes': kappa5, 'KNN': kappa6, 'XGBoost': kappa7}

classmodels = []
for key in summaryClass.keys():
    classmodels.append(key)

In [None]:
accuracies = [acc1, acc2, acc3, acc4, acc5, acc6, acc7]

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x=classmodels, y=accuracies);
plt.title('Model Accuracy (the higher the better)')
plt.show();

#### The best perfomed model is:
## XGBoost Classification
##### It received the highest accuracy, much higher than the other models. However, it is still too low - only 44%. Therefore, none of the models is good enough to make reliable predictions for the region of the customers. This section was created just for training purposes as this my first data science project.