# Classification Problem - Predicting Customer Churn 🛫

------------

## Quick reminder on `Jupyter Notebook` and `Python` basics 🚴‍♀️

Notebook consists of two main parts.

1. Text instructions like this one - these are made using a text formatting language called [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)

2. Code cells like the one below:

In [None]:
1 + 1 * 2

1. To run a code cell, click into it with your mouse and press the `► Run` button in the navbar at the top of the notebook. 
2. You can also use the shortcut `Shift + Enter` to run a cell!
3. A cell that has been run will get a `In [number]` next to it
4. An output (returned value) of a cell will be displayed below with a `Out[number]` next to it
5. If you want to add another code cell - look for the `➕` button in the navbar.

In **Python** we have **built-in data types** to help us work with different kinds of data:

In [None]:
"ML like a pro" # 👈 Strings (str) like this one; Note the quotes around the text!
42 # 👈 Integers (int) like this one
3.14 # 👈 Floats (float) like this one

We have **variables** to help store data:

In [None]:
name = "Alan Turing"
age = 42
new_hire = [[0, 30, 3, 7.1, 12]]

...and **re-use** it later!:

In [None]:
"Hi, my name is " + name

In [None]:
# getting one year older :(
age = age + 1
age

And we have **methods** to perform actions on data:

In [None]:
name.upper()

In [None]:
number_of_n = name.count('n') # creating a new variable as a result of the method call
number_of_n

----------

# A new model - K-Nearest Neighbors 🤝

[K-Nearest Neighbors (or KNN)](https://scikit-learn.org/0.24/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is a distance-based model that can be used both for regression (predict a number) or classification (predict a category).

**And the best part?** The steps are exactly the same for KNN model as for the Linear Regression we just did! 🙌

1. Let's start by importing the necessary Python libraries again

In [5]:
import pandas as pd
import numpy as np
import seaborn as sns

2. Let's get the data from `CSV` into a nice `DataFrame`

In [7]:
churn = pd.read_csv('clean_churn.csv')
churn

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,608,1,41,1,83807.86,1,0,1,112542.58,0
1,502,1,42,8,159660.80,3,1,0,113931.57,1
2,850,1,43,2,125510.82,1,1,1,79084.10,0
3,645,0,44,8,113755.78,2,1,0,149756.71,1
4,376,1,29,4,115046.74,4,1,0,119346.88,1
...,...,...,...,...,...,...,...,...,...,...
6378,597,1,53,4,88381.21,1,1,0,69384.71,1
6379,644,0,28,7,155060.41,1,1,0,29179.52,0
6380,516,0,35,10,57369.61,1,1,1,101699.77,0
6381,772,0,42,3,75075.31,2,1,0,92888.52,1


3. Always a good idea to do some **visual exploration** first 📊

In this case, we care about whether the customer has `Exited` or **churned**. So we can try to plot the different features (inputs) keeping `Exited` as the color differentiator (hue).

### Your turn! 🚀

Experiment with the `scatterplot` below by changing which columns will be `x` axis and `y` axis. Make sure to type the column names correctly! 👀

In [None]:
sns.scatterplot(data=churn, x='PICK A COLUMN', y='PICK ANOTHER COLUMN', hue='Exited')

----------

4. Time for Machine Learning! Import the model from Scikit-learn:

In [20]:
from sklearn.neighbors import KNeighborsClassifier

5. Initialize the model and **pick a number of neighbors** to match against:

In [21]:
classifier = KNeighborsClassifier(n_neighbors=3)

6. Create our `inputs` and `output`. This time let's call them `x` and `y`:

In [14]:
x = churn.drop(['Exited'], axis='columns')
y = churn.Exited

Feel free to check your `x` and `y` below 👇

In [12]:
# your code here

### 7. Your turn! 🚀

The **training** and **scoring** part is exactly the same as with the first model - try to solve it!

In [62]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
classifier.fit(x,y) # this trains the model
classifier.score(x,y) # this scores the model
</pre>
</details>

This time score metric is **accuracy** - how many predictions did the model get right in our dataset.

How is your score? Pretty great for 10 minutes of work, right?! 🤩Well...

----------

## We've been cheating! 😳

We've been scoring the model on the same data as it is trained on - too easy! That is called **data leakage**.

Scikit-learn library saves us again - let's import and use the [`train_test_split` method](https://scikit-learn.org/0.24/modules/generated/sklearn.model_selection.train_test_split.html) 

In [24]:
from sklearn.model_selection import train_test_split

Below line might look a little crazy. Don't worry.

The `train_test_split` method gives us all the datasets we need - inputs and outputs for both training and testing. So we simultaneously create four new variables to store that.

In [40]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)

### 7. (without cheating) Your turn! 🚀

Now that we have training and testing datasets, you should **initialize** a new model, **train** and **score** it again with the right datasets.

In [61]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
classifier = KNeighborsClassifier(n_neighbors=3)

classifier.fit(xtrain, ytrain) # using only the training data
classifier.score(xtest, ytest) # using unseen testing data to score
</pre>
</details>

----------

### 8. Your turn to predict! 🚀

The prediction also works exactly as the previous model, so we let you do that. We created an example customer which you can tune.

*Note: the feature order is:* 

`['CreditScore', 'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']`

In [60]:
customer = [[608.0, 1.0, 31.0, 20.0, 81207.86, 2.0, 1.0, 0.0, 111142.58]]
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
classifier.predict(customer)
</pre>
</details>

### 8.1. Your turn! 🚀Adding probability

To understand the model's decision a bit more, we can use the `predict_proba()` method. It works exactly the same as the `predict()` method you used above. Can you write the code?

In [None]:
# your code here

<details>
    <summary>Reveal Solution 🙈</summary>

<p> 
<pre>
classifier.predict_proba(customer)
</pre>
</details>

### Your turn! 🚀

Try to tune the features inside `customer` variable to get different results. What insights can you see? 🔍

In [63]:
# your code here

----------

### Feature selection on KNN using permutation?

In [273]:
from sklearn.inspection import permutation_importance

permutation_score = permutation_importance(model, xtrain, ytrain, n_repeats=50)

np.vstack((inputs.columns, permutation_score.importances_mean)).T

array([['CreditScore', 1.3428827215755668e-05],
       ['Gender', 0.0],
       ['Age', 0.0],
       ['Tenure', 0.0],
       ['Balance', 0.009006266786034008],
       ['NumOfProducts', 0.0],
       ['HasCrCard', 0.0],
       ['IsActiveMember', 0.0],
       ['EstimatedSalary', 0.007967770814682175]], dtype=object)

In [None]:
# permutation shows that most important are Balance, EstimatedSalary, CreditScore
# Gender, Age, Tenure, NumOfProducts, HasCrCard, isActiveMember show 0 - consider removing for simplicity?

In [191]:
model = KNeighborsClassifier(n_neighbors=10)

In [193]:
small_x = inputs[['Balance', 'CreditScore', 'EstimatedSalary']]

In [194]:
xtr, xtt, ytr, ytt = train_test_split(small_x, output, test_size=0.3)

In [195]:
model.fit(xtr, ytr)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='uniform')

In [196]:
model.score(xtt, ytt)

0.7441253263707572

In [197]:
small_x.head()

Unnamed: 0,Balance,CreditScore,EstimatedSalary
1,83807.86,608,112542.58
2,159660.8,502,113931.57
4,125510.82,850,79084.1
5,113755.78,645,149756.71
7,115046.74,376,119346.88


In [259]:
pavel = [[11723200, 15, 723476]]

In [260]:
print(model.predict(pavel))
model.predict_proba(pavel)

[1]


array([[0.4, 0.6]])

## Forcing only categorical columns!

In [275]:
categories = churn[['NumOfProducts', 'HasCrCard', 'IsActiveMember', 'Gender']]

In [276]:
knn_cat = KNeighborsClassifier(n_neighbors=5)

In [279]:
cattrain, cattest, outtrain, outtest = train_test_split(categories, output, test_size=0.3)

In [280]:
knn_cat.fit(cattrain, outtrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

In [282]:
knn_cat.score(cattest,outtest)

0.7326370757180156

In [285]:
permutation_score = permutation_importance(knn_cat, cattrain, outtrain, n_repeats=50)

np.vstack((categories.columns, permutation_score.importances_mean)).T

array([['NumOfProducts', 0.056490599820948936],
       ['HasCrCard', 0.003393017009847781],
       ['IsActiveMember', 0.02487914055505816],
       ['Gender', -0.009413607878245333]], dtype=object)

In [291]:
cattest.head(1)

Unnamed: 0,NumOfProducts,HasCrCard,IsActiveMember,Gender
8705,1,1,1,1


In [301]:
connie = [[3,1,0,1]]
print(knn_cat.predict(connie))
knn_cat.predict_proba(connie)

[1]


array([[0.2, 0.8]])