<a href="https://colab.research.google.com/github/Enayar478/ML-Notebooks/blob/main/KNN_Customer_Churn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Problem - Predicting Customer Churn üõ´

------------

# A new model - K-Nearest Neighbors ü§ú ü§õ

[K-Nearest Neighbors (or KNN)](https://scikit-learn.org/0.24/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) is a distance-based model that can be used both for regression (predict a number) or classification (predict a category).

**And the best part?** The steps are exactly the same for KNN model as for the Linear Regression we just did! üôå

1. Let's start by importing the necessary Python libraries again

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

2. Let's get the data from `CSV` into a nice `DataFrame`.
You can download the data from [here](https://drive.google.com/file/d/1O8WQNn_yrqBCJDnI0KFPLbUbKBXcP0-V/view?usp=share_link) and then upload them to Google Collaboratory as usual!

In [None]:
churn = pd.read_csv('https://drive.google.com/uc?export=download&id=1O8WQNn_yrqBCJDnI0KFPLbUbKBXcP0-V')
churn.head()

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,608,1,41,1,83807.86,1,0,1,112542.58,0
1,502,1,42,8,159660.8,3,1,0,113931.57,1
2,850,1,43,2,125510.82,1,1,1,79084.1,0
3,645,0,44,8,113755.78,2,1,0,149756.71,1
4,376,1,29,4,115046.74,4,1,0,119346.88,1


3. Always a good idea to do some **visual exploration** first üìä

In this case, we care about whether the customer has `Exited` or **churned**. So we can try to plot the different features (inputs) keeping `Exited` as the color differentiator (hue).

### Visualize the data! üìä

Experiment with the `scatterplot` below by changing which columns will be `x` axis and `y` axis. Make sure to type the column names correctly! üëÄ

In [None]:
px.scatter(churn, x='Age', y='Balance', color='Exited')

----------

4. Time for Machine Learning! Import the model from Scikit-learn:

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
from sklearn.neighbors import KNeighborsClassifier
</pre>
</details>

5. Initialize the model and **pick a number of neighbors** `n_neighbors` to match against:

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model = KNeighborsClassifier(n_neighbors=3)
</pre>
</details>

6. Create our `inputs` and `output`. This time let's call them `x` and `y`:

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
X = churn.drop(["Exited"], axis="columns") # dropping the output column to create the inputs (features)
y = churn["Exited"]
</pre>
</details>

Feel free to check your `x` and `y` below üëá

### 1. Your turn! üöÄ

The **training** and **scoring** part is exactly the same as with the first model - try to solve it!

**üí°Tip:** try adjusting the number of neighbors - `n_neighbors` - above until you get the best result.

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model.fit(X,y) # this trains the model
model.score(X,y) # this scores the model
</pre>
</details>

This time score metric is **accuracy** - how many predictions did the model get right in our dataset.

How is your score? Pretty great for 10 minutes of work, right?! ü§© Well...

----------

## We've been cheating! üò≥

We've been scoring the model on the same data as it is trained on - too easy! That is called **data leakage**.

Scikit-learn library saves us again - let's import and use the [`train_test_split` method](https://scikit-learn.org/0.24/modules/generated/sklearn.model_selection.train_test_split.html)

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
from sklearn.model_selection import train_test_split
</pre>
</details>

Below line might look a little crazy. Don't worry.

The `train_test_split` method gives us all the datasets we need - inputs and outputs for both training and testing. So we simultaneously create four new variables to store that.

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
</pre>
</details>

### 2. Your turn - this time no cheating! üöÄ

Now that we have training and testing datasets, you should **initialize** a new model, **train** and **score** it again with the right datasets.

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model = KNeighborsClassifier(n_neighbors=3)

model.fit(X_train, y_train) # using only the training data
model.score(X_test, y_test) # using unseen testing data to score
</pre>
</details>

----------

### 3.1. Your turn - prediction! üöÄ

The prediction also works exactly as the previous model, so we let you do that. We created an example customer which you can tune.

In [None]:
customer = pd.DataFrame({
    'CreditScore': [608],
    'Gender': [1],
    'Age': [31],
    'Tenure': [20],
    'Balance': [81207.86],
    'NumOfProducts': [2],
    'HasCrCard': [1],
    'IsActiveMember': [0],
    'EstimatedSalary': [11142.58]
})
# predict here

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model.predict(customer)
</pre>
</details>

### 3.2. Your turn! üöÄAdding probability

To understand the model's decision a bit more, we can use the `predict_proba()` method. It works exactly the same as the `predict()` method you used above. Can you write the code?

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model.predict_proba(customer)
</pre>
</details>

### 4. Your turn! üöÄ

Try to tune the features inside `customer` variable to get different results. What insights can you see? üîç

In [None]:
customer2 = pd.DataFrame({
    'CreditScore': [608],
    'Gender': [1],
    'Age': [51],
    'Tenure': [20],
    'Balance': [60000.86],
    'NumOfProducts': [2],
    'HasCrCard': [1],
    'IsActiveMember': [0],
    'EstimatedSalary': [11142.58]
})
# predict here

----------

### Explainability? Already harder with KNN model üòì

We need to use a [feature permutation](https://scikit-learn.org/0.24/modules/generated/sklearn.inspection.permutation_importance.html) method provided by `Scikit-learn`.

This method runs the model scoring many times, by changing (*permutating*) one feature at a time, to see which one causes the most change in the target.

In [None]:
from sklearn.inspection import permutation_importance

permutation_score = permutation_importance(model, X_train, y_train, n_repeats=10)

np.vstack((X.columns, permutation_score.importances_mean)).T

# Congratulations on your ML model! ü¶∏‚Äç‚ôÄÔ∏èü¶∏‚Äç‚ôÇÔ∏è

-----

## üïµÔ∏è‚Äç‚ôÄÔ∏è Going further? (Optional Challenge) KNN only with scaled numerical values!

You probably noticed that the **categorical columns** - such as `HasCrCard` or `NumOfProducts` - have virtually no influence on the above model.

It might just be that they are strongly outweighed by the other columns, like `Balance` or `EstimatedSalary`.

The remedy? Let's scale our numerical features!

1. Import the scaler, instantiate it and transform `X_train` and `X_test`. Save the new scaled values to `X_train_scaled` and `X_test_scaled`.

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)</pre>
</details>

2. Initialize a KNN model and fit it on the scaled training data. Pick the number of neighbors `n_neighbors` you like üôÇ

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)</pre>
</details>

3. Score the model on the testing data. How is the accuracy? Is it comparable to the first model?

<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
model.score(X_test_scaled, y_test) # using unseen testing data to score the performance of the model</pre>
</details>

4. Make predictions and checking prediction probability! üîÆ

We left a `customer3` sample below but **make sure to scale the numbers using the same scaler** that we created before (only do .transform, not .fit_transform)

After that is done, predict using your model!

In [None]:
customer3 = pd.DataFrame({
    'CreditScore': [608],
    'Gender': [1],
    'Age': [51],
    'Tenure': [20],
    'Balance': [60000.86],
    'NumOfProducts': [2],
    'HasCrCard': [1],
    'IsActiveMember': [0],
    'EstimatedSalary': [11142.58]
})


<details>
    <summary>Reveal Solution üôà</summary>

<p>
<pre>
customer3 = pd.DataFrame({
    'CreditScore': [608],
    'Gender': [1],
    'Age': [51],
    'Tenure': [20],
    'Balance': [60000.86],
    'NumOfProducts': [2],
    'HasCrCard': [1],
    'IsActiveMember': [0],
    'EstimatedSalary': [11142.58]
})
customer3_scaled = scaler.transform(customer3)
model2.predict(customer3_scaled)
</pre>
</details>

7. **Explainability** - let's check importance of each input with `permutation_importance` once again. What do you find? üîç

‚ö†Ô∏è We left the variables that **you need to replace with your variables** in `UPCASED_LETTERS` üëá

In [None]:
permutation_score = permutation_importance(YOUR_MODEL, SCALED_X_TRAIN, YOUR_Y, n_repeats=10)

np.vstack((NONSCALED_X_TRAIN.columns, permutation_score.importances_mean)).T

# Congratulations on completing the optional challenge! üèãÔ∏è‚Äç‚ôÄÔ∏èüèãÔ∏è‚Äç‚ôÇÔ∏è