Please find my comments below - **I kindly ask that you do not move, modify, or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

<div class="alert alert-block alert-warning">
<b>Overall reviewer's comment</b> <a class="tocSkip"></a>

Hello Deborah,
    
You’ve submitted another project—great work! Your commitment to pushing through the challenges of this program is admirable.
After reviewing your submission, I’ve returned it with some feedback to help you make the necessary improvements.

What Was Great:
 - Outstanding data analysis!
 - Good project structure
 - Deep conclusions

Areas to Improve:
 - We do not need display() in Jupyter for dataframes
 - We need to split data into train, val, test if we use hyperparameters tuning (to avoid overfitting)
 - Our recommendation is to rerun whole notebook before submitting to avoid unexpected bugs =)
    
    
Please check my comments below.
    
Keep in mind that revisions are a normal and valuable part of the learning process. Use this feedback to refine your work and resubmit when you’re ready. I know you’re capable of great things, and I’m here to support you every step of the way. Keep going—you’re doing a great job! 🏄

</div>

<div class="alert alert-block alert-warning">
<b>Overall reviewer's comment v2</b> <a class="tocSkip"></a>

Thank you for correcting train val test split!
    
However, there are still some errors. Please check my comments.
    
Also, we need to check our best model on test dataset (we need to get F1 > 0.59)
</div>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
<b>Thanks for reviewing my project.</b>
<br>
<br>
- I used display, instead of print, because if there are many columns, print will put some columns on the next line, and it makes it hard to read.
<br>
<br>
- I have split the data in 3 sets for this next version.
<br>
<br>
- Also, I did rerun the whole notebook, and then save, before submitting. I did not have any cells with errors. Did you?
<a class="tocSkip"></a>

<div class="alert alert-block alert-info">
<b>Deb's comment #2:</b>
- The TT site did not save out some of my fixes, nor my ver 1 comments to you. So, I'm cutting-n-pasting comments and fixes back in, from my ver 2 local copy.<a class="tocSkip"></a>

<div class="alert alert-block alert-warning">
<b>Overall reviewer's comment v3</b> <a class="tocSkip"></a>

Could you please reload correct notebook one more time? This version has an error in 108 cell (NameError: name 'target_pred_constant' is not defined).
    
Also, we need to check our best model on **test** dataset (F1 metric).
</div>

<div class="alert alert-block alert-info">
<b>Deb's comment #3:</b>
- Yes. <a class="tocSkip"></a>

# Beta Bank Churn, by Deborah Thomas

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Introduction</div>

- I will develop a model, that will analyze bank customers' data, to predict whether a customer will leave Beta Bank soon.
- I will use evaluation metrics to test the model.

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Import Libraries</div>

In [30]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier


from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score, roc_curve

from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

import plotly.express as px

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Download the Data</div>

In [33]:
try:
    churn = pd.read_csv('/datasets/Churn.csv')  # Attempt to read from the server path
except FileNotFoundError:
    churn = pd.read_csv('../datasets/Churn.csv')  # Fallback to the local path


display(churn.head(20))

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


## <div style="color: red; border: 2px solid yellow; display: inline-block;">Clean the Data</div>

 ### <span style="color:red">Rename columns to shorter names</span>

In [37]:
churn.rename(columns={'RowNumber': 'Row#', 'CustomerId': 'CustID', 'Surname': 'Name', 'CreditScore': 'CredScore', 'Geography': 'Geo', 'NumOfProducts': 'NumProds', 'IsActiveMember': 'Active', 'EstimatedSalary': 'Salary', 'Exited': 'Churned'}, inplace=True)

In [39]:
display(churn.head(3))

Unnamed: 0,Row#,CustID,Name,CredScore,Geo,Gender,Age,Tenure,Balance,NumProds,HasCrCard,Active,Salary,Churned
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1


 ### <span style="color:red">Active / Churned</span>

In [42]:
print(churn[['Active', 'Churned']].head(5))

   Active  Churned
0       1        1
1       1        0
2       0        1
3       0        0
4       1        0


#### This is interesting. There are some customers who have already churned, but are still active.

### <span style="color:red">Name</span>

In [46]:
# Filter names that start with "Ch'ien" or contain a single apostrophe, due to an error that I had fitting the model.
filtered_names = churn[churn['Name'].str.startswith('Chi\'en') | churn['Name'].str.contains("'")]
display(filtered_names.head(15))

Unnamed: 0,Row#,CustID,Name,CredScore,Geo,Gender,Age,Tenure,Balance,NumProds,HasCrCard,Active,Salary,Churned
52,53,15683553,O'Brien,788,France,Female,33,5.0,0.0,2,0,0,116978.19,0
58,59,15623944,T'ien,511,Spain,Female,66,4.0,0.0,1,1,0,1643.11,1
109,110,15744689,T'ang,479,Germany,Male,35,9.0,92833.89,1,1,0,99449.86,1
183,184,15810845,T'ang,636,France,Male,42,2.0,0.0,2,1,1,55470.78,0
186,187,15771977,T'ao,730,France,Female,39,1.0,99010.67,1,1,0,194945.8,0
226,227,15774393,Ch'ien,694,France,Female,30,9.0,0.0,2,1,1,26960.31,0
228,229,15637753,O'Sullivan,751,Germany,Male,50,2.0,96888.39,1,1,0,77206.25,1
265,266,15813163,Ch'iu,531,Spain,Female,36,9.0,99240.51,1,1,0,123137.01,0
279,280,15782210,K'ung,714,France,Male,46,1.0,0.0,1,1,0,152167.79,1
283,284,15699389,Ch'ien,807,France,Male,42,7.0,118274.71,1,1,1,25885.72,0


- Just looking to see if there are names with single apostrophes because I had an error fitting the data.
- I think it's best to drop the Name column when I split the data, later on, as this column will not give meaningful information when predicting target data. 

In [49]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Row#       10000 non-null  int64  
 1   CustID     10000 non-null  int64  
 2   Name       10000 non-null  object 
 3   CredScore  10000 non-null  int64  
 4   Geo        10000 non-null  object 
 5   Gender     10000 non-null  object 
 6   Age        10000 non-null  int64  
 7   Tenure     9091 non-null   float64
 8   Balance    10000 non-null  float64
 9   NumProds   10000 non-null  int64  
 10  HasCrCard  10000 non-null  int64  
 11  Active     10000 non-null  int64  
 12  Salary     10000 non-null  float64
 13  Churned    10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


 ### <span style="color:red">Look for missing data:</span>

#### 'Tenure' has some missing data. 909 rows have invalid data.

In [53]:
# Find and display the rows where 'Tenure' is missing
missing_tenure_rows = churn[churn['Tenure'].isna()]

print("\nRows with missing 'tenure' values:")
display(missing_tenure_rows.head(9))


Rows with missing 'tenure' values:


Unnamed: 0,Row#,CustID,Name,CredScore,Geo,Gender,Age,Tenure,Balance,NumProds,HasCrCard,Active,Salary,Churned
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.0,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.0,1,0,0,84509.57,0
82,83,15641732,Mills,543,France,Female,36,,0.0,2,0,0,26019.59,0
85,86,15805254,Ndukaku,652,Spain,Female,75,,0.0,2,1,1,114675.75,0
94,95,15676966,Capon,730,Spain,Male,42,,0.0,2,0,1,85982.47,0
99,100,15633059,Fanucci,413,France,Male,34,,0.0,2,0,0,6534.18,0


In [55]:
# Fill NaN values
churn['Tenure'].fillna(churn['Tenure'].median(), inplace=True)

#### I am not going to drop these rows, that have NaN 'Tenure', but I will fill them with the median value of the column, as leaving in the NaN will create an error when fitting the model, later on.

In [58]:
churn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Row#       10000 non-null  int64  
 1   CustID     10000 non-null  int64  
 2   Name       10000 non-null  object 
 3   CredScore  10000 non-null  int64  
 4   Geo        10000 non-null  object 
 5   Gender     10000 non-null  object 
 6   Age        10000 non-null  int64  
 7   Tenure     10000 non-null  float64
 8   Balance    10000 non-null  float64
 9   NumProds   10000 non-null  int64  
 10  HasCrCard  10000 non-null  int64  
 11  Active     10000 non-null  int64  
 12  Salary     10000 non-null  float64
 13  Churned    10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


 ### <span style="color:red">Data types:</span>

#### Changing Tenure to int, because none of the values will change if whole numbers are used.

In [None]:
churn['Tenure'] = churn['Tenure'].astype(int)

In [None]:
churn.info()

 ### <span style="color:red">Encoding</span>

- <span style="color:red">Encode Geography column in to 2 new columns(regions), and dropping first column(France):</span>

In [None]:
churn['Geo'].value_counts()

In [None]:
# Apply One-Hot Encoding to Geo column.
churn_ohe = pd.get_dummies(churn, columns=['Geo'], drop_first=True)

print("\nOne-Hot Encoded DataFrame:")
display(churn_ohe.head(3))

- <span style="color:red">Encode Gender in to 2 new columns</span>

In [None]:
# Apply One-Hot Encoding to Gender column.
churn_ohe = pd.get_dummies(churn_ohe, columns=['Gender'], drop_first=False)

In [None]:
print("\nOne-Hot Encoded DataFrame:")
display(churn_ohe.head(3))

- <span style="color:red">Encode Number of Products in to 4 new columns:</span>

In [None]:
# Apply One-Hot Encoding to Gender column.
churn_ohe = pd.get_dummies(churn_ohe, columns=['NumProds'], drop_first=False)

print("\nOne-Hot Encoded DataFrame:")
display(churn_ohe.head(3))

- <span style="color:red">Encode HasCrCard column in to 2 new columns:</span>

In [None]:
# Apply One-Hot Encoding to Gender column.
churn_ohe = pd.get_dummies(churn_ohe, columns=['HasCrCard'], drop_first=False)

print("\nOne-Hot Encoded DataFrame:")
display(churn_ohe.head(3))

In [None]:
churn_ohe.info()

#### The new columns are of type uint8, which takes up the least amount of disk space. This is good.

### Data is clean now

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Explore the Data</div>

### <span style="color:red">Credit Score</span>

In [None]:
churn['CredScore'].value_counts()

### What is the comparison of credit score to churned customers?

In [None]:
churn['CredScore'].min()

In [None]:
churn['CredScore'].max()

In [None]:
# break the 'CredScore' column into different levels,  based on the distribution of credit scores.
# Define custom bins and labels
bins = [349, 425, 500, 575, 650, 725, float('inf')]
labels = ['very poor', 'poor', 'okay', 'good', 'very good', 'excellent']

In [None]:
# Bin the 'CredScore' into defined levels and assign labels
churn['CredScoreLevel'] = pd.cut(churn['CredScore'], bins=bins, labels=labels)

In [None]:
# Display the updated DataFrame
print(churn[['CredScore', 'CredScoreLevel']].head(15))

In [None]:
# Group by 'CredScoreLevel' and 'Churned', then count occurrences
credscore_churn_counts = churn.groupby(['CredScoreLevel', 'Churned']).size().reset_index(name='Count')

In [None]:
churn['CredScoreLevel'].value_counts()

In [None]:
fig = px.bar(
    credscore_churn_counts,
    x='CredScoreLevel',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Credit Score Level',
    labels={'CredScoreLevel': 'Credit Score Level', 'Count': 'Number of Customers'}
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Credit Score Level",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Credit Score Level',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

# Adjust the font size of the tick labels
fig.update_xaxes(tickfont={'size': 14})
fig.update_yaxes(tickfont={'size': 14})

fig.show()

### Credit Score breakdown:
The credit scores, on this dataset, go from 350-850.

- very poor ... 349-424
- poor      ... 425-499
- okay      ... 500-574
- good      ... 575-649
- very good ... 650-724
- excellent ... 725-850


In [None]:
print("Churn rate for customers with very poor credit:")
33 / (33 + 37) * 100

In [None]:
print("Churn rate for customers with poor credit:")
119 / (119 + 454) * 100

In [None]:
print("Churn rate for customers with okay credit:")
343/ (343 + 1256) * 100

In [None]:
print("Churn rate for customers with good credit:")
562 / (562 + 2133) * 100

In [None]:
print("Churn rate for customers with very good credit:")
518 / (518 + 2275) * 100

In [None]:
print("Churn rate for customers with excellent credit:")
462 / (462 + 1808) * 100

- The majority of customers have very good credit. 
- All levels, except 'very poor' credit, have a churn rate approximately between 18\% - 21\%.
- Customers with 'very poor' credit have a much higher churn rate than other credit levels, at 47\%.

### <span style="color:red">Geography</span>

In [None]:
churn.Geo.value_counts()

#### The dataset only includes data from the regions of France, Germany, and Spain.

### What is the churn rate for customers from these 3 regions?

In [None]:
# Group by 'Geo' and 'Churned', then count occurrences
geo_churn_counts = churn.groupby(['Geo', 'Churned']).size().reset_index(name='Count')

In [None]:
fig = px.bar(
    geo_churn_counts,
    x='Geo',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Geographical Region',
    labels={'Geo': 'Geographical Region', 'Count': 'Number of Customers'}
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Geographical Region",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Geographical Region',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)
# Adjust the font size of the tick labels
fig.update_xaxes(tickfont={'size': 14})
fig.update_yaxes(tickfont={'size': 14})


fig.show()


In [None]:
print("Churn rate for customers from France:")
810 / 4204

In [None]:
print("Churn rate for customers from Germany:")
814 / 1695

In [None]:
print("Churn rate for customers from Spain:")
413 / 2064

- The churn rate for customers from France and Spain are about the same, approximately 20\%
- The churn rate for customers from Germany is more than double that of France or Spain, at 48\%.

### <span style="color:red">Gender</span>

In [None]:
churn.Gender.value_counts()

#### Most customers are male.

### What is the churn rate of customers, by gender?

In [None]:
# Group by 'Churned' and 'Gender', then count occurrences
gender_churned_counts = churn.groupby(['Churned', 'Gender']).size().reset_index(name='Count')

# Filter for rows where 'Churned' is True
true_gender_churned_counts = gender_churned_counts[gender_churned_counts['Churned'] == True]

print(true_gender_churned_counts)

In [None]:
print("Churn rate for females:")

In [None]:
1139 / 4543 * 100

In [None]:
print("Churn rate for males:")

In [None]:
898 /  5457 * 100

- The churn rate for females is 25\%.
- The churn rate for males is 16\%.

In [None]:
# Create bar graph using Plotly Express
fig = px.bar(
    true_gender_churned_counts,
    x='Gender',
    y='Count',
    title='Gender Distribution for Customers Who Have Churned',
    labels={'Count': 'Number of Churned Customers', 'Gender': 'Gender'},
    color='Gender',
    text='Count'
)
# Center the title and make it bigger
fig.update_layout(
    title={
        'text': "Gender Distribution for Customers Who Have Churned",
        'y':0.9,  # Adjust the vertical position of the title
        'x':0.5,  # Center the title
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Gender',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Churned Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)
# Make the tick labels on both axes bigger
fig.update_xaxes(tickfont={'size': 14})
fig.update_yaxes(tickfont={'size': 14})

fig.show()

#### More female customers churn than male customers.

### <span style="color:red">Age</span>

In [None]:
churn['Age'].min()

In [None]:
churn['Age'].max()

In [None]:
# Group by 'Age' and 'Churned', then count occurrences
age_churn_counts = churn.groupby(['Age', 'Churned']).size().reset_index(name='Count')

In [None]:
# Map the 0 and 1 values to 'False' and 'True'
age_churn_counts['Churned'] = age_churn_counts['Churned'].map({0: 'False', 1: 'True'})

In [None]:
fig = px.line(
    age_churn_counts,
    x='Age',
    y='Count',
    color='Churned',
    title='Churn Rate by Age',
    labels={'Age': 'Age', 'Count': 'Number of Customers'}
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Age",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Age',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

# Adjust the font size of the tick labels
fig.update_xaxes(tickfont={'size': 14})
fig.update_yaxes(tickfont={'size': 14})


fig.show()


- Ages 23-48 have an extremely low churn rate than other ages, with the lowest churn ages being 35-37. 

### <span style="color:red">Tenure</span>

In [None]:
churn['Tenure'].value_counts()

In [None]:
# Group by 'Age' and 'Churned', then count occurrences
tenure_churn_counts = churn.groupby(['Tenure', 'Churned']).size().reset_index(name='Count')

In [None]:
fig = px.bar(
    tenure_churn_counts,
    x='Tenure',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Tenure',
    labels={'Tenure': 'Tenure (years)', 'Count': 'Number of Customers', 'Churned': 'Churned'},
    text='Count'
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Tenure",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Tenure (years)',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)
# Adjust the text position for better readability
fig.update_traces(texttemplate='%{text}', textposition='outside')

# Set the x-axis ticks to display every tenure value
tenure_range = list(range(int(tenure_churn_counts['Tenure'].min()), int(tenure_churn_counts['Tenure'].max()) + 1))
fig.update_xaxes(tickvals=tenure_range)

fig.show()

#### There is a higher percentage of customers who have been tenured for 5 years.

### What are the percentage rates of churned customers, by Tenure ?

In [None]:
# Group by 'Tenure' and 'Churned', then count occurrences
tenure_churn_counts = churn.groupby(['Tenure', 'Churned']).size().reset_index(name='Count')

# Pivot the data to get counts for each 'Tenure' and 'Churned' status
pivot_df = tenure_churn_counts.pivot(index='Tenure', columns='Churned', values='Count').fillna(0)

In [None]:
# Rename columns for clarity
pivot_df.columns = ['Not_Churned', 'Churned']

# Calculate total customers per 'Tenure'
pivot_df['Total'] = pivot_df.sum(axis=1)

In [None]:
# Calculate the churn rate percentage
pivot_df['Churn_Rate (%)'] = (pivot_df['Churned'] / pivot_df['Total']) * 100

# Reset index for printing
pivot_df = pivot_df.reset_index()

In [None]:
# Print the result
print(pivot_df[['Tenure', 'Churn_Rate (%)']])

- The churn rate, by Tenure, ranges from 17.2\% - 23.5\%.  
- The lowest churn rate is for those customers with a Tenure of 7 and 8 years. 
- The highest churn rate is for those customers with a Tenure of 1 year or less.

### <span style="color:red">Bank Balance</span>

In [None]:
churn['Balance'].min()

In [None]:
churn['Balance'].max()

In [None]:
churn['Balance'].mean()

#### The average bank balance is \$76,485.

In [None]:
# Create a copy of the DataFrame for visualization purposes
churn_viz = churn.copy()

In [None]:
# Map the 0 and 1 values to 'False' and 'True' in the copy
churn_viz['Churned'] = churn_viz['Churned'].map({0: 'False', 1: 'True'})

In [None]:
fig = px.box(churn_viz, x='Churned', y='Balance', color='Churned',
             title='Balance Distribution by Churn Status',
             labels={'Churned': 'Churned', 'Balance': 'Balance'})

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Balance Distribution by Churn Status",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Churned',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Balance',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

fig.show()

#### The median churn rate, based on bank balance, is \$109,349

### <span style="color:red">Number of Products</span>

In [None]:
display(churn.head(3))

In [None]:
churn.NumProds.value_counts()

In [None]:
# Group by 'NumProds' and 'Churned', then count occurrences
numprods_churn_counts = churn.groupby(['NumProds', 'Churned']).size().reset_index(name='Count')

In [None]:
fig = px.bar(
    numprods_churn_counts,
    x='NumProds',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Number of Products',
    labels={'NumProds': 'Number of Products', 'Count': 'Number of Customers'}
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Number of Products",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Number of Products',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

# Adjust the font size of the tick labels
fig.update_xaxes(tickfont={'size': 14})
fig.update_yaxes(tickfont={'size': 14})

fig.show()

In [None]:
print("Percentage of customers churned who have 1 product:")
1409 / (1409 + 3675) * 100

In [None]:
print("Percentage of customers churned who have 2 products:")
348 / (348 + 4242) * 100

In [None]:
print("Percentage of customers churned who have 3 products:")
220 / (220 + 46) * 100

In [None]:
print("Percentage of customers churned who have 4 products:")
60 / (60 + 0) * 100

- Customers with 3 products have a very high churn rate, at 82\%
- Customers with 4 products have the highest churn rate, at 100\%.

In [None]:
churn.Churned.value_counts()

In [None]:
# Filter the DataFrame for customers with Churned == 0 and NumProds == 4
filtered_numProd_customers = churn[(churn['Churned'] == False) & (churn['NumProds'] == 4)]
num_customers = filtered_numProd_customers.shape[0]
print(num_customers)

#### There are zero customers, who have 4 products, who have not churned.

### <span style="color:red">Credit Cards</span>

### How many customers, who have credit cards, have Churned ?

In [None]:
# Group by 'Churned' and 'HasCrCard', then count occurrences
HasCrCard_churned_counts = churn.groupby(['Churned', 'HasCrCard']).size().reset_index(name='Count')

# Filter for rows where 'Churned' is True
HasCrCard_churnedTrue_counts = HasCrCard_churned_counts[HasCrCard_churned_counts['Churned'] == True]

print(HasCrCard_churnedTrue_counts)

In [None]:
# Extract counts of churned customers with and without credit cards
churned_with_credit_cards = HasCrCard_churnedTrue_counts[HasCrCard_churnedTrue_counts['HasCrCard'] == 1]['Count'].values[0]
total_churned_customers = HasCrCard_churnedTrue_counts['Count'].sum()

# Calculate the proportion and format it as a percentage
cc_churn_proportion = (churned_with_credit_cards / total_churned_customers) * 100

# Print the formatted string
print(f"{cc_churn_proportion:.2f}% of churned customers have credit cards.")

In [None]:
fig = px.bar(
    HasCrCard_churnedTrue_counts,
    x='HasCrCard',
    y='Count',
    title='Distribution of Churned Customers with Credit Cards',
    labels={'HasCrCard': 'Has Credit Card', 'Count': 'Number of Churned Customers'},
    text='Count'
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Distribution of Churned Customers with Credit Cards",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Has Credit Card',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Churned Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

# Adjust the font size of the tick labels
fig.update_xaxes(
    tickfont={'size': 14},
    tickvals=[0, 1],  # Set the tick values to only 0 and 1
    ticktext=['0', '1'],  # Set the tick text to "0" and "1"
    type='category'  # Treat the x-axis as categorical
)

fig.update_yaxes(tickfont={'size': 14})

# Customize the plot to display the counts on the bars
fig.update_traces(texttemplate='%{text}', textposition='outside')

fig.show()

#### There is a much higher percentage of churned customers who have credit cards. The churn rate is 69.9\%.

### <span style="color:red">Is Active</span>

In [None]:
churn['Active'].value_counts()

In [None]:
# Group by 'Active' and 'Churned', then count occurrences
active_churn_counts = churn.groupby(['Active', 'Churned']).size().reset_index(name='Count')

In [None]:
fig = px.bar(
    active_churn_counts,
    x='Active',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Active Status',
    labels={'Active': 'Active Status', 'Count': 'Number of Customers', 'Churned': 'Churned'}
)

# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Active Status",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Active Status',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)

fig.show()

#### There are 735 customers who have churned, but are still active.

In [None]:
print(735 / len(churn))

#### 7\% of customers have churned, but are still active.

### <span style="color:red">Salary</span>

In [None]:
churn['Salary'].min()

In [None]:
churn['Salary'].max()

In [None]:
# break the 'Salary' column into different levels.
# Define custom bins and labels
bins = [0, 20000, 50000, 90000, 130000, 175000, float('inf')]
labels = ['very poor', 'poor', 'okay', 'good', 'very good', 'excellent']

In [None]:
# Bin the 'Salary' into defined levels and assign labels
churn['SalaryLevel'] = pd.cut(churn['Salary'], bins=bins, labels=labels)

In [None]:
# Display the updated DataFrame
display(churn.head(15))

In [None]:
# Group by 'CredScoreLevel' and 'Churned', then count occurrences
salaryLevel_churn_counts = churn.groupby(['SalaryLevel', 'Churned']).size().reset_index(name='Count')

In [None]:
churn['SalaryLevel'].value_counts()

In [None]:
fig = px.bar(
    salaryLevel_churn_counts,
    x='SalaryLevel',
    y='Count',
    color='Churned',
    barmode='group',
    title='Churn Rate by Salary Level',
    labels={'SalaryLevel': 'Salary Level', 'Count': 'Number of Customers', 'Churned': 'Churned'}
)
# Customize the title and axis labels
fig.update_layout(
    title={
        'text': "Churn Rate by Salary Level",
        'y': 0.9,
        'x': 0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Adjust the font size of the title
    },
    xaxis_title={
        'text': 'Salary Level',
        'font': {'size': 18}  # Adjust the x-axis title font size
    },
    yaxis_title={
        'text': 'Number of Customers',
        'font': {'size': 18}  # Adjust the y-axis title font size
    }
)
# Show the plot
fig.show()

In [None]:
print("Churn rate for customers with a 'very poor' salary:")
198 / (198 + 788) * 100

In [None]:
print("Churn rate for customers with a 'poor' salary:")
407 / (407 + 1176) * 100

In [None]:
print("Churn rate for customers with an 'okay' salary:")
291 / (291 + 1638) * 100

In [None]:
print("Churn rate for customers with a 'good' salary:")
402 / (402 + 1622) * 100

In [None]:
print("Churn rate for customers with a 'very good' salary:")
466 / (466 + 1752) * 100

In [None]:
print("Churn rate for customers with an 'excellent' salary:")
273 / (273 + 987) * 100

#### Salary range explanation:
- 'very poor'  ... `$0-$19,999`
- 'poor'       ... `$20,000-$49,999`
- 'okay'       ... `$50,000-$89,999`
- 'good'       ... `$90,000-$130,000`
- 'very good'  ... `$130,000-$175,000`
- 'excellent'  ... `$175,001 and above`

- Churn rate, by salary, ranges from 15\% - 21.6\%.
- The salary group with the lowest churn rate is the 'okay' salary range.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Excellent analysis!
</div>

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Supervised Learning</div>

### This next section will use the OHE dataframe.

In [None]:
churn_ohe.info()

In [None]:
display(churn_ohe.head(15))

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

### <span style="color:red">Split the Data</span>

- Dropping the 'Name' column, as this does not add any valuable information for predicting the target.

In [None]:
target = churn_ohe['Churned']
features = churn_ohe.drop(['Churned', 'Name'], axis=1)

<div class="alert alert-block alert-info">
<b>Old method. 80 / 20 split. Commenting out this next cell:</b> <a class="tocSkip"></a>

In [None]:
#features_train, features_valid, target_train, target_valid = train_test_split(
    #features, target, test_size=0.25, random_state=12345)

<div class="alert alert-block alert-info">
<b>New method. This next cell does a 60/40 split. This includes a validation set:</b> <a class="tocSkip"></a>

In [None]:
features_train, features_remaining, target_train, target_remaining = train_test_split(
    features, target, test_size=0.40, random_state=12345
)

<div class="alert alert-block alert-info">
<b>This next cell takes the remaining 40&#37; and splits it evenly into validation (20&#37;) and test (20&#37;).</b> <a class="tocSkip"></a>
</div>

In [None]:
#includes a validation set...
features_valid, features_test, target_valid, target_test = train_test_split(
    features_remaining, target_remaining, test_size=0.50, random_state=12345
)

<div class="alert alert-block alert-info">
<b>Summary:</b> 
<br>
- Training Set (60%): Used to train the model.
<br>
- Validation Set (20%): Used for hyperparameter tuning.
<br>
- Test Set (20%): Used for final evaluation of the model.  <a class="tocSkip"></a> 
</div>






<div class="alert alert-block alert-info">
<b>When a 3-way(Train, Validation, Test) split is necessary :</b> 
<br>
- Hyperparameter Tuning
<br>
- Avoiding Overfitting<a class="tocSkip"></a> 
</div>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Something needs to be changed, but don't worry, you've got this.
    
<s>We need to split data into train, val and test datasets if we want use hyperparameters tuning. Train + val is used for tuning and test for final metric. This scheme helps us avoid overfitting.
</div>

<div class="alert alert-block alert-info">
<b>Thank you for explaining that.</b> <a class="tocSkip"></a>

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Well done!
</div>

### <span style="color:red">Create the model</span>

- <span style="color:red">Logistic Regression model ... unbalanced:</span>

In [None]:
Logmodel = LogisticRegression(random_state=12345, solver='liblinear')

In [None]:
Logmodel.fit(features_train, target_train)

### Predict and evalute the Logistic Regression unbalanced model:

In [None]:
predicted_valid = Logmodel.predict(features_valid)

In [None]:
accuracy_valid = accuracy_score(target_valid, predicted_valid)

print(accuracy_valid)

#### 79\% is not great. 

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
- Using F1 score, instead of Accuracy. <a class="tocSkip"></a>

In [None]:
# Calculate the F1 Score for the validation set
print('F1 Score:', f1_score(target_valid, predicted_valid, zero_division=1))

#### 0\% is not good. 

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

<s>Our main metric is F1
</div>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
- The F1 score, which is the harmonic mean of precision and recall, provides a better measure of a model's accuracy on imbalanced datasets than the simple accuracy metric. <a class="tocSkip"></a>

### Sanity Check

In [None]:
target_pred_constant = pd.Series(0, index=target.index)

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
- I had to add in this next cell, because now that the data is split in 3, I was getting an error. However, now, both the dummy and the non-dummy give the same accuracy score.<a class="tocSkip"></a>

In [None]:
# Correct the length of target_pred_constant to match target_valid
target_pred_constant = target_pred_constant.loc[target_valid.index]

In [None]:
# Dummy Evaluation metrics
print('Dummy Accuracy:', accuracy_score(target_valid, target_pred_constant))
print('Dummy Precision:', precision_score(target_valid, target_pred_constant, zero_division=1))
print('Dummy Recall:', recall_score(target_valid, target_pred_constant, zero_division=1))
print('Dummy F1 Score:', f1_score(target_valid, target_pred_constant, zero_division=1))

print('\n')
print('Dummy Confusion Matrix:\n', confusion_matrix(target_valid, target_pred_constant))

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 2):</b>
<br>
- The 2 cells above have changed.  <a class="tocSkip"></a>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 3):</b>
<br>
-  So, this is the 3rd version where TT did not save the correct code. I have just added back in my ver1 notes, above, and added in the line of code that the TT site deleted. <a class="tocSkip"></a>

The results indicate that the output of the accuracy_score for the model (0.791) is not better than the accuracy score for a dummy classifier that always predicts the majority class. This implies that the dataset is imbalanced, with one class being significantly more frequent than the other. In this case, accuracy is not a reliable metric for model performance.

- The model did not predict any instances of the minority class. 
- The confusion matrix shows no true positives or false positives for class 1.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good job making sanity check!
</div>

### Things to consider when choosing which threshold:

<b>Recall:</b>
- If missing a positive case is very costly (e.g., medical diagnoses, fraud detection), prioritize recall.

<b>Precision: </b>
- If falsely identifying a positive is costly (e.g., spam detection), prioritize precision.

<b>F1 Score: </b>
- If you need a balance between precision and recall, prioritize the F1 score.

<b>Accuracy: </b>
- General cases with balanced datasets where both classes are equally important.


#### Considering this, I think that F1 would be the best to prioritize.

- <span style="color:red">Logistic Regression model ... balanced:</span>

In [None]:
Logmodel2 = LogisticRegression(solver='liblinear', class_weight='balanced', random_state=12345)
Logmodel2.fit(features_train, target_train)
predicted_valid = Logmodel2.predict(features_valid)

In [None]:
probabilities_valid = Logmodel2.predict_proba(features_valid)[:, 1]
print(probabilities_valid)

#### Try new thresholds

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
- Trying new thresholds, using a loop. <a class="tocSkip"></a>

In [None]:
# Define a range of thresholds to evaluate
thresholds = np.arange(0.1, 0.9, 0.1)

In [None]:
for threshold in thresholds:
    predicted_valid_threshold = (probabilities_valid >= threshold).astype(int)
    
    # ANSI escape sequences for blue text and reset
    BLUE = '\033[94m'
    RESET = '\033[0m'

    print(f'{BLUE}Threshold: {threshold}{RESET}')
    print('\n')
    
    print('Accuracy:', accuracy_score(target_valid, predicted_valid_threshold))
    print('Precision:', precision_score(target_valid, predicted_valid_threshold, zero_division=1))
    print('Recall:', recall_score(target_valid, predicted_valid_threshold, zero_division=1))
    print('F1 Score:', f1_score(target_valid, predicted_valid_threshold, zero_division=1))
    
    print('\n')
    print('Confusion Matrix:\n', confusion_matrix(target_valid, predicted_valid_threshold))
    print('\n')
    print('\n')

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
- Scroll to see the results. <a class="tocSkip"></a>

### As threshold goes up, False Negatives go up (Confusion Matrix: low left corner.)

#### I don't think that Logistic Regression is the best model to use to make predictions for this dataset. I cannot get a good F1 score. I have tried different thresholds.

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 3):</b>
<br>
- Above loop section was fixed 2 versions ago. But, TT site did not save my code nor did it save my import numpy as np, nor did it save my ver1 comments in the section just above. I am pasting my comments back in, and pasting in the corrected code again, again. <a class="tocSkip"></a>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Pro tip: we can check more threshold using loop (for range ...)
</div>

- <span style="color:red">Decision Tree model ... unbalanced</span>

In [None]:
Treemodel = DecisionTreeClassifier(random_state=12345)

In [None]:
Treemodel.fit(features_train, target_train)

In [None]:
predicted_valid = Treemodel.predict(features_valid)

In [None]:
print(confusion_matrix(target_valid, predicted_valid))

### Explanation of Confusion Matrix:

- TN in the upper-left corner
- TP in the lower right corner

- FP in the upper right corner
- FN in the lower-left corner

#### TP, in lower right corner of the confusion matrix, at 258, is very low.

### Balance the data

- <span style="color:red">Inspect the Target ('Churned' column)</span>

In [None]:
class_counts = churn_ohe['Churned'].value_counts()
class_proportions = class_counts / len(churn_ohe)
print(class_proportions)

#### It is imbalanced. 80\% remained as customers. 20\% have churned.

- <span style="color:red">Decision Tree model ... balanced</span>

#### Try different tree depths

In [None]:
for depth in range(1, 7):
    Treemodel2 = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    Treemodel2.fit(features_train, target_train)
    predictions_valid = Treemodel2.predict(features_valid)
    
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))
    
    precision = precision_score(target_valid, predictions_valid, zero_division=1)
    recall = recall_score(target_valid, predictions_valid, zero_division=1)
    f1 = f1_score(target_valid, predictions_valid, zero_division=1)
    conf_matrix = confusion_matrix(target_valid, predictions_valid)


In [None]:
    print("Precision:")
    print(precision)
    print('\n')
    
    print("Recall:")
    print(recall)
    print('\n')
    
    print("F1:")
    print(f1)
    print('\n')
    
    print("Confusion matrix:")
    print(conf_matrix)

#### F1 is still low.

- <span style="color:red">Random Forest model ... unbalanced</span>

In [None]:
best_score = 0
best_est = 0

In [None]:
for est in range(1, 11):
    RFmodel = RandomForestClassifier(random_state=54321, n_estimators=est)
    RFmodel.fit(features_train, target_train)
    
    predictions_valid = RFmodel.predict(features_valid)
    
    f1 = f1_score(target_valid, predictions_valid, zero_division=1)
    
    if f1 > best_score:
        best_score = f1
        best_est = est

In [None]:
print(f'Best F1 Score on Validation Set: {best_score}')
print(f'Best n_estimators: {best_est}')

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 3):</b>
<br>
- Again, TT site deleted my saved code. I keep doing File / Save and Checkpoint, every few minutes, and I see the message that says, "Save at timestamp", but it does not save. I'm pasting in the corrected code again, again. <a class="tocSkip"></a>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
Now using only features_valid for this line (above):
  predictions_valid = RFmodel.predict(features_valid) <a class="tocSkip"></a>

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

We need to use train+val for tuning hyperparameters (est here) and check best model on test dataset.
</div>

- <span style="color:red">Random Forest model ... balanced</span>

In [None]:
# Combine the feature and target training sets into a single DataFrame
train_data = pd.concat([features_train, target_train], axis=1)

In [None]:
# Separate the majority and minority classes
majority_class = train_data[train_data['Churned'] == 0]
minority_class = train_data[train_data['Churned'] == 1]

In [None]:
# Downsample the majority class
majority_downsampled = resample(majority_class,
                                replace=False,    
                                n_samples=len(minority_class),  # match minority class size
                                random_state=123)  

In [None]:
# Combine the minority class with the downsampled majority class
downsampled_train_data = pd.concat([majority_downsampled, minority_class])

<div class="alert alert-block alert-danger">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Something needs to be fixed =)
</div>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
I think that there is a bug, on the TT site, where some cells did not save. I have just done a cut-n-paste, from my local ver 1, and error is gone. <a class="tocSkip"></a>

### balance the data using downsampling

In [None]:
# Separate features and target variable from the downsampled training set
features_resampled = downsampled_train_data.drop('Churned', axis=1)
target_resampled = downsampled_train_data['Churned']

In [None]:
# Check balanced class distribution
print("Class distribution after downsampling:")
print(target_resampled.value_counts())

#### Data looks balanced now.

### <span style="color:red">Final model</span>

### Build a new Random Forest model.

In [None]:
# Train and evaluate the Random Forest model
rf_clf = RandomForestClassifier(random_state=54321, n_estimators=100)

In [None]:
# Train the model on the resampled dataset
rf_clf.fit(features_resampled, target_resampled)

In [None]:
# Predict probabilities on the validation set
probabilities_valid = rf_clf.predict_proba(features_valid)[:, 1]

#### Use different threshold.

In [None]:
# Predict using a new threshold
new_threshold = 0.6
predicted_valid_threshold = (probabilities_valid >= new_threshold).astype(int)

In [None]:
# Evaluate the RF model
print('Accuracy:', accuracy_score(target_valid, predicted_valid_threshold))
print('Precision:', precision_score(target_valid, predicted_valid_threshold, zero_division=1))
print('Recall:', recall_score(target_valid, predicted_valid_threshold, zero_division=1))
print('F1 Score:', f1_score(target_valid, predicted_valid_threshold, zero_division=1))
print('\n')
print('Confusion Matrix:\n', confusion_matrix(target_valid, predicted_valid_threshold))

In [None]:
#### F1 score is now 60.1\%. Accuracy is now 81.8\%.

- <span style="color:red">Find AUC-ROC:</span>

In [None]:
# Calculate the AUC-ROC score
auc_roc = roc_auc_score(target_valid, probabilities_valid)

print('AUC-ROC Score:', auc_roc)

#### AUC ROC of .855 is very good.

### Explanation of Auc-Roc score:
- 0.5 indicates a model that makes random predictions.
- 1.0 indicates a perfect model.
- Greater than 0.8 is often considered indicative of a strong model.
- Greater than 0.9 is excellent.

In [None]:
# Plot the ROC curve
fpr, tpr, thresholds = roc_curve(target_valid, probabilities_valid)
fig = px.area(
    x=fpr, y=tpr,
    title=f'Receiver Operating Characteristic (ROC) Curve (AUC = {auc_roc:.2f})',
    labels=dict(x='False Positive Rate', y='True Positive Rate'),
    width=800, height=600
)

fig.add_shape(
    type='line',
    line=dict(dash='dash'),
    x0=0, x1=1, y0=0, y1=1
)

fig.update_layout(
    title={
        'text': f'Receiver Operating Characteristic (ROC) Curve (AUC = {auc_roc:.2f})',
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 20}  # Adjust the size as needed
    }
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()

## <div style="color: red; border: 2px solid yellow; display: inline-block;">Conclusion:</div>

## Exploring the Data:

<b>Credit Score</b>
- The highest churn rate is for customers with 'very poor' credit (349-424), at 47.1\% churn rate.
- The lowest churn rate is for customers with 'very good' credit (650-724), at 18.5\% churn rate.

<b>Geography</b>
- There are 3 regions in this dataset: France, Germany, Spain.
- Most customers are from France.
- The highest churn rate is from Germany, at 48\%.

<b>Gender</b>
- Most customers are male. The churn rate, for males, is 16.45\%.
- Most customers who churn are female, at 25\%

<b>Age</b>
- The minimum age of customers is 18. The maximum is 92.
- The lowest churn rate comes from the 23-48 age group, with the very lowest churn rate being ages 35-37 years old.

<b>Tenure</b>
- Customers tenure goes from 0-11 years.
- Most customers have a tenure of 5 years.
- Customers who have a tenure of 1 year, or less, have the highest churn rate.
- Customers who have a tenure of 7 years have the lowest churn rate.

<b>Bank Balance</b>
- Bank balances range from \$0 - 250,898.
- The average bank balance is \$76,485.
- The median churn rate, based on balance, is \$109,349.

<b>Number of Products</b>
- Most customers have only 1 product.
- The lowest churn rate is for customers with 2 products.
- The churn rate for customers with 3 products is 82\%.
- The highest churn rate is for customers with 4 products, at 100\%.

<b>Has Credit Card</b>
- 69.91\% of customers who have churned have a credit card.

<b>Is Active</b>
- 7\% of customers who have churned are still active.

<b>Salary</b>
- Salaries range from \$12 - 200,000.
- The majority of customers have a very good salary, in the range of \$130,000 - 175,000.
- The lowest churn rate is for customers with an "okay" salary of \$50,000 - 89,000.
- The highest churn rate is for customers with a "poor" salary of \$ 20,000 - 49,000.

<b>Supervised Learning</b>
- The best model was the Random Forest model, using data that was downsampled to balance the data.
- f1 was .616
- Accuracy was 82.1\%.

<b>AUC ROC</b>
- The AUC ROC scored 86\% which is very good.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Great final conclusion!
</div>

<div class="alert alert-block alert-info">
<b>Deb's comments (ver 1):</b>
<br>
TT site is not saving my code. I have pasted in my ver1 notes again, and all the missing code. So, to recap, nothing changed between this version and the last version, except pasting in my code again that the TT site just repeatedly keeps deleting.<a class="tocSkip"></a>