# Classification of a Beta Bank customer data set to predict whether customers stay with the service or leave

# Content <a id='back'></a>

* [Introduction](#intro)
* [step 1. Data review.](#data_review)
    * [First impressions](#data_review_conclusions)
* [Step 2. Data preprocessing](#data_preprocessing)
    * [2.1 Duplicate values and fill missing values](#duplicate_values)
* [Step 3. Data Analysis](#data_analysis)
    * [3.1 Segmentation of the source data into a training set, a validation set and a test set.](#segmentation)
    * [3.2 Logistic Regression whith imbalance data](#logistic)
    * [3.3 Data balancing using upsampled and downsampled](#balance)
    * [3.4 DecisionTreeClassifier with corrected data](#decisiontree)
    * [3.5 LogisticRegression with corrected data](#logistic_corrected)
    * [3.6 RandomForestClassifier with corrected data](#randomf)
    * [3.7 Test of the best model](#test_bestmodel)
    * [3.8 Extra analysis: Analyzed only with training and validation data](#extra_analysis)
* [Step 4. Step 4. Final Test](#final_test)
* [Conclusion](#end)

# Introduction <a id='intro'></a>

Beta Bank customers are leaving, little by little, every month. Bankers discovered that it is cheaper to save existing customers than to attract new ones.

We need to predict whether a customer will leave the bank soon. You have the data on the past behavior of clients and the termination of contracts with the bank.

## Step 1. Data review. <a id='data_review'></a>

In [6]:
# All libraries are loaded

import pandas as pd
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.utils import shuffle
from sklearn.metrics import roc_curve

### First impressions <a id='data_review_conclusions'></a>

In [9]:
# Import data

df = pd.read_csv('Churn.csv')

In [11]:
# The data frame information and a sample of the data are printed

display(df.head())
df.info()
print(df.isnull().sum())
df.describe()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Ge

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


**Observations**
1. There is only missing data in the variable 'Tenure' (909), due to the characteristics of the data they will be filled with the mean of 'Tenure'.
2. The data type is correct for the data set.
3. It seems everything is correct and it is posible to continue with the other steps.
4. The values returned by the describe() method are consistent

## Step 2. Data preprocessing <a id='data_preprocessing'></a>

### Duplicate values and fill missing values <a id='duplicate_values'></a>

In [16]:
# Verify missing values

df['Tenure'] = df['Tenure'].fillna(df['Tenure'].mean())

print(df.isnull().sum())

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64


In [18]:
# Verify duplicated data

print('Duplicated values in df:')
print(df[df.duplicated()])

Duplicated values in df:
Empty DataFrame
Columns: [RowNumber, CustomerId, Surname, CreditScore, Geography, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary, Exited]
Index: []


**Observations**

1. There are no duplicate values in "df", the consistency of the data was ensured so we can continue with the next steps.
2. The missing values were filled
3. Although it has not been requested in this project, it would be good to standardize the data to have less differences and noise, this could surely improve the accuracy in the classification.

## Step 3. Data Analysis <a id='data_analysis'></a>

### Segmentation of the source data into a training set, a validation set and a test set. <a id='segmentation'></a>

In [23]:
# Tranform object variables to categorical variables avoinding the dummy problem

df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=True)
numeric = ['CustomerId','CreditScore', 'Age', 'Tenure', 'Balance', 'EstimatedSalary']

In [25]:
# Split between features and objective

features = df.drop(['Exited', 'RowNumber', 'Surname', 'CustomerId'], axis=1)
target = df['Exited']

In [27]:
# Split the dataset in train, validation and test set (70% train, 15% validation, 15% test)
X_train, X_temp, y_train, y_temp = train_test_split(features, target, test_size=0.3, random_state=12345)
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=12345)

In [29]:
# Data scaling

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_valid = sc.fit_transform(X_valid)
X_test =sc.fit_transform(X_test)

### Logistic Regression whith imbalance data. <a id='logistic'></a>


In [32]:
# Initial training with logistic regression without imbalance correction

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(X_train, y_train)

# Evaluate on validation set

y_pred = model.predict(X_valid)
f1 = f1_score(y_valid, y_pred)
roc_auc = roc_auc_score(y_valid, model.predict_proba(X_valid)[:, 1])

print("F1 Score (sin corrección):", f1)
print("AUC-ROC (sin corrección):", roc_auc)

F1 Score (sin corrección): 0.2995169082125604
AUC-ROC (sin corrección): 0.7757985643725204


**Observations**

1. The model was trained with logistic regression due to its benefits, it is shown that it is far from reaching the objective of 59% in the F1 score, it is necessary to apply methods to balance the classes.
2. Downsampled and upsampled methods will be used.
3. If the data were not standardized the scores would be a little lower.
4. 3 characteristics are eliminated from the data set due to the type of data and their null contribution to the analysis.
5. The AUC-ROC score is above 50%, this does not seem so bad at first glance; However, looking at the F1 score, it is concluded that more processing of the data is required.
6. It was separated into a proportion of 70% for training data, 15% data for validation and 15% for test data since they showed the best results and the best relationship between the amount of data to train, validate and test (it was tested with 60 /20/20 and with 80/10/10).

### Data balancing using upsampled and downsampled. <a id='balance'></a>


In [36]:
# A function is created to balance the data and create the a dataset downsampled and upsampled

def downsampled_or_upsampled(features, target, rep_or_frac):
    # class vectors are created for objective and features
    features = pd.DataFrame(features)
    target = pd.Series(target).reset_index(drop=True)
    # Class 0 and 1 are distinguished
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    # If the value is greater than 1, it is oversampling. 
    if rep_or_frac > 1:
        features_upsampled = pd.concat([features_zeros] + [features_ones] * rep_or_frac)
        target_upsampled = pd.concat([target_zeros] + [target_ones] * rep_or_frac)

        features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
        
        return features_upsampled, target_upsampled
    # Less than 1, it is downsampling.
    elif rep_or_frac < 1:
        features_downsampled = pd.concat([features_zeros.sample(frac=rep_or_frac, random_state=12345)]+ [features_ones])
        target_downsampled = pd.concat([target_zeros.sample(frac=rep_or_frac, random_state=12345)]+ [target_ones])

        features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)

        return features_downsampled, target_downsampled
# the function is called
features_upsampled, target_upsampled = downsampled_or_upsampled(X_train, y_train, 3)
features_downsampled, target_downsampled = downsampled_or_upsampled(X_train, y_train, 0.3)
# Print the shape of the new datasets
print("The oversampled features have dimensions:", features_upsampled.shape)
print("The oversampled target has dimensions:", target_upsampled.shape)
print("The subsampled characteristics have dimensions:", features_downsampled.shape)
print("The subsampled target has dimensions:", target_downsampled.shape)

The oversampled features have dimensions: (9822, 11)
The oversampled target has dimensions: (9822,)
The subsampled characteristics have dimensions: (3088, 11)
The subsampled target has dimensions: (3088,)


### DecisionTreeClassifier with corrected data. <a id='decisiontree'></a>


In [39]:
# For loop to obtain the best hyperparameter of the decision tree
best_score = 0
best_est = 0
for depth in range(1, 30): 
    model = DecisionTreeClassifier(random_state=12345, max_depth = depth) 
    model.fit(features_upsampled, target_upsampled) 
    score = model.score(X_valid, y_valid)
    if score > best_score:
        best_score = score
        best_depth = depth

print("The accuracy of the best model on the validation set (max_depth = {}): {}".format(best_est, best_score))

The accuracy of the best model on the validation set (max_depth = 0): 0.8313333333333334


In [40]:
# DecisionTreeClassifier with upsampled data

model_upsampled_decisiontree = DecisionTreeClassifier(random_state=12345, max_depth = 19)
model_upsampled_decisiontree.fit(features_upsampled, target_upsampled)

# DecisionTreeClassifier with downsampled data

model_downsampled_decisiontree = DecisionTreeClassifier(random_state=12345, max_depth = 19)
model_downsampled_decisiontree.fit(features_upsampled, target_upsampled)

In [41]:
# Evaluation on the validation set with upsampled data
y_pred_upsampled = model_upsampled_decisiontree.predict(X_valid)
f1_upsampled_decisiontree = f1_score(y_valid, y_pred_upsampled)
roc_auc_upsampled_decisiontree = roc_auc_score(y_valid, model_upsampled_decisiontree.predict_proba(X_valid)[:, 1])
print("Scores for DecisionTree with upsampled data")
print("")
print("F1 Score (upsampled):", f1_upsampled_decisiontree)
print("AUC-ROC (upsampled):", roc_auc_upsampled_decisiontree)

# Evaluation on the validation set with downsampled data
y_pred_downsampled = model_downsampled_decisiontree.predict(X_valid)
f1_downsampled_decisiontree = f1_score(y_valid, y_pred_downsampled)
roc_auc_downsampled_decisiontree = roc_auc_score(y_valid, model_downsampled_decisiontree.predict_proba(X_valid)[:, 1])
print("")
print("Scores for DecisionTree with downsampled data")
print("")
print("F1 Score (downsampled):", f1_downsampled_decisiontree)
print("AUC-ROC (downsampled):", roc_auc_downsampled_decisiontree)

Scores for DecisionTree with upsampled data

F1 Score (upsampled): 0.477124183006536
AUC-ROC (upsampled): 0.6726029416984527

Scores for DecisionTree with downsampled data

F1 Score (downsampled): 0.477124183006536
AUC-ROC (downsampled): 0.6726029416984527


### LogisticRegression with corrected data. <a id='logistic_corrected'></a>


In [43]:
# LogisticRegression with upsampled data

model_upsampled_logisticregression = LogisticRegression(random_state=12345)
model_upsampled_logisticregression.fit(features_upsampled, target_upsampled)

# LogisticRegression with downsampled data

model_downsampled_logisticregression = LogisticRegression(random_state=12345)
model_downsampled_logisticregression.fit(features_downsampled, target_downsampled)

# Evaluation on the validation set with upsampled data
y_pred_upsampled = model_upsampled_logisticregression.predict(X_valid)
f1_upsampled_logisticregression = f1_score(y_valid, y_pred_upsampled)
roc_auc_upsampled_logisticregression = roc_auc_score(y_valid, model_upsampled_logisticregression.predict_proba(X_valid)[:, 1])
print("Scores for LogisticRegression with upsampled data")
print("")
print("F1 Score (upsampled):", f1_upsampled_logisticregression)
print("AUC-ROC (upsampled):", roc_auc_upsampled_logisticregression)

# Evaluation on the validation set with downsampled data
y_pred_downsampled = model_downsampled_logisticregression.predict(X_valid)
f1_downsampled_logisticregression = f1_score(y_valid, y_pred_downsampled)
roc_auc_downsampled_logisticregression = roc_auc_score(y_valid, model_downsampled_logisticregression.predict_proba(X_valid)[:, 1])
print("")
print("Scores for LogisticRegression with downsampled data")
print("")
print("F1 Score (downsampled):", f1_downsampled_logisticregression)
print("AUC-ROC (downsampled):", roc_auc_downsampled_logisticregression)

Scores for LogisticRegression with upsampled data

F1 Score (upsampled): 0.5101763907734057
AUC-ROC (upsampled): 0.7792306369129367

Scores for LogisticRegression with downsampled data

F1 Score (downsampled): 0.5113924050632911
AUC-ROC (downsampled): 0.7802545249023212


### RandomForestClassifier with corrected data. <a id='randomf'></a>


In [None]:
# For loop to obtain the best hyperparameter of the RandomForestClassifier with upsampled
best_score = 0
best_est = 0
for est in range(1, 20): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_upsampled, target_upsampled)
    score = model.score(X_valid, y_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("The accuracy of the best model on the validation set with upsampled (n_estimators = {}): {}".format(best_est, best_score))

# For loop to obtain the best hyperparameter of the RandomForestClassifier with downsampled
best_score = 0
best_est = 0
for est in range(1, 20): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_downsampled, target_downsampled) 
    score = model.score(X_valid, y_valid)
    if score > best_score:
        best_score = score
        best_est = est

print("The accuracy of the best model on the validation set with downsampled (n_estimators = {}): {}".format(best_est, best_score))

In [None]:
# RandomForest with upsampled data
model_upsampled_RandomForest = RandomForestClassifier(random_state=12345, n_estimators=18, max_depth = 12)
model_upsampled_RandomForest.fit(features_upsampled, target_upsampled)

# RandomForest with downsampled data

model_downsampled_RandomForest = RandomForestClassifier(random_state=12345, n_estimators=12, max_depth = 19)
model_downsampled_RandomForest.fit(features_downsampled, target_downsampled)

# Evaluation on the validation set with upsampled data
y_pred_upsampled = model_upsampled_RandomForest.predict(X_valid)
f1_upsampled_RandomForest = f1_score(y_valid, y_pred_upsampled)
roc_auc_upsampled_RandomForest = roc_auc_score(y_valid, model_upsampled_RandomForest.predict_proba(X_valid)[:, 1])
print("Scores for RandomForest with upsampled data")
print("")
print("F1 Score (upsampled):", f1_upsampled_RandomForest)
print("AUC-ROC (upsampled):", roc_auc_upsampled_RandomForest)

# Evaluation on the validation set with downsampled data
y_pred_downsampled = model_downsampled_RandomForest.predict(X_valid)
f1_downsampled_RandomForest = f1_score(y_valid, y_pred_downsampled)
roc_auc_downsampled_RandomForest = roc_auc_score(y_valid, model_downsampled_RandomForest.predict_proba(X_valid)[:, 1])
print("")
print("Scores for RandomForest with downsampled data")
print("")
print("F1 Score (downsampled):", f1_downsampled_RandomForest)
print("AUC-ROC (downsampled):", roc_auc_downsampled_RandomForest)

### Test of the best model. <a id='test_bestmodel'></a>


In [None]:
# Evaluate the best model on the test set (we use RandomForest upsampling in this case)
y_pred_test = model_upsampled_RandomForest.predict(X_test)
f1_test = f1_score(y_test, y_pred_test)
roc_auc_test = roc_auc_score(y_test, model_upsampled_RandomForest.predict_proba(X_test)[:, 1])

print("F1 Score (test set):", f1_test)
print("AUC-ROC (test set):", roc_auc_test)

In [None]:
# We calculate the values for the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model_upsampled_RandomForest.predict_proba(X_test)[:, 1])

plt.figure()

plt.plot(fpr, tpr)

# ROC curve for random model (looks like a straight line)
plt.plot([0, 1], [0, 1], linestyle='--')

plt.ylim([0.0, 1.0])
plt.xlim([0.0, 1.0])

plt.xlabel('False positive rate')
plt.ylabel('True positive rate')

plt.title('ROC curve')

plt.show()

**Observations**

1. The best model for this analysis was random forests with results of 60.01% for the data test in the F1 score and 84.55% in the region under the AUC-ROC curve. This result was achieved with n_estimators set to 18 and max_depth set to 12.
2. 3 decision tree models, logistic regression and random forests were trained.
3. The technique that worked best to balance the classes in a random forest was oversampling, for decision trees both techniques worked the same and for logistic regression subsampling was the one that gave the best results.
4. The oversampling was repeated 3 times for class 1 and divided by 3 for the undersampling in class 0.
5. For loops were used to get an idea of the values of hyperparameters in random forests and decision trees, it is not easy to find the optimal values, by testing the optimal values did not give the best results in the test, by making changes it was possible to find values that they went a little over the target.

### Extra analysis: Analyzed only with training and validation data. <a id='extra_analysis'></a>


In [25]:
# Split the dataset in train, validation and test set validation Dividir el dataset en conjuntos de entrenamiento, validación y prueba (75% entrenamiento, 25% validación)

X_train2, X_valid2, y_train2, y_valid2 = train_test_split(features, target, test_size=0.25, random_state=12345)

In [26]:
# Data scaling

X_train2 = sc.fit_transform(X_train2)
X_valid2 = sc.fit_transform(X_valid2)

In [27]:
# A function is created to balance the data and create the a dataset downsampled and upsampled

def downsampled_or_upsampled(features, target, rep_or_frac):
    features = pd.DataFrame(features)
    target = pd.Series(target).reset_index(drop=True)
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    
    if rep_or_frac > 1:
        features_upsampled = pd.concat([features_zeros] + [features_ones] * rep_or_frac)
        target_upsampled = pd.concat([target_zeros] + [target_ones] * rep_or_frac)

        features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
        
        return features_upsampled, target_upsampled
    elif rep_or_frac < 1:
        features_downsampled = pd.concat([features_zeros.sample(frac=rep_or_frac, random_state=12345)]+ [features_ones])
        target_downsampled = pd.concat([target_zeros.sample(frac=rep_or_frac, random_state=12345)]+ [target_ones])

        features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)

        return features_downsampled, target_downsampled

features_upsampled2, target_upsampled2 = downsampled_or_upsampled(X_train2, y_train2, 3)
features_downsampled2, target_downsampled2 = downsampled_or_upsampled(X_train2, y_train2, 0.3)

In [28]:
#For loop to obtain the best hyperparameter of the RandomForest
best_score = 0
best_est = 0
for est in range(1, 20): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_upsampled2, target_upsampled2) 
    score = model.score(X_valid2, y_valid2)
    if score > best_score:
        best_score = score
        best_est = est

print("The accuracy of the best model on the validation set with upFor loop to obtain the best hyperparameter of the decision treesampled (n_estimators = {}): {}".format(best_est, best_score))

best_score = 0
best_est = 0
for est in range(1, 20): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_downsampled2, target_downsampled2) 
    score = model.score(X_valid2, y_valid2)
    if score > best_score:
        best_score = score
        best_est = est

print("The accuracy of the best model on the validation set with downsampled (n_estimators = {}): {}".format(best_est, best_score))

The accuracy of the best model on the validation set with upFor loop to obtain the best hyperparameter of the decision treesampled (n_estimators = 16): 0.8496
The accuracy of the best model on the validation set with downsampled (n_estimators = 18): 0.8004


In [30]:
# RandomForest with downsampled data
model_upsampled_RandomForest2 = RandomForestClassifier(random_state=12345, n_estimators=18, max_depth = 12)
model_upsampled_RandomForest2.fit(features_upsampled2, target_upsampled2)

# RandomForest with downsampled data

model_downsampled_RandomForest2 = RandomForestClassifier(random_state=12345, n_estimators=12, max_depth = 19)
model_downsampled_RandomForest2.fit(features_downsampled2, target_downsampled2)

# Evaluación en el conjunto de validación with upsampled data
y_pred_upsampled2 = model_upsampled_RandomForest2.predict(X_valid2)
f1_upsampled_RandomForest2 = f1_score(y_valid2, y_pred_upsampled2)
roc_auc_upsampled_RandomForest2 = roc_auc_score(y_valid2, model_upsampled_RandomForest2.predict_proba(X_valid2)[:, 1])
print("Scores for RandomForest with upsampled data")
print("")
print("F1 Score (upsampled):", f1_upsampled_RandomForest2)
print("AUC-ROC (upsampled):", roc_auc_upsampled_RandomForest2)

# Evaluación en el conjunto de validación with downsampled data
y_pred_downsampled2 = model_downsampled_RandomForest2.predict(X_valid2)
f1_downsampled_RandomForest2 = f1_score(y_valid2, y_pred_downsampled2)
roc_auc_downsampled_RandomForest2 = roc_auc_score(y_valid2, model_downsampled_RandomForest2.predict_proba(X_valid2)[:, 1])
print("")
print("Scores for RandomForest with downsampled data")
print("")
print("F1 Score (downsampled):", f1_downsampled_RandomForest2)
print("AUC-ROC (downsampled):", roc_auc_downsampled_RandomForest2)

Scores for RandomForest with upsampled data

F1 Score (upsampled): 0.6268081002892959
AUC-ROC (upsampled): 0.8487655465981785

Scores for RandomForest with downsampled data

F1 Score (downsampled): 0.5826513911620295
AUC-ROC (downsampled): 0.8334493828922023


**Observations**

This extra analysis was carried out to see if in this data set the performance would be better by increasing the training and validation data, it can be seen that although the validation result is improved it is only by less than 3%. This shows how by increasing the training data, the model acquires better performance.


## Step 4. Final Test. <a id='final_test'></a>


In [32]:
# Create a constant model that always predicts the majority class (0)
y_pred_constant = pd.Series([0] * len(y_test))

# Evaluate the constant model
f1_constant = f1_score(y_test, y_pred_constant)
roc_auc_constant = roc_auc_score(y_test, y_pred_constant)

print("F1 Score (constant model):", f1_constant)
print("AUC-ROC (constant model):", roc_auc_constant)

F1 Score (constant model): 0.0
AUC-ROC (constant model): 0.5


## Conclusion. <a id='end'></a>


The model used in this project that showed the best performance was random forests, with an upsampled AUC score of 84.87%. This allows us to efficiently predict whether the customer is at risk of leaving Beta Bank. Decision trees also perform well, not far behind random forests, so they could also be an option.
It is better to use a sampling method on data with these characteristics.