# Bank Marketting Subscription Prediction Project

### Business Understanding

#### **Problem Statement** 
The goal of this project is to build a predictive model that accurately determines the likelihood of a client subscribing to a term deposit based on various customer features. The bank's marketing campaigns heavily rely on effectively targeting customers who are more likely to subscribe to a term deposit. By leveraging machine learning techniques, the project seeks to improve the efficiency of these marketing campaigns and increase the conversion rate

#### **Stakeholders:**
     - Executive Officers (CEO)
     - Marketing Officers (CMO) 
     - Data Aministrators (CDA)


Prior information we have been provided with is as below in the output of this code:

In [1]:
# Load names file
with open(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-names.txt", "r") as file:
    lines = file.readlines()
    for line in lines:
        print(line.strip())

1. Relevant Information:

The data is related with direct marketing campaigns of a banking institution.
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,
in order to access if the product (bank term deposit) would be (or not) subscribed.

There are two datasets:
1) bank-full.csv with all examples, ordered by date (from May 2008 to November 2010).
2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

2. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

3. Number of Attributes: 16 + output attribute.

4. Attribute information:

Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3 - mari

- Other than what is provided in the code above about the data there is important information not covered. The total number of datasets are for and below is the explanation
- - bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010),   very close to the data analyzed. 
- - bank-additional.csv with 10% of the examples (4119), (randomly selected from 1), and 20 inputs. 
- - bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).  
- - bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs). 
- We can deduce that this is a classification problem and this will inform the flow of the project that is ; the feature engeneering,the evaluation metrics, the models that are likely to work best with such a project etc 
- That being noted, as the project progresses, **key insights** are noted below each task performed for better project flow  

#### **Key Metrics and Success Criteria**
     1. Acuracy-The Model should have an accuracy score of 85% (On balanced data).Good models are expected to have an accuracy score of >0.80 or 80%
     2. Threshold for precision and Recall - The model should achieve a precision and recall at least 80%. This assures that the model is reliable in predicting
     3. Minimum F1 Score- The F1 score should be atleast 0.75. This balances the trade offs between precision and recalls, indicating the model performs well even if the class distribution is imbalanced
     4. AUC-ROC Score- This should be atleast 0.85. A high AUC-ROC score indicates that the model is effective in distinguishing subscribers to non subscribers
     5. Confusion Matrix - The number of False Negatives (FN) should be lower to ensure that most of the subscription cases are identified
     
    
#### **Hypothesis**

#### Null Hypothesis

#### Alternative Hpothesis

#### Analytical Questions
    

### **Data Understanding**

#### **Importations**

In [2]:
import numpy as np
import pandas as pd

#### **Load Datasets**

In [3]:
# Load csv data
bank_additional_full = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-additional-full.csv", delimiter=";")
bank_additional = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-additional.csv", delimiter=";")
bank = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank-full.csv", delimiter=";")
bank_full = pd.read_csv(r"C:\Users\Admin\OneDrive\Desktop\Bank-Marketing-Subscription-Predictor\data\bank.csv", delimiter=";")

In [4]:
bank_additional_full.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
bank_additional.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,30,blue-collar,married,basic.9y,no,yes,no,cellular,may,fri,...,2,999,0,nonexistent,-1.8,92.893,-46.2,1.313,5099.1,no
1,39,services,single,high.school,no,no,no,telephone,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.855,5191.0,no
2,25,services,married,high.school,no,yes,no,telephone,jun,wed,...,1,999,0,nonexistent,1.4,94.465,-41.8,4.962,5228.1,no
3,38,services,married,basic.9y,no,unknown,unknown,telephone,jun,fri,...,3,999,0,nonexistent,1.4,94.465,-41.8,4.959,5228.1,no
4,47,admin.,married,university.degree,no,yes,no,cellular,nov,mon,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.191,5195.8,no


In [6]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no


In [7]:
bank_full.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no


#### **Data Exploration**

In [8]:
shapes = f"""
Bank Additional Full:
{bank_additional_full.shape}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.shape}
-----------------------------------------------------------------------------------------
Bank:
{bank.shape}
----------------------------------------------------------------------------------------- 
Bank Full:
{bank_full.shape}
-----------------------------------------------------------------------------------------
"""
print (shapes)



Bank Additional Full:
(41188, 21)
-----------------------------------------------------------------------------------------
Bank Additional:
(4119, 21)
-----------------------------------------------------------------------------------------
Bank:
(45211, 17)
----------------------------------------------------------------------------------------- 
Bank Full:
(4521, 17)
-----------------------------------------------------------------------------------------



In [9]:
infos = f"""
Bank Additional Full:
{bank_additional_full.info()}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.info()}
-----------------------------------------------------------------------------------------
Bank Full:
{bank_full.info()}
----------------------------------------------------------------------------------------- 
Bank :
{bank.info()}
-----------------------------------------------------------------------------------------
"""
print (infos)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [10]:
columns = f"""
Bank Additional Full:
{bank_additional_full.columns}
-----------------------------------------------------------------------------------------
Bank Additional:
{bank_additional.columns}
-----------------------------------------------------------------------------------------
Bank Full:
{bank_full.columns}
----------------------------------------------------------------------------------------- 
Bank :
{bank.columns}
-----------------------------------------------------------------------------------------
"""
print (columns)


Bank Additional Full:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')
-----------------------------------------------------------------------------------------
Bank Additional:
Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')
-----------------------------------------------------------------------------------------
Bank Full:
Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutco

#### **Merge the Train Datasets**

In [11]:
# Combine DataFrames
Bank_anaysis_data = pd.concat([**, **], ignore_index=True)

***.head()

SyntaxError: invalid syntax (260746689.py, line 2)

In [None]:
# data consistency and analysis

***.replace(True, 'Yes', inplace=True)
***.replace(False, 'No', inplace=True)

***.head()


In [None]:
# Change *** to float 

***['column**'] = pd.to_numeric(***['column**'], errors='coerce')

In [None]:
# Checking for duplicates 
****.duplicated().sum() 

In [None]:
# Missing values with their percentages 
***.isnull().sum().to_frame('Null Count').assign(Percentage=lambda x: (x['Null Count'] / len(churn_prime)) * 100)

In [None]:
# Unique values in each column

for column in columns:
    print(f'{column}')
    print(f'There are {***[column].unique().size} unique values')
    print(f'These are {***[column].unique()}')
    print('=' * 50)

In [None]:
#Statistical  Analysis of numeric values

***.describe().T

In [None]:
# Overview Analysis of categorical columns 

***.describe(include= 'object').T

#### **EDA**  (Exploratory Data Analysis)

#### **1. Numerical Columns EDA**

#### **Univariate Analysis**

In [None]:
# Distribution of Numerical Feature
***.hist(figsize= (14,10),grid=False, color='skyblue')
plt.show()

In [None]:
plt.figure(figsize=(14,10))
sns.kdeplot(***['column**'])
plt.grid(False)
plt.show()

#### Checking for Outliers

In [None]:
# Create a box plot for multiple columns
plt.figure(figsize=(10, 6))
sns.boxplot(churn_prime[['tenure', 'MonthlyCharges']],  whis=1.5)

# Add titles and labels
plt.title('Box Plot of tenure, Monthly Charges')
plt.xlabel('Variables')
plt.ylabel('Distribution')

plt.grid(False)

# Display the plot
plt.show()


#### **Bivariate Analysis**

In [None]:
plt.figure(figsize=(10, 6))
sns.violinplot(x='Contract', y='MonthlyCharges', data=churn_prime, palette='muted')
plt.title('Monthly Charges by Contract Type')
plt.xlabel('Contract Type')
plt.ylabel('Monthly Charges')
plt.show()


In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Churn', y='MonthlyCharges', data=churn_prime, palette='muted')
plt.title('Monthly Charges by Churn Customers')
plt.xlabel('Churn')
plt.ylabel('Monthly Charges')
plt.show()


#### **Maltivariate Analysis**

In [None]:
numerical_columns = ['SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges']

numeric_df = churn_prime.select_dtypes(include=[np.number])
corr_matrix = numeric_df.corr()
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
sns.pairplot(churn_prime[['MonthlyCharges', 'TotalCharges', 'tenure', 'Churn']], hue='Churn')
plt.show()

#### **2. Categorical Columns EDA**

#### **Distribution and Counts for Categorical variables**

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='Churn', order=churn_prime['Churn'].value_counts().index)

# Add titles and labels
plt.title('Churn Count')
plt.xlabel('Churn')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

- The Churn column is imbalanced (we have class imbalance), with more 'No' than 'Yes' values. This affects model training, leading to biased predictions. Consider using techniques like SMOTE (Synthetic Minority Over-sampling Technique) or adjusting class weights to balance the dataset during model training.

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='Contract', order=churn_prime['Contract'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of Contract Types')
plt.xlabel('Contract Type')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

* For the contracts column which will be our focus for the hypothesis we did a bar plot- and realise most customers are on the month to month subscription contract

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='InternetService', order=churn_prime['InternetService'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of InternetService Distribution')
plt.xlabel('InternetService')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

In [None]:
# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create a bar chart for the 'Contract' column
plt.figure(figsize=(10, 6))
sns.countplot(churn_prime, x='PaymentMethod', order=churn_prime['PaymentMethod'].value_counts().index)

# Add titles and labels
plt.title('Bar Chart of PaymentMethod Distribution')
plt.xlabel('PaymentMethod')
plt.ylabel('Frequency')

plt.grid(False)

# Display the plot
plt.show()

#### **Hypothesis Testing**

In [None]:
# Create a copy of the original DataFrame
df_train_chi = churn_prime.copy()

# Drop the row with the unknown value from the Churn column
df_train_chi.drop(index=2988, inplace=True)
df_train_chi.reset_index(drop=True, inplace=True)

# Drop 'customerID' column as it is not needed for analysis
df_train_chi.drop(columns=['customerID'], axis=1, inplace=True)

# Convert Churn to binary
df_train_chi['Churn'] = df_train_chi['Churn'].map({'Yes': 1, 'No': 0})

# Replace invalid TotalCharges with NaN
df_train_chi['TotalCharges'] = pd.to_numeric(df_train_chi['TotalCharges'], errors='coerce')

# Define numerical and categorical columns
num_columns = df_train_chi.select_dtypes(include=['number']).columns
cat_columns = df_train_chi.select_dtypes(include=['object']).columns

# Impute missing values for numerical columns
imputer_num = SimpleImputer(strategy='median')
df_train_chi[num_columns] = imputer_num.fit_transform(df_train_chi[num_columns])

# Impute missing values for categorical columns
imputer_cat = SimpleImputer(strategy='most_frequent')
df_train_chi[cat_columns] = imputer_cat.fit_transform(df_train_chi[cat_columns])

# Create contingency table for Churn and Contract
contingency_table = pd.crosstab(df_train_chi['Churn'], df_train_chi['Contract'])

# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output the results
print("Chi-Square Test")
print("----------------")
print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")
print(f"Degrees of Freedom: {dof}")

# Interpret the result based on the p-value
alpha = 0.05
if p < alpha:
    print("Reject the null hypothesis: This means there is a significant difference in churn rates among customers with different contract types.")
else:
    print("Fail to reject the null hypothesis: This means there is no significant difference in churn rates among customers with different contract types.")


- The chi-square test was utilized to examine whether there are significant variations in churn rates based on different contract types within the Telco dataset
- With a chosen significance level (alpha) of 0.05, the extremely low p-value (3.62e-192) obtained from the test indicates a robust rejection of the null hypothesis.
- Consequently, we reject the null hypothesis that there is no significant difference in churn rates across various contract types.
- This statistical finding provides compelling evidence that contract type plays a critical role in influencing churn rates among Telco customers.

#### **Answering Analytical Questions**

### **Data Preparation**

#### Handling misssing values 

In [None]:
churn_prime['TotalCharges'].fillna(churn_prime['TotalCharges'].median(), inplace=True) # TotalCharges column 


In [None]:
miss_categ = ['MultipleLines', 'OnlineSecurity', 'OnlineBackup',               #For missing values in categorical columns 
                       'DeviceProtection', 'TechSupport', 'StreamingTV', 
                       'StreamingMovies', 'Churn']

for col in miss_categ:
    mode_val = churn_prime[col].mode()[0]                                      
    churn_prime[col].fillna(mode_val, inplace=True)

In [None]:
# convert churn_prime to csv for Power Bi Visualisation before further Modeling

churn_prime.to_csv('churn_prime.csv', index= False)

##### Drop the Cutomer Id Column it doe not have any statistical  or computational significance and has too many unknown categories  that will affect the encoding process

In [None]:
churn_prime = churn_prime.drop('customerID', axis=1)

In [None]:
churn_prime.isnull().sum()

####  Split data to X and y (Input and Output variables )

In [None]:
# Input variables

X= churn_prime.drop ('Churn', axis= 1)
X.head()

In [None]:
# Output variable / target variable 
y= churn_prime['Churn']
y.value_counts()

- The dataset exhibits a significant class imbalance, where instances labeled as "No" (indicating non-churn) outnumbers instances labeled as "Yes" (indicating churn) by a considerable margin. Addressing this imbalance is crucial as it can hinder the model's ability to effectively predict the minority class, which in this case is "Yes" or churn.

In [None]:
(X.shape, y.shape) 

#### Split data to categorical and numerical columns

In [None]:
numerical_columns= X.select_dtypes('number').columns
numerical_columns 

In [None]:
categorical_columns= X.select_dtypes('object').columns
categorical_columns

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Print the shapes of the resulting datasets
(X_train.shape, y_train.shape), (X_test.shape, y_test.shape)

In [None]:
# We use a lable encoder for y because its not a 2 dimentional array 

encoder = LabelEncoder()

# Fit the encoder to the target variable
y_train_encoded= encoder.fit_transform(y_train)
y_test_encoded= encoder.transform(y_test)


In [None]:
# Check skewness to determine which scaler to use 
X.select_dtypes('number').skew()

- Descison
Standard scaler is disqualified as our data is not anything close to a bell shape (being evenly distributed)
MinMax scaller is diqualified as our data has outliers 
We use Robust Scaler due to the biases in X train  

In [None]:
X.describe().T

- We use Quantile transformer as it transform our data to a close to a bell shape-where data is evenly distributed and mean is equal to median which is equal to mode 

#### *Pipeline*

In [None]:
numeric_pipeline= Pipeline(steps=[ 
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', RobustScaler()),
    ('QuantileTransformation', QuantileTransformer ()),
])

categorical_pipeline= Pipeline([
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('encoder', OneHotEncoder()),
    
])

preprocessor = ColumnTransformer(transformers=[
    ('num_pipeline', numeric_pipeline, numerical_columns),
    ('cat_pipeline', categorical_pipeline, categorical_columns),

])

In [None]:
preprocessor

#### **Modeling & Evaluation**

#### Train on unbalanced data 

In [None]:
# Define the models
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('SVM', SVC(probability=True, random_state=42)),
    ('GBM', GradientBoostingClassifier(random_state=42)),
    ('Neural Network', MLPClassifier(random_state=42))
]


# Arrays to store individual model predictions and their probabilities
model_predictions = {}
model_probabilities = {}

# Store confusion matrices for each model
confusion_matrices = {}

for model_name, classifier in models:
    # Define the pipeline with the classifier
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', classifier)
    ])

    # Fit the pipeline on training data
    pipeline.fit(X_train, y_train_encoded)

    # Predict on test data
    y_pred = pipeline.predict(X_test)
    y_prob = pipeline.predict_proba(X_test)[:, 1]

    # Store predictions and probabilities
    model_predictions[model_name] = y_pred
    model_probabilities[model_name] = y_prob

    # Store confusion matrix
    cm = confusion_matrix(y_test_encoded, y_pred)
    confusion_matrices[model_name] = cm

    # Evaluate model performance with classification report
    print(model_name)
    print(classification_report(y_test_encoded, y_pred, target_names=['No', 'Yes']))  # Add target_names for class labels
    print('=' * 50)

    # Calculate ROC AUC score
    roc_auc = roc_auc_score(y_test_encoded, y_prob)

    # Print ROC AUC score
    print(f'ROC AUC Score: {roc_auc:.4f}')
    print('=' * 50)

 

- From the models performance we can deduce that class imbalance is skewing model performance metrics towards the majority class ("No")
- Moving forward we will address class imbalance through techniques like SMOTE and fine-tuning models to improved F1-scores, particularly for predicting churn instances

#### Visualising the confusion matrix and AUC ROC Curve for our Imbalanced dataset

In [None]:
# Convert confusion matrices to DataFrame
df_scores = pd.DataFrame.from_dict({model_name: [conf_matrix] for model_name, conf_matrix in confusion_matrices.items()}, orient='index', columns=['confusion_matrix'])
df_scores 

In [None]:
def plot_confusion_matrices(df_scores, figsize=(15, 8), ncols=3):
    nrows = int(np.ceil(len(df_scores) / ncols))
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes = axes.flatten()
    
    for i, (model_name, row) in enumerate(df_scores.iterrows()):
        conf_matrix = row['confusion_matrix']
        ax = axes[i]
        sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                    xticklabels=['Not Churn', 'Churn'], yticklabels=['Not Churn', 'Churn'], ax=ax)
        ax.set_xlabel('Predicted labels')
        ax.set_ylabel('True labels')
        ax.set_title(f'Confusion Matrix - {model_name}')
    
    plt.tight_layout()
    plt.show()

plot_confusion_matrices(df_scores)

In [None]:
# Plot ROC AUC curve for all models
plt.figure(figsize=(10, 8))

# Iterate over each model's probabilities and plot ROC curve
for model_name, y_prob in model_probabilities.items():
    # Compute ROC curve
    fpr, tpr, _ = roc_curve(y_test_encoded, y_prob)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.2f})')

# Plot random guessing line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

# Set plot properties
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(False)
plt.show()

#### Train on balanced data 

In [None]:
# Define the models
models = [
    ('Logistic Regression', LogisticRegression(random_state=42)),
    ('Random Forest', RandomForestClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('SVM', SVC(probability=True, random_state=42)),
    ('GBM', GradientBoostingClassifier(random_state=42)),
    ('Neural Network', MLPClassifier(random_state=42))
]

In [None]:
balanced_table =pd.DataFrame(columns=['Model','Accuracy', 'Precision', 'Recall', 'F1_Score'])
balanced_pipeline= {}
 
for model_name, classifier in models:
   
    pipeline = imbPipeline(steps=[
        ('preprocessor', preprocessor),
        ('OverSampler', SMOTE(random_state=42)),
        ('classifier', classifier)
    ])
 
    pipeline.fit(X_train,y_train_encoded)
   
    balanced_pipeline [model_name]= pipeline
 
    y_pred = pipeline.predict(X_test)
 
   
    balanced_metrics= classification_report(y_test_encoded, y_pred, output_dict=True)
 
    accuracy= balanced_metrics['accuracy']
    precision = balanced_metrics['weighted avg']['precision']
    recall = balanced_metrics['weighted avg']['recall']
    f1 = balanced_metrics['weighted avg']['f1-score']
 
    balanced_table.loc[len(balanced_table)]= [model_name, accuracy, precision, recall,f1]
 
balanced_table.sort_values(by='F1_Score')

- After balancing Data we can notice an improvement in the F1 Scores of the Models With all models meeting the threshold creteria of 0.75 apart from KNN. 
- The best performing are Radom Forest and GBM with an F1 score of 0.791 and 0.790 respectively
- The fact that recall matches accuracy for all models suggests that your data is well-balanced across classes. This is a good sign, indicating that the balancing technique (SMOTE) has been effective 

In [None]:
# View balanced data pipelines 
balanced_pipeline ['Random Forest']

In [None]:
def plot_confusion_matrices(balanced_pipeline, X_test, y_test_encoded, figsize=(15, 8), ncols=3):
    nrows = int(np.ceil(len(balanced_pipeline) / ncols))
    fig, axes = plt.subplots(nrows, ncols, figsize=figsize)
    axes = axes.flatten()
    
    for i, (model_name, pipeline) in enumerate(balanced_pipeline.items()):
        y_pred = pipeline.predict(X_test)
        conf_matrix = confusion_matrix(y_test_encoded, y_pred)
        
        ax = axes[i]
        sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
                    xticklabels=['Not Churn', 'Churn'], yticklabels=['Not Churn', 'Churn'], ax=ax)
        ax.set_xlabel('Predicted labels')
        ax.set_ylabel('True labels')
        ax.set_title(f'Confusion Matrix - {model_name}')
    
    # Remove any unused subplots
    for j in range(i+1, len(axes)):
        fig.delaxes(axes[j])
    
    plt.tight_layout()
    plt.show()

plot_confusion_matrices(balanced_pipeline, X_test, y_test_encoded)

In [None]:
# Plot ROC AUC Curve for balanced pipeline
def plot_roc_auc_curves(balanced_pipeline, X_test, y_test_encoded, figsize=(10, 8)):
    plt.figure(figsize=figsize)

    # Iterate over each model in the balanced pipeline
    for model_name, pipeline in balanced_pipeline.items():
        # Get predicted probabilities
        y_prob = pipeline.predict_proba(X_test)[:, 1]
        
        # Compute ROC curve
        fpr, tpr, _ = roc_curve(y_test_encoded, y_prob)
        roc_auc = auc(fpr, tpr)
        
        # Plot ROC curve
        plt.plot(fpr, tpr, lw=2, label=f'{model_name} (AUC = {roc_auc:.2f})')

    # Plot random guessing line
    plt.plot([0, 1], [0, 1], color='gray', linestyle='--')

    # Set plot properties
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic (ROC) Curve')
    plt.legend(loc="lower right")
    plt.grid(False)
    plt.show()

plot_roc_auc_curves(balanced_pipeline, X_test, y_test_encoded)

**Summary of Findings:**

**Model Performance Metrics:**

- Accuracy- All models achieve relatively high accuracy scores ranging from 0.686 to 0.791. This indicates that they perform well in overall prediction correctness on balanced data
-The highest performing being Random Forest with 0.790. But our target was >0.80


- Precision- Measures how many of the predicted positive instances (churn) are actually positive
-Logistic Regression and GBM show the highest precision scores around 0.808 and 0.802, respectively. Meeting the set threshold of >0.80


- Recall: Reflects how many of the actual positive instances (churn) were predicted correctly
-Random Forest achieves the highest recall score at 0.790 followed by GBM 0.784. Both do not meet threshold of >0.80



- F1-score: Balances the trade-off between precision and recall, providing a single metric to evaluate model performance
-Random Forest achieves the highest F1-score of 0.791, closely followed by GBM at 0.790 meeting the threshold of atleast 0.75

**Model Comparison and Recommendations**

Random Forest and GBM consistently perform well across all metrics (accuracy, precision, recall, and F1-score). They are particularly robust in maintaining high F1-scores, suggesting effective balance between identifying churn cases and minimizing false positives.

Logistic Regression and SVM also demonstrate strong performance with high precision scores, making them reliable choices for applications where precision in predicting churn is critical.

Neural Network shows competitive performance but slightly lower precision compared to other models, indicating potential for further optimization or tuning.

KNN exhibits the lowest recall among the models, which suggests it may struggle more with correctly identifying churn cases, especially in situations where recall is crucial.


**Conclusion:**

The ensemble methods (Random Forest and GBM) stand out for their balanced performance across all metrics on balanced data. They are recommended for applications where F1 Score is the highest consideration like in this case. And therefore moving forwward we will fine tune this 2 to ensure maximum performance and the one with best performance we will use to test our test dataset

Logistic Regression and SVM offer strong precision and are suitable for scenarios prioritizing precision in churn prediction.

Neural Network shows promise but may benefit from further fine-tuning to improve precision and overall performance.

**Saving our models for future use before the hyperparameter tuning as after tuning we realize that the scores reduce. In any project the scores before hyperparameter tuning can be used if the hyperparameter runing underperforms**

In [None]:
rf_model = RandomForestClassifier
gbm_model = GradientBoostingClassifier
svm_model = SVC

# Create a folder named 'models' if it doesn't exist
if not os.path.exists('models'):
    os.makedirs('models')

# Save the Random Forest model
joblib.dump(rf_model, 'models/random_forest_model.joblib')

# Save the GBM model
joblib.dump(gbm_model, 'models/gbm_model.joblib')

# Save the SVM model
joblib.dump(svm_model, 'models/svm_model.joblib')


#### **Perform Hyperparameter Tuning** 

- For this we selected the top performing models which are GBM and Random Forest Classification and see which one best performs after hyperparameter tuning in order to  pick the best performing model 

-GBM HYPERPARAMETER TUNING

In [None]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
        'max_depth': trial.suggest_int('max_depth', 1, 30),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
        'subsample': trial.suggest_uniform('subsample', 0.5, 1.0),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2'])
    }
    
    model = GradientBoostingClassifier(**params, random_state=42)
    
    pipeline = imbPipeline([
        ('preprocessor', preprocessor),
        ('oversampler', SMOTE(random_state=42)),
        ('classifier', model)
    ])
    
    try:
        score = cross_val_score(pipeline, X_train, y_train_encoded, cv=5, scoring='f1_weighted', error_score='raise').mean()
    except Exception as e:
        print(f"Error in trial with parameters: {params}")
        print(f"Error message: {str(e)}")
        return float('-inf')  # Return a very low score to indicate failure
    
    return score

# Create and run the study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

best_params = study.best_params
best_f1_score = study.best_value

print("Best Parameters:", best_params)
print("Best F1-Score:", best_f1_score)


Observations 
- The error message Error messages in the trials was later reaslised is because of the 'max_features' parameter was set to 'auto', which is not a valid option for the cross_val_score function in scikit-learn. And we'll be moving to correct this moving forward

- Optimization Progress: Despite encountering errors in some trials, the study continued to run and completed all 100 trials as specified (n_trials=100).

- The best F1-score observed during the study was 0.794, achieved in Trial 63 which was a slight improvement from 0.790

-RANDOM FOREST HYPERPARAMETER TUNING

In [None]:
def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'max_depth': trial.suggest_int('max_depth', 1, 30),
    }
    model = RandomForestClassifier(**params)
    pipeline = imbPipeline([
        ('preprocessor', preprocessor),
        ('oversampler', SMOTE()),
        ('classifier', model)
    ])
    score = cross_val_score(pipeline, X_train, y_train_encoded, cv=5, scoring='f1').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

best_params = study.best_params
best_f1_score = study.best_value

print("Best Parameters:", best_params)
print("Best F1-Score:", best_f1_score)


Observation
- Trial Number: Trial 99 indicates that this was the 99th trial conducted during the optimization process.
- Trial Result: The F1-score observed for this trial was 0.635314351632384.
- Trial Parameters: The hyperparameters tested during this trial were {'n_estimators': 335, 'max_depth': 9}.
- Best Trial: The best F1-score observed overall throughout all trials was 0.6424022851349124, achieved in Trial 40.

- From an initial F1 Score of 0.791 to 0.644 is a drop we will be adjusting the hyperparameters for better performance of the tuning

#### **Challanges and Moving Forward**

- The only Major Challange was with the hyperparameter tunings of our 2 best performing models which we will seek to work it out with the best hyperparameters so that we can move forward to sellecting the best model for our test data
- Other Challanges were learning oportnities.
- We will also be exporting core machine learning Components for future use in other projects 


**Exporting the Models for use in making our Multipages web based APP**
- The two models will be
 
-GBM

-Random forest