<a href="https://colab.research.google.com/github/23f1000642/Predicting-Bank-Telemarketing-Success/blob/main/23f1000642_notebook_t32024_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: SOME KAGGLE DATA SOURCES ARE PRIVATE
# RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES.
import kagglehub
kagglehub.login()


In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

predict_the_success_of_bank_telemarketing_path = kagglehub.competition_download('predict-the-success-of-bank-telemarketing')

print('Data source import complete.')


# Project Title: Predicting Bank Telemarketing Success Using Machine Learning


### :::::::::: :::::::::: :::::::::: :::::::::: ::::::::::  Process of The Project  :::::::::: :::::::::: :::::::::: :::::::::: ::::::::::

                                - 1 . python Implementation
                                - 2 . Domain Analysis ,  Basic Checks & find Insights
                                - 3 . EDA (univariate , Bi-Variate ) and Find Insights
                                - 4 . Data Preprocessing / Feature engineering
                
                                                        - ( i ) . Find Missing values & impute them
                                                        - ( ii ) .  Convert Categorical variable to Numerical
                                                       
                                        
                                - 5 . Feature Selection
                                                         - ( i ) . Check coorelation
                                                         - ( ii ) . use pipeline
                                                         - ( iii ) . Create Dependent(y) & Independent(X) Variable
                                                    
                                - 6 . Model Creation  & Evaluation
                                        
                                                          - ( i ) . Splite Data into Train & Test
                                                          - ( ii ) . Import Matrics for evaluation
                                                          
                                                                        - ( a ) . Logistic regression
                                                                        - ( b ) . K Nearest Neighbour
                                                                        - ( c ) . Lightgbm
                                                                        - ( d ) . Random Forest
                                                                        - ( e ) . XG - Boost
                                                                        
                                - 7 . Create Model Comparision Report
### ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

## ***Objective:***
### The goal of the project is to predict whether a customer will subscribe to a bank term deposit (yes or no) based on features such as contact details, demographics, and campaign-related data from previous telemarketing calls.

# 1 . Python Implementation

## Importing Libraries


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import  f1_score, precision_score, recall_score, precision_recall_curve
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import uniform, randint
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier



# import seaborn library for data visualisation
%matplotlib inline

# Import Warnings For ignoring Feature Warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load the Dataset
data = pd.read_csv('/kaggle/input/predict-the-success-of-bank-telemarketing/train.csv')

test = pd.read_csv('/kaggle/input/predict-the-success-of-bank-telemarketing/test.csv')


# 2 . Domain Analysis and Basic Checks

In [None]:
# to see first five records
data.head()

In [None]:
# to see all columns
pd.set_option('display.max_columns',None)
data.tail() # to see last 5 records


In [None]:
# to know about data type & null values
data.info()

### Input Variables:
1. **last contact date**: Last contact date.  
2. **age**: Age of the client (numeric).  
3. **job**: Type of job.  
4. **marital**: Marital status (categorical: `"married"`, `"divorced"`, `"single"`; note: `"divorced"` includes divorced or widowed).  
5. **education**: Level of education (categorical: `"unknown"`, `"secondary"`, `"primary"`, `"tertiary"`).  
6. **default**: Does the client have credit in default? (binary: `"yes"`, `"no"`).  
7. **balance**: Average yearly balance in euros (numeric).  
8. **housing**: Does the client have a housing loan? (binary: `"yes"`, `"no"`).  
9. **loan**: Does the client have a personal loan? (binary: `"yes"`, `"no"`).  
10. **contact**: Contact communication type (categorical: `"unknown"`, `"telephone"`, `"cellular"`).  
11. **duration**: Last contact duration in seconds (numeric).  
12. **campaign**: Number of contacts performed during this campaign and for this client (numeric, includes the last contact).  
13. **pdays**: Number of days since the client was last contacted in a previous campaign (numeric, `-1` means the client was not previously contacted).  
14. **previous**: Number of contacts performed before this campaign for this client (numeric).  
15. **poutcome**: Outcome of the previous marketing campaign (categorical: `"unknown"`, `"other"`, `"failure"`, `"success"`).  

### Output Variable (Desired Target):
16. **target**: Has the client subscribed to a term deposit? (binary: `"yes"`, `"no"`).


In [None]:
# To get Statistical information about numerical columns
data.describe()

### Insights for Numerical  Datatype : --
* **Age**: Average age is 42 years, ranging from 18 to 95 years.
* **Balance**: Mean account balance is €5441, with high variability (up to €102,127).
* **Call Duration**: Average call duration is 439 seconds, with a maximum of 4918 seconds.
* **Campaign Contacts**: Median contacts per person is 2, but some were contacted up to 63 times.
* **Previous Campaigns**: Most clients were not previously contacted (pdays = -1 for many).
* **Target Variable**: Only 14.86% of clients subscribed, showing class imbalance.
* **Temporal Data**: Most campaign activity occurred in 2009, with June being the busiest month.



In [None]:
# To know statistical information about Categorical data
data.describe(include='O')

- **Count**: Non-null values in each column; `job`, `education`, `contact` have missing data.
- **Unique**: Indicates diversity in values; e.g., `job` has 11 roles, `poutcome` has 3 outcomes.
- **Top**: Most frequent value per column (e.g., "married" in `marital`).
- **Frequency**: Frequency of top values, showing class imbalance like "no" in `target`.

# 3 . EDA : Exploratory Data Analysis

In [None]:
data.info()

In [None]:
data['last contact date'].head()


In [None]:
data['last contact date'] = pd.to_datetime(data['last contact date'])
data['p_year'] = data['last contact date'].dt.year
data['p_month'] = data['last contact date'].dt.month
data['p_weekday'] = data['last contact date'].dt.weekday


test['last contact date'] = pd.to_datetime(test['last contact date'])
test['p_year'] = test['last contact date'].dt.year
test['p_month'] = test['last contact date'].dt.month
test['p_weekday'] = test['last contact date'].dt.weekday

*Yearly Trends*: Spot long-term shifts in customer behavior or campaign effectiveness by analyzing engagement patterns over years.

*Monthly Patterns*: Reveal seasonal effects, potentially indicating which months drive higher engagement.

*Weekday Variations*: Understand customer preferences for specific days, aiding in scheduling more impactful marketing outreach.

These extracted features enrich the data, allowing the model to leverage temporal patterns for better prediction accuracy.

In [None]:
print(data.head()),
print('-----------------------------------------------------------------------'),
print(test.head())

In [None]:
data = data.drop(['last contact date'],axis=1)
test = test.drop(['last contact date'],axis=1)

Dropping the `last contact date` column was necessary after extracting `year`, `month`, and `weekday` features. This removed redundant information, streamlined the dataset, and reduced potential noise, focusing the model on the more relevant, time-based features for improved accuracy.

In [None]:
# using a for loop to find NoN Objective data type
data1 = []                         # create a empty List
for i in data.columns:             # use for loop in data.columns
    if data[i].dtype!='O':         # use if condition
        data1.append(i)            # Append that columns which satisfy the if condition
print(data1)                       # print the List

In [None]:
data1 = data[data1]                  # create a Numerical data type Variable
data2 = data.drop(data1,axis=1)      # create a categorical data type variable

### ( i ) . Univariate Analysis

In [None]:
## univariate analysis for Numerical Columns
plt.figure(figsize=(10,10))                                         # set the canvas size
plotnumber = 1                                                      # create a variable for plotting numbers
for i in data1 :                                                    # use for loop to iterate the loop
    plt.subplot(5,2,plotnumber)                                     # use subplot to plot figure in rows & columns
    sns.histplot(x = data1[i],color='blue',kde=True)                # plot Histogram
    plotnumber = plotnumber + 1
plt.tight_layout()                                                  # to avoid the overlapping

- **Purpose**: Conduct univariate analysis on numerical columns to understand data distribution.
- **Result**: Visual distribution of each numeric column, highlighting patterns and skewness.

### Insights of Numerical data  :--

1. **Age**: The age distribution is right-skewed, with a majority of clients around 30–40 years. Older age groups have fewer clients.

2. **Balance**: The balance feature is highly skewed, with most clients having a low balance. A few have exceptionally high balances.

3. **Duration**: Duration shows a right-skewed distribution, indicating that most calls were relatively short, while a few were significantly longer.

4. **Campaign**: Most clients were contacted only a few times in a single campaign, as indicated by the right-skewed distribution.

5. **Pdays**: Most values are clustered around zero, suggesting that most clients had not been contacted in the previous campaign.

6. **Previous**: Similar to `pdays`, most clients had no prior contacts, with very few having multiple contacts.

7. **P_year**: Peaks around specific years suggest that certain years had higher client activity or data collection intensity.

8. **P_month**: Client contacts show seasonal peaks, with higher counts in specific months, indicating targeted campaigns during certain times of the year.

9. **P_weekday**: There are consistent client contact patterns across weekdays, showing no significant preference foature engineering.



In [None]:
## univariate analysis for Categorical Columns
plt.figure(figsize=(10,18))                                         # set the  size
plotnumber = 1                                                      # create a variable for plotting numbers
for i in data2 :                                                    # use for loop to iterate the loop
    plt.subplot(6,2,plotnumber)                                     # use subplot to plot figure in rows & columns
    sns.countplot(x = data2[i])                                     # plot countplot
    plotnumber = plotnumber + 1

    plt.xticks(rotation=90)
plt.tight_layout()                                                  # to avoid the overlapping

### Insights Categorical Data :--


**Specific Insights:**

* **Job:** The most common job types are "blue-collar" and "management."
* **Marital:** The majority of individuals are "married," followed by "single" and "divorced."
* **Education:** The most common education level is "secondary," followed by "tertiary" and "primary."
* **Default:** Most individuals have not defaulted on their credit.
* **Housing:** The majority of individuals have a housing loan.
* **Loan:** A significant number of individuals have a personal loan.
* **Contact:** The most common mode of contact is "cellular," followed by "telephone."
* **Poutcome:** The most common outcome of the previous marketing campaign is "failure."

  

**Overall:**

* The dataset appears to be imbalanced, with a higher proportion of individuals who did not subscribe to a term deposit ("no").



### ( ii ) . Bivariate Analysis

In [None]:
## Bivariate analysis for Numerical Columns
plt.figure(figsize=(10,18))                                         # set the canvas size
plotnumber = 1                                                      # create a variable for plotting numbers
for i in data1 :                                                    # use for loop to iterate the loop
    plt.subplot(6,2,plotnumber)                                     # use subplot to plot figure in rows & columns
    sns.histplot(x = data1[i],hue=data.target,palette=['red', 'green'])       # plot Histogram with Target
    plotnumber = plotnumber + 1
    plt.xticks(rotation=90)
plt.tight_layout()                                                  # to avoid the overlapping

### Insights of Numerical data with Targetts:**


1. **`age`:** Subscription rates are higher for individuals in their 30s and decline with age.  
2. **`balance`:** Positive balances slightly increase subscription rates; most balances are near zero.  
3. **`duration`:** Longer contact durations strongly correlate with higher subscriptions.  
4. **`campaign`:** Multiple calls (above 10) do not significantly improve subscription rates.  
5. **`pdays`:** Non-zero `pdays` (prior contact) positively influence subscriptions.  
6. **`previous`:** Higher prior successful contacts slightly increase subscription likelihood.  
7. **`p_year`:** Campaigns in 2009 had better subscription rates compared to 2008 and 2010.  
8. **`p_month`:** Subscriptions are higher in March, September, and December, showing seasonality.  
9. **`p_weekday`:** Subscription rates are consistent across weekdays. conclusions.




In [None]:
## Bivariate analysis for Categorical Columns
plt.figure(figsize=(10,18))                                         # set the canvas size
plotnumber = 1                                                      # create a variable for plotting numbers
for i in data2.drop('target',axis=1) :                                   # use for loop to iterate the loop
    plt.subplot(6,2,plotnumber)                                     # use subplot to plot figure in rows & columns
    sns.countplot(x = data2[i],hue=data.target,palette=['r', 'g'])       # plot Countplot with Target class
    plotnumber = plotnumber + 1
    plt.xticks(rotation=90)
plt.tight_layout()                                                  # to avoid the overlapping

### Insights of Categorical data with Targets:**

* **Job:** The most common job types are "blue-collar" and "management."
* **Marital:** The majority of individuals are "married," followed by "single" and "divorced."
* **Education:** The most common education level is "secondary," followed by "tertiary" and "primary."
* **Default:** Most individuals have not defaulted on their credit.
* **Housing:** The majority of individuals have a housing loan.
* **Loan:** A significant number of individuals have a personal loan.
* **Contact:** The most common mode of contact is "cellular," followed by "telephone."
* **Poutcome:** The most common outcome of the previous marketing campaign is "fat conclusions.




# 4 . Data Preprocessing :

### ( i ) . Handle missing values :-

In [None]:
# see number of missing values present in each Features
print(data.isnull().sum())
print("--------------------------------------------------------------")

print(test.isnull().sum())

### Poutcome

In [None]:
# Check the distribution of target across poutcome categories
# Plot the distribution of target with respect to poutcome categories
sns.countplot(data=data, x='poutcome', hue='target')
plt.title("Distribution of 'target' across 'poutcome' categories")
plt.xticks(rotation=45)
plt.show()


In [None]:
# Calculate missing percentage for 'poutcome'
poutcome_missing_percentage = data['poutcome'].isnull().mean() * 100
print(f"'poutcome' missing percentage: {poutcome_missing_percentage:.2f}%")


In [None]:
# Drop 'poutcome' if it has limited value for prediction
data.drop(['poutcome'], axis=1, inplace=True)
test.drop(['poutcome'], axis=1, inplace=True)
print("Dropped 'poutcome' from both train and test datasets.")


### Handling missing values

In [None]:

# 1 Check for null values
print(data.isnull().sum())

# 2. Handling missing values

# For numerical columns, you can fill missing values with mean, median, or any custom value
data['age'].fillna(data['age'].mean(), inplace=True)  # Fill with mean
data['balance'].fillna(data['balance'].median(), inplace=True)  # Fill with median

# For categorical columns, you can fill with the mode or a custom value like 'Unknown'
data['job'].fillna(data['job'].mode()[0], inplace=True)  # Fill with mode
data['education'].fillna('Unknown', inplace=True)

# For the 'contact' column, you can either fill it with 'Unknown' or drop it if it's not useful
data['contact'].fillna('Unknown', inplace=True)

print('--------------------------------------------------------------------------')

# For numerical columns in the test data, fill missing values with the mean, median, or custom value
test['age'].fillna(test['age'].mean(), inplace=True)  # Fill with mean
test['balance'].fillna(test['balance'].median(), inplace=True)  # Fill with median

# For categorical columns, fill with the mode or a custom value like 'Unknown'
test['job'].fillna(test['job'].mode()[0], inplace=True)  # Fill with mode
test['education'].fillna('Unknown', inplace=True)

# For the 'contact' column, fill it with 'Unknown' or drop it if it's not important
test['contact'].fillna('Unknown', inplace=True)
print(test.isnull().sum())





##  Insights:

**1. Checking for Null Values:**

* **Purpose:** Identifies columns with missing values.
* **Method:** The `isnull().sum()` method calculates the number of null values in each column.

**2. Handling Missing Values:**

* **Numerical Columns:**
   - **Mean Imputation:** Replaces missing values with the mean of the column. This is suitable when the distribution is roughly symmetric.
   - **Median Imputation:** Replaces missing values with the median of the column. This is more robust to outliers.
* **Categorical Columns:**
   - **Mode Imputation:** Replaces missing values with the most frequent category.
   - **Custom Value:** Assigns a specific value like 'Unknown' to indicate vohe quality of your analysis.


### ( ii ) . Handle Categorical Variable :-

In [None]:
# print unique values of All Categorical data by using for loop.
for i in data.keys() :
    print("★"*5,i,"★"*5,"\n")
    print(data[i].value_counts(),'\n')

#### Insights:
1. Displays unique values and counts for each categorical feature.  
2. Helps identify imbalances or rare categories.  
3. Useful for preprocessing and feature engineering.  

In [None]:
data.info()

 # Splite Data into Train & Test

In [None]:
# 1. Split the data into features and target
X = data.drop(columns=['target'])  # Features
y = data['target']  # Target

## Column Transform with scaling and one-hot encoding

In [None]:
# 2. Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Define numerical and categorical columns
numerical_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'p_year', 'p_month', 'p_weekday']
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact']

# 4. Create a ColumnTransformer with scaling and one-hot encoding
Transformer = ColumnTransformer([
    ('scale', StandardScaler(), numerical_cols),
    ('onehot', OneHotEncoder(), categorical_cols)
], remainder='passthrough')

# 5. Fit the transformer on X_train and transform both training and validation data
X_train_transformed = pd.DataFrame(Transformer.fit_transform(X_train), columns=Transformer.get_feature_names_out())
X_val_transformed = pd.DataFrame(Transformer.transform(X_val), columns=Transformer.get_feature_names_out())
test_data_transformed = pd.DataFrame(Transformer.transform(test), columns=Transformer.get_feature_names_out())

# 6. Verify the transformed data
print(X_train_transformed.head())
print(X_val_transformed.head())

## Insights :


1. **Split Data**: Separates features (`X`) and target (`y`) for training.

2. **Train-Validation Split**: Divides data into 80% training and 20% validation for model evaluation.

3. **Define Columns**:
   - `numerical_cols`: Continuous values for scaling.
   - `categorical_cols`: Categorical values for encoding.

4. **ColumnTransformer Setup**:
   - **StandardScaler** for numerical columns.
   - **OneHotEncoder** for categorical columns.
   - Keeps other columns unchanged (`remainder='passthrough'`).

5. **Transform Data**: Applies transformations to training, validation, and test data for consistent feature scaling and encoding.

6. **Data Verification**: Prints transformed training and validation data to confirm trfor model training.

In [None]:
# 1. Converting the target variable 'target' into numeric values
data['target'] = data['target'].map({'yes': 1, 'no': 0})  #  'yes' and 'no' are the values

# 2. Spliting the data into features and target
X = data.drop(columns=['target'])  # Features
y = data['target']  # Target

# 3. Spliting into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
#  Verifingthe transformed data and check for any remaining non-numeric values
print(X_train_transformed.dtypes),
print("******************************")
print(X_val_transformed.dtypes)
print("******************************")

print(X_train_transformed.dtypes)

# Model Training

### Pipeline Traning

In [None]:

# Define numerical and categorical columns
numerical_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'p_year', 'p_month', 'p_weekday']
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact']

# Create ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('scale', StandardScaler(), numerical_cols),
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)

# Define the models you want to evaluate
models = [
    ('LogisticRegression', LogisticRegression(random_state=42)),
    ('RandomForest', RandomForestClassifier(random_state=42)),
    ('KNN', KNeighborsClassifier()),
    ('XGBClassifier', XGBClassifier(random_state=42)),
    ('LGBMClassifier', LGBMClassifier(random_state=42,n_jobs=-1))
]

# Create a function to train and evaluate models
def train_and_evaluate(models, X_train, y_train, X_val, y_val):
    best_models = {}

    for model_name, model in models:
        # Set up the pipeline for each model
        pipeline = Pipeline(steps=[
            ('preprocessing', preprocessor),
            ('pca', PCA(n_components=5)),         # You can adjust n_components later
            ('svd', TruncatedSVD(n_components=5)), # Adjust as necessary
            ('classifier', model)                 # The current model in the loop
        ])

        # Train the model
        pipeline.fit(X_train, y_train)

        # Predict on the validation set
        y_val_pred = pipeline.predict(X_val)

        # Calculate F1 score (macro)
        f1 = f1_score(y_val, y_val_pred, average='macro')
        print(f"Validation F1 Score (Macro) for {model_name}: {f1:.4f}")

        # Save the best model (optional)
        best_models[model_name] = pipeline

    return best_models

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate models
best_models = train_and_evaluate(models, X_train, y_train, X_val, y_val)

# Optionally: Predict on test data with the best models
for model_name, model in best_models.items():
    test_predictions = model.predict(test)
    test_predictions_df = pd.DataFrame(test_predictions, columns=['target'])
    print(f"Test Predictions for {model_name}:")
    print(test_predictions_df.head())


### Insights:

1. **Preprocessing and Dimensionality Reduction**:  
   - The `ColumnTransformer` applies scaling and one-hot encoding for numerical and categorical features, respectively.  
   - PCA and Truncated SVD reduce the feature space to 5 components, optimizing computational efficiency.

2. **Model Performance**:  
   - Multiple classifiers are evaluated, with `XGBClassifier` achieving the best F1 Macro score on the validation set (0.6095).  
   - `KNN` and `LGBMClassifier` also show competitive performance, highlighting model variation.

3. **Predictions and Output**:  
   - Test predictions for all models reveal most outputs as class `0`, likely reflecting data imbalances.  
   - Model performance could benefit from advanced techniques like oversampling/undersampling or threshold tuning for class balance.

# HyperParameter tuning

## Logistic Regression

In [None]:

# Initialize Logistic Regression model
logreg = LogisticRegression(random_state=42)

# Define the parameter distribution for Logistic Regression
param_dist_logreg = {
    'classifier__penalty': ['l1', 'l2', 'elasticnet', 'none'],    # Regularization type
    'classifier__C': uniform(0.01, 10),                           # Inverse of regularization strength
    'classifier__solver': ['newton-cg', 'lbfgs', 'liblinear', 'saga'],  # Solver algorithm
    'classifier__max_iter': [100, 200, 300, 400, 500]             # Maximum number of iterations
}

# Create a pipeline with the preprocessor and Logistic Regression
pipeline = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', logreg)
])

# Use RandomizedSearchCV for hyperparameter tuning
random_search_logreg = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_dist_logreg,
    n_iter=30,                   # Number of parameter settings sampled
    cv=5,                        # 5-fold cross-validation for better robustness
    scoring='f1_macro',          # Metric for evaluation
    n_jobs=-1,                   # Use all available cores
    random_state=42              # Ensures reproducibility
)

# Fit the RandomizedSearchCV on the training data
random_search_logreg.fit(X_train, y_train)

# Get the best model and print the best parameters
best_model_logreg = random_search_logreg.best_estimator_
print("Best Parameters for Logistic Regression:", random_search_logreg.best_params_)

# Predict probabilities on validation set
y_val_prob_logreg = best_model_logreg.predict_proba(X_val)[:, 1]  # Probabilities for the positive class

# Fine-tune threshold to maximize F1 Macro Score
thresholds_logreg = np.arange(0.3, 0.6, 0.01)
best_f1_macro = 0
best_threshold = 0.5  # Default starting threshold

for threshold in thresholds_logreg:
    y_val_pred_threshold_logreg = (y_val_prob_logreg >= threshold).astype(int)
    f1_macro_logreg_threshold = f1_score(y_val, y_val_pred_threshold_logreg, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_logreg_threshold:.4f}")

    # Update the best F1 macro score and best threshold if a better score is found
    if f1_macro_logreg_threshold > best_f1_macro:
        best_f1_macro = f1_macro_logreg_threshold
        best_threshold = threshold

# Print the best F1 macro score and threshold found through tuning
print(f"\nBest F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")


### Insights:

1. **Optimal Parameters for Logistic Regression**:  
   - The model achieved its best configuration with specific hyperparameters: regularization strength (`C`) of 3.35, `l2` penalty, and `newton-cg` solver, ensuring a robust performance.

2. **Threshold Tuning**:  
   - F1 Macro Score peaks at 0.6436 when the threshold is set to 0.30. As the threshold increases, the F1 score declines, indicating the model's sensitivity to lower thresholds for class balance.

3. **Significance of Results**:  
   - The optimal threshold (0.30) significantly boosts the F1 score, highlighting the importance of tuning thresholds in imbalanced datasets for achieving better class representation.

## RandomForeset Classifier

In [None]:

# Initialize RandomForest model
model_rf = RandomForestClassifier(random_state=0)

# Parameter distribution for RandomForest (reduced ranges)
param_dist_rf = {
    'classifier__n_estimators': randint(100, 500),      #  max number of trees
    'classifier__max_depth': randint(5, 15),            #  max depth
    'classifier__min_samples_split': randint(2, 10),    #  min samples to split a node
    'classifier__min_samples_leaf': randint(1, 10),     #  min samples in a leaf node
    'classifier__max_features': ['auto', 'sqrt'],       #  number of features to consider
    'classifier__bootstrap': [True],                    # Bootstrap sampling
    'classifier__class_weight': ['balanced']            # Class weighting
}

# Create a pipeline with the preprocessor and RandomForest classifier
pipeline_rf = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', model_rf)
])

# RandomizedSearchCV for hyperparameter tuning
random_search_rf = RandomizedSearchCV(
    estimator=pipeline_rf,
    param_distributions=param_dist_rf,
    n_iter=15,                  # Reduced number of parameter settings sampled
    cv=2,                       # 2-fold cross-validation
    scoring='f1_macro',         # Metric for evaluation
    n_jobs=-1,                  # Use all available cores
    random_state=42             # Ensures reproducibility
)

# Fit the RandomizedSearchCV on the training data
random_search_rf.fit(X_train, y_train)

# Get the best model and print the best parameters
best_model_rf = random_search_rf.best_estimator_
print("Best Parameters for RandomForest:", random_search_rf.best_params_)

# Predict on validation set and evaluate
y_val_pred_rf = best_model_rf.predict(X_val)
print("\nRandomForestClassifier Classification Report:")
print(classification_report(y_val, y_val_pred_rf))

# Calculate F1 macro score
f1_macro_rf = f1_score(y_val, y_val_pred_rf, average='macro')
print(f"F1 Macro Score on Validation Set: {f1_macro_rf:.4f}")

# Threshold Tuning for RandomForest with finer steps
y_proba_rf = best_model_rf.predict_proba(X_val)[:, 1]

# Initialize best F1 macro score and best threshold
best_f1_macro = f1_macro_rf  # Start with the initial F1 score
best_threshold = 0.5         # Default threshold

# Loop through a range of thresholds to find the best one
thresholds_rf = np.arange(0.3, 0.6, 0.01)  # Finer steps for threshold
for threshold in thresholds_rf:
    y_pred_threshold_rf = (y_proba_rf >= threshold).astype(int)
    f1_macro_rf_threshold = f1_score(y_val, y_pred_threshold_rf, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_rf_threshold:.4f}")

    # Update best F1 macro score and best threshold if a better score is found
    if f1_macro_rf_threshold > best_f1_macro:
        best_f1_macro = f1_macro_rf_threshold
        best_threshold = threshold

# Print the best F1 macro score and threshold
print(f"Best F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")


### Insights:

1. **Optimal Random Forest Hyperparameters**:  
   - The model performs best with a depth of 14, balanced class weights, 154 estimators, and other tuned hyperparameters (`sqrt` for `max_features`, `min_samples_split` of 5, and `min_samples_leaf` of 3). These ensure a well-generalized and balanced approach for the dataset.

2. **Classification Report**:  
   - Class 0 (majority class) achieves high precision (96%) but slightly lower recall (85%), indicating the model is conservative with predictions.  
   - Class 1 (minority class) shows moderate recall (79%) but lower precision (49%), reflecting a need for further optimization on imbalanced data.

3. **Threshold Tuning Insights**:  
   - The F1 Macro Score peaks at **0.7619** for a threshold of **0.55**, showing that a slightly higher threshold improves the balance between precision and recall for both classes.

 ## K-Nearest Neighbors (KNN):

In [None]:

# parameter grid for KNN
param_grid_knn = {
    'classifier__n_neighbors': [5, 7, 10, 15],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__metric': ['euclidean', 'manhattan']
}

# KNN pipeline without dimensionality reduction
pipeline_knn = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', KNeighborsClassifier())
])

# GridSearchCV with the pipeline
grid_search_knn = GridSearchCV(
    pipeline_knn, param_grid_knn, cv=5, scoring='f1_macro', n_jobs=-1
)
grid_search_knn.fit(X_train, y_train)

# Best model evaluation
best_knn = grid_search_knn.best_estimator_
y_val_pred_knn = best_knn.predict(X_val)
print("\nKNN Classification Report:")
print(classification_report(y_val, y_val_pred_knn))

# Calculate F1 macro score
f1_knn = f1_score(y_val, y_val_pred_knn, average='macro')
print(f"Best KNN F1 Score (Macro) on Validation Set: {f1_knn:.4f}")
print("Best Parameters for KNN:", grid_search_knn.best_params_)

# Threshold Tuning for KNN
y_proba_knn = best_knn.predict_proba(X_val)[:, 1]  # Get probability estimates for the positive class

# Initialize best F1 macro score and best threshold
best_f1_macro = f1_knn  # Start with the initial F1 score from default threshold
best_threshold = 0.5    # Default threshold

# Loop through a finer range of thresholds to find the best one
thresholds_knn = np.arange(0.3, 0.6, 0.01)  # Finer steps for threshold tuning
for threshold in thresholds_knn:
    y_pred_threshold_knn = (y_proba_knn >= threshold).astype(int)
    f1_macro_knn_threshold = f1_score(y_val, y_pred_threshold_knn, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_knn_threshold:.4f}")

    # Update best F1 macro score and best threshold if a better score is found
    if f1_macro_knn_threshold > best_f1_macro:
        best_f1_macro = f1_macro_knn_threshold
        best_threshold = threshold

# Print the best F1 macro score and threshold
print(f"Best F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")


### Insights:

1. **KNN Model Performance**:  
   - Class 0 (majority class) achieves a high precision (88%) and recall (95%), indicating reliable predictions for the dominant class.  
   - Class 1 (minority class) suffers from low recall (27%) and moderate precision (50%), highlighting challenges with imbalanced data.

2. **Optimal Parameters and Validation Score**:  
   - The best parameters for KNN are `euclidean` distance, `5` neighbors, and `distance` weighting. These settings achieve a validation F1 Macro Score of **0.6312**, showing moderate overall performance.

3. **Threshold Tuning Improvement**:  
   - Adjusting the threshold improves the F1 Macro Score, peaking at **0.6816** for a threshold of **0.34**, demonstrating that fine-tuning thresholds significantly benefits the balance between precision and recall.

## XGBoost

In [None]:


# Initialize XGBoost model
model_xgb = XGBClassifier(random_state=0, use_label_encoder=False, scale_pos_weight=2)

# Expanded parameter distribution for XGBoost
param_dist_xgb = {
    'classifier__learning_rate': uniform(0.01, 0.3),      # Expanded learning rate range
    'classifier__max_depth': randint(3, 15),              # Increased max depth for deeper trees
    'classifier__n_estimators': randint(100, 1000),       # Increased n_estimators upper limit
    'classifier__subsample': uniform(0.5, 0.9),           # Subsample range
    'classifier__colsample_bytree': uniform(0.5, 0.9),    # Feature fraction range
    'classifier__gamma': uniform(0, 0.5),                 # Gamma range
    'classifier__reg_alpha': uniform(0, 1),               # L1 regularization
    'classifier__reg_lambda': uniform(0.5, 1.5)           # L2 regularization
}

# Create a pipeline with the preprocessor and XGBoost classifier
pipeline_xgb = Pipeline(steps=[
    ('preprocessing', preprocessor),
    ('classifier', model_xgb)
])

# RandomizedSearchCV for hyperparameter tuning
random_search_xgb = RandomizedSearchCV(
    estimator=pipeline_xgb,
    param_distributions=param_dist_xgb,
    n_iter=50,                   # Increased number of parameter settings sampled
    cv=5,                        # Increased cross-validation folds
    scoring='f1_macro',          # Metric for evaluation
    n_jobs=-1,                   # Use all available cores
    random_state=42              # Ensures reproducibility
)

# Fit the RandomizedSearchCV on the training data
random_search_xgb.fit(X_train, y_train)

# Get the best model and print the best parameters
best_model_xgb = random_search_xgb.best_estimator_
print("Best Parameters for XGBoost:", random_search_xgb.best_params_)

# Predict on validation set and evaluate
y_val_pred_xgb = best_model_xgb.predict(X_val)
print("\nXGBoostClassifier Classification Report:")
print(classification_report(y_val, y_val_pred_xgb))

# Calculate F1 macro score
f1_macro_xgb = f1_score(y_val, y_val_pred_xgb, average='macro')
print(f"F1 Macro Score on Validation Set: {f1_macro_xgb:.4f}")

# Threshold Tuning with finer steps for XGBoost
y_proba_xgb = best_model_xgb.predict_proba(X_val)[:, 1]

# Initialize best F1 macro score and best threshold
best_f1_macro = f1_macro_xgb  # Starting with the current F1 score
best_threshold = 0.5          # Initial threshold

# Loop through a finer range of thresholds to find the best one
thresholds_xgb = np.arange(0.3, 0.6, 0.01)  # More granular threshold steps
for threshold in thresholds_xgb:
    y_pred_threshold_xgb = (y_proba_xgb >= threshold).astype(int)
    f1_macro_xgb_threshold = f1_score(y_val, y_pred_threshold_xgb, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_xgb_threshold:.4f}")

    # Update best F1 macro score and best threshold if a better score is found
    if f1_macro_xgb_threshold > best_f1_macro:
        best_f1_macro = f1_macro_xgb_threshold
        best_threshold = threshold

# Print the best F1 macro score and threshold
print(f"Best F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")


### Insights:

1. **XGBoost Model Performance**:  
   - **Class 0** (majority class) performs exceptionally well with a high precision of **94%** and recall of **89%**, ensuring accurate identification of non-subscribers.  
   - **Class 1** (minority class) shows moderate performance with a precision of **54%** and a recall of **71%**, indicating a focus on capturing subscribers at the cost of some false positives.  
   - The overall validation F1 Macro Score is **0.7664**, reflecting a well-balanced model for an imbalanced dataset.

2. **Optimal Hyperparameters**:  
   - Key parameters such as `learning_rate=0.025`, `max_depth=5`, `n_estimators=570`, and regularization settings (`reg_alpha` and `reg_lambda`) demonstrate careful tuning for effective learning and preventing overfitting.

3. **Threshold Tuning Improvement**:  
   - Fine-tuning the threshold leads to an optimal F1 Macro Score of **0.7723** at a threshold of **0.40**. This demonstrates the impact of threshold adjustments in improving the balance between precision and recall for both classes.

4. **Model Strengths and Application**:  
   - XGBoost effectively handles imbalanced datasets, capturing meaningful patterns and achieving the highest F1 Macro Score among all models evaluated.  
   - The ability to tune thresholds allows for flexible optimization depending on the business requirement, e.g., prioritizing fewer false negatives or false positives.

## LightGBM

In [None]:


# Define preprocessing for numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), numerical_cols),
        ('onehot', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ]
)


# Define the full pipeline with preprocessing, PCA, SVD, and classifier
pipeline_lgbm = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(random_state=42, verbose=-1))
])

# Define parameter grid for hyperparameter tuning
param_dist_lgbm = {
    'classifier__n_estimators': [50, 100, 150],
    'classifier__learning_rate': [0.01, 0.1, 0.2],
    'classifier__max_depth': [5, 10, 20],
    'classifier__num_leaves': [31, 50, 100],
    'classifier__min_data_in_leaf': [10, 20, 30]
}

# Perform RandomizedSearchCV
random_search_lgbm = RandomizedSearchCV(
    pipeline_lgbm, param_distributions=param_dist_lgbm, n_iter=10, cv=5,
    scoring='f1_macro', n_jobs=-1, random_state=42
)

# Fit the model with training data
random_search_lgbm.fit(X_train, y_train)

# Get the best model from RandomizedSearchCV
best_lgbm = random_search_lgbm.best_estimator_

# Predict on the validation set
y_val_pred_lgbm = best_lgbm.predict(X_val)
f1_lgbm = f1_score(y_val, y_val_pred_lgbm, average='macro')
print(f"\nBest LightGBM F1 Score (Macro): {f1_lgbm:.4f}")
print("Best Parameters for LightGBM:", random_search_lgbm.best_params_)

# Print classification report for initial threshold of 0.5
print("\nLightGBM Classifier Classification Report (Threshold=0.5):")
print(classification_report(y_val, y_val_pred_lgbm))

# Threshold Tuning
y_proba_lgbm = best_lgbm.predict_proba(X_val)[:, 1]  # Probability for the positive class

# Initialize variables to find the optimal threshold
best_f1_macro = f1_lgbm  # Start with initial F1 score
best_threshold = 0.5     # Default threshold

# Fine-tune threshold to maximize F1 macro score
thresholds_lgbm = np.arange(0.3, 0.6, 0.01)
for threshold in thresholds_lgbm:
    y_pred_threshold_lgbm = (y_proba_lgbm >= threshold).astype(int)
    f1_macro_lgbm_threshold = f1_score(y_val, y_pred_threshold_lgbm, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_lgbm_threshold:.4f}")

    # Update the best F1 macro score and best threshold if a better score is found
    if f1_macro_lgbm_threshold > best_f1_macro:
        best_f1_macro = f1_macro_lgbm_threshold
        best_threshold = threshold

# Print the best F1 macro score and threshold found through tuning
print(f"\nBest F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")


### Insights:

1. **LightGBM Model Performance (Default Threshold of 0.5)**:
   - **Class 0** (non-subscribers) shows strong performance with **89% precision** and **95% recall**, ensuring accurate identification.
   - **Class 1** (subscribers) has moderate precision of **57%** and lower recall at **37%**, indicating more missed detections of true subscribers.
   - The macro-average F1 Score is **0.6855**, reflecting moderate performance on an imbalanced dataset.

2. **Optimal Hyperparameters**:
   - The model's parameters include `num_leaves=100`, `n_estimators=100`, `min_data_in_leaf=10`, `max_depth=10`, and a learning rate of **0.1**, which balance complexity and training efficiency.

3. **Threshold Tuning Impact**:
   - Lowering the threshold to **0.30** improves the F1 Macro Score to **0.7413**, highlighting LightGBM's potential to better balance precision and recall when prioritizing Class 1 (subscribers).
   - However, as the threshold increases, the F1 score decreases, indicating diminished recall for Class 1.

4. **Comparison to Other Models**:
   - LightGBM's best F1 Macro Score after tuning (**0.7413**) is lower than XGBoost (**0.7723**) and Random Forest (**0.7619**) but outperforms KNN (**0.6816**).
   - It may still be a viable choice when training time and interpretability aineering techniques.

# Comparison of all Models:

In [None]:


# Model names and F1 macro scores
models = ['Logistic Regression', 'Random Forest', 'KNN', 'XGBoost', 'LightGBM']
f1_scores = [0.6436, 0.7619, 0.6816, 0.7723, 0.7334]

# Plotting
plt.figure(figsize=(10, 6))
bars = plt.bar(models, f1_scores, color=['blue', 'green', 'orange', 'red', 'purple'])

# Adding labels and title
plt.xlabel('Models')
plt.ylabel('Best F1 Macro Score After Threshold Tuning')
plt.title('Comparison of F1 Macro Scores for Different Models')

# Adding numbers on top of the bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width() / 2, yval + 0.01, f'{yval:.4f}', ha='center', va='bottom', fontsize=12)

plt.ylim(0.6, 0.8)
plt.show()


This bar chart illustrates the **comparison of F1 Macro Scores** achieved by different models after threshold tuning. The models included are:

1. **Logistic Regression**: Achieved the lowest score of **0.6436**.
2. **Random Forest**: Performed well with an F1 Macro Score of **0.7619**, making it one of the top-performing models.
3. **K-Nearest Neighbors (KNN)**: Scored moderately with **0.6816**, better than Logistic Regression but lagging behind tree-based methods.
4. **XGBoost**: Achieved the highest score of **0.7723**, indicating its superior performance among the evaluated models.
5. **LightGBM**: Scored **0.7334**, slightly behind Random Forest and XGBoost but significantly better than KNN and Logistic Regression.

The chart highlights that **XGBoost** is the best-performing model in terms of F1 Macro Score, followed by **Random Forest** and **LightGBM**, while **Logistic Regression** and **KNN** perform less effectively. This analysis emphasizes the dominance of tree-based ensemble methods (XGBoost and Random Forest) for this classification task.

In [None]:
# Submission of Best Model

In [None]:
# Step 1: Fit the RandomizedSearchCV on the training data
random_search_xgb.fit(X_train, y_train)

# Step 2: Get the best model and print the best parameters
best_model_xgb = random_search_xgb.best_estimator_
print("Best Parameters for XGBoost:", random_search_xgb.best_params_)

# Step 3: Predict probabilities on the validation set
y_proba_xgb = best_model_xgb.predict_proba(X_val)[:, 1]

# Step 4: Threshold tuning
best_threshold = 0.5
best_f1_macro = 0
thresholds_xgb = np.arange(0.3, 0.6, 0.01)  # Fine-grained thresholds from 0.3 to 0.6

for threshold in thresholds_xgb:
    y_pred_threshold_xgb = (y_proba_xgb >= threshold).astype(int)
    f1_macro_xgb_threshold = f1_score(y_val, y_pred_threshold_xgb, average='macro')
    print(f"Threshold: {threshold:.2f}, F1 Macro Score: {f1_macro_xgb_threshold:.4f}")
    if f1_macro_xgb_threshold > best_f1_macro:
        best_f1_macro = f1_macro_xgb_threshold
        best_threshold = threshold

print(f"\nBest F1 Macro Score after Threshold Tuning: {best_f1_macro:.4f} at Threshold: {best_threshold:.2f}")

# Step 5: Predict probabilities on the test set using the best threshold
y_test_proba = best_model_xgb.predict_proba(test)[:, 1]
y_test_pred = (y_test_proba >= best_threshold).astype(int)

# Step 6: Map predictions to "yes" or "no"
y_pred_mapped = np.where(y_test_pred == 1, 'yes', 'no')

# Step 7: Prepare the submission file
submission = pd.DataFrame({
    'id': range(len(y_pred_mapped)),  # If 'id' column is unavailable, generate index-based IDs
    'target': y_pred_mapped
})

# Step 8: Save the submission file
submission.to_csv('submission.csv', index=False, header=True)

# Step 9: Preview the submission file
print("Submission File Preview:")
print(submission.head())
print(submission.shape)
print(submission['target'].value_counts())

print("\nFinal submission file created successfully with the tuned XGBoost model!")


## Conclusion

In this project, I used machine learning techniques to predict the success of bank telemarketing campaigns. After performing data preprocessing and exploratory data analysis (EDA), I experimented with multiple models including Logistic Regression, Random Forest, KNN, LightGBM, and XGBoost.

**Key Points:**
- **XGBoost** achieved the best performance with a final **F1 Macro Score of 0.7723** after fine-tuning the decision threshold.
- The project demonstrated the importance of feature engineering, model selection, and threshold adjustments to improve model performance.

This model can help businesses optimize marketing campaigns, enhancing customer targeting and resource allocation.
