# **Success of Bank telemarketing**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing train dataset
df=pd.read_csv("/kaggle/input/predict-the-success-of-bank-telemarketing/train.csv")

In [None]:
df.head() # Checking 5 rows of train dataset

In [None]:
df.info() # Information of train dataset

In [None]:
df.shape # Dimention of train dataset

In [None]:
df.describe() # Description of train dataset

# Exploratory Data Analysis (EDA)

## Missing Values

In [None]:
df.isnull().sum()

Datatypes in all missing Value Columns is of object type.
* job  -- 229--  [0.58%]
* education  -- 1467-- [3.74%]
* contact    --  10336-- [26.36%]
* poutcome   --  29451-- [75.11%]

## Number of Unique Values in each Column

In [None]:
for column in df.columns:
    print(f'{column}         {df[column].nunique()}')

* Excluding target  (1)default,(2)housing,(3)loan, (4)contact are **binary** columns and all have object datatypes.
* (1)marital, (2)education, (3)poutcome have **three** unique values and all are of object types.

## Exploring Categorical Features

In [None]:
cat_=df.select_dtypes(include=["object"]).drop(columns=["target"]).columns


In [None]:
print(f'no of cat columns are {len(cat_)} which are: \n ')
for feature in cat_:
    print(feature)

In [None]:
for feature in cat_:
    print('In column "{}" number of unique categories are {}. '.format(feature,len(df[feature].unique())))

## Categorical Feature Distribution

In [None]:
fcat_ = [feature for feature in cat_ if feature != 'last contact date'] #Excluding last_contact_date

In [None]:
# Initialize an empty dictionary to store the results
feature_stats = {}

# Total number of samples
total_samples = len(df)

for feature in fcat_:
    # Count occurrences of each category
    counts = df[feature].value_counts()
    
    # Calculate the ratio of each category to the total samples
    ratios = counts / total_samples
    
    # Store results in the dictionary
    feature_stats[feature] = pd.DataFrame({
        'Count': counts,
        'Ratio': ratios
    })
    
    # Print the results
    print(f"Feature: {feature}")
    print(feature_stats[feature])
    print("\n")

# Example to access a specific feature's stats
# print(feature_stats['specific_feature_name'])


## Pyplot
**Matplotlib's Pyplot** is a Python library module used for creating 2D visualizations like line plots, bar charts, histograms, and scatter plots. It provides a MATLAB-like interface with functions for plotting, labeling, and styling graphs. Pyplot is simple, versatile, and widely used for exploratory data analysis and presentation.

In [None]:
plt.figure(figsize=(15,80))
plotnumber=1
for feature in fcat_:
    ax=plt.subplot(12,3,plotnumber)
    sns.countplot(y=feature,data=df)
    plt.xlabel(feature)
    plt.title(feature)
    plotnumber+=1
plt.show()

* blue-collar(19.83%) is most frequent job category followed by management(19.03%).
* In marital column married(57.86%) are in highest numbers and single(28.69%) are about half of married and divorced(13.44%) are about half of singles.
* In education column Secondary(49.95%) is the most observed category and primary(16.69%) the least.
* In default column 'no'(94.24%) is in very high proportion (almost all) as compared to yes(5.76%).
* In 'housing' category 'yes'(55.23%) is more than 'no'(44.76%).
* In 'loan' category most of clients dont have any personal loan.'no'->81.15%  'yes'->18.85%
* In 'contact' category , 'cellular'(63.83%) is most observed category. 'telephone'->9.80%
* 'poutcome' which contains outcome of previous marketing campaign  has 'failure'(12.62%) as most observed category. 'other->6.53%  'success'-> 5.74%

## Relationship between Categorical features and label

In [None]:
for feature in fcat_:
    sns.catplot(x='target',col=feature,kind='count',data=df)

In [None]:
for feature in fcat_:
    print(f"Feature: {feature}")
    
    # Group by target and feature, then count occurrences
    counts = df.groupby([feature, 'target']).size().unstack(fill_value=0)
    
    # Calculate ratio of 'yes' to 'no' (assuming 'yes' and 'no' are the target labels)
    counts['yes/no ratio'] = counts.get('yes', 0) / counts.get('no', 1)
    
    print(counts)
    print("\n")


* In feature 'job' category 'student' has highet conversion rate whereas 'blue-collar' has lowest which happens to be the most frequent category.
* In feature 'marital' category 'divorced' has highest conversion rate whereas 'married' has lowest which is most frequent category.
* in Feature 'education' category 'tertiary' has highest chance of positive response,whereas the 'secondary' the least which happens to be most frequent.
* clients having credit in default although are very less frequent  as specified above but have high conversion rate of.
* In feature 'contact' category 'telephone' has high conversion rate.
* in feature 'poutcome' which have lot of missing values, category 'success' has yes/no ratio of 1.417 which is only +1 ratio in all feature-category.

## Exploring Numerical features

In [None]:
num_ = df.select_dtypes(include=["number"]).drop(columns=["target"], errors='ignore').columns

In [None]:
num_

## Checking for Discrete Numerical Features

In [None]:
discrete_feature=[feature for feature in num_ if len(df[feature].unique())<20]


In [None]:
len(discrete_feature)

* There are no discrete feature in numerical columns num_.
* Hence,all the features in numerical columns are Continuous.

## Distribution of Numerical Features

In [None]:
plt.figure(figsize=(20, 60))  # Set the overall figure size
plotnumber = 1  # Initialize subplot counter

for feature in num_:  # Loop through numerical features
    ax = plt.subplot(12, 3, plotnumber)  # Create subplots
    sns.histplot(df[feature], kde=True)  # Histogram with KDE overlay
    plt.xlabel(feature)  # Set x-axis label
    plt.title(f'Distribution of {feature}')  # Set title
    plotnumber += 1  # Increment subplot counter

plt.tight_layout()  # Adjust layout to prevent overlap
plt.show()  # Display the plots


* Age is somewhat normally distributed.
* Rest of numerical columns i.e. balance,duration,campaign,pdays,previous are skewed towards right and  may have outliers.

## Relationship between Numerical Features and label

In [None]:
plt.figure(figsize=(20,60))
plotnumber=1
for feature in num_:
    ax=plt.subplot(12,3,plotnumber)
    sns.boxplot(x='target',y=df[feature],data=df)
    plt.xlabel(feature)
    plotnumber+=1
plt.show()

* Clients with higher age tend to say yes.
* Higher balances are more associated with yes outcomes,extreme outliers are present in both yes and no target.
* The yes category has higher values for duration, with a higher median and wider spread. Outliers are present.
* Otliers are present in campaign,pdays and previous.

## Exploring Correlation between numerical features.

In [None]:
cor_mat=df[num_].corr()
fig=plt.figure(figsize=(16,8))
sns.heatmap(cor_mat,annot=True)

* Previous-Balance have highest correlation of 0.72
* Balance has strong correlation with campaign(0.67),duration(0.67) and moderate correlation with pdays(0.56).
* Duration has strong correlation with campaign(0.63).
* Every numerical feature except age has significant correlation with each other having coefficient ranging from 0.52 to 0.72

## Checking Imbalance in Dataset based on Target

In [None]:
sns.countplot(x='target',data=df)
plt.show()

In [None]:
df['target'].groupby(df['target']).count()

In [None]:
(5827/33384)*100


* Based on Target data is highly Imbalanced as it has only 17.45% of 'yes' target.

# Feature Engineering

It's Evident from EDA that:
* job,education,contact,poutcome have missing values, all of these are categorical features.
* all numerical features have some outliers.
* all numerical features except age have correlation among themselves with coefficient ranging from 0.52 to 0.72
* Based on output label target dataset is highly imbalanced.

In [None]:
# copying data into df2
df2=df.copy()

In [None]:
df2.shape

In [None]:
df2.groupby(['target','default']).size()

In [None]:
df2.groupby(['target','pdays']).size()

In [None]:
29446/39211


*  75% values in pdays are -1.

In [None]:
df2.groupby(['target','balance']).size()

In [None]:
df2.groupby(['target','duration']).size()

In [None]:
#pd.set_option('display.max_rows', None)

In [None]:
df2.groupby(['target','campaign'],sort=True)['campaign'].count()

* For both yes and no target labels, most of the counts are concentrated at the lower campaign values (e.g., 1, 2, 3).
* The higher campaign values  have very low counts relative to the most frequent values (1, 2, 3).
* Hence higher campaign values lets say greater than 30 may be considered as outliers and we should remove them.

In [None]:
df2.groupby(['target','previous'],sort=True)['previous'].count()

In [None]:
#pd.reset_option('all')

In [None]:
df2.head()

In [None]:
df2.isna().sum()

## Importing important libraries

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler,MaxAbsScaler, PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.preprocessing import OrdinalEncoder,LabelEncoder
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split


## Preprocessing on train dataset

**Preprocessing in machine learning** involves preparing raw data for modeling. Key steps include handling missing values, encoding categorical data, scaling/normalizing features, feature selection, and data transformation. Preprocessing ensures data quality, consistency, and suitability for algorithms, improving model accuracy, performance, and generalization to unseen data.

In [None]:
df3 = df2.copy()

In [None]:

# Convert the last contact date to datetime format
df3['last contact date'] = pd.to_datetime(df3['last contact date'])

# Extract the month as full names
df3['month'] = df3['last contact date'].dt.month_name()

# Extract the numeric day
df3['day'] = df3['last contact date'].dt.day

# Drop the original date column if no longer needed
df3 = df3.drop(columns=['last contact date'])

### Removing null values

In [None]:
# class DataFrameMaker(BaseEstimator, TransformerMixin):
#     def __init__(self, transformer):
#         self.transformer = transformer

#     def fit(self, X, y=None):
#         self.transformer.fit(X)
#         self.columns = [i.split("__")[-1] for i in self.transformer.get_feature_names_out()]
#         return self

#     def transform(self, X):
#         index = X.index
#         X = self.transformer.transform(X)
#         X = pd.DataFrame(X, index=index, columns=self.columns)
#         for col in self.columns:
#             try:
#                 X[col] = pd.to_numeric(X[col])
#             except ValueError:
#                 pass
#         return X

#     def get_feature_names_out(self, names):
#         return names

In [None]:
# # preprocessor=DataFrameMaker(ColumnTransformer([
# #    ('scale',StandardScaler(),["age","day"]),
    
#  #   ('corr_feat',Pipeline([
#  #   ('scaler', StandardScaler()),  
#     ('pca', PCA(n_components=5))  
#      ]),num_cols),
    
#     ('ohe',OneHotEncoder(),["marital","default","housing","loan","month"]),
    
#     ('job-edu',Pipeline([
#         ('impute',SimpleImputer(strategy="most_frequent")),
#         ('ohe',OrdinalEncoder())
#     ]),["education",'job',]),
    
#     ('poutc',Pipeline([
#         ('impute',SimpleImputer(strategy="most_frequent",fill_value="unknown")),
#         ('oe',OrdinalEncoder())
#     ]),["poutcome"]),

#     ('contact',Pipeline([
#         ('impute',SimpleImputer(strategy="most_frequent")),
#         ('ohe',OneHotEncoder())
#     ]),["contact"])
    
    
# ],remainder='passthrough'))

In [None]:
#Encoding categorical variables
Categorial_Data = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome','month']
for cat in Categorial_Data:
    encoder = LabelEncoder()
    df3[cat] = encoder.fit_transform(df3[cat].astype(str))

In [None]:
# Encoding numerical variables
scaler = StandardScaler()
Numerical_Data = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous','day']
df3[Numerical_Data] = scaler.fit_transform(df3[Numerical_Data])

In [None]:
# Drop target column and make it separate data
X=df3.drop(columns=["target"])
y=df3["target"]

In [None]:
# Spliting train data in train and validtaion
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2,random_state=42,stratify=y)

In [None]:
X_train.shape,y_train.shape,X_val.shape,y_val.shape

## Performing SMOTE

In [None]:
smote = SMOTE(random_state=42)
X_train1, y_train1 = smote.fit_resample(X_train, y_train)
X_val1, y_val1 = smote.fit_resample(X_train, y_train)

In [None]:
# Make copies of data
X_train=X_train1.copy()
y_train=y_train1.copy()
X_val=X_val1.copy()
y_val=y_val1.copy()

## Feature engineering on train dataset

**Feature engineering** in machine learning is the process of transforming raw data into meaningful features that improve model performance. It involves creating, modifying, or selecting features through techniques like encoding, scaling, binning, and extraction. Effective feature engineering enhances predictive power, reduces noise, and helps algorithms better understand patterns in the data.

In [None]:
# Add polynomial features
poly = PolynomialFeatures(degree=3, interaction_only=False, include_bias=True)
X_train1 = poly.fit_transform(X_train1)
X_val1 = poly.transform(X_val1)


In [None]:
X_train1.shape

# Model Selection and Training

## Libraries importing

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV,cross_val_score
from sklearn.metrics import make_scorer, f1_score
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score

def evaluate_models(models, X, y, cv=5):
    """
    Evaluates multiple models using cross-validation with F1-macro scoring.
    
    Parameters:
    - models (dict): A dictionary where keys are model names and values are model objects.
    - X (array-like): Feature matrix.
    - y (array-like): Target vector.
    - cv (int): Number of cross-validation folds (default=5).
    
    Returns:
    - results (dict): A dictionary with model names as keys and mean F1-macro scores as values.
    """

    #y_numeric = y.map({'no': 0, 'yes': 1})
    
    # Define F1-macro scoring
    scorer = make_scorer(f1_score, average='macro')
    
    # Dictionary to store results
    results = {}
    
    for name, model in models.items():
        print(f"Evaluating model: {name}")
        
        # Perform cross-validation
        scores = cross_val_score(model, X, y, cv=cv, scoring=scorer)
        
        # Store mean F1-macro score
        results[name] = scores.mean()
        
        print(f"Scores for {name}: {scores}")
        print(f"Mean F1-macro score for {name}: {scores.mean():.4f}\n")
    
    return results


In [None]:
models = {
    "Random Forest": RandomForestClassifier(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Logistic Regression": LogisticRegression(random_state=42),
    "Logistic Regression with Hyperparameter Tuning": LogisticRegression(random_state=42, max_iter=50, solver='sag', C=10, class_weight='balanced', penalty='l2'),
    
}

## Baseline Models on Train Set

In [None]:
#evaluate_models(models,X_train1,y_train1)

OUTPUT:

'Logistic Regression': 0.8283268043772323 

'Random Forest': 0.907300228894979
'Decision Tree': 0.858584536054299

'Logistic Regression with Hyperparameter Tuning': 0.831304074805654

## Baseline Models on Validation set

In [None]:
#evaluate_models(models,X_val1, y_val1)

* Scores for Logistic Regression with Hyperparameter Tuning: [0.7888673  0.83608037 0.84105354 0.84461018 0.84590898]
Mean F1-macro score for Logistic Regression with Hyperparameter Tuning: 0.8313

'Logistic Regression': 0.8283268043772323 

'Random Forest': 0.907300228894979

'Decision Tree': 0.858584536054299

'Logistic Regression with Hyperparameter Tuning': 0.831304074805654

## HyperParameter tuning

**Hyperparameter tuning** is the process of optimizing a model's settings (hyperparameters) to improve performance. Unlike parameters learned during training (e.g., weights), hyperparameters (e.g., learning rate, depth, or number of layers) are set before training. Techniques like Grid Search, Random Search, and Bayesian Optimization are commonly used for tuning.

In [None]:
# # Add polynomial features
# poly = PolynomialFeatures(degree=3, interaction_only=False, include_bias=True)
# X_train_pt = poly.fit_transform(X_train1)
# X_val_pt = poly.transform(X_val1)

# # Hyperparameter tuning
# param_grid = {'penalty':['l1','l2'],
#               'max_iter':[1, 10, 50, 100],
#                 'C': [0.01, 0.1, 1, 10],
#                 'solver': ['sag', 'lbfgs', 'liblinear', 'saga'],
#                 'class_weight': [None, 'balanced']
# }
# pt_log_reg = GridSearchCV(LogisticRegression(random_state=42), param_grid, cv=5)
# pt_log_reg.fit(X_train_pt, y_train1)

# # Best model
# best_pt_model = pt_log_reg.best_estimator_
# print("Best Parameters:", pt_log_reg.best_params_)

# # Predict and evaluate
# y_pt_pred = best_pt_model.predict(X_val_pt)
# print("Score with parameter tuning:", accuracy_score(y_val1, y_pt_pred))

## It takes too many time to run that's why comment this code

* Best Params : max_iter=50, solver='sag', C=10, class_weight='balanced', penalty='l2'

In [None]:
model_param = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier(),
        'param': {
            'n_estimators': [50, 100],
            'criterion': ['gini', 'entropy'],
            'max_depth': range(2, 4, 1),
        }
    },
    'DecisionTreeClassifier': {
        'model': DecisionTreeClassifier(),
        'param': {
            'max_depth': range(2, 4, 1),
            'criterion': ['gini', 'entropy'],
        }
    }
}



In [None]:
scores = []

for model_name, mp in model_param.items():
    model_selection = GridSearchCV(
        estimator=mp['model'], 
        param_grid=mp['param'], 
        cv=5, 
        return_train_score=False
    )
    model_selection.fit(X_val1, y_val1)
    scores.append({
        'model': model_name,
        'best_score': model_selection.best_score_,
        'best_params_': model_selection.best_params_
    })



In [None]:
scores

 * {'model': 'RandomForestClassifier',
  'best_score': 091519700901393061,
  'best_params_': {'criterion': 'gini', 'max_depth': 3, 'n_estimators': 50}
  * 
 {'model'DecisionTreeClassifierier',
  'best_score'900.8688027601097801,
  'best_param{'criterion': 'gini', 'max_depth': 3,}50}}]

In [None]:
# Importing test dataset
test=pd.read_csv("/kaggle/input/predict-the-success-of-bank-telemarketing/test.csv")

In [None]:
# Copy test dataset
test1=test.copy()

## Preprocessing on test dataset

**Preprocessing in machine learning** involves preparing raw data for modeling. Key steps include handling missing values, encoding categorical data, scaling/normalizing features, feature selection, and data transformation. Preprocessing ensures data quality, consistency, and suitability for algorithms, improving model accuracy, performance, and generalization to unseen data.

In [None]:
# Convert the last contact date to datetime format
test1['last contact date'] = pd.to_datetime(test1['last contact date'])

# Extract the month as full names
test1['month'] = test1['last contact date'].dt.month_name()

# Extract the numeric day
test1['day'] = test1['last contact date'].dt.day

# Drop the original date column if no longer needed
test1 = test1.drop(columns=['last contact date'])

In [None]:
# Encoding categorical variables
Categorial_Data = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'poutcome','month']
for cat in Categorial_Data:
    encoder = LabelEncoder()
    test1[cat] = encoder.fit_transform(test1[cat].astype(str))

In [None]:
# Encoding numerical variables
scaler = StandardScaler()
Numerical_Data = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous','day']
test1[Numerical_Data] = scaler.fit_transform(test1[Numerical_Data])

## Feature engineering on test dataset

**Feature engineering** in machine learning is the process of transforming raw data into meaningful features that improve model performance. It involves creating, modifying, or selecting features through techniques like encoding, scaling, binning, and extraction. Effective feature engineering enhances predictive power, reduces noise, and helps algorithms better understand patterns in the data.

In [None]:
# Add polynomial features
poly = PolynomialFeatures(degree=3, interaction_only=False, include_bias=True)
X_train = poly.fit_transform(X_train)
X_val = poly.transform(X_val)
test1 = poly.transform(test1)


In [None]:
test1.shape

In [None]:
# Random Forest Classifier Model
model1=RandomForestClassifier(criterion='gini', max_depth= 3, n_estimators=50)
model1.fit(X_train1,y_train1)
model1_pred=model1.predict(test1)

In [None]:
# Decision Tree Classifier Model
model2=DecisionTreeClassifier(criterion='gini', max_depth= 3)
model2.fit(X_train1,y_train1)
model2_pred=model2.predict(test1)

In [None]:
# Logistic Regression Model
final_model = LogisticRegression(random_state=42, max_iter=50, solver='sag', C=10, class_weight='balanced', penalty='l2')
final_model.fit(X_train1,y_train1)
final_pred=final_model.predict(test1)

In [None]:
final_pred

# Submission

In [None]:
submission=pd.DataFrame(columns=['id', 'target'])
submission['id'] = [i for i in range(0,len(final_pred))]
#submission["target"] = ['yes' if x > 0.5 else 'no' for x in final_pred]
submission['target']= final_pred
submission.to_csv('submission.csv', index=False, encoding='utf-8')

In [None]:
sub=pd.read_csv("submission.csv")

In [None]:
sub.head()

In [None]:
import os
print(os.listdir())