**INTRODUCTION**


**Problem Statement:**
We want to create a system that helps us decide which customers to target in our marketing campaigns for term deposits. This system will predict if a customer is likely to say “yes” or “no” to a term deposit offer.

**OBJECTIVE**

  * Build a system that predicts whether a customer will subscribe to a term deposit.
  * Use this system to select the right customers for marketing campaigns.
  * Improve the efficiency of our marketing efforts and save costs.

**DATA AVAILABILITY**

* This is the classic marketing bank dataset uploaded originally in the UCI  Repository.
* The dataset gives you information about a marketing campaign of a financial institution.
* In which you will have to analyze and find ways to look for future strategies in order to improve future marketing campaigns for the bank.



**DESCRIPTION OF DATASET**


1-age: Represents the age of the individual.

2-job: Describes the occupation or job of the person.

3-marital: Indicates the marital status of the person (e.g., married, single, divorced).

4-education: Represents the educational level of the person (e.g., primary, secondary, tertiary).

5-default: Indicates whether the person has credit in default ('yes', 'no', or 'unknown').

6-housing: Shows whether the person has a housing loan ('yes', 'no', or 'unknown').

7-loan: Indicates whether the person has a personal loan ('yes', 'no', or 'unknown').

8-contact: Describes the method of communication used to contact the person (e.g., 'cellular', 'telephone').

9-day: Indicates the day of the week of the last contact.

10-month: Represents the month of the last contact.

11-Duration: Represents the duration of the last contact in seconds

12-campaign: Indicates the number of contacts made during this campaign.

13-pdays: Describes the number of days since the person was last contacted or -1 if they were not previously contacted.

14-previous: Represents the number of contacts made before this campaign.

15-poutcome: Indicates the outcome of the previous marketing campaign.

16-deposit: The target variable, indicating whether the person subscribed to a term deposit ('yes' or 'no')gn.

# **Data Pre-processing**:
It is a process of preparing raw data and making it sutaible for  a machine learning model

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**> Importing Dataset**

In [None]:
df = pd.read_csv("/content/drive/MyDrive/LR _DATA SCIENCE/MACHINE LEARNING/projecccccct/bank.csv")
df

# **EDA**
It is a critical step in data analysis process that involves examining and visualizing datasets to understand their charcteristics patterns and relationships.
**1. Summarise data 2. Handling missing values 3. visualization and insights**


In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
#used to generate descriptive statistics of a DataFrame
print(df.describe(include=['object']))
print("*"*100)
print(df.describe(include=['int64']))

We have 7 numerical features and 10 categorical features

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

Their are no missing values in the dataset

In [None]:
features=["age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome","deposit"]
for i in features:
 print(df[i].value_counts(),i)
 print('-'*100)

In [None]:
features=["age","job","marital","education","default","balance","housing","loan","contact","day","month","duration","campaign","pdays","previous","poutcome","deposit"]
for i in features:
 print(df[i].unique(),i)
 print('-'*100)

# **Visualisation and insights**

Data visualisation: graphical representation of data

Data insights: are valuable information and observations gained from data

In [None]:
# Numerical features
numerical_features = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
# Categorical features
categorical_features = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"]
target_variable=['deposit']

# **Finding distribution of categorical feature**

In [None]:
plt.figure(figsize=(15,80))
plotnumber =1
for categorical_feature in categorical_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.countplot(y=categorical_feature,data=df)
    plt.xlabel(categorical_feature)
    plt.title(categorical_feature)
    plotnumber+=1
plt.show()

This code iterates through each categorical feature and creates a separate count plot for each.
Each plot shows the distribution of the "deposit" variable within that specific category of the chosen feature.
The plots are arranged in a grid layout determined by the number of categorical features.

* defualt feature seems to be does not play importand role as it has value of no at high ratio to value yes which can drop

# **Relationship between Categorical Features and Label**

In [None]:
for feature in categorical_features:
    plt.figure(figsize=(10, 6))
    sns.countplot(x=feature, hue="deposit", data=df)
    plt.show()

* retired client has high interest on deposit
* client who has housing loan seems to be not interested much on deposit
* if pre campagin outcome that is poutcome=success then, there is high chance of client to show interest on deposit

# **Finding distribution of numerical feature**

**HISTOGRAM:**
* It is a graphical representation of the distribution of a dataset.
* It provides frequency distribution of a set of continuous or discrete data

In [None]:
for i in numerical_features:
    plt.figure(figsize=(8, 6))
    plt.hist(df[i], bins=20)
    plt.title('Distribution of ' + i)
    plt.xlabel(i)
    plt.ylabel('Frequency')
    plt.grid(False)
    plt.show()


* The distribution of age appears to be approximately normal.
* The distributions of balance, duration, campaign, pdays, and previous -->indicating that most values are concentrated on the lower end-->these variables seem to have some outliers on the higher side

In [None]:
for i in numerical_features:
    plt.figure(figsize=(8, 6))
    sns.barplot(x='deposit', y=i, data=df)
    plt.title(i + ' vs deposit')
    plt.show()

client shows interest on deposit who had discussion for longer duration and who were contacted before the campaign

**HeatMap**

In [None]:
numerical_df = df[numerical_features]
correlation_matrix = numerical_df.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

no feature is highly correlated with each other

**Box plot**

box and whisker plot

--median is shown by the line inside box
--center box shows the range of data

box plots are used to visualize the distribution and identify outliers for numerical features

In [None]:
for coloum in numerical_features:
    sns.boxplot(data=df, x=coloum)
    plt.xlabel(coloum)
    plt.title('Box Plot of '+coloum)
    plt.grid(False)
    plt.show()

age, balance, duration, compaign, pdays and previous has outliers

**As per EDA**
* no missing value found
* 9 categorical features(which need to be encoded later)
* defaut features does not play imp role
* age, balance, duration, compaign, pdays and previous has outliers

# **bold text** **Before dropping the features let us understand the count**

In [None]:
df['default'].groupby(df['default']).count()

In [None]:
df['pdays'].groupby(df['pdays']).count()

In [None]:
df.drop(['default'],axis=1, inplace=True)

In [None]:
df.drop(['pdays'],axis=1, inplace=True)

In [None]:
df.groupby(['deposit', 'balance'])['balance'].count()
#outlier should not be remove as balance goes high, client show interest on deposit

In [None]:
df.groupby(['deposit', 'duration'])['duration'].count()
#outlier should not be remove as duration goes high, client show interest on deposit

In [None]:
df.groupby(['deposit', 'campaign'])['campaign'].count()
#their are lesser number of rows which shows high difference so will remove outliers here

In [None]:
df.groupby(['deposit', 'previous'])['previous'].count()
#their are lesser number of rows which shows high difference so will remove outliers here

In [None]:
#function to remove outliers using IQR method
def remove_outliers(df, column):
    Q1=df[column].quantile(0.25)
    Q3=df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

#columns with outliers
outlier_columns=['campaign', 'previous']

In [None]:
#boxplot before and after removal of outliers
for col in outlier_columns:
    plt.figure(figsize=(8,8))

    #boxplot before removing outlier
    plt.subplot(1,2,1)
    df.boxplot(column=col)
    plt.title(col+'  before removing outliers')
    plt.grid(False)
    #remove outliers
    data=remove_outliers(df,col)

    #boxplot after removing outlier
    plt.subplot(1,2,2)
    data.boxplot(column=col)
    plt.title(col+'  after removing outliers')
    plt.grid(False)
plt.show()

# **Feature Selection**
* As per insights gained from EDA ,
* while removing unwanted data and steps performed before removing outliers-->we have dropped few columns.
* I will be selecting rest of the features

In [None]:
features = ['age', 'job', 'marital', 'education', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'previous', 'poutcome']
X = df[features]
y = df['deposit']


In [None]:
X

In [None]:
y

# **Feature scaling and handling categorical data**

In [None]:
numerical_features=['age', 'balance', 'day', 'duration', 'campaign', 'previous']
categorical_features=['job', 'marital', 'education', 'housing', 'loan', 'contact', 'month', 'poutcome']

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])


In [None]:
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

In [None]:
# ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
    ])

In [None]:
X_preprocessed = preprocessor.fit_transform(X).toarray()

# Splitting the data
   training set and test set

In [None]:
# prompt: import train_test_split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y, test_size=0.25, random_state=42)


In [None]:
X_train.shape

In [None]:
X_train

In [None]:
X_test.shape

# Model Selection

Perfomance evaluation
1. Confusion Matrix: It is a matrix of size 2×2 for binary classification with actual
values on one axis and predicted on another.
2. Accuracy Score: Accuracy is the measure of correct predictions made by our model.
It is equal to the number of correct predictions made upon total number of
predictions made by the model.
3. Precision Score: It is defined as the ratio of true positives to the sum of true and
false positives. It is also known as Positive Predictive Value (PPV).
4. Recall Score: It is defined as the ratio of true positives to the sum of true positives
and false negatives. It is also called True Positive Rate (TPR) or sensitivity.
5. F1 score: It is the weighted harmonic mean of precision and recall. The closer the
value of the F1 score is to 1.0 , the better the expected performance of the model is

# 1. SVC
Here, Machine Learning models learn from the past input data and predict the output. Support
vector machines are basically supervised learning models used for classification and regression
analysis.


In [None]:
from sklearn.svm import SVC
svm_model = SVC()
svm_model.fit(X_train, y_train)


In [None]:
y_pred = svm_model.predict(X_test)


In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
training_score= svm_model.score(X_train,y_train)
training_score

In [None]:
test_score= svm_model.score(X_test,y_test)
test_score

# 2. RANDOM FOREST CLASSIFIER
Random forests or random decision forests is an ensemble learning method for classification,
regression and other tasks that operates by constructing a multitude of decision trees at training
time. For classification tasks, the output of the random forest is the class selected by most trees

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)

In [None]:
y_pred1 = rf_model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred1))
print("Classification Report:\n", classification_report(y_test, y_pred1))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred1))

In [None]:
training_score= rf_model.score(X_train,y_train)
training_score

In [None]:
test_score= rf_model.score(X_test,y_test)
test_score

# 3. LOGISTIC REGRESSION
It is one of the most popular Machine Learning algorithms, which comes under the Supervised
Learning technique. It is used for predicting the categorical dependent variable using a given set
of independent variables.


In [None]:
from sklearn.linear_model import LogisticRegression
lcr_model = LogisticRegression()
lcr_model.fit(X_train, y_train)

In [None]:
y_pred2 = lcr_model.predict(X_test)
y_pred2

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred2))
print("Classification Report:\n", classification_report(y_test, y_pred2))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred2))

In [None]:
training_score= lcr_model.score(X_train,y_train)
training_score

In [None]:
test_score= lcr_model.score(X_test,y_test)
test_score

#  4.K-Nearest Neighbor (KNN) :
It is used for classification and regression. In both cases, the input consists of the k closest
training examples in a data set. The output depends on whether k-NN is used for classification
or regression

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn_model = KNeighborsClassifier(n_neighbors=10)
knn_model.fit(X_train, y_train)

In [None]:
y_pred3 = knn_model.predict(X_test)
y_pred3

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred3))
print("Classification Report:\n", classification_report(y_test, y_pred3))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred3))

In [None]:
training_score= knn_model.score(X_train,y_train)
training_score

In [None]:
test_score= knn_model.score(X_test,y_test)
test_score

# 5.Naive Bayes : GaussianNB
In statistics, Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on
applying Bayes' theorem with strong (naive) independent assumptions between the features.
They are among the simplest Bayesian network models, but coupled with kernel density
estimation, they can achieve high

In [None]:
from sklearn.naive_bayes import GaussianNB
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)

In [None]:
y_pred4 = nb_model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred4))
print("Classification Report:\n", classification_report(y_test, y_pred4))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred4))

In [None]:
training_score= nb_model.score(X_train,y_train)
training_score

In [None]:
test_score= nb_model.score(X_test,y_test)
test_score

# 5.Naive Bayes : BernoulliNB


In [None]:
from sklearn.naive_bayes import BernoulliNB
nb_model1=BernoulliNB()
nb_model1.fit(X_train, y_train)

In [None]:
y_pred5 = nb_model1.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred5))
print("Classification Report:\n", classification_report(y_test, y_pred5))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred5))

In [None]:
training_score= nb_model1.score(X_train,y_train)
training_score

In [None]:
test_score= nb_model1.score(X_test,y_test)
test_score

# 6.Decision tree
A decision tree is one of the most powerful tools of supervised learning algorithms used for both
classification and regression tasks. It builds a flowchart-like tree structure where each internal
node denotes a test on an attribute, each branch represents an outcome of the test, and each
leaf node (terminal node) holds a class label.

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt_model=DecisionTreeClassifier()
dt_model.fit(X_train, y_train)

In [None]:
y_pred6 = dt_model.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred6))
print("Classification Report:\n", classification_report(y_test, y_pred6))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred6))

In [None]:
training_score= dt_model.score(X_train,y_train)
training_score

In [None]:
test_score= dt_model.score(X_test,y_test)
test_score

In [None]:
plt.figure(figsize=(18,5))
comparison = pd.DataFrame({'Model':['K-Nearest neighbor','Naive BayesGNB','Naive Bayes-BNB','Support Vector Machine','Decision Tree','Random Forest','Logistic Regression'],
                           'Accuracy':[71.38,75.63,65.83,71.29,77.42,82.57,74.42]})
sns.barplot(x='Model',y='Accuracy',data=comparison)


#XGBoost

In [None]:
from xgboost import XGBClassifier


In [None]:
# Instantiate the XGBoost classifier
xgb_model = XGBClassifier()

# Fit the model on the balanced training data
xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
print("XGBoost Model:")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_xgb))


#*OBSERVATION*

Here, we can see that the randomforest model has the highest accuracy score.


To check the performance of the Balanced dataset

**OVERSAMPLING**

Oversampling is a technique used in machine learning to address class imbalance by increasing
the number of instances in the minority class (the less frequent class). This helps to balance the
class distribution, which can lead to better model performance.

In [None]:
y_train.value_counts()

SMOTE-for data balancing-synthetic minority over sampling Technique


In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X_train, y_train)

In [None]:
y_res.value_counts()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_res = scaler.fit_transform(X_res)
X_test = scaler.fit_transform(X_test)

In [None]:
# model training for balanced data
rfc1 = RandomForestClassifier()

rfc1.fit(X_res,y_res)

In [None]:
y_pred7=rfc1.predict(X_test)

In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred7))
print("Classification Report:\n", classification_report(y_test, y_pred7))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred7))

In [None]:
training_score= rfc1.score(X_res, y_res)
training_score

In [None]:
test_score= rfc1.score(X_test,y_test)
test_score

Observation
• so we can choose the balanced data with randomforest has best model

#HYPER PARAMETER TUNING

Hyper parameter Tuning refers to the process of choosing the optimum set of hyperparameters

for a machine learning model. Here i'am using gridsearchCV to get the optimum set of

hyperparameters.


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier()

# Define hyperparameters and their possible values for tuning
param_dist = {
    'n_estimators': [10, 200],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 11],
    'min_samples_leaf': [1, 5],
}

# Create a randomized search object with 5-fold cross-validation
random_search = RandomizedSearchCV(rf_classifier,param_dist, cv=5, random_state=42)

# Fit the randomized search to the data
random_search.fit(X_train, y_train)

# Print the best parameters found by the randomized search
print("Best Parameters:", random_search.best_params_)

# Get the best model from the randomized search
best_rf_classifier = random_search.best_estimator_

# Make predictions on the test set
y_pred = best_rf_classifier.predict(X_test)


In [None]:
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Prediction

In [None]:
# Create a DataFrame with the input data
new_data = pd.DataFrame(
    {'age': [35],
     'job': ['management'],
     'marital': ['divorced'],
     'education': ['tertiary'],
     'balance': [3837],
     'housing': ['no'],
     'loan': ['yes'],
     'contact': ['unknown'],
     'day': [8],
     'month': ['may'],
     'duration': [1084],
     'campaign': [1],
     'previous': [-1],
     'poutcome': ['unknown']}
)

# Apply the same preprocessing steps to the new data
new_data_preprocessed = preprocessor.transform(new_data).toarray()

# Make predictions using the best_rf_classifier
prediction = best_rf_classifier.predict(new_data_preprocessed)

print("Predicted Output:", prediction)


Conclusion:

* The Random Forest model, trained on a balanced dataset, demonstrated the highest accuracy and balanced performance across precision, recall, and F1-score.
* The final model can be used for predicting whether a customer is likely to subscribe to a term deposit, aiding in targeted marketing efforts.
* Continuous monitoring and periodic updates may be necessary as new data becomes available to maintain the model's relevance and accuracy.