<a href="https://colab.research.google.com/github/RanudeeFernando/CM2604_ML_CW/blob/main/notebooks/ml_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Client Subscription to Term Deposits Using Bank Marketing Data

**Description of Dataset**

The dataset used in this study was taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/222/bank%2Bmarketing) and it contains information related to a direct marketing campaign conducted by a Portuguese banking institution.

The goal of this study is to predict whether a client will subscribe to a long-term deposit (target variable: y).




## Explore Dataset

In [25]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [26]:
import pandas as pd

In [27]:
# Define file path to dataset
file_path = "/content/drive/MyDrive/CM2604 Machine Learning/CW/bank-full.csv"

# Load the dataset
df = pd.read_csv(file_path, sep=';')

In [28]:
# View first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

First 5 rows of the dataset:
   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  


In [29]:
# Check shape of dataset
print("Shape of dataset:")
print(f"Number of Rows: {df.shape[0]}")
print(f"Number of Columns: {df.shape[1]}")

Shape of dataset:
Number of Rows: 45211
Number of Columns: 17


In [30]:
# Display information about dataset
print("Dataset information:")
df.info()

Dataset information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [31]:
# Display summary statistics for numerical columns
print("Statistical summary of numerical columns:")
print(df.describe())

Statistical summary of numerical columns:
                age        balance           day      duration      campaign  \
count  45211.000000   45211.000000  45211.000000  45211.000000  45211.000000   
mean      40.936210    1362.272058     15.806419    258.163080      2.763841   
std       10.618762    3044.765829      8.322476    257.527812      3.098021   
min       18.000000   -8019.000000      1.000000      0.000000      1.000000   
25%       33.000000      72.000000      8.000000    103.000000      1.000000   
50%       39.000000     448.000000     16.000000    180.000000      2.000000   
75%       48.000000    1428.000000     21.000000    319.000000      3.000000   
max       95.000000  102127.000000     31.000000   4918.000000     63.000000   

              pdays      previous  
count  45211.000000  45211.000000  
mean      40.197828      0.580323  
std      100.128746      2.303441  
min       -1.000000      0.000000  
25%       -1.000000      0.000000  
50%       -1.000000  

In [32]:
# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns

# Strip spaces from all categorical columns
for col in categorical_columns:
    df[col] = df[col].str.strip()

In [33]:
# Display summary of unique values for categorical columns
print("Unique values in categorical columns:\n")
for column in categorical_columns:
  print(f"Column: {column}")
  print(df[column].unique())
  print()

Unique values in categorical columns:

Column: job
['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']

Column: marital
['married' 'single' 'divorced']

Column: education
['tertiary' 'secondary' 'unknown' 'primary']

Column: default
['no' 'yes']

Column: housing
['yes' 'no']

Column: loan
['no' 'yes']

Column: contact
['unknown' 'cellular' 'telephone']

Column: month
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']

Column: poutcome
['unknown' 'failure' 'other' 'success']

Column: y
['no' 'yes']



In [34]:
# Define target variable
target_variable = 'y'

# Define numerical and categorical features
numerical_features = df.select_dtypes(include=['int64']).columns.tolist()

categorical_features = df.select_dtypes(include=['object']).columns.tolist()

# Exclude target variable from categorical features
if target_variable in categorical_features:
  categorical_features.remove(target_variable)

# Display the results
print("Numerical Features:", numerical_features)
print("Categorical Features:", categorical_features)
print("Target Variable:", target_variable)

Numerical Features: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
Categorical Features: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
Target Variable: y


## Exploratory Data Analysis

1. **Handle Missing Values**

In [35]:
# Check for missing values in the entire dataset
missing_values = df.isnull().sum()
print("Missing Values per Column:")
print(missing_values)
print("\n")

# Check if there are any missing values
if missing_values.sum() == 0:
  print("No missing values found.")
else:
  print("Missing values have been found and require handling.")


Missing Values per Column:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


No missing values found.
