Step 1: Data Collection
Goal: Gather data from various sources (databases, files, APIs, etc.).

Key Actions:

Identify and access relevant data sources. Ensure data formats are compatible with processing tools.

Output: data collected from the given link which are in the structured CSV format

In [1]:
# importing required libraries
import pandas as pd  # pandas library
import numpy as np   # numphy library
import random        # random generation of number
import os            # to see the path of the current file

In [2]:
print(os.getcwd()) # displays the current path of file

c:\Users\jalpa\DataScience\Bank-Customer_Subscription\bank-subscription-propensity-model\notebooks


In [None]:
pip install pandas openpyxl


In [3]:
#Reading Dataset
# Use lowercase 'data' to match your professional structure
#df = pd.read_csv('../Data/raw/bank-full.csv', sep=',', encoding='latin1')
df = pd.read_excel('../Data/raw/bank-full.xlsx')  # reading the full dataset of bank
print(df.head())
print(df.shape)
# Reading the subset of full bank dataset
df_sub = pd.read_csv('../Data/raw/bank.csv',sep=';')  # reading the subset(Sampling) dataset of bank
print(df_sub.head())
print(df_sub.shape)


   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  
(45211, 17)
   age          job  marital  education default  balance housing loan  \
0   30   unemployed  marri

The data is related with direct marketing campaigns of a Portuguese banking institution. 
   The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, 
   in order to access if the product (bank term deposit) would be (or not) subscribed. 

   There are two datasets: 

      1) bank-full.csv with all examples (45211), ordered by date (from May 2008 to November 2010).
      2) bank.csv with 10% of the examples (4521), randomly selected from bank-full.csv.
      
   The smallest dataset is provided to test more computationally demanding machine learning algorithms (e.g. SVM).

   The classification goal is to predict if the client will subscribe a term deposit (variable y).

1. Number of Instances: 45211 for bank-full.csv (4521 for bank.csv)

2. Number of Attributes: 16 + output attribute.

3. Attribute information:

   Input variables:
   
**bank client data:**

   1 - age (numeric)

   2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
                                       "blue-collar","self-employed","retired","technician","services") 

   3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
   
   4 - education (categorical: "unknown","secondary","primary","tertiary")

   5 - default: has credit in default? (binary: "yes","no")

   6 - balance: average yearly balance, in euros (numeric)

   7 - housing: has housing loan? (binary: "yes","no")

   8 - loan: has personal loan? (binary: "yes","no")

   **related with the last contact of the current campaign:**

   9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

   10 - day: last contact day of the month (numeric)

   11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

   12 - duration: last contact duration, in seconds (numeric)

   **other attributes:**

   13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

   14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

   15 - previous: number of contacts performed before this campaign and for this client (numeric)

   16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
  

  Output variable (desired target):
   17 - y - has the client subscribed a term deposit? (binary: "yes","no")

4. Missing Attribute Values: None

In [4]:
# Exploring the data completeness and data type of our variables
df.info()


<class 'pandas.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   age        45211 non-null  int64
 1   job        45211 non-null  str  
 2   marital    45211 non-null  str  
 3   education  45211 non-null  str  
 4   default    45211 non-null  str  
 5   balance    45211 non-null  int64
 6   housing    45211 non-null  str  
 7   loan       45211 non-null  str  
 8   contact    45211 non-null  str  
 9   day        45211 non-null  int64
 10  month      45211 non-null  str  
 11  duration   45211 non-null  int64
 12  campaign   45211 non-null  int64
 13  pdays      45211 non-null  int64
 14  previous   45211 non-null  int64
 15  poutcome   45211 non-null  str  
 16  y          45211 non-null  str  
dtypes: int64(7), str(10)
memory usage: 5.9 MB


**Initial Observations:**

The datasets contain customer information, including demographics, account details, and outcomes of bank marketing campaigns.

All columns contain 45,211 non-null values, meaning there are no missing values in any column.
This ensures data completeness, which is a positive aspect since no immediate imputation or removal of rows is necessary.

Data types appear consistent, with a mix of integers and categorical (object) data.
The target variable (y) indicates whether a customer subscribed to a term deposit.

7 columns are of data type int64 (integer).
These columns are numerical and likely represent continuous or discrete variables.

10 columns are of data type object.
These are categorical variables and may need to be encoded into numerical values if we are going to use them as features in our model.



In [None]:
# Descriptive statistics for numerical columns

#df.describe() # this gives statistical summary

# Let's look at the "spread" of our data. This helps us identify potential outliers.
#df.describe()

df.describe().T


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,45211.0,40.93621,10.618762,18.0,33.0,39.0,48.0,95.0
balance,45211.0,1362.272058,3044.765829,-8019.0,72.0,448.0,1428.0,102127.0
day,45211.0,15.806419,8.322476,1.0,8.0,16.0,21.0,31.0
duration,45211.0,258.16308,257.527812,0.0,103.0,180.0,319.0,4918.0
campaign,45211.0,2.763841,3.098021,1.0,1.0,2.0,3.0,63.0
pdays,45211.0,40.197828,100.128746,-1.0,-1.0,-1.0,-1.0,871.0
previous,45211.0,0.580323,2.303441,0.0,0.0,0.0,0.0,275.0


**Insights from the Data (statistics output)**

Age: The average client age is 40.936 = 41 years, with a range from 18(min) to 95(max).

Balance: The average balance is €1362, with a large variability (std = €3044), with extreme (possible outliers) values -8019 euro and 102127 euro.

Duration: The average call duration is 258 seconds (approx 4 minutes).

Campaign: Most clients were contacted on an average of 3 times, between 1 and 3 times most of the clients. with extreme (possible outlier) value of 63 times.

Previous: Most clients were contacted first time for this campaign. with extreme(possible outlier) value 275 times.

Below are the Possible Outliers: (to investigate)

Balance: A minimum balance of -€8019 and a maximum balance of €102,127, indicating a few extreme values.

Duration: Calls as short as 0 seconds and as long as 4918 seconds ( approx 82 minutes), suggesting potential outliers.

Campaign: Some clients were contacted up to 63 times, which is unusually high.

Previous: some clients were contacted 275 times previously before this campaign.
