

- Dataset Overview

The dataset contains 41,199 instances with 21 attributes. The attributes capture various demographic, financial, and contact-related information about individuals who were part of a telemarketing campaign. The goal is  to predict the target variable 'y', which is  a binary indicator of whether the client subscribed to the product (e.g., term deposit).

- Attribute Descriptions

+ Demographic:
     + age: Age of the client (numerical).
     + job: Type of job (categorical).
    + marital: Marital status (categorical).
    + education: Education level (categorical).   
+ Financial:
    +  default: Has credit in default? (categorical).
    +  housing: Has housing loan? (categorical).
    +  loan: Has personal loan? (categorical).
+ Contact-Related:
    + contact: Contact communication type (categorical).
    + month: Last contact month of the year (categorical).
    + day_of_week: Last contact day of the week (categorical).
    + duration: Last contact duration in seconds (numerical).
  
+ Campaign-Related:
    + campaign: Number of contacts performed during this campaign for this client (numerical).
    + pdays: Number of days that passed by after the client was last contacted from a previous + campaign (numerical).
    + previous: Number of contacts performed before this campaign for this client (numerical).   
+ poutcome: Outcome of the previous marketing campaign (categorical).
  
+ Economic Indicators:
    + emp_var_rate: Employment variation rate - quarterly indicator (numerical).
    + cons_price_idx: Consumer price index - monthly indicator (numerical).
    + cons_conf_idx: Consumer confidence index - monthly indicator (numerical).
    + euribor3m: Euribor 3-month rate - daily indicator (numerical).
    + nr_employed: Number of employees - quarterly indicator (numerical).
+ Target Variable:
    + y: Has the client subscribed to a term deposit? (binary, 1 for yes, 0 for no).

Potential Insights

The dataset includes a mix of categorical and numerical features, requiring appropriate preprocessing steps for machine learning.
The economic indicators might provide valuable context for understanding how macroeconomic factors influence subscription behavior.
The contact-related and campaign-related features can help analyze the effectiveness of different outreach strategies.
The demographic and financial features can be used to segment customers and identify potentially high-value targets.

### 1.Dataset loading 

In [32]:
pwd

'/workspaces/Data-Science-with-Python-master/Chapter01/Activities'

In [1]:
import pandas as pd    
import numpy as np
# Dataframe 
df = pd.read_csv("../data/house_price.csv", index_col=[0])

In [2]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44.0,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53.0,technician,married,unknown,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28.0,management,single,university.degree,no,yes,no,cellular,jun,thu,...,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39.0,services,married,high.school,no,no,no,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55.0,retired,married,basic.4y,no,yes,no,cellular,aug,fri,...,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


### 2.	Feature Engineering !


In [35]:
# Dataframe Dimension 
df.shape

(41199, 21)

In [37]:
# datframe infroamtion 
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41199 entries, 0 to 41198
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41197 non-null  float64
 1   job             41199 non-null  object 
 2   marital         41199 non-null  object 
 3   education       41199 non-null  object 
 4   default         41199 non-null  object 
 5   housing         41199 non-null  object 
 6   loan            41199 non-null  object 
 7   contact         41193 non-null  object 
 8   month           41199 non-null  object 
 9   day_of_week     41199 non-null  object 
 10  duration        41192 non-null  float64
 11  campaign        41199 non-null  int64  
 12  pdays           41199 non-null  int64  
 13  previous        41199 non-null  int64  
 14  poutcome        41199 non-null  object 
 15  emp_var_rate    41199 non-null  float64
 16  cons_price_idx  41199 non-null  float64
 17  cons_conf_idx   41199 non-null  floa

In [36]:
# Statistics 
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,41197.0,40.023812,10.434966,1.0,32.0,38.0,47.0,104.0
duration,41192.0,258.274762,259.270089,0.0,102.0,180.0,319.0,4918.0
campaign,41199.0,2.567514,2.769719,1.0,1.0,2.0,3.0,56.0
pdays,41199.0,962.485206,186.886905,0.0,999.0,999.0,999.0,999.0
previous,41199.0,0.172941,0.494859,0.0,0.0,0.0,0.0,7.0
emp_var_rate,41199.0,0.0819,1.570971,-3.4,-1.8,1.1,1.4,1.4
cons_price_idx,41199.0,93.57565,0.578845,92.201,93.075,93.749,93.994,94.767
cons_conf_idx,41199.0,-40.502002,4.628524,-50.8,-42.7,-41.8,-36.4,-26.9
euribor3m,41199.0,3.621336,1.734431,0.634,1.344,4.857,4.961,5.045
nr_employed,41199.0,5167.036455,72.249592,4963.6,5099.1,5191.0,5228.1,5228.1


### 3.	Detect Missing Values 

In [38]:
# missing values
df.isna()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41194,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
41195,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
41196,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
41197,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [39]:
# count missing values
df.isna().sum()

age               2
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           6
month             0
day_of_week       0
duration          7
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64

In [47]:
# display rows having missing values 
df[df.isna().any(axis=1)]

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
41185,42.0,admin.,single,university.degree,unknown,yes,yes,telephone,may,wed,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0
41186,48.0,technician,married,professional.course,no,no,yes,telephone,oct,tue,...,2,999,0,nonexistent,-3.4,92.431,-26.9,0.742,5017.5,0
41188,25.0,student,single,high.school,no,no,no,,may,fri,...,4,999,0,nonexistent,1.1,93.994,-36.4,4.859,5191.0,0
41189,103.0,technician,married,high.school,no,no,n,,aug,fri,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,1
41190,29.0,technician,single,Basic,no,yes,n,,may,mon,...,1,999,0,nonexistent,-1.8,92.893,-46.2,1.299,5099.1,0
41191,44.0,services,married,high.school,unknown,yes,y,,aug,fri,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.966,5228.1,0
41192,39.0,admin.,married,university.degree,no,no,y,cellular,nov,tue,...,2,999,0,nonexistent,-0.1,93.2,-42.0,4.153,5195.8,0
41193,1.0,admin.,married,high.school,no,yes,y,cellular,may,thu,...,4,999,1,failure,-1.8,92.893,-46.2,1.266,5099.1,0
41195,2.0,housemaid,married,Basic,unknown,no,n,,may,thu,...,2,999,0,nonexistent,1.1,93.994,-36.4,4.86,5191.0,0
41196,3.0,admin.,single,university.degree,unknown,yes,y,,may,wed,...,3,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,0


In [42]:
# features having missing values 
df.isna().sum()

age               2
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           6
month             0
day_of_week       0
duration          7
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64

In [48]:
# %  MIssing values by  feature 
(df.isna().sum()/df.shape[0]).sort_values(ascending=False)

duration          0.000170
contact           0.000146
age               0.000049
job               0.000000
marital           0.000000
default           0.000000
education         0.000000
loan              0.000000
housing           0.000000
month             0.000000
day_of_week       0.000000
campaign          0.000000
pdays             0.000000
previous          0.000000
poutcome          0.000000
emp_var_rate      0.000000
cons_price_idx    0.000000
cons_conf_idx     0.000000
euribor3m         0.000000
nr_employed       0.000000
y                 0.000000
dtype: float64

In [None]:
#finding the data types of each column and checking for null
# null_ = df.isna().any()
# dtypes = df.dtypes
# sum_na_ = df.isna().sum()
# info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])
# info


Unnamed: 0,isNullExist,NullSum,type
age,True,2,float64
job,False,0,object
marital,False,0,object
education,False,0,object
default,False,0,object
housing,False,0,object
loan,False,0,object
contact,True,6,object
month,False,0,object
day_of_week,False,0,object


### 4.	Drop NULL  Values

In [49]:
# Drop rows ahving null values
df = df.dropna()


In [51]:

# check missing values 
df.isna().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp_var_rate      0
cons_price_idx    0
cons_conf_idx     0
euribor3m         0
nr_employed       0
y                 0
dtype: int64

### 5.	features Values count 

In [54]:
df.marital.unique()

array(['married', 'single', 'divorced', 'unknown'], dtype=object)

In [57]:
for col in df.select_dtypes(include="object") :
    print(f" Categories. {col}:  {df[col].unique()}")


 Categories. job:  ['blue-collar' 'technician' 'management' 'services' 'retired' 'admin.'
 'housemaid' 'unemployed' 'entrepreneur' 'self-employed' 'unknown'
 'student']
 Categories. marital:  ['married' 'single' 'divorced' 'unknown']
 Categories. education:  ['basic.4y' 'unknown' 'university.degree' 'high.school' 'basic.9y'
 'professional.course' 'basic.6y' 'illiterate']
 Categories. default:  ['unknown' 'no' 'yes']
 Categories. housing:  ['yes' 'no' 'unknown']
 Categories. loan:  ['no' 'yes' 'unknown']
 Categories. contact:  ['cellular' 'telephone']
 Categories. month:  ['aug' 'nov' 'jun' 'apr' 'jul' 'may' 'oct' 'mar' 'sep' 'dec']
 Categories. day_of_week:  ['thu' 'fri' 'tue' 'mon' 'wed']
 Categories. poutcome:  ['nonexistent' 'success' 'failure']


In [53]:
df.keys()

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')

In [9]:
df.education.value_counts()

education
university.degree      12167
high.school             9516
basic.9y                6045
professional.course     5242
basic.4y                4176
basic.6y                2292
unknown                 1731
illiterate                18
Name: count, dtype: int64

### 6.	Encode  a Categotrical Feature 

In [58]:
df.housing.unique()  

array(['yes', 'no', 'unknown'], dtype=object)

In [59]:
# Label Encoding
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df['housing_encoded'] = encoder.fit_transform(df['housing'])

In [70]:
# Encoding schema 
encoder.classes_, f" encoed to  {df.housing_encoded.unique()}" 

(array(['no', 'unknown', 'yes'], dtype=object), ' encoed to  [2 0 1]')

In [71]:
for index, value in enumerate(encoder.classes_):
    print(f"Index {index}: {value}")

Index 0: no
Index 1: unknown
Index 2: yes


In [80]:
# display both encoded and non encode features 
df[["housing", "housing_encoded"]].sample(10)

Unnamed: 0,housing,housing_encoded
29674,yes,2
37354,yes,2
32025,yes,2
32003,yes,2
24264,no,0
28718,yes,2
9244,no,0
6254,yes,2
38249,yes,2
6281,yes,2


In [83]:
df[df.housing_encoded==1][["housing", "housing_encoded"]]

Unnamed: 0,housing,housing_encoded
90,unknown,1
101,unknown,1
102,unknown,1
157,unknown,1
160,unknown,1
...,...,...
41004,unknown,1
41102,unknown,1
41122,unknown,1
41132,unknown,1


### 7.	Encode All Categorical dataframe features 

In [94]:
# drop that housing_encoed features just create 
df.drop(["housing_encoded"], axis=1, inplace=True)

In [95]:
df.keys()

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx',
       'cons_conf_idx', 'euribor3m', 'nr_employed', 'y'],
      dtype='object')

In [96]:
# Categorcial features 
categorical_features  = df.select_dtypes(exclude="number").columns

# Numerical features 
numerical_features = df.select_dtypes(include="object").columns

In [97]:
categorical_features 

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome'],
      dtype='object')

In [98]:
numerical_features

Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'day_of_week', 'poutcome'],
      dtype='object')

In [99]:
df_oneHotEncoding= pd.get_dummies(df[categorical_features])

In [100]:
df_oneHotEncoding.head()

Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,True,...,False,False,True,False,False,False,False,False,True,False
2,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,True
3,False,False,False,False,False,False,False,True,False,False,...,False,False,True,False,False,False,False,False,True,False
4,False,False,False,False,False,True,False,False,False,False,...,False,False,True,False,False,False,False,False,False,True


In [101]:
df_result = pd.concat([df[numerical_features
                          ],df_oneHotEncoding],axis=1)

In [102]:
df_result.head()

Unnamed: 0,job,marital,education,default,housing,loan,contact,month,day_of_week,poutcome,...,month_oct,month_sep,day_of_week_fri,day_of_week_mon,day_of_week_thu,day_of_week_tue,day_of_week_wed,poutcome_failure,poutcome_nonexistent,poutcome_success
0,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,nonexistent,...,False,False,False,False,True,False,False,False,True,False
1,technician,married,unknown,no,no,no,cellular,nov,fri,nonexistent,...,False,False,True,False,False,False,False,False,True,False
2,management,single,university.degree,no,yes,no,cellular,jun,thu,success,...,False,False,False,False,True,False,False,False,False,True
3,services,married,high.school,no,no,no,cellular,apr,fri,nonexistent,...,False,False,True,False,False,False,False,False,True,False
4,retired,married,basic.4y,no,yes,no,cellular,aug,fri,success,...,False,False,True,False,False,False,False,False,False,True


In [103]:
df_result.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41187 entries, 0 to 41194
Data columns (total 63 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   job                            41187 non-null  object
 1   marital                        41187 non-null  object
 2   education                      41187 non-null  object
 3   default                        41187 non-null  object
 4   housing                        41187 non-null  object
 5   loan                           41187 non-null  object
 6   contact                        41187 non-null  object
 7   month                          41187 non-null  object
 8   day_of_week                    41187 non-null  object
 9   poutcome                       41187 non-null  object
 10  job_admin.                     41187 non-null  bool  
 11  job_blue-collar                41187 non-null  bool  
 12  job_entrepreneur               41187 non-null  bool  
 13  job_ho

### 8.	Dataframe Split into Training & test datasets 

In [104]:
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp_var_rate,cons_price_idx,cons_conf_idx,euribor3m,nr_employed,y
0,44.0,blue-collar,married,basic.4y,unknown,yes,no,cellular,aug,thu,...,1,999,0,nonexistent,1.4,93.444,-36.1,4.963,5228.1,0
1,53.0,technician,married,unknown,no,no,no,cellular,nov,fri,...,1,999,0,nonexistent,-0.1,93.2,-42.0,4.021,5195.8,0
2,28.0,management,single,university.degree,no,yes,no,cellular,jun,thu,...,3,6,2,success,-1.7,94.055,-39.8,0.729,4991.6,1
3,39.0,services,married,high.school,no,no,no,cellular,apr,fri,...,2,999,0,nonexistent,-1.8,93.075,-47.1,1.405,5099.1,0
4,55.0,retired,married,basic.4y,no,yes,no,cellular,aug,fri,...,1,3,1,success,-2.9,92.201,-31.4,0.869,5076.2,1


In [111]:
df.iloc[:,-1]

0        0
1        0
2        1
3        0
4        1
        ..
41182    0
41183    0
41184    0
41187    0
41194    0
Name: y, Length: 41187, dtype: int64

In [115]:
# Get independant varaibles  and target  feature
X=df_result.iloc[:,:-1]
y=df_result.iloc[:,-1]

In [118]:
# Split data ito training and test sets 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.3)

In [119]:
print("FULL Dateset X Shape: ", X.shape )
print("Train Dateset X Shape: ", X_train.shape )
print("Train Dateset y Shape: ", y_train.shape )
print("Test Dateset X Shape: ", X_test.shape )
print("Test Dateset y Shape: ", y_test.shape )

FULL Dateset X Shape:  (41187, 62)
Train Dateset X Shape:  (28830, 62)
Train Dateset y Shape:  (28830,)
Test Dateset X Shape:  (12357, 62)
Test Dateset y Shape:  (12357,)
