# Data Dictionary

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services") 

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric) 

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

**related with the last contact of the current campaign:**

9 - contact: contact communication type (categorical: "unknown","telephone","cellular") 

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

**other attributes:**

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

**Output variable (desired target):**

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score, make_scorer, confusion_matrix, classification_report, fbeta_score

# Define seed for repeatability
SEED = 42
np.random.seed(SEED)

In [None]:
# Read in train data
train_df = pd.read_csv('/kaggle/input/banking-dataset-marketing-targets/train.csv', 
                       sep=";")
train_df.head()

In [None]:
# Read in test data
test_df = pd.read_csv('/kaggle/input/banking-dataset-marketing-targets/test.csv', 
                      sep=";")
test_df.head()

# Explore training data

In [None]:
train_df.info()

We have no missing information. According to the data dictionary missing values are labeled as `unknown`.

In [None]:
train_df['age'].describe()

In [None]:
train_df['age'].plot.hist(bins=30, density=True)
plt.show()

In [None]:
train_df['job'].value_counts(normalize=True)

In [None]:
train_df['marital'].value_counts(normalize=True)

In [None]:
train_df['education'].value_counts(normalize=True)

In [None]:
train_df['default'].value_counts(normalize=True)

In [None]:
train_df['balance'].describe()

In [None]:
train_df['balance'].plot.hist(bins=50, density=True)
plt.show()

In [None]:
train_df['balance'].plot.box()
plt.show()

We do have outliers in the balance column. There are a few ways to deal with the outliers:
- We set a maximum value, say at 90 percentile. Additionally, create a new column that indicates the balance value is greater than the threshold.
- We let the values be and hope the model is robust to the outliers.

In [None]:
train_df['housing'].value_counts(normalize=True)

In [None]:
train_df['loan'].value_counts(normalize=True)

In [None]:
train_df['contact'].value_counts(normalize=True)

In [None]:
train_df['day'].value_counts(normalize=True)

In [None]:
train_df['day'].value_counts(normalize=True).describe()

In [None]:
train_df['month'].value_counts(normalize=True)

Customers are rarely contacted in the month of `December`. This makes sense, it is the holiday season and customers would not like to be bothered during this time of the year. But given December is a festive month, are customers more likely to say yes during this time?

In [None]:
train_df['duration'].describe()

In [None]:
train_df['duration'].plot.hist(bins=50, density=True)
plt.show()

In [None]:
train_df['duration'].plot.box()
plt.show()

The distribution of duration is similar to that of balance. We can clip the outliers and create an additional column to record this information.

In [None]:
# Investigate duration values that are 0
train_df[train_df['duration'] == 0]

If the customer was contacted but never answered his/her phone the `duration` value can be `0`

In [None]:
train_df['campaign'].describe()

In [None]:
train_df['campaign'].plot.hist(bins=30, density=True)
plt.show()

In [None]:
train_df['pdays'].describe()

In [None]:
# Investigate first time contact
print(train_df.loc[train_df['pdays'] == -1, 'pdays'].count())
print(train_df.loc[train_df['pdays'] == -1, 'pdays'].count()/train_df.shape[0]*100)

81% of customers were contacted for the first time.

In [None]:
# Investigate repeat contacts
train_df.loc[train_df['pdays'] != -1, 'pdays'].describe()

In [None]:
train_df.loc[train_df['pdays'] != -1, 'pdays'].plot.hist(bins=20, density=True)
plt.show()

In [None]:
train_df['previous'].value_counts()

- Majority of the values are 0. This matches with what we found with `pdays`. 
- One customer was contacted 275 times. This looks like a data error. We will replace this value with the mean of non-zero values.

In [None]:
train_df.loc[train_df['previous'] == 275, 'previous'] = train_df.loc[train_df['previous'] != 0, 'previous'].median()
train_df['previous'].describe()

In [None]:
np.percentile(train_df.loc[train_df['previous'] != 0, 'previous'], q=95)

In [None]:
train_df.loc[train_df['previous'] >= 9, 'y'].value_counts()

In [None]:
train_df['poutcome'].value_counts(normalize=True)

In [None]:
train_df.groupby(['poutcome', 'y'])['y'].count()

Build a machine learning model using the data as it is. After we set a baseline we apply feature engineering techniques to try and improve the performance of the model.

In [None]:
# Create train_x dataframe
train_x = train_df.iloc[:, :-1]
train_x.head()

In [None]:
# Create train_y dataframe
train_y = train_df[['y']]
train_y.head()

In [None]:
train_x.info()

In [None]:
# Get a list of columns for one-hot encoding
ohe_cols = list(train_x.select_dtypes(include='object').columns.values)

# We want to label encode education
le_col = ['education']

# Drop education 
ohe_cols.remove('education')
ohe_cols

In [None]:
train_x = pd.get_dummies(train_x, prefix=ohe_cols, columns=ohe_cols, drop_first=True)
train_x.head()

In [None]:
# Perform label encoding on education
ed_cat = {'unknown': 0, 
          'primary': 1,
          'secondary': 2,
          'tertiary': 3}
train_x['education'] = train_x['education'].replace(ed_cat)
train_x['education'].value_counts(normalize=True)

In [None]:
train_x.head()

In [None]:
# Encode target variable
y_cat = {'no': 0, 
         'yes': 1}
train_y['y'] = train_y['y'].replace(y_cat)
train_y['y'].value_counts(normalize=True)

In [None]:
# Create the test_x dataframe
test_x = test_df.iloc[:, :-1]

# Create train_y dataframe
test_y = test_df[['y']]

# One-hot encode columns
test_x = pd.get_dummies(test_x, prefix=ohe_cols, columns=ohe_cols, drop_first=True)

# Label encode education
test_x['education'] = test_x['education'].replace(ed_cat)

# Encode target variable
test_y['y'] = test_y['y'].replace(y_cat)

In [None]:
test_x.head()

In [None]:
test_y.head()

# Decision Tree Classifier

In [None]:
# Define the model
dc = DecisionTreeClassifier(max_depth=30, min_samples_split=10, min_samples_leaf=10,
                            random_state=SEED, class_weight="balanced")

In [None]:
# Define a scorer
rs = make_scorer(recall_score)

# Cross validation
cv = cross_val_score(dc, train_x, train_y, cv=10, n_jobs=-1, scoring=rs)
print("Cross validation scores: {}".format(cv))
print("%0.2f recall with a standard deviation of %0.2f" % (cv.mean(), cv.std()))

In [None]:
# Fit the model on the complete train dataset
dc.fit(train_x, train_y)

In [None]:
# Get predictions from the train dataset
pred = dc.predict(train_x)
print("The train recall score is {}".format(np.round(recall_score(train_y, pred), 2)))

In [None]:
plt.title("Confusion matrix on Train set")
ax = sns.heatmap(confusion_matrix(train_y, pred), annot=True, fmt='d')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.yticks(rotation=0)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.show()
print(classification_report(train_y, pred))

In [None]:
# Get predictions from the test dataset
pred = dc.predict(test_x)
print("The test recall score is {}".format(np.round(recall_score(test_y, pred), 2)))

In [None]:
plt.title("Confusion matrix on Test set")
ax = sns.heatmap(confusion_matrix(test_y, pred), annot=True, fmt='d')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.yticks(rotation=0)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.show()
print(classification_report(test_y, pred))

# Random Forest Classifier

In [None]:
rf = RandomForestClassifier(n_jobs=-1, random_state=SEED, class_weight="balanced_subsample")

# Define a scorer
rs = make_scorer(recall_score)

# Cross validation
cv = cross_val_score(rf, train_x, train_y, cv=10, n_jobs=-1, scoring=rs)
print("Cross validation scores: {}".format(cv))
print("%0.2f recall with a standard deviation of %0.2f" % (cv.mean(), cv.std()))

# Fit the model on the complete train dataset
rf.fit(train_x, train_y)

# Get predictions from the train dataset
pred = rf.predict(train_x)
print("The train recall score is {}".format(np.round(recall_score(train_y, pred), 2)))

plt.title("Confusion matrix on Train set")
ax = sns.heatmap(confusion_matrix(train_y, pred), annot=True, fmt='d')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.yticks(rotation=0)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.show()
print(classification_report(train_y, pred))

# Get predictions from the test dataset
pred = rf.predict(test_x)
print("The test recall score is {}".format(np.round(recall_score(test_y, pred), 2)))

plt.title("Confusion matrix on Test set")
ax = sns.heatmap(confusion_matrix(test_y, pred), annot=True, fmt='d')
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.yticks(rotation=0)
plt.xlabel("Predicted Labels")
plt.ylabel("Actual Labels")
plt.show()
print(classification_report(test_y, pred))

Great! We achieved a perfect classifier on the test set! The best part is, we did not have to perform complex feature engineering. Just the basic one-hot encoding and label encoding on categorical variables.

In [None]:
# # Determine the maximum permissible balance value
# balance_thresh = np.percentile(train_df['balance'], q=90)

# # Create a new column that indicates balance > threshold
# train_df['balance_outliers'] = np.where(train_df['balance'] > balance_thresh, 1, 0)
# train_df['balance_outliers'].value_counts(normalize=True)

In [None]:
# # Clip balance values with the threshold
# train_df['balance'].clip(upper=balance_thresh, inplace=True)
# train_df['balance'].describe()

In [None]:
# # Check balance values that are negative
# train_df.loc[train_df.balance < 0, 'balance'].describe()

In [None]:
# # Determine the maximum permissible value for duration
# duration_thresh = np.percentile(train_df['duration'], 90)

# # Create a new column that indicates duration > threshold
# train_df['duration_outliers'] = np.where(train_df['duration'] > duration_thresh, 1, 0)
# train_df['duration_outliers'].value_counts(normalize=True)

In [None]:
# # Clip the values
# train_df['duration'].clip(upper=duration_thresh, inplace=True)
# train_df['duration'].describe()

In [None]:
# # Determine the maximum permissible value for duration
# campaign_thresh = np.percentile(train_df['campaign'], 90)

# # Create a new column that indicates duration > threshold
# train_df['campaign_outliers'] = np.where(train_df['campaign'] > campaign_thresh, 1, 0)
# train_df['campaign_outliers'].value_counts(normalize=True)

In [None]:
# # Clip the values
# train_df['campaign'].clip(upper=campaign_thresh, inplace=True)
# train_df['campaign'].describe()