# Bank Marketing
#### Abstract:
The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

#### Data Set Information:
The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

### Attribute Information:
#### Bank client data:
* Age (numeric)
* Job : type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown')
* Marital : marital status (categorical: 'divorced', 'married', 'single', 'unknown' ; note: 'divorced' means divorced or widowed)
* Education (categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown')
* Default: has credit in default? (categorical: 'no', 'yes', 'unknown')
* Housing: has housing loan? (categorical: 'no', 'yes', 'unknown')
* Loan: has personal loan? (categorical: 'no', 'yes', 'unknown')
#### Related with the last contact of the current campaign:
* Contact: contact communication type (categorical:'cellular','telephone')
* Month: last contact month of year (categorical: 'jan', 'feb', 'mar',…, 'nov', 'dec')
* Dayofweek: last contact day of the week (categorical:'mon','tue','wed','thu','fri')
* Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
#### Other Attributes:
* Campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
* Pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
* Previous: number of contacts performed before this campaign and for this client (numeric)
* Poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
* Emp.var.rate: employment variation rate - quarterly indicator (numeric)
* Cons.price.idx: consumer price index - monthly indicator (numeric)
* Cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* Euribor3m: euribor 3 month rate - daily indicator (numeric)
* Nr.employed: number of employees - quarterly indicator (numeric)
### Output variable (desired target):
* y - has the client subscribed a term deposit? (binary: 'yes', 'no')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
dataset = pd.read_csv('../input/bank-marketing/bank-additional-full.csv', sep = ';')

In [None]:
print(dataset.shape)

In [None]:
print(dataset.head())

In [None]:
print(dataset.info())

There are no null values in the dataset. On the other hand, there are some categorical variables.

# Cleaning The Dataset

In [None]:
dataset = dataset.rename(columns={'y': 'subscribed'})

In [None]:
print(dataset.duplicated().sum())
#There are 12 duplicated values in the dataset.

In [None]:
print(dataset[dataset.duplicated(keep=False)].iloc[:,:7])

In [None]:
dataset = dataset.drop_duplicates()

In [None]:
print(dataset.shape)

In [None]:
print('\033[1mNULL VALUES\033[0m\n'+ str(dataset.isnull().values.any()))

#There are no null values in the dataset.

# Exploratory Data Analysis

In [None]:
Subscribed = pd.DataFrame(dataset['subscribed'].value_counts())
print(Subscribed.T)
pd.DataFrame(dataset['subscribed'].value_counts()).plot(kind='bar', color='lightgreen')
plt.show()
# Most people do not subscribe to a term deposit.

In [None]:
plt.figure(figsize=(16,4))

plt.subplot(1,4,1)
sns.distplot(dataset['age'])
plt.title('Age Distribution')

plt.subplot(1,4,2)
sns.countplot(dataset['job'])
plt.title('Job Distribution')
plt.xticks(rotation=90)

plt.subplot(1,4,3)
sns.countplot(dataset['marital'], color='pink')
plt.title('Marital Status')

plt.subplot(1,4,4)
sns.countplot(dataset['education'], color='lightgreen')
plt.xticks(rotation=90)
plt.title('Education Level')

plt.show()

* Most people are generally between 20 - 40 years old. Few people are above 60 years old. 
* Most people are administrator, technician or blue-collar workers.
* Most people are married.
* Most people have university degree.

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,3,1)
sns.countplot(dataset['default'], palette="Set3")
plt.title('Default Credit')

plt.subplot(1,3,2)
sns.countplot(dataset['housing'], palette="Set3")
plt.title('Housing Loan')

plt.subplot(1,3,3)
sns.countplot(dataset['loan'], palette="Set3")
plt.title('Loan')

plt.show()

* Most people have no credit in default, while almost none of people have credit.
* The number of people who have housing loan are higher than people who have no housing loan.
* Most people have no personal loan. 

In [None]:
plt.figure(figsize=(18,5))

plt.subplot(1,4,1)
sns.countplot(dataset['contact'], palette="vlag")
plt.title('Contact Type')

plt.subplot(1,4,2)
sns.countplot(dataset['month'], palette="vlag",order = ['mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'])
plt.title('Month')
plt.xticks(rotation=90)

plt.subplot(1,4,3)
sns.countplot(dataset['day_of_week'], palette="vlag")
plt.title('Day Of Week')

plt.subplot(1,4,4)
sns.distplot(dataset['duration'])
plt.xticks(rotation=90)
plt.title('Duration of Calls')

plt.show()

* Customers were contacted almost everyday. It doesn't convey extra infrmation, that's why I will 'drop day_of_week' from the dataset.

* Most people are reached from cellular phones.
* In may, most calls are made.
* Duration of calls are generally between 0 - 1000 sn.

In [None]:
dataset.drop('day_of_week', axis=1, inplace=True)

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,2,1)
sns.violinplot("contact", "campaign", data=dataset, kind='reg')
plt.title('Number of Contacts vs Contact Type')

plt.subplot(1,2,2)
sns.distplot(dataset['campaign'])
plt.title('Number of Contacts with Customers')

plt.show()

The number of contacts performed during this campaign and for this client are higher with telephone. On the other hand, the number of contacts is around 0-10 range.

In [None]:
plt.figure(figsize=(16,5))

plt.subplot(1,3,1)
sns.countplot(dataset['pdays'])
plt.xticks(rotation=90)
plt.title('Number of Days Passed Since Previous Campaign')

plt.subplot(1,3,2)
sns.countplot(dataset['previous'])
plt.title('Number of Previous Contacts')

plt.subplot(1,3,3)
sns.countplot(dataset['poutcome'])
plt.title('Previous Campaign Result')

plt.show()

* "pdays" show the number of days that passed by after the client was last contacted from a previous campaign (999 means client was not previously contacted). Graphic tells us that almost all of the customers were not contacted for previous campaign. I will change this variable as previously contacted or not, since most of the clients were not previously contacted. 

* The number of previous contacts graphic shows that most people were not contacted previously. That's why previous campaign results do not exist for some customers.

In [None]:
dataset.loc[dataset['pdays'] < 999, 'pdays'] = 1
dataset.loc[dataset['pdays'] == 999, 'pdays'] = 0

In [None]:
dataset = dataset.rename(columns={'pdays': 'previouslycontacted', 'previous':'previouscontacts'})

# Data Preparation

In [None]:
bins= [0,10,20,30,40,50,60,70,80,90,100]
labels = [0,1,2,3,4,5,6,7,8,9]
dataset.insert(1, 'agegroup', pd.cut(dataset['age'], bins=bins, labels=labels, right=False))
dataset = dataset.drop('age', axis=1)

In [None]:
#convert categorical variables to numerical variables with Label Encoder from Sklearn.
categorical_columns = dataset.select_dtypes(include='object').columns

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i in categorical_columns:
    dataset[i] = le.fit_transform(dataset[i]) 

In [None]:
print(dataset.head())

In [None]:
print(dataset.shape)

### Checking Normal Distribution
Gaussian Naive Bayes assumes that the predictors take up a continuous value and are not discrete, and these values are sampled from a gaussian distribution. That's why I will check normal distribution.

In [None]:
X_train = dataset.iloc[:, :-1].values.astype('float')
y_train = dataset['subscribed'].values

In [None]:
pd.DataFrame(X_train[y_train == 1]).plot(kind='density', ind=100, legend=False)
plt.title('Subscribed Likelihood Plots')

plt.show()

In [None]:
pd.DataFrame(X_train[y_train == 0]).plot(kind='density', ind=100, legend=False)
plt.title('Not Subscribed Likelihood Plots')

plt.show()

In [None]:
#The data points are not normally distributed. Apply Standard Scaler to get a more normally distributed dataset.
from sklearn.preprocessing import StandardScaler
X_train = pd.DataFrame(StandardScaler().fit_transform(X_train))

In [None]:
X_train[y_train == 1].plot(kind='density', ind=100, legend=False)
plt.title('Subscribed Likelihood Plot after Standardization')
plt.show()

In [None]:
X_train[y_train == 0].plot(kind='density', ind=100, legend=False)
plt.title('Not Subscribed Likelihood Plot after Standardization')
plt.show()
#There are values that go as 1,2,3 ... These values will not have perfectly normal distribution even after standardization.
#However, it gives better distribution.

### Checking Correlation Between Features
The correlation between variables are not linear, that's why I will use spearman correlation method.

In [None]:
plt.figure(figsize=(12,10))
sns.heatmap(dataset.corr(method='spearman'), cbar=True, cmap="RdBu_r")
plt.title("Correlation Matrix", fontsize=16)
plt.show()

There are some highly correlated variables in the dataset. 
Since Naive Bayes assumes that features are independent of each other, drop highly correlated varibles.

In [None]:
correlation = X_train.corr(method='spearman').abs()
upper = correlation.where(np.triu(np.ones(correlation.shape), k=1).astype(np.bool))
to_drop = [column for column in upper.columns if any(upper[column] > 0.40)]
X_train.drop(X_train[to_drop], axis=1, inplace=True)
print(X_train.shape)

# Gaussian Naive Bayes Model


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.25, random_state=42)

In [None]:
from sklearn.naive_bayes import GaussianNB
gb = GaussianNB()
gb.fit(X_train, y_train)
pred = gb.predict(pd.DataFrame(X_test))

### Evaluation of the model

In [None]:
from sklearn.metrics import roc_curve, auc
gbprob = gb.predict_proba(X_train)[:,1]
fpr, tpr, thr = roc_curve(y_train, gbprob)
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Plot')
print(auc(fpr, tpr))

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
print('Accuracy score of Gaussian Naive Bayes:' + str(accuracy_score(y_test,pred)))
print('Confusion Matrix\n' + str(confusion_matrix(y_test, pred)))

The model predicted 383 false positives. It means it predicted as the 383 customers subscribe to a term deposit, but actually it is not true. These customers don't subscribe. This number is actually high. The model should not miss this, othervise they do not call the customer again, and lose the customer.