#**Portugese Bank Marketing Solution**

Step 1 : Import packages 



```
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
import pandas as pd 
import seaborn as sns
```



Step 2 : Load dataset


```
url = "https://raw.githubusercontent.com/KaiSun19/LogisticRegression/data/bank_cleaned.csv"
dataset = pd.read_csv(url)
```

Step 3 : Create dummies for qualitative data 


```
job = pd.get_dummies(dataset["job"], drop_first=True)
marital = pd.get_dummies(dataset["marital"], drop_first=True)
education = pd.get_dummies(dataset["education"], drop_first=True)
default = pd.get_dummies(dataset["default"], drop_first=True)
housing = pd.get_dummies(dataset["housing"], drop_first=True)
loan = pd.get_dummies(dataset["loan"], drop_first=True)
```



Step 4: Remove irrelevant data to customer information and concatenate quantitative data 



```
dataset.drop(["job", "marital", "education", "default", "housing", "loan", "campaign","pdays","previous","poutcome","response", "duration","day","month", 'Unnamed: 0'] ,axis=1, inplace=True)
datasets = [job,marital,education,default,housing,loan,dataset]
data = pd.concat(datasets,axis=1)
```



Step 5: Analyze correlation matrix 

The full correlation matrix can be found at :

https://github.com/KaiSun19/LogisticRegression/blob/figures/banking_correlation_matrix.png

We can see that there is only a moderate correlation between some variables e.g. "tertiary" and "secondary" or "management" and "tertiary" which is partially due to their role as "dummy" variables so there is no obvious problem of multi-colinearity within the data 

Step 6 : Identify features and target variables 


```
features = ["blue-collar", "entrepreneur", "housemaid", "management", "other", "retired", "self-employed", "services", "student",
            "technician", "unemployed", "married", "single", "secondary", "tertiary", "yes", "yes", "age", "balance"]
target = "response_binary"
```


Step 7 : Split data into training and testing sets 


```
X = data[features]
y= data[target]
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)
```


Step 8: Fit model and and test on test data 


```
model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
```


Step 9 : Evaluate model 


```
cm = confusion_matrix(y_test,y_pred)
class_names=[0,1] 
plt.xticks(class_names)
plt.yticks(class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
```

![Confusion Matrix](https://raw.githubusercontent.com/KaiSun19/LogisticRegression/figures/banking_cm.png)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)


```
Accuracy : 0.8874742924297326
Precision : 0.0
Recall : 0.0
```

The Precision and Recall score of 0 shows that there is insufficient data of positive outcomes i.e. customers who would subscribe to the term deposit. Therefore, from the raw datsaset there needs to be a rebalancing of target variables before concluding the true accuracy of the algorithm. The accuracy score of 0.88 shows that the algorithm is however accurate in predicting negative outcomes 