<a href="https://colab.research.google.com/github/MattLeRoi/new_project/blob/main/index.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bank

Dataset information -
The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact with the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")

related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", â€¦, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

other attributes:

13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means 
client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None

In [None]:
# !pip install lightgbm
# !pip install catboost
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb
from xgboost import XGBClassifier

import lightgbm as lgbm
from lightgbm import LGBMClassifier

from catboost import CatBoostClassifier, Pool

import matplotlib.pyplot as plt

In [None]:
df=pd.read_csv('bank-full.csv', delimiter=';')
df

In [None]:
for col in df.columns:
    plt.figure()
    plt.title(col)
    plt.hist(df[col]);

In [None]:
df.y.value_counts()

In [None]:
X=df.drop(['y','pdays'], axis=1) # The -1 in pdays throws off the math. previous also functions as a flag for previously contacted
y = [1 if target_y_n == "yes" else 0 for target_y_n in df['y']]

In [None]:
categorical_features = ['job','marital','education','default','housing','loan','contact','month','poutcome']

X_encoded = pd.get_dummies(X, columns=categorical_features)
X_encoded

In [None]:
X_all_training,X_test,y_all_training,y_test = train_test_split(X_encoded,y,random_state=42, test_size=.15) # 15% test set
X_train,X_val,y_train,y_val = train_test_split(X_all_training,y_all_training,random_state=42, test_size=.1/.85) # 10% validation set
X_val

In [None]:
roc_results = pd.DataFrame(columns=['Model','Score'])

def add_score (roc_results,model_name):
    y_pred = model_name.predict(X_val)
    y_proba = model_name.predict_proba(X_val)[:, 1]
    score = roc_auc_score(y_val, y_proba)
    print("ROC-AUC:", score)
    new_row_data = {'Model':model_name, 'Score':score}
    roc_results.loc[len(roc_results)] = new_row_data
    return roc_results

### Logistic Regression

In [None]:
log_reg = LogisticRegression(max_iter=5000)
log_reg.fit(X_train, y_train)

add_score(roc_results,log_reg)

# y_pred = log_reg.predict(X_val)
# y_proba = log_reg.predict_proba(X_val)[:, 1]
# score = roc_auc_score(y_val, y_proba)
# print("ROC-AUC:", score)

In [None]:
model_list += 'LogisticRegression'
roc_scores += score

In [None]:
coefs = pd.Series(log_reg.coef_[0], index=X_train.columns)
coefs.sort_values(ascending=False).head(10).sort_values(ascending=True).plot(kind='barh');

### Random Forest

In [None]:
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=None,
    random_state=42)

rf.fit(X_train, y_train)

y_pred = rf.predict(X_val)
y_proba = rf.predict_proba(X_val)[:, 1]

score = roc_auc_score(y_val, y_proba)
print("ROC-AUC:", score)

In [None]:
model_list += 'RandomForest'
roc_scores += score

In [None]:
importances = pd.Series(rf.feature_importances_, index=X_train.columns)
importances.sort_values(ascending=False).head(10).sort_values(ascending=True).plot(kind='barh');


### XGBoost

In [None]:
xgb_model = XGBClassifier(
#     n_estimators=200,
#     learning_rate=0.1,
#     max_depth=4,
#     random_state=42,
#     use_label_encoder=False,
#     eval_metric='logloss'
)
xgb_model.fit(X_train, y_train)

y_pred = xgb_model.predict(X_val)
y_proba = xgb_model.predict_proba(X_val)[:, 1]

score = roc_auc_score(y_val, y_proba)
print("ROC-AUC:", score)

In [None]:
model_list += 'RandomForest'
roc_scores += score

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
xgb.plot_importance(xgb_model, ax=ax,max_num_features=10)
plt.show()


### Light GBM

In [None]:
lgbm_model = LGBMClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=-1,
    random_state=42
)
lgbm_model.fit(X_train, y_train)

y_pred = lgbm_model.predict(X_val)
y_proba = lgbm_model.predict_proba(X_val)[:, 1]

score = roc_auc_score(y_val, y_proba)
print("ROC-AUC:", score)

In [None]:
model_list += 'RandomForest'
roc_scores += score

In [None]:
lgbm.plot_importance(lgbm_model, max_num_features=10, figsize=(8, 6));

### Cat Boost

In [None]:
cat = CatBoostClassifier(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    random_seed=42,
    verbose=0
)
cat.fit(X_train, y_train)

y_pred = cat.predict(X_val)
y_proba = cat.predict_proba(X_val)[:, 1]

score = roc_auc_score(y_val, y_proba)
print("ROC-AUC:", score)

In [None]:
model_list += 'RandomForest'
roc_scores += score

In [None]:
importances = cat.get_feature_importance()
feature_names = X_train.columns

feat_imp = pd.Series(importances, index=feature_names).sort_values(ascending=False)
feat_imp.head(10).sort_values(ascending=True).plot(kind='barh')
plt.title("CatBoost Feature Importance")
plt.show()

In [None]:
plt.bar(result2['month'],result2['y_binary'])
plt.ylabel('% converted');

In [None]:
forplotting = bfu['month'].value_counts()
forplotting=pd.DataFrame(forplotting.reset_index())
forplotting.sort_values(by='count', ascending=False)
plt.bar(forplotting['month'],forplotting['count']);

look at two sets - previously contacted vs not? 
check number of each