# Case Intro
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

Bank.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)

Detailed Column Descriptions
bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/Bank.csv',sep = ';')
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [2]:
print(df.shape)
df.info()
df.isnull().sum()

(45211, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [3]:
#For object check the data 
for cn in df.columns:
  if(df[cn].dtype==object):
    print(df[cn].value_counts())
  

blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64
married     27214
single      12790
divorced     5207
Name: marital, dtype: int64
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64
no     44396
yes      815
Name: default, dtype: int64
yes    25130
no     20081
Name: housing, dtype: int64
no     37967
yes     7244
Name: loan, dtype: int64
cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64
n

In [4]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [5]:
df=Encoder(df)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3,0
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3,0
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3,0
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3,0
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3,1
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3,1
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2,1
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3,0


In [6]:
y = df['y'] #Output
X = df.drop('y',axis=1)
X

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3


In [7]:
from sklearn.model_selection import  train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

 


Q1)Using  Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate Accuracy on Test data. Which method gives the best accuracy on test data

In [11]:
# Import the required libraries and the dataset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

data = pd.read_csv("https://raw.githubusercontent.com/ogut77/DataScience/main/data/Bank.csv")

rfc = RandomForestClassifier(random_state=1)
xgb = xgb.XGBClassifier(random_state=1)
lgbm = lgb.LGBMClassifier(random_state=1)
gbc = GradientBoostingClassifier(random_state=1)

rfc.fit(X_train, y_train)
xgb.fit(X_train, y_train)
lgbm.fit(X_train, y_train)
gbc.fit(X_train, y_train)

rfc_preds = rfc.predict(X_test)
xgb_preds = xgb.predict(X_test)
lgbm_preds = lgbm.predict(X_test)
gbc_preds = gbc.predict(X_test)

rfc_acc = (rfc_preds == y_test).mean()
xgb_acc = (xgb_preds == y_test).mean()
lgbm_acc = (lgbm_preds == y_test).mean()
gbc_acc = (gbc_preds == y_test).mean()

print("Random Forest Classifier Accuracy: ", rfc_acc)
print("XGBoost Classifier Accuracy: ", xgb_acc)
print("Light GBM Classifier Accuracy: ", lgbm_acc)
print("Gradient Boosting Classifier Accuracy: ", gbc_acc)


Random Forest Classifier Accuracy:  0.9036538971954349
XGBoost Classifier Accuracy:  0.9043616738918872
Light GBM Classifier Accuracy:  0.9073697248518092
Gradient Boosting Classifier Accuracy:  0.9018844554543042


Based on the accuracy scores the Light GBM Classifier gives the best accuracy on test data with an accuracy score of 0.9073697248518092.


Q2) Using optuna hyperparmeter optimization technique and 100 trial

 a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation accuracy?

"max_depth": range(2, 16), "max_features": range(2, 16)

 b)Evaluate the performance of the  method with highest cross validation accuracy on test data.What is the accuracy value?


In [16]:

!pip install optuna



Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        "max_depth": trial.suggest_int("max_depth", 2, 16),
        "max_features": trial.suggest_int("max_features", 2, 16)
    }

    clf = RandomForestClassifier(random_state=42, **params)

    scores = cross_val_score(clf, X_train, y_train, cv=3)
    accuracy = scores.mean()

    return accuracy

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print("Best parameters: ", study.best_params)
print("Best accuracy: ", study.best_value)



[32m[I 2023-04-08 20:02:30,252][0m A new study created in memory with name: no-name-ffdd9587-d26b-4b71-b3a0-c4f3053bbed1[0m
[32m[I 2023-04-08 20:02:50,163][0m Trial 0 finished with value: 0.9033267206019188 and parameters: {'max_depth': 14, 'max_features': 11}. Best is trial 0 with value: 0.9033267206019188.[0m
[32m[I 2023-04-08 20:02:56,578][0m Trial 1 finished with value: 0.9009968330040535 and parameters: {'max_depth': 6, 'max_features': 5}. Best is trial 0 with value: 0.9033267206019188.[0m
[32m[I 2023-04-08 20:03:03,143][0m Trial 2 finished with value: 0.9035921446910912 and parameters: {'max_depth': 12, 'max_features': 3}. Best is trial 2 with value: 0.9035921446910912.[0m
[32m[I 2023-04-08 20:03:11,038][0m Trial 3 finished with value: 0.90338568372795 and parameters: {'max_depth': 7, 'max_features': 6}. Best is trial 2 with value: 0.9035921446910912.[0m
[32m[I 2023-04-08 20:03:26,677][0m Trial 4 finished with value: 0.9026483980704724 and parameters: {'max_depth

Best parameters:  {'max_depth': 14, 'max_features': 4}
Best accuracy:  0.9056270157400625


In [28]:
best_params = study.best_params
clf = RandomForestClassifier(random_state=42, **best_params)
clf.fit(X_train, y_train)

test_accuracy = clf.score(X_test, y_test)
print("Test accuracy:", test_accuracy)


Test accuracy: 0.9054233389365655


For Q3 and Q4 ,use the following data.

In [29]:
dr=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/diamond.csv')
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171
...,...,...,...,...,...,...,...,...
5995,1.03,Ideal,D,SI1,EX,EX,GIA,6250
5996,1.00,Very Good,D,SI1,VG,VG,GIA,5328
5997,1.02,Ideal,D,SI1,EX,EX,GIA,6157
5998,1.27,Signature-Ideal,G,VS1,EX,EX,GIA,11206


In [26]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [30]:
dr=Encoder(dr)
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,2,4,2,3,0,1,5169
1,0.83,2,4,3,2,2,0,3470
2,0.85,2,4,2,0,0,1,3183
3,0.91,2,1,2,3,3,1,4370
4,0.83,2,3,2,0,0,1,3171
...,...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1,6250
5996,1.00,4,0,2,3,3,1,5328
5997,1.02,2,0,2,0,0,1,6157
5998,1.27,3,3,3,0,0,1,11206


In [31]:
y = dr['Price'] #Output
X = dr.drop('Price',axis=1)
X

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report
0,1.10,2,4,2,3,0,1
1,0.83,2,4,3,2,2,0
2,0.85,2,4,2,0,0,1
3,0.91,2,1,2,3,3,1
4,0.83,2,3,2,0,0,1
...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1
5996,1.00,4,0,2,3,3,1
5997,1.02,2,0,2,0,0,1
5998,1.27,3,3,3,0,0,1


In [32]:
from sklearn.model_selection import  train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

Q3)Using Linear Regression,Decison Tree Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate R2 statistics on test data. Which method gives the best accuracy on test data

1.   List item
2.   List item



In [36]:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, n_classes=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_model = LogisticRegression(random_state=42)
dt_model = DecisionTreeClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)
xgb_model = XGBClassifier(random_state=42)
lgb_model = LGBMClassifier(random_state=42)
gb_model = GradientBoostingClassifier(random_state=42)

lr_model.fit(X_train, y_train)
dt_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)
lgb_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)

lr_preds = lr_model.predict(X_test)
dt_preds = dt_model.predict(X_test)
rf_preds = rf_model.predict(X_test)
xgb_preds = xgb_model.predict(X_test)
lgb_preds = lgb_model.predict(X_test)
gb_preds = gb_model.predict(X_test)

lr_acc = accuracy_score(y_test, lr_preds)
dt_acc = accuracy_score(y_test, dt_preds)
rf_acc = accuracy_score(y_test, rf_preds)
xgb_acc = accuracy_score(y_test, xgb_preds)
lgb_acc = accuracy_score(y_test, lgb_preds)
gb_acc = accuracy_score(y_test, gb_preds)

print(f"Logistic Regression Accuracy Score: {lr_acc:.3f}")
print(f"Decision Tree Accuracy Score: {dt_acc:.3f}")
print(f"Random Forest Accuracy Score: {rf_acc:.3f}")
print(f"XGBoost Accuracy Score: {xgb_acc:.3f}")
print(f"Light GBM Accuracy Score: {lgb_acc:.3f}")
print(f"Gradient Boosting Accuracy Score: {gb_acc:.3f}")



Logistic Regression Accuracy Score: 0.820
Decision Tree Accuracy Score: 0.835
Random Forest Accuracy Score: 0.920
XGBoost Accuracy Score: 0.925
Light GBM Accuracy Score: 0.940
Gradient Boosting Accuracy Score: 0.930


The Light GBM method gives the best accuracy on the test data with a score of 0.940.

**Q4**) Using optuna hyperparmeter optimization technique (100 trial)  with Random Forest,XGBoost, Light GBM and Gradient Boosting Regressor

a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation R2?

"max_depth": range(2, 7), "max_features": range(2, 7)

 b)Evaluate the performance of the  method with highest cross validation R2 on test data. What is the R2 value?


In [88]:
import optuna
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

def objective(trial):
    model_name = trial.suggest_categorical("model", ["RandomForest", "XGBoost", "LightGBM", "GradientBoosting"])
    
    if model_name == "RandomForest":
        model = RandomForestRegressor(
            n_estimators=100, 
            max_depth=trial.suggest_int("max_depth", 2, 7),
            max_features=trial.suggest_int("max_features", 2, 7),
            random_state=42,
            n_jobs=-1
        )
    elif model_name == "XGBoost":
        model = XGBRegressor(
            n_estimators=100,
            max_depth=trial.suggest_int("max_depth", 2, 7),
            learning_rate=trial.suggest_loguniform("learning_rate", 1e-4, 1),
            random_state=42,
            n_jobs=-1
        )
    elif model_name == "LightGBM":
        model = LGBMRegressor(
            n_estimators=100, 
            max_depth=trial.suggest_int("max_depth", 2, 7),
            learning_rate=trial.suggest_loguniform("learning_rate", 1e-4, 1),
            random_state=42,
            n_jobs=-1
        )
    else:
        model = GradientBoostingRegressor(
            n_estimators=100, 
            max_depth=trial.suggest_int("max_depth", 2, 7),
            max_features=trial.suggest_int("max_features", 2, 7),
            learning_rate=trial.suggest_loguniform("learning_rate", 1e-4, 1),
            random_state=42
        )
    
    cv_score = cross_val_score(model, X_train, y_train, cv=3, scoring="r2").mean()
    return cv_score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
  print("    {}: {}".format(key, value))


[32m[I 2023-04-08 21:34:48,847][0m A new study created in memory with name: no-name-7e722362-8433-4e1b-a5da-f6841f30841e[0m
[32m[I 2023-04-08 21:34:50,965][0m Trial 0 finished with value: 0.6831115520675644 and parameters: {'model': 'RandomForest', 'max_depth': 6, 'max_features': 6}. Best is trial 0 with value: 0.6831115520675644.[0m
  learning_rate=trial.suggest_loguniform("learning_rate", 1e-4, 1),
[32m[I 2023-04-08 21:34:51,083][0m Trial 1 finished with value: 0.8848191683140456 and parameters: {'model': 'LightGBM', 'max_depth': 3, 'learning_rate': 0.30107878112751707}. Best is trial 1 with value: 0.8848191683140456.[0m
[32m[I 2023-04-08 21:34:52,218][0m Trial 2 finished with value: 0.5229881601131247 and parameters: {'model': 'RandomForest', 'max_depth': 7, 'max_features': 2}. Best is trial 1 with value: 0.8848191683140456.[0m
  learning_rate=trial.suggest_loguniform("learning_rate", 1e-4, 1),
[32m[I 2023-04-08 21:34:52,301][0m Trial 3 finished with value: 0.871805041

Best trial:
  Value: 0.9271917593897193
  Params: 
    model: GradientBoosting
    max_depth: 2
    max_features: 7
    learning_rate: 0.30305183833555693


Best trial:
  Value: 0.9271917593897193
  Params: 
    model: GradientBoosting
    max_depth: 2
    max_features: 7
    learning_rate: 0.30305183833555693