<a href="https://colab.research.google.com/github/Nsi20/SCT_DS_03/blob/main/SCT_DS_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TASK**

# **Build a decision tree classifier to predict whether a customer will purchase or service based on their demographic and behavioural data.**

**Installing the 'ucimlrepo' package for fetching UCI datasets**

In [1]:
!pip install ucimlrepo


Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [2]:
from ucimlrepo import fetch_ucirepo


In [3]:
bank_marketing = fetch_ucirepo(id=222)


In [4]:
X = bank_marketing.data.features
y = bank_marketing.data.targets

# Metadata
print(bank_marketing.metadata)

# Variable information
print(bank_marketing.variables)


{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the s

**Displaying the first few rows of features and target values to inspect the data.**

In [5]:
print(X.head())

print(y.head())


   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married        NaN      no     1506     yes   no   
4   33           NaN   single        NaN      no        1      no   no   

  contact  day_of_week month  duration  campaign  pdays  previous poutcome  
0     NaN            5   may       261         1     -1         0      NaN  
1     NaN            5   may       151         1     -1         0      NaN  
2     NaN            5   may        76         1     -1         0      NaN  
3     NaN            5   may        92         1     -1         0      NaN  
4     NaN            5   may       198         1     -1         0      NaN  
    y
0  no
1  no
2  no
3  no
4  no


**Loading the Bank Marketing dataset directly from the UCI repository using pandas.**

In [6]:
import pandas as pd

data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip"
!wget $data_url -O bank.zip
!unzip bank.zip -d ./bank_data

data = pd.read_csv("./bank_data/bank.csv", sep=";")

print(data.head())


--2024-12-01 23:48:35--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified
Saving to: ‘bank.zip’

bank.zip                [   <=>              ] 565.47K   746KB/s    in 0.8s    

2024-12-01 23:48:36 (746 KB/s) - ‘bank.zip’ saved [579043]

Archive:  bank.zip
  inflating: ./bank_data/bank-full.csv  
  inflating: ./bank_data/bank-names.txt  
  inflating: ./bank_data/bank.csv    
   age          job  marital  education default  balance housing loan  \
0   30   unemployed  married    primary      no     1787      no   no   
1   33     services  married  secondary      no     4789     yes  yes   
2   35   management   single   tertiary      no     1350     yes   no   
3   30   management  married   tertiary      no     1476     yes  yes   
4   59

# **Step 2: Handle Missing Values and Encode Categorical Variables using one-hot encoding.**

In [7]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

categorical_cols = ['job', 'marital', 'education', 'housing', 'loan', 'contact', 'month', 'poutcome']
imputer_cat = SimpleImputer(strategy="most_frequent")
data[categorical_cols] = imputer_cat.fit_transform(data[categorical_cols])

numerical_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
imputer_num = SimpleImputer(strategy="mean")
data[numerical_cols] = imputer_num.fit_transform(data[numerical_cols])

encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_features = pd.DataFrame(encoder.fit_transform(data[categorical_cols]), columns=encoder.get_feature_names_out(categorical_cols))

data = pd.concat([data[numerical_cols], encoded_features, data['y']], axis=1)

print(data.head())


    age  balance  duration  campaign  pdays  previous  job_blue-collar  \
0  30.0   1787.0      79.0       1.0   -1.0       0.0              0.0   
1  33.0   4789.0     220.0       1.0  339.0       4.0              0.0   
2  35.0   1350.0     185.0       1.0  330.0       1.0              0.0   
3  30.0   1476.0     199.0       4.0   -1.0       0.0              0.0   
4  59.0      0.0     226.0       1.0   -1.0       0.0              1.0   

   job_entrepreneur  job_housemaid  job_management  ...  month_jun  month_mar  \
0               0.0            0.0             0.0  ...        0.0        0.0   
1               0.0            0.0             0.0  ...        0.0        0.0   
2               0.0            0.0             1.0  ...        0.0        0.0   
3               0.0            0.0             1.0  ...        1.0        0.0   
4               0.0            0.0             0.0  ...        0.0        0.0   

   month_may  month_nov  month_oct  month_sep  poutcome_other  \
0  

# **Step 3: Spliting the Data into Training and Testing Sets**

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

X = data.drop(columns=['y'])
y = data['y']

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)  # 'yes' -> 1, 'no' -> 0

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set: X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Testing set: X_test: {X_test.shape}, y_test: {y_test.shape}")


Training set: X_train: (3616, 40), y_train: (3616,)
Testing set: X_test: (905, 40), y_test: (905,)


# **Step 4: Building and Training the Decision Tree Classifier**

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

dt_model = DecisionTreeClassifier(random_state=42)

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.85

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       807
           1       0.35      0.42      0.38        98

    accuracy                           0.85       905
   macro avg       0.64      0.66      0.65       905
weighted avg       0.87      0.85      0.86       905


Confusion Matrix:
[[732  75]
 [ 57  41]]


# **Applying Class Weighting with Tuning**

In [10]:
dt_model_weighted = DecisionTreeClassifier(random_state=42, class_weight='balanced')

dt_model_weighted.fit(X_train, y_train)

y_pred_weighted = dt_model_weighted.predict(X_test)

accuracy_weighted = accuracy_score(y_test, y_pred_weighted)
print(f"Accuracy with Class Weighting: {accuracy_weighted:.2f}")

print("\nClassification Report with Class Weighting:")
print(classification_report(y_test, y_pred_weighted))

print("\nConfusion Matrix with Class Weighting:")
print(confusion_matrix(y_test, y_pred_weighted))


Accuracy with Class Weighting: 0.85

Classification Report with Class Weighting:
              precision    recall  f1-score   support

           0       0.92      0.91      0.92       807
           1       0.31      0.33      0.32        98

    accuracy                           0.85       905
   macro avg       0.62      0.62      0.62       905
weighted avg       0.85      0.85      0.85       905


Confusion Matrix with Class Weighting:
[[737  70]
 [ 66  32]]


# **Hyperparameter Tuning with GridSearchCV**

In [11]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'max_features': [None, 'sqrt', 'log2']
}

grid_search = GridSearchCV(estimator=DecisionTreeClassifier(random_state=42, class_weight='balanced'),
                           param_grid=param_grid,
                           scoring='accuracy',
                           cv=5,
                           n_jobs=-1)

grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")

best_dt_model = grid_search.best_estimator_

y_pred_best = best_dt_model.predict(X_test)

accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"\nAccuracy with Best Hyperparameters: {accuracy_best:.2f}")

print("\nClassification Report with Best Hyperparameters:")
print(classification_report(y_test, y_pred_best))

print("\nConfusion Matrix with Best Hyperparameters:")
print(confusion_matrix(y_test, y_pred_best))


Best Hyperparameters: {'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}

Accuracy with Best Hyperparameters: 0.85

Classification Report with Best Hyperparameters:
              precision    recall  f1-score   support

           0       0.92      0.91      0.92       807
           1       0.31      0.33      0.32        98

    accuracy                           0.85       905
   macro avg       0.62      0.62      0.62       905
weighted avg       0.85      0.85      0.85       905


Confusion Matrix with Best Hyperparameters:
[[737  70]
 [ 66  32]]


# **Applying SMOTE to the training data**

In [12]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

print(f"Original Training Set: {X_train.shape}, {y_train.shape}")
print(f"Resampled Training Set: {X_train_resampled.shape}, {y_train_resampled.shape}")


Original Training Set: (3616, 40), (3616,)
Resampled Training Set: (6386, 40), (6386,)


# **Training the Decision Tree Model with Resampled Data**

In [13]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train_resampled, y_train_resampled)

y_pred_resampled = clf.predict(X_test)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

accuracy_resampled = accuracy_score(y_test, y_pred_resampled)

class_report_resampled = classification_report(y_test, y_pred_resampled)

conf_matrix_resampled = confusion_matrix(y_test, y_pred_resampled)

print(f"Accuracy with SMOTE: {accuracy_resampled}")
print("\nClassification Report with SMOTE:")
print(class_report_resampled)
print("\nConfusion Matrix with SMOTE:")
print(conf_matrix_resampled)


Accuracy with SMOTE: 0.8651933701657458

Classification Report with SMOTE:
              precision    recall  f1-score   support

           0       0.93      0.91      0.92       807
           1       0.39      0.46      0.42        98

    accuracy                           0.87       905
   macro avg       0.66      0.69      0.67       905
weighted avg       0.87      0.87      0.87       905


Confusion Matrix with SMOTE:
[[738  69]
 [ 53  45]]


# **Step 1: Preprocess New Data for Predictions**

In [14]:
from ucimlrepo import fetch_ucirepo

bank_marketing = fetch_ucirepo(id=222)

X = bank_marketing.data.features
y = bank_marketing.data.targets

print(bank_marketing.metadata)

print(bank_marketing.variables)


{'uci_id': 222, 'name': 'Bank Marketing', 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing', 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv', 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).', 'area': 'Business', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 45211, 'num_features': 16, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'], 'target_col': ['y'], 'index_col': None, 'has_missing_values': 'yes', 'missing_values_symbol': 'NaN', 'year_of_dataset_creation': 2014, 'last_updated': 'Fri Aug 18 2023', 'dataset_doi': '10.24432/C5K306', 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'], 'intro_paper': {'ID': 277, 'type': 'NATIVE', 'title': 'A data-driven approach to predict the s

In [15]:

print(X.head())
print(y.head())


   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married        NaN      no     1506     yes   no   
4   33           NaN   single        NaN      no        1      no   no   

  contact  day_of_week month  duration  campaign  pdays  previous poutcome  
0     NaN            5   may       261         1     -1         0      NaN  
1     NaN            5   may       151         1     -1         0      NaN  
2     NaN            5   may        76         1     -1         0      NaN  
3     NaN            5   may        92         1     -1         0      NaN  
4     NaN            5   may       198         1     -1         0      NaN  
    y
0  no
1  no
2  no
3  no
4  no


# **Handling Missing Values**

In [16]:

X = data.drop(columns=['y'])
y = data['y']

print(X.shape, y.shape)



(4521, 40) (4521,)


In [17]:

y = y.map({'no': 0, 'yes': 1})

print(y.head())


0    0
1    0
2    0
3    0
4    0
Name: y, dtype: int64


In [18]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


(3616, 40) (905, 40) (3616,) (905,)


In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

print('Classification Report:')
print(classification_report(y_test, y_pred))

print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.8972375690607735
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94       807
           1       0.55      0.27      0.36        98

    accuracy                           0.90       905
   macro avg       0.73      0.62      0.65       905
weighted avg       0.88      0.90      0.88       905

Confusion Matrix:
[[786  21]
 [ 72  26]]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [20]:
model = LogisticRegression(max_iter=2000)
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
from sklearn.preprocessing import StandardScaler

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model with scaled data
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred_scaled = model.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred_scaled)
print(f'Accuracy: {accuracy}')
print('Classification Report:')
print(classification_report(y_test, y_pred_scaled))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred_scaled))


Accuracy: 0.8983425414364641
Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.98      0.94       807
           1       0.57      0.27      0.36        98

    accuracy                           0.90       905
   macro avg       0.74      0.62      0.65       905
weighted avg       0.88      0.90      0.88       905

Confusion Matrix:
[[787  20]
 [ 72  26]]


In [22]:
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

smote = SMOTE()
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

model_smote = LogisticRegression(max_iter=1000)
model_smote.fit(X_train_smote, y_train_smote)

y_pred_smote = model_smote.predict(X_test_scaled)

accuracy_smote = accuracy_score(y_test, y_pred_smote)
print(f'Accuracy with SMOTE: {accuracy_smote}')
print('Classification Report with SMOTE:')
print(classification_report(y_test, y_pred_smote))
print('Confusion Matrix with SMOTE:')
print(confusion_matrix(y_test, y_pred_smote))


Accuracy with SMOTE: 0.8419889502762431
Classification Report with SMOTE:
              precision    recall  f1-score   support

           0       0.97      0.85      0.91       807
           1       0.39      0.82      0.53        98

    accuracy                           0.84       905
   macro avg       0.68      0.83      0.72       905
weighted avg       0.91      0.84      0.86       905

Confusion Matrix with SMOTE:
[[682 125]
 [ 18  80]]


In [23]:
print(y_pred)

[0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
 0 0 0 1 0 0 0 0 0 0 0 0 