# Data Mining

This NoteBook is dedicated to:

1. Data Cleaning and Preprocessing
2. Train and evaluating 4 different data mining models to understand complex patterns within collected data

the DM models as one of the critical components of a Decision support system will detect lying patterns and relations within data.

## Required Packages

In [1]:
import sys

sys.path.append(r"h:\Resume\Projects\DataScience\Banking Telemarketing Decision Support System\Project")

import pandas as pd
import numpy as np

import importlib

from src.data_preprocessing import process_dataset as prep

import src.data_ingestion as load
from src.models.model_selection import train_compare_models
from configs.config_repository import ConfigRepository

## Data Prepration

### Loading the data

In [2]:
df_train = load.load_csv_to_dataframe(file_path = '../data//raw/bank-full.csv')
df_test = load.load_csv_to_dataframe(file_path = '../data/raw/bank.csv' )

Dataset loaded successfully with 45211 rows and 17 columns.
Dataset loaded successfully with 4521 rows and 17 columns.


In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


### Looking at the data composition 

searching for any harsh anomalies within the data


From Exploratory data analysis we know that the collected data doesn't show any abnormality .

### Data Cleaning
* Remove duplicate rows.
* Handle missing values.
* Correct textual inconsistencies
* removing outliers

### Data Transformation

* convert categorical variable into one_hot vector
* data normalization

In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


#### Loading configuration for preprocessing

In [5]:
cr = ConfigRepository(config_path="../configs/models_config.json")
fe_config = cr.get_config("fe_config")

In [6]:
# Preprocessing the data
data_train_prep = prep(
    df_=df_train,
    target_column='y',
    missing_method='drop',  # or 'fill'
    fill_value=None,
    fill_method=None,
    missing_threshold=0.5,
    outlier_removal=False,
    add_new_features=True,
    fe_config=fe_config,
    return_dataframe=True,
)
df_test
data_train_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44709 entries, 0 to 44708
Data columns (total 60 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   num__balance                   44709 non-null  float64
 1   num__day                       44709 non-null  float64
 2   num__duration                  44709 non-null  float64
 3   num__campaign                  44709 non-null  float64
 4   cat__job_admin.                44709 non-null  float64
 5   cat__job_blue-collar           44709 non-null  float64
 6   cat__job_entrepreneur          44709 non-null  float64
 7   cat__job_housemaid             44709 non-null  float64
 8   cat__job_management            44709 non-null  float64
 9   cat__job_retired               44709 non-null  float64
 10  cat__job_self-employed         44709 non-null  float64
 11  cat__job_services              44709 non-null  float64
 12  cat__job_student               44709 non-null 

In [7]:
data_test_prep = prep(
    df_=df_test,
    target_column='y',
    missing_method='drop',  # or 'fill'
    fill_value=None,
    fill_method=None,
    missing_threshold=0.5,
    outlier_removal=False,
    add_new_features=True,
    fe_config=fe_config,
    return_dataframe=True,
)

data_test_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4461 entries, 0 to 4460
Data columns (total 60 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   num__balance                   4461 non-null   float64
 1   num__day                       4461 non-null   float64
 2   num__duration                  4461 non-null   float64
 3   num__campaign                  4461 non-null   float64
 4   cat__job_admin.                4461 non-null   float64
 5   cat__job_blue-collar           4461 non-null   float64
 6   cat__job_entrepreneur          4461 non-null   float64
 7   cat__job_housemaid             4461 non-null   float64
 8   cat__job_management            4461 non-null   float64
 9   cat__job_retired               4461 non-null   float64
 10  cat__job_self-employed         4461 non-null   float64
 11  cat__job_services              4461 non-null   float64
 12  cat__job_student               4461 non-null   f

In [8]:
y_train = data_train_prep['y']
X_train = data_train_prep.drop(columns = ['y'])

In [9]:
X_test = data_test_prep.drop(columns = ['y'])

y_test = data_test_prep['y']

In [None]:
results = train_compare_models(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test)

Models:   0%|          | 0/4 [00:00<?, ?model/s]2025-04-29 22:48:59 — INFO — Training LogisticRegression (1/4)…


n_iterations: 8
n_required_iterations: 8
n_possible_iterations: 8
min_resources_: 20
max_resources_: 44709
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 2235
n_resources: 20
Fitting 5 folds for each of 2235 candidates, totalling 11175 fits




----------
iter: 1
n_candidates: 745
n_resources: 60
Fitting 5 folds for each of 745 candidates, totalling 3725 fits




----------
iter: 2
n_candidates: 249
n_resources: 180
Fitting 5 folds for each of 249 candidates, totalling 1245 fits




----------
iter: 3
n_candidates: 83
n_resources: 540
Fitting 5 folds for each of 83 candidates, totalling 415 fits




----------
iter: 4
n_candidates: 28
n_resources: 1620
Fitting 5 folds for each of 28 candidates, totalling 140 fits




----------
iter: 5
n_candidates: 10
n_resources: 4860
Fitting 5 folds for each of 10 candidates, totalling 50 fits




----------
iter: 6
n_candidates: 4
n_resources: 14580
Fitting 5 folds for each of 4 candidates, totalling 20 fits




----------
iter: 7
n_candidates: 2
n_resources: 43740
Fitting 5 folds for each of 2 candidates, totalling 10 fits


2025/04/29 22:51:39 INFO mlflow.tracking.fluent: Experiment with name 'Bank_Marketing_Models' does not exist. Creating a new experiment.
Successfully registered model 'BankMarketing_LogReg'.
Created version '1' of model 'BankMarketing_LogReg'.
2025-04-29 22:51:52 — INFO — Evaluating LogisticRegression on test set…
Models:  25%|██▌       | 1/4 [02:52<08:38, 172.98s/model]2025-04-29 22:51:52 — INFO — Training RandomForest (2/4)…


Model registered: BankMarketing_LogReg
n_iterations: 8
n_required_iterations: 8
n_possible_iterations: 8
min_resources_: 20
max_resources_: 44709
aggressive_elimination: False
factor: 3
----------
iter: 0
n_candidates: 2235
n_resources: 20
Fitting 5 folds for each of 2235 candidates, totalling 11175 fits
