# Data Mining

This NoteBook is dedicated to:

1. Data Cleaning and Preprocessing
2. Train and evaluating 4 different data mining models to understand complex patterns within collected data

the DM models as one of the critical components of a Decision support system will detect lying patterns and relations within data.

## Required Packages

In [1]:
import sys

sys.path.append(r"h:\Resume\Projects\DataScience\UCI _ Bank Case Study\Project")

import pandas as pd
import numpy as np

import importlib


from src.data_preprocessing import process_dataset as prep

import src.data_ingestion as load

## Data Prepration

### Loading the data

In [2]:
df_train = load.load_csv_to_dataframe(file_path = '../data//raw/bank-additional-full.csv' )
df_test = load.load_csv_to_dataframe(file_path = '../data/raw/bank-additional.csv' )

Dataset loaded successfully with 41188 rows and 21 columns.
Dataset loaded successfully with 4119 rows and 21 columns.


### Looking at the data composition 

searching for any harsh anomalies within the data


From Exploratory data analysis we know that the collected data doesn't show any abnormality .

### Data Cleaning
* Remove duplicate rows.
* Handle missing values.
* Correct textual inconsistencies
* removing outliers

### Data Transformation

* convert categorical variable into one_hot vector
* data normalization

In [3]:
df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

In [4]:
fe_config = config = {
        "pdays": {
            "new_name": "pdays_category",
            "special": 999,
            "special_label": "never contacted",
            "bins": [-1, 1, 14, 30],
            "labels": ['[0, 1]', '(1, 14]', '(14, 30]']
        },
        "age": {
            "new_name": "age_category",
            "bins": [0, 30, 50, 100],
            "labels": ['young', 'middle-aged', 'senior']
        },
        "previous": {
            "new_name": "previous_category",
            "bins": [-1, 0, 3, 6, np.inf],
            "labels": ['0', '[1,3]', '[4,6]', '>6']
        }
    }

In [5]:
# Preprocessing the data
data_train_prep = prep(
    df_=df_train,
    target_column='y',
    missing_method='drop',  # or 'fill'
    fill_value=None,
    fill_method=None,
    missing_threshold=0.5,
    outlier_removal=False,
    add_new_features=True,
    fe_config=fe_config,
    return_dataframe=True,
)

print(data_train_prep.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41176 entries, 0 to 41175
Data columns (total 72 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   num__duration                        41176 non-null  float64
 1   num__campaign                        41176 non-null  float64
 2   num__emp.var.rate                    41176 non-null  float64
 3   num__cons.price.idx                  41176 non-null  float64
 4   num__cons.conf.idx                   41176 non-null  float64
 5   num__euribor3m                       41176 non-null  float64
 6   num__nr.employed                     41176 non-null  float64
 7   num__y                               41176 non-null  float64
 8   cat__job_admin.                      41176 non-null  float64
 9   cat__job_blue-collar                 41176 non-null  float64
 10  cat__job_entrepreneur                41176 non-null  float64
 11  cat__job_housemaid          