# Data Preprocessing
This notebook encapsulates the data preprocessing pipeline applied to data stored in CSV format. Key preprocessing steps integrated into this pipeline include the encoding of categorical variables, imputation of missing values, and optimization of memory allocation.

In [1]:
from time import time

In [2]:
import numpy as np
import pandas as pd

In [3]:
from sklearn.neighbors import KNeighborsClassifier

In [4]:
timer_start = time()

## Importing the Dataset
We initiate the preprocessing pipeline by importing the dataset. The dataset comprises a rich array of attributes associated with individual profiles, encapsulating demographic descriptors and financial indicators. To gain a deeper understanding of the underlying dataset, we proceed to visualize the DataFrame.

In [5]:
df = pd.read_csv(filepath_or_buffer="data.csv",
                 na_values="unknown",
                 encoding="utf-8")

df.index.names = ['id']

In [6]:
df.head()

Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,no
1,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,no
3,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,no
4,33,,single,,no,1,no,no,,5,may,198,1,no


In [7]:
print(f"The DataFrame has been allocated a memory space of {df.memory_usage().sum() / 10**6: .2f}MB")

The DataFrame has been allocated a memory space of  4.48MB


To prevent **data leakage**, we remove the columns *contact*, *day*, *month*, *duration*, and *campaign*. These columns contain information about the last marketing communication, which is typically unknown until the final communication actually occurs.

In [8]:
df.drop(labels=["contact",
                "day",
                "month",
                "duration",
                "campaign"],
        axis=1,
        inplace=True)

## Encoding of Categorical Variables and Imputation of Missing Values
We perform the one-hot encoding for the columns that do not contain missing values.

In [9]:
df = pd.get_dummies(data=df,
                    columns=["marital",
                             "default",
                             "housing",
                             "loan",
                             "y"],
                    drop_first=True)

df.rename(columns={"marital_married": "married",
                   "marital_single": "single", 
                   "default_yes": "default",
                   "housing_yes": "housing", 
                   "loan_yes": "loan", 
                   "y_yes": "y"},
          inplace=True)

The function **impute_nan**, serves to impute missing values within a DataFrame. It operates by identifying the specified column containing missing values, then utilizes a **K-Nearest Neighbors classifier** to predict and fill in these missing values. The function returns the DataFrame with the missing values imputed.

In [10]:
def impute_na(df: pd.DataFrame,
              column: str
              ) -> pd.DataFrame:
    """
    Imputes missing values in a DataFrame using a K-Nearest Neighbors Classifier.
    :param df: DataFrame to be analyzed.
    :param column: Name of the column containing missing values.
    :return: Imputed DataFrame.
    """
    df_na = df[df[column].isna()].copy()
    df_na.drop(labels=column,
               axis=1, 
               inplace=True)
    
    df.dropna(axis=0,
              inplace=True)
    
    x = df.drop(labels=column, 
                axis=1)
    y = df[column]
    
    imputer = KNeighborsClassifier()
    imputer.fit(x, y)
    
    df_na[column] = imputer.predict(df_na)
    
    return pd.concat(objs=[df, df_na],
                     axis=0)

We utilize the **impute_na** function to replace missing values within the *job* column, and perform its one-hot encoding.

In [11]:
job_impute = impute_na(df=df.drop(labels=["education",
                                          "y"], 
                                  axis=1),
                       column="job")

df = pd.concat(objs=[job_impute, 
                     df.education, 
                     df.y], 
               axis=1)

In [12]:
df = pd.get_dummies(data=df,
                    columns=["job"],
                    drop_first=True)

df.rename(columns={column: column.split("_")[1] 
                   for column in df.columns 
                   if "job_" in column},
          inplace=True)

We ordinal encode the column *education*, and we utilize the **impute_na** function to replace its missing values.

In [13]:
df.education = df.education.apply(func=lambda x: 3 if x == "tertiary" else
                                                 2 if x == "secondary" else
                                                 1 if x == "primary" else
                                                 np.nan)

In [14]:
education_impute = impute_na(df=df.drop(labels="y",
                                        axis=1),
                             column="education")

df = pd.concat(objs=[education_impute,
                     df.y], 
               axis=1)

## Optimizing Memory Allocation
We have already transformed string data types into boolean or numerical data types using one-hot and ordinal encoding. Now, we proceed to optimize the data types of all numerical values within the DataFrame for efficiency.

In [15]:
df = df.astype(dtype={"age": np.int8,
                      "education": np.int8,
                      "balance": np.int16})

In [16]:
df = df[df.dtypes.sort_values(ascending=False).index]

We export and store the DataFrame in binary format. By reloading it, we can construct a new DataFrame, further enhancing its memory efficiency.

In [17]:
df.to_pickle(path="dataframe.pkl")

In [18]:
print(f"Allocated memory for the DataFrame: {pd.read_pickle(filepath_or_buffer='dataframe.pkl').memory_usage().sum() / 10**6: .2f}MB")

Allocated memory for the DataFrame:  1.12MB


In conclusion, we visualize the outcome of the preprocessing pipeline.

In [19]:
df.head()

Unnamed: 0_level_0,balance,age,education,management,unemployed,technician,student,services,self-employed,retired,entrepreneur,housemaid,blue-collar,loan,housing,default,single,married,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
0,2143,58,3,True,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False
1,29,44,2,False,False,True,False,False,False,False,False,False,False,False,True,False,True,False,False
2,2,33,2,False,False,False,False,False,False,False,True,False,False,True,True,False,False,True,False
5,231,35,3,True,False,False,False,False,False,False,False,False,False,False,True,False,False,True,False
6,447,28,3,True,False,False,False,False,False,False,False,False,False,True,True,False,True,False,False


In [20]:
print(f"Total running time of the script: {time() - timer_start: .2f}s")

Total running time of the script:  1.34s
