# Introduction to Data Preprocessing

Data preprocessing is a crucial step in preparing a dataset for analysis and modeling. This notebook focuses on transforming raw data into a clean and structured format that can be effectively used in predictive models. The primary goals of preprocessing include normalizing data, encoding categorical variables, and preparing the features for the machine learning pipeline.

## Objectives of Preprocessing
  
- **Encoding Categorical Variables**: Convert categorical features into numerical formats using techniques such as one-hot encoding or label encoding.
  
- **Scaling and Normalization**: Apply scaling techniques to standardize features, ensuring that variables with different scales do not skew the model’s results.
  
- **Feature Engineering**: Create new features or modify existing ones to improve model accuracy.

## Importance of Preprocessing

Proper preprocessing ensures that the data is clean, reliable, and structured in a way that machine learning algorithms can interpret and learn from effectively. It also helps to mitigate potential biases, improve model performance, and reduce the risk of overfitting.


In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, MinMaxScaler

%run ../census_income/utils/pandas_missing_handler.py
%run ../census_income/utils/pandas_explorer.py

## Data load and first visualization:

In [124]:
path = '../census_income/data/cleaned_data.csv'
df = pd.read_csv(path).drop(columns='Unnamed: 0')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


## Encoding Nominal Categorical Variables Using LabelEncoder

In [125]:
cols_to_encode = ['workclass','marital_status','occupation','relationship','race','gender','native_country']

encoder = LabelEncoder()
encoded_df = df.copy()
for col in cols_to_encode:
    encoded_df[col] = encoder.fit_transform(df[col])

## Encoding Ordinal Categorical Variables Using OrdinalEncoder:

In [126]:
education_levels = [
    ' Preschool',
    ' 1st-4th',
    ' 5th-6th',
    ' 7th-8th',
    ' 9th',
    ' 10th',
    ' 11th',
    ' 12th',
    ' HS-grad',
    ' Some-college',
    ' Assoc-acdm',
    ' Assoc-voc',
    ' Bachelors',
    ' Masters',
    ' Prof-school',
    ' Doctorate'
]
encoder = OrdinalEncoder(categories=[education_levels])
encoded_df['education'] = encoder.fit_transform(df['education'].to_frame())

## Mapping the dependent variable (income) using one hot encoding:

In [127]:
encoded_df = pd.get_dummies(encoded_df, columns=['income'],drop_first=True)
encoded_df['income_ >50K'] = encoded_df['income_ >50K'].astype(int)

## Scaling the numerical values using minmaxscaler:

In [128]:
scaler = MinMaxScaler()
cols_to_scale = [
    'age',
    'fnlwgt',
    'education_num',
    'capital_gain',
    'capital_loss',
    'hours_per_week'
]

encoded_df[cols_to_scale] = scaler.fit_transform(encoded_df[cols_to_scale])

## Mapping the 'capital_gain' and 'capital_loss' columns into a new 'capital_balance' column:

In [129]:
encoded_df['capital_balance'] = encoded_df['capital_gain'] - encoded_df['capital_loss']
encoded_df = encoded_df.drop(columns=['capital_gain','capital_loss'])

In [130]:
encoded_df.to_csv('../census_income/data/preprocessed_data.csv')
encoded_df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,gender,hours_per_week,native_country,income_ >50K,capital_balance
0,0.452055,4,0.047277,12.0,0.8,2,3,0,4,1,0.122449,38,0,0.0
1,0.287671,2,0.137244,8.0,0.533333,0,5,1,4,1,0.397959,38,0,0.0
2,0.493151,2,0.150212,6.0,0.4,2,5,0,2,1,0.397959,38,0,0.0
3,0.150685,2,0.220703,12.0,0.8,2,9,5,2,0,0.397959,4,0,0.0
4,0.273973,2,0.184109,13.0,0.866667,2,3,5,4,0,0.397959,38,0,0.0


# Conclusions

The preprocessing stage of this analysis has provided valuable insights and established a solid foundation for the subsequent modeling phase. Key conclusions include:

**Encoding Categorical Variables**:
   - Categorical variables were successfully encoded using `OrdinalEncoder`, `LabelEncoder` and `OneHotEncoder`. 
   - This transformation is essential for enabling machine learning algorithms to interpret categorical data effectively.


**Feature Scaling**:
   - Quantitative variables were scaled using `MinMaxScaler`. This normalization is particularly beneficial for regression models, as it improves convergence and ensures that all features contribute equally to the model performance.

**Creation of New Features**:
   - A new feature, 'capital_balance', was derived from the existing 'capital_gain' and 'capital_loss' columns. This addition aims to capture the net capital impact, providing a potentially valuable predictor for the target variable.

**Prepared Data for Modeling**:
   - The preprocessing steps have prepared the dataset for modeling with various algorithms, including logistic regression, decision trees, and random forests. This ensures that the models can be trained on a clean, well-structured dataset.

In summary, the preprocessing phase has been crucial in transforming the raw data into a structured format suitable for analysis. With these steps completed, the next phase will focus on training and evaluating predictive models to gain insights into income prediction.
