# Data Preparation
This notebook is intended to automatically process raw data into training-ready data with minimal need for customization. It assumes you've already done some Exploratory Data Analysis.

This notebook is intended to be the second in a 3 step process:
1) Exploratory Data Analysis
2) Data Preparation
3) Model Training and Evaluation

This notebook walks through these Data Preparation processes:
* Converting datatypes
* Winsorizing outliers
* Replacing missing values
* Filtering with specific criteria
* Normalizing numerical ranges
* Removing features that are too heavily correlated with other features
* Removing features that have too low correlation with the label
* Encoding categorical variables
* Converting strings to one-hot encoded columns

* Verifying balance
    * Calculating the Proportion of Each Class
    * Addressing imbalance by upsampling underrepresented groups

Style guides:
* TODO: mostly meet [PEP-8](https://peps.python.org/pep-0008/)

#### Import dependencies

In [20]:
# Define some exclusions for PEP8 that don't apply when the Jupyter Notebook
#   is exported to .py file
# pylint: disable=pointless-statement
# pylint: disable=fixme
# pylint: disable=expression-not-assigned
# pylint: disable=missing-module-docstring
# pylint: disable=invalid-name

import os
# import sys
from math import isnan

import pandas as pd
# import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats

sns.set_theme()

### Load the data into a Pandas dataframe

Define the path to the dataset file

Define the name of the label column

In [21]:
rootdir = os.getcwd()
infile = os.path.join(rootdir,
                      'eCornell/CTECH462_Managing_Data_In_Machine_Learning/data',
                      'adult.data.full.asst')

df = pd.read_csv(infile)

#### Defining reused variables

In [22]:
RAW_NUM_SAMPLES           = df.shape[0]
RAW_COLUMN_NAMES          = df.columns.tolist()
RAW_NUMERIC_COLUMN_NAMES  = df.select_dtypes(include=np.number).columns.tolist()
RAW_OBJECT_COLUMN_NAMES   = df.select_dtypes(include=object).columns.tolist()

print(RAW_NUM_SAMPLES)
print(RAW_COLUMN_NAMES)
print(RAW_NUMERIC_COLUMN_NAMES)
print(RAW_OBJECT_COLUMN_NAMES)

32561
['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income_binary']
['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex_selfID', 'native-country', 'income_binary']


#### Locating or defining the column name for the label/target

Check to see if there is an existing column named 'label' or 'target'. If so, then use that as the label name. Otherwise, use the last column as the label name.

In [23]:
label_name_list = ['label', 'target']

LABEL_COLUMN_NAME = None

for l in label_name_list:
    if l.lower() in [x.lower() for x in RAW_COLUMN_NAMES]:
        LABEL_COLUMN_NAME = l
        break

if LABEL_COLUMN_NAME is None:
    LABEL_COLUMN_NAME = RAW_COLUMN_NAMES[-1]
    
print(LABEL_COLUMN_NAME)

income_binary


## Detecting and converting dates and timestamps
Description TBD

In [None]:
# Code TBD

## Winsorizing outliers
Description TBD

In [None]:
# Code TBD

## Replacing missing values
Description TBD

In [None]:
# Code TBD

## Filtering with specific criteria
Description TBD

In [None]:
# Code TBD

## Normalizing numerical ranges
Description TBD

In [None]:
# Code TBD

## Encoding categorical variables
Description TBD

In [None]:
# Code TBD

## Converting strings to one-hot encoded columns
Description TBD

In [None]:
# Code TBD

## Removing features that are too heavily correlated with other features
Description TBD

In [None]:
# Code TBD

## Removing features that have too low correlation with the label
Description TBD

In [None]:
# Code TBD

# Correcting for bias and underrepresentation
Description TBD

In [None]:
# Code TBD