# Process the Data
Now that I have explored the data, I want to process it to prepare it for training

---
## Imports

In [1]:
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

---
## Load the Data

In [2]:
dataset = pd.read_csv("../data/adult_raw.csv")

### Prepare Columns, Replace **?** with **NaN**

In [3]:
dataset.columns = [col.replace(".",'_') for col in dataset.columns]
dataset.replace('?', np.nan, inplace=True)

### Remove Records that Contain NaN

In [4]:
dataset.dropna(inplace=True)

### Divide Data Into Features and Target
I am doing this before the encoding because I do not want the target to get encoded as well

In [5]:
features = dataset.drop('income', axis=1)
target = dataset['income']

### One-Hot Encode Categorical features

In [6]:
features = pd.get_dummies(features)
target = target.map({
    '<=50K': 0,
    '>50K': 1
})

### Get the Train Test Split

In [7]:
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, stratify = target)

### All Preprocessing Steps in One Function

In [8]:
def train_test_from_raw_data(dataset):
    dataset.columns = [col.replace(".",'_') for col in dataset.columns]
    dataset.replace('?', np.nan, inplace=True) 
    
    dataset.dropna(inplace=True)
    
    features = dataset.drop('income', axis=1)
    target = dataset['income']
    
    features = pd.get_dummies(features)
    target = target.map({
        '<=50K': 0,
        '>50K': 1
    })
    
    return train_test_split(features, target, test_size = 0.2, stratify = target)
    
    

## Conclusions

The steps that I should take to process data are the following:

- Replace **.** in columns to **_**
- Replace **?** to **NaN**
- Drop records with **NaN**
- Divide dataset into features and target
- Encode the features and target
- Get the training and testing set while making sure that the distribution of target is balanced