# Preprocessing Code Example

### Part I: Set ups
This part includes the imported libraries and dataset. Feel free to change the path of the dataset accordingly.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('adult.data', header = None)

In [14]:
df.columns = ['age', 'workclass', 'fnlwgt', 'edu', 'edu-num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'cap-gain', 'cap-loss','hpw','native country','income']
df.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edu-num,marital,occupation,relationship,race,sex,cap-gain,cap-loss,hpw,native country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


### Part II: Basic Data Understanding

We can look at all values and their counts by using the following code. It is useful to understand all aspect of the dataset.

In [4]:
df['workclass'].value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

### Part III: Preprocessing

The following lines of code show how you can normalize the data. Noted that if you want to normalize only one feature, you need to change the datatype accordingly. 

You can create a new dataframe with one feature by using old_df[['feature_name']]

More info at https://scikit-learn.org/stable/modules/preprocessing.html

In [5]:
df2 = df[['age']]

from sklearn import preprocessing
import numpy as np

min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(df2)

array([[0.30136986],
       [0.45205479],
       [0.28767123],
       ...,
       [0.56164384],
       [0.06849315],
       [0.47945205]])

Or you can change from array (df['feature_name']) to numpy array by using np.array()

In [6]:
x_scaled2=min_max_scaler.fit_transform(np.array(df['age']).reshape(1,-1))

To preform one-hot encoding for categorical feature, you can use the following lines of code.

More info at https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [7]:
df3 = df[['edu']]
df4 = pd.get_dummies(df3)

In [8]:
df4.head()

Unnamed: 0,edu_ 10th,edu_ 11th,edu_ 12th,edu_ 1st-4th,edu_ 5th-6th,edu_ 7th-8th,edu_ 9th,edu_ Assoc-acdm,edu_ Assoc-voc,edu_ Bachelors,edu_ Doctorate,edu_ HS-grad,edu_ Masters,edu_ Preschool,edu_ Prof-school,edu_ Some-college
0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0


You can also perform feature selection on the whole dataframe.

More info at https://scikit-learn.org/stable/modules/feature_selection.html

In [9]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

x = df.iloc[:,:-1]    #Split only data
y = df.iloc[:,-1]     #Split the target out
x.head()

Unnamed: 0,age,workclass,fnlwgt,edu,edu-num,marital,occupation,relationship,race,sex,cap-gain,cap-loss,hpw,native country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba


In [17]:
# # We can one-hot encode the dataframe. This line of code will encode only categorical features automatically.
x = pd.get_dummies(x)
x.head()
# We then create the feature selector. In this case, we use chi-2 algorithm and we want to choose 4 features (k=4).
selector = SelectKBest(chi2, k=4)     #This line creates the selector
x_new = selector.fit(x,y)             #This line fits the selector to the dataset, and select the features.


# # Once we fit the selector, all features are selected and its indices are saved. We can create a new dataframe with those indices.
col = selector.get_support(indices=True)   #all indices are saved in col.
x_new = x.iloc[:,col]

In [18]:
x_new.head()

Unnamed: 0,age,fnlwgt,cap-gain,cap-loss
0,39,77516,2174,0
1,50,83311,0,0
2,38,215646,0,0
3,53,234721,0,0
4,28,338409,0,0


You can also perform feature extraction. 

More info at https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [12]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 4)           # Create PCA transformer
x_pca = pca.fit_transform(x)          # Fit and transform PCA transformer to the dataset

print(pca.explained_variance_ratio_)  # This show the variance of each component.

[9.95113642e-01 4.87183949e-03 1.44878114e-05 1.66233329e-08]


The array above shows the explained variance of each component. The first element is 0.99511. It means that the first component can explain 99.51% of the dataset already. Thus, it means that we can reduce the dimension of the original dataset to only 1 feature that basically covers 99.51% of the data.

In [13]:
pd.DataFrame(x_pca).head()

Unnamed: 0,0,1,2,3
0,-112262.329664,1099.917213,-89.796475,-0.815179
1,-106467.395741,-1074.257792,-93.353761,2.959664
2,25867.604154,-1078.283626,-88.157906,-0.024219
3,44942.603983,-1078.862069,-87.382449,14.597705
4,148630.604092,-1082.021023,-83.374483,-8.370771
