## preprocessing with scikit-learn

this notebook is intended to be a reference for using scikit learn in preprocessing

In [32]:
# neccessary imports

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

In [14]:
# this will give a 5x3 matrix with random integers from 0 to 199

dataset = np.random.randint(0,200,(5,3))

In [31]:
dataset

array([[113, 155,  27],
       [ 24, 161, 188],
       [ 90, 178, 138],
       [ 53,  98,  98],
       [ 83, 102, 116]])

## normalizing: MinMaxScaler()

In [25]:
# scale data: instatiate a MinMaxScaler object
my_scaler = MinMaxScaler()

type(my_scaler)

sklearn.preprocessing.data.MinMaxScaler

In [26]:
# fit to the data
# scikit-learn follows conventions like data.fit()

my_scaler.fit(dataset)



MinMaxScaler(copy=True, feature_range=(0, 1))

In [27]:
# transform the data

my_scaler.transform(dataset)

array([[1.        , 0.7125    , 0.        ],
       [0.        , 0.7875    , 1.        ],
       [0.74157303, 1.        , 0.68944099],
       [0.3258427 , 0.        , 0.44099379],
       [0.66292135, 0.05      , 0.55279503]])

data are transformed such that the minimum value becomes zero (0) and the max value becomes one (1).
#### * there are other ways to normalize data, but this one is pretty solid for many machine learning purposes
#### * this can actually be done in one command, using the MinMaxScaler fit_transform() method:

In [30]:
# one line, same results using fit_transform() method

my_scaler.fit_transform(dataset)



array([[1.        , 0.7125    , 0.        ],
       [0.        , 0.7875    , 1.        ],
       [0.74157303, 1.        , 0.68944099],
       [0.3258427 , 0.        , 0.44099379],
       [0.66292135, 0.05      , 0.55279503]])

#### this one-step fit and transform works great here when we're using the entire dataset. BUT when we're constructing a model, we'd typically want to use a train-test-split for model validation, meaning
#### * we would've split the data up into training and testing sets
#### * we would want to keep the test dataset a secret from our model
#### * so fitting to the entire dataset then basing the normalization off of that fit is cheating! it's giving the model a sneak-peek at (or clues to) the data
### * in real life, it's better to fit to the TRAINING data, THEN transform the training and testing data seperately

## train-test-split is your friend

In [34]:
# generate data using pandas dataframe
# this one gives a 50x5 dataframe of random integers from 0 to 200
# four columns of features plus one for labels

new_data = pd.DataFrame(data=np.random.randint(0,201,(50,5)), 
                        columns=['feature_1', 'feature_2', 'feature_3', 'feature_4', 
                                 'label'])

In [35]:
# split labels from data
# .drop() axes: for rows: use 0; for columns: use 1
# alternatively just set X = ['feature_1', 'feature_2',...] etc

X = new_data.drop('label', axis=1)

y = new_data['label']

sklearn has a pretty cool model_selection suite! train_test_split is part of it:

In [36]:
from sklearn.model_selection import train_test_split

In [38]:
# use train_test_split to futher split up the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)

In [39]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(40, 4) (10, 4) (40,) (10,)


### that's it (:

for more information check out scikit-learn model selection documentation at http://scikit-learn.org/stable/model_selection.html#model-selection