# Data Modeling
In addition to dimensionality concerns, there were other health factors for input data that we've discussed previously.  Sklearn comes with tools to help us normalize our data, encode categorical data, detect outliers, and much more.

## Feature Scaling
Due to the way floating point numbers work on digital systems, the closer they are to zero the more accurate they are.  If you have an extremely large value, whole integers begin to get skipped, before that happens decimal precision is lost.  Due to this we can improve the performance of many estimators by scaling the data first, some estimators aren't sensitive to unscaled data but in general we can say that models prefer standardized gaussian distribution for input features. Another problem that can exist in raw data is that the scale of two different featues can confuse estimators (imagine distance in miles vs inches), an estimator could place more significance on the larger value.

There is a hidden danger we could be introducing here known as data leakage, we will discuss what this is and how to prevent it another day.

* StandardScaler - 0 mean and unit variance
* MinMaxScaler - scale values to a specific range, lets you preserve values of 0
* RobustScaler - scaling is not linear, good if you have many outliers
* PowerTransformer - good at non linear transformations (similar to log)
* QuantileTransformer - a non linear transformer that even works on bimodal or uniform distributions 

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=100, random_state=100)

scaler = StandardScaler()
scaler.fit(X)

# call inverse_transform to undo the effect of the transformer
X_scaled = scaler.transform(X)

print(X[0], X_scaled[0])

[-0.37690335] [-0.28124444]


### Target Engineering
We typically only transform the input features, however it might be necessary if the model is not fitting well. Log transform is often used for this, don't forget to apply the inverse when avaluating predications.

## Categorical Encoding
As we've discussed before, input data must be quantitative, if your data set has many categorical features this could be very limiting.  If you have categorical labels they must be converted to quantitative before they can be used to train a model, for example converting hair color from 'brown' to an RGB value.  Another example would be to convert ordinal data to discrete dummy codes (freshman = 1, sophmore = 2, junior = 3), even though these values are categorical there is a specific ordering to them so we might be able to use them.  This could be done manually or with sklearn's OrdinalEncoder.  If there is no suitable conversion these values can still be used they will just need to be converted first.  Which method to do this will depend on the model and the data, so try an approach and see how it affects the problem.

###  One Hot
One hot encoding transforms a feature with k possible values into k new binary features (0 or 1).  Care must be taken however, if the converted feature has a large number of possible values this could greatly increase the complexity of the model.  If this is the case labels can be combined (light brown, brown, and chestnut being combined), or a different approach can be applied.

In [None]:
from sklearn.preprocessing import LabelBinarizer
import numpy as np
import pandas as pd


enc = LabelBinarizer()
vals = ['action', 'rpg', 'rpg', 'rpg', 'puzzle', 'puzzle']
df = pd.DataFrame(vals, columns=['cat'])

# convert to dummy codes and one hot array
one_hot = enc.fit_transform(df)

# combine for nice output
df = pd.concat([df, pd.DataFrame(one_hot, columns=['is_action', 'is_puzzle', 'is_rpg'])], axis=1)
df

df['one_hot_1'] = one_hot[:, 0]
df['one_hot_2'] = one_hot[:, 1]
df['one_hot_3'] = one_hot[:, 2]

Unnamed: 0,cat,is_action,is_puzzle,is_rpg
0,action,1,0,0
1,rpg,0,0,1
2,rpg,0,0,1
3,rpg,0,0,1
4,puzzle,0,1,0
5,puzzle,0,1,0


### Frequency Encoding
This type of encoding takes the frequency of a label occuring and uses that as quantitative value, sometimes this value will have some relation to the output feature.  There is no built in method to do this, but it can be easily achieved with pandas. 

In [3]:
df = pd.DataFrame(vals, columns=['cat'])

# get frequency of each label
fe = df.groupby('cat').size() / len(df)

# map labels to frequency
df['freq'] = df['cat'].map(fe)

df

Unnamed: 0,cat,freq
0,action,0.166667
1,rpg,0.5
2,rpg,0.5
3,rpg,0.5
4,puzzle,0.333333
5,puzzle,0.333333


### Feature Hashing
Feature hashing is a technique similar to one hot encoding, however instead of assinging each unique value its own feature, each value is hashed by a user specified number of buckets.  Due to this the number of features generated can be directly controlled by the user.  The return from this operation is a sparse array so you will need to do some data modeling to get it back into your dataframe.

In [8]:
from sklearn.feature_extraction import FeatureHasher
import pandas as pd

# n_featuers is how many buckets (features) your hash should use
fh = FeatureHasher(n_features=2, input_type='string')
df = pd.DataFrame(vals, columns=['cat'])

print(df.cat)

# the toarray converts from a sparse matrix to dense which we can create a df with
res = fh.fit_transform(df.cat)
fh_df = pd.DataFrame(res.toarray())

df = pd.concat([df, fh_df], axis=1)
df

0    action
1       rpg
2       rpg
3       rpg
4    puzzle
5    puzzle
Name: cat, dtype: object


Unnamed: 0,cat,0,1
0,action,0.0,-2.0
1,rpg,-2.0,1.0
2,rpg,-2.0,1.0
3,rpg,-2.0,1.0
4,puzzle,1.0,-1.0
5,puzzle,1.0,-1.0


## Outliers
There are two different types of actions we can take when dealing with outliers.  Novelty detection is when you already have a clean dataset and you are trying to determine if new data is an inlier or an outlier.  Outlier detection is when you have a potentially dirty dataset and you are trying to determine if any of the data is deviant.  Outliers generally take one of the following forms.

* global - incredibly rare event (it snowed in Florida)
* contextual - abnormal due to time (hurricane in Florida during winter)
* collective - abnormal due to amount (five hurricanes hitting Orlando on the same day during late summer)

Something to keep in mind with outliers is that they are not necessarily bad, in fact they might be the most important samples in your data set.  If you detect outliers you will need to examine them and understand their relevance to the problem you are trying to solve.  Its possible I could have a 7' tall student in my class, I wouldn't necessarily want my model to consider that sample.  Its possible I could get a student that gets a 100 on every test / assignment, I would probably want my model to try and detect this.

### Local Outlier Factor
Many applications of machine learning will require outliers to be removed in order to get good predictions, this is one reason we look at the distribution of our features during EDA.  However, when we do this we are only looking at single feature at a time (univariate).  For example lets say its not uncommon for somebody to be 6', its also common for somebody else to weigh 120 lbs, however it would be very uncommon for somebody to have both of those traits.  Local Outlier Factor uses an algorithm that computes the density of data with respect to all features (multivariate) to determine if a sample is an inlier or an outlier.  Something to keep in mind is that when we call fit the argument name is still 'X', however we might pass in the output as well, or only parts of the input, the y argument is ignored.

In [None]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('assets\small_data.csv', index_col='rank')
df = df.drop('names', axis=1)

clf = LocalOutlierFactor(n_neighbors=5)
df['outlier'] = clf.fit_predict(df)

df.head(7)

Unnamed: 0_level_0,avg rating,geek rating,num votes,year,outlier
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,8.32,8.2,21003,2005,-1
2,8.72,8.153,3646,2015,-1
3,8.27,8.134,15846,2012,1
4,8.25,8.055,10507,2013,1
5,8.2,8.049,13715,2006,1
6,8.13,8.037,41009,2002,-1
7,8.1,8.022,40962,2007,-1


# Workflow
There is not a strict blueprint for how to complete a machine learning problem, with varying data the workflow for each problem will be drastically different.  This general workflow can help, but shouldn't be strictly followed.  As the course progresses we will uncover more and transition from naive approaches to more robust.

<td> <img src="images\ml_workflow01.png" alt="Drawing" style="width:850px;"/> </td>

# OK
One of the most important skills for an engineer working in this field is the ability to process data.  Often when working with models the best way to get better performance isn't to manipulate the model, its too manipulate the data.  Which specific thing to try will depend on the model and data, through the course we will explore ways to tell which technique we should apply to the data.