In [None]:
!pip install category_encoders -Uq
!pip install spacy -q

from collections import Counter

import category_encoders as ce
import matplotlib.pyplot as plt
from scipy.stats import skewnorm

import pandas as pd
import matplotlib.pyplot as plt

import spacy
from sklearn.preprocessing import LabelEncoder

from sklearn.cluster import KMeans
from create_dataset import ds

# Feature Engineering

## References

Gabby Shklovsky - Random Forests Best Practices for the Business World - PyData NYC 2017 [youtube](https://www.youtube.com/watch?v=E7VLE-U07x0) - [slides](https://www.youtube.com/redirect?q=https%3A%2F%2Fwww.slideshare.net%2FPyData%2Frandom-forests-best-practices-for-the-business-world&redir_token=HgV_RBYb_uD_jYV6nYygn8RpyKR8MTU2OTkwODE2N0AxNTY5ODIxNzY3&v=E7VLE-U07x0&event=video_description)

Art of Feature Engineering for Data Science - Nabeel Sarwar - [youtube](https://youtu.be/leTyvBPhYzw)

Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai - [youtube](https://youtu.be/irkV4sYExX4)

[Building Machine Learning Powered Applications: Going from Idea to Product](https://www.amazon.com/Building-Machine-Learning-Powered-Applications/dp/149204511X)

[Datacamp categorical data tutorial](https://www.datacamp.com/community/tutorials/categorical-data)

[Smarter Ways to Encode Categorical Data for Machine Learning](https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159)

[Vincent Warmerdam: Winning with Simple, even Linear, Models | PyData London 2018](https://www.youtube.com/watch?v=68ABAU_V8qI)

[Why giving your algorithm ALL THE FEATURES does not always work - Thomas Huijskens](https://www.youtube.com/watch?v=JsArBz46_3s)

[Feature Engineering - Elite Data Science](https://elitedatascience.com/feature-engineering)

[Feature Engineering for ML](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)

# Feature engineering

Adding columns than transform existing columns
- this includes the target (target transformation)

## How can I generate features?

Domain knowledge
- use business insight -> features

Exploratory data analysis

Build features from features
- ratios
- polynomials
- log / exponentials / sigmoids
- Fourier / Laplace transforms

## Data leakage

Feature available during training that isn't available during testing / production time

Duplicates in the train & test set

Training / learning statistics from the test set
- common mistake to refit normalization / standardization

Feature engineering can often cause data leakage
- mean target encoding on the entire dataset, before splitting test & train

## Nominal vs ordinal

Nominal (no order) versus order (ordinal)

Some methods place order onto a nominal series during the encoding

## Log-transformations

https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution

Commonly done on the target in regression problems
- you need to inverse the transformation at some point!
- you do predictions on the transformed target, then inverse transform the prediction to get the actual prediction

Few reasons to do this
- multiplicative trends to additive
- makes the distribution more normal
- numerical stability (same reason you parameterize log(sigma) in VAEs)

Multiplicative trend to additive:

In [None]:
inflation = 1.5
nominal = [1.0]
for year in range(10):
    nominal.append(nominal[-1] * inflation)

plt.plot(nominal, color='blue', label='original')
plt.plot(np.log(nominal), color='red', label='log-transform')
plt.legend()

Makes a distribution less skewed (more normal)

In [None]:
data = skewnorm.rvs(10, loc=1, scale=1, size=1000)
f, a = plt.subplots(ncols=2, sharex=True)
_ = a[0].hist(data, color='blue')
a[0].set_title('original')
_ = a[1].hist(np.log(data), color='red')
a[1].set_title('log-transformed')

## Transform from continuous to categorical

On the target 
- moving from a regression to classification problem 
- key thing to ask - how do we use our prediction in our business problem

On features 
- removes noise (& maybe signal)

In [None]:
ds

In [None]:
pd.cut(ds.loc[:, 'contract-length'].fillna(0), bins=4)

## Dealing with categorical variables

Machine always think in numbers
- all categorical features need to be transformed into numbers at somepoint
- most ML algorithms get upset if you don't do it - some will do it on the fly

## One hot encoding

- curse of dimensionality makes dimensionality increase exponential
- lose the explicit relationship of the feature (model now just sees a lot of columns, and has to learn their relationship)

There are a few ways to do one-hot encoding in the Python
- `sklearn.preprocessing.OneHotEncoder()`
- `pd.get_dummies()`

Recommend `sklearn.preprocessing.OneHotEncoder()`
- creates a stateful transformer that can be reused (i.e. on the test data)
- using `pd.get_dummies`, you are hoping that the columns will be encoded in the same way at test time

In [None]:
from sklearn.preprocessing import OneHotEncoder

#  note that I use a missing value token here
cat = ds.loc[:, 'customers-category'].to_frame().fillna('missing')

#  will by default return a sparse matrix - you can turn this off
enc = OneHotEncoder(sparse=False)
enc.fit_transform(cat)

## Label encoding

- `0, 1, 2, 3`
- is an ordinal encoding - even if feature is not ordinal, you are imposing this structure on the data

Decision trees sort the data based on the feature being split on
- this makes the decision boundary meaningful
- inherently ordinal

In [None]:
enc = LabelEncoder()
enc.fit_transform(cat)

## Mean encode

- put the training data average for the target for that class
- could also use other statistics like median, quantiles or variance

The code below will mean encode a categorical feature using the average of another column (often the target):

In [None]:
def mean_encode(data, col, on):
    group = data.groupby(col).mean()
    data.loc[:, col+'-original'] = data.loc[:, col]
    mapper = {k: v for k, v in zip(group.index, group.loc[:, on].values)}

    data.loc[:, col] = data.loc[:, col].replace(mapper)
    data.loc[:, col].fillna(value=np.mean(data.loc[:, col]), inplace=True)
    return data


def test_mean_encoding():
    store1 = pd.DataFrame(
        {'store': ['A'] * 3,
         'Sales': [100, 200, 300],
         'noise': [0, 0, 0]}
    )

    store2 = pd.DataFrame(
        {'store': ['B'] * 3,
         'Sales': [10, 20, 30],
         'noise': [0, 0, 0]}
    )

    data = pd.concat([store1, store2], axis=0)
    data = mean_encode(data, col='store', on='Sales')
    np.testing.assert_array_equal(
        data.loc[:, 'store'], np.array([200, 200, 200, 20, 20, 20])
    )
    
test_mean_encoding()
data = mean_encode(ds.copy(), 'customers-category', 'contract-length')
data.drop(['contract-length', 'location'], axis=1)

## Mean & frequency encoding

See *Feature Engineering with H2O - Dmitry Larko, Senior Data Scientist, H2O.ai - [youtube](https://youtu.be/irkV4sYExX4)*

If we have a category with only a few samples, the mean we encode will be higher variance
- we can tell the model this by encoding the weighted average of the overall mean & the mean of this category
- we trust the mean encoding less if we have less samples
- requires a hyperparameter $\lambda$
- function that turns a category into a weight

$$ \lambda(category) * \mu(category) + (1 - \lambda(category)) * mean(data) $$

A simple choice of $\lambda$ is a frequentist probability

$$\lambda(category) = \frac{freq(category)}{freq(all categories)}$$

### Practical

Implement mean & frequency encoding

## Binary encoding

Encoding the string using it's binary (0110100 etc) representation

In Python we can do this using the `bin()` builtin:

In [None]:
bin(4)

The `category_encoders` library can do this for us:

In [None]:
data = ['A', 'B', 'A', 'C', 'D', 'A', 'E']

enc = ce.BinaryEncoder()

print('{} classes'.format(len(set(data))))

pd.concat([enc.fit_transform(data), pd.DataFrame(data, columns=['data'])], axis=1)

### Hashing encoding

Uses hashing (same operation as in Python dicts)
- some infomation loss due to collisions

In [None]:
enc = ce.HashingEncoder()
pd.concat([enc.fit_transform(data), pd.DataFrame(data, columns=['data'])], axis=1)

## Frequency encoding

Encode the categorical feature based on their relative frequency
- probability of seeing this category
- tell the model about rare categories
- but you can't distinguish categories with the same frequency (unlikely in very large datasets)

In [None]:
data = ['A', 'B', 'A', 'C', 'D', 'A', 'E']
counter = Counter(data)
freq_enc = [counter[x] / len(data) for x in data]

pd.DataFrame(
    {'data': data,
     'freq_enc': freq_enc}
)

## Interaction features

Also called **feature crossing**

Multiplying features together

Define interaction effects as new features directly
- even for trees - because trees are locally greedy

Grouping sparse classes

In [None]:
data = {
    'late': [1, 1, 0],
    'tired': [1, 0, 1]
}

d = pd.DataFrame(data)

d

We can add columns based on the combination of these features:

In [None]:
d.loc[:, 'late-and-tired'] = ((d.loc[:, 'late'] == 1) * (d.loc[:, 'tired'] == 1)).astype(int)

d

This kind of if/and relationship is something that a decision tree is good at learning:
- but you can always help your model out
- encode domain knowledge where you can!

This kind of feature engineering will be uncomfortable for those who are familiar with linear models, where introducing this kind of co-linearity is incorrect :)

## Indirect features

*Gabby Shklovsky - Random Forests Best Practices for the Business World - PyData NYC 2017 [youtube](https://www.youtube.com/watch?v=E7VLE-U07x0) - [slides](https://www.youtube.com/redirect?q=https%3A%2F%2Fwww.slideshare.net%2FPyData%2Frandom-forests-best-practices-for-the-business-world&redir_token=HgV_RBYb_uD_jYV6nYygn8RpyKR8MTU2OTkwODE2N0AxNTY5ODIxNzY3&v=E7VLE-U07x0&event=video_description)*

Non-predictive features don't hurt so much with trees
- **use multiple metrics that are proxies** for same concept as predictors
- two or three things that measure it indirectly
- absolute and relative differences
- percentage & absolute changes - sometimes % is predictive, sometimes $

## Factor out linear relationships

*Gabby Shklovsky - Random Forests Best Practices for the Business World - PyData NYC 2017 [youtube](https://www.youtube.com/watch?v=E7VLE-U07x0) - [slides](https://www.youtube.com/redirect?q=https%3A%2F%2Fwww.slideshare.net%2FPyData%2Frandom-forests-best-practices-for-the-business-world&redir_token=HgV_RBYb_uD_jYV6nYygn8RpyKR8MTU2OTkwODE2N0AxNTY5ODIxNzY3&v=E7VLE-U07x0&event=video_description)*

- factor out linear relationships between predictor and response
- a strong linear relationship will overpower subtle non linearities (will dominate the tree splits)
- trees model a linear relationship in a non-linear way -> unlikely to pick up non linear effects

### Differencing

- difference = linear transformation
- **predict the difference** between last year and now (not the actual sales)
- last spend is still predictive

In [1]:
data = pd.DataFrame(
    {'sales': [10, 50, 100], 'year': [2012, 2013, 2014]}
)

data.loc[:, 'last-year-sales-feature'] = data.loc[:, 'sales'].shift()
data.loc[:, 'model-target'] = data.loc[:, 'sales'].diff()

data

Unnamed: 0,sales,year,last-year-sales-feature,model-target
0,10,2012,,
1,50,2013,10.0,40.0
2,100,2014,50.0,50.0


## Dimensionality reduction

### PCA & t-SNE

Both forms of dimensionality reduction, that can create new columns
- these can sit alongside or replace the high dimensional data

See [visualization.ipynb]

### Clustering

Can be used to create a new feature:

In [None]:
data = pd.DataFrame([np.random.uniform(0, x, 1000) for x in range(1, 10)]).T

mdl = KMeans(3)
mdl.fit(data)
data.loc[:, 'cluster'] = mdl.labels_

data.head()

## Datetime

### Month, year, day columns

These can then be either be label / one-hot encoded

In [None]:
data = pd.DataFrame(
    {'date': pd.date_range('01-01-2018', '02-01-2018', freq='1h')}
)

data.loc[:, 'month'] = data.loc[:, 'date'].dt.month
data.loc[:, 'minute'] = data.loc[:, 'date'].dt.minute

data.head()

### Cyclical datetime features

Encoding cyclical continuous features - [blog post](https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/)

In [None]:
hours_in_day = 24

def transform_hourly(self, x):
    h = x.index.hour

    sin = np.sin(2 * np.pi * h / self.hours_in_day)
    cos = np.cos(2 * np.pi * h / self.hours_in_day)

    out = pd.DataFrame(index=x.index)
    out.loc[:, 'sin_h'] = sin
    out.loc[:, 'cos_h'] = cos
    return out


def transform_hh(self, x):
    hh = x.index.hour + (x.index.minute / 60)

    sin = np.sin(2 * np.pi * hh / self.hours_in_day)
    cos = np.cos(2 * np.pi * hh / self.hours_in_day)

    out = pd.DataFrame(index=x.index)
    out.loc[:, 'sin_hh'] = sin
    out.loc[:, 'cos_hh'] = cos
    return out


def transform(x, max_value):
    sin = np.sin(2 * np.pi * x / max_value)
    cos = np.cos(2 * np.pi * x / max_value)

    out = pd.DataFrame()
    out.loc[:, 'sin'] = sin
    out.loc[:, 'cos'] = cos

    return out

rng = pd.DataFrame(index=pd.date_range('01-01-2020', '01-02-2020', freq='1h'))
df = transform(rng.index.hour, 24)
df.plot('sin', 'cos', kind='scatter')

## Time series decomposition

Seasonality, trend can become features

More important than other areas to ask the question
- what will I have available at test time?
- example of weather versus weather forecasts

### ceasium

## NLP

Bag of words, tfidf

Tokenization
- harder than splitting on whitespace!
- ngrams
- NLTK, SpaCy

Word / doc embeddings
- word to vec

### Part of speech tagging with SpaCy

A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

https://spacy.io/usage/linguistic-features

https://spacy.io/api/token

**Need to run**:
```bash
$ python -m spacy download en_core_web_sm
```

In [None]:
def find_ents(doc, verbose=False):
    if verbose:
        print(doc)
        print('---')
        
    doc = nlp(doc)
    ents = []
    for token in doc:
        
        if verbose:
            print({
                'text': token.text,
                'coarse POS': token.pos_,
                'fine POS': token.tag_,
                'syntatic dependency relation': token.dep_,
                'stop': token.is_stop
            })
        if token.pos_ == 'PROPN':
            ents.append(token)
            
    return doc, ents

In [None]:
nlp = spacy.load('en_core_web_sm') 
doc, ents = find_ents("Apple is looking at buying U.K. startup for $1 billion")
ents

Possible for a single word to have a different part of speech tag in different sentences based on different contexts

- I love you vs. Lets make love

In [None]:
doc, ents = find_ents("I love you", verbose=True)

In [None]:
doc, ents = find_ents("Let's make love", verbose=True)

## Transfer learning

Value of clustering - for any layer that works like an embeding.  Image classification have penultimate layer before softmax which can be used as an embedding.  Clustering these vectors produces interesting results

Can also do similarity (often cosine simiarity) between the vectors formed at the end of a conv. net

Can also do similarity on word embeddings

## Augmentation

Common in computer vision
- rotation etc

SMOTE
- synthetic data