## Feature Engineer

![](../img/feature_engineer.png)

This section covers some libraries for feature engineering. 

### Split Data in a Stratified Fashion in scikit-learn

Normally, after using scikit-learn's `train_test_split`, the proportion of values in the sample will be different from the proportion of values in the entire dataset. 

In [16]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

X, y = load_iris(return_X_y=True)
np.bincount(y)

array([50, 50, 50])

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [18]:
# Get count of each class in the train set

np.bincount(y_train)

array([37, 34, 41])

In [19]:
# Get count of each class in the test set

np.bincount(y_test)

array([13, 16,  9])

If you want to keep the proportion of classes in the sample the same as the proportion of classes in the entire dataset, add `stratify=y`. 

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)

In [21]:
np.bincount(y_train)

array([37, 37, 38])

In [22]:
np.bincount(y_test)

array([13, 13, 12])

### Strategy to Prevent Data Leakage in Time-correlated Datasets

If you randomly split time-correlated datasets for machine learning models, your training set may contain future transactions, leading to biased predictions.

To avoid data leakage in time-correlated datasets, split the data by time.

In [23]:
import pandas as pd
from datetime import datetime 

# Create the example dataset
data = {'customer_id': [1, 2, 3, 4, 5],
        'amount': [10.00, 20.00, 15.00, 25.00, 30.00],
        'date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05']}
df = pd.DataFrame(data)

# Convert the date column to datetime format
df['date'] = pd.to_datetime(df['date'])

In [24]:
from sklearn.model_selection import train_test_split

# Split the data randomly into training and test sets
train_data, test_data = train_test_split(df, test_size=0.3, random_state=42)

print(f'Train data:\n{train_data}')
print(f'Test data:\n{test_data}')

SyntaxError: invalid syntax (<ipython-input-24-b01de5e433b7>, line 6)

In [None]:
# Set the cutoff date
cutoff_date = datetime(2021, 1, 4)

# Split the data into training and test sets by time
train_data = df[df['date'] < cutoff_date]
test_data = df[df['date'] >= cutoff_date]

print(f'Train data:\n{train_data}')
print(f'Test data:\n{test_data}')

### Drop Correlated Features

In [None]:
!pip install feature_engine 

If you want to remove the correlated variables from a dataframe, use `feature_engine.DropCorrelatedFeatures`. 

In [None]:
import pandas as pd
from sklearn.datasets import make_classification
from feature_engine.selection import DropCorrelatedFeatures

# make dataframe with some correlated variables
X, y = make_classification(
        n_samples=1000,
        n_features=6,
        n_redundant=3,
        n_clusters_per_class=1,
        class_sep=2,
        random_state=0,
    )

# trabsform arrays into pandas df and series
colnames = ["var_" + str(i) for i in range(6)]
X = pd.DataFrame(X, columns=colnames)

In [None]:
X.columns

In [None]:
X[["var_0", "var_1", "var_2"]].corr()

Drop the variables with a correlation above 0.8. 

In [None]:
tr = DropCorrelatedFeatures(variables=None, method="pearson", threshold=0.8)

Xt = tr.fit_transform(X)

tr.correlated_feature_sets_

In [None]:
Xt.columns

[Link to feature-engine](https://feature-engine.readthedocs.io/en/1.1.x/).

### Encode Rare Labels with Feature-engine

When dealing with features with high cardinality, you might want to mark the rare categories as "Other". Feature-engine's `RareLabelEncoder` makes it easy for you to do so.

In [None]:
from sklearn.datasets import fetch_openml
from feature_engine.encoding import RareLabelEncoder

data = fetch_openml('dating_profile')['data']
data.head(10)

In [None]:
processed = data.dropna(subset=['education'])

In the code below, 
- `tol` species the minimum frequency below which a category is considered rare. 
- `replace_with` species the value to be used to replace rare categories.
- `variables` specify the list of categorical variables that will be encoded.

In [None]:
encoder = RareLabelEncoder(tol=0.05, variables=["education"], replace_with="Other")
encoded = encoder.fit_transform(processed)


Now the rare categories in the column `education` are replaced with "Other".

In [None]:
encoded['education'].sample(10)

[Link to feature-engine](https://feature-engine.readthedocs.io/en/1.1.x/).

### Encode Categorical Data Using Frequency

In [None]:
!pip install feature-engine

Sometimes, count or frequency can be useful features for your model. If you want to replace categories by either the count or the percentage of observations per category, use feature_engine's `CountFrequencyEncoder`.

In [None]:
import seaborn as sns
from feature_engine.encoding import CountFrequencyEncoder
from sklearn.model_selection import train_test_split

data = sns.load_dataset("diamonds")

X_train, X_test, y_train, y_test = train_test_split(data, data["price"], random_state=0)
X_train

In the code below, I encode `color` and `clarity`. 

In [None]:
# initiate an encoder
encoder = CountFrequencyEncoder(
    encoding_method="frequency", variables=["color", "clarity"]
)

# fit the encoder
encoder.fit(X_train)

# process the data
p_train = encoder.transform(X_train)
p_test = encoder.transform(X_test)

In [None]:
p_test

[Link to feature-engine](https://feature-engine.readthedocs.io/en/1.1.x/).

### Return a DataFrame When Using a scikit-learn's Transformer

In [None]:
!pip install feature_engine 

Applying a scikit-learn's transformer on your DataFrame will return a NumPy array. 

In [None]:
import pandas as pd 
from sklearn.preprocessing import StandardScaler
from feature_engine.wrappers import SklearnTransformerWrapper

In [None]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
StandardScaler().fit_transform(df)

If you want to return a pandas DataFrame instead, use feature-engine's `SklearnTransformerWrapper` along with your scikit-learn's tranformer.

In [None]:
scaler = SklearnTransformerWrapper(transformer=StandardScaler())
scaler.fit_transform(df)

[Link to feature-engine](https://feature-engine.readthedocs.io/en/1.1.x/).

### Similarity Encoding for Dirty Categories Using dirty_cat

In [None]:
!pip install dirty-cat

To capture the similarities among dirty categories when encoding categorical variables, use dirty_cat’s `SimilarityEncoder` . 

To understand how `SimilarityEncoder` works, let's start with the employee_salaries dataset.

In [None]:
from dirty_cat.datasets import fetch_employee_salaries
from dirty_cat import SimilarityEncoder

X = fetch_employee_salaries().X
X.head(10)

In [None]:
dirty_column = "employee_position_title"
X_dirty = df[dirty_column].values
X_dirty[:7]

We can see that titles such as 'Master Police Officer' and 'Police Officer III' are similar. We can use `SimilaryEncoder` to encode these categories while capturing their similarities. 

In [None]:
enc = SimilarityEncoder(similarity="ngram")
X_enc = enc.fit_transform(X_dirty[:10].reshape(-1, 1))
X_enc

Cool! Let's create a heatmap to understand the correlation between the encoded features.

In [None]:
import seaborn as sns
import numpy as np
from sklearn.preprocessing import normalize
from IPython.core.pylabtools import figsize

def plot_similarity(labels, features):
  
    normalized_features = normalize(features)
    
    # Create correction matrix
    corr = np.inner(normalized_features, normalized_features)
    
    # Plot
    figsize(10, 10)
    sns.set(font_scale=1.2)
    g = sns.heatmap(corr, xticklabels=labels, yticklabels=labels, vmin=0,
        vmax=1, cmap="YlOrRd", annot=True, annot_kws={"size": 10})
        
    g.set_xticklabels(labels, rotation=90)
    g.set_title("Similarity")


def encode_and_plot(labels):
  
    enc = SimilarityEncoder(similarity="ngram") # Encode
    X_enc = enc.fit_transform(labels.reshape(-1, 1))
    
    plot_similarity(labels, X_enc) # Plot

In [None]:
encode_and_plot(X_dirty[:10])

As we can see from the matrix above,
- The similarity between the same strings such as 'Office Services Coordinator' and 'Office Services Coordinator' is 1
- The similarity between somewhat similar strings such as 'Office Services Coordinator' and 'Master Police Officer' is 0.41
- The similarity between two very different strings such as 'Social Worker IV' and 'Polic Aide' is 0.028


[Link to dirty-cat](https://dirty-cat.github.io/).

[Link to my full article about dirty-cat](https://towardsdatascience.com/similarity-encoding-for-dirty-categories-using-dirty-cat-d9f0b581a552).

### Snorkel — Programmatically Build Training Data in Python

In [None]:
!pip install snorkel

Imagine you try to determine whether a job posting is fake or not. You come up with some assumptions about a fake job posting, such as:
* If a job posting has few to no descriptions about the requirements, it is likely to be fake.
* If a job posting does not include any company profile or logo, it is likely to be fake.
* If the job posting requires some sort of education or experience, it is likely to be real.

In [None]:
import pandas as pd 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


train_df = pd.read_pickle(
    "https://github.com/khuyentran1401/Data-science/blob/master/feature_engineering/snorkel_example/train_fake_jobs.pkl?raw=true"
)
train_df.head(5)


How do you test which of these features are the most accurate in predicting fraud?

That is when Snorkel comes in handy. Snorkel is an open-source Python library for programmatically building training datasets without manual labeling. 

To learn how Snorkel works, start with giving a meaningful name to each value:

In [None]:
from snorkel.labeling import labeling_function, PandasLFApplier, LFAnalysis

FAKE = 1
REAL = 0
ABSTAIN = -1

We assume that:
- Fake companies don’t have company profiles or logos
- Fake companies are found in a lot of fake job postings
- Real job postings often requires a certain level of experience and education 

Let’s test those assumptions using Snorkel’s `labeling_function` decorator. The `labeling_function` decorator allows us to quickly label instances in a dataset using functions.

In [None]:
@labeling_function()
def no_company_profile(x: pd.Series):
    return FAKE if x.company_profile == "" else ABSTAIN


@labeling_function()
def no_company_logo(x: pd.Series):
    return FAKE if x.has_company_logo == 0 else ABSTAIN


@labeling_function()
def required_experience(x: pd.Series):
    return REAL if x.required_experience else ABSTAIN


@labeling_function()
def required_education(x: pd.Series):
    return REAL if x.required_education else ABSTAIN

`ABSTAIN` or `-1` tells Snorkel not to make any conclusion about the instance that doesn’t satisfy the condition.

Next, we will use each of these labeling functions to label our training dataset:

In [None]:
lfs = [
    no_company_profile,
    no_company_logo,
    required_experience,
    required_education,
]

applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=train_df)

Now that we have created the labels using each labeling function, we can use `LFAnalysis` to determine the accuracy of these labels.

In [None]:
LFAnalysis(L=L_train, lfs=lfs).lf_summary(Y=train_df.fraudulent.values)

Details of the statistics in the table above:
* **Polarity**: The set of unique labels this LF outputs (excluding abstains)
* **Coverage**: The fraction of the dataset that is labeled
* **Overlaps**: The fraction of the dataset where this LF and at least one other LF agree
* **Conflicts**: The fraction of the dataset where this LF and at least one other LF disagree
* **Correct**: The number of data points this LF labels correctly
* **Incorrect**: The number of data points this LF labels incorrectly
* **Empirical** Accuracy: The empirical accuracy of this LF

[Link to Snorkel](https://www.snorkel.org/).

[My full article about Snorkel](https://towardsdatascience.com/snorkel-programmatically-build-training-data-in-python-712fc39649fe).

### sketch: AI Code-Writing Assistant That Understands Data Content

Wouldn't it be nice if you could get insights into your data by simply asking a question? Sketch allows you to do exactly that.

Sketch is an AI code-writing assistant for pandas users that understands the context of your data.

In [None]:
!pip install sketch

In [None]:
import pandas as pd  
import seaborn as sns 
import sketch

In [None]:
data = sns.load_dataset('taxis')
data.head(10)

In [None]:
data.sketch.ask(
    "Can you give me friendly names for each column?" 
    "(Output as an HTML list)"
)

In [None]:
data.sketch.ask(
    "Which payment is the most popular payment?"
)

In [None]:
data.sketch.howto("Create some features from the pickup column")

In [None]:

# Create a new column for the hour of the pickup
data['pickup_hour'] = data['pickup'].dt.hour

# Create a new column for the day of the week of the pickup
data['pickup_day'] = data['pickup'].dt.weekday

# Create a new column for the month of the pickup
data['pickup_month'] = data['pickup'].dt.month_name()


In [None]:
data.sketch.howto(
    "Create some features from the pickup_zone column"
)

In [None]:

# Create a new column called 'pickup_zone_count'
data['pickup_zone_count'] = data.groupby('pickup_zone')['pickup_zone'].transform('count')

# Create a new column called 'pickup_zone_fare'
data['pickup_zone_fare'] = data.groupby('pickup_zone')['fare'].transform('mean')

# Create a new column called 'pickup_zone_distance'
data['pickup_zone_distance'] = data.groupby('pickup_zone')['distance'].transform('mean')


In [None]:
data 

[Link to sketch](https://github.com/approximatelabs/sketch).