# Last Updated: 2023-05-29
# Completed: 2023-05-29

In [1]:
# 0. Importing Modules
import numpy as np
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from datetime import datetime, timedelta
from matplotlib.ticker import MaxNLocator
print("Setup Complete")



Setup Complete


# The Goal of Feature Engineering

We might perform feature engineering to:
- improve a model's predictive performance
- reduce computational or data needs
- improve interpretability of the results

# A Guiding Principle of Feature Engineering

- The key idea here is that a transformation you apply to a feature becomes in essence a part of the model itself.
- We will try to linearise the parameter to the output that we are predicting.
- Ex: A linear model fits poorly with only "X" as feature, if "X" and "Y" does not have a linear relationship

# 1. Mutual Information

**A great first step is to construct a ranking with a feature utility metric, a function measuring associations between a feature and the target.** Then you can choose a smaller set of the most useful features to develop initially and have more confidence that your time will be well spent.

The metric we'll use is called "mutual information". Mutual information is a lot like correlation in that it measures a relationship between two quantities. 

Mutual information is a great general-purpose metric and especially useful at the start of feature development when you might not know what model you'd like to use yet. It is:

- easy to use and interpret,
- computationally efficient,
- theoretically well-founded,
- resistant to overfitting, and,
- able to detect any kind of relationship

## 1.1 Mutual Information and What it Measures

Mutual information describes relationships in terms of uncertainty. **The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other.** If you knew the value of a feature, how much more confident would you be about the target?

Technical note: What we're calling uncertainty is measured using a quantity from information theory known as "entropy". The entropy of a variable means roughly: "how many yes-or-no questions you would need to describe an occurance of that variable, on average." The more questions you have to ask, the more uncertain you must be about the variable. **Mutual information is how many questions you expect the feature to answer about the target.**

## 1.2 Interpreting Mutual Information Scores

**The least possible mutual information between quantities is 0.0.** When MI is zero, the quantities are independent: neither can tell you anything about the other. **Conversely, in theory there's no upper bound to what MI can be.** In practice though values above 2.0 or so are uncommon. (Mutual information is a logarithmic quantity, so it increases very slowly.)

Here are some things to remember when applying mutual information:

- MI can help you to understand **the relative potential of a feature as a predictor of the target, considered by itself**.

- It's possible for a feature to be very informative when interacting with other features, but not so informative all alone. **MI can't detect interactions between features.** It is a univariate metric.

- The actual usefulness of a feature depends on the model you use it with. A feature is only useful to the extent that its relationship with the target is one your model can learn. Just because a feature has a high MI score doesn't mean your model will be able to do anything with that information. You may need to transform the feature first to expose the association.

## Link to tutorial/exercise:

- https://www.kaggle.com/code/ryanholbrook/mutual-information
- https://www.kaggle.com/code/tsztungchau/exercise-mutual-information

# 2. Creating Features

## 2.1 Mathematical Transforms

Research in the field. Apply mathematical transformations whenever there will be a useful relationship.

Visualization tools will sometimes be helpful for us to understand the mathematical relationship.

## 2.2 Counts

Features describing **the presence or absence of something** often come in sets, 
the set of risk factors for a disease, say. You can aggregate such features by **creating a count**.

**These features will be binary** (1 for Present, 0 for Absent) **or boolean** (True or False). 
In Python, booleans can be added up just as if they were integers.

In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
display(df)

# 1. we sum up the values in each row
# 2. then, apply the resulting value into axis = 1

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
row_sum = df.sum(axis=1)
print(row_sum)  # Output: 0    5
                #         1    7
                #         2    9
                #         dtype: int64

"""
The line row_sum = df.sum(axis=1) calculates the sum of values along each row of the DataFrame df. 
The axis=1 parameter indicates that the sum operation should be performed horizontally across each row.
"""

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


0    5
1    7
2    9
dtype: int64


'\nThe line row_sum = df.sum(axis=1) calculates the sum of values along each row of the DataFrame df. \nThe axis=1 parameter indicates that the sum operation should be performed horizontally across each row.\n'

## 3.3 Building-Up and Breaking-Down Features

When information is packed in form of strings, we can use the **"str" accessor** lects you apply string methods.

Or we could also **join simple features (by python string operations)** into a composed feature if we had reason to believe there was some interaction in the combination.

## 3.4 Group Transformations

**Group transforms**, which aggregate/combine information across multiple rows grouped by some category. 

With a group transform you can create features like: "the average income of a person's state of residence," 
or "the proportion of movies released on a weekday, by genre." 

If you had discovered a category interaction, a group transform over that categry could be something good to investigate.

The **mean** function is a built-in dataframe method, which means we can pass it as a string to transform. 

Other handy methods include **max, min, median, var, std, and count**. 

Some other built-in functions could also come in handy, such as 
- .sample()
- .drop()
- .transform()
- .drop_duplicates
- .merge()
- .groupby()

### Tips on Creating Features
It's good to keep in mind your model's own strengths and weaknesses when creating features. Here are some guidelines:
- Linear models learn sums and differences naturally, but can't learn anything more complex.

- Ratios seem to be difficult for most models to learn. Ratio combinations often lead to some easy performance gains.

- Linear models and neural nets generally do better with normalized features. Neural nets especially need features scaled to values not too far from 0. Tree-based models (like random forests and XGBoost) can sometimes benefit from normalization, but usually much less so.

- Tree models can learn to approximate almost any combination of features, but when a combination is especially important they can still benefit from having it explicitly created, especially when data is limited.

- Counts are especially helpful for tree models, since these models don't have a natural way of aggregating information across many features at once.

## Link to tutorial/exercise:

- https://www.kaggle.com/code/ryanholbrook/creating-features
- https://www.kaggle.com/code/tsztungchau/exercise-creating-features

# 3. Clustering With K-Means

**Clustering** simply means the assigning of data points to groups based upon how similar the points are to each other. A clustering algorithm makes "birds of a feather flock together," so to speak.

## 3.1 Cluster Labels as a Feature

It's important to remember that this Cluster feature is categorical.

The motivating idea for adding cluster labels is that the clusters will break up complicated relationships across features into simpler chunks. 

Our model can then just learn the simpler chunks one-by-one instead having to learn the complicated whole all at once. 

It's a "divide and conquer" strategy.



## 3.2 K-Means Clustering

K-means clustering is one of the great many clustering algorithms. It is meant to be intuitive and easy to apply in a feature engineering context.

**K-means** clustering measures similarity using ordinary straight-line distance (Euclidean distance, in other words). 

It creates clusters by placing a number of points, called **centroids**, inside the feature-space. 

Each point in the dataset is assigned to the cluster of whichever centroid it's closest to. 

The "k" in "k-means" is how many centroids (that is, clusters) it creates. You define the k yourself.

There are three important parameters from scikit-learn's implementation: "n_clusters", "max_iter", and "n_init".

It's a simple two-step process. The algorithm **starts by randomly initializing some predefined number (n_clusters) of centroids.** It then iterates over these two operations:

1. assign points to the nearest cluster centroid
2. move each centroid to minimize the distance to its points

It iterates over these two steps until the centroids aren't moving anymore, or until some maximum number of iterations has passed **(max_iter)**.

It often happens that the initial random position of the centroids ends in a poor clustering. For this reason the algorithm repeats a number of times **(n_init)** and returns the clustering that has the least total distance between each point and its centroid, the optimal clustering.

You may need to increase the **max_iter** for a large number of clusters or **n_init** for a complex dataset. Ordinarily though the only parameter you'll need to choose yourself is **n_clusters** (k, that is). The best partitioning for a set of features depends on the model you're using and what you're trying to predict, so it's best to tune it like any hyperparameter (through cross-validation, say).

## Link to tutorial/exercise:

- https://www.kaggle.com/code/ryanholbrook/clustering-with-k-means
- https://www.kaggle.com/code/tsztungchau/exercise-clustering-with-k-means

(reference)

- https://www.kaggle.com/code/dansbecker/using-categorical-data-with-one-hot-encoding/notebook

# 4. Principal Component Analysis (PCA)

## 4.1 Introduction

Just like clustering is **a partitioning of the dataset based on proximity**, you could think of PCA as **a partitioning of the variation in the data**. PCA is a great tool to help you discover important relationships in the data and can also be used to create more informative features.

(Technical note: PCA is typically applied to **standardized data**. With standardized data "variation" means "correlation". With unstandardized data "variation" means "covariance". All data in this course will be standardized before applying PCA.)

## 4.2 Principal Component Analysis

The whole idea of PCA: **instead of describing the data with the original features, we describe it with its axes of variation. The axes of variation become the new features.**

The new features PCA constructs are actually just liear combinations (weighted sums) of the original features.

These new features are called the **principal components** of the data. The weights themselves are called **loadings**. There will be as many principal components as there are features in the original dataset.

A component's loadings tell us what variation it expresses through signs and magnitudes.

PCA also tells us the amount of variation in each component. We can see from the figures that there is more variation in the data along the **Size** component than along the **Shape** component. PCA makes this precise through each component's **percent of explained variance**.

## 4.3 PCA for Feature Engineering

There are two ways you could use PCA for feature engineering.

1. The first way is to **use it as a descriptive technique**. Since the components tell you about the variation, you could compute the MI scores for the components and see what kind of variation is most predictive of your target. That could give you ideas for kinds of features to create -- a product of 'Height' and 'Diameter' if 'Size' is important, say, or a ratio of 'Height' and 'Diameter' if Shape is important. You could even try clustering on one or more of the high-scoring components.

2. The second way is to **use the components themselves as features**. Because the components expose the variational structure of the data directly, they can often be more informative than the original features. Here are some use-cases:

  - **Dimensionality reduction**: When your features are highly redundant (multicollinear, specifically), PCA will partition out the redundancy into one or more near-zero variance components, which you can then drop since they will contain little or no information.
  
  - **Anomaly detection**: Unusual variation, not apparent from the original features, will often show up in the low-variance components. These components could be highly informative in an anomaly or outlier detection task.
  
  - **Noise reduction**: A collection of sensor readings will often share some common background noise. PCA can sometimes collect the (informative) signal into a smaller number of features while leaving the noise alone, thus boosting the signal-to-noise ratio.
  
  - **Decorrelation**: Some ML algorithms struggle with highly-correlated features. PCA transforms correlated features into uncorrelated components, which could be easier for your algorithm to work with.
  
**PCA basically gives you direct access to the correlational structure of your data.**

Side notes:
- PCA only works with numeric features, like continuous quantities or counts.
- PCA is sensitive to scale. It's good practice to standardize your data before applying PCA, unless you know you have good reason not to.
- Consider removing or constraining outliers, since they can have an undue influence on the results.

## Link to tutorial/exercise:

- https://www.kaggle.com/code/ryanholbrook/principal-component-analysis
- https://www.kaggle.com/code/tsztungchau/exercise-principal-component-analysis

(reference)

- N/A

# 5. Target Encoding

## 5.1 Introduction

The technique we'll look at in this lesson, **target encoding**, is instead meant *for categorical features*. It's a method of encoding categories as numbers, like one-hot or label encoding, with the difference that it also uses the target to create the encoding. This makes it what we call a **supervised** feature engineering technique.

## 5.2 Target Encoding

A **target encoding** is any kind of encoding that replaces a feature's categories with some number derived from the target.

## 5.3 Smoothing

Smoothing is usually used when:
1. there are unknown categories that are filled in automatically by Pandas
2. there are rare categories that may not represent the importance of the feature in the future, if more data are provided in the future

The idea is to **blend the in-category average with the overall average**. 

Rare categories get less weight on their category average, while missing categories just get the overall average.

In pseudocode:

**encoding = weight * in_category + (1 - weight) * overall**

where weight is a value between 0 and 1 calculated from the category frequency.

An easy way to determine the value for weight is to compute an m-estimate:

**weight = n / (n + m)**

where **"n"** is the total number of times that category occurs in the data. The parameter **"m"** determines the "smoothing factor". Larger values of m put more weight on the overall estimate.


When choosing a value for "m", consider how noisy the categories to be:
- the noiser, the larger value for "m", and vice versa

Use Cases for Target Encoding
Target encoding is great for:

- **High-cardinality features**: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target.

- **Domain-motivated features**: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness.

## Link to tutorial/exercise:

- https://www.kaggle.com/code/ryanholbrook/target-encoding
- https://www.kaggle.com/code/tsztungchau/exercise-target-encoding

(reference)

- N/A