# Feature Engineering and Selection

Feature engineering is the process of selecting, manipulating and transforming raw data into features that can be used in supervised learning. It leverages the existing data to create new variables that do not exist in the original dataset. In regards to an EDA, the overall goal is to find patterns in data. This process comes after one goes thorugh the process of Understanding the data, handles the missing data and preprocesses it (topics covered in the notebook [Data Cleaning and Preprocess](DataCleaning_Preprocessing.ipynb)).

In this notebook, some techniques will be showcased, namely: Feature Creation, Multicollinearity Detection, Feature Selection techniques and Dimensionality Reduction, which was already skimmed in previous repositories.

## Index



**Libraries and Datasets to use:**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression
from sklearn.decomposition import PCA
from statsmodels.stats.outliers_influence import variance_inflation_factor

df_titanic = sns.load_dataset("titanic")  # Categorical-heavy dataset
df_diamonds = sns.load_dataset("diamonds")  # Numerical-heavy dataset


## 1. Feature Creation

Feature creation involves transforming raw data into features that better represent the underlying problem to predictive models, resulting in improved model accuracy on unseen data. We can divide this process in some groups:

- **Interaction feature creation:** We can do some polynomial transformations on existing features, creating new ones (`Polynomial Features`), or we can combine features to capture interactions between them (`Interaction Terms`);
- **Binning and Discretization:** Basing on continuous variables, we can, respectively, convert them into discrete bins (`Binning`) or transform them into categorical variables (`Discretization`);
- **Time-Based features:** When dealing with these kinds of features, we can extract time components such as day, month, year, etc. or also calculate elapsed times, IE time deltas ($\Delta  t$);
- **Text Data:** From text data, we can: 
  - split it into individual words or tokens (`Tokenization`); 
  - transform it into numerical vectors based on Term Frequency-Inverse Document Frequency (`TF-IDF`);
  - represent words in a continuous vector space, using pre-trained models like Word2Vec or GloVe (`Word Embeddings`)

We can see below an example of the Ineraction feature creation technique to generate new features on the **diamonds** example dataset ($a^2, a*b$ and $b^2$):

In [9]:
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
diamond_features = df_diamonds[['carat', 'depth']]
diamond_poly = poly.fit_transform(diamond_features)
print("Original Features:", diamond_features.head())
print("Polynomial Features:", diamond_poly[:5, 2:])

Original Features:    carat  depth
0   0.23   61.5
1   0.21   59.8
2   0.23   56.9
3   0.29   62.4
4   0.31   63.3
Polynomial Features: [[5.29000e-02 1.41450e+01 3.78225e+03]
 [4.41000e-02 1.25580e+01 3.57604e+03]
 [5.29000e-02 1.30870e+01 3.23761e+03]
 [8.41000e-02 1.80960e+01 3.89376e+03]
 [9.61000e-02 1.96230e+01 4.00689e+03]]


## 5. Extra Resources

For further reading:
- [BuiltIn article](https://builtin.com/articles/feature-engineering)
- [Feature Engineering and Selection Book](http://www.feat.engineering/)
- [Scikit-learn: Feature Engineering](https://scikit-learn.org/stable/modules/feature_selection.html)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Seaborn Visualization Guide](https://seaborn.pydata.org/)