# Feature Engineering
[Kaggle's Feature Engineering](https://www.kaggle.com/learn/feature-engineering)

## Baseline Model
[Kaggle's Baseline Model](https://www.kaggle.com/matleonard/baseline-model)

[Kaggle's Baseline Model: Exercise](https://www.kaggle.com/matleonard/baseline-model)

## Categorical Encodings
1. Label Encoding
2. One Hot Encoding
3. Count Encoding
4. Target Encoding
5. CatBoost Encoding
6. Singular Value Decomposition

[Kaggle's Categorical Encodings](https://www.kaggle.com/matleonard/categorical-encodings)

[Kaggle's Categorical Encodings: Exercise](https://www.kaggle.com/aubreyjohn/exercise-categorical-encodings/edit)

## Feature Generation
Creating new features from the raw data is one of the best ways to improve your model.

The features you create are different for every dataset, so it takes a bit of creativity and experimentation. 

You can have access to multiple tables with relevant data that you can use to create new features.

You can make new columns by:

- Combining categorical columns (the combined columns are called interactions).
- Making calculations based on numerical columns.
 
Use **.rolling()** to calculate number of items in a period e.g last 7 days. Takes in a series with timestamp as index and indices as values.

If you want a window that always starts at the first row but expands as you get further in the data, you can use the **.expanding** methods for this.


Calculate time since last item in the same category. A handy method for performing operations within groups is to use **.groupby**. then **.transform**. The **.transform**. method takes a function then passes a series or dataframe to that function for each group. This returns a dataframe with the same indices as the original dataframe. In our case, we'll perform a groupby on "category" and use transform to calculate the time differences for each category.

def time_since_last_project(series):
    # Return the time in hours
    return series.diff().dt.total_seconds() / 3600.
    
df = ks[['category', 'launched']].sort_values('launched')

timedeltas = df.groupby('category').transform(time_since_last_project)


Transforming numerical values. Some models work better when the features are normally distributed. Common choices for this are the **square root** and **natural logarithm**.

The **log transformation** won't help if your model is tree-based since tree-based models are scale invariant. However, this should help if you have a **linear model** or **neural network**.

Other transformations include **squares** and other **powers**, **exponentials**, etc. These might help the model discriminate, like the **kernel trick for SVMs**. 

Again, it takes a bit of experimentation to see what works.

[Kaggle's Feature Generation](https://www.kaggle.com/learn/feature-engineering)

[Kaggle's Feature Generation: Exercise](https://www.kaggle.com/aubreyjohn/exercise-feature-generation/edit)

 One method is to create a bunch of new features and later choose the best ones with **feature selection algorithms**.

## Feature Selection
[Kaggle's Baseline Model](https://www.kaggle.com/matleonard/baseline-model)

[Kaggle's Baseline Model: Exercise](https://www.kaggle.com/matleonard/baseline-model)