10.1 Data Enrichment and Feature Engineering

Introduction
Data enrichment and feature engineering involve enhancing the quality and depth of data by adding new information or deriving new features from existing data. This process helps improve the performance of machine learning models by providing them with more meaningful inputs.

Definition
Data Enrichment: The process of augmenting existing data by adding new information from external sources or through calculated metrics.
Feature Engineering: The practice of transforming raw data into features that better represent the underlying problem to the predictive models.

Objective
The main objective is to create more informative features that can capture the relationships and patterns within the data, ultimately leading to better model performance.

Importance
Feature engineering and data enrichment are critical for improving the accuracy and robustness of machine learning models. By generating relevant features, you can provide the model with the necessary inputs to make more informed predictions.

10.2 Techniques List and Definitions
1. Creating Interaction Features: Combining two or more features to capture the interaction between them.
2. Adding External Data: Enriching the dataset with external sources, such as demographic information, weather data, or economic indicators.
3. Aggregating Features: Summarizing data through aggregations like mean, sum, or count based on different categories or time periods.
4. Creating Lag Features: Creating features based on previous time steps in time series data.
5. Calculating Rolling Statistics: Using moving averages, sums, or other statistics calculated over a rolling window.
6. Binning Continuous Variables: Converting continuous variables into categorical bins to capture non-linear relationships.
7. Polynomial Feature Creation: Generating polynomial features to capture non-linear relationships between variables.
8. Text Feature Extraction: Extracting meaningful features from text data, such as word counts or sentiment scores.
9. Encoding Cyclical Features: Encoding cyclical features, such as time or geographic data, to capture periodicity.
10. Dimensionality Reduction Techniques: Using techniques like PCA to reduce the number of features while retaining essential information.

10.2.1 Creating Interaction Features

Introduction
Interaction features capture the combined effect of two or more features, which may not be apparent when considering the features individually.

In [1]:
import pandas as pd

# Sample Data
data = {
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# Creating Interaction Feature
df['Feature1_Feature2_Interaction'] = df['Feature1'] * df['Feature2']

print(df)


   Feature1  Feature2  Feature1_Feature2_Interaction
0        10         1                             10
1        20         2                             40
2        30         3                             90
3        40         4                            160
4        50         5                            250


Explanation

Interaction Feature: The new feature, Feature1_Feature2_Interaction, is created by multiplying Feature1 and Feature2.
Purpose: This feature may capture interactions that are important for predicting the target variable, improving model accuracy.

10.2.2 Adding External Data

Introduction
Adding external data can enrich the existing dataset with additional context, leading to better predictive performance.

In [2]:
import pandas as pd

# Sample Data
data = {
    'Product ID': [1, 2, 3, 4, 5],
    'Sales': [100, 150, 200, 130, 170]
}
df = pd.DataFrame(data)

# External Data - Adding weather information
external_data = {
    'Product ID': [1, 2, 3, 4, 5],
    'Average_Temperature': [30, 25, 27, 32, 28]
}
external_df = pd.DataFrame(external_data)

# Merging External Data
df = df.merge(external_df, on='Product ID', how='left')

print(df)

   Product ID  Sales  Average_Temperature
0           1    100                   30
1           2    150                   25
2           3    200                   27
3           4    130                   32
4           5    170                   28


Explanation

External Data: The dataset is enriched with weather information, which might be relevant for predicting sales.
Merging: The merge function is used to combine the existing dataset with the external data based on a common key (Product ID).

10.2.3 Aggregating Features

Introduction
Aggregating features involves summarizing data through various statistical measures, such as mean, sum, or count, often based on groupings or time periods.

In [None]:
import pandas as pd

# Sample Data
data = {
    'Customer ID': [1, 1, 2, 2, 3],
    'Purchase Amount': [100, 200, 150, 250, 300]
}
df = pd.DataFrame(data)

# Aggregating Purchase Amount by Customer ID
df_aggregated = df.groupby('Customer ID')['Purchase Amount'].sum().reset_index()

print(df_aggregated)


Explanation

Aggregation: The groupby function is used to sum the Purchase Amount for each Customer ID, providing insight into total spending by each customer.
Purpose: Aggregated features can simplify the dataset and reveal patterns or trends.