# Feature Engineering for AI/ML

## Learning Objectives
- Understand feature engineering principles
- Learn encoding techniques for categorical data
- Master feature scaling and normalization
- Create new features from existing data

## What is Feature Engineering?
Feature engineering is the process of selecting, modifying, or creating features from raw data to improve machine learning model performance.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from datetime import datetime

plt.style.use('seaborn-v0_8')

## 1. Sample Dataset Creation

In [None]:
# Create a sample e-commerce dataset
np.random.seed(42)
n_samples = 1000

data = {
    'customer_id': range(1, n_samples + 1),
    'age': np.random.randint(18, 80, n_samples),
    'gender': np.random.choice(['Male', 'Female', 'Other'], n_samples, p=[0.45, 0.45, 0.1]),
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_samples),
    'purchase_amount': np.random.exponential(100, n_samples),
    'purchase_date': pd.date_range('2023-01-01', periods=n_samples, freq='H'),
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Books', 'Home', 'Sports'], n_samples),
    'is_premium': np.random.choice([0, 1], n_samples, p=[0.7, 0.3]),
    'previous_purchases': np.random.poisson(5, n_samples)
}

df = pd.DataFrame(data)
print("Original Dataset:")
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nData types:\n{df.dtypes}")

## 2. Categorical Encoding

In [None]:
# Label Encoding for ordinal data
le = LabelEncoder()
df['gender_encoded'] = le.fit_transform(df['gender'])

print("Label Encoding for Gender:")
print(df[['gender', 'gender_encoded']].drop_duplicates().sort_values('gender_encoded'))

# One-Hot Encoding for nominal data
city_encoded = pd.get_dummies(df['city'], prefix='city')
category_encoded = pd.get_dummies(df['product_category'], prefix='category')

print("\nOne-Hot Encoding for Cities:")
print(city_encoded.head())

# Combine with original dataframe
df_encoded = pd.concat([df, city_encoded, category_encoded], axis=1)
print(f"\nDataset shape after encoding: {df_encoded.shape}")

## 3. Feature Scaling and Normalization

In [None]:
# Select numerical features for scaling
numerical_features = ['age', 'purchase_amount', 'previous_purchases']

# Standard Scaling (Z-score normalization)
scaler_standard = StandardScaler()
df_standard = df[numerical_features].copy()
df_standard[numerical_features] = scaler_standard.fit_transform(df_standard[numerical_features])

# Min-Max Scaling
scaler_minmax = MinMaxScaler()
df_minmax = df[numerical_features].copy()
df_minmax[numerical_features] = scaler_minmax.fit_transform(df_minmax[numerical_features])

# Compare scaling methods
fig, axes = plt.subplots(3, 3, figsize=(15, 12))

for i, feature in enumerate(numerical_features):
    # Original
    axes[i, 0].hist(df[feature], bins=30, alpha=0.7)
    axes[i, 0].set_title(f'Original {feature}')
    
    # Standard scaled
    axes[i, 1].hist(df_standard[feature], bins=30, alpha=0.7, color='orange')
    axes[i, 1].set_title(f'Standard Scaled {feature}')
    
    # Min-Max scaled
    axes[i, 2].hist(df_minmax[feature], bins=30, alpha=0.7, color='green')
    axes[i, 2].set_title(f'Min-Max Scaled {feature}')

plt.tight_layout()
plt.show()

print("Scaling Statistics:")
print("\nOriginal Data:")
print(df[numerical_features].describe())
print("\nStandard Scaled:")
print(df_standard.describe())
print("\nMin-Max Scaled:")
print(df_minmax.describe())

## 4. Date/Time Feature Engineering

In [None]:
# Extract date/time features
df['year'] = df['purchase_date'].dt.year
df['month'] = df['purchase_date'].dt.month
df['day'] = df['purchase_date'].dt.day
df['hour'] = df['purchase_date'].dt.hour
df['day_of_week'] = df['purchase_date'].dt.dayofweek
df['is_weekend'] = (df['day_of_week'] >= 5).astype(int)

# Create cyclical features for time
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

print("Date/Time Features:")
time_features = ['purchase_date', 'year', 'month', 'day', 'hour', 'day_of_week', 'is_weekend']
print(df[time_features].head())

# Visualize cyclical features
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Hour cyclical encoding
axes[0].scatter(df['hour_cos'], df['hour_sin'], c=df['hour'], cmap='hsv', alpha=0.6)
axes[0].set_xlabel('Hour Cosine')
axes[0].set_ylabel('Hour Sine')
axes[0].set_title('Cyclical Encoding of Hour')

# Month cyclical encoding
axes[1].scatter(df['month_cos'], df['month_sin'], c=df['month'], cmap='hsv', alpha=0.6)
axes[1].set_xlabel('Month Cosine')
axes[1].set_ylabel('Month Sine')
axes[1].set_title('Cyclical Encoding of Month')

plt.tight_layout()
plt.show()

## 5. Creating Interaction Features

In [None]:
# Create interaction features
df['age_purchase_interaction'] = df['age'] * df['purchase_amount']
df['premium_age_interaction'] = df['is_premium'] * df['age']
df['purchase_per_previous'] = df['purchase_amount'] / (df['previous_purchases'] + 1)  # +1 to avoid division by zero

# Binning continuous variables
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 35, 50, 100], labels=['Young', 'Adult', 'Middle', 'Senior'])
df['purchase_tier'] = pd.qcut(df['purchase_amount'], q=4, labels=['Low', 'Medium', 'High', 'Premium'])

print("Interaction Features:")
interaction_features = ['age_purchase_interaction', 'premium_age_interaction', 'purchase_per_previous']
print(df[interaction_features].head())

print("\nBinned Features:")
print(df[['age', 'age_group', 'purchase_amount', 'purchase_tier']].head())

# Visualize binned features
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

df['age_group'].value_counts().plot(kind='bar', ax=axes[0])
axes[0].set_title('Age Group Distribution')
axes[0].set_ylabel('Count')

df['purchase_tier'].value_counts().plot(kind='bar', ax=axes[1])
axes[1].set_title('Purchase Tier Distribution')
axes[1].set_ylabel('Count')

plt.tight_layout()
plt.show()

## 6. Statistical Features

In [None]:
# Create statistical features by grouping
# Average purchase amount by city
city_avg_purchase = df.groupby('city')['purchase_amount'].mean().to_dict()
df['city_avg_purchase'] = df['city'].map(city_avg_purchase)

# Purchase amount relative to city average
df['purchase_vs_city_avg'] = df['purchase_amount'] / df['city_avg_purchase']

# Category frequency encoding
category_counts = df['product_category'].value_counts().to_dict()
df['category_frequency'] = df['product_category'].map(category_counts)

print("Statistical Features:")
stat_features = ['city_avg_purchase', 'purchase_vs_city_avg', 'category_frequency']
print(df[stat_features].head())

# Correlation heatmap of numerical features
numerical_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[numerical_cols].corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

## 7. Feature Selection Techniques

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression, mutual_info_regression
from sklearn.ensemble import RandomForestRegressor

# Prepare features for selection (only numerical)
feature_cols = df.select_dtypes(include=[np.number]).columns.tolist()
feature_cols.remove('customer_id')  # Remove ID column

X = df[feature_cols]
y = df['purchase_amount']  # Target variable

# Remove target from features
X = X.drop('purchase_amount', axis=1)

# Method 1: Statistical Feature Selection (F-test)
selector_f = SelectKBest(score_func=f_regression, k=10)
X_selected_f = selector_f.fit_transform(X, y)
selected_features_f = X.columns[selector_f.get_support()]

print("Top 10 features by F-test:")
feature_scores_f = pd.DataFrame({
    'feature': X.columns,
    'score': selector_f.scores_
}).sort_values('score', ascending=False)
print(feature_scores_f.head(10))

# Method 2: Mutual Information
selector_mi = SelectKBest(score_func=mutual_info_regression, k=10)
X_selected_mi = selector_mi.fit_transform(X, y)

print("\nTop 10 features by Mutual Information:")
feature_scores_mi = pd.DataFrame({
    'feature': X.columns,
    'score': selector_mi.scores_
}).sort_values('score', ascending=False)
print(feature_scores_mi.head(10))

# Method 3: Random Forest Feature Importance
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X, y)

print("\nTop 10 features by Random Forest Importance:")
feature_importance_rf = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print(feature_importance_rf.head(10))

# Visualize feature importance
plt.figure(figsize=(10, 6))
top_features = feature_importance_rf.head(10)
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance')
plt.title('Top 10 Features by Random Forest Importance')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 8. Feature Engineering Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

def create_feature_pipeline():
    """
    Create a comprehensive feature engineering pipeline
    """
    # Define column types
    numerical_features = ['age', 'previous_purchases']
    categorical_features = ['gender', 'city', 'product_category']
    
    # Create transformers
    numerical_transformer = Pipeline(steps=[
        ('scaler', StandardScaler())
    ])
    
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(drop='first', sparse_output=False))
    ])
    
    # Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
    
    return preprocessor

# Create and test the pipeline
pipeline = create_feature_pipeline()

# Prepare sample data
sample_data = df[['age', 'previous_purchases', 'gender', 'city', 'product_category']].head()
print("Original sample data:")
print(sample_data)

# Transform the data
transformed_data = pipeline.fit_transform(sample_data)
print(f"\nTransformed data shape: {transformed_data.shape}")
print("Transformed data (first 5 rows):")
print(transformed_data[:5])

## üéØ Key Takeaways

1. **Categorical Encoding**:
   - Use Label Encoding for ordinal data
   - Use One-Hot Encoding for nominal data
   - Consider target encoding for high-cardinality features

2. **Feature Scaling**:
   - StandardScaler for normally distributed data
   - MinMaxScaler for bounded ranges
   - RobustScaler for data with outliers

3. **Time Features**:
   - Extract meaningful components (hour, day, month)
   - Use cyclical encoding for periodic features
   - Create business-relevant time features

4. **Feature Creation**:
   - Interaction features capture relationships
   - Binning can capture non-linear patterns
   - Statistical aggregations add context

5. **Feature Selection**:
   - Use multiple methods to validate importance
   - Consider domain knowledge alongside statistics
   - Balance between performance and interpretability

## üìù Exercises

1. Create polynomial features and test their impact
2. Implement target encoding with cross-validation
3. Build a feature engineering pipeline for text data
4. Compare different feature selection methods on a real dataset
5. Create domain-specific features for a business problem