# Feature Engineering - Complete Guide

## üìö Learning Objectives
- Master feature creation techniques
- Learn feature transformation methods
- Implement feature selection strategies
- Handle different data types effectively
- Create domain-specific features
- Optimize feature sets for ML models

## üéØ What is Feature Engineering?

**Feature Engineering** is the process of using domain knowledge to create features that make machine learning algorithms work better.

### Why is it Important?
> "Coming up with features is difficult, time-consuming, requires expert knowledge. 'Applied machine learning' is basically feature engineering." - Andrew Ng

### Impact on Model Performance:
- Good features can improve model accuracy by 10-50%
- Often more important than algorithm choice
- Can reduce training time significantly
- Makes models more interpretable

### Categories of Feature Engineering:
1. **Feature Creation**: Creating new features from existing ones
2. **Feature Transformation**: Changing the scale or distribution
3. **Feature Selection**: Choosing the most relevant features
4. **Feature Extraction**: Reducing dimensionality (PCA, etc.)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import (
    StandardScaler, MinMaxScaler, RobustScaler,
    LabelEncoder, OneHotEncoder, OrdinalEncoder,
    PolynomialFeatures, PowerTransformer
)
from sklearn.feature_selection import (
    SelectKBest, f_regression, mutual_info_regression,
    RFE, SelectFromModel
)
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ Libraries imported successfully!")

## Part 1: Feature Creation
### 1Ô∏è‚É£ Load Dataset

In [None]:
# Load California Housing dataset
df = pd.read_csv('supervised Learning/01_Regression/Linear Regression/data/dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

### 2Ô∏è‚É£ Mathematical Features
Creating new features using mathematical operations

In [None]:
# Create a copy for feature engineering
df_fe = df.copy()

print("üîß Creating Mathematical Features...\n")

# 1. Ratio Features
df_fe['rooms_per_household'] = df_fe['total_rooms'] / df_fe['households']
df_fe['bedrooms_per_room'] = df_fe['total_bedrooms'] / df_fe['total_rooms']
df_fe['population_per_household'] = df_fe['population'] / df_fe['households']

print("‚úÖ Ratio features created:")
print("  - rooms_per_household")
print("  - bedrooms_per_room")
print("  - population_per_household")

# 2. Interaction Features
df_fe['income_per_room'] = df_fe['median_income'] * df_fe['rooms_per_household']
df_fe['income_age_interaction'] = df_fe['median_income'] * df_fe['housing_median_age']

print("\n‚úÖ Interaction features created:")
print("  - income_per_room")
print("  - income_age_interaction")

# 3. Aggregation Features
df_fe['total_rooms_bedrooms'] = df_fe['total_rooms'] + df_fe['total_bedrooms']
df_fe['avg_rooms_bedrooms'] = (df_fe['total_rooms'] + df_fe['total_bedrooms']) / 2

print("\n‚úÖ Aggregation features created:")
print("  - total_rooms_bedrooms")
print("  - avg_rooms_bedrooms")

# 4. Polynomial Features (for specific columns)
df_fe['median_income_squared'] = df_fe['median_income'] ** 2
df_fe['median_income_cubed'] = df_fe['median_income'] ** 3

print("\n‚úÖ Polynomial features created:")
print("  - median_income_squared")
print("  - median_income_cubed")

print(f"\nüìä New dataset shape: {df_fe.shape}")
print(f"Added {df_fe.shape[1] - df.shape[1]} new features!")

### 3Ô∏è‚É£ Binning/Discretization
Converting continuous variables into categorical bins

In [None]:
print("üîß Creating Binned Features...\n")

# 1. Equal-width binning
df_fe['income_category'] = pd.cut(
    df_fe['median_income'],
    bins=5,
    labels=['Very Low', 'Low', 'Medium', 'High', 'Very High']
)

# 2. Quantile-based binning
df_fe['age_quartile'] = pd.qcut(
    df_fe['housing_median_age'],
    q=4,
    labels=['Q1', 'Q2', 'Q3', 'Q4']
)

# 3. Custom binning
def categorize_rooms(rooms):
    if rooms < 3:
        return 'Small'
    elif rooms < 6:
        return 'Medium'
    else:
        return 'Large'

df_fe['household_size_category'] = df_fe['rooms_per_household'].apply(categorize_rooms)

print("‚úÖ Binned features created:")
print("  - income_category (5 bins)")
print("  - age_quartile (4 quartiles)")
print("  - household_size_category (custom)")

# Visualize binning
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

df_fe['income_category'].value_counts().plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Income Categories', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

df_fe['age_quartile'].value_counts().plot(kind='bar', ax=axes[1], color='lightcoral', edgecolor='black')
axes[1].set_title('Age Quartiles', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Count')

df_fe['household_size_category'].value_counts().plot(kind='bar', ax=axes[2], color='lightgreen', edgecolor='black')
axes[2].set_title('Household Size Categories', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Count')

plt.tight_layout()
plt.show()

### 4Ô∏è‚É£ Date/Time Features
Extracting features from datetime columns

In [None]:
# Create sample datetime data
print("üîß Creating Date/Time Features...\n")

# Simulate a date column
np.random.seed(42)
df_fe['sale_date'] = pd.date_range(start='2020-01-01', periods=len(df_fe), freq='H')

# Extract datetime features
df_fe['year'] = df_fe['sale_date'].dt.year
df_fe['month'] = df_fe['sale_date'].dt.month
df_fe['day'] = df_fe['sale_date'].dt.day
df_fe['day_of_week'] = df_fe['sale_date'].dt.dayofweek
df_fe['hour'] = df_fe['sale_date'].dt.hour
df_fe['is_weekend'] = (df_fe['day_of_week'] >= 5).astype(int)
df_fe['quarter'] = df_fe['sale_date'].dt.quarter
df_fe['is_month_start'] = df_fe['sale_date'].dt.is_month_start.astype(int)
df_fe['is_month_end'] = df_fe['sale_date'].dt.is_month_end.astype(int)

print("‚úÖ Date/Time features created:")
print("  - year, month, day")
print("  - day_of_week, hour")
print("  - is_weekend, quarter")
print("  - is_month_start, is_month_end")

# Cyclical encoding for periodic features
df_fe['month_sin'] = np.sin(2 * np.pi * df_fe['month'] / 12)
df_fe['month_cos'] = np.cos(2 * np.pi * df_fe['month'] / 12)
df_fe['hour_sin'] = np.sin(2 * np.pi * df_fe['hour'] / 24)
df_fe['hour_cos'] = np.cos(2 * np.pi * df_fe['hour'] / 24)

print("\n‚úÖ Cyclical features created:")
print("  - month_sin, month_cos")
print("  - hour_sin, hour_cos")

print(f"\nüìä Total features now: {df_fe.shape[1]}")

## Part 2: Feature Transformation
### 5Ô∏è‚É£ Scaling Techniques

In [None]:
# Select numerical features for scaling
numerical_features = ['median_income', 'total_rooms', 'population']
X_sample = df[numerical_features].copy()

print("üîß Demonstrating Different Scaling Techniques...\n")
print("Original data statistics:")
print(X_sample.describe())

# 1. StandardScaler (Z-score normalization)
scaler_standard = StandardScaler()
X_standard = pd.DataFrame(
    scaler_standard.fit_transform(X_sample),
    columns=[f"{col}_standard" for col in numerical_features]
)

# 2. MinMaxScaler (0-1 scaling)
scaler_minmax = MinMaxScaler()
X_minmax = pd.DataFrame(
    scaler_minmax.fit_transform(X_sample),
    columns=[f"{col}_minmax" for col in numerical_features]
)

# 3. RobustScaler (robust to outliers)
scaler_robust = RobustScaler()
X_robust = pd.DataFrame(
    scaler_robust.fit_transform(X_sample),
    columns=[f"{col}_robust" for col in numerical_features]
)

# Visualize scaling effects
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Original
X_sample['median_income'].hist(bins=50, ax=axes[0, 0], color='gray', edgecolor='black')
axes[0, 0].set_title('Original Data', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Median Income')

# StandardScaler
X_standard['median_income_standard'].hist(bins=50, ax=axes[0, 1], color='skyblue', edgecolor='black')
axes[0, 1].set_title('StandardScaler (Mean=0, Std=1)', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Scaled Value')

# MinMaxScaler
X_minmax['median_income_minmax'].hist(bins=50, ax=axes[1, 0], color='lightcoral', edgecolor='black')
axes[1, 0].set_title('MinMaxScaler (Range: 0-1)', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Scaled Value')

# RobustScaler
X_robust['median_income_robust'].hist(bins=50, ax=axes[1, 1], color='lightgreen', edgecolor='black')
axes[1, 1].set_title('RobustScaler (Robust to Outliers)', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('Scaled Value')

plt.tight_layout()
plt.show()

print("\nüìä Scaling Comparison:")
print("\nStandardScaler:")
print(X_standard.describe().loc[['mean', 'std']])
print("\nMinMaxScaler:")
print(X_minmax.describe().loc[['min', 'max']])
print("\nRobustScaler:")
print(X_robust.describe().loc[['50%']])  # Median should be close to 0

### 6Ô∏è‚É£ Handling Skewed Data

In [None]:
print("üîß Handling Skewed Distributions...\n")

# Check skewness
skewed_feature = 'total_rooms'
original_skew = df[skewed_feature].skew()
print(f"Original skewness of {skewed_feature}: {original_skew:.2f}")

# Different transformation techniques
transformations = {
    'Original': df[skewed_feature],
    'Log Transform': np.log1p(df[skewed_feature]),
    'Square Root': np.sqrt(df[skewed_feature]),
    'Box-Cox': stats.boxcox(df[skewed_feature] + 1)[0],
    'Yeo-Johnson': PowerTransformer(method='yeo-johnson').fit_transform(
        df[[skewed_feature]]
    ).ravel()
}

# Visualize transformations
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, (name, data) in enumerate(transformations.items()):
    axes[idx].hist(data, bins=50, edgecolor='black', alpha=0.7)
    axes[idx].set_title(f'{name}\nSkewness: {pd.Series(data).skew():.2f}', 
                       fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Value')
    axes[idx].set_ylabel('Frequency')
    axes[idx].grid(True, alpha=0.3)

# Remove extra subplot
fig.delaxes(axes[5])

plt.suptitle('Comparison of Transformation Techniques', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

print("\n‚úÖ Best transformation reduces skewness closest to 0")

### 7Ô∏è‚É£ Encoding Categorical Variables

In [None]:
print("üîß Encoding Categorical Variables...\n")

# Sample categorical data
if 'ocean_proximity' in df.columns:
    cat_feature = 'ocean_proximity'
    
    print(f"Unique values in {cat_feature}:")
    print(df[cat_feature].value_counts())
    
    # 1. Label Encoding (for ordinal data)
    le = LabelEncoder()
    df_fe['ocean_proximity_label'] = le.fit_transform(df[cat_feature])
    
    print("\n‚úÖ Label Encoding:")
    print(dict(zip(le.classes_, le.transform(le.classes_))))
    
    # 2. One-Hot Encoding (for nominal data)
    df_onehot = pd.get_dummies(df[cat_feature], prefix='ocean', drop_first=False)
    df_fe = pd.concat([df_fe, df_onehot], axis=1)
    
    print("\n‚úÖ One-Hot Encoding created columns:")
    print(list(df_onehot.columns))
    
    # 3. Frequency Encoding
    freq_encoding = df[cat_feature].value_counts(normalize=True).to_dict()
    df_fe['ocean_proximity_freq'] = df[cat_feature].map(freq_encoding)
    
    print("\n‚úÖ Frequency Encoding:")
    print(freq_encoding)
    
    # 4. Target Encoding (mean of target for each category)
    target_encoding = df.groupby(cat_feature)['median_house_value'].mean().to_dict()
    df_fe['ocean_proximity_target'] = df[cat_feature].map(target_encoding)
    
    print("\n‚úÖ Target Encoding:")
    print(target_encoding)

print(f"\nüìä Total features after encoding: {df_fe.shape[1]}")

## Part 3: Feature Selection
### 8Ô∏è‚É£ Filter Methods

In [None]:
print("üîß Feature Selection - Filter Methods...\n")

# Prepare data for feature selection
# Select only numerical features
numerical_cols = df_fe.select_dtypes(include=[np.number]).columns.tolist()
numerical_cols = [col for col in numerical_cols if col not in ['median_house_value', 'sale_date']]

X_fs = df_fe[numerical_cols].fillna(df_fe[numerical_cols].median())
y_fs = df_fe['median_house_value']

# 1. Correlation-based selection
correlations = X_fs.corrwith(y_fs).abs().sort_values(ascending=False)

print("Top 10 features by correlation with target:")
print(correlations.head(10))

# Visualize correlations
plt.figure(figsize=(12, 8))
correlations.head(15).plot(kind='barh', color='skyblue', edgecolor='black')
plt.xlabel('Absolute Correlation', fontsize=12)
plt.title('Top 15 Features by Correlation with Target', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

# 2. Mutual Information
mi_scores = mutual_info_regression(X_fs, y_fs, random_state=42)
mi_scores = pd.Series(mi_scores, index=X_fs.columns).sort_values(ascending=False)

print("\nTop 10 features by Mutual Information:")
print(mi_scores.head(10))

# 3. SelectKBest
selector = SelectKBest(f_regression, k=10)
X_selected = selector.fit_transform(X_fs, y_fs)
selected_features = X_fs.columns[selector.get_support()]

print("\n‚úÖ Top 10 features selected by SelectKBest:")
print(list(selected_features))

### 9Ô∏è‚É£ Wrapper Methods

In [None]:
print("üîß Feature Selection - Wrapper Methods...\n")

# Use a subset for faster computation
X_subset = X_fs[correlations.head(20).index]

# 1. Recursive Feature Elimination (RFE)
estimator = RandomForestRegressor(n_estimators=50, random_state=42, n_jobs=-1)
rfe = RFE(estimator, n_features_to_select=10)

print("Running RFE... (this may take a minute)")
rfe.fit(X_subset, y_fs)

rfe_features = X_subset.columns[rfe.support_]
rfe_ranking = pd.Series(rfe.ranking_, index=X_subset.columns).sort_values()

print("\n‚úÖ Features selected by RFE:")
print(list(rfe_features))

print("\nFeature Ranking:")
print(rfe_ranking.head(15))

# Visualize RFE ranking
plt.figure(figsize=(12, 6))
rfe_ranking.head(15).plot(kind='barh', color='lightcoral', edgecolor='black')
plt.xlabel('Ranking (1 = Best)', fontsize=12)
plt.title('RFE Feature Ranking', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

### üîü Embedded Methods

In [None]:
print("üîß Feature Selection - Embedded Methods...\n")

# 1. Feature Importance from Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_subset, y_fs)

feature_importance = pd.Series(
    rf.feature_importances_,
    index=X_subset.columns
).sort_values(ascending=False)

print("Top 10 features by Random Forest importance:")
print(feature_importance.head(10))

# Visualize feature importance
plt.figure(figsize=(12, 6))
feature_importance.head(15).plot(kind='barh', color='lightgreen', edgecolor='black')
plt.xlabel('Importance Score', fontsize=12)
plt.title('Random Forest Feature Importance', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

# 2. SelectFromModel (automatic threshold)
selector_model = SelectFromModel(rf, threshold='median')
selector_model.fit(X_subset, y_fs)

selected_features_model = X_subset.columns[selector_model.get_support()]

print("\n‚úÖ Features selected by SelectFromModel:")
print(list(selected_features_model))
print(f"\nReduced from {X_subset.shape[1]} to {len(selected_features_model)} features")

## Part 4: Impact Analysis
### 1Ô∏è‚É£1Ô∏è‚É£ Compare Model Performance

In [None]:
from sklearn.linear_model import LinearRegression

print("üîß Comparing Model Performance with Different Feature Sets...\n")

# Prepare different feature sets
feature_sets = {
    'Original Features': df[['longitude', 'latitude', 'housing_median_age', 
                            'total_rooms', 'total_bedrooms', 'population', 
                            'households', 'median_income']].fillna(method='ffill'),
    'Engineered Features': X_subset,
    'Selected Features (RFE)': X_subset[rfe_features],
    'Selected Features (RF)': X_subset[selected_features_model]
}

results = {}

for name, X in feature_sets.items():
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y_fs, test_size=0.2, random_state=42
    )
    
    # Train model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {
        'Features': X.shape[1],
        'RMSE': rmse,
        'R¬≤': r2
    }
    
    print(f"{name}:")
    print(f"  Features: {X.shape[1]}")
    print(f"  RMSE: ${rmse:,.2f}")
    print(f"  R¬≤: {r2:.4f}\n")

# Visualize comparison
results_df = pd.DataFrame(results).T

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

results_df['Features'].plot(kind='bar', ax=axes[0], color='skyblue', edgecolor='black')
axes[0].set_title('Number of Features', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=45)

results_df['RMSE'].plot(kind='bar', ax=axes[1], color='lightcoral', edgecolor='black')
axes[1].set_title('RMSE Comparison', fontsize=14, fontweight='bold')
axes[1].set_ylabel('RMSE ($)')
axes[1].tick_params(axis='x', rotation=45)

results_df['R¬≤'].plot(kind='bar', ax=axes[2], color='lightgreen', edgecolor='black')
axes[2].set_title('R¬≤ Score Comparison', fontsize=14, fontweight='bold')
axes[2].set_ylabel('R¬≤ Score')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

## üìä Key Takeaways

### Feature Engineering Best Practices:

#### 1. **Feature Creation**:
‚úÖ **Domain Knowledge**: Use your understanding of the problem  
‚úÖ **Ratios**: Often more informative than raw values  
‚úÖ **Interactions**: Capture relationships between features  
‚úÖ **Aggregations**: Summarize information  
‚úÖ **Polynomial**: Capture non-linear relationships  

#### 2. **Feature Transformation**:
‚úÖ **Scaling**: Essential for distance-based algorithms  
‚úÖ **Normalization**: Handle skewed distributions  
‚úÖ **Encoding**: Convert categorical to numerical  
‚úÖ **Binning**: Reduce noise and capture patterns  

#### 3. **Feature Selection**:
‚úÖ **Remove Redundant**: Eliminate highly correlated features  
‚úÖ **Remove Irrelevant**: Use statistical tests  
‚úÖ **Reduce Dimensionality**: Improve model performance  
‚úÖ **Prevent Overfitting**: Fewer features = simpler model  

### Common Techniques Summary:

| Technique | When to Use | Pros | Cons |
|-----------|-------------|------|------|
| **StandardScaler** | Most cases | Preserves outliers | Sensitive to outliers |
| **MinMaxScaler** | Bounded range needed | Simple interpretation | Sensitive to outliers |
| **RobustScaler** | Data with outliers | Robust to outliers | Less common |
| **Log Transform** | Right-skewed data | Simple, effective | Only for positive values |
| **One-Hot Encoding** | Nominal categories | No ordinal assumption | High dimensionality |
| **Label Encoding** | Ordinal categories | Low dimensionality | Implies order |
| **Target Encoding** | High cardinality | Captures relationship | Risk of overfitting |

### Feature Selection Methods:

| Method | Type | Speed | Accuracy |
|--------|------|-------|----------|
| **Correlation** | Filter | ‚ö° Fast | Good |
| **Mutual Information** | Filter | ‚ö° Fast | Better |
| **RFE** | Wrapper | üêå Slow | Best |
| **Random Forest** | Embedded | ‚ö° Fast | Good |
| **Lasso** | Embedded | ‚ö° Fast | Good |

### Workflow Recommendation:

1. **Understand the Data**: EDA first!
2. **Handle Missing Values**: Impute or remove
3. **Create Features**: Domain-specific engineering
4. **Transform Features**: Scale and normalize
5. **Select Features**: Remove redundant/irrelevant
6. **Validate**: Check impact on model performance
7. **Iterate**: Continuously improve

### Common Pitfalls:
‚ùå **Data Leakage**: Don't use test data for feature engineering  
‚ùå **Overfitting**: Too many features can hurt generalization  
‚ùå **Ignoring Domain**: Generic features may not capture patterns  
‚ùå **Not Validating**: Always check if features improve performance  
‚ùå **Forgetting Interpretability**: Complex features are hard to explain  

### Next Steps:
1. Apply these techniques to your own datasets
2. Experiment with different combinations
3. Use automated feature engineering tools (Featuretools)
4. Learn advanced techniques (deep feature synthesis)
5. Practice on Kaggle competitions