# 3. Pre-Modeling Steps
Before feeding data into an algorithm, you must prepare the environment and refine the inputs. This involves transforming raw data into meaningful signals, filtering out noise, and setting up a rigorous testing environment.

## 1. Feature Engineering üèóÔ∏è
At its core, feature engineering is about transforming raw data into a format that makes it easier for the machine learning algorithm to understand the underlying patterns.

- **Concept**: Brainstorm and create highly predictive new variables from raw data.
- **Example**: Subtracting `Year_Built` from `Current_Year` to create a new `Age` feature for predicting house prices. This hands the algorithm the exact mathematical relationship it needs to predict the target.

In [None]:
import pandas as pd

# 1. Load the raw data
data = {'House_ID': [1, 2, 3], 
        'Year_Built': [1990, 2005, 2020], 
        'Current_Year': [2026, 2026, 2026]}
df = pd.DataFrame(data)

# 2. Engineer the new feature
df['House_Age'] = df['Current_Year'] - df['Year_Built']

print(df[['House_ID', 'Year_Built', 'House_Age']])

## 2. Feature Selection ‚úÇÔ∏è
Feeding a model too many irrelevant features introduces noise. The model might accidentally find fake patterns and memorize them (overfitting), which ruins its ability to predict new data. This step removes redundant or irrelevant features to reduce noise and compute time.

### Filter Methods (The Bouncer)
A fast, statistical check done before training. If a feature has zero mathematical correlation (e.g., via correlation scores or Chi-square) to the target, it gets dropped immediately.

### Wrapper Methods (The Tryout)
A thorough, iterative process like **Recursive Feature Elimination (RFE)**. It trains a model with all features, drops the least predictive one, and repeats until only the most predictive features remain.

In [None]:
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression

# Generate dummy data: 100 samples, 10 features
X, y = make_regression(n_samples=100, n_features=10, noise=0.1, random_state=42)

# --- Filter Method: SelectKBest ---
# Keep only the top 3 features most strongly correlated with the target
selector = SelectKBest(score_func=f_regression, k=3)
X_filtered = selector.fit_transform(X, y)

# --- Wrapper Method: Recursive Feature Elimination (RFE) ---
model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=3)
X_rfe = rfe.fit_transform(X, y)

print(f"Selected feature indices (RFE): {rfe.support_}")

## 3. Dimensionality Reduction (PCA)
Sometimes features are highly related to each other (e.g., "Total Square Footage" and "Number of Rooms" in a house). Instead of dropping them, we compress them.

- **Principal Component Analysis (PCA)**: Mathematically mashes correlated features together into new, consolidated variables called "Principal Components." You lose human readability, but you keep the underlying variance (the important information) while drastically shrinking the dataset size.

In [None]:
from sklearn.decomposition import PCA

# Compress the 10 original features down to 2 principal components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print(f"PCA reduced shape: {X_pca.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

## 4. Data Splitting & K-Fold Cross-Validation
Once features are engineered and selected, you cannot feed 100% of your data into the algorithm. You must rigorously test it to prevent overfitting (memorizing the data instead of learning patterns).

### Train, Validation, and Test Split
- **Training Set (~70-80%)**: The "textbook". The algorithm uses this data to learn the relationships and mathematical weights.
- **Validation Set (~10-15%)**: The "practice quiz". Used to evaluate the model while you are still tweaking its settings (hyperparameter tuning). 
- **Test Set (~10-15%)**: The "final exam". Locked in a vault until the end. Gives you the final, unbiased metric of how the model performs in production.

### K-Fold Cross-Validation
A simple Train/Test split might be "lucky" or "unlucky". K-Fold Cross-Validation solves this by dividing data into "K" equal-sized chunks (e.g., K=5):
- The model trains on 4 folds, validates on the 1st fold.
- It resets, trains on a different combination of 4 folds, and validates on the 2nd fold.
- This repeats 5 times, so every data point is in the validation set exactly once.
- Averaging the 5 scores proves the model is stable and performs consistently.

In [None]:
from sklearn.model_selection import train_test_split, KFold, cross_val_score

# --- 1. Train/Test Split ---
# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 2. K-Fold Cross-Validation ---
# Divide the training data into 5 distinct "folds"
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Train and evaluate the model 5 times (each time using a different fold for validation)
cv_scores = cross_val_score(model, X_train, y_train, cv=kf)

print(f"Individual Fold Scores: {cv_scores}")
print(f"Average Model Performance: {cv_scores.mean():.4f}")