# Intro to ML: Preprocessing Data

## When and Why Preprocess?
Machine learning models work best when data is clean, consistent, and comparable. **Preprocessing** means preparing data so that it’s fair and understandable to your model.

Think about it like baking cookies—if your ingredients aren’t measured in the same units (cups vs. grams), the result might be… weird.

**Reasons to preprocess:**
- Fix messy or missing data.
- Put features on similar scales so no one feature dominates.
- Convert categories (like colors or fruit names) into numbers so computers can read them.

### 🍪 Example:
You’re building a model to predict which cookie recipe people like best. Your dataset includes:
- Sugar (grams)
- Baking time (minutes)
- Cookie type (chocolate chip, oatmeal, sugar)
- Rating (1–5 stars)

Before training your model, you’d want to:
1. Make sure sugar and baking time are in consistent units.
2. Convert cookie type into numbers.
3. Possibly scale features so that grams and minutes don’t overpower each other.

### 💡 Question 1:
Why might a model get confused if 'baking time' is measured in minutes but 'sugar' is in teaspoons?

## Scaling
Scaling ensures that features measured in different units don’t unfairly influence the model.

**Example:**
Imagine a model comparing two features:
- Height (in centimeters, ~100–200)
- Age (in years, ~1–100)

Without scaling, the model might think height matters more, simply because the numbers are larger.

**Two common types of scaling:**
- **Standard scaling**: centers data around 0, so most values fall between -1 and 1.
- **Min-max scaling**: squishes everything between 0 and 1.

### 🏀 Fun Example:
If you’re analyzing NBA player stats (height in cm, points per game, salary in dollars), you’d want to scale them before predicting something like “All-Star potential.” Otherwise, the model might think money = talent.

### 💡 Question 2:
Which would you use in this case—StandardScaler or MinMaxScaler—and why?

## Encoding
Models can’t understand words—they need numbers. **Encoding** is how we convert categories into numeric form.

**Two common encoding types:**
- **One-Hot Encoding**: Turns each category into a new column (useful for non-ordered categories).
  - Example: 'cookie type' → chocolate chip = [1,0,0], oatmeal = [0,1,0], sugar = [0,0,1].
- **Ordinal Encoding**: Assigns numeric ranks to ordered categories.
  - Example: 'spiciness' → low = 1, medium = 2, high = 3.

**Analogy:** If One-Hot Encoding is like making separate scoreboards for each team, Ordinal Encoding is like ranking teams 1st, 2nd, and 3rd.

### 🍦 Example:
You’re training an ice cream shop model to predict flavor popularity. The dataset includes flavor (vanilla, chocolate, strawberry) and average weekly sales.
You’d need to encode these flavors as numbers before training.

### 💡 Question 3:
Why might 'vanilla=1, chocolate=2, strawberry=3' be misleading for a model?

## Principal Component Analysis (PCA)
PCA is a way to **simplify complex data** by reducing the number of features, while keeping most of the important information.

**Imagine:** You have a 10-feature dataset of cookies (sugar, flour, butter, etc.). PCA can help compress those into 2–3 features that still describe most of the variation in recipes.

This makes it easier to visualize patterns and can speed up learning.

### 🎨 Fun Example:
Suppose you have 1,000 photos of sneakers with features like color intensity, texture, and brightness. PCA can reduce all that down to just two “summary” features that let you plot the shoes in 2D—red vs. blue shoes might form distinct clusters.

**However:** Reducing too much can mean losing key details (like brand logos or unique designs).

### 💡 Discussion Prompt:
If you reduced 1,000 sneaker features to just 2, what might the model lose in its ability to classify shoes?

## Key Takeaways
- **Preprocessing** prepares your data to be fair and consistent.
- **Scaling** ensures features measured in different units don’t overpower one another.
- **Encoding** lets computers understand categorical data.
- **PCA** reduces complexity while keeping the main patterns.

**In short:** Before teaching a computer to learn, make sure the data speaks the same language!