---

# üìò Feature Engineering

 -  Feature engineering is the process of transforming raw data into meaningful features that improve the performance of machine learning models.
 - It‚Äôs one of the most critical steps in data science because the quality of features often matters more than the choice of model.
---

## 1. Introduction

* Feature engineering = one of the most **fascinating aspects** of ML.

* Focus: not only on algorithms, but also on **data preparation**.

* Two broad ways to improve model performance:

  1. **By Generating new features** ‚Üí using external information
      * Example: From housing dataset ‚Üí add location-based info (distance to city center, school rating, crime rate).

  2. **Transforming existing features** ‚Üí generate more signal.
      * Example: In Titanic dataset ‚Üí FamilySize = SibSp + Parch + 1.

---

2. What is Feature Engineering?

Definition:
üëâ Feature engineering means creating or modifying features from the data so that a machine learning model can understand it better.

  * We don‚Äôt always collect new data, instead we transform or combine existing data.

  * It needs some domain knowledge ‚Äì the more we understand the problem, the better features we can design.
---

## 3. Types of Feature Engineering

1. **Feature Pre-processing**

   * Modify existing features to make them more useful.
   * Example: Scaling, Transformation.

2. **Feature Generation**

   * Create new features by combining/ existing ones to capture hidden patterns.
   * Example: Total_Sqft = Basement_Area + First_Floor_Area + Second_Floor_Area (overall house size).

---

üîπ Types of Feature Engineering

  1. Numerical Features

    - Scaling: Standardization, Normalization


  2. Categorical Features

    - Encoding: One-Hot Encoding, Label Encoding

---


## 4. Examples

* **House Pricing Data**

  * Calculate House Age ‚Üí HouseAge = CurrentYear ‚Äì YearBuilt.
  * Add new column (Create Rooms per Household ‚Üí AveRoomsPerHousehold = TotalRooms / Households).

* **Titanic Dataset**

  * IsAlone ‚Üí IsAlone = 1 if FamilySize == 1 else 0 (whether passenger traveled alone).
  * Age Calculation from Name (Mr., Mrs., Miss, Master).


---

## 5. Numerical Features (Titanic Examples)

### (a) Feature Transformation

* Apply math operations to smooth data:
  * **Examples**: log(x), ‚àöx, x¬≤, 1/x.

* **Why?**
  * Reduces skewness in features like **Fare** (many low fares, few very high).
  * Makes patterns easier for models to learn.
* ‚ö†Ô∏è *Log(0) is undefined ‚Üí use log(x + C), where C is a small constant.*

---

### (b) Feature Scaling

* Features have **different ranges** (e.g., Age in years vs Fare in ¬£).
* **Methods**:

1. **Min‚ÄìMax Scaling**

   $$
   X' = \frac{X - X_{min}}{X_{max} - X_{min}}
   $$

   ‚Üí Rescales Age and Fare to **0‚Äì1 range**.

2. **Standardization (Z-score)**

   $$
   X' = \frac{X - \mu}{\sigma}
   $$

   ‚Üí Centers Age/Fare around mean = 0 and standard deviation = 1.

---

## 6. Categorical Features (Titanic Examples)

### (a) One-Hot Encoding

* Turns categories into binary columns.
* Example: **Embarked = {C, Q, S}** ‚Üí

  * Embarked\_C, Embarked\_Q, Embarked\_S.

**Limitations:**

1. Loses order (not an issue here since Embarked has no natural order).
2. Can increase columns if many categories.

---

### (b) Label Encoding

* Assigns numbers to categories.
* Example: **Cabin Deck** (extracted from Cabin: A, B, C, D‚Ä¶) ‚Üí

  * A=1, B=2, C=3, etc.
* Keeps natural order if it exists.

---

## 7. Key Takeaways

* **Understand the dataset** ‚Üí e.g., Age, Fare, Embarked, Cabin.
* Apply transformations to **Fare** (log) and scale **Age/Fare** for fair comparison.
* Encode categorical features:

  * **One-Hot** for Embarked, Sex.
  * **Label** for ordered categories (like Deck).

---

‚úÖ **One-line Summary (Titanic)**:
Feature engineering on Titanic = clean and transform **Age, Fare, Embarked, Cabin, Family size** ‚Üí helps models predict survival more accurately.

---



---

# üìò **Feature Scaling**

---

## ‚ùì **Key Questions**

1. What is **feature scaling** and why is it important?
2. How do we **bin data** and draw a **histogram** to understand distribution?
3. What are **Standardization** and **Normalization**?
4. How do we calculate **Standardization** and **Normalization** mathematically?
5. When should each scaling method be used?
6. Can we see a **worked example** with a real dataset?

---

## 1Ô∏è‚É£ What is Feature Scaling?

Feature scaling is the process of transforming numerical features to a common scale without distorting differences in the ranges of values. It‚Äôs critical because:

* Many machine learning algorithms (like KNN, SVM) are sensitive to the scale of data.
* Features with larger numeric ranges can dominate those with smaller ranges, biasing the model.
* Scaling improves convergence speed in optimization algorithms.

---

## 2Ô∏è‚É£ Binning and Drawing Histograms

### What is Binning?

* Binning groups continuous numerical data into intervals called bins.
* Helps visualize data distribution by showing the frequency of values within bins.

### Example of Binning:

Suppose we have plant counts:

\| Counts: 7, 11, 13, 16, 18, 22, 24 |

üåøHome Plants and Their Counts

| S.No | Plant Name         | Count |
| ---- | ------------------ | ----- |
| 1    | Money Plant        | 7     |
| 2    | Tulsi (Holy Basil) | 11    |
| 3    | Jade Plant         | 13    |
| 4    | Aloe Vera          | 16    |
| 5    | Areca Palm         | 18    |
| 6    | Peace Lily         | 22    |
| 7    | Pothos             | 24    |


We can create bins of size 5:

| Bin Range | Frequency (Count) |
| --------- | ----------------- |
| 5‚Äì9       | 1 (7)             |
| 10‚Äì14     | 2 (11, 13)        |
| 15‚Äì19     | 2 (16, 18)        |
| 20‚Äì24     | 2 (22, 24)        |

A histogram can be drawn with these bins on the x-axis and frequency on the y-axis, showing how values are distributed.

---

## 3Ô∏è‚É£ What are Standardization and Normalization?

* **Standardization (Z-score normalization):**
  Rescales data to have a mean of 0 and a standard deviation of 1.
  Useful when data follows a normal distribution.

* **Normalization (Min-Max scaling):**
  Rescales data to a fixed range, usually \[0, 1].
  Useful when you want to bound the feature values.

---

## 4Ô∏è‚É£ Step-by-Step Mathematical Calculation Using Example Dataset

---

### Dataset:

$$
X = \{7, 11, 13, 16, 18, 22, 24\}
$$

---

### Step 1: Calculate Mean (Œº)

$$
\mu = \frac{7 + 11 + 13 + 16 + 18 + 22 + 24}{7} = \frac{111}{7} \approx 15.86
$$

---

### Step 2: Calculate Standard Deviation (œÉ)

| $X$     | $X - \mu$         | $(X - \mu)^2$ |
| ------- | ----------------- | ------------- |
| 7       | 7 - 15.86 = -8.86 | 78.50         |
| 11      | -4.86             | 23.62         |
| 13      | -2.86             | 8.18          |
| 16      | 0.14              | 0.02          |
| 18      | 2.14              | 4.58          |
| 22      | 6.14              | 37.70         |
| 24      | 8.14              | 66.28         |
| **Sum** |                   | **218.88**    |

$$
\sigma = \sqrt{\frac{218.88}{7}} = \sqrt{31.27} \approx 5.59
$$

---

### Step 3: Standardization (Z-score Normalization)

$$
Z = \frac{X - \mu}{\sigma}
$$

Calculate for each value:

| $X$ | Z-score $Z$                      |
| --- | -------------------------------- |
| 7   | $\frac{7 - 15.86}{5.59} = -1.59$ |
| 11  | $-0.87$                          |
| 13  | $-0.51$                          |
| 16  | $0.03$                           |
| 18  | $0.38$                           |
| 22  | $1.10$                           |
| 24  | $1.46$                           |

---

### Step 4: Normalization (Min-Max Scaling)

$$
X' = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

Where $X_{min} = 7$, $X_{max} = 24$:

Calculate range:

$$
24 - 7 = 17
$$

Calculate normalized values:

| $X$ | Normalized $X'$              |
| --- | ---------------------------- |
| 7   | $\frac{7 - 7}{17} = 0.00$    |
| 11  | $\frac{4}{17} \approx 0.24$  |
| 13  | $\frac{6}{17} \approx 0.35$  |
| 16  | $\frac{9}{17} \approx 0.53$  |
| 18  | $\frac{11}{17} \approx 0.65$ |
| 22  | $\frac{15}{17} \approx 0.88$ |
| 24  | $\frac{17}{17} = 1.00$       |

---

## 5Ô∏è‚É£ Final Comparison Table

| $X$ | Standardized $Z$ | Normalized $X'$ |
| --- | ---------------- | --------------- |
| 7   | -1.59            | 0.00            |
| 11  | -0.87            | 0.24            |
| 13  | -0.51            | 0.35            |
| 16  | 0.03             | 0.53            |
| 18  | 0.38             | 0.65            |
| 22  | 1.10             | 0.88            |
| 24  | 1.46             | 1.00            |

---

## 6Ô∏è‚É£ When to Use Which?

| Method              | When to Use                                                                   | Notes                                       |
| ------------------- | ----------------------------------------------------------------------------- | ------------------------------------------- |
| **Standardization** | Data is normally distributed, algorithms like SVM, KNN, Logistic Regression   | Centers data around zero with unit variance |
| **Normalization**   | Neural networks, algorithms needing bounded input or non-normal distributions | Scales data to a fixed range \[0, 1]        |

---

# üß† **Summary**

* **Feature Scaling** is vital for many ML algorithms to perform optimally.
* Start by **visualizing data** with histograms (using binning) to understand distribution.
* **Standardize** if data is approximately normal or if the algorithm expects zero-mean unit variance input.
* **Normalize** if you want to bound values within a specific range or for algorithms like neural nets.
* Always calculate mean, std deviation, min, and max to apply the formulas correctly.



## 5. Example in Python

# NEW FEATURE : Feature creation
---
  - df["salary_per_age"] = df["salary"] / df["age"]
---
###Explanation :
üìå What does it do?

This line creates a new feature (column) in your DataFrame called salary_per_age, which is calculated by dividing each person's salary by their age.

---

So, for each row (each person), it answers:

"How much salary does this person earn for each year of their age?"
---

----
üí° Why is this useful?

This new feature can reveal interesting insights like:

Relative earning efficiency ‚Äî Is someone earning a high salary despite being young?

Helps normalize salary by age (e.g., a 25-year-old earning ‚Çπ5L may be more impressive than a 50-year-old earning ‚Çπ7L).

Can highlight outliers ‚Äî If someone earns much more or less per year of age than others.

Useful in machine learning models as an engineered feature to capture relationships.

-----

###üìä Example:

Let's say your DataFrame (df) looks like this:

| name    | age | salary |
| ------- | --- | ------ |
| Alice   | 25  | 500000 |
| Bob     | 50  | 700000 |
| Charlie | 30  | 600000 |


After running:

df["salary_per_age"] = df["salary"] / df["age"]


Your updated DataFrame becomes:

| name    | age | salary | salary\_per\_age |
| ------- | --- | ------ | ---------------- |
| Alice   | 25  | 500000 | 20000.0          |
| Bob     | 50  | 700000 | 14000.0          |
| Charlie | 30  | 600000 | 20000.0          |

====
üß† Interpretation:=====

Alice and Charlie earn ‚Çπ20,000 per year of their age.

Bob, even though older and earning more in total, earns less per year of his age ‚Üí could indicate less efficiency, or be due to other factors (experience plateau, industry, etc.).

---

‚ö†Ô∏è Edge Cases to Watch:

Make sure age is not zero (division by zero error).

Handle missing values in either column.

- df = df[df["age"] != 0]  # or use df.dropna()


In [1]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample data: a small dataset with age, salary, city, and date
df = pd.DataFrame({
    "age": [25, 30, 35, None],  # Age of the person (missing value in the 4th row)
    "salary": [50000, 60000, 70000, 80000],  # Salary of the person
    "city": ["Delhi", "Mumbai", "Delhi", "Chennai"],  # City of the person
    "date": pd.to_datetime(["2022-01-10", "2022-03-15", "2022-07-20", "2022-10-05"])  # Date of some event
})

# Step 1: Handle missing values
# We have a missing value in the 'age' column (None). We will fill it with the median age from the dataset.
df["age"].fillna(df["age"].median(), inplace=True)

# Step 2: Feature creation
# 2.1 Create a new feature 'salary_per_age' (Salary divided by age) to see how much salary someone earns per year of age.
df["salary_per_age"] = df["salary"] / df["age"]

# 2.2 Extract the year and month from the 'date' column, which can be useful for time-based analysis.
df["year"] = df["date"].dt.year  # Extracts the year part (e.g., 2022)
df["month"] = df["date"].dt.month  # Extracts the month part (e.g., 1 for January, 3 for March)

# Step 3: Encoding categorical variables
# The 'city' column is categorical, so we need to convert it into numerical values.
# We use one-hot encoding, which creates a new binary column for each category (city).
df = pd.get_dummies(df, columns=["city"])

# Step 4: Scaling numerical features
# Numerical features like 'age', 'salary', and 'salary_per_age' can have very different ranges.
# We scale them so that they are on the same scale (mean=0, standard deviation=1).
scaler = StandardScaler()  # Create a scaler object
df[["age", "salary", "salary_per_age"]] = scaler.fit_transform(df[["age", "salary", "salary_per_age"]])
# fit - parameters z-score = mean, stand. dev.
# TRANSFORM - Z-SCORE - 23 - -1.3
# Print the resulting DataFrame
print(df)


        age    salary       date  salary_per_age  year  month  city_Chennai  \
0 -1.414214 -1.341641 2022-01-10       -0.577350  2022      1         False   
1  0.000000 -0.447214 2022-03-15       -0.577350  2022      3         False   
2  1.414214  0.447214 2022-07-20       -0.577350  2022      7         False   
3  0.000000  1.341641 2022-10-05        1.732051  2022     10          True   

   city_Delhi  city_Mumbai  
0        True        False  
1       False         True  
2        True        False  
3       False        False  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["age"].fillna(df["age"].median(), inplace=True)


What the Code Does:


1. **Handling Missing Values**:

   * The `age` column has a missing value (`None`). Instead of leaving it as `None`, we replace it with the **median** value of the `age` column. This ensures that we don't lose data and avoid errors in subsequent steps.

2. **Feature Creation**:

   * We create a new feature `salary_per_age`, which is simply the **salary** divided by the **age**. This can give us an idea of how much someone earns relative to their age.
   * We also extract the **year** and **month** from the `date` column, which can be useful for any time-based analysis (e.g., trends across months or years).

3. **Encoding Categorical Variables**:

   * The `city` column contains text data (categorical data). Machine learning models cannot directly work with text, so we need to convert it into numerical data.
   * We use **one-hot encoding** with `pd.get_dummies()`. This will create separate columns for each city (e.g., `city_Delhi`, `city_Mumbai`, `city_Chennai`), where the column gets a `1` if the city matches, otherwise a `0`. This is a common technique for handling categorical variables.

4. **Scaling Numerical Features**:

   * Features like `age`, `salary`, and `salary_per_age` may have very different scales. For example, salary can be in the tens of thousands, while age is in the twenties to thirties. If we don't scale them, the model might give more importance to features with larger values.
   * **StandardScaler** standardizes the features, so they all have a mean of 0 and a standard deviation of 1. This makes them comparable and helps many machine learning algorithms perform better.

### **Resulting DataFrame:**

After executing the code, the resulting DataFrame will look something like this (values might vary based on the median and scaling):

```
        age    salary  salary_per_age  year  month  city_Chennai  city_Delhi  city_Mumbai
0 -1.224745 -1.341641       -1.341641  2022      1             0           1            0
1  0.000000 -0.447214       -0.447214  2022      3             0           0            1
2  1.224745  0.447214        0.447214  2022      7             0           1            0
3  0.000000  1.341641        1.341641  2022     10             1           0            0
```

### **Explanation of Output**:

* The **missing age** value is now replaced by the **median age**, and it's scaled to a value with a mean of 0 and standard deviation of 1.
* The **city** columns are transformed into binary columns (`city_Chennai`, `city_Delhi`, and `city_Mumbai`), with `1` indicating the presence of that city for the respective row.
* The **salary\_per\_age** feature is calculated and scaled as well.


You **can** use **Label Encoding** for the `city` column instead of **One-Hot Encoding**, but there are important things to consider. Let's explain both methods in simple terms and discuss which one might be better for different situations.

### **Label Encoding (Using Numbers)**

**Label Encoding** is a technique where each unique category is assigned a **numerical value**. For example, in the case of the `city` column, the cities could be encoded as:

* Delhi ‚Üí 0
* Mumbai ‚Üí 1
* Chennai ‚Üí 2

So, after Label Encoding, your `city` column might look like:

```
[0, 1, 0, 2]
```

#### **Why Use Label Encoding?**

* **Simple and efficient**: It's easier to implement because it just assigns a number to each category. This is helpful if you have many categories, and you want to keep the dataset compact (i.e., fewer columns).
* **Works well for ordinal data**: If your categories have a **natural order**, like **"Low" < "Medium" < "High"**, then Label Encoding is perfect. Here, "Low" could be 0, "Medium" could be 1, and "High" could be 2, meaning the number itself carries meaningful information.

#### **Why Not Use Label Encoding for Cities?**

* **No inherent order**: In the case of cities, there is **no natural order** between them. Delhi is not "greater" or "lesser" than Mumbai, and Mumbai is not "greater" or "lesser" than Chennai. These are just names of places with no meaningful ranking.
* **Misleading for models**: When you apply Label Encoding to non-ordinal data like cities, machine learning algorithms might incorrectly assume that one city is somehow "greater" or "less" than another. For example, if you assign Delhi as `0`, Mumbai as `1`, and Chennai as `2`, the model might think that Chennai is "greater" than Mumbai, which doesn‚Äôt make sense.

### **One-Hot Encoding (Using Binaries)**

**One-Hot Encoding** turns each category into a **new column**, where each column represents a different category, and the value is either `1` or `0`. So, for the `city` column, instead of having one column for `city`, you create three columns:

* `city_Delhi`
* `city_Mumbai`
* `city_Chennai`

For each row, you put a `1` in the column that represents the city of that row and `0` in the other columns. It would look like this:

| age | city\_Delhi | city\_Mumbai | city\_Chennai |
| --- | ----------- | ------------ | ------------- |
| 25  | 1           | 0            | 0             |
| 30  | 0           | 1            | 0             |
| 35  | 1           | 0            | 0             |
| 40  | 0           | 0            | 1             |

#### **Why Use One-Hot Encoding?**

* **No incorrect assumptions**: Since we‚Äôre creating separate columns for each city, the model doesn‚Äôt assume any order or "greater than" relationship between them. It just sees them as distinct, equal categories.
* **Clear representation**: One-Hot Encoding ensures that every city is treated equally, with no implied hierarchy, which is important for categorical data like cities.

#### **Why Not Use One-Hot Encoding?**

* **More columns**: One-Hot Encoding can increase the size of the dataset significantly, especially if you have a lot of categories (e.g., if there were 100 cities, you'd create 100 columns!). This can be inefficient and may lead to higher computational costs.
* **Sparsity**: Since most of the values will be `0` (with only one `1` per row), your dataset will become sparse (i.e., filled mostly with zeroes), which can sometimes make processing less efficient.

---

### **Which One is Better?**

#### **When to Use Label Encoding**:

* **For Ordinal Data**: If the categories have a **natural order** (like "Small", "Medium", "Large"), Label Encoding is good. For example, if the data was about the **size of a product**, Label Encoding would make sense because the sizes have an order.
* **For a large number of categories**: If you have **lots of unique categories** (e.g., 100 different cities), Label Encoding can be more memory-efficient since it adds just one column instead of many.

#### **When to Use One-Hot Encoding**:

* **For Nominal (Non-Ordered) Data**: Since cities don‚Äôt have a natural order (Delhi isn‚Äôt ‚Äúgreater‚Äù than Chennai, and Chennai isn‚Äôt ‚Äúless‚Äù than Mumbai), **One-Hot Encoding** is the right choice for cities.
* **When you want to avoid misleading the model**: One-Hot Encoding makes sure the model doesn't interpret one category as being "greater" or "less" than another.

### **Summary of Differences**:

| Feature                              | **Label Encoding**                      | **One-Hot Encoding**                                   |
| ------------------------------------ | --------------------------------------- | ------------------------------------------------------ |
| **Best for**                         | Ordinal data (like low, medium, high)   | Nominal data (like cities, colors, etc.)               |
| **Increases columns?**               | No, only one column per category        | Yes, one new column for each category                  |
| **Risk of model misinterpretation?** | Yes, it may treat categories as ordered | No, it treats categories equally                       |
| **Computational cost**               | Lower (fewer columns)                   | Higher (more columns, especially with many categories) |

---

### **Conclusion **:

* If you're working with things like **cities** or **colors**, where there‚Äôs no **order**, **One-Hot Encoding** is the better choice.
* If you're dealing with things that **do have an order**, like **low, medium, and high**, then **Label Encoding** is fine.

So, for cities (Delhi, Mumbai, Chennai), **One-Hot Encoding** would be better to avoid confusing the model into thinking one city is "better" or "worse" than another.


### Explanation of the Code:

```python
df[["age", "salary", "salary_per_age"]] = scaler.fit_transform(df[["age", "salary", "salary_per_age"]])
```

Let's break this down step by step.

---

### **1. Scaling with `StandardScaler`**

The **`StandardScaler`** is a preprocessing tool from `sklearn` that **standardizes** numerical features by scaling them so that they have a **mean of 0** and a **standard deviation of 1**.

Standardization is important because it **brings all the features to the same scale**, which makes the model less sensitive to different ranges of values. For instance, if you have **age** (which is in the range of 20-60 years) and **salary** (which might be in the range of 50,000 - 200,000), they have very different scales. **Standardization** ensures that both features contribute equally during model training.

---

### **2. `scaler.fit_transform()`**

The **`fit_transform()`** method does two things:

* **`fit()`**: Calculates the statistics (mean and standard deviation) of the data that will be used for scaling.
* **`transform()`**: Scales the data using the statistics calculated from `fit()`, applying the transformation.

In this case:

* **`scaler.fit_transform(df[["age", "salary", "salary_per_age"]])`**:

  * **`fit()`** will calculate the mean and standard deviation for the **age**, **salary**, and **salary\_per\_age** columns.
  * **`transform()`** will then apply the scaling to each of these columns using the formula:

    $$
    \text{scaled value} = \frac{\text{original value} - \text{mean}}{\text{standard deviation}}
    $$

---

### **3. Assignment to `df[["age", "salary", "salary_per_age"]]`**

The result of **`scaler.fit_transform()`** is a scaled version of the data (an array). This result is assigned back to the same columns **`age`, `salary`, and `salary_per_age`** in the original `DataFrame` (`df`).

* **`df[["age", "salary", "salary_per_age"]]`**: This refers to the three columns in the `DataFrame` that are being scaled.
* The **scaled values** are then placed back into those columns.

---

### **Step-by-Step Example:**

Let's assume we have the following data:

| age | salary | salary\_per\_age |
| --- | ------ | ---------------- |
| 25  | 50000  | 2000             |
| 30  | 60000  | 2000             |
| 35  | 70000  | 2000             |
| 40  | 80000  | 2000             |

Now, let‚Äôs apply the scaling using `StandardScaler`:

1. **Calculate the mean and standard deviation for each column**:

   * `age`: Mean = 32.5, Standard Deviation = 6.45
   * `salary`: Mean = 67500, Standard Deviation = 12990
   * `salary_per_age`: Mean = 2000, Standard Deviation = 0

2. **Transform the values** using the formula:

   $$
   \text{scaled value} = \frac{\text{original value} - \text{mean}}{\text{standard deviation}}
   $$

   * For **age**, we scale each value like this:

     * For 25: $\frac{25 - 32.5}{6.45} = -1.18$
     * For 30: $\frac{30 - 32.5}{6.45} = -0.39$
     * For 35: $\frac{35 - 32.5}{6.45} = 0.39$
     * For 40: $\frac{40 - 32.5}{6.45} = 1.18$
   * For **salary**, we scale each value like this:

     * For 50,000: $\frac{50000 - 67500}{12990} = -1.28$
     * For 60,000: $\frac{60000 - 67500}{12990} = -0.58$
     * For 70,000: $\frac{70000 - 67500}{12990} = 0.19$
     * For 80,000: $\frac{80000 - 67500}{12990} = 0.96$
   * For **salary\_per\_age**, since all the values are identical (2000), the standard deviation is 0, and this feature will be **skipped** or could result in `NaN` or 0 after scaling, depending on the implementation.

3. **After Scaling**, the result might look like this:

| age   | salary | salary\_per\_age |
| ----- | ------ | ---------------- |
| -1.18 | -1.28  | 0                |
| -0.39 | -0.58  | 0                |
| 0.39  | 0.19   | 0                |
| 1.18  | 0.96   | 0                |

---

### **Summary of Key Points**:

* **StandardScaler** makes sure that the features (age, salary, etc.) have similar scales, making the model training more stable and faster.
* **`fit_transform()`**: First, it learns the scaling parameters (mean and standard deviation), and then it transforms the data based on these parameters.
* The scaled features will now have a **mean of 0** and a **standard deviation of 1**, which allows the model to treat each feature equally and prevent one feature from dominating the others.



üîπ 1. Scaling Methods in Python

In [2]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Sample data
df = pd.DataFrame({"age": [18, 25, 40, 60], "salary": [20000, 50000, 80000, 120000]})

# Min-Max Scaling (0‚Äì1 range)
minmax = MinMaxScaler()
df_minmax = pd.DataFrame(minmax.fit_transform(df), columns=df.columns)

# Standardization (Z-score scaling)
standard = StandardScaler()
df_standard = pd.DataFrame(standard.fit_transform(df), columns=df.columns)


print("Original:\n", df)
print("\nMin-Max:\n", df_minmax)
print("\nStandardized:\n", df_standard)


Original:
    age  salary
0   18   20000
1   25   50000
2   40   80000
3   60  120000

Min-Max:
         age  salary
0  0.000000     0.0
1  0.166667     0.3
2  0.523810     0.6
3  1.000000     1.0

Standardized:
         age    salary
0 -1.102532 -1.283901
1 -0.667731 -0.473016
2  0.263987  0.337869
3  1.506277  1.419048


üîπ What is One-Hot Encoding?

Definition: Converts categorical variables into binary (0/1) columns.

Each category becomes a new column ‚Üí 1 if the row belongs to that category, else 0.

Prevents models from assuming an ordinal relationship where none exists.

üîπ Example

Suppose you have a column City:
| ID | City    |
| -- | ------- |
| 1  | Delhi   |
| 2  | Mumbai  |
| 3  | Chennai |
| 4  | Delhi   |
‚û°Ô∏è After one-hot encoding:
| ID | City\_Chennai | City\_Delhi | City\_Mumbai |
| -- | ------------- | ----------- | ------------ |
| 1  | 0             | 1           | 0            |
| 2  | 0             | 0           | 1            |
| 3  | 1             | 0           | 0            |
| 4  | 0             | 1           | 0            |


üîπ Methods in Python
‚úÖ Using pandas.get_dummies()

In [3]:
import pandas as pd

df = pd.DataFrame({"City": ["Delhi", "Mumbai", "Chennai", "Delhi"]})

# One-hot encode
df_ohe = pd.get_dummies(df, columns=["City"], prefix="City")
print(df_ohe)


   City_Chennai  City_Delhi  City_Mumbai
0         False        True        False
1         False       False         True
2          True       False        False
3         False        True        False


---

# üö¢ Feature Scaling in the Titanic Dataset

### **1. What is Feature Scaling?**

* **Definition:**
  Feature scaling is the process of adjusting the values of numerical features so they lie within a common scale.

* **Important Point:**

  * Scaling is applied **column-wise** (feature-wise), not row-wise.
  * Example: Scaling all values of **Age** or **Fare**, but not across different passengers.

---

### **2. Why is Feature Scaling Important?**

* **Reason:**
  Machine learning algorithms that rely on **distance** (e.g., K-NN, K-means) or **gradient descent** (e.g., logistic regression, SVM) are sensitive to differences in feature scale.

  * **Example from Titanic:**

    * **Fare** ranges from **0 to 512.33**.
    * **Age** ranges roughly from **0.42 to 80**.
    * If we don‚Äôt scale, the model may give much more weight to **Fare** than **Age**, even though both are important.

* **Solution:**
  Scaling ensures that all features contribute equally, improving fairness and accuracy.

---

### **3. Types of Feature Scaling**

#### **a. Normalization (Min-Max Scaling)**

* **Formula:**

  $$
  x' = \frac{x - \min(X)}{\max(X) - \min(X)}
  $$

* **Effect:** Brings values into the range **\[0, 1]**.

* **Titanic Example:**

  * Passenger with **Fare = 7.25** ‚Üí scaled to near **0.01**.
  * Passenger with **Fare = 512.33** ‚Üí scaled to **1.00**.

---

#### **b. Standardization (Z-score Scaling)**

* **Formula:**

  $$
  x' = \frac{x - \mu}{\sigma}
  $$

  where \$\mu\$ = mean, \$\sigma\$ = standard deviation.

* **Effect:** Creates a distribution with mean **0** and standard deviation **1**.

* **Titanic Example:**

  * If **mean Age = 30** and **œÉ = 14**, then

    * Passenger with **Age = 22** ‚Üí standardized value = -0.57.
    * Passenger with **Age = 50** ‚Üí standardized value = +1.43.

---

### **4. The Problem with Unscaled Titanic Features**

* **Issue:**

  * **Fare differences** (hundreds of dollars) overshadow **Age differences** (decades).
* **Result:**
  Without scaling, clustering or distance-based models may wrongly classify based on **Fare**, ignoring **Age**.

---

### **5. Applying Normalization: Step-by-Step**

#### Example Subset:

| Passenger | Fare   | Age |
| --------- | ------ | --- |
| A         | 7.25   | 22  |
| B         | 71.28  | 38  |
| C         | 512.33 | 54  |

#### After Normalization (0‚Äì1 range):

| Passenger | Fare (scaled) | Age (scaled) |
| --------- | ------------- | ------------ |
| A         | 0.00          | 0.00         |
| B         | 0.13          | 0.41         |
| C         | 1.00          | 1.00         |

‚úÖ Now both **Fare** and **Age** are on the same scale.

---

### **6. When to Use Which Method?**

* **Normalization:**

  * Best when features have fixed ranges (like Fare).
  * Useful in **neural networks, K-means, K-NN**.

* **Standardization:**

  * Best when features follow a **normal distribution** (like Age).
  * Useful in **logistic regression, SVM, PCA**.

---

### **7. Key Takeaways**

* Feature scaling is essential in Titanic because **Fare** and **Age** are on very different scales.
* Both **Normalization** and **Standardization** prevent models from being biased toward high-range features.
* Always apply scaling before **clustering, distance-based models, or gradient descent‚Äìbased models**.

---

### **8. Summary**

* **Feature Scaling** improves model accuracy by ensuring all features are on the same scale.
* **Normalization** and **Standardization** are common techniques for scaling features.
* **Normalization** is useful for algorithms that require data in a fixed range, while **Standardization** is helpful for data that is normally distributed.

---



##Titanic

* Practical examples using the **Titanic dataset**

  * **Data encoding**
  * **Feature scaling**

---

# üß†  Data Preprocessing  (with Titanic Dataset)

---

## üéØ Goal:

To understand and implement **essential preprocessing techniques** before feeding data into a machine learning model, using the **Titanic dataset** as an example.

---

In [4]:
# Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder


# Dataset: Titanic (Survival Prediction)
# You can load it via seaborn for simplicity:
titanic = sns.load_dataset('titanic')
print(titanic.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [5]:
#Data Encoding

#Why?
# Machine learning algorithms **cannot handle strings** ‚Äî we need to convert categorical data to **numeric format**.

# Label Encoding (For ordinal categorical features)
le = LabelEncoder()
titanic['sex_encoded'] = le.fit_transform(titanic['sex'])  # male: 1, female: 0
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_encoded
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,1
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,0
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,0
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,1
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,0
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,0
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1


In [6]:
# One-Hot Encoding (For nominal categorical features)

encoded = pd.get_dummies(titanic[['embarked', 'class']])
titanic_encoded = pd.concat([titanic, encoded], axis=1)

#Note: One-hot encoding increases dimensionality ‚Üí remember the **curse of dimensionality**!
# encoded
titanic_encoded

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,...,embark_town,alive,alone,sex_encoded,embarked_C,embarked_Q,embarked_S,class_First,class_Second,class_Third
0,0,3,male,22.0,1,0,7.2500,S,Third,man,...,Southampton,no,False,1,False,False,True,False,False,True
1,1,1,female,38.0,1,0,71.2833,C,First,woman,...,Cherbourg,yes,False,0,True,False,False,True,False,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,...,Southampton,yes,True,0,False,False,True,False,False,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,...,Southampton,yes,False,0,False,False,True,True,False,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,...,Southampton,no,True,1,False,False,True,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,...,Southampton,no,True,1,False,False,True,False,True,False
887,1,1,female,19.0,0,0,30.0000,S,First,woman,...,Southampton,yes,True,0,False,False,True,True,False,False
888,0,3,female,,1,2,23.4500,S,Third,woman,...,Southampton,no,False,0,False,False,True,False,False,True
889,1,1,male,26.0,0,0,30.0000,C,First,man,...,Cherbourg,yes,True,1,True,False,False,True,False,False


In [7]:
##  Feature Scaling (Revisited)

#Once you've selected and encoded your features, apply scaling **before feeding to algorithms like KNN, SVM, or Neural Networks**.

#Standardization (Z-score)

scaler = StandardScaler()
titanic['age_scaled'] = scaler.fit_transform(titanic[['age']])

#Normalization (Min-Max)

minmax = MinMaxScaler()
titanic['fare_normalized'] = minmax.fit_transform(titanic[['fare']])


In [8]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,sex_encoded,age_scaled,fare_normalized
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False,1,-0.530377,0.014151
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,0,0.571831,0.139136
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True,0,-0.254825,0.015469
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False,0,0.365167,0.103644
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True,1,0.365167,0.015713
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True,1,-0.185937,0.025374
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True,0,-0.737041,0.058556
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False,0,,0.045771
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True,1,-0.254825,0.058556


In [9]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
import seaborn as sns

# Load Titanic dataset from seaborn
titanic = sns.load_dataset('titanic')

# Check available columns
print(titanic.columns)

# Select only the required columns (case-sensitive)
required_cols = ['survived', 'pclass', 'sex', 'age', 'fare', 'embarked']
titanic = titanic[required_cols]

# Drop rows with missing values
titanic.dropna(inplace=True)

# Encode categorical features
titanic['sex'] = LabelEncoder().fit_transform(titanic['sex'])
titanic = pd.get_dummies(titanic, columns=['embarked'])

# Scale numerical features
scaler = StandardScaler()
titanic[['age', 'fare']] = scaler.fit_transform(titanic[['age', 'fare']])

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')


In [10]:
titanic

Unnamed: 0,survived,pclass,sex,age,fare,embarked_C,embarked_Q,embarked_S
0,0,3,1,-0.527669,-0.516380,False,False,True
1,1,1,0,0.577094,0.694046,True,False,False
2,1,3,0,-0.251478,-0.503620,False,False,True
3,1,1,0,0.369951,0.350326,False,False,True
4,0,3,1,0.369951,-0.501257,False,False,True
...,...,...,...,...,...,...,...,...
885,0,3,0,0.646142,-0.102875,False,True,False
886,0,2,1,-0.182430,-0.407687,False,False,True
887,1,1,0,-0.734812,-0.086335,False,False,True
889,1,1,1,-0.251478,-0.086335,True,False,False


## ‚úÖ Summary Table of Techniques
---
| Technique               | Purpose                     | Library / Function Used          |
| ----------------------- | --------------------------- | -------------------------------- |
| Data Encoding           | Convert text to numbers     | `LabelEncoder`, `get_dummies()`  |
| Feature Scaling         | Normalize range of features | `StandardScaler`, `MinMaxScaler` |
---