
---

## **1. What is Scaling in Machine Learning?**

**Definition:**
Scaling is a **data preprocessing step** where numerical features are transformed so they share a common range or distribution.
The idea is to adjust the **magnitude** of different features so that they are **comparable** and don’t disproportionately influence model training.

**Role in Preprocessing:**

* Ensures all features contribute **equally** to model training.
* Helps models converge faster and perform better.
* Prevents bias towards features with larger numerical ranges.

Example:
If your dataset has:

* Feature A: Income in the range `₹20,000 – ₹200,000`
* Feature B: Age in the range `18 – 60`
  Without scaling, Income might dominate the training simply because it has bigger numbers.

---

## **2. Why Scaling is Important**

### **a) Improves Model Performance**

Some algorithms **are sensitive to feature magnitude**.

* **Sensitive models:**

  * Support Vector Machines (SVM)
  * K-Nearest Neighbors (KNN)
  * K-Means Clustering
  * Principal Component Analysis (PCA)
  * Logistic Regression & Linear Regression (when regularization is used)

These models calculate distances or optimize cost functions where large feature values can skew results.

---

### **b) Speeds Up Gradient Descent**

* Gradient descent works by updating weights step-by-step.
* If features are on very different scales, convergence takes longer because the optimization path zigzags instead of moving smoothly.
* Scaling → **faster convergence**.

---

### **c) Improves Interpretability**

* When features are on a similar scale, model coefficients (in linear models) become easier to compare in terms of importance.

---

## **3. Common Scaling Techniques**

---

### **3.1 Normalization (Min-Max Scaling)**

**Formula:**

$$
X' = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

Transforms data to range **\[0, 1]** (or any custom range).

**When to use:**

* When the distribution is unknown and you want all features within a fixed range.
* Good for **distance-based algorithms** (KNN, K-Means).

**Example:**
Income = ₹50,000, Min = ₹20,000, Max = ₹200,000 →

$$
X' = \frac{50000 - 20000}{200000 - 20000} = 0.1667
$$

**Limitations:**

* Sensitive to **outliers** — one extreme value can distort scaling.

---

### **3.2 Standardization (Z-score Normalization)**

**Formula:**

$$
X' = \frac{X - \mu}{\sigma}
$$

Centers data around mean = 0 and standard deviation = 1.

**When to use:**

* When you need **normal-like distribution**.
* Works better with algorithms assuming Gaussian-like features (Logistic Regression, Linear Regression, SVM, PCA).

**Example:**
If Age = 40, mean = 30, std dev = 5 →

$$
X' = \frac{40 - 30}{5} = 2
$$

Meaning Age is 2 standard deviations above average.

**Advantages:**

* Not restricted to \[0,1] range.
* Less sensitive to outliers compared to Min-Max.

---

### **3.3 Robust Scaling**

**Formula:**

$$
X' = \frac{X - \text{median}}{\text{IQR}}
$$

Uses **median** and **interquartile range** instead of mean and std deviation.

**When to use:**

* Data with many **outliers**.

---

## **4. Scenarios Where Scaling Makes a Big Difference**

* **K-Means Clustering**: Without scaling, clusters are biased towards features with larger ranges.
* **SVM**: Decision boundary gets distorted if features are not on the same scale.
* **PCA**: Principal components are influenced by variance; scaling ensures fair contribution.
* **Gradient Descent Models**: Scaled features → smoother convergence.

Example:
If you run K-Means on:

* Feature 1: Age (18–60)
* Feature 2: Income (₹20k–₹200k)
  Without scaling → clusters will mostly be determined by income.

---

## **5. Best Practices for Scaling**

1. **Fit scaler only on training data**, then apply to both training and test sets to avoid data leakage.
2. Choose scaling method based on:

   * Model type (distance-based → normalization; linear with Gaussian assumptions → standardization).
   * Presence of outliers (use robust scaling if needed).
3. Scaling is **not always needed** for tree-based models (Decision Trees, Random Forest, XGBoost) because they split based on thresholds, not distances.

---

## **6. Key Takeaways**

* Scaling makes features **comparable**, improves **model accuracy**, and speeds up **training**.
* **Normalization** → range-based scaling (good for distance-based models).
* **Standardization** → mean-centered scaling (good for Gaussian assumptions).
* **Robust scaling** → outlier-resistant.
* Always scale **before** feeding data into sensitive algorithms.
* Scaling choice should match **data characteristics** and **model requirements**.

---


---

## **1. What is Data Normalization?**

In the context of **databases**, **data normalization** is the process of organizing data into structured tables (relations) to reduce redundancy and improve data integrity.
It involves dividing a database into smaller, related tables and defining relationships between them, usually through primary and foreign keys.

In **machine learning**, the term “normalization” can also mean scaling numerical data into a specific range, but here we’re focusing on **database normalization** as per your instruction.

---

## **2. Objectives of Data Normalization**

* **Reduce data redundancy** – avoid storing the same piece of information in multiple places.
* **Improve data integrity** – ensure changes in one place automatically reflect everywhere they’re needed.
* **Optimize storage** – by eliminating duplicate data, storage space is used more efficiently.
* **Make maintenance easier** – changes are made in one location instead of multiple places.

---

## **3. Normal Forms and Their Characteristics**

Database normalization typically follows several stages (called **normal forms**). Each stage builds upon the previous one.

### **First Normal Form (1NF)**

**Rule:**

* Eliminate repeating groups in individual tables.
* Ensure each column contains atomic (indivisible) values.
* Each record must be unique (no duplicate rows).

**Example:**
Before 1NF:

| StudentID | Name | Courses       |
| --------- | ---- | ------------- |
| 1         | Ali  | Math, Physics |

After 1NF (split into separate rows):

| StudentID | Name | Course  |
| --------- | ---- | ------- |
| 1         | Ali  | Math    |
| 1         | Ali  | Physics |

---

### **Second Normal Form (2NF)**

**Rule:**

* Must be in 1NF.
* Remove partial dependencies (i.e., non-key attributes must depend on the whole primary key, not part of it).

**Example:**
A table storing both student-course relationships and student’s department:
\| StudentID | Course   | Department |

Here, Department depends only on StudentID, not on Course.
To fix this, split into two tables:

**Students Table:**
\| StudentID | Department |

**StudentCourses Table:**
\| StudentID | Course     |

---

### **Third Normal Form (3NF)**

**Rule:**

* Must be in 2NF.
* Remove transitive dependencies (non-key attributes should not depend on other non-key attributes).

**Example:**
If a Student table has:
\| StudentID | DepartmentID | DepartmentName |

Here, DepartmentName depends on DepartmentID, not on StudentID directly.
So, create a **Department table**:

**Departments Table:**
\| DepartmentID | DepartmentName |

---

## **4. Real-World Applications of Normalization**

* **Database Design in Organizations** – For a retail store, normalizing the sales and inventory databases ensures product details are stored once and referenced everywhere.
* **Data Migration Projects** – When moving data from legacy systems to modern ones, normalization ensures consistent and non-duplicated records.
* **Banking Systems** – Customer details are stored in a master table and linked via IDs to accounts, loans, and transactions, reducing redundancy.

---

## **5. Benefits of Data Normalization**

* **Improved Data Consistency** – No mismatched or outdated copies of the same information.
* **Easier Maintenance** – Update in one place instead of hunting multiple locations.
* **Better Query Performance** – Smaller, well-indexed tables can improve search speed.
* **Reduced Storage Costs** – Less duplication means less disk space needed.

---

## **6. Challenges or Limitations**

* **More Complex Queries** – Since data is split across multiple tables, joins are required to retrieve complete information.
* **Potential Performance Issues in Read-Heavy Systems** – If joins are excessive, query time may increase.
* **Over-Normalization** – Breaking data into too many tables can make the database harder to manage.

**How to Address These Issues:**

* Use **denormalization** selectively for performance (combine some tables for faster reads).
* Optimize indexing and caching.
* Balance normalization with the specific needs of the application.

---


In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [45]:
# Make a example Dataset
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [5, 4, 3, 2, 1]
})

In [46]:
data.head()

Unnamed: 0,feature1,feature2
0,1,5
1,2,4
2,3,3
3,4,2
4,5,1


In [47]:
scaler = MinMaxScaler()

In [48]:
data = scaler.fit_transform(data)

In [49]:
data

array([[0.  , 1.  ],
       [0.25, 0.75],
       [0.5 , 0.5 ],
       [0.75, 0.25],
       [1.  , 0.  ]])

In [50]:
data  = sns.load_dataset("titanic")
data.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [51]:
data.shape

(891, 15)

In [52]:
scaler = MinMaxScaler()

In [53]:
scaled_data = pd.DataFrame(scaler.fit_transform(data.select_dtypes(include=[np.number])), columns=data.select_dtypes(include=[np.number]).columns)

In [54]:
scaled_data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,0.0,1.0,0.271174,0.125,0.000000,0.014151
1,1.0,0.0,0.472229,0.125,0.000000,0.139136
2,1.0,1.0,0.321438,0.000,0.000000,0.015469
3,1.0,0.0,0.434531,0.125,0.000000,0.103644
4,0.0,1.0,0.434531,0.000,0.000000,0.015713
...,...,...,...,...,...,...
886,0.0,0.5,0.334004,0.000,0.000000,0.025374
887,1.0,0.0,0.233476,0.000,0.000000,0.058556
888,0.0,1.0,,0.125,0.333333,0.045771
889,1.0,0.0,0.321438,0.000,0.000000,0.058556


In [55]:
st_scaler = StandardScaler()

In [56]:
scaled_data = pd.DataFrame(st_scaler.fit_transform(data.select_dtypes(include=[np.number])), columns=data.select_dtypes(include=[np.number]).columns)

In [58]:
scaled_data

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
0,-0.789272,0.827377,-0.530377,0.432793,-0.473674,-0.502445
1,1.266990,-1.566107,0.571831,0.432793,-0.473674,0.786845
2,1.266990,0.827377,-0.254825,-0.474545,-0.473674,-0.488854
3,1.266990,-1.566107,0.365167,0.432793,-0.473674,0.420730
4,-0.789272,0.827377,0.365167,-0.474545,-0.473674,-0.486337
...,...,...,...,...,...,...
886,-0.789272,-0.369365,-0.185937,-0.474545,-0.473674,-0.386671
887,1.266990,-1.566107,-0.737041,-0.474545,-0.473674,-0.044381
888,-0.789272,0.827377,,0.432793,2.008933,-0.176263
889,1.266990,-1.566107,-0.254825,-0.474545,-0.473674,-0.044381
