# **7 Days Data Reduction Course**

# **Course** : Machine Learning 

# **Day 1:** Introduction to Data Reduction

# **Student**: Muhammad Shafiq

-----------------------------------

## **Day 1 Objectives**

| Goal                                | Explanation                                                     |
| ----------------------------------- | --------------------------------------------------------------- |
|  Understand what Data Reduction is | Real logic behind column, row, and dimension reduction          |
|  Why it matters in ML projects     | Performance, accuracy, generalization                           |
|  Types of Data Reduction           | Feature Selection, Dimensionality Reduction, Instance Reduction |
|  Real-world scenarios              | Where and how to use                                            |
|  Mini Coding Project               | Basic demo on when and how reduction improves model performance |


### **Data Reduction**

**Data Reduction** mean reducing the size or complexity of a dataset -- while keeping key information that models need to learn.


### **Why data reduction is important?**

| Problem            | Without Data Reduction                    | With Data Reduction             |
| ------------------ | ----------------------------------------- | ------------------------------- |
|  Model Accuracy  | Overfitting due to noisy/unnecessary data | Better generalization           |
|  Speed           | Slow training and high memory usage       | Fast training and lower memory  |
|  Inference       | Laggy predictions on real-time systems    | Quick and efficient predictions |
|  Maintainability | Hard to debug models                      | Easier to interpret & monitor   |


### **Three types of Data Reduction**

| Type                              | Target                                                  | Example                                |
| --------------------------------- | ------------------------------------------------------- | -------------------------------------- |
| 🔹 **Feature (Column) Reduction** | Remove irrelevant or redundant features                 | Drop highly correlated columns         |
| 🔹 **Instance (Row) Reduction**   | Remove unnecessary rows                                 | Remove outliers or duplicates          |
| 🔹 **Dimensionality Reduction**   | Reduce complex numerical features into fewer dimensions | PCA on 1000 image pixels → 20 features |


### **Real-World Examples**

| Use Case        | Dataset                      | Reduction Logic                                                 |
| --------------- | ---------------------------- | --------------------------------------------------------------- |
|  Healthcare   | X-ray, CT Scans              | Reduce image resolution, remove noisy samples                   |
|  Telecom      | Customer Churn               | Remove features like `Phone number`, reduce dimension using PCA |
|  Agriculture  | Crop yield dataset           | Remove outlier farms or low-variance columns                    |
|  Self-driving | Camera + Lidar sensor fusion | Reduce sensor noise, merge redundant frames                     |


### **import libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt


### **Load dataset**

In [None]:
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
print("Original shape:", X.shape)


### **Add useless column**


In [None]:
np.random.seed(42)
X['random_noise']= np.random.random(size=X.shape[0])
X['duplicate_mean'] = X['mean radius'] * 1

print("New shape with junk features: ", X.shape)


### **Train model without reduction**


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print("Accuracy with noise:", accuracy_score(y_test, preds))


## **Now Reduce Columns**

In [None]:
X_clean = X.drop(columns=['random_noise','duplicate_mean'])
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_clean, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model2.fit(X_train2, y_train2)
pred2 = model2.predict(X_test2)
print("Accuracy after column reduction:", accuracy_score(y_test2, preds2))


### **Visualizing Correlation**

In [None]:
corr = X.corr()
plt.figure(figsize=(12, 10))
sns.heatmap(corr, cmap='coolwarm', annot=False)
plt.title('Correlation Heatmap')
plt.show()

### **📝 Mini Task (For You Today):**

1. Run the above code on your system ✅

2. Try replacing the dataset with:

     - Titanic dataset

     - Heart disease dataset
 
3. Identify at least 2 columns that are:

     - Irrelevant

     - Duplicate or highly correlated

4. Measure the accuracy before and after removing them.