# üìò Day 10 ‚Äì Machine Learning Journey

---

##  Decision Tree Pruning

### Cost Complexity Pruning

[ \textbf{Cost} = Error(T) + Complexity(T) ]

* **Error(T)**: Misclassification or regression error of the tree

  * Regression ‚Üí **MSE (Mean Squared Error)**
  * Classification ‚Üí **Gini Index**
* **Complexity(T)**: Number of **leaf nodes** in the tree

üëâ Goal: Balance **model accuracy** and **model simplicity** to avoid overfitting.

---

##  Phases of Machine Learning

### Phase 1: Data Preprocessing

* Handling missing values
* Outlier treatment
* Encoding categorical variables
* Scaling & normalization

### Phase 2: Mathematics for ML

Algorithms covered:

* KNN (K-Nearest Neighbors)
* Decision Trees (DT)
* Naive Bayes (NB)
* Linear Regression
* Logistic Regression
* Support Vector Machines (SVM)
* Ensemble Techniques

### Phase 3: Feature Engineering

* Feature creation
* Feature transformation
* **Feature Selection**

>  *The success of a Machine Learning algorithm is purely based on **Feature Engineering** and **Feature Selection***

---

##  Neural Networks & Deep Learning

* **Neural Networks (NN)** belong to **Deep Learning**
* NN = Feature Engineering + ML Algorithm
* Parametric ML examples:

  * Linear Regression
  * Logistic Regression

### Why Neural Networks?

* Neural Networks perform **automated feature engineering**
* Automation becomes useful when **data complexity is high**

---

##  Feature Engineering: Missing Value Imputation

### What are Missing Values?

* Missing values are represented as **Null / NaN / NA / '-'**

### Why do missing values occur?

* Data collector unable to obtain a value
* Respondent unwilling to share information
* System or manual data-entry errors

### Ideal Solution

* **Collect complete information** ‚ùå (costly and time-consuming)

### Practical Solution

* **Impute missing values** using appropriate techniques without compromising data quality

>  Feature Engineering is guided by **Domain Experts**. Data Scientists apply techniques based on domain suggestions.

---

##  Missing Value Awareness (Before Imputation)

### 40% Heuristic Rule

* If a feature has **‚â• 40% missing values**:

  * **Fiscal / Financial feature** ‚Üí  Drop the feature
  * **Non-fiscal feature** ‚Üí üì• Collect or enrich the data

> **One-liner:** Features with ‚â•40% missing values are dropped if fiscal; otherwise, missing data should be collected or enriched.

---

##  Missing Data Mechanisms

There are **three types** of missingness:

### 1Ô∏è‚É£ MCAR ‚Äì Missing Completely At Random

* Missingness is **purely random**
* Not dependent on any feature

**Imputation:**

* Mean
* Median
* Mode

---

### 2Ô∏è‚É£ MAR ‚Äì Missing At Random

* Missingness depends on **other observed features**

**Example:**

* Employees with low IQ do not share performance ratings

**Imputation:**

* **KNN Imputer** (uses similarity with other features)

---

### 3Ô∏è‚É£ MNAR ‚Äì Missing Not At Random

* Missingness depends on the **value of the feature itself**

**Example:**

* Employees hide performance rating due to poor performance

**Solution:**

* ‚ùå Discard observations
* ‚úîÔ∏è Domain-driven decision

---

##  Example Dataset Scenario

* Dataset from an organization
* Features:

  * IQ
  * Performance Rating

Goal:

[ Performance = f(IQ) ]

* MCAR ‚Üí No dependency
* MAR ‚Üí Dependency on another feature
* MNAR ‚Üí Dependency on feature value itself

---

##  Data Science Workflow for Handling Missing Values

### Step 1: Start with Dataset

* Load and understand the dataset
* Identify features, target, and data types

### Step 2: Detect Missing Values

* Count & percentage of missing values
* Identify missingness type (MCAR / MAR / MNAR)

### Step 3: Discard Observations (If Required)

Drop rows **only when**:

* Missingness is MCAR
* Missing rows are very few
* Target distribution is unaffected

Avoid dropping when data loss is significant.

### Step 4: Build the Model

* Train-test split
* Apply imputation strategy
* Train the ML model

### Step 5: Evaluate Model Performance

**Classification:**

* Accuracy
* Precision
* Recall
* F1-score
* ROC-AUC

**Regression:**

* RMSE
* MAE
* R¬≤

### Step 6: Compare Performance

* Before vs After missing value handling

---

##  One-line Exam Summary

> Start with the dataset, identify missing values, discard observations when justified, build the model, and evaluate model performance.

---

##  Missing Value Imputation Techniques

### 1Ô∏è‚É£ Simple Imputer

* Mean
* Median
* Mode
* Constant value

### 2Ô∏è‚É£ KNN Imputer

* Uses similarity (K-nearest neighbors)
* Predicts missing values using neighbors

---

##  Avoiding Data Leakage (Very Important)

* **Fit imputer only on training data**
* **Transform test data using the same imputer**

‚ùå Never impute on full dataset before splitting




In [1]:
import numpy as np
import pandas as pd

In [2]:
data = {
    "IQ": [78, 84, 84, 85, 87, 91, 92, 94, 94, 96, 99, 105, 105, 106, 108, 112, 113, 115, 118, 134],
    "Job_Performance_Ratings": [9, 13, 10, 8, 7, 7, 9, 9, 11, 7, 7, 10, 11, 15, 10, 8, 12, 14, 16, 12],
    "MCAR": [np.nan, 13, np.nan, 8, 7, 7, 9, 9, 11, np.nan, 7, 10, np.nan, 15, 10, np.nan, 12, 14, 16, np.nan],
    "MAR": [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 9, 9, 11, 7, 7, 10, 11, 15, 10, 8, 12, 14, 16, 12],
    "MNAR": [9, 13, 10, np.nan, np.nan, np.nan, 9, 9, 11, np.nan, np.nan, 10, 11, 15, 10, np.nan, 12, 14, 16, 12]
}
# Create DataFrame
df = pd.DataFrame(data)
df

Unnamed: 0,IQ,Job_Performance_Ratings,MCAR,MAR,MNAR
0,78,9,,,9.0
1,84,13,13.0,,13.0
2,84,10,,,10.0
3,85,8,8.0,,
4,87,7,7.0,,
5,91,7,7.0,,
6,92,9,9.0,9.0,9.0
7,94,9,9.0,9.0,9.0
8,94,11,11.0,11.0,11.0
9,96,7,,7.0,


In [4]:
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [5]:
df

Unnamed: 0,IQ,Job_Performance_Ratings,MCAR,MAR,MNAR
0,78,9,,,9.0
1,84,13,13.0,,13.0
2,84,10,,,10.0
3,85,8,8.0,,
4,87,7,7.0,,
5,91,7,7.0,,
6,92,9,9.0,9.0,9.0
7,94,9,9.0,9.0,9.0
8,94,11,11.0,11.0,11.0
9,96,7,,7.0,


In [6]:
si = SimpleImputer(strategy='mean').set_output(transform = 'pandas')
si

In [7]:
si.fit_transform(df)

Unnamed: 0,IQ,Job_Performance_Ratings,MCAR,MAR,MNAR
0,78.0,9.0,10.571429,10.785714,9.0
1,84.0,13.0,13.0,10.785714,13.0
2,84.0,10.0,10.571429,10.785714,10.0
3,85.0,8.0,8.0,10.785714,11.5
4,87.0,7.0,7.0,10.785714,11.5
5,91.0,7.0,7.0,10.785714,11.5
6,92.0,9.0,9.0,9.0,9.0
7,94.0,9.0,9.0,9.0,9.0
8,94.0,11.0,11.0,11.0,11.0
9,96.0,7.0,10.571429,7.0,11.5


In [8]:
from sklearn.compose import ColumnTransformer

In [11]:
ct = ColumnTransformer([('Mean_Imputation', si, ['MCAR'])],remainder='passthrough',verbose_feature_names_out=False).set_output(transform = 'pandas')
ct

In [12]:
ct.fit_transform(df)

Unnamed: 0,MCAR,IQ,Job_Performance_Ratings,MAR,MNAR
0,10.571429,78,9,,9.0
1,13.0,84,13,,13.0
2,10.571429,84,10,,10.0
3,8.0,85,8,,
4,7.0,87,7,,
5,7.0,91,7,,
6,9.0,92,9,9.0,9.0
7,9.0,94,9,9.0,9.0
8,11.0,94,11,11.0,11.0
9,10.571429,96,7,7.0,


In [13]:
si = SimpleImputer(strategy = 'median')
si

In [14]:
ct = ColumnTransformer([('Median_Imputation', si, ['MCAR'])],
                       remainder = 'passthrough',
                       verbose_feature_names_out= False).set_output(transform = 'pandas')
ct

In [15]:
ct.fit_transform(df)

Unnamed: 0,MCAR,IQ,Job_Performance_Ratings,MAR,MNAR
0,10.0,78,9,,9.0
1,13.0,84,13,,13.0
2,10.0,84,10,,10.0
3,8.0,85,8,,
4,7.0,87,7,,
5,7.0,91,7,,
6,9.0,92,9,9.0,9.0
7,9.0,94,9,9.0,9.0
8,11.0,94,11,11.0,11.0
9,10.0,96,7,7.0,


In [16]:
si = SimpleImputer(strategy = 'constant', fill_value = 10)
si