- __Handle Missing Values__

1. Remove the Rows from the data
2. Fill the Row or Impute them


__Imputing are of Two type__

1. # üß© Types of Imputation (Missing Value Handling)

In **Machine Learning and Statistics**, imputation methods are **broadly classified into two main types**:

---

## üîπ 1. Univariate Imputation

### üìå Definition
**Univariate imputation** fills missing values using **only the same column‚Äôs data**.  
No information from other features is used.

---

### üß† Common Univariate Methods
- Mean Imputation
- Median Imputation
- Mode Imputation
- Constant Value Imputation
- Random Sample from the same column

---

### üß™ Example (Python)
```python
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']])


## üîπ 2. Multivariate Imputation

### üìå Definition
**Multivariate imputation** fills missing values using **multiple features together**.  
The missing value is **predicted based on other columns**, capturing relationships between variables.

---

### üß† Common Multivariate Imputation Methods
- **KNN Imputation**
- **MICE (Multiple Imputation by Chained Equations)**
- **Regression-based Imputation**


## üßæ Complete Case Analysis (CCA)
*(Also known as Listwise Deletion)*
- Agr kisi bhi clmn me ek bhi data missing h to puri row ko hath denge

---

### üìå Definition
**Complete Case Analysis** is a method of handling missing data where **any row containing at least one missing value is removed** from the dataset.

Only rows with **no missing values** are kept for analysis or model training.

---

### üß† In Simple Words
> If a row has **even one missing value (NaN)**, the **entire row is dropped**.

---

### üß™ Example

#### Original Data
| Age | Salary | City |
|----|--------|------|
| 25 | 50000 | Delhi |
| NaN | 60000 | Mumbai |
| 30 | NaN | Chennai |
| 28 | 55000 | Pune |

#### After Complete Case Analysis
| Age | Salary | City |
|----|--------|------|
| 25 | 50000 | Delhi |
| 28 | 55000 | Pune |

---

### üß™ Python Example
```python
df_complete = df.dropna()


## üìå Assumptions of Complete Case Analysis (CCA)

For **Complete Case Analysis (CCA)** to give valid and unbiased results, the following assumptions must hold:

---

### üîπ 1. Missing Completely At Random (MCAR)
The most important assumption.

- The probability of missingness is **independent of both observed and unobserved data**
- Missing values occur **purely by chance**

‚úÖ Example:
- Data missing due to accidental data entry error

‚ùå Violation:
- High-income people not reporting salary

---

### üîπ 2. Removed Rows Are Representative
- The rows that are dropped should be **similar to the rows that remain**
- No systematic difference between deleted and retained observations

If this is violated ‚Üí **selection bias**

---

### üîπ 3. Sufficient Remaining Sample Size
- After removing incomplete rows, enough data must remain
- Otherwise:
  - Statistical power decreases
  - Model performance degrades

---

### üîπ 4. Missing Data Proportion Is Small
- CCA is safe when missing data is **very low** (typically < 5%)
- Large missing percentages ‚Üí heavy data loss

---

### üîπ 5. No Informative Missingness
- Missingness itself should **not carry information**
- If missing values indicate something meaningful, CCA is inappropriate

Example:
- Missing medical test results because patients are too sick

---

## ‚ö†Ô∏è What Happens If Assumptions Fail?
- Biased estimates
- Loss of important patterns
- Poor model generalization

---

## üéØ Exam / Interview Key Line
> **Complete Case Analysis assumes that data is Missing Completely At Random (MCAR) and that removing incomplete cases does not bias the results.**

---


## ‚úÖ Advantages and ‚ùå Disadvantages of Complete Case Analysis (CCA)

---

## ‚úÖ Advantages of CCA

### üîπ 1. Simple and Easy to Implement
- Requires only one command (`dropna()`)
- No complex modeling or assumptions needed

---

### üîπ 2. No Artificial Data Introduced
- Does **not estimate or guess** missing values
- Original data integrity is preserved

---

### üîπ 3. No Imputation Bias
- Avoids bias caused by incorrect imputation methods
- Useful as a **baseline approach**

---

### üîπ 4. Fast and Computationally Efficient
- No extra computation
- Suitable for quick exploratory analysis

---

### üîπ 5. Easy to Explain (Exam & Interview Friendly)
- Conceptually straightforward
- Often preferred in theoretical explanations

---

## ‚ùå Disadvantages of CCA

### üîπ 1. Loss of Data
- Drops entire rows even if **only one value is missing**
- Can result in significant data loss

---

### üîπ 2. Reduced Sample Size
- Smaller dataset ‚Üí lower statistical power
- Models may perform worse due to less data

---

### üîπ 3. Can Introduce Bias
- If data is **not MCAR**, results become biased
- Common in real-world datasets

---

### üîπ 4. Not Suitable for Small Datasets
- Data removal can make dataset unusable
- High risk when observations are limited

---

### üîπ 5. Ignores Missingness Information
- Missing values themselves may contain useful information
- CCA discards this signal completely

---

## üìä Summary Table

| Aspect | Complete Case Analysis |
|------|------------------------|
| Implementation | Very easy |
| Data loss | High |
| Bias risk | High (if not MCAR) |
| Speed | Very fast |
| Real-world suitability | Limited |

---

## üéØ Exam / Interview Key Line
> **Complete Case Analysis is simple and fast but can cause major data loss and bias if missingness is not completely at random.**

---


## üìå When Complete Case Analysis (CCA) Is Used ‚Äî Key Points

- When missing data is **very small** (‚âà less than 5%)
- When data is **Missing Completely At Random (MCAR)**
- When the **dataset is large**
- During **quick exploratory data analysis (EDA)**
- When missingness is **not informative**
- Agr data bahut jyda missing h colmn se 95% se jyda to colmn ko hi hath dege

> **CCA is used only when dropping rows will not introduce bias or reduce data quality.**


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
df = pd.read_csv(r"C:\Users\Lenovo\Downloads\data_science_job.csv")


In [19]:
df

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,20.0,,,36.0,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15.0,50-99,Pvt Ltd,47.0,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5.0,,,83.0,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,0.0,,Pvt Ltd,52.0,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,20.0,50-99,Funded Startup,8.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,No relevent experience,no_enrollment,Graduate,Humanities,14.0,,,42.0,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14.0,,,52.0,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,20.0,50-99,Pvt Ltd,44.0,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,0.0,500-999,Pvt Ltd,97.0,0.0


In [20]:
df.isnull().sum()

enrollee_id                  0
city                         0
city_development_index     479
gender                    4508
relevent_experience          0
enrolled_university        386
education_level            460
major_discipline          2813
experience                  65
company_size              5938
company_type              6140
training_hours             766
target                       0
dtype: int64

In [23]:
df.isnull().mean()*100  
# it give the how many % data is missing 

# isme se me gender pr ni kr skta bc jinka 5% se km h unpr apply krege

enrollee_id                0.000000
city                       0.000000
city_development_index     2.500261
gender                    23.530640
relevent_experience        0.000000
enrolled_university        2.014824
education_level            2.401086
major_discipline          14.683161
experience                 0.339284
company_size              30.994885
company_type              32.049274
training_hours             3.998330
target                     0.000000
dtype: float64

In [25]:
df.shape

(19158, 13)

In [27]:
cols = [var for var in df.columns if df[var].isnull().mean()<0.05 and df[var].isnull().mean() > 0]

In [28]:
cols

['city_development_index',
 'enrolled_university',
 'education_level',
 'experience',
 'training_hours']

In [31]:
df[cols].sample(23)

Unnamed: 0,city_development_index,enrolled_university,education_level,experience,training_hours
1634,0.754,no_enrollment,Graduate,7.0,100.0
15209,0.55,no_enrollment,Primary School,0.0,216.0
9725,0.925,no_enrollment,Graduate,20.0,34.0
659,0.92,no_enrollment,Graduate,20.0,40.0
9550,0.55,no_enrollment,Masters,5.0,9.0
4235,,no_enrollment,Phd,7.0,7.0
582,0.92,no_enrollment,Graduate,20.0,53.0
14640,0.91,Full time course,Graduate,4.0,6.0
4192,0.91,no_enrollment,Graduate,7.0,105.0
4561,0.624,no_enrollment,Graduate,10.0,59.0


In [32]:
df[cols].isnull().sum()

city_development_index    479
enrolled_university       386
education_level           460
experience                 65
training_hours            766
dtype: int64

__Find how much data is reamin after droping__

In [35]:
len(df[cols].dropna())/len(df)

0.8968577095730244

In [37]:
df_new = df[cols].dropna()
df_new

Unnamed: 0,city_development_index,enrolled_university,education_level,experience,training_hours
0,0.920,no_enrollment,Graduate,20.0,36.0
1,0.776,no_enrollment,Graduate,15.0,47.0
2,0.624,Full time course,Graduate,5.0,83.0
4,0.767,no_enrollment,Masters,20.0,8.0
5,0.764,Part time course,Graduate,11.0,24.0
...,...,...,...,...,...
19153,0.878,no_enrollment,Graduate,14.0,42.0
19154,0.920,no_enrollment,Graduate,14.0,52.0
19155,0.920,no_enrollment,Graduate,20.0,44.0
19156,0.802,no_enrollment,High School,0.0,97.0


The ratio remain same

In [38]:
df['education_level'].value_counts()

education_level
Graduate          11598
Masters            4361
High School        2017
Phd                 414
Primary School      308
Name: count, dtype: int64

In [39]:
df_new['education_level'].value_counts()

education_level
Graduate          10650
Masters            4022
High School        1845
Phd                 380
Primary School      285
Name: count, dtype: int64

In [45]:
Ratio = pd.concat([
    df['education_level'].value_counts()/len(df),
    df_new['education_level'].value_counts()/len(df_new),
],axis=1)

Ratio.columns=['Original','CCA']

Ratio

Unnamed: 0_level_0,Original,CCA
education_level,Unnamed: 1_level_1,Unnamed: 2_level_1
Graduate,0.605387,0.619835
Masters,0.227633,0.234082
High School,0.105282,0.10738
Phd,0.02161,0.022116
Primary School,0.016077,0.016587


In [46]:
Ratio1 = pd.concat([
    df['enrolled_university'].value_counts()/len(df),
    df_new['enrolled_university'].value_counts()/len(df_new),
],axis=1)

Ratio1.columns=['Original','CCA']

Ratio1

Unnamed: 0_level_0,Original,CCA
enrolled_university,Unnamed: 1_level_1,Unnamed: 2_level_1
no_enrollment,0.721213,0.735188
Full time course,0.196106,0.200733
Part time course,0.062533,0.064079


In [47]:
Ratio = pd.concat([
    df['training_hours'].value_counts()/len(df),
    df_new['training_hours'].value_counts()/len(df_new),
],axis=1)

Ratio.columns=['Original','CCA']

Ratio

Unnamed: 0_level_0,Original,CCA
training_hours,Unnamed: 1_level_1,Unnamed: 2_level_1
28.0,0.016703,0.017809
18.0,0.014563,0.015074
12.0,0.014563,0.014958
22.0,0.014250,0.015074
50.0,0.014041,0.014841
...,...,...
240.0,0.000261,0.000291
234.0,0.000261,0.000233
272.0,0.000261,0.000233
238.0,0.000209,0.000233
