### **Identifying Missing Data**
In **Pandas**, missing data can occur for various reasons — incomplete records, data collection issues, or processing errors.
Pandas provides powerful tools to detect and handle missing values efficiently.

---
##### ➡️ **Types of Missing Data in Pandas**
1. **`None`** — A Python singleton object used to represent the **absence of a value**.
2. **`NaN` (Not a Number)** — A special floating-point value defined by the **IEEE 754 standard** to indicate missing numerical data.
3. **`NaT` (Not a Time)** — Used for **missing datetime values**.

> Pandas treats **`None`**, **`NaN`**, and **`NaT`** as missing values in most operations.
---
##### ➡️ **Detecting Missing Data**
Pandas provides two key methods to identify missing data:
* **`isnull()` / `isna()`** → Detect missing values (returns `True` for missing entries).
* **`notnull()` / `notna()`** → Detect non-missing values (returns `True` for valid entries).
---
##### ➡️ **Summary**
* **`isnull()`** and **`isna()`** → Identify missing values.
* **`notnull()`** and **`notna()`** → Identify valid (non-missing) values.
* Pandas handles **`None`**, **`NaN`**, and **`NaT`** consistently as missing data.

##### ➡️ **Example: Checking for Missing Data**

In [5]:
import pandas as pd
import numpy as np

# Create a DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['a', 'b', 'c', None]
})
# Boolean mask of missing values
print(f"# Boolean mask of missing values:\n{df.isnull()}\n")

# Count of missing values per column
print(f"Count of missing values per column:\n{df.isnull().sum()}")

# Boolean mask of missing values:
       A      B      C
0  False  False  False
1  False   True  False
2   True   True  False
3  False  False   True

Count of missing values per column:
A    1
B    2
C    1
dtype: int64


➡️ **`Task:` Create a function that takes a DataFrame as input and returns the percentage of missing values in each column.**

In [8]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, 4],
                   'B': [5, np.nan, np.nan, 8],
                   'C': ['a', 'b', 'c', None]})

# Percentage of missing values in each column.
percentage = (df.isnull().sum() / len(df)) * 100
print(f"Percentage of Missing values in each column:\n{percentage}")

Percentage of Missing values in each column:
A    25.0
B    50.0
C    25.0
dtype: float64


### **Dropping Missing Data**
In **Pandas**, missing data can be easily removed using the **`dropna()`** method.
This method helps you clean your dataset by removing rows or columns that contain **`NaN`**, **`None`**, or **`NaT`** values.

---
##### ➡️ **Key Concept**
The **`dropna()`** function removes rows or columns with missing data based on customizable conditions.
> By default, it removes **rows** with any missing values.
---


```

---

### **Important Notes**

* **`dropna(thresh=n)`** → Drops rows that have **fewer than `n` non-null values**.
* **`dropna(subset=['A', 'B'])`** → Drops rows where **either `'A'` or `'B'`** (or both) have missing values.
* **`dropna(subset=['A', 'B'], how='all')`** → Drops rows **only if all specified columns** are missing.
* **`subset`** parameter allows focusing on **specific columns**, ignoring others.

---

### **Summary**

| Method                                    | Description                                      |
| ----------------------------------------- | ------------------------------------------------ |
| `df.dropna()`                             | Drops rows with **any missing value**            |
| `df.dropna(how='all')`                    | Drops rows **only if all values are missing**    |
| `df.dropna(thresh=n)`                     | Drops rows with **less than n non-null values**  |
| `df.dropna(subset=['A'])`                 | Drops rows where **column A** has missing values |
| `df.dropna(subset=['A', 'B'], how='any')` | Drops rows where **A or B** is missing           |

Using **`dropna()`** is essential for cleaning and preparing your dataset before performing any analysis or modeling.


##### ➡️ **`Example:` Dropping Missing Data**

In [9]:
import pandas as pd
import numpy as np

# Sample DataFrame with missing values
df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['a', 'b', 'c', None]
})

# Drop rows with any missing values
df_clean = df.dropna()

# Drop rows only if all values are missing
df_clean = df.dropna(how='all')

# Drop rows with less than 2 non-null values
df_clean = df.dropna(thresh=2)

# Drop rows where column 'A' has missing values
df_clean = df.dropna(subset=['A'])

# Drop rows where either 'A' or 'B' is missing
df_clean = df.dropna(subset=['A', 'B'], how='any')

➡️ **`Problem 1:` You are given a DataFrame. Perform the following operations.**
* Drop rows where **column A has missing values**
* Drop rows with **less than 2 non-null values**
* Drop rows where **both Column A and Column B has missing values**

In [10]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, None, 10],
    'C': ['a', 'b', 'c', None, 'e']
})
print(f"Original DataFrame:\n{df}\n")

# Drop rows where column A has missing values
clean_data_1 = df.dropna(subset=['A'])
print(f"Drop rows where column A has missing values:\n{clean_data_1}\n")

# Drop rows with less than 2 non-null values
clean_data_2 = df.dropna(thresh=2)
print(f"Drop rows with less than 2 non-null values:\n{clean_data_2}\n")

# Drop rows where both Column A and Column B has missing values
clean_data_3 = df.dropna(subset=['A', 'B'], how='all')
print(f"Drop rows where both Column A and Column B has missing values:\n{clean_data_3}")

Original DataFrame:
     A     B     C
0  1.0   5.0     a
1  2.0   NaN     b
2  NaN   NaN     c
3  4.0   NaN  None
4  5.0  10.0     e

Drop rows where column A has missing values:
     A     B     C
0  1.0   5.0     a
1  2.0   NaN     b
3  4.0   NaN  None
4  5.0  10.0     e

Drop rows with less than 2 non-null values:
     A     B  C
0  1.0   5.0  a
1  2.0   NaN  b
4  5.0  10.0  e

Drop rows where both Column A and Column B has missing values:
     A     B     C
0  1.0   5.0     a
1  2.0   NaN     b
3  4.0   NaN  None
4  5.0  10.0     e


#### **Filling Missing Data in Pandas**
When cleaning data, instead of removing rows with missing values, we can **fill** those missing values using appropriate techniques.
Pandas provides multiple methods to handle missing data effectively.

---
➡️ **Explanation of Methods**
| Method                      | Description                                                              |
| --------------------------- | ------------------------------------------------------------------------ |
| `fillna(0)`                 | Replaces all missing values with `0`.                                    |
| `fillna({'A': 0, 'B': 99})` | Fills each column with custom values.                                    |
| `ffill()`                   | Forward fill – carries the **last valid value forward**.                 |
| `bfill()`                   | Backward fill – uses the **next valid value**.                           |
| `fillna(df['A'].mean())`    | Fills missing values with **mean**, **median**, or **mode** of a column. |
---
➡️ **Important Considerations**
* **Data Type:** The fill value must match the column’s data type.
* **Bias:** Filling values may **introduce bias** if not done carefully.
* **Reason for Missingness:** Always understand **why** data is missing before deciding how to fill it.
* **Impact on Analysis:** The chosen fill method can **affect statistical results** and insights.

##### ➡️ **Common Methods to Fill Missing Data**

In [12]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': ['a', 'b', 'c', None]
})
# Fill all missing values with 0
df_filled = df.fillna(0)

# Fill different values for different columns
df_filled = df.fillna({'A': 0, 'B': 99, 'C': 'Unknown'})

# Forward fill - propagate last valid observation forward
df_ffilled = df.ffill()

# Backward fill - use next valid observation to fill the gap
df_bfilled = df.bfill()

# Fill missing values using statistical measures
df['A'] = df['A'].fillna(df['A'].mean())     # Fill with mean
df.fillna({'B': df['B'].median()}, inplace=True)   # Fill with median

➡️ **`Problem 2:` Perform the following operations on the given DataFrame.**
* **Replace null values** based on given conditions to get the table as per the expected output below
* **Forward fill only column B**
* Create a function that **fills missing values in numeric columns with the mean** of the column and in **string columns with 'Unknown'.**

In [6]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [5, np.nan, np.nan, None, 10],
    'C': ['a', 'b', 'c', None, 'e']
})
print(f"Original DataFrame:\n{df}\n")

# Replace null values based on given conditions--> np.nan: 99.0 & None: Unknown
replace = df.fillna({'A': 0, 'B': 99.0, 'C': "Unknown"})
print(f"Replace Null values based on given conditions [np.nan: 99.0 & None: Unknown]:\n{replace}\n")

# Forward fill only column B
df_copy = df.copy() # Creates a Copy of original Dataframe

df_copy['B'] = df_copy['B'].ffill()
print(f"Forward Filling of Column B:\n{df_copy}\n")

# Function to Fill missing values in numeric columns with the mean of the column and in string columns with 'Unknown'.
for column in df.columns:
    if df[column].dtype == 'object':
        df.fillna({column: "Unknown"}, inplace=True)
    else:
        df.fillna({column: df[column].mean()}, inplace=True)

print(f"Filling missing values in Numeric cols with Mean & String cols with 'Unknown':\n{df}")

Original DataFrame:
     A     B     C
0  1.0   5.0     a
1  2.0   NaN     b
2  NaN   NaN     c
3  4.0   NaN  None
4  5.0  10.0     e

Replace Null values based on given conditions [np.nan: 99.0 & None: Unknown]:
     A     B        C
0  1.0   5.0        a
1  2.0  99.0        b
2  0.0  99.0        c
3  4.0  99.0  Unknown
4  5.0  10.0        e

Forward Filling of Column B:
     A     B     C
0  1.0   5.0     a
1  2.0   5.0     b
2  NaN   5.0     c
3  4.0   5.0  None
4  5.0  10.0     e

Filling missing values in Numeric cols with Mean & String cols with 'Unknown':
     A     B        C
0  1.0   5.0        a
1  2.0   7.5        b
2  3.0   7.5        c
3  4.0   7.5  Unknown
4  5.0  10.0        e
