# HANDLING MISSING VALUES


- When we work with data in machine learning, it’s pretty common to see some missing value  maybe someone didn’t answer a survey question, or a sensor didn’t record properly. The problem is, most machine learning models don’t like empty spaces; they need every value to be filled.

- That’s where **SimpleImputer** from scikit-learn comes in. 
> Think of it as a quick way to "fill in the blanks." You can choose how you want to fill them:

- If it’s numbers, you might use the **mean** (the average of the column) so the missing value fits in smoothly.

- Or, you could use the **median**, which is safer if you have extreme outliers.

- If it’s categories, like gender or product type, 
> you can use the **most_requent** value (basically the most common answer).

In [3]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer



In [7]:
data = {
    "CustomerID": [1, 2, 3, 4, 5, 6],
    "Age": [25, np.nan, 35, 40, np.nan, 30],
    "Gender": ["Male", "Female", np.nan, "Female", "Male", np.nan],
    "Income": [50000, 60000, np.nan, 80000, 75000, np.nan],
    "Purchased": ["Yes", "No", "Yes", np.nan, "No", "Yes"]
}


df = pd.DataFrame(data)

display(data)

{'CustomerID': [1, 2, 3, 4, 5, 6],
 'Age': [25, nan, 35, 40, nan, 30],
 'Gender': ['Male', 'Female', nan, 'Female', 'Male', nan],
 'Income': [50000, 60000, nan, 80000, 75000, nan],
 'Purchased': ['Yes', 'No', 'Yes', nan, 'No', 'Yes']}

In [10]:
df.isnull().sum()


CustomerID    0
Age           2
Gender        2
Income        2
Purchased     1
dtype: int64

In [33]:
imputerCategorial = SimpleImputer(strategy='most_frequent')
df[["Gender"]] = imputerCategorial.fit_transform(df[["Gender"]])
df[["Purchased"]] = imputerCategorial.fit_transform(df[["Purchased"]])

imputeNumerical = SimpleImputer(strategy='mean')

df[["Income"]] = imputeNumerical.fit_transform(df[["Income"]])
df[["Age"]] = imputeNumerical.fit_transform(df[["Age"]])

print(df)



   CustomerID   Age  Gender   Income Purchased
0           1  25.0    Male  50000.0       Yes
1           2  32.5  Female  60000.0        No
2           3  35.0  Female  50000.0       Yes
3           4  40.0  Female  80000.0       Yes
4           5  32.5    Male  75000.0        No
5           6  30.0  Female  50000.0       Yes


In [34]:
nullValues = df.isnull().sum().sum()

if(nullValues == 0):
    print("No Null Values")
else:
    print(nullValues)

No Null Values


# WHAT I LEARNED HERE

- Choosing the right strategy to handle null values is very important you can’t just fill them randomly.
- For example, 
> if the missing data is numeric (like Income, Age, Salary, Kilogram, or Meter), you need to be careful about which method you use.

- The **Most Frequent** strategy works for both numbers and categories. It simply fills the missing value with the most common one in the column. If there’s a tie, it picks the smaller value.

- The **Mean strategy** is best for numerical data because it replaces missing values with the average of the existing numbers, keeping the overall balance of the dataset.

- **Median** is for numerical data. It fills missing values with the middle value (when sorted) and works well when there are outliers since it’s less affected than the mean.


# THERE'S ALSO CALLED OUTLIERS
## So, what are outliers?
- Well, I realized they’re kinda like the **"black sheep"** in my data the values that don’t fit in with the rest. They’re either way too high or way too low compared to most of the other numbers.

> For example, if most ages in my dataset are around 18–40, and suddenly I see 120, that’s definitely an outlier the "black sheep".
