In [None]:


Q1: Missing values in a dataset are values that are not stored or observed during data collection. They can affect the quality and accuracy of data analysis and machine learning models. Some algorithms that are not affected by missing values are **k-nearest neighbors**, **decision trees** and **random forests**¹².

Q2: Some techniques to handle missing data are:

- Deleting the missing values: This involves removing the rows or columns that contain missing values. This is only advisable if the missing values are few and not important for the analysis.
- Imputing the missing values: This involves replacing the missing values with some estimated values, such as the mean, median, mode, or a constant value. This can help preserve the size and structure of the dataset, but it may introduce bias or noise.
- Imputing the missing values for categorical features: This involves replacing the missing values with the most frequent category, a new category, or a value based on some logic or domain knowledge.
- Imputing the missing values using Sci-kit Learn Library: This involves using some built-in functions from the scikit-learn library, such as SimpleImputer, KNNImputer, or IterativeImputer, to impute the missing values based on different strategies and algorithms.
- Using "Missingness" as a feature: This involves creating a new binary feature that indicates whether a value is missing or not. This can help capture some information about the missingness pattern and its relation to the target variable.

Here is an example of imputing the missing values using scikit-learn in Python:

```python
# Import libraries
import pandas as pd
from sklearn.impute import SimpleImputer

# Create a sample dataframe with missing values
df = pd.DataFrame({"Age": [25, 30, 35, np.nan, 40],
                   "Salary": [50000, 60000, np.nan, 80000, 90000],
                   "Gender": ["M", "F", np.nan, "M", "F"]})

# Create an imputer object with mean strategy for numeric features
imputer_num = SimpleImputer(strategy="mean")

# Fit and transform the numeric columns
df[["Age", "Salary"]] = imputer_num.fit_transform(df[["Age", "Salary"]])

# Create an imputer object with most_frequent strategy for categorical features
imputer_cat = SimpleImputer(strategy="most_frequent")

# Fit and transform the categorical column
df[["Gender"]] = imputer_cat.fit_transform(df[["Gender"]])

# Print the imputed dataframe
print(df)
```

Output:

```
    Age   Salary Gender
0  25.0  50000.0      M
1  30.0  60000.0      F
2  35.0  70000.0      M
3  33.0  80000.0      M
4  40.0  90000.0      F
```

