# Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

## Missing values: Absent data points. Essential for accurate analysis. Decision trees, random forest, XGBoost less affected.


# Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [3]:
# Deletion: Remove rows or columns with missing values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6], 'C': [7, 8, 9]})
df.dropna(inplace=True)  # Remove rows with missing values


In [4]:
# Imputation: Fill missing values with statistical measures.
df.fillna(df.mean(), inplace=True)  # Fill with mean


In [5]:
# Interpolation: Estimate missing values based on neighboring values.
df.interpolate(inplace=True)  # Linear interpolation


# Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

## Imbalanced data: Unequal distribution of classes. Leads to biased models favoring majority class. Handling essential for accurate predictions.


# Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

## **Up-sampling:** Increases minority class instances. Use when minority class is too small.

## **Down-sampling:** Decreases majority class instances. Use when dataset is too large.

## **Example:** Fraud detection with 99% non-fraudulent and 1% fraudulent transactions. Up-sampling fraudulent transactions can help model learn better. 


# Q5: What is data Augmentation? Explain SMOTE.

## **Data Augmentation:** Creating new training data from existing data by applying random transformations. Improves model generalization.

## **SMOTE (Synthetic Minority Over-sampling Technique):** Creates synthetic data points for minority class in imbalanced datasets. Works by selecting a minority class instance, finding its k-nearest neighbors, and creating new data points along the line segments joining the instance and its neighbors. 


## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

# Outliers: Data points significantly different from others. Can affect model performance, bias results. Essential to identify and handle appropriately.


# Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

# **Techniques to handle missing customer data:**

* **Deletion:** Remove rows or columns with missing values.
* **Imputation:** Fill missing values with statistical measures (mean, median, mode) or predictive models.
* **Categorization:** Create a new category for missing values.
* **Analysis:** Explore reasons for missing data, understand impact on analysis.
* **Algorithm selection:** Choose algorithms robust to missing data (decision trees, random forest).


## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

# **Strategies to determine missing data pattern:**

* **Missing Completely At Random (MCAR):** Missing data is unrelated to any other variables.
* **Missing At Random (MAR):** Missing data is related to observed data but not missing data itself.
* **Missing Not At Random (MNAR):** Missing data is related to both observed and unobserved data.

**Techniques:**

* **Visualization:** Explore missing data patterns using missing data plots.
* **Statistical tests:** Conduct tests to check for relationships between missing data and other variables.
* **Domain knowledge:** Use expert knowledge to understand potential reasons for missing data.


## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

# **Strategies for imbalanced dataset evaluation:**

* **Precision, recall, F1-score:** Measure model performance beyond accuracy.
* **ROC curve, AUC:** Visualize model performance, calculate area under ROC curve.
* **Confusion matrix:** Analyze model predictions, identify misclassifications.
* **Class weighting:** Assign higher weights to minority class during training.
* **Undersampling/oversampling:** Balance dataset before training.
* **Cost-sensitive learning:** Assign different costs to misclassification errors.


## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

# **Down-sampling majority class:**

* **Random under-sampling:** Randomly remove instances from majority class.
* **Cluster-based under-sampling:** Cluster majority class, then remove instances from each cluster.
* **NearMiss:** Removes majority class instances close to minority class.
* **Tomek links:** Removes majority class instances close to minority class without any instances of the majority class between them. 
