#### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Ans:
- Missing values are absent data values in a dataset.
- It's crucial to handle missing values as they can lead to biased or inaccurate model predictions. Ignoring missing values can also lead to incorrect conclusions, as it can skew the distribution of the remaining data.
- Tree-based(decision trees and random forests) and distance-based (k-nearest neighbors (KNN) and Support Vector Machines (SVM)) methods are not affected by missing values.

#### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Ans: 
1. Deletion
2. Mean/median imputation
3. Forward/Backward filling
4. Interpolation

In [4]:
import pandas as pd
import numpy as np

# Deletion
df = pd.DataFrame({'A': [1, 2, None, 4, None, 6],
                   'B': [6, None, 8, None, 10, None]})
print(df, '\n')
print("Deletion:\n", df.dropna())


# Mean/median imputation
df = pd.DataFrame({'A': [1, 2, None, 4, None, 6],
                   'B': [6, None, 8, None, 10, None]})
print("\nMean/median imputation:\n", df.fillna(df.mean()))


# Forward/Backward filling
df = pd.DataFrame({'A': [1, 2, None, 4, None, 6],
                   'B': [6, None, 8, None, 10, None]})
print("\nForward/Backward filling:\n", df.fillna(method='ffill').fillna(method='bfill'))


# Interpolation
df = pd.DataFrame({'A': [1, 2, None, 4, None, 6],
                   'B': [6, None, 8, None, 10, None]})
print("\nInterpolation:\n", df.interpolate(method='linear'))

     A     B
0  1.0   6.0
1  2.0   NaN
2  NaN   8.0
3  4.0   NaN
4  NaN  10.0
5  6.0   NaN 

Deletion:
      A    B
0  1.0  6.0

Mean/median imputation:
       A     B
0  1.00   6.0
1  2.00   8.0
2  3.25   8.0
3  4.00   8.0
4  3.25  10.0
5  6.00   8.0

Forward/Backward filling:
      A     B
0  1.0   6.0
1  2.0   6.0
2  2.0   8.0
3  4.0   8.0
4  4.0  10.0
5  6.0  10.0

Interpolation:
      A     B
0  1.0   6.0
1  2.0   7.0
2  3.0   8.0
3  4.0   9.0
4  5.0  10.0
5  6.0  10.0


#### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans: **Imbalanced data** refers to a situation where the classes in a classification problem are not represented equally. 

If imbalanced data is not handled, the machine learning model will be biased towards the majority class, resulting in poor predictions for the minority class.

#### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

Ans: Up-sampling and down-sampling are techniques used to address imbalanced data.

Up-sampling involves increasing the number of instances in the minority class, while down-sampling involves decreasing the number of instances in the majority class.

Eg: In a binary classification problem with 100 instances, if the positive class has only 10 instances, up-sampling can be used to create new instances of the positive class to make the class distribution more balanced. On the other hand, down-sampling can be used if the negative class has a very large number of instances compared to the positive class.

Up-sampling and down-sampling are required when dealing with imbalanced data to ensure that the machine learning model is not biased towards the majority class and can accurately predict both classes.

#### Q5: What is data Augmentation? Explain SMOTE.

Ans: **Data augmentation** is a technique used to artificially increase the size of a dataset by generating new data points from the existing ones. It is commonly used in machine learning to address the problem of limited data.

**SMOTE (Synthetic Minority Over-sampling Technique)** is a data augmentation technique used to address the problem of imbalanced data. It works by creating synthetic examples of the minority class by interpolating between existing examples. SMOTE is particularly useful when the number of instances in the minority class is very small compared to the majority class. By creating synthetic examples, SMOTE can help balance the class distribution and prevent the machine learning model from being biased towards the majority class.

#### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Ans: **Outliers** are data points in a dataset that are significantly different from other data points. They can be caused by errors in data collection or measurement, or they may represent valid but rare instances.

It is essential to handle outliers because they can have a significant impact on the results of machine learning algorithms. Outliers can skew statistical measures such as the mean and standard deviation, which can in turn affect the performance of machine learning models that rely on these measures. Outliers can also create noise in the data, which can lead to overfitting or poor generalization of the model.

#### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Ans: Some techniques that can be used to handle missing data in an analysis:

1. Deletion: This involves removing the entire row or column that contains the missing value. This can be done if the missing data is relatively small compared to the total dataset.

2. Imputation: This involves estimating the missing value based on the other available data. Mean, median, mode, regression, and k-nearest neighbor imputation are some common imputation techniques.

3. Prediction: This involves using a machine learning model to predict the missing values. This technique can be effective if there is enough data available to train the model.

#### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Ans: Several strategies that can be used:

1. Visual inspection: This involves creating plots or charts of the data to visually identify any patterns or trends in the missing data.

2. Statistical tests: This involves conducting statistical tests to determine if there is a significant difference between the missing and non-missing data. For example, a t-test or chi-squared test can be used to determine if the missing data is related to a specific variable.

3. Machine learning techniques: This involves using machine learning models to predict the missing values and then comparing the predicted values to the actual values. If the model performs well, it suggests that the missing data is missing at random.

#### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Ans: The several strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset:

1. Resampling techniques: This involves oversampling the minority class or undersampling the majority class to balance the dataset. This can improve the performance of the model on the minority class, but may also introduce bias.

2. Cost-sensitive learning: This involves assigning different misclassification costs to the different classes to reflect the imbalance in the dataset. This can improve the performance of the model on the minority class.

3. Evaluation metrics: This involves using evaluation metrics that are specifically designed for imbalanced datasets, such as Area Under the ROC Curve (AUC-ROC), Area Under the Precision-Recall Curve (AUC-PR), and balanced accuracy.

#### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

Ans: To balance an unbalanced dataset with a majority class, several methods to down-sample the majority class such as:

1. Random undersampling: randomly selecting a subset of the majority class to match the size of the minority class.

2. Cluster-based undersampling: identifying clusters in the majority class and selecting representative samples from each cluster.

3. Tomek Links: removing the samples of the majority class that are close to the minority class.

4. NearMiss: selecting samples from the majority class that are closest to the minority class.

#### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

Ans: To balance an unbalanced dataset with a minority class, several methods to up-sample the minority class such as:

1. Random oversampling: randomly duplicating samples from the minority class to match the size of the majority class.

2. Synthetic Minority Over-sampling Technique (SMOTE): generating synthetic samples for the minority class based on the nearest neighbors in feature space.

3. Adaptive Synthetic (ADASYN): generating synthetic samples for the minority class based on the density of the samples.

4. Oversampling with Cluster Centroids: creating clusters from the majority class and replacing each cluster with the centroid of the cluster.