## Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

1. **Missing values**: Missing values in a dataset refer to the absence of data or no recorded value for a particular attribute or feature.
2. **Handling missing values**: It is essential to handle missing values because they can affect the performance of the machine learning model and lead to inaccurate or biased results.
3. **Algorithms not affected by missing values**: Some algorithms that can handle missing data without requiring imputation include k-Nearest Neighbors, Random Forest, and Naive Bayes.


## Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [None]:
# Mean Imputation
import pandas as pd
df = pd.read_csv('data.csv')
df = df.fillna(df.mean())


In [None]:
# Mode Imputation
import pandas as pd
df = pd.read_csv('data.csv')
df = df.fillna(df.mode())


In [None]:
# Median Imputation
import pandas as pd
df = pd.read_csv('data.csv')
df = df.fillna(df.median())


## Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation where the classes in the target variable are not represented equally. If imbalanced data is not handled, the machine learning model may be biased towards the majority class and may not perform well on the minority class.

## Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Up-sampling and down-sampling are techniques used to balance imbalanced data in a dataset. Up-sampling involves increasing the number of samples in the minority class, while down-sampling involves decreasing the number of samples in the majority class.

For example, let's say we have a dataset with two classes: Class A and Class B. Class A has 1000 samples, while Class B has only 100 samples. This is an imbalanced dataset, as Class A has significantly more samples than Class B.

To balance this dataset, we could use up-sampling to increase the number of samples in Class B. This could be done by randomly duplicating samples from Class B until it has the same number of samples as Class A. Alternatively, we could use down-sampling to decrease the number of samples in Class A. This could be done by randomly removing samples from Class A until it has the same number of samples as Class B.

Up-sampling and down-sampling are required when dealing with imbalanced data, as they can help improve the performance of machine learning models by ensuring that all classes are represented equally.

## Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to increase the amount of training data by creating new samples from the existing data. This is done by applying various transformations to the original data, such as rotation, scaling, and flipping, to generate new samples that are similar but not identical to the original data.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique used to balance imbalanced datasets. It works by creating synthetic samples of the minority class by interpolating between existing minority class samples. This helps to increase the number of samples in the minority class and balance the dataset.


## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points in a dataset that are significantly different from the other data points. They can be caused by various factors, such as measurement errors, data entry errors, or natural variability in the data.

It is essential to handle outliers because they can have a significant impact on the analysis of the data. Outliers can affect the mean, median, and standard deviation of the data, which can lead to inaccurate or misleading results. They can also affect the performance of machine learning models, as they can cause the model to overfit to the outlier data and perform poorly on new data.


## Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

 1) Mean Imputation
 
 2) Median Impution
 
 3) Mode Imputation
 
 4) Listwise Deletion (if dataset is large)

## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

There are several strategies that can be used to determine if missing data is missing at random or if there is a pattern to the missing data. Some of these strategies include:

1. **Visual inspection**: Plotting the data and visually inspecting it for patterns can help identify if the missing data is missing at random or not.
2. **Statistical tests**: Conducting statistical tests, such as chi-squared tests or t-tests, can help determine if the missing data is missing at random or not.
3. **Missing data mechanisms**: Understanding the mechanisms that cause data to be missing, such as data entry errors or survey non-response, can help determine if the missing data is missing at random or not.


## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

When working with an imbalanced dataset, such as in a medical diagnosis project where the majority of patients do not have the condition of interest, it is important to use appropriate evaluation metrics to assess the performance of the machine learning model. Some strategies that can be used include:

1. **Confusion matrix**: A confusion matrix can help visualize the performance of the model by showing the number of true positives, false positives, true negatives, and false negatives.
2. **Precision, recall, and F1 score**: Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positives. The F1 score is the harmonic mean of precision and recall and provides a balanced measure of the model's performance.
3. **ROC curve and AUC**: The receiver operating characteristic (ROC) curve plots the true positive rate against the false positive rate at different classification thresholds, while the area under the curve (AUC) provides a measure of the model's ability to distinguish between the two classes.


## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

1) Downsampling Technique (down sample the majority class)

from sklearn.utlis import resample

2) SMOTE (down sample the majority class by reducing the weights parameter)

from imblearn.oversampling import SMOTE
<br>downsample = SMOTE()
 


## Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

1) SMOTE (up-sample the minority class by increasing the weights parameter)

from imblearn.oversampling import SMOTE
<br>downsample = SMOTE()

2) Upsampling Technique (Up-sample the minority class)

from sklearn.utlis import resample
 
