Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

* Missing values in a dataset refer to the absence of data points or entries in certain fields or features.
* It is essential to handle missing values because they can lead to biased estimates, reduce the statistical power of the analysis, and lead to incorrect conclusions or model predictions.
  * Algorithms that are not affected by missing values include:
  * Decision Trees
  * Random Forests
  * K-Nearest Neighbors (KNN)
  * Algorithms like XGBoost and LightGBM can handle missing values internally.

Q2: List down techniques used to handle missing data. Give an example of each with python code.

In [2]:
#Removal: Remove rows or columns with missing values.


import pandas as pd
df = pd.DataFrame({'A': [1, 2, None], 'B': [4, None, 6]})
df.dropna()  # Removes rows with missing values
df.dropna(axis=1)  # Removes columns with missing values


0
1
2


In [3]:
# Imputation: Replace missing values with a specific value (mean, median, mode, etc.).

df.fillna(df.mean())  # Replaces missing values with the mean


Unnamed: 0,A,B
0,1.0,4.0
1,2.0,5.0
2,1.5,6.0


In [4]:
# Using Algorithms: Algorithms like KNN imputation.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = imputer.fit_transform(df)


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

* Imbalanced data refers to a dataset where the classes are not represented equally, often seen in classification problems.
* If imbalanced data is not handled, it can lead to:
  * Bias: The model will be biased towards the majority class.
  * Poor Performance: The model may have poor performance metrics like precision, recall, and F1-score for the minority class.
  * Misleading Accuracy: High overall accuracy but poor detection of the minority class.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

In [None]:
# Up-sampling: Increasing the number of samples in the minority class by replicating them.
# Down-sampling: Reducing the number of samples in the majority class.

from sklearn.utils import resample
df_majority = df[df.Class == 0]
df_minority = df[df.Class == 1]

# Up-sample minority class
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority))

# Down-sample majority class
df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority))


Q5: What is data Augmentation? Explain SMOTE.

* Data Augmentation: Techniques used to increase the diversity of data by adding slightly modified copies of already existing data or newly created synthetic data.
* SMOTE (Synthetic Minority Over-sampling Technique): A method used to create synthetic samples for the minority class.

In [None]:
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)


Q6: What are outliers in a dataset? Why is it essential to handle outliers?

* Outliers are data points that differ significantly from other observations.
* Handling outliers is essential because they can:
  * Skew and mislead the statistical analysis.
  * Affect the performance of machine learning models.
  * Indicate variability in the data, potential errors, or novel insights.

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?


**Techniques:**
  * Imputation using mean, median, or mode.
  * Using algorithms like KNN or regression to predict and fill missing values.
  * Removing rows or columns with a significant number of missing values.
  * Using advanced imputation techniques like multiple imputation.

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

**Strategies:**
  * Missing Completely at Random (MCAR): Check if the missing values are randomly distributed by visual inspection or statistical       tests.
  * Missing at Random (MAR): Analyze if the probability of missing data is related to other observed variables.
  * Not Missing at Random (NMAR): Investigate patterns or correlations with missing data, using tests like Little’s MCAR test.
  * Visualizations: Use heatmaps or scatter plots to identify patterns.

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?


**Strategies:**
  * Use performance metrics like precision, recall, F1-score, ROC-AUC instead of accuracy.
  * Implement cross-validation techniques.
  * Use stratified sampling to ensure representative distribution of classes.
  * Apply up-sampling, down-sampling, or SMOTE to balance the classes.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?


In [None]:
# Down-sampling the majority class by randomly removing samples.

df_majority_downsampled = resample(df_majority, replace=False, n_samples=len(df_minority))


In [None]:
# Generating synthetic data for the minority class using techniques like SMOTE.
X_resampled, y_resampled = smote.fit_resample(X, y)


Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?


**Methods:**

* Up-sampling the minority class by duplicating samples or using SMOTE.
* Creating synthetic data for the minority class.
* Combining up-sampling and down-sampling to maintain a balanced dataset without overfitting.


In [None]:
X_resampled, y_resampled = smote.fit_resample(X, y)
