### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

A1. **Missing Values:** occur when no data is available for a particular attribute or observation in a dataset. They can result from various reasons, such as errors in data collection, incomplete responses, or data corruption.

It is essential to handle because of the following reasons : 
- **Accuracy:** Missing values can lead to inaccurate analysis and skewed results.
- **Model Performance:** Many algorithms require complete data and may fail or perform poorly with missing values.
- **Bias:** Ignoring missing values can introduce bias, affecting the validity of the analysis.

**Algorithms Not Affected by Missing Values:**
- **Decision Trees** (e.g., CART, Random Forests)
- **k-Nearest Neighbors (k-NN)**
- **Naive Bayes**

These algorithms can handle missing values internally by using surrogate splits (in the case of decision trees) or ignoring missing values during calculations (in k-NN and Naive Bayes).

### Q2: List down techniques used to handle missing data. Give an example of each with Python code.

In [13]:
#A2.
print('*************************************')
print('**1. Imputation (Mean/Median/Mode):**')
print('*************************************')

import pandas as pd
import numpy as np

data = {'Feature1': [1, 2, np.nan, 4, 5]}
df = pd.DataFrame(data)

# Mean imputation
mean_value = df['Feature1'].mean()
df['Feature1'].fillna(mean_value, inplace=True)
print('mean',df)

# Median imputation
median_value = df['Feature1'].median()
df['Feature1'].fillna(median_value, inplace=True)
print('median',df)

# Mode imputation
mode_value = df['Feature1'].mode()[0]
df['Feature1'].fillna(mode_value, inplace=True)
print('mode',df)

print('************************************')
print('**2. Forward Fill / Backward Fill:**')
print('************************************')

# Forward fill
df.fillna(method='ffill', inplace=True)
print(df)
# Backward fill
df.fillna(method='bfill', inplace=True)
print(df)

print('*********************')
print('**3. Interpolation:**')
print('*********************')

# Interpolation
df['Feature1'] = df['Feature1'].interpolate()
print(df)

print('*******************************')
print('**4. Dropping Missing Values:**')
print('*******************************')

# Drop rows with missing values
df.dropna(inplace=True)
print(df)

print('*******************************************************************')
print('**5. Using a Machine Learning Model for Imputation (KNN Imputer):**')
print('*******************************************************************')

from sklearn.impute import KNNImputer

# Sample data
data = {'Feature1': [1, 2, np.nan, 4, 5]}
df = pd.DataFrame(data)

# KNN Imputer
imputer = KNNImputer(n_neighbors=2)
df_imputed = imputer.fit_transform(df)
print(df_imputed)

*************************************
**1. Imputation (Mean/Median/Mode):**
*************************************
mean    Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
median    Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
mode    Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
************************************
**2. Forward Fill / Backward Fill:**
************************************
   Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
   Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
*********************
**3. Interpolation:**
*********************
   Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
*******************************
**4. Dropping Missing Values:**
*******************************
   Feature1
0       1.0
1       2.0
2       3.0
3       4.0
4       5.0
*******************************************************************
**5. Using a Machine Lea

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Feature1'].fillna(mean_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Feature1'].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting va

### Q3: Explain imbalanced data. What will happen if imbalanced data is not handled?

A3. **Imbalanced Data:**  
Imbalanced data occurs when the classes in the dataset are not represented equally. For example, in a binary classification problem, one class may have significantly more examples than the other.

**Consequences of Not Handling Imbalanced Data:**
- **Model Bias:** The model may be biased toward the majority class, leading to poor performance on the minority class.
- **Poor Generalization:** The model might not generalize well to new data, especially for the minority class.
- **Misleading Metrics:** Performance metrics such as accuracy may be misleading if the majority class dominates the dataset.

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

A4. **Up-sampling:** involves increasing the number of instances in the minority class by duplicating existing samples or generating new samples.
**Down-sampling:** involves reducing the number of instances in the majority class to match the number of instances in the minority class.


In [7]:
from sklearn.utils import resample

# Sample data
data = {'Feature1': [1, 2, 3, 4, 5],
        'Class': ['A', 'A', 'B', 'B', 'B']}
df = pd.DataFrame(data)

# Separate majority and minority classes
df_majority = df[df['Class'] == 'B']
df_minority = df[df['Class'] == 'A']

# Up-sample minority class
df_minority_upsampled = resample(df_minority, replace=True, n_samples=len(df_majority), random_state=0)
df_balanced = pd.concat([df_majority, df_minority_upsampled])
print(df_balanced)

# Down-sample majority class
df_majority_downsampled = resample(df_majority, replace=False,  n_samples=len(df_minority), random_state=0)
df_balanced = pd.concat([df_majority_downsampled, df_minority])
print(df_balanced)

   Feature1 Class
2         3     B
3         4     B
4         5     B
0         1     A
1         2     A
1         2     A
   Feature1 Class
4         5     B
3         4     B
0         1     A
1         2     A




### Q5: What is data Augmentation? Explain SMOTE.

A5. **Data Augmentation:** is a technique used to artificially increase the size of a dataset by creating modified versions of existing data. This is commonly used in image data to create variations (e.g., rotations, translations) and enhance model performance.

**SMOTE (Synthetic Minority Over-sampling Technique):**  
SMOTE is a technique used to generate synthetic examples in the feature space for the minority class by interpolating between existing minority class samples.


In [6]:
from imblearn.over_sampling import SMOTE

# Sample data
X = [[1, 2], [2, 3], [3, 4], [4, 5]]  # Feature matrix
y = [0, 0, 1, 1]  # Target variable (imputed labels for the minority class)

# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
print(X_res, y_res)

[[1, 2], [2, 3], [3, 4], [4, 5]] [0, 0, 1, 1]




### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

A6. **Outliers:** are data points that differ significantly from the majority of the data. They are extreme values that lie outside the range of normal data.

**Importance of Handling Outliers:**
- **Model Accuracy:** Outliers can skew statistical measures and affect model accuracy.
- **Data Distribution:** They can distort the distribution of the data, affecting assumptions made by some algorithms.
- **Interpretability:** Outliers may affect the interpretability of the model's results.

**Handling Outliers:**
- **Remove:** Exclude outliers if they are errors or not representative.
- **Transform:** Apply transformations like log or square root to reduce the impact of outliers.
- **Impute:** Replace outliers with more central values or median values.

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

A7. **Techniques for Handling Missing Data:**
- **Imputation:** Fill missing values with mean, median, mode, or use advanced methods like KNN or regression imputation.
- **Removal:** Drop rows or columns with missing values if the percentage of missing data is small.
- **Prediction:** Use machine learning models to predict and fill in missing values based on other data.
- **Flag and Fill:** Create a new binary feature to indicate missingness and then impute the values.

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

A8. **Strategies to Determine Missing Data Patterns:**
- **Visual Inspection:** Use visualizations like heatmaps to check patterns of missing data.
- **Statistical Tests:** Conduct tests to determine if missing data is related to other variables (e.g., Little's MCAR test).
- **Correlation Analysis:** Analyze correlations between missingness and other features to detect patterns.
- **Missing Data Mechanism:** Assess if data is missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

A9. **Strategies for Evaluating Imbalanced Datasets:**
- **Use Metrics Suitable for Imbalance:** Utilize metrics such as precision, recall, F1-score, and ROC-AUC instead of accuracy.
- **Cross-Validation:** Use stratified k-fold cross-validation to ensure each fold has a representative proportion of classes.
- **Resampling Techniques:** Apply up-sampling or down-sampling to balance the classes in training data.
- **Adjust Class Weights:** Use algorithms that allow setting class weights to handle imbalance (e.g., class_weight parameter in scikit-learn).

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

A10. **Balancing Dataset by Down-Sampling:**
- **Random Down-Sampling:** Randomly reduce the number of samples in the majority class to match the minority class.


In [5]:
from sklearn.utils import resample
import pandas as pd
data = {'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
        'Satisfaction': ['Satisfied', 'Satisfied', 'Satisfied', 'Not Satisfied', 'Not Satisfied', 'Not Satisfied', 'Not Satisfied', 'Not Satisfied', 'Not Satisfied', 'Not Satisfied']}
df = pd.DataFrame(data)

df_majority = df[df['Satisfaction'] == 'Satisfied']
df_minority = df[df['Satisfaction'] == 'Not Satisfied']

df_majority_downsampled = resample(df_majority, n_samples=len(df_minority), random_state=0)

df_balanced = pd.concat([df_majority_downsampled, df_minority])
print(df_balanced)

   Feature1   Satisfaction
0         1      Satisfied
1         2      Satisfied
0         1      Satisfied
1         2      Satisfied
1         2      Satisfied
2         3      Satisfied
0         1      Satisfied
3         4  Not Satisfied
4         5  Not Satisfied
5         6  Not Satisfied
6         7  Not Satisfied
7         8  Not Satisfied
8         9  Not Satisfied
9        10  Not Satisfied





### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

A11. **Balancing Dataset by Up-Sampling:**
- **Random Up-Sampling:** Increase the number of samples in the minority class by duplicating existing samples.

- **SMOTE:** Generate synthetic samples for the minority class to increase its representation.


In [2]:
from imblearn.over_sampling import SMOTE

# Sample data
X = [[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]  # Features
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]  # Target (Imbalanced dataset)

# Apply SMOTE
smote = SMOTE()
X_res, y_res = smote.fit_resample(X, y)
print(X_res, y_res)


[[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]] [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
