<a href="https://colab.research.google.com/github/Sha-98/Data-Science-Masters/blob/main/Feature_Engg01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Feature Engineering-1**

## **Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.**



**Missing Values in a Dataset:**
Missing values in a dataset refer to the absence of information or data for certain observations or variables. These missing values can occur for various reasons, such as data entry errors, sensor malfunctions, or intentional non-responses.

**Importance of Handling Missing Values:**
Handling missing values is essential for several reasons:

* **Maintaining Data Integrity:** Missing values can lead to inaccurate and biased analyses if not appropriately addressed.
* **Algorithm Compatibility:** Many machine learning algorithms cannot handle missing values and may produce errors or suboptimal results.
* **Effective Analysis:** To ensure the accuracy and reliability of statistical analyses, imputation or removal of missing values is necessary.

**Algorithms Not Affected by Missing Values:**
Some machine learning algorithms are inherently robust to missing values:

* **Decision Trees:** Decision trees can naturally handle missing values by making decisions based on available information.

* **Random Forests:** Random Forests, being an ensemble of decision trees, can also handle missing values effectively.

* **XGBoost:** XGBoost is an ensemble method that can handle missing values through a principled optimization algorithm.

* **K-Nearest Neighbors (KNN):** KNN imputations can be used to fill in missing values by considering the values of neighboring instances.

* **Naive Bayes:** Naive Bayes is not directly affected by missing values because it calculates probabilities independently for each feature.

***While these algorithms can work with missing values, it's crucial to consider the impact of imputation methods or the potential bias introduced during analysis. In practice, addressing missing values often involves imputation techniques or, in some cases, removing observations or variables with too many missing values.***

## **Q2: List down techniques used to handle missing data. Give an example of each with python code.**


Handling missing data is a critical step in the data preprocessing pipeline. Here are some common techniques along with Python examples:

### **Deletion of Missing Data:**
* Remove rows or columns with missing values.
* This is suitable when missing values are randomly distributed.

In [7]:
import pandas as pd
import numpy as np

# Example DataFrame
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned_rows = df.dropna()

# Remove columns with missing values
df_cleaned_cols = df.dropna(axis=1)

In [6]:
df_cleaned_cols

0
1
2
3


In [5]:
df_cleaned_rows

Unnamed: 0,A,B
0,1.0,5.0
3,4.0,8.0


### **Imputation with Mean/Median/Mode:**
* Fill missing values with the mean, median, or mode of the respective column.

In [9]:
# Example DataFrame
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Impute missing values with mean
df_imputed_mean = df.fillna(df.mean())

# Impute missing values with median
df_imputed_median = df.fillna(df.median())

# Impute missing values with mode
df_imputed_mode = df.fillna(df.mode().iloc[0])

In [10]:
df_imputed_mean

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.666667
2,2.333333,7.0
3,4.0,8.0


In [11]:
df_imputed_mean

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,6.666667
2,2.333333,7.0
3,4.0,8.0


In [12]:
df_imputed_mode

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,5.0
2,1.0,7.0
3,4.0,8.0


### **Imputation with Forward Fill (ffill) or Backward Fill (bfill):**
* Propagate the non-missing values forward or backward.

In [13]:
# Example DataFrame
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, 8]}
df = pd.DataFrame(data)

# Forward fill missing values
df_ffill = df.ffill()

# Backward fill missing values
df_bfill = df.bfill()

In [14]:
df_ffill

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,5.0
2,2.0,7.0
3,4.0,8.0


In [15]:
df_bfill

Unnamed: 0,A,B
0,1.0,5.0
1,2.0,7.0
2,4.0,7.0
3,4.0,8.0


***These methods offer different trade-offs, and the choice depends on the dataset and the specific requirements of the analysis. It's essential to assess the impact of missing data handling on the results and choose the method that best suits the context.***

## **Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?**



Imbalanced data refers to a situation where the distribution of classes in a classification dataset is not equal. In other words, one class significantly outnumbers the other(s). This can lead to challenges when training machine learning models, particularly in binary classification problems where there are only two classes (positive and negative).

### **Challenges of Imbalanced Data:**

**1. Bias in Model Training:**
* The model tends to be biased towards the majority class since it has more instances to learn from.

**2. Poor Generalization:**
* The model may generalize poorly to the minority class, leading to misclassification.

**3. Model Evaluation Issues:**
* Traditional accuracy can be misleading as a metric. The model might achieve high accuracy by simply predicting the majority class, while performing poorly on the minority class.

**4. Insensitive to Rare Events:**
* The model may struggle to detect and handle rare events represented by the minority class.

### **Consequences if Imbalanced Data is Not Handled:**

**1. Misleading Performance Metrics:**
* Accuracy alone becomes an unreliable performance metric. A high accuracy may not reflect the model's actual ability to correctly classify instances from the minority class.

**2. Risk of Ignoring Minority Class:**
* In critical applications (e.g., fraud detection, disease diagnosis), ignoring the minority class can have severe consequences.

**3. Loss of Valuable Information:**
* Valuable patterns and information related to the minority class may be overlooked.

**4. Model Biases:**
* Models may become biased towards predicting the majority class, leading to suboptimal results for real-world applications.

## **Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.**



### **Up-sampling and Down-sampling in Imbalanced Data:**

**Up-sampling (Over-sampling):**

* **Definition:** Up-sampling involves increasing the number of instances in the minority class by generating synthetic samples or replicating existing ones.
* **Example Scenario:**
In a fraud detection dataset where fraudulent transactions are rare (minority class), up-sampling can be applied to create synthetic instances of fraudulent transactions. This helps balance the class distribution.

In [17]:
# Example using Python with the imbalanced-learn library
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

# X, y: Features and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE for up-sampling
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

### **Down-sampling (Under-sampling):**

* **Definition:** Down-sampling involves reducing the number of instances in the majority class by randomly removing some of them.
* **Example Scenario:**
In a medical diagnosis dataset where the positive (disease) class is rare compared to the negative (healthy) class, down-sampling can be applied to reduce the number of instances from the negative class.

In [None]:
# Example using Python with scikit-learn
from sklearn.utils import resample

# X, y: Features and labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Down-sample the majority class
X_majority_downsampled, y_majority_downsampled = resample(X_train[y_train == 'majority_class'],
                                                         y_train[y_train == 'majority_class'],
                                                         replace=False,
                                                         n_samples=len(y_train[y_train == 'minority_class']),
                                                         random_state=42)

# Combine the down-sampled majority class with the minority class
X_resampled = np.vstack((X_majority_downsampled, X_train[y_train == 'minority_class']))
y_resampled = np.concatenate((y_majority_downsampled, y_train[y_train == 'minority_class']))

### **When to Use Up-sampling and Down-sampling:**

**1. Up-sampling:**
* Use when the minority class is under-represented, and you want to increase its presence in the dataset.
* Appropriate for scenarios where generating synthetic samples can help capture the patterns of the minority class.

**2. Down-sampling:**
* Use when the majority class overwhelms the minority class, and you want to balance the class distribution.
* Suitable for situations where reducing the number of instances in the majority class is a viable option.

***The choice between up-sampling and down-sampling depends on the specific characteristics of the dataset and the goals of the machine learning task.***

## **Q5: What is data Augmentation? Explain SMOTE.**


## **Q6: What are outliers in a dataset? Why is it essential to handle outliers?**


## **Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?**


## **Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?**


## **Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?**


## **Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?**


## **Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?**