In [None]:
##Feature Engineering-1 Assignment

In [None]:
##Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

##A1. Missing values in a dataset are datapoints that do not exist for a unique data entity. For instance: For a dataset consisting of student marks, there could be a situation where marks of english, maths are missing from the dataset.
##It is important to handle missing values since missing data could impact any model being created on the data by providing erratic patterns, influencing the output prediction and hence, the overall model usability.

##Models such as decision trees, random forests, K-Nearest Neighbors (KNN) are less sensitive to missing values and have built-in mechanisms to handle them effectively. Hence, they also do not require any imputations beforehand. 

In [2]:
##Q2: List down techniques used to handle missing data. Give an example of each with python code.

##A2. Handling missing data is a crucial step in data preprocessing to ensure accurate model training and predictions. Here are some common techniques used to handle missing data, along with examples using Python:

##1. Deleting Rows or Columns with Missing Values

import pandas as pd
import numpy as np

# Sample DataFrame with missing values
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, 7, np.nan],
        'C': [np.nan, 10, 11, 12]}
df = pd.DataFrame(data)

# Drop rows with any missing values
df_dropna_rows = df.dropna()

# Drop columns with any missing values
df_dropna_cols = df.dropna(axis=1)

print("DataFrame after dropping rows with missing values:")
print(df_dropna_rows)

print("\nDataFrame after dropping columns with missing values:")
print(df_dropna_cols)

##Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the non-missing values in the column.

from sklearn.impute import SimpleImputer

# Mean imputation
imputer = SimpleImputer(strategy='mean')
df_mean_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print("DataFrame after mean imputation:")
print(df_mean_imputed)


DataFrame after dropping rows with missing values:
Empty DataFrame
Columns: [A, B, C]
Index: []

DataFrame after dropping columns with missing values:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
DataFrame after mean imputation:
          A    B     C
0  1.000000  5.0  11.0
1  2.000000  6.0  10.0
2  2.333333  7.0  11.0
3  4.000000  6.0  12.0


In [None]:
##Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

##A3. Imbalanced data refers to a situation in a classification problem where the classes are not represented equally. 
##Typically, one class (called the minority class) has significantly fewer instances than the other class (called the majority class). 
##For example, in a binary classification problem, if Class A has 95% of the instances and Class B has only 5%, then the data is imbalanced.

##If imbalanced data is not handled properly, several issues can arise:

##Biased Models: Machine learning models trained on imbalanced data tend to be biased towards the majority class. They often ignore the minority class because the overall accuracy metric can still be high if the majority class is predicted correctly, even if the minority class is not.

##Poor Generalization: Models trained on imbalanced data may not generalize well to new data, especially if the minority class is important but underrepresented. The model may perform poorly on predicting the minority class in unseen data.

##Incorrect Assumptions: Some algorithms assume balanced class distribution for optimal performance. Imbalanced data violates this assumption and can lead to suboptimal model performance.

##Misleading Evaluation Metrics: Accuracy alone is not a reliable metric for evaluating models on imbalanced data. Models can achieve high accuracy by predicting the majority class, but they may fail to correctly predict instances of the minority class.

In [None]:
##Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

##A4.Up-sampling and down-sampling are techniques used to address class imbalance in machine learning datasets:
##Up-sampling: Up-sampling involves increasing the number of instances in the minority class (less represented class) to match the number of instances in the majority class. This is typically done by randomly duplicating examples from the minority class or generating synthetic examples based on existing minority class data.

##Example Scenario for Up-sampling: Imagine a credit card fraud detection dataset where only 1% of transactions are fraudulent (minority class), while the rest are legitimate transactions (majority class). In this case, up-sampling would involve increasing the number of fraudulent transactions in the dataset to balance it with the number of legitimate transactions. This helps the model learn more effectively from the minority class examples and improves its ability to detect fraud accurately.

##Down-sampling: Down-sampling involves reducing the number of instances in the majority class to match the number of instances in the minority class. This is typically done by randomly removing examples from the majority class until the dataset is balanced.

##Example Scenario for Down-sampling: Consider a medical dataset where detecting a rare disease is the minority class (1%) and not having the disease is the majority class (99%). In this case, down-sampling would involve reducing the number of instances of not having the disease to match the number of instances of having the disease. This ensures that the model is not biased towards predicting the majority class and gives equal importance to both classes during training.

In [None]:
##Q5: What is data Augmentation? Explain SMOTE.

##A5. Data Augmentation is a technique used primarily in machine learning and computer vision to artificially increase the size and diversity of training datasets by creating modified versions of existing data. 
##The goal of data augmentation is to improve the generalization and robustness of models, especially when the original dataset is limited in size or diversity.

##SMOTE (Synthetic Minority Over-sampling Technique):
##SMOTE is a specific technique used for up-sampling the minority class in imbalanced datasets, particularly in classification tasks. 
##It works by generating synthetic examples rather than duplicating existing ones. SMOTE creates new instances of the minority class by interpolating between existing minority class instances.

In [None]:
##Q6: What are outliers in a dataset? Why is it essential to handle outliers?

##A6. Outliers in a dataset refers to extreme values that are unnaturally higher or lower than what is present in the dataset.
##Outliers can arise due to various reasons such as measurement errors, experimental variability, or genuine but rare events in the data.

##It is essential to handle outliers because it impacts the descriptive and inferential statistics to be performed on a dataset.

##Impact on Statistical Analysis: Outliers can skew statistical measures and metrics, such as mean and standard deviation, leading to misleading interpretations of data distribution and relationships.

##Effect on Machine Learning Models: Outliers can adversely affect the performance and accuracy of machine learning models. Models that are sensitive to outliers, such as linear regression and clustering algorithms, can be heavily influenced by these extreme values, resulting in biased or inefficient predictions.

In [None]:
##Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

##A7. In order to handle missing data, we can use below techniques such as:

##Removal of missing data points: We can delete rows or data instances containing missing data to only keep those with complete data
##Imputations using measures of central tendency: We can impute the missing values by using statistics such as mean, median or mode based on nature of the problem
##Interpolation: Fill missing values using interpolation methods such as linear interpolation, polynomial interpolation, etc.

In [None]:
##Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

##A8. Determining whether missing data is missing at random (MAR) or not at random (MNAR) is crucial for choosing appropriate strategies to handle missing data effectively. 
##Here are some strategies you can use to investigate the patterns of missing data:

##Missing Data Heatmap: Create a heatmap or summary table that shows the presence of missing values across different variables/features. This can help visualize if missingness is concentrated in specific columns or rows.

##Summary Statistics: Calculate summary statistics (e.g., mean, median) separately for rows and columns with missing values. Compare these statistics with those of complete cases to identify any patterns.

##Domain Knowledge: Utilize domain knowledge to understand plausible reasons for missingness. This can provide insights into whether missing data is related to specific conditions, events, or data collection processes.

In [None]:
##Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

##A9. In order to address the issue of an imbalanced data (minority output is way smaller than the majority), we can use techniques of up-sampling and down-sampling.
##In this particular case, it may make more sense to perform down-sampling so that the rare occurence can be predicted correctly.

##Confusion Matrix: Evaluate the model using a confusion matrix to understand true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).

##Precision, Recall, F1-score: Use metrics like precision, recall, and F1-score that are more informative on imbalanced datasets, focusing on the minority class (positive class).

##Class Weight Adjustment: Adjust class weights in the machine learning model to penalize misclassifications of the minority class more than the majority class. This helps in improving sensitivity (recall) for the minority class.

In [None]:
##Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

##A10. When dealing with an imbalanced dataset in customer satisfaction estimation where the majority of customers are reported to be satisfied, down-sampling the majority class is a common approach to balance the dataset. 
##Down-sampling involves reducing the number of instances in the majority class to match the number of instances in the minority class (unsatisfied customers in this case).

##Below are the steps that can be followed to deal with the problem:

##Separate Data by Class: Split the original dataset into two subsets: one for the majority class (satisfied customers) and one for the minority class (unsatisfied customers).
##Determine the Size of the Minority Class: Calculate the number of instances in the minority class to know how many samples to keep after down-sampling.
##Down-sample the Majority Class: Randomly sample a subset from the majority class to match the size of the minority class. 
##Combine Down-sampled Majority Class with Minority Class: Concatenate the down-sampled majority class with the original minority class to form a balanced dataset.
##Verify the Balanced Dataset: Check the class distribution to ensure the dataset is now balanced between satisfied and unsatisfied customers.

In [None]:
##Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

#A11. ##Below are the steps that can be followed to deal with the problem:

##Separate Data by Class: Split the original dataset into two subsets: one for the majority class (common output) and one for the minority class (rare output).
##Determine the Size of the Majority Class: Calculate the number of instances in the majority class to know how many samples are needed after up-sampling.
##Up-sample the Minority Class: Randomly sample with replacement from the minority class to match the size of the majority class.
##Combine Up-sampled Minority Class with Majority Class: Concatenate the up-sampled minority class with the original majority class to form a balanced dataset.
##Verify the Balanced Dataset: Check the class distribution to ensure the dataset is now balanced between common and rare outputs.