# Assignment

## Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Ans: Missing values in a dataset refer to the absence of data for certain variables or observations. They can occur due to various reasons such as data entry errors, equipment malfunction, or non-response in surveys. Handling missing values is crucial because they can lead to biased or inefficient analysis and modeling results. Missing values can disrupt statistical calculations, reduce the representativeness of the dataset, and affect the performance of machine learning algorithms.

Some algorithms that are not affected by missing values are:

1. Decision Trees: Decision trees can handle missing values by creating surrogate splits based on available predictors.

2. Random Forests: Random forests can handle missing values in a similar way to decision trees by imputing missing values using surrogate splits.

3. Gradient Boosting Machines (GBMs): GBMs have built-in handling of missing values by treating them as a separate category during the splitting process.

## Q2: List down techniques used to handle missing data. Give an example of each with python code.

Ans: 
1. Deleting Rows: This technique involves removing rows with missing values. However, it should only be used when the amount of missing data is very small and random.

In [None]:
# Example: Deleting rows with missing values
import pandas as pd

# Assuming 'df' is the DataFrame with missing values
df.dropna(inplace=True)


2. Imputation: Imputation involves filling in missing values with estimated or calculated values. One common method is mean imputation.

In [None]:
# Example: Mean imputation using pandas
import pandas as pd

# Assuming 'df' is the DataFrame with missing values
mean_value = df['column_name'].mean()
df['column_name'].fillna(mean_value, inplace=True)


## Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Ans: : Imbalanced data refers to a situation where the classes or categories in the target variable are not represented equally. It commonly occurs in classification problems when one class dominates the dataset, while the other class(es) have significantly fewer instances.

If imbalanced data is not handled properly, it can lead to biased model performance. Algorithms trained on imbalanced data tend to favor the majority class, resulting in poor predictions for the minority class. This can be problematic, especially in scenarios where the minority class is of particular interest, such as fraud detection or rare disease diagnosis.

## Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Ans:Up-sampling and down-sampling are techniques used to address imbalanced data.

Up-sampling involves increasing the number of instances in the minority class to match the majority class. This can be done by randomly replicating samples from the minority class.

Down-sampling involves reducing the number of instances in the majority class to match the minority class. This can be done by randomly removing samples from the majority class.

For example, if you have an imbalanced dataset with 100 instances of the majority class and 20 instances of the minority class, you can up-sample the minority class by randomly selecting and replicating instances until both classes have an equal number of instances (e.g., 100 instances each)

## Q5: What is data Augmentation? Explain SMOTE.

Ans: Data augmentation is a technique used to artificially increase the size of a dataset by creating variations or modifications of the existing data. It is commonly used in machine learning and computer vision tasks to improve model performance.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used specifically for imbalanced datasets. It generates synthetic samples for the minority class by interpolating between neighboring instances. SMOTE helps in balancing the class distribution and provides additional training samples for the minority class.

## Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Ans: Outliers in a dataset are data points that significantly deviate from the majority of other data points. They can be caused by measurement errors, data corruption, or genuinely rare events. Handling outliers is essential because they can have a disproportionate influence on statistical analyses and modeling results, leading to biased conclusions or inaccurate predictions.

Outliers can affect statistical measures such as the mean and standard deviation, making them less representative of the overall data distribution. They can also impact the performance of machine learning algorithms, particularly those sensitive to the scale and distribution of the data.

## Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Ans:  When handling missing data in customer analysis, some techniques that can be used are:

1. Deleting rows with missing values if the amount of missing data is small and random.
2. Imputing missing values using mean imputation, median imputation, or mode imputation based on the type of variable.
3. Using advanced imputation techniques like KNN imputation or regression imputation if the missingness has a complex pattern.

## Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Ans:  To determine if the missing data is missing at random or if there is a pattern, you can employ the following strategies:

1. Missing Data Visualization: Plotting missingness patterns across variables and observations can help identify any patterns or correlations.
2. Statistical Tests: Conduct statistical tests to examine if the missingness is related to other variables in the dataset. For example, the Chi-square test or correlation tests can be used to check the association between missingness and other variables.

## Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Ans: Strategies to evaluate the performance of a machine learning model on an imbalanced dataset with a majority of negative instances and a small percentage of positive instances include:

1. Confusion Matrix: Calculate metrics such as accuracy, precision, recall, and F1-score to evaluate the model's performance on both the majority and minority classes.
2. ROC Curve and AUC: Plot the Receiver Operating Characteristic (ROC) curve and calculate the Area Under the Curve (AUC) to assess the model's ability to discriminate between classes.
3. Precision-Recall Curve: Plot the Precision-Recall curve to analyze the trade-off between precision and recall for different classification thresholds.


## Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

Ans:  To balance an unbalanced dataset with the majority class, some methods that can be employed to down-sample the majority class include:

1. Random Under-sampling: Randomly select a subset of instances from the majority class to match the number of instances in the minority class.
2. Cluster-Based Under-sampling: Use clustering techniques to identify representative instances from the majority class and remove redundant instances.

## Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

Ans:  To balance an unbalanced dataset with a low percentage of occurrences in the minority class, some methods that can be employed to up-sample the minority class include:

1. Random Over-sampling: Randomly replicate instances from the minority class to increase its representation in the dataset.
2. Synthetic Minority Over-sampling Technique (SMOTE): Generate synthetic samples for the minority class by interpolating between neighboring instances to create new instances.