# Q.1.

Missing values in a dataset are the values that are not available for some variables or observations. There can be different reasons for missing values such as data entry errors, non-response, or missing data due to a technical issue.

Handling missing values is essential because they can lead to biased or inaccurate results and reduce the effectiveness of the analysis. Incomplete data can also lead to reduced statistical power and can make it difficult to draw meaningful conclusions from the analysis. Therefore, it is important to handle missing values to ensure the accuracy and completeness of the analysis.

There are several algorithms that are not affected by missing values, including:

1. Decision Trees: Decision trees can handle missing values by assigning probabilities to each possible value of the missing variable based on the available data.

2. Random Forest: Random forest is an ensemble learning algorithm that uses decision trees. It can handle missing values by imputing them using the most common value or mean value from the available data.

3. K-Nearest Neighbors: K-Nearest Neighbors (KNN) is a non-parametric algorithm that can handle missing values by imputing them with the mean or median value of the available data.

4. Naive Bayes: Naive Bayes is a probabilistic algorithm that can handle missing values by assuming that the missing data is missing at random and imputing the missing values using the most common value or mean value from the available data.

5. Support Vector Machines: Support Vector Machines (SVM) can handle missing values by imputing them using the mean or median value from the available data. SVMs can also handle missing values by ignoring the missing values during the training process.

# Q.2.

# Methods for handling missing values


## i) Deleting the rows having missing value

In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

In [2]:
df = sns.load_dataset('penguins')

In [3]:
# this functions shows each columns having how many number of missing values
df.isnull().sum()      

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [4]:
df1 = df.dropna()

In [5]:
# now the number of missing values in each column is zero
df1.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    0
body_mass_g          0
sex                  0
dtype: int64

## Mean Imputation 

 This technique involves replacing the missing values with the mean value of the available data

In [6]:
# here we are removing null values from 'bill_length_mm' column

df['bill_length_mm']=df['bill_length_mm'].fillna(df['bill_length_mm'].mean())

In [7]:
# as you can see number of missing values in 'bill_length_mm' column is now zero 

df.isnull().sum()

species               0
island                0
bill_length_mm        0
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

## Median Imputation

This technique involves replacing the missing values with the median value of the available data

In [8]:
#  here we are removing null values from 'bill_depth_mm' column

df['bill_depth_mm']=df['bill_depth_mm'].fillna(df['bill_depth_mm'].median())

In [9]:
# as you can see number of missing values in 'bill_depth_mm' column is now zero

df.isnull().sum()

species               0
island                0
bill_length_mm        0
bill_depth_mm         0
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

## Mode Imputation

This technique involves replacing the missing values with the median value of the available data. This is used for categorical data

In [10]:
#  here we are removing null values from 'bill_depth_mm' column

df['sex']=df['sex'].fillna(df['sex'].mode()[0])

In [11]:
# as you can see number of missing values in 'sex' column is now zero

df.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    2
body_mass_g          2
sex                  0
dtype: int64

## Random Imputation

It is a data imputation technique that replaces missing values with randomly sampled values from the observed data

In [12]:
# here we are trying to remove missing values from body_mass_g' column

number_missing = df['body_mass_g'].isnull().sum()
observed_values = df.loc[df['body_mass_g'].notnull(), 'body_mass_g']

In [13]:
df.loc[df['body_mass_g'].isnull(), 'body_mass_g'] = np.random.choice(observed_values, number_missing, replace = True)

In [14]:
df.isnull().sum()

species              0
island               0
bill_length_mm       0
bill_depth_mm        0
flipper_length_mm    2
body_mass_g          0
sex                  0
dtype: int64

# Q.3.

Imbalanced data refers to a situation where the distribution of the classes in a dataset is highly skewed, i.e., one class has significantly more samples than the other(s). In other words, the number of samples in one class is much smaller than the number of samples in the other class(es). For example, in a binary classification problem, if the positive class (class of interest) has only 5% of the samples, and the remaining 95% belong to the negative class, then the dataset is highly imbalanced.

Imbalanced data can lead to biased model performance because most machine learning algorithms are designed to optimize overall accuracy, which tends to favor the majority class. In other words, the model may not learn to distinguish between the minority and majority classes and may predict the majority class for most samples, resulting in poor performance for the minority class. This is especially problematic in applications where correctly identifying the minority class is crucial, such as detecting fraudulent transactions or rare diseases.

If imbalanced data is not handled, the resulting model may have poor performance for the minority class, leading to missed opportunities to identify the class of interest. In some cases, the model may even perform worse than a random guess, which is highly undesirable. This can lead to negative consequences in real-world applications, such as financial losses, patient harm, or missed opportunities for intervention.

# Q.4.

Up-sampling and down-sampling are techniques used to handle imbalanced data by adjusting the number of samples in each class to create a more balanced dataset.

Up-sampling involves increasing the number of samples in the minority class by randomly duplicating existing samples or generating new synthetic samples. For example, suppose we have a binary classification problem where the positive class has only 10% of the samples, and the negative class has 90%. In that case, we can up-sample the positive class by randomly duplicating existing samples or generating new synthetic samples until we have a more balanced dataset.

Down-sampling, on the other hand, involves reducing the number of samples in the majority class by randomly removing samples. For example, if we have a binary classification problem where the positive class has only 10% of the samples, and the negative class has 90%, we can down-sample the negative class by randomly removing samples until we have a more balanced dataset.

Up-sampling and down-sampling can be required in situations where the class distribution in the dataset is highly imbalanced, and the model's performance for the minority class is poor. This is often the case in applications where the cost of misclassifying the minority class is high, such as fraud detection or rare disease diagnosis. By up-sampling the minority class or down-sampling the majority class, we can create a more balanced dataset that better represents the underlying class distribution, which can improve the model's performance for the minority class.

It's worth noting that up-sampling and down-sampling have their limitations and can potentially lead to overfitting or underfitting, respectively. Therefore, it's essential to carefully evaluate the performance of the model after applying these techniques and consider other approaches such as cost-sensitive learning or ensemble methods to handle imbalanced data.

# Q.5.

Data augmentation is a technique used to artificially increase the size of a dataset by creating new examples from the existing ones. It is commonly used in machine learning and deep learning applications to improve the performance of models by providing them with more training examples

# SMOTE


SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used to address the problem of imbalanced datasets. It is designed to address the issue where the minority class in a dataset is under-represented compared to the majority class.

The SMOTE algorithm works by generating synthetic examples of the minority class by interpolating between pairs of existing examples. It works as follows:

For each example in the minority class, find its k nearest neighbors in the feature space.

Choose one of these k neighbors at random, and create a new example that is a linear interpolation between the original example and the chosen neighbor.

Repeat this process until the desired level of over-sampling is achieved.

The key idea behind SMOTE is to create synthetic examples that are similar to the minority class examples, but are not identical. By creating new examples in this way, the SMOTE algorithm can help balance the class distribution and improve the performance of machine learning models trained on imbalanced datasets.



#  Q.6.

Outliers are data points in a dataset that are significantly different from other data points in the same dataset. These are extreme values that can have a significant impact on the results of statistical analysis and machine learning models. Outliers can arise due to measurement errors, data entry errors, or they may represent genuine extreme values in the data.

It is important to handle outliers because they can have a significant impact on the results of data analysis and machine learning models. Outliers can skew statistical results, making it difficult to draw accurate conclusions from the data. In machine learning, outliers can cause models to overfit to the training data, leading to poor performance on new data.

Handling outliers typically involves either removing them from the dataset or transforming them in some way so that they no longer have a significant impact on the results. However, it is important to handle outliers with care, as removing too many or too few outliers can have a negative impact on the accuracy of the results.

# Q.7.

There are several techniques that can be used to handle missing data in a dataset:

Deletion: This involves removing any rows or columns that contain missing values. However, this method can result in loss of valuable information, especially if the missing values represent a large proportion of the data.

Imputation: This involves replacing missing values with estimated values. Some popular imputation techniques include mean imputation, median imputation, mode imputation, regression imputation, and k-nearest neighbor imputation.

Prediction modeling: This involves using machine learning algorithms to predict the missing values based on the available data. This technique can be more accurate than simple imputation methods, but it can also be computationally expensive.

Multiple imputation: This involves creating several imputed datasets using different imputation techniques and then combining them to create a final dataset. This technique can provide more accurate results than simple imputation methods.

Ignore: This involves ignoring the missing data and only analyzing the available data. However, this method can result in biased or incomplete results, especially if the missing values represent a large proportion of the data.

# Q.8.

There are several strategies that can be used to determine if missing data is missing at random or if there is a pattern to the missing data:

Analyze the missing data pattern: One way to determine if there is a pattern to the missing data is to look at the missing data pattern itself. For example, you can create a visualization of the missing data pattern to see if there are any noticeable patterns or trends.

Look at correlations: You can also look at the correlations between the missing data and other variables in the dataset. If there is a strong correlation between the missing data and another variable, this may suggest that there is a pattern to the missing data.

Statistical tests: You can use statistical tests to determine if the missing data is missing at random or if there is a pattern to the missing data. One common statistical test is the Little's MCAR (Missing Completely At Random) test, which tests whether the missing data is completely random or not.

Imputation techniques: Another way to determine if there is a pattern to the missing data is to use imputation techniques. If a particular imputation technique performs better than others, this may suggest that there is a pattern to the missing data

# Q.9.

Here are some strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset:

Confusion matrix: A confusion matrix can provide a better picture of the performance of a model. It shows the number of true positives, true negatives, false positives, and false negatives. From the confusion matrix, metrics such as precision, recall, and F1-score can be calculated.

Precision-Recall curve: The precision-recall (PR) curve is another way to evaluate the performance of a model on an imbalanced dataset. The PR curve shows the trade-off between precision and recall at different probability thresholds. The area under the PR curve (AUC-PR) is a single-number summary of the performance of the model.

Resampling techniques: Resampling techniques such as oversampling the minority class, undersampling the majority class, or a combination of both can help to balance the class distribution and improve the performance of the model.

# Q.10.

Here are some methods that can be employed to down-sample the majority class:

Random under-sampling: In this method, we randomly select a subset of samples from the majority class such that the number of samples in the majority class becomes equal to the number of samples in the minority class.

Cluster centroids: In this method, we identify the centroids of the clusters formed by the majority class and remove the samples that are farthest from the centroids.

Tomek links: Tomek links are pairs of samples in different classes that are closest to each other. In this method, we identify Tomek links between the majority and minority classes and remove the samples from the majority class.

NearMiss: NearMiss is an under-sampling method that selects samples from the majority class based on their distance to the minority class. There are three variations of this method - NearMiss-1, NearMiss-2, and NearMiss-3.

# Q.11.

To up-sample the minority class in an imbalanced dataset, the following methods can be employed:

Random over-sampling: In this method, samples are randomly duplicated from the minority class to balance the dataset. However, this method can lead to overfitting, as it does not introduce new information into the dataset.

Synthetic Minority Over-sampling Technique (SMOTE): SMOTE is a popular method for up-sampling minority classes. In this method, synthetic samples are generated from the minority class by interpolating between neighboring minority class samples. SMOTE is effective in balancing the dataset while also reducing overfitting.

Adaptive Synthetic Sampling (ADASYN): ADASYN is an extension of SMOTE that generates more synthetic samples for minority class samples that are harder to learn, i.e., minority samples that are closer to the majority class samples. This method can improve the classification performance of the minority class.