### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of data values in one or more columns of a dataset. This can occur due to various reasons such as data collection errors, data entry errors, data loss during transmission, or simply because the data does not exist.

It is essential to handle missing values in a dataset for several reasons. Firstly, the presence of missing values can adversely affect the quality of the analysis performed on the dataset. Secondly, some statistical analysis techniques such as regression analysis or factor analysis cannot be performed if there are missing values in the dataset. Finally, missing values can lead to bias in the analysis if not handled appropriately.

Some algorithms that are not affected by missing values include:

1. Decision Trees: Decision Trees can handle missing values by assigning a default value to the missing attribute, or by using the most common value of the attribute in the dataset.

2. Random Forest: Random Forest is an ensemble of Decision Trees, and like Decision Trees, it can handle missing values.

3. K-Nearest Neighbors: KNN can handle missing values by using the most common value of the attribute or by using the average value of the attribute in the dataset.

4. Naive Bayes: Naive Bayes can handle missing values by ignoring the missing attribute when calculating probabilities.

5. Support Vector Machines (SVM): SVM can handle missing values by ignoring the missing attribute or by replacing the missing values with the mean or median of the attribute.






### Q2: List down techniques used to handle missing data. Give an example of each with python code.

Here are some common techniques used to handle missing data in a dataset, along with examples in Python code:
### 1. Deletion:

This involves removing the rows or columns that contain missing values from the dataset. This technique is appropriate when the missing values are random and the amount of data loss is acceptable.

In [1]:
# Example of deleting rows with missing values
import pandas as pd

# create a sample dataset with missing values
data = {'A': [1, 2, 3, None, 5],
        'B': [6, None, 8, 9, 10],
        'C': [11, 12, None, 14, 15]}
df = pd.DataFrame(data)

# Print dataset before deletion
print('Data before deletion of missing rows : ')
print(df)

print('\n====================================\n')

# drop rows with missing values
df = df.dropna()

# Print dataset after deletion
print('Data after deletion of missing rows : ')
print(df)

Data before deletion of missing rows : 
     A     B     C
0  1.0   6.0  11.0
1  2.0   NaN  12.0
2  3.0   8.0   NaN
3  NaN   9.0  14.0
4  5.0  10.0  15.0


Data after deletion of missing rows : 
     A     B     C
0  1.0   6.0  11.0
4  5.0  10.0  15.0


In [2]:
# Example of deleting columns with missing values
import pandas as pd

# create a sample dataset with missing values
data = {'A': [1, 2, 3, 4, 5],
        'B': [6, 7 , 8, 9, 10],
        'C': [11, 12, None, None, 15]}
df = pd.DataFrame(data)

# Print dataset before deletion
print('Data before deletion of missing columns : ')
print(df)

print('\n====================================\n')

# drop rows with missing values
df = df.dropna(axis=1)

# Print dataset after deletion
print('Data after deletion of missing columns : ')
print(df)

Data before deletion of missing columns : 
   A   B     C
0  1   6  11.0
1  2   7  12.0
2  3   8   NaN
3  4   9   NaN
4  5  10  15.0


Data after deletion of missing columns : 
   A   B
0  1   6
1  2   7
2  3   8
3  4   9
4  5  10


### 2. Simple Imputation:
This involves filling in the missing values with an estimated value based on the available data. This technique is appropriate when the missing values are non-random and the amount of missing data is relatively small.

Mean Imputation : This should be used on numerical variables when there are no outliers

In [3]:
# Example of imputing missing values with mean value
import pandas as pd
from sklearn.impute import SimpleImputer

# create a sample dataset with missing values
data = {'A': [1, 2, 3, None, 5],
        'B': [6, None, 8, 9, 10],
        'C': [11, 12, None, 14, 15]}
df = pd.DataFrame(data)

# Print dataset before imputation
print('Data before mean imputation : ')
print(df)

print('\n====================================\n')

# impute missing values with mean value
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print('Data after mean imputation : ')
print(df_imputed)

Data before mean imputation : 
     A     B     C
0  1.0   6.0  11.0
1  2.0   NaN  12.0
2  3.0   8.0   NaN
3  NaN   9.0  14.0
4  5.0  10.0  15.0


Data after mean imputation : 
      A      B     C
0  1.00   6.00  11.0
1  2.00   8.25  12.0
2  3.00   8.00  13.0
3  2.75   9.00  14.0
4  5.00  10.00  15.0


Median Imputation : This should be used on numerical variables when there are outliers present in data

In [4]:
# Example of imputing missing values with median value
import pandas as pd
from sklearn.impute import SimpleImputer

# create a sample dataset with missing values
data = {'A': [1, 2, 3, None, 5, 90],
        'B': [6, None, 8, 9, 10, 200],
        'C': [11, 12, None, 14, 15, 100]}
df = pd.DataFrame(data)

# Print dataset before imputation
print('Data before median imputation : ')
print(df)

print('\n====================================\n')

# impute missing values with mean value
imputer = SimpleImputer(strategy='median')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print('Data after median imputation : ')
print(df_imputed)

Data before median imputation : 
      A      B      C
0   1.0    6.0   11.0
1   2.0    NaN   12.0
2   3.0    8.0    NaN
3   NaN    9.0   14.0
4   5.0   10.0   15.0
5  90.0  200.0  100.0


Data after median imputation : 
      A      B      C
0   1.0    6.0   11.0
1   2.0    9.0   12.0
2   3.0    8.0   14.0
3   3.0    9.0   14.0
4   5.0   10.0   15.0
5  90.0  200.0  100.0


Mode Imputation : This should be used to handle categorical misssing data only

In [5]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Create a sample dataframe with missing values
data = {'Gender': ['M', 'F', 'F', np.nan, 'M', 'F', 'M'],
        'City': ['NYC', 'LA', np.nan, 'LA', 'LA', 'NYC', np.nan],
        'Age': [30, 40, 25, np.nan, np.nan, 35, 28],
        'Income': [50000, np.nan, 80000, 60000, 70000, np.nan, 90000]}
df = pd.DataFrame(data)

# Print dataset before imputation
print('Data before mode imputation on categorical data : ')
print(df)

print('\n====================================\n')


# Create a SimpleImputer object with 'most_frequent' strategy
imputer = SimpleImputer(strategy='most_frequent')

# Impute the missing values in the categorical columns
df[['Gender', 'City']] = imputer.fit_transform(df[['Gender', 'City']])

# Display the dataframe after imputation
print('Data after mode imputation on categorical data : ')
print(df)

Data before mode imputation on categorical data : 
  Gender City   Age   Income
0      M  NYC  30.0  50000.0
1      F   LA  40.0      NaN
2      F  NaN  25.0  80000.0
3    NaN   LA   NaN  60000.0
4      M   LA   NaN  70000.0
5      F  NYC  35.0      NaN
6      M  NaN  28.0  90000.0


Data after mode imputation on categorical data : 
  Gender City   Age   Income
0      M  NYC  30.0  50000.0
1      F   LA  40.0      NaN
2      F   LA  25.0  80000.0
3      F   LA   NaN  60000.0
4      M   LA   NaN  70000.0
5      F  NYC  35.0      NaN
6      M   LA  28.0  90000.0


### 3. Iterative Imputer:
It is an advanced imputation technique used to handle missing data in a dataset. This technique uses machine learning algorithms to estimate the missing values based on the observed data. It iteratively imputes missing values by modeling each feature with missing values as a function of other features in the dataset that do not have missing values. Used on Numerical data only

In [6]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# create a sample dataset with missing values
data = {'A': [1, 2, 3, None, 5],
        'B': [6, None, 8, 9, None],
        'C': [11, 12, None, 14, 15]}
df = pd.DataFrame(data)

# print Data before iterative imputation
print('Data before Iterative imputation : ')
print(df)

print('\n====================================\n')


# impute missing values with iterative imputer
imputer = IterativeImputer(max_iter=10, random_state=42)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print('Data after Iterative imputation : ')
print(df_imputed)

Data before Iterative imputation : 
     A    B     C
0  1.0  6.0  11.0
1  2.0  NaN  12.0
2  3.0  8.0   NaN
3  NaN  9.0  14.0
4  5.0  NaN  15.0


Data after Iterative imputation : 
     A         B     C
0  1.0  6.000000  11.0
1  2.0  7.000000  12.0
2  3.0  8.000000  13.0
3  4.0  9.000000  14.0
4  5.0  9.999999  15.0


### 4. K-nearest neighbor imputation:
This involves filling in the missing values with the values of the nearest neighbor(s) based on a distance metric. This technique is appropriate when the missing values are non-random, and there is a correlation between the missing values and the other features. Should be used on numerical variables only

In [7]:
# Example of k-nearest neighbor imputation
import pandas as pd
from sklearn.impute import KNNImputer

# create a sample dataset with missing values
data = {'A': [1, 2, None, None, 5],
        'B': [6, None, None, 9, 10],
        'C': [None, 12, 13, 14, None]}
df = pd.DataFrame(data)

# Print dataset before imputation
print('Data before KNN imputation : ')
print(df)

print('\n====================================\n')

# impute missing values with k-nearest neighbor imputation
imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

print('Data after KNN imputation : ')
print(df_imputed)

Data before KNN imputation : 
     A     B     C
0  1.0   6.0   NaN
1  2.0   NaN  12.0
2  NaN   NaN  13.0
3  NaN   9.0  14.0
4  5.0  10.0   NaN


Data after KNN imputation : 
          A          B     C
0  1.000000   6.000000  13.0
1  2.000000   8.333333  12.0
2  2.666667   8.333333  13.0
3  2.666667   9.000000  14.0
4  5.000000  10.000000  13.0


### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a situation in which the number of instances in one class is significantly higher or lower than the number of instances in the other class(es) in a classification problem. For instance, in a binary classification problem where the positive class is rare, the dataset may have only a few positive instances compared to the majority negative instances.

If imbalanced data is not handled, the model can suffer from several issues:

1. Bias towards the majority class: The model will be biased towards the majority class, and it will classify most of the instances as belonging to that class, even if they belong to the minority class. This is because the model optimizes for accuracy, and classifying all instances as the majority class would result in high accuracy.

2. Poor performance on the minority class: The model may have poor performance on the minority class, leading to low recall or sensitivity for that class. This is because the model does not have enough information about the minority class to learn the patterns and make accurate predictions.

3. Overfitting: The model may overfit to the majority class and ignore the minority class. Overfitting occurs when the model learns the noise in the data and makes decisions based on that noise rather than the underlying patterns in the data.

4. Incorrect evaluation: If the dataset is imbalanced, accuracy may not be a good evaluation metric as it can be misleading. For example, a model that classifies all instances as the majority class may have a high accuracy, but it will have poor performance on the minority class.

Therefore, it is essential to handle imbalanced data by applying techniques such as oversampling the minority class, undersampling the majority class, or using a combination of both. Other techniques include generating synthetic samples, using different evaluation metrics such as F1 score, ROC-AUC, or Precision-Recall curves, and using algorithms that are specifically designed to handle imbalanced data, such as the Cost-Sensitive Learning algorithms.






### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

Up-sampling and down-sampling are two common techniques used to handle imbalanced data in a classification problem.

Up-sampling refers to the technique of adding more instances to the minority class, such that the number of instances in both classes is balanced. This can be done by randomly duplicating the existing instances in the minority class or by generating synthetic samples using algorithms like SMOTE (Synthetic Minority Over-sampling Technique).

Down-sampling, on the other hand, refers to the technique of removing instances from the majority class such that the number of instances in both classes is balanced. This can be done by randomly removing instances from the majority class or by selecting a subset of instances based on some criteria like distance.

An example where up-sampling and down-sampling may be required is in a medical diagnosis problem. Suppose a dataset contains 1000 instances of patients, out of which only 100 patients are diagnosed with a rare disease. In this case, the dataset is imbalanced, and the rare disease class is the minority class. If we train a classification model on this dataset, it is likely to have poor performance on the rare disease class.

To handle this problem, we can either up-sample the rare disease class by generating synthetic samples using SMOTE or down-sample the normal class by randomly removing instances. Up-sampling can be useful when we have limited instances in the minority class, and we want to create more instances to balance the dataset. Down-sampling can be useful when we have a large number of instances in the majority class, and we want to reduce the size of the majority class to balance the dataset.

Both up-sampling and down-sampling have their advantages and disadvantages, and the choice of which technique to use depends on the problem at hand and the characteristics of the dataset.


### Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique used to increase the size of a dataset by creating new instances from the existing ones. The idea is to generate new instances that are similar to the original ones, but with some variations that make the dataset more diverse and representative of the real-world scenarios. This technique is widely used in computer vision, natural language processing, and other fields where large amounts of data are required to train machine learning models.

One popular data augmentation technique is Synthetic Minority Over-sampling Technique (SMOTE), which is used to handle imbalanced datasets. SMOTE works by generating synthetic samples for the minority class by interpolating between existing instances. The process involves selecting an instance from the minority class, selecting one of its k nearest neighbors from the minority class, and generating a new instance along the line connecting the two instances.

For example, suppose we have a dataset with two classes, class A and class B, where class A is the minority class. We apply SMOTE to up-sample the minority class by generating synthetic samples. We select an instance from class A and find its k nearest neighbors in the minority class. We then randomly select one of the neighbors and generate a new instance by interpolating between the two instances. We repeat this process until the number of instances in both classes is balanced.

SMOTE has several advantages over other up-sampling techniques. It generates synthetic samples that are representative of the minority class and are not exact copies of the existing instances. This helps to avoid overfitting and improves the generalization of the model. SMOTE also preserves the distribution of the minority class, unlike random oversampling, which can lead to overfitting and poor performance.

However, SMOTE also has some limitations. It can generate noisy samples if the dataset has overlapping classes or the minority class has a lot of noise. SMOTE can also introduce bias towards the minority class if the synthetic samples are not generated carefully. Therefore, it is essential to use SMOTE in combination with other techniques, such as cross-validation, to evaluate the performance of the model on the test set.


### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that are significantly different from the other data points in a dataset. These are observations that lie far away from the other observations, either in terms of their magnitude or distribution. Outliers can occur due to errors in data collection, measurement errors, or natural variations in the data.

It is essential to handle outliers in a dataset for several reasons:

1. Outliers can affect the statistical properties of the dataset, such as the mean, variance, and correlation coefficients. Since these statistics are often used in machine learning models, outliers can result in biased estimates and poor performance.

2. Outliers can also affect the distribution of the data and the shape of the decision boundary. In classification problems, outliers can lead to misclassification and reduce the accuracy of the model.

3. Outliers can also lead to overfitting in machine learning models, where the model tries to fit the outlier points as well. This can result in poor generalization of the model to new data.

4. Outliers can also be indicative of errors or anomalies in the data. By removing the outliers, we can identify and correct these errors or anomalies in the data.

Handling outliers can involve several techniques such as removing the outliers, transforming the data, or treating them as missing values. The choice of the technique depends on the nature of the data and the problem at hand. In some cases, removing outliers may be appropriate if they are due to errors or anomalies. In other cases, transforming the data may be necessary to reduce the impact of outliers. It is important to be careful when handling outliers and to evaluate the effect of outlier removal on the performance of the machine learning model.






### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

Handling missing data is an essential step in data analysis as it can affect the accuracy and reliability of the results. There are several techniques that can be used to handle missing data:

1. Deletion: This technique involves removing the rows or columns that contain missing values. If the amount of missing data is small, deleting the rows or columns may not have a significant impact on the analysis. However, if the amount of missing data is large, deleting the data can lead to biased estimates and loss of information.

2. Imputation: This technique involves replacing the missing values with estimated values. There are several methods for imputing missing data, such as mean imputation, median imputation, and mode imputation. In mean imputation, the missing values are replaced with the mean value of the variable. In median imputation, the missing values are replaced with the median value of the variable. In mode imputation, the missing values are replaced with the most frequent value of the variable. Imputation can help to retain the information in the missing values and reduce the bias in the analysis.

3. Regression imputation: This technique involves predicting the missing values using a regression model. The regression model is trained on the non-missing values of the variable, and the predicted values are used to replace the missing values. Regression imputation can be useful when there is a strong correlation between the missing variable and other variables in the dataset.

4. Multiple imputation: This technique involves creating multiple imputed datasets, each with different imputed values, and combining the results of the analysis. Multiple imputation can help to account for the uncertainty in the imputed values and provide more accurate estimates of the results.

5. Using specialized models: In some cases, specialized models can be used to handle missing data. For example, decision trees and random forests can handle missing data without the need for imputation.

The choice of the technique depends on the nature of the data and the problem at hand. It is important to be careful when handling missing data and to evaluate the effect of the technique on the analysis.






### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

When dealing with missing data, there are several strategies to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are some of the most commonly used methods:

1. Analyze missingness patterns: You can start by examining the missingness patterns in the data. Plotting the distribution of missing values by variable or by record can help identify patterns of missingness. If the missingness patterns are random or similar across all variables, then it is likely that the missing data is missing at random. However, if there are patterns in the missingness, such as specific variables having higher rates of missing values or specific values within a variable being more likely to be missing, this suggests that the missing data may be non-random.

2. Correlation analysis: You can examine the correlation between the missingness of a variable and other variables in the dataset. If the missingness of a variable is not correlated with any other variable, then it is likely missing at random. However, if the missingness of a variable is correlated with other variables, it suggests that the missing data may be non-random.

3. Imputation and analysis: Impute the missing values using various techniques and compare the results. If the results are consistent across multiple imputation techniques, then it suggests that the missing data is missing at random. However, if the results vary significantly depending on the imputation technique used, it suggests that the missing data may be non-random.

4. Expert knowledge: Sometimes expert knowledge can help determine if the missing data is missing at random or not. For example, if you are studying the impact of a new medication, and patients who experience side effects are more likely to drop out of the study, then the missing data is likely not missing at random.

5. Statistical tests: You can use statistical tests such as the Little’s MCAR test or Missing Completely at Random (MCAR) test to determine if the missing data is missing at random or not. These tests can help determine if the pattern of missing data can be explained by chance or if there is a systematic reason for the missing data.

Overall, it's important to remember that determining the pattern of missing data is often a combination of these methods, and it may require some judgment to make a final determination.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Dealing with imbalanced datasets is a common problem in machine learning, especially in medical diagnosis projects. Here are some strategies you can use to evaluate the performance of your machine learning model on an imbalanced dataset:

1. Confusion matrix: A confusion matrix is a table that summarizes the performance of a classification model. It shows the true positive, false positive, true negative, and false negative rates. In the case of an imbalanced dataset, accuracy may not be a good metric to evaluate the model's performance. Instead, you can look at other metrics such as precision, recall, F1-score, and the area under the receiver operating characteristic (ROC) curve. These metrics are not affected by the class imbalance and provide a better evaluation of the model's performance.

2. Resampling techniques: Resampling techniques can be used to balance the dataset. You can either oversample the minority class or undersample the majority class. Oversampling involves adding copies of the minority class to the dataset, while undersampling involves removing examples from the majority class. However, both techniques have some drawbacks. Oversampling can lead to overfitting, while undersampling can lead to a loss of information. One common resampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which generates synthetic examples of the minority class.

3. Ensemble methods: Ensemble methods combine multiple models to improve their performance. One common ensemble method is the bagging method, which involves training multiple models on different subsets of the dataset and averaging their predictions. Another common ensemble method is the boosting method, which involves training multiple models sequentially, with each subsequent model focusing on the errors of the previous model. Ensemble methods can help improve the performance of the model on imbalanced datasets.

4. Cost-sensitive learning: Cost-sensitive learning involves assigning different costs to different types of errors. In the case of an imbalanced dataset, misclassifying a minority class example as a majority class example may be more costly than the opposite. By assigning different costs to different types of errors, the model can be trained to minimize the overall cost of errors rather than just the number of errors.

5. Domain knowledge: Finally, domain knowledge can be used to improve the model's performance on an imbalanced dataset. For example, if the dataset contains demographic information, you can use this information to stratify the dataset and ensure that both classes are represented equally in each stratum.

Overall, it's important to remember that there is no single best strategy for dealing with imbalanced datasets, and the best approach may depend on the specific dataset and problem at hand. It's often a combination of these techniques that leads to the best results.

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

There are several methods that can be employed to balance an unbalanced dataset and down-sample the majority class. Here are a few possible approaches:

1. Random under-sampling: This involves randomly removing instances from the majority class until the dataset is balanced. One potential drawback of this approach is that it may result in the loss of important information, particularly if the majority class contains important or rare examples that should be preserved.

2. Cluster-based under-sampling: This method involves clustering the majority class instances and then selecting representative instances from each cluster. This can help to preserve important information in the majority class, while also reducing the imbalance.

3. Tomek Links: This method is an under-sampling technique that identifies pairs of instances from different classes that are close to each other, and removes the majority class instance from each pair. By doing this, the Tomek Links method creates a clearer separation between the two classes.

4. Edited Nearest Neighbors (ENN): This method is also an under-sampling technique that removes noisy or mislabeled instances by checking the class of each instance's nearest neighbors. If an instance's nearest neighbors are mostly from a different class, then the instance is removed. ENN can be applied after other under-sampling or over-sampling techniques to further improve the balance of the dataset.

5. Ensemble-based methods: These methods involve training multiple models on different subsets of the data, and then combining the results to produce a final prediction. This can be particularly useful in cases where the dataset is highly imbalanced and standard methods may not be effective.

It is important to note that there is no one "best" method for balancing an unbalanced dataset, and the choice of method will depend on the specific characteristics of the dataset and the goals of the analysis. It is also important to evaluate the performance of the chosen method on a validation set to ensure that it does not introduce biases or negatively impact the accuracy of the model

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

If I have an unbalanced dataset with a low percentage of occurrences of a rare event, you can employ various techniques to balance the dataset and up-sample the minority class. Here are a few possible approaches:

1. Random over-sampling: This involves randomly duplicating instances from the minority class until the dataset is balanced. One potential drawback of this approach is that it may result in overfitting and lower the overall accuracy of the model.

2. Synthetic minority over-sampling technique (SMOTE): This method involves creating synthetic instances of the minority class by interpolating between existing instances. SMOTE generates new instances by taking the difference between the feature vector of one minority class instance and its k-nearest neighbors, and then multiplying this difference by a random number between 0 and 1. This can help to balance the dataset while also preserving the overall distribution of the minority class.

3. Adaptive Synthetic Sampling (ADASYN): This method is an extension of SMOTE that generates more synthetic instances in the minority class regions that are harder to learn by the classifier. The idea is to generate more synthetic samples where the density of the minority class is lower, thus focusing more on the difficult to learn samples.

4. SMOTE-Tomek: This method combines the SMOTE over-sampling technique with Tomek Links under-sampling. Tomek Links are pairs of instances from different classes that are close to each other and can be removed to increase the separation between the classes. SMOTE-Tomek first applies applies SMOTE over-sampling to the remaining minority class instances., and then Tomek Links under-sampling to remove the majority class instances that form Tomek Links with minority class instances.

5. SMOTE-ENN: This method combines the SMOTE over-sampling technique with Edited Nearest Neighbors (ENN) under-sampling. ENN is a cleaning technique that removes noisy or mislabeled instances by checking the class of each instance's nearest neighbors. SMOTE-ENN first applies SMOTE over-sampling to the minority class instances, and then applies ENN under-sampling to remove instances that are misclassified by their nearest neighbors.

It is important to note that up-sampling the minority class can also lead to overfitting and reduced generalization performance. Therefore, it is important to evaluate the performance of the chosen method on a validation set to ensure that it does not introduce biases or negatively impact the accuracy of the model.