## Question 01 - What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

## Answer :-

Missing values are the values that are not present in a dataset. They can occur due to various reasons such as data corruption, human errors during data entry, data storage problems, etc. Missing values are denoted by symbols like NA, NaN, or simply left blank.

Handling missing values is essential because they can lead to biased and inaccurate models. If the number of missing values is significant, it can also affect the statistical power of the analysis. Hence, it is crucial to identify the missing values and handle them appropriately.

Some algorithms that are not affected by missing values are:

1. Decision Trees: Decision Trees can handle missing values by ignoring the missing values and creating split based on the available data.

2. Random Forest: Random Forest is an extension of Decision Trees that can handle missing values by ignoring the missing values while selecting the best split.

3. Naive Bayes: Naive Bayes is a probabilistic algorithm that can handle missing values by ignoring them.

4. K-Nearest Neighbors: KNN algorithm can handle missing values by ignoring the missing values while computing the distance between data points.

5. Support Vector Machines: SVM can handle missing values by ignoring them while computing the distance between data points.

However, some algorithms are sensitive to missing values, such as linear regression, logistic regression, k-means clustering, etc. Hence, it is crucial to handle missing values before applying these algorithms.

## Question 02 - List down techniques used to handle missing data. Give an example of each with python code.

## Answer :-

There are several techniques to handle missing data in a dataset. Some of the common techniques are:

Deletion: Delete the rows or columns containing missing data.
Imputation: Fill in the missing values with estimated values.
Here are examples of how to implement each technique using Python:

In [4]:
#Deletion:

#a. Listwise Deletion: Delete all rows that contain missing data.
import pandas as pd
import numpy as np

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]})

In [6]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,,11
3,4.0,8.0,12


In [5]:
# drop rows with missing values
df_new = df.dropna()

# print the new dataframe
print(df_new)

     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12


In [7]:
# b. Pairwise Deletion: Delete only the rows or columns that have missing values for specific features.

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,,11
3,4.0,8.0,12


In [8]:
# drop rows with missing values for column 'B'
df_new = df.dropna(subset=['B'])

# print the new dataframe
print(df_new)


     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12


In [9]:
# Imputation:

# a. Mean Imputation: Replace missing values with the mean of the corresponding feature.
import pandas as pd
from sklearn.impute import SimpleImputer

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,9
1,2.0,,10
2,,,11
3,4.0,8.0,12


In [10]:
# create an imputer object with mean strategy
imputer = SimpleImputer(strategy='mean')

# fit the imputer on the dataframe
imputer.fit(df)

# transform the dataframe by replacing missing values with mean
df_new = pd.DataFrame(imputer.transform(df), columns=df.columns)

# print the new dataframe
print(df_new)

          A    B     C
0  1.000000  5.0   9.0
1  2.000000  6.5  10.0
2  2.333333  6.5  11.0
3  4.000000  8.0  12.0


In [12]:
import pandas as pd
from sklearn.impute import SimpleImputer

# create a sample dataframe with missing values
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8], 'C': [9, 10, 11, 12]})

# create an imputer object with median strategy
imputer = SimpleImputer(strategy='median')

# fit the imputer on the dataframe
imputer.fit(df)

# transform the dataframe by replacing missing values with median
df_new = pd.DataFrame(imputer.transform(df), columns=df.columns)
print(df_new)

     A    B     C
0  1.0  5.0   9.0
1  2.0  6.5  10.0
2  2.0  6.5  11.0
3  4.0  8.0  12.0


## Question 03 - Explain the imbalanced data. What will happen if imbalanced data is not handled?

## Answer :-

Imbalanced data refers to a situation in which the classes in the target variable are not represented equally in a dataset. This means that one class may be significantly more frequent than the other(s). For example, in a binary classification problem where the positive class (class of interest) has a much smaller number of observations than the negative class.

If imbalanced data is not handled properly, it can lead to biased models that have a higher accuracy in predicting the majority class and a lower accuracy in predicting the minority class. This can be particularly problematic in applications where it is more important to correctly identify the minority class, such as detecting fraudulent transactions or medical diagnoses.

For example, if a dataset has 95% negative class and 5% positive class, a model that predicts all samples as negative will have an accuracy of 95%, which looks impressive but useless for detecting positive samples. On the other hand, if the model is biased towards the minority class, it can lead to high false positives, i.e., predicting a negative class as positive, which can also be detrimental in certain applications.

Therefore, it is crucial to handle imbalanced data to ensure that the model is trained on a balanced dataset and that it has equal representation of all classes.

## Question 04 - What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down- sampling are required.

## Answer :-

Up-sampling and down-sampling are techniques used in dealing with imbalanced datasets, where one class is significantly underrepresented compared to the other class.

- Up-sampling: Up-sampling involves increasing the number of instances in the minority class to match the number of instances in the majority class. This can be achieved by randomly duplicating the existing instances in the minority class or by generating new synthetic samples.

- Down-sampling: Down-sampling involves decreasing the number of instances in the majority class to match the number of instances in the minority class. This can be achieved by randomly selecting a subset of the instances in the majority class.

An example where up-sampling might be required is in the case of fraud detection, where the number of fraudulent transactions is much less than the number of non-fraudulent transactions. In this case, up-sampling the minority class (fraudulent transactions) can help the model learn more about the rare class and improve its performance.

An example where down-sampling might be required is in the case of a medical dataset where the number of healthy patients significantly exceeds the number of patients with a disease. In this case, down-sampling the majority class (healthy patients) can help to balance the dataset and prevent the model from being biased towards the majority class.

## Question 05 - What is data Augmentation? Explain SMOTE.

## Answer :-

Data augmentation is a technique in machine learning that involves creating new training data from existing data to increase the size and diversity of the training set. This helps in improving the performance and generalization of the model. SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation method used for handling imbalanced datasets.

SMOTE works by creating synthetic samples of the minority class by selecting random samples from the minority class and creating similar but slightly different samples. SMOTE uses a k-nearest neighbor algorithm to generate these new samples. It selects a sample from the minority class and identifies its k-nearest neighbors. SMOTE then randomly selects one of these neighbors and creates a new sample that is a combination of the original sample and the selected neighbor.

For example, suppose we have a dataset with two classes, A and B, where class A is the minority class. If the dataset is imbalanced, with only a few samples in class A, we can use SMOTE to generate new samples for class A. SMOTE selects a sample from class A and identifies its k-nearest neighbors. It then randomly selects one of these neighbors and creates a new sample that is a combination of the original sample and the selected neighbor. The new sample is added to the dataset as a new data point for class A.

In Python, we can use the imblearn library to implement SMOTE. Here is an example code snippet:

In [14]:
'''from imblearn.over_sampling import SMOTE

# Load the imbalanced dataset
X, y = load_data()

# Instantiate the SMOTE object
sm = SMOTE()

# Fit the SMOTE object to the dataset and generate new samples
X_resampled, y_resampled = sm.fit_resample(X, y)'''



'from imblearn.over_sampling import SMOTE\n\n# Load the imbalanced dataset\nX, y = load_data()\n\n# Instantiate the SMOTE object\nsm = SMOTE()\n\n# Fit the SMOTE object to the dataset and generate new samples\nX_resampled, y_resampled = sm.fit_resample(X, y)'

## Question 06 - What are outliers in a dataset? Why is it essential to handle outliers?

## Answer :-

Outliers are data points that lie far away from the majority of the data points in a dataset. They can be caused by various reasons, such as measurement errors, data entry errors, or natural variations in the data.

It is essential to handle outliers in a dataset because they can have a significant impact on the statistical analysis and machine learning algorithms. Outliers can skew the distribution of the data, affect the mean and variance, and reduce the accuracy of the model. Additionally, some machine learning algorithms are sensitive to outliers, which can result in poor performance.

Handling outliers involves identifying them and then deciding whether to remove or transform them.

There are various techniques to handle outliers, such as:

- Trimming: removing extreme values from the dataset
- Winsorizing: replacing extreme values with a less extreme value
- Transforming: applying a mathematical transformation to the data, such as a logarithmic or square root transformation
- Robust statistics: using statistical methods that are less sensitive to outliers, such as the median instead of the mean

In some cases, it may be appropriate to keep the outliers in the dataset, particularly if they are a result of genuine data variability or measurement error. However, it is crucial to carefully consider the impact of outliers on the analysis and machine learning algorithms.

## Question 07 - You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

## Answer :-

There are several techniques to handle missing data in a dataset. Some of the most commonly used techniques are:

1. Deletion: In this method, the rows or columns containing missing values are removed from the dataset. However, this method should be used only when the percentage of missing values is very low, and the missing values are missing completely at random (MCAR).

2. Mean/ Median/ Mode Imputation: In this method, the missing values are replaced with the mean, median, or mode value of the corresponding feature. This method is suitable when the missing values are missing at random (MAR).

3. Forward/ Backward Fill: In this method, the missing values are replaced with the last known value (forward fill) or the next known value (backward fill). This method is suitable when the missing values occur in a sequence and are missing at random (MAR).

4. Hot Deck Imputation: In this method, the missing values are replaced with a randomly selected value from a similar group of individuals. This method is suitable when the missing values are missing not at random (MNAR) and depend on some other variables.

5. Machine Learning Methods: In this method, the missing values are predicted using machine learning algorithms such as k-Nearest Neighbors (k-NN), Decision Trees, Random Forests, etc. This method is suitable when the missing values are missing not at random (MNAR) and depend on some other variables.

## Question 08 - You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

## Answer :-

There are several strategies to determine if the missing data is missing at random or if there is a pattern to the missing data:

1. Visual inspection: One simple way to determine if there is a pattern to the missing data is to visualize the missing data using a heatmap. This will allow you to see if the missing data is concentrated in specific areas or if it is randomly distributed across the dataset.

2. Statistical tests: There are several statistical tests that can be used to determine if the missing data is missing at random or not. These tests include the Little’s MCAR test, which tests if the missing data is completely at random, and the Missing Indicator method, which tests if there is a pattern to the missing data.

3. Imputation methods: Another way to determine if there is a pattern to the missing data is to use imputation methods such as mean imputation or regression imputation. If the imputed values are significantly different from the observed values, it suggests that there may be a pattern to the missing data.

4. Domain knowledge: It is also essential to have domain knowledge about the dataset to determine if there is a pattern to the missing data. For example, if you are working with medical data and notice that there is a higher percentage of missing data for certain age groups, it may suggest that there is a pattern to the missing data.

By using these strategies, you can determine if the missing data is missing at random or if there is a pattern to the missing data, which will help you decide on the best approach to handle the missing data in your analysis.

## Question 09 - Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

## Answer :-

When dealing with an imbalanced dataset in a medical diagnosis project, the following strategies can be used to evaluate the performance of a machine learning model:

1. Confusion Matrix: The confusion matrix provides a summary of the predicted results versus the actual results. It can be used to calculate metrics such as precision, recall, and F1 score, which are commonly used to evaluate the performance of classification models.

2. ROC Curve and AUC: The ROC curve is a graphical representation of the true positive rate (sensitivity) against the false positive rate (1-specificity) of a classification model for different classification thresholds. The area under the ROC curve (AUC) is a performance metric that provides a single number representing the overall quality of the classification model. A model with an AUC of 1.0 indicates perfect performance, while a model with an AUC of 0.5 indicates random guessing.

3. Stratified Sampling: Stratified sampling is a technique used to ensure that the imbalanced class distribution is preserved in both the training and testing sets. In this technique, the data is divided into several strata based on the class distribution, and then samples are drawn from each stratum in proportion to its size.

4. Resampling Techniques: Resampling techniques are used to balance the class distribution by oversampling the minority class or undersampling the majority class. Techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) are commonly used to oversample the minority class.

5. Cost-Sensitive Learning: Cost-sensitive learning involves adjusting the misclassification cost of the minority class to reduce the impact of misclassifying the minority class. This technique can be useful in situations where the cost of false negatives (missed diagnoses) is higher than the cost of false positives (incorrect diagnoses).

6. Ensemble Methods: Ensemble methods such as bagging, boosting, and stacking can be used to improve the performance of a classification model on imbalanced datasets. These methods combine the predictions of multiple models to reduce the impact of individual model weaknesses and improve the overall performance of the model.

## Question 10 - When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

## Answer :-

To down-sample the majority class in an imbalanced dataset, we can use various techniques, including:

1. Random under-sampling: This involves randomly removing samples from the majority class until the class distribution is balanced. This can be done using the resample function from the sklearn.utils module in Python. Here is an example:

from sklearn.utils import resample

# Down-sample majority class
df_majority = df[df.Satisfaction == 'Satisfied']
df_minority = df[df.Satisfaction == 'Not Satisfied']
df_majority_downsampled = resample(df_majority,
                                   replace=False,  # sample without replacement
                                   n_samples=len(df_minority),  # match minority class
                                   random_state=42)  # reproducible results

# Combine minority class with down-sampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

2. Cluster-based under-sampling: This involves identifying clusters of samples in the majority class and removing samples from each cluster until the class distribution is balanced. This can be done using the KMeans algorithm from the sklearn.cluster module in Python.

3. Tomek links: This involves identifying pairs of samples from different classes that are closest to each other and removing the majority class samples from the pairs. This can be done using the TomekLinks class from the imblearn.under_sampling module in Python.

4. Neighbourhood cleaning rule: This involves removing samples from the majority class that are classified incorrectly by a k-nearest neighbour algorithm. This can be done using the NeighbourhoodCleaningRule class from the imblearn.under_sampling module in Python.

## Question 11 - You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

## Answer :-

In order to balance the dataset and up-sample the minority class, we can use various techniques. Some of the techniques are:

- Random over-sampling: Randomly selecting instances from the minority class and duplicating them until the classes are balanced.

- Synthetic Minority Over-sampling Technique (SMOTE): Generating synthetic samples of the minority class by interpolating between the minority class instances.

- Adaptive Synthetic (ADASYN): A variant of SMOTE, which generates synthetic samples of the minority class with a greater degree of difficulty for the classifier.

Here is an example of how to use the SMOTE technique for up-sampling the minority class in Python:

In [15]:
'''from imblearn.over_sampling import SMOTE

X_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)'''


'from imblearn.over_sampling import SMOTE\n\nX_train_resampled, y_train_resampled = SMOTE().fit_resample(X_train, y_train)'

In the above code, X_train and y_train are the feature and target variables of the training dataset, respectively. The fit_resample() method from the SMOTE class generates synthetic samples of the minority class to balance the dataset, and returns the up-sampled feature and target variables, X_train_resampled and y_train_resampled, respectively.