ASSIGNMENT: FE-1

1. What are missing values in a dataset? Why is it essential to handle missing values? Name some 
algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of data for one or more variables in a given observation. Missing values can occur for several reasons, including data entry errors, data corruption, or non-response by participants. The presence of missing values can lead to biased or inaccurate results in statistical analysis, modeling, and machine learning, which makes it crucial to handle them appropriately.

It is essential to handle missing values because they can affect the accuracy and reliability of statistical models and machine learning algorithms. If missing values are not addressed, they can introduce bias, reduce the power of the analysis, and even lead to incorrect conclusions. Therefore, handling missing values is a critical step in the data preprocessing pipeline.

There are several ways to handle missing values, such as deleting the rows or columns with missing values, imputing the missing values with the mean, median, mode, or other statistical methods, or using machine learning algorithms that can handle missing values.

Some algorithms that are not affected by missing values include tree-based algorithms such as Random Forest and Gradient Boosting Machines, Bayesian networks, and deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). These algorithms can handle missing values in a robust and efficient manner, and they can provide accurate results even when the data contains missing values.

2.  List down techniques used to handle missing data.  Give an example of each with python code

Deletion method:
This method involves removing the rows or columns with missing values. This approach is only suitable when the amount of missing data is small, and the missing data is missing at random.

In [14]:
import numpy as np
import pandas as pd

# Generate random data
data = np.random.rand(5, 3)

# Set some values to NaN
data[0, 1] = np.nan
data[2, 0] = np.nan
data[3, 2] = np.nan

# Convert to Pandas DataFrame
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

# Print the DataFrame
print(df)



          A         B         C
0  0.922022       NaN  0.923820
1  0.253452  0.730530  0.336515
2       NaN  0.049607  0.039550
3  0.645468  0.275295       NaN
4  0.879351  0.144878  0.693796


In [15]:
df.isnull().sum()

A    1
B    1
C    1
dtype: int64

In [17]:
df.dropna()
# deleting method

Unnamed: 0,A,B,C
1,0.253452,0.73053,0.336515
4,0.879351,0.144878,0.693796


Mean/Mode/Median imputation:

This method involves filling the missing values with the mean, mode or median of the available data. This approach is suitable when the missing data is missing at random.

In [18]:
from sklearn.impute import SimpleImputer


In [26]:
imputer = SimpleImputer(strategy='mean', add_indicator=False)   # putting mean at null
df = imputer.fit_transform(df)

In [30]:
df = pd.DataFrame(df, columns=['A', 'B', 'C'])


In [31]:
df

Unnamed: 0,A,B,C
0,0.922022,0.300077,0.92382
1,0.253452,0.73053,0.336515
2,0.675073,0.049607,0.03955
3,0.645468,0.275295,0.49842
4,0.879351,0.144878,0.693796


In [38]:
from sklearn.impute import SimpleImputer
import pandas as pd

# Create a sample dataset with missing values
df = pd.DataFrame({
    'A': [1, 2, 3, None, 5],
    'B': [6, None, 8, 9, 10],
    'C': [11, 12, None, 14, 15]
})

# Create an instance of SimpleImputer and fit_transform the data with add_indicator=True
imputer = SimpleImputer(strategy='median', add_indicator=True)
df = imputer.fit_transform(df)

# Convert the NumPy array back to a Pandas DataFrame
df = pd.DataFrame(df)

# Print the DataFrame
print(df)


     0     1     2    3    4    5
0  1.0   6.0  11.0  0.0  0.0  0.0
1  2.0   8.5  12.0  0.0  1.0  0.0
2  3.0   8.0  13.0  0.0  0.0  1.0
3  2.5   9.0  14.0  1.0  0.0  0.0
4  5.0  10.0  15.0  0.0  0.0  0.0


In [16]:
import pandas as pd

# Create a sample DataFrame
df = pd.DataFrame({'A': [1.0, 2.0, None, 4.0, 5.0],
                   'B': [6.0, 7.0, 8.0, None, 10.0],
                   'C': ['cat', 'dog', None, 'dog', 'cat']})

# Fill missing values with mode
df['C'] = df['C'].fillna(df['C'].mode()[0])
df['A'] = df['A'].fillna(df['A'].mean())
df['B'] = df['B'].fillna(df['B'].median())


# Print the resulting DataFrame
print(df)





     A     B    C
0  1.0   6.0  cat
1  2.0   7.0  dog
2  3.0   8.0  cat
3  4.0   7.5  dog
4  5.0  10.0  cat


3. Explain the imbalanced data. What will happen if imbalanced data is not handled?

Imbalanced data refers to a dataset where the classes are not represented equally. For example, in a binary classification problem, if 90% of the data points belong to class A and only 10% belong to class B, the dataset is considered imbalanced.

If imbalanced data is not handled, it can lead to several problems:

Biased model: In imbalanced data, the model may learn to predict the majority class, and not the minority class. This can result in a biased model, where the performance on the minority class is poor.
Misclassification: The model may misclassify the minority class as the majority class, resulting in false negatives. This can be a critical problem in applications such as fraud detection, where missing a fraudulent transaction can lead to significant losses.
Overfitting: In imbalanced data, the model may overfit to the majority class, resulting in poor generalization performance on new data.
Therefore, it is essential to handle imbalanced data to build a fair and accurate model. There are several techniques to handle imbalanced data, such as:

Resampling: Resampling techniques can be used to balance the classes. This can be achieved by either oversampling the minority class or undersampling the majority class.
Cost-sensitive learning: Cost-sensitive learning involves assigning different misclassification costs to different classes. This can encourage the model to pay more attention to the minority class.
Ensemble methods: Ensemble methods such as boosting and bagging can be used to combine multiple models and improve the performance on the minority class.

4. What are Up-sampling and Down-sampling? Explain with an example when up-sampling and downsampling are required.

Up-sampling and down-sampling are two common techniques used to handle imbalanced data.

Down-sampling involves reducing the number of instances in the majority class, whereas up-sampling involves increasing the number of instances in the minority class.

For example, suppose we have a binary classification problem where the target variable has 100 positive instances and 10,000 negative instances. This is an imbalanced dataset because the number of negative instances far outweighs the number of positive instances.

In this case, we can use down-sampling to reduce the number of negative instances to make the dataset more balanced. We could randomly select 100 negative instances and keep all 100 positive instances, resulting in a dataset with 200 instances.

Alternatively, we could use up-sampling to increase the number of positive instances by generating new instances. We could use techniques like SMOTE (Synthetic Minority Over-sampling Technique) to generate new instances that are similar to the existing positive instances. This would result in a dataset with a higher number of positive instances and a more balanced distribution between the two classes.

5. What is data Augmentation? Explain SMOTE

Data augmentation is the process of generating new training samples by augmenting the existing data using various techniques such as rotation, flipping, zooming, adding noise, and more. The goal is to increase the diversity and quantity of the training data to improve the model's performance.

SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation technique used to address class imbalance in classification tasks. It creates synthetic examples of the minority class by interpolating between minority class examples. The basic steps of SMOTE are:

Identify the minority class examples that need oversampling.
For each of the minority examples, identify k nearest neighbors from the same class.
Randomly select one of the k-nearest neighbors and use it to create a synthetic example by interpolating between the selected neighbor and the original example.
Repeat steps 2 and 3 until the desired number of new minority class examples are generated.
For example, consider a binary classification problem where the positive class has only 10% of the total samples. This class imbalance can be addressed by applying SMOTE to the positive class to generate new synthetic examples. SMOTE can create synthetic samples that are slightly different from the original positive class samples but still belong to the positive class. This process can help balance the class distribution and improve the model's performance.

6. What are outliers in a dataset? Why is it essential to handle outliers?

Outliers are data points that deviate significantly from the other data points in a dataset. They are observations that lie an abnormal distance away from other values in a random sample from a population. Outliers can be either too high or too low with respect to the other values in the dataset. They can occur due to various reasons such as measurement errors, natural variations, or data entry errors.

It is essential to handle outliers because they can significantly affect the results of data analysis. Outliers can affect the mean and standard deviation of the data, which can lead to biased statistical inferences. For instance, the mean of a dataset with outliers may not be a good representation of the central tendency of the data, leading to inaccurate conclusions. In machine learning, outliers can have a significant impact on the performance of the models. They can affect the accuracy of the predictions and lead to overfitting.

Handling outliers can involve several techniques, such as removing them from the dataset, replacing them with the mean or median values, or transforming them using mathematical functions. The choice of the method depends on the nature of the data and the objectives of the analysis. It is important to carefully consider the impact of handling outliers on the data and to choose the most appropriate method for the specific context

7. You are working on a project that requires analyzing customer data. However, you notice that some of 
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

There are several techniques that can be used to handle missing data in a dataset. Some of the commonly used techniques are:

Deletion: This technique involves removing the rows or columns that contain missing data. This can be done using the dropna() function in pandas. However, this technique can result in a loss of information, and it should only be used when the amount of missing data is small.

Imputation: This technique involves replacing missing values with estimated values. There are several imputation methods available, such as mean imputation, median imputation, mode imputation, and regression imputation. The choice of imputation method depends on the nature of the data and the amount of missing data.

Using advanced algorithms: Advanced algorithms such as k-nearest neighbors (KNN) and decision trees can be used to impute missing values. These algorithms use the patterns in the data to estimate the missing values.

Domain-specific knowledge: In some cases, domain-specific knowledge can be used to impute missing values. For example, if the missing data is related to a customer's income, their occupation, education level, and location can be used to estimate their income.

8. You are working with a large dataset and find that a small percentage of the data is missing. What are 
some strategies you can use to determine if the missing data is missing at random or if there is a pattern 
to the missing data?

There are several strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are a few techniques:

Missing Data Heatmap: A heatmap of missing values can provide insights into patterns of missing data. You can create a heatmap by visualizing the dataset with missing values, where missing values are colored in a different color.

Correlation Matrix: A correlation matrix can be used to identify the relationship between the missing values and other variables in the dataset. By examining the correlation matrix, you may find that missing values are correlated with specific variables.

Statistical Tests: Statistical tests such as Little's MCAR test, can be used to determine if missing data is missing completely at random or if there is a pattern to the missing data.

Data Visualization: Data visualization techniques such as scatter plots, box plots, and histograms can be used to visualize the data and identify patterns or outliers.

9.  Suppose you are working on a medical diagnosis project and find that the majority of patients in the 
dataset do not have the condition of interest, while a small percentage do. What are some strategies you 
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

Here are some strategies that can be used to evaluate the performance of a machine learning model on an imbalanced dataset:

Confusion Matrix: A confusion matrix is a table that is often used to evaluate the performance of a classification model. It shows the actual class labels against the predicted class labels, and the number of correct and incorrect predictions. By analyzing the confusion matrix, we can calculate various performance metrics such as precision, recall, and F1 score, which are better suited for imbalanced datasets.

Precision, Recall, and F1 Score: Precision measures the proportion of true positives (i.e., the number of correct positive predictions) among all positive predictions. Recall measures the proportion of true positives among all actual positive cases. F1 score is the harmonic mean of precision and recall, which takes both metrics into account. These metrics are useful for imbalanced datasets as they focus on the performance of the minority class.

ROC Curve and AUC: ROC (Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classifier system. It plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. AUC (Area Under the Curve) is a metric that represents the overall performance of the classifier. A higher AUC value indicates better performance.

Stratified Sampling: In stratified sampling, we ensure that the proportions of the classes in the training and test sets are the same as those in the original dataset. This ensures that the model is trained on a representative sample of the data and can generalize well to new data.

Resampling Techniques: Resampling techniques involve creating new samples by either oversampling the minority class or undersampling the majority class. Oversampling techniques such as SMOTE (Synthetic Minority Over-sampling Technique) can be used to create new synthetic samples for the minority class, while undersampling techniques such as random undersampling can be used to remove some of the majority class samples. These techniques can help balance the class distribution and improve the model's performance on the minority class.

10. When attempting to estimate customer satisfaction for a project, you discover that the dataset is 
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to 
balance the dataset and down-sample the majority class?

In the scenario described, since the majority class represents satisfied customers, it may be important to carefully consider the trade-offs of down-sampling this group. One option could be to use a combination of down-sampling and SMOTE to balance the dataset and create new synthetic samples of the minority class. However, it is also important to evaluate the performance of the model on the imbalanced dataset and consider the potential impact of misclassification of the minority class. Additionally, it may be useful to explore other strategies to handle imbalanced datasets, such as cost-sensitive learning or using different evaluation metrics.

To balance the unbalanced dataset and down-sample the majority class, the following methods can be employed:

Random under-sampling: This involves randomly selecting a subset of the majority class to match the number of instances in the minority class. This can lead to a loss of information, and it may not be an effective method if the dataset is already small.

Cluster-based under-sampling: This method involves clustering the majority class into groups and then undersampling from each cluster. This can help retain the information and the pattern present in the data.

Synthetic minority over-sampling technique (SMOTE): This involves generating synthetic samples of the minority class by creating new instances based on existing minority instances. This helps to balance the dataset and ensure that the model is not overfitting on the minority class.

Combined sampling: This involves combining different under-sampling and over-sampling techniques to achieve a balanced dataset.

11. You discover that the dataset is unbalanced with a low percentage of occurrences while working on a 
project that requires you to estimate the occurrence of a rare event. What methods can you employ to 
balance the dataset and up-sample the minority class?

When dealing with a dataset that has a low percentage of occurrences of a rare event, you can use up-sampling techniques to balance the dataset. Some of the commonly used up-sampling techniques include:

Random over-sampling: In this technique, the minority class samples are randomly duplicated until the number of samples in both the majority and minority classes is equal.

SMOTE (Synthetic Minority Over-sampling Technique): In this technique, synthetic samples are generated for the minority class based on the existing samples, making it possible to create new samples that are representative of the minority class.

ADASYN (Adaptive Synthetic Sampling): This technique generates synthetic samples for the minority class in regions of the feature space that are under-represented, making it possible to produce more samples that are representative of the minority class.