## FEATURE ENGINEERING ASSIGNMENT

Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.

Missing values in a dataset refer to the absence of data for one or more variables in certain observations or records. These missing values can occur due to various reasons, such as data collection errors, data loss during transfer or storage, or intentional non-response by individuals.

Handling missing values is essential for several reasons:

Data integrity: Missing values can compromise the integrity and quality of the dataset. They can lead to biased or inaccurate analysis and modeling if not appropriately addressed.

Statistical analysis: Many statistical techniques and algorithms require complete data to provide valid and reliable results. Missing values can disrupt the assumptions of these methods and may lead to biased estimates or incorrect inferences.

Machine learning: Most machine learning algorithms cannot handle missing values directly. They usually expect complete data for training and prediction. Therefore, missing values need to be handled to ensure the successful application of machine learning models.

Data imputation: Missing values may contain valuable information, and imputing or filling in those missing values can help recover some of the lost information. By imputing missing values, we can utilize more complete data for analysis and modeling.

Some algorithms that are not affected by missing values include:

Tree-based algorithms: Decision trees, random forests, and gradient boosting machines are robust to missing values. They can handle missing values by considering alternative split points or imputing missing values during the tree construction process.

Rule-based algorithms: Rule-based classifiers, such as association rule mining or rule induction algorithms, can handle missing values by treating missing as a separate category or by using default rules.

Distance-based algorithms: Some distance-based algorithms, like k-nearest neighbors (KNN), can handle missing values by either ignoring missing values or imputing them based on neighboring instances.

It's important to note that although these algorithms can handle missing values, imputing or handling missing values appropriately can still improve their performance and the quality of the results.

Q2: List down techniques used to handle missing data. Give an example of each with python code.

There are several techniques commonly used to handle missing data. Here are five techniques along with their corresponding Python code examples:

1. Removal of missing data (Listwise deletion):
This technique involves removing any observation that has missing values. It can be implemented using the dropna() function in pandas.

In [1]:
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [6, None, 8, 9, 10]}
df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)


     A     B
0  1.0   6.0
3  4.0   9.0
4  5.0  10.0


2. Mean/median imputation:
In this technique, missing values are replaced with the mean or median value of the variable. It can be implemented using the fillna() function in pandas.

In [2]:
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [6, None, 8, 9, 10]}
df = pd.DataFrame(data)

# Fill missing values with mean
df_filled = df.fillna(df.mean())
print(df_filled)


     A      B
0  1.0   6.00
1  2.0   8.25
2  3.0   8.00
3  4.0   9.00
4  5.0  10.00


3. Mode imputation:
Mode imputation involves replacing missing values with the most frequent value (mode) of the variable. It can be implemented using the fillna() function in pandas.

In [3]:
import pandas as pd

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 2, 5],
        'B': [6, None, 8, 9, 10]}
df = pd.DataFrame(data)

# Fill missing values with mode
df_filled = df.fillna(df.mode().iloc[0])
print(df_filled)


     A     B
0  1.0   6.0
1  2.0   6.0
2  2.0   8.0
3  2.0   9.0
4  5.0  10.0


4. Regression imputation:
Regression imputation involves predicting the missing values based on other variables in the dataset. It can be implemented using machine learning algorithms such as linear regression.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [6, None, 8, 9, 10]}
df = pd.DataFrame(data)

# Separate features and target variable
X = df[['B']]
y = df['A']

# Fit linear regression model
model = LinearRegression()
model.fit(X,y)

# Predict missing values
missing_values = df['A'].isnull()
df.loc[missing_values, 'A'] = model.predict(df.loc[missing_values, ['B']])
print(df)



5. Multiple imputation:
Multiple imputation involves creating multiple plausible values for the missing data and then analyzing the dataset multiple times with these imputations. It can be implemented using the IterativeImputer class from the scikit-learn library.

In [15]:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Create a DataFrame with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [6, None, 8, 9, 10]}
df = pd.DataFrame(data)

# Perform multiple imputation
imputer = IterativeImputer()
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_imputed)


          A          B
0  1.000000   6.000000
1  2.000000   7.000036
2  2.999873   8.000000
3  4.000000   9.000000
4  5.000000  10.000000


Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

mbalanced data refers to a situation where the distribution of classes or categories in a dataset is heavily skewed, meaning that one class or category has significantly more instances than the others. This is a common issue in various domains, such as fraud detection, rare disease diagnosis, or anomaly detection, where the positive class (minority class) is relatively rare compared to the negative class (majority class).

If imbalanced data is not handled, several consequences can arise:

Biased model performance: Machine learning algorithms are often designed to optimize overall accuracy, which works well when classes are balanced. However, in the case of imbalanced data, the algorithms tend to be biased towards the majority class, leading to poor performance on the minority class. The resulting model may have high accuracy overall but fail to identify the minority class instances correctly.

Misleading evaluation metrics: Traditional evaluation metrics, such as accuracy, can be misleading when dealing with imbalanced data. For instance, in a dataset with 95% negative class instances and 5% positive class instances, a model that predicts all instances as negative would still achieve 95% accuracy. Therefore, evaluating model performance solely based on accuracy can be deceptive and fail to capture the true performance on the minority class.

Decision-making bias: If imbalanced data is not handled, decision-making processes based on the model's predictions may be biased towards the majority class. This can have real-world consequences, such as underestimating the risk of rare events or missing out on valuable opportunities.

Decreased model generalization: Imbalanced data can lead to overfitting, where the model becomes overly sensitive to the majority class and fails to generalize well to unseen data. This is because the algorithm learns to prioritize the majority class due to its prevalence in the training data.

To mitigate these issues, handling imbalanced data is crucial. Techniques such as oversampling the minority class, undersampling the majority class, generating synthetic samples, or using specialized algorithms like ensemble methods (e.g., Random Forest, Gradient Boosting) or anomaly detection techniques (e.g., One-Class SVM) can help address the challenges posed by imbalanced data. These techniques aim to provide the model with a more balanced view of the classes, enabling it to learn effectively from the minority class and improve performance on the underrepresented class.

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

Up-sampling and down-sampling are two techniques used to handle imbalanced data by adjusting the distribution of classes in a dataset. Here's an explanation of each technique along with an example scenario where they are required:

Up-sampling (Over-sampling):
Up-sampling involves increasing the number of instances in the minority class to match the number of instances in the majority class. This is typically achieved by randomly replicating existing instances from the minority class or generating synthetic data points based on the existing minority class instances.

Example scenario: Suppose you have a dataset for credit card fraud detection, where the positive class represents fraudulent transactions (minority class), and the negative class represents legitimate transactions (majority class). The dataset is heavily imbalanced, with only 1% of transactions being fraudulent. In this case, up-sampling can be applied by replicating or synthetically generating more instances of the fraudulent transactions to increase their representation in the dataset. This helps the model learn from the available fraudulent examples and improve its ability to detect fraud accurately.

Down-sampling (Under-sampling):
Down-sampling involves reducing the number of instances in the majority class to match the number of instances in the minority class. This is typically done by randomly removing instances from the majority class, which can lead to a smaller dataset size.

Example scenario: Consider a dataset for cancer diagnosis, where the positive class represents malignant cases (minority class), and the negative class represents benign cases (majority class). The dataset is imbalanced, with 80% of cases being benign. In this scenario, down-sampling can be applied by randomly removing instances from the majority class (benign cases) to achieve a balanced distribution between the two classes. By reducing the dominance of the majority class, the model can better learn the distinguishing characteristics of the minority class (malignant cases) and improve its ability to identify cancer accurately.

It's important to note that both up-sampling and down-sampling have their advantages and limitations. Up-sampling can increase the risk of overfitting since it duplicates or generates additional instances, potentially leading to the model memorizing the existing data. Down-sampling may discard potentially useful information from the majority class, resulting in a loss of data and potentially reducing the model's ability to generalize.

The choice between up-sampling and down-sampling depends on the specific dataset and problem at hand. It's often recommended to try multiple techniques, including a combination of up-sampling, down-sampling, and other advanced methods, to find the most effective approach for handling the imbalanced data.

Q5: What is data Augmentation? Explain SMOTE.

Data augmentation is a technique in machine learning that involves creating new data points from existing data points. This can be done by applying a variety of transformations to the data, such as flipping, rotating, cropping, or translating images. Data augmentation can be used to improve the performance of machine learning models by reducing overfitting.

SMOTE (Synthetic Minority Oversampling Technique) is a specific type of data augmentation that is used to address imbalanced datasets. Imbalanced datasets are datasets where one class is much more represented than the others. This can make it difficult for machine learning models to learn to classify the minority class correctly. SMOTE works by creating synthetic data points for the minority class by interpolating between existing data points. This helps to balance the dataset and improve the performance of machine learning models.

Here is an example of how SMOTE works. Let's say we have a dataset with 100 data points, of which 90 are in the majority class and 10 are in the minority class. We want to use SMOTE to create synthetic data points for the minority class. We would start by randomly selecting 10 data points from the minority class. For each of these data points, we would then identify its 5 nearest neighbors. We would then create a synthetic data point by interpolating between the selected data point and one of its nearest neighbors. This would result in us creating 50 new data points for the minority class, which would bring the total number of data points in the minority class up to 60.

SMOTE is a powerful technique that can be used to improve the performance of machine learning models on imbalanced datasets. However, it is important to note that SMOTE can also introduce noise into the dataset. This is why it is important to carefully evaluate the performance of machine learning models on both the original dataset and the augmented dataset.

Here are some other examples of data augmentation techniques:

For images, you can use geometric transformations such as flipping, rotating, cropping, and translating.
For text, you can use techniques such as synonym replacement, word dropout, and back translation.
For time series data, you can use techniques such as time warping and shifting.
Data augmentation is a powerful technique that can be used to improve the performance of machine learning models. However, it is important to use it carefully to avoid introducing noise into the dataset.

Q6: What are outliers in a dataset? Why is it essential to handle outliers?

Outliers in a dataset are data points that significantly deviate from the majority of the data, either in terms of their values or their relationship with other data points. Outliers can be extreme values that are far larger or smaller than the typical values, or they can be data points that fall far away from the general pattern or trend of the data.

Handling outliers is essential for several reasons:

Impact on statistical measures: Outliers can heavily influence statistical measures such as the mean and standard deviation. Since these measures are sensitive to extreme values, the presence of outliers can distort their values and provide a misleading representation of the data.

Biased analysis and modeling: Outliers can lead to biased analysis and modeling results. They can skew the distribution, affect parameter estimates, and impact the assumptions of statistical tests and models. Ignoring or mishandling outliers can lead to inaccurate conclusions and predictions.

Model performance: Outliers can have a significant impact on the performance of machine learning models. Models can be overly influenced by outliers, resulting in poor generalization and suboptimal predictive capabilities. Outliers can lead to overfitting, where the model fits the noise or extreme values instead of the underlying pattern.

Data interpretation and understanding: Outliers can provide valuable insights or indicate potential data quality issues or anomalies. It is important to understand the nature and reasons behind outliers to gain a comprehensive understanding of the data and make informed decisions. Ignoring outliers may result in overlooking important aspects of the data or missing out on critical information.

Data normalization and scaling: Outliers can affect data normalization and scaling techniques, which are commonly used in various data preprocessing steps. Outliers can distort the scaling process, impacting the relative importance and contribution of other data points.

Handling outliers involves various techniques such as:

Identifying and removing outliers: Outliers can be detected using statistical methods such as the z-score, modified z-score, or the interquartile range (IQR). Once identified, outliers can be removed from the dataset if they are deemed to be data anomalies or errors.

Transforming the data: Data transformation techniques such as log transformation or Winsorization can be applied to handle outliers. These techniques modify the values of extreme observations to bring them closer to the rest of the data distribution.

Robust statistical measures: Instead of relying on mean and standard deviation, robust statistical measures such as the median and median absolute deviation (MAD) can be used. These measures are less affected by outliers and provide a more accurate representation of the central tendency and variability.

Advanced modeling techniques: Some advanced modeling techniques, such as robust regression or robust clustering algorithms, are designed to handle outliers more effectively by minimizing their influence on the model estimation or clustering process.

By appropriately handling outliers, the integrity of the data analysis, modeling, and interpretation can be improved, leading to more accurate and reliable results.

Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?

When dealing with missing data in customer data analysis, several techniques can be used to handle the missing values. Here are some commonly employed techniques:

Removal of missing data:
If the missing values are limited in number and do not significantly impact the analysis, you may choose to remove the observations with missing values. However, this approach should be used with caution as it may lead to a loss of valuable information if the missingness is not completely random.

Mean/median imputation:
This technique involves replacing missing values with the mean or median value of the respective variable. Imputing the mean preserves the overall mean of the data, while imputing the median is more robust to outliers.

Mode imputation:
Mode imputation involves replacing missing categorical values with the most frequent category (mode) of the variable. This approach is suitable for variables with discrete categories.

Regression imputation:
Regression imputation utilizes regression models to predict missing values based on other variables. A regression model is built using the observed data, and the missing values are then predicted using the model. This technique allows for the incorporation of relationships among variables.

Multiple imputation:
Multiple imputation involves creating multiple imputed datasets by imputing missing values multiple times using a specified imputation method. Each imputed dataset is analyzed separately, and the results are combined using specialized rules. This technique accounts for the uncertainty introduced by imputing missing values.

K-nearest neighbors imputation:
K-nearest neighbors (KNN) imputation imputes missing values based on similar observations. The missing values are replaced with values from the nearest neighbors, considering the K most similar instances based on other variables.

Advanced imputation methods:
There are more advanced imputation methods available, such as hot deck imputation, expectation-maximization (EM) algorithm, and multiple imputation by chained equations (MICE). These techniques offer more sophisticated imputation strategies for handling missing data.

The choice of technique depends on various factors, including the nature of the missing data, the analysis objectives, and the characteristics of the dataset. It is essential to consider the potential impact of imputation on the analysis results and to evaluate the performance and validity of the chosen imputation method.

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

Determining if the missing data is missing at random (MAR) or if there is a pattern to the missing data can provide valuable insights into the nature and potential biases introduced by the missingness. Here are some strategies you can use to assess the missing data patterns:

1. Missing data visualization:
Visualizing the missing data patterns can help identify potential patterns or relationships between missing values and other variables. You can create a missingness matrix or use heatmaps to visualize the missingness patterns across different variables. If certain variables or combinations of variables exhibit higher missingness, it could indicate a pattern in the missing data.

2. Missing data correlation:
Calculate the correlation between the missingness of a variable and other variables in the dataset. This can be done using measures such as phi coefficient (for categorical variables) or point-biserial correlation (for a combination of categorical and continuous variables). Significant correlations suggest that the missingness of one variable is related to the values of other variables.

3. Missing data mechanism tests:
Conduct statistical tests to determine if the missing data follows a specific mechanism. Two common tests include:

a. Little's Missing Completely at Random (MCAR) test: This test assesses if the missing data is independent of both observed and unobserved values. If the test fails to reject the null hypothesis, it suggests that the data is MCAR.

b. Chi-square test for Missingness-Not-At-Random (MNAR): This test assesses if the missingness is related to the unobserved values themselves. If the test rejects the null hypothesis, it indicates that the missingness is not random.

4. Pattern analysis by observed data:
Examine the patterns or characteristics of the observed data corresponding to the missing values. If specific patterns or systematic differences are found between the observed and missing values, it suggests that the missing data is not random.

5. Expert domain knowledge:
Consult with subject matter experts who are familiar with the data or the domain. They can provide insights into potential reasons or mechanisms behind the missing data patterns based on their knowledge and experience.

It's important to note that determining the missing data mechanism is challenging and can be subjective. The above strategies provide useful information but may not provide definitive answers. Handling missing data should consider the potential biases introduced by missingness, regardless of whether it is missing at random or not, to ensure the validity and reliability of the analysis.

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?

When working with an imbalanced medical diagnosis dataset where the majority of patients do not have the condition of interest, it is crucial to employ appropriate strategies to evaluate the performance of your machine learning model effectively. Here are some strategies that can be used:

Confusion matrix and class imbalance-aware metrics:
Utilize a confusion matrix to examine the model's performance on both the majority (negative) and minority (positive) classes. Metrics such as accuracy, precision, recall (sensitivity), specificity, F1 score, and area under the ROC curve (AUC-ROC) can provide insights into the model's performance. However, be cautious with accuracy as it can be misleading due to the class imbalance. Focus on class-specific metrics like recall or F1 score to assess the model's ability to correctly identify the minority class.

Resampling techniques:
Implement resampling techniques to balance the dataset, such as oversampling the minority class or undersampling the majority class. This can help provide the model with a more balanced training dataset and improve its performance on the minority class. However, it's essential to evaluate the model's performance on the original imbalanced dataset as well to assess its real-world applicability.

Cross-validation:
Utilize appropriate cross-validation techniques, such as stratified k-fold cross-validation, to ensure that each fold maintains the class distribution proportionality. This helps in obtaining more reliable and representative performance estimates of the model.

Evaluation on different thresholds:
Adjust the classification threshold to find the optimal balance between precision and recall. This can be done by plotting a precision-recall curve or an ROC curve and selecting the threshold that maximizes the desired metric (e.g., F1 score or specificity) for the minority class.

Cost-sensitive learning:
Assign different misclassification costs to the minority and majority classes to reflect the practical importance and consequences of misclassification. This encourages the model to pay more attention to the minority class during training.

Ensemble methods:
Employ ensemble methods, such as bagging or boosting techniques, to improve the model's performance. Ensemble methods combine multiple models to make predictions and can be beneficial in handling class imbalance and reducing bias towards the majority class.

Domain knowledge and expert feedback:
Consult with medical experts to gain insights into the critical factors or evaluation metrics that are relevant to the medical diagnosis problem. Their domain knowledge can provide valuable guidance in evaluating the model's performance effectively.

It's important to note that the choice of evaluation strategies should align with the specific characteristics of the medical diagnosis problem, the available data, and the evaluation objectives. Evaluating model performance on imbalanced datasets requires careful consideration to ensure that the model is effective in accurately identifying the minority class while maintaining a good balance between precision and recall.

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?

To balance an unbalanced dataset in which the majority class dominates customer satisfaction reports, you can employ down-sampling techniques to reduce the number of instances in the majority class. Here are some methods you can use:

Random under-sampling:
Randomly select a subset of instances from the majority class to match the number of instances in the minority class. This approach can be straightforward to implement and help balance the dataset. However, it may discard potentially useful data and result in information loss.

Cluster-based under-sampling:
Apply clustering algorithms, such as k-means or hierarchical clustering, to group instances from the majority class into clusters. Then, select representative instances from each cluster, ensuring a diverse representation of the majority class while reducing the number of instances.

Tomek links:
Identify Tomek links, which are pairs of instances from different classes that are nearest neighbors to each other. Remove the majority class instances from these pairs, as they are likely to be noise or outliers. This technique aims to create a clearer separation between the classes.

NearMiss:
NearMiss is a family of under-sampling techniques that select instances from the majority class based on their proximity to the minority class. The selection is done by considering the nearest neighbors of the minority class instances. NearMiss variants include NearMiss-1, NearMiss-2, and NearMiss-3, each with different selection criteria.

Condensed Nearest Neighbor (CNN):
CNN is an under-sampling method that starts with an empty subset of instances and iteratively adds instances from the majority class that are correctly classified by a k-nearest neighbor classifier trained on the minority class instances. The aim is to retain only a subset of instances from the majority class that is necessary for accurate classification.

Edited Nearest Neighbor (ENN):
ENN is an under-sampling technique that removes instances from the majority class if their class label differs from the majority of their k-nearest neighbors. ENN aims to remove potentially mislabeled instances from the majority class.

It is important to note that when down-sampling the majority class, you should be cautious about potential information loss and the impact on the overall representation of the data. Consider evaluating the model's performance on both the down-sampled dataset and the original imbalanced dataset to ensure a comprehensive assessment. Additionally, combining down-sampling with other techniques like synthetic data generation or ensemble methods can help improve the performance and balance of the dataset.

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?

To balance an unbalanced dataset with a low percentage of occurrences, specifically when dealing with a rare event, you can employ up-sampling techniques to increase the number of instances in the minority class. Here are some methods you can use:

Random over-sampling:
Randomly duplicate instances from the minority class to match the number of instances in the majority class. This simple approach increases the representation of the minority class but may lead to overfitting if not handled carefully.

SMOTE (Synthetic Minority Over-sampling Technique):
SMOTE generates synthetic samples for the minority class by interpolating feature values between existing instances. It creates new instances along the line segments connecting neighboring instances. SMOTE helps in creating a more diverse and balanced dataset for the rare event.

ADASYN (Adaptive Synthetic Sampling):
ADASYN is an extension of SMOTE that adjusts the synthetic sample generation process based on the density distribution of the feature space. It places more emphasis on generating synthetic samples for instances that are more challenging to classify.

Borderline SMOTE:
Borderline SMOTE focuses on generating synthetic samples near the decision boundary between classes. It focuses on the instances that are misclassified or close to being misclassified. This technique aims to improve the model's ability to learn from the difficult minority class instances.

SMOTE-ENN (SMOTE with Edited Nearest Neighbors):
SMOTE-ENN combines over-sampling using SMOTE with the removal of noisy or mislabeled instances using the Edited Nearest Neighbor (ENN) algorithm. ENN removes instances from the majority class that are misclassified by the k-nearest neighbor classifier trained on the minority class.

Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC):
SMOTE-NC extends SMOTE to handle datasets with both continuous and categorical features. It generates synthetic samples by considering the distribution of continuous features and the creation of categorical synthetic features based on the nearest neighbors.

When up-sampling the minority class, be cautious about potential overfitting and the risk of synthetic samples being too similar to existing instances. Evaluating the model's performance on both the up-sampled dataset and the original imbalanced dataset is crucial. Additionally, combining up-sampling with other techniques such as down-sampling or ensemble methods can be beneficial in creating a balanced dataset and improving the performance on the rare event.