# **`Feature Engineering-1`**

`Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.`

**`Missing values`** in a dataset refer to the absence of a value in one or more variables for a particular observation. This can happen for several reasons, such as data entry errors, incomplete surveys or questionnaires, or missing data due to technical issues.

Handling missing values is essential because they can affect the accuracy and reliability of the statistical analysis and machine learning models. Missing values can lead to biased estimates of the parameters, reduce the statistical power of the analysis, and even lead to incorrect conclusions. Moreover, many machine learning algorithms cannot handle missing values and may produce errors or incorrect predictions.

Some common methods for handling missing values in a dataset include:

1. Removing observations with missing values: This method involves removing all observations that contain at least one missing value. However, this approach can reduce the sample size and may lead to biased estimates if the missing data is not missing at random.

2. Removing variables with missing values: This method involves removing all variables that contain missing values. This approach may be appropriate if the missing values are only present in a small number of variables or if the variables are not essential for the analysis.

3. Imputation: This method involves estimating the missing values based on the values of other variables or based on the average or median value of the variable. There are several methods of imputation, such as mean imputation, median imputation, regression imputation, and multiple imputation.

4. Treat missing values as a separate category: This method is suitable for categorical variables and involves creating a new category for the missing values.

The choice of the method for handling missing values depends on the amount and pattern of missing data, the type of analysis, and the specific requirements of the problem. Handling missing values appropriately can improve the quality and reliability of the analysis and machine learning models.

There are several machine learning algorithms that are not affected by missing values in the data. Some of these algorithms include:

1. `Decision trees`: Decision trees can handle missing values by simply ignoring them when splitting nodes.

2. `Random forests`: Random forests are an ensemble of decision trees and can handle missing values in a similar way as decision trees.

3. `K-nearest neighbors (KNN)`: KNN is a non-parametric algorithm that does not require any assumptions about the distribution of the data. It can handle missing values by using the available data points to calculate the distance between observations.

4. `Naive Bayes`: Naive Bayes is a probabilistic algorithm that can handle missing values by ignoring them and calculating the probability of the class given the available features.

5. `Support vector machines (SVM)`: SVM can handle missing values by defining a distance measure between observations that does not depend on the missing values.

6. `Principal Component Analysis (PCA)`: PCA can handle missing values by using an estimation method such as expectation-maximization (EM) algorithm to estimate the missing values and then performing dimensionality reduction.

It's important to note that while these algorithms are not affected by missing values, the performance of the algorithm may be impacted if there are a large number of missing values or if the missing values are not missing at random. Therefore, it's still important to handle missing values appropriately, regardless of the algorithm being used.

`Q2: List down techniques used to handle missing data. Give an example of each with python code.`

Here are some common techniques used to handle missing data:

1. Deletion: This method involves removing observations with missing values. It can be further divided into two subcategories:
 - Listwise deletion: This approach removes entire observations with missing values.
 - Pairwise deletion: This approach only removes missing values for specific variables.
2. Mean/Median/Mode Imputation: This method involves replacing the missing values with the mean/median/mode of the available values for that variable.

3. Regression Imputation: This method involves predicting the missing values of a variable using regression analysis.

4. K-nearest neighbor imputation: This method involves using the values of the k-nearest neighbors to estimate the missing value.

5. Multiple Imputation: This method involves imputing the missing values multiple times to create multiple datasets and then analyzing each dataset to obtain an average or pooled estimate.

6. Data augmentation: This method involves creating new data points based on the available data to fill in the missing values.

7. Treat missing values as a separate category: This method is suitable for categorical variables and involves creating a new category for the missing values.

The choice of method for handling missing data depends on the type and amount of missing data, the distribution of the data, and the specific requirements of the problem. It's essential to handle missing data appropriately to avoid bias and improve the accuracy and reliability of the analysis.

In [6]:
#Examples:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.impute import SimpleImputer

#1. Deletion

#list wise deletion
df = sns.load_dataset("titanic")
print(df.shape)
df.dropna(inplace=True) # deletion
print(df.shape) 

(891, 15)
(182, 15)


In [2]:
#pairwise deletion
df = sns.load_dataset("titanic")
print(df[["age","embarked"]].shape)
df.dropna(subset=['age', 'embarked'], inplace=True) #pairwise deletion
print(df[["age","embarked"]].shape)

(891, 2)
(712, 2)


In [27]:
#2.mean/median/mode imputation
df = sns.load_dataset("titanic") #load data

#Mean/Median/Mode Imputation:
print("Before",df.embarked.isnull().sum()) #original

# Impute missing values with mode
imputer = SimpleImputer(strategy='most_frequent') # strategy can be changed to mean or median 
#respectively for mean or median imputation

df["embarked"] = imputer.fit_transform(df[["embarked"]])
print("After",df.embarked.isnull().sum()) #final result


Before 2
After 0


In [29]:
# 3. Regression Imputation:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Load the dataset
df = sns.load_dataset("titanic") #load data

#before imputation
print("before",df.age.isnull().sum())
# Impute missing values with regression
imputer = IterativeImputer() #imputation using linear regression
df["age"] = imputer.fit_transform(df[["age"]])
print("After",df.age.isnull().sum())


before 177
After 0


In [32]:
#4. K-nearest neighbor imputation:

from sklearn.impute import KNNImputer

# Load the dataset
df = sns.load_dataset("titanic") #load data

#before imputation
print("before",df.age.isnull().sum())

# Impute missing values with KNN
imputer = KNNImputer(n_neighbors=3)
df[['age']] = imputer.fit_transform(df[['age']])

print("after",df.age.isnull().sum())


before 177
after 0


In [33]:
# 5. Multiple Imputation:

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Load the dataset
df = sns.load_dataset("titanic") #load data

#before imputation
print("before",df.age.isnull().sum())

# Impute missing values with KNN
imputer = IterativeImputer(max_iter=10, random_state=0)
df[['age']] = imputer.fit_transform(df[['age']])

print("after",df.age.isnull().sum())

before 177
after 0


In [34]:
# 6. Data augmentation:

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})

# Create new data points based on existing data to fill in missing values
df_augmented = df.copy()
df_augmented.loc[2, 'A'] = 3
df_augmented.loc[1, 'B'] = 6

print(df_augmented)


     A    B
0  1.0  5.0
1  2.0  6.0
2  3.0  7.0
3  4.0  8.0


In [35]:
# 7. Treat missing values as a separate category:

# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, 7, 8]})

# Replace missing values with a new label "missing"
df['A'] = df['A'].fillna('missing')
df['B'] = df['B'].fillna('missing')

print(df)

         A        B
0      1.0      5.0
1      2.0  missing
2  missing      7.0
3      4.0      8.0


`Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?`

`Imbalanced data` refers to a situation in which the distribution of classes in a classification dataset is not equal. In other words, one class may have significantly more instances than another class. For example, in a medical diagnosis dataset, the number of healthy patients may be much larger than the number of sick patients.

If imbalanced data is not handled, it can lead to biased and inaccurate machine learning models. The model may have a high accuracy rate but may be predicting only the majority class, ignoring the minority class. This can result in a high false negative rate, which means that the model fails to correctly identify the minority class, leading to potentially serious consequences.

For example, in a credit card fraud detection dataset, the majority of the transactions may be legitimate, and only a small percentage may be fraudulent. If the model is trained on this imbalanced dataset without any handling techniques, it may classify all transactions as legitimate, resulting in a high false negative rate and allowing fraudulent transactions to go undetected.

Therefore, handling imbalanced data is crucial in machine learning to ensure that the model performs well on all classes and avoids making biased predictions.

`Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-
sampling are required.`

`Up-sampling` and `down-sampling` are techniques used to `handle imbalanced data` in machine learning.

Up-sampling is the process of randomly replicating minority class samples to increase their number to match the majority class samples. This technique can be used when the minority class has too few samples to represent the true distribution of the data, and the model needs more data to learn from.

Down-sampling, on the other hand, is the process of randomly removing samples from the majority class to reduce its number to match the minority class samples. This technique can be used when the majority class has significantly more samples than the minority class, and the model is biased towards the majority class.

Here's an example to illustrate when up-sampling and down-sampling may be required:

Suppose we have a dataset for credit card fraud detection, where the majority class is legitimate transactions, and the minority class is fraudulent transactions. If the dataset contains 10,000 legitimate transactions and only 100 fraudulent transactions, the dataset is heavily imbalanced, with a class distribution of 99% legitimate transactions and 1% fraudulent transactions.

In this case, the model may not be able to accurately predict fraudulent transactions since it has too few samples to learn from. To handle this situation, we can use up-sampling to increase the number of fraudulent transactions by randomly replicating them, for example, to 1,000 or more. This will balance the dataset and enable the model to learn from a more representative set of data.

On the other hand, if the dataset contains 10,000 fraudulent transactions and only 100 legitimate transactions, the model may be biased towards the fraudulent transactions and predict all transactions as fraudulent. In this case, we can use down-sampling to reduce the number of fraudulent transactions to match the number of legitimate transactions, for example, to 100. This will balance the dataset and enable the model to learn from both classes equally.

Overall, up-sampling and down-sampling are essential techniques to handle imbalanced data and ensure that the machine learning model can make accurate predictions on all classes.

`Q5: What is data Augmentation? Explain SMOTE.`

Data augmentation is a technique used to increase the size of a dataset by generating new data from the existing data while preserving its label. It is a common technique used to address the problem of limited data and improve the performance of machine learning models.

SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used for handling imbalanced datasets. It is used to generate new synthetic samples of the minority class in the imbalanced dataset, thereby balancing the class distribution. The technique works by creating synthetic samples of the minority class by interpolating between existing minority class samples.

Here's a step-by-step explanation of how SMOTE works:

1. For each minority class sample, SMOTE selects k nearest neighbors from the minority class.
2. It then randomly selects one of the k nearest neighbors and generates a new synthetic sample by interpolating between the selected minority class sample and the chosen nearest neighbor.
3. The interpolation is done by taking the difference between each feature value of the two samples and multiplying it by a random number between 0 and 1. This random number is generated for each feature, resulting in a new synthetic sample that lies somewhere between the two original samples in the feature space.
4. SMOTE repeats this process until the desired number of synthetic samples has been generated.
Here's an example of how SMOTE can be implemented in Python using the imbalanced-learn library:

In [None]:
#example code
from imblearn.over_sampling import SMOTE

# instantiate SMOTE with desired sampling strategy
sm = SMOTE(sampling_strategy='minority')

# fit and apply SMOTE to the training data
X_resampled, y_resampled = sm.fit_resample(X_train, y_train)

In this example, SMOTE is instantiated with a sampling strategy of 'minority', which means that the minority class will be over-sampled to match the number of samples in the majority class. The fit_resample method is then called on the training data, which applies SMOTE and returns the new resampled data, X_resampled and y_resampled.

SMOTE is a powerful technique for handling imbalanced datasets, but it should be used with caution as it can sometimes generate synthetic samples that are unrealistic or noisy. It is important to carefully tune the parameters of SMOTE and evaluate its impact on the performance of the machine learning model.

`Q6: What are outliers in a dataset? Why is it essential to handle outliers?`

Outliers are observations or data points in a dataset that are significantly different from the majority of the other observations or data points. Outliers can be caused by various factors, such as measurement errors, data entry errors, or true variations in the data.

It is essential to handle outliers in a dataset because they can have a significant impact on the results of data analysis and modeling. Outliers can lead to incorrect statistical analysis, biased estimates of model parameters, and reduced accuracy of predictive models. Outliers can also skew the distribution of data and affect the overall performance of the model.

Handling outliers is important to ensure that the model is accurate and reliable. There are several techniques for handling outliers, including:

1. Trimming: This involves removing a fixed percentage of the extreme values from the dataset. For example, the top and bottom 5% of values can be removed.

2. Winsorization: This involves replacing the extreme values with the nearest non-outlier values. For example, the top and bottom 5% of values can be replaced with the values at the 95th and 5th percentiles, respectively.

3. Z-score method: This involves identifying outliers based on their distance from the mean of the dataset. Observations that are more than a certain number of standard deviations away from the mean can be considered outliers.

4. Interquartile range (IQR) method: This involves identifying outliers based on their distance from the median of the dataset. Observations that are more than a certain number of IQRs away from the median can be considered outliers.

Here's an example of how to remove outliers using the Z-score method in Python:

In [43]:
import numpy as np

# generate some random data with outliers
data = np.random.normal(0, 1, 100)
data[0] = 10
data[1] = -10

# compute the z-scores of the data
z_scores = (data - np.mean(data)) / np.std(data)

# identify outliers based on z-scores
outliers = np.abs(z_scores) > 3

# remove outliers from the data
cleaned_data = data[~outliers]

print("Original data:", data.shape)
print("Cleaned data:", cleaned_data.shape)

Original data: (100,)
Cleaned data: (98,)


Here's an example of how to remove outliers using the Interquartile range (IQR) method in Python:

In [44]:
import numpy as np

# generate some random data with outliers
data = np.random.normal(0, 1, 100)
data[0] = 10
data[1] = -10

# compute the IQR of the data
q1, q3 = np.percentile(data, [25, 75])
iqr = q3 - q1

# define the upper and lower bounds for outlier detection
upper_bound = q3 + 1.5 * iqr
lower_bound = q1 - 1.5 * iqr

# identify outliers based on the IQR method
outliers = (data < lower_bound) | (data > upper_bound)

# remove outliers from the data
cleaned_data = data[~outliers]

print("lower bound",lower_bound)
print("upper bound",upper_bound)
print("Original data:", data.shape)
print("Cleaned data:", cleaned_data.shape)

lower bound -3.15719984876644
upper bound 3.0847035295496186
Original data: (100,)
Cleaned data: (98,)


`Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?`

There are several techniques that can be used to handle missing data in customer data analysis. Here are a few:

1. Deletion: This involves removing any rows or columns in the dataset that contain missing data. However, this approach should be used with caution as it can result in a loss of information and potentially biased results.

2. Mean/median imputation: This involves replacing missing values with the mean or median value of the available data. This approach is simple to implement but can result in biased estimates if the missing data is not missing at random.

3. Mode imputation: This involves replacing missing values with the mode (most frequent value) of the available data. This approach is also simple to implement but may not be appropriate for continuous data.

4. Regression imputation: This involves using regression models to predict the missing values based on other variables in the dataset. This approach can provide accurate estimates but requires additional modeling and may be sensitive to model assumptions.

5. Multiple imputation: This involves creating multiple imputed datasets, each with slightly different imputed values, and analyzing each dataset separately. This approach can provide more accurate estimates and a measure of uncertainty, but requires more computational resources.

6. Interpolation: This involves estimating the missing values based on the values of neighboring data points. This approach is often used for time series data.

Ultimately, the choice of which technique to use depends on the nature of the data, the extent of the missingness, and the specific analysis being conducted.

`Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are
some strategies you can use to determine if the missing data is missing at random or if there is a pattern
to the missing data?`

There are several strategies that can be used to determine if missing data is missing at random or if there is a pattern to the missing data:

1. `Check for missing data patterns`: One approach is to visually inspect the data for patterns in the missing data. This can be done using heatmaps or other visualization tools to see if the missing data is clustered in certain areas or if it appears to be randomly distributed throughout the dataset.

2. `Test for missing data patterns`: Statistical tests can be used to check for patterns in the missing data. One approach is to test whether the missing data is related to other variables in the dataset using correlation or regression analysis.

3. `Compare missing data patterns between groups`: If the dataset includes groups or subgroups, it can be useful to compare missing data patterns between the groups. If there are significant differences in the missing data patterns between the groups, this can suggest that the missing data is not missing at random.

4. `Impute missing data`: Another approach is to impute the missing data using different imputation methods and compare the results. If the results are similar across multiple imputation methods, this can suggest that the missing data is missing at random.

Ultimately, the choice of strategy will depend on the nature of the dataset and the extent of the missing data. It is often recommended to use multiple strategies to cross-validate the results and increase confidence in the conclusions.

`Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the
dataset do not have the condition of interest, while a small percentage do. What are some strategies you
can use to evaluate the performance of your machine learning model on this imbalanced dataset?`

When working with imbalanced datasets, where the majority of samples belong to one class, there are several strategies that can be used to evaluate the performance of a machine learning model. Here are a few approaches:

1. `Confusion matrix`: A confusion matrix is a table that shows the true positive, true negative, false positive, and false negative rates of the model predictions. It provides an overall picture of the performance of the model.

2. `Precision and recall`: Precision and recall are two metrics that are often used to evaluate the performance of binary classification models on imbalanced datasets. Precision measures the accuracy of the positive predictions, while recall measures the ability of the model to identify all positive cases. Both metrics are important, but which one to focus on depends on the specific application.

3. `F1 score`: The F1 score is the harmonic mean of precision and recall and is a commonly used metric for evaluating the performance of binary classification models on imbalanced datasets. It provides a single score that balances both precision and recall.

4. `ROC curve and AUC`: The receiver operating characteristic (ROC) curve is a plot of the true positive rate against the false positive rate at various classification thresholds. The area under the curve (AUC) provides an overall measure of the model's ability to discriminate between positive and negative cases.

5. `Resampling techniques`: Resampling techniques such as oversampling or undersampling can be used to balance the dataset and improve the performance of the model. Techniques like SMOTE (Synthetic Minority Over-sampling Technique) can be used to generate synthetic data points from the minority class.

It is essential to choose the right evaluation metrics and strategies that are relevant to the specific problem and dataset. Moreover, it is always a good practice to cross-validate the results using different evaluation metrics and strategies to ensure that the model's performance is not biased towards a particular approach.

`Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?`

To balance an unbalanced dataset and down-sample the majority class, we can use the following techniques:

1. `Random undersampling`: Randomly removing samples from the majority class until the dataset is balanced. This technique is simple to implement but can result in the loss of important information.

2. `Cluster centroids`: In this technique, the majority class is down-sampled by identifying clusters of the majority class and replacing them with their respective centroids. This method can preserve important information, but it may not be effective if there is significant overlap between the majority and minority classes.

3. `Tomek links`: Tomek links are pairs of instances from different classes that are close to each other but have different labels. In this technique, the majority class is down-sampled by removing instances that form Tomek links. This method can remove noisy and borderline instances but may not be effective if there is significant overlap between the classes.

4. `Synthetic Minority Over-sampling Technique (SMOTE)`: SMOTE is an oversampling technique that generates synthetic data points from the minority class. The technique works by selecting a random minority class data point and creating synthetic samples along the line segments that connect its nearest neighbors. SMOTE can be used to balance the dataset by oversampling the minority class.

To down-sample the majority class after balancing the dataset, we can use any of the above techniques to create a balanced dataset, followed by randomly selecting a subset of the majority class samples to down-sample the majority class.

`Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a
project that requires you to estimate the occurrence of a rare event. What methods can you employ to
balance the dataset and up-sample the minority class?`

To balance an unbalanced dataset and up-sample the minority class, we can use the following techniques:

1. `Random oversampling`: Randomly duplicating samples from the minority class until the dataset is balanced. This technique is simple to implement but can result in overfitting and the generation of noisy samples.

2. `SMOTE (Synthetic Minority Over-sampling Technique)`: SMOTE is a popular oversampling technique that generates synthetic data points from the minority class. The technique works by selecting a random minority class data point and creating synthetic samples along the line segments that connect its nearest neighbors. SMOTE can be used to balance the dataset by oversampling the minority class.

3. `ADASYN (Adaptive Synthetic Sampling)`: ADASYN is an extension of SMOTE that generates more synthetic data points for the minority class samples that are harder to learn. This method can be more effective than SMOTE if there is significant overlap between the classes.

4. `Synthetic minority augmentation (SMA)`: SMA is another oversampling technique that generates synthetic data points from the minority class. The technique works by adding random noise to the minority class samples to create new synthetic samples. This technique can be more effective than SMOTE in some cases.

To up-sample the minority class after balancing the dataset, we can use any of the above techniques to create a balanced dataset, followed by randomly selecting a subset of the minority class samples to up-sample the minority class.