## Features Engineering 1

### Question 1
Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some
algorithms that are not affected by missing values.


___Answer___

Missing values in a dataset refer to instances where no data is stored for a particular observation or feature. 

__Why handle them?__

It is essential to handle missing values because they can affect the performance of machine learning algorithms and lead to inaccurate or misleading results. 

Some algorithms that are not affected by missing values include:

1. K-Nearest Neighbor
2. Naive Bayes 

Also, Decision Trees can also handle missing values.

### Question 2

Q2: List down techniques used to handle missing data. Give an example of each with python code.



__Answer__

There are various techniques that can be used to handle missing data, some of them are:

__Deletion method:__

This method involves deleting the rows or columns that contain missing values. It can be further divided into two sub-categories: Listwise deletion and Pairwise deletion.

1. __Listwise deletion:__

In this method, all the rows that contain any missing value are removed from the dataset.

2. __Pairwise deletion:__

In this method, only the missing values in each row are removed, while the rest of the values in that row are retained.


__Imputation method:__

This method involves filling in the missing values with some substitute value. There are several ways to do this:

1. Mean imputation:
In this method, missing values are replaced with the mean of the column.

2. Median imputation:
In this method, missing values are replaced with the median of the column.

3. Mode imputation:
In this method, missing values are replaced with the mode (most frequent value) of the column.

4. Interpolation:
In this method, missing values are replaced by interpolating the values of the adjacent rows or columns.


__Prediction method:__

In this method, the missing values are predicted based on other variables in the dataset using regression or classification models.


In [37]:
## listwise deletion
import pandas as pd
import seaborn as sns

df = sns.load_dataset("titanic")
# Dropping rows with missing values
df = df.dropna()
# df.head()
print(df.isnull().sum())
print(df.shape)

survived       0
pclass         0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
deck           0
embark_town    0
alive          0
alone          0
dtype: int64
(182, 15)


In [38]:
### pairwise deletion, more columnwise!!!
import pandas as pd
import seaborn as sns


df = sns.load_dataset("titanic")
# Dropping missing values column-wise
df = df.dropna(axis=1)
# df.head()
print(df.isnull().sum())
print(df.shape)

survived      0
pclass        0
sex           0
sibsp         0
parch         0
fare          0
class         0
who           0
adult_male    0
alive         0
alone         0
dtype: int64
(891, 11)


In [67]:
### Mean imputation

import seaborn as sns

df = sns.load_dataset("titanic")

# # list(df) ## to list colunm names
# df.columns
# df.isnull().sum()

df["Age_meanInput"] = df["age"].fillna(df["age"].mean())


print(df.isnull().sum())
print(df.shape)
### note all the age na filled with mean

survived           0
pclass             0
sex                0
age              177
sibsp              0
parch              0
fare               0
embarked           2
class              0
who                0
adult_male         0
deck             688
embark_town        2
alive              0
alone              0
Age_meanInput      0
dtype: int64
(891, 16)


In [68]:
## median imputation

import seaborn as sns

df = sns.load_dataset("titanic")

# # list(df) ## to list colunm names
# df.columns
# df.isnull().sum()

df["Age_medianInput"] = df["age"].fillna(df["age"].median())

print(df.isnull().sum())
print(df.shape)
### note all the age na filled with median

survived             0
pclass               0
sex                  0
age                177
sibsp                0
parch                0
fare                 0
embarked             2
class                0
who                  0
adult_male           0
deck               688
embark_town          2
alive                0
alone                0
Age_medianInput      0
dtype: int64
(891, 16)


In [73]:
## mode imputation, mostly used of caterogical variables


import seaborn as sns

df = sns.load_dataset("titanic")


df["embark_modeInput"] = df["embark_town"].fillna(df["embark_town"].mode().iloc[0])

print(df.isnull().sum())
print(df.shape)

### Note all the embark Na filled with mode of the highest occuring category 


survived              0
pclass                0
sex                   0
age                 177
sibsp                 0
parch                 0
fare                  0
embarked              2
class                 0
who                   0
adult_male            0
deck                688
embark_town           2
alive                 0
alone                 0
embark_modeInput      0
dtype: int64
(891, 16)


In [98]:
#Interapolation_ linear
import seaborn as sns
import numpy as np

df = sns.load_dataset("titanic")

# # list(df) ## to list colunm names
# df.columns
# df.isnull().sum()

df["age_interp"] = df['age'].interpolate()
# list(df)


print(df.isnull().sum())
print(df.shape)

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
age_interp       0
dtype: int64
(891, 16)


In [133]:
### using regression to predict and fill na
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression

df = sns.load_dataset('titanic')

### picking data that are numeric and not categories or boolean
df = df.select_dtypes(include=np.number)

# Splitting the data into training and testing sets
X_train = df.dropna().drop('age', axis=1)
y_train = df.dropna()['age']
X_test = df[df.isnull().any(axis=1)].drop('age', axis=1)
# Training a linear regression model
model = LinearRegression().fit(X_train, y_train)
# Predicting the missing values
df.loc[df.isnull().any(axis=1), 'age'] = model.predict(X_test)

# Note: I only deal with the numeric variables

print(df.isnull().sum())
print(df.shape)

survived    0
pclass      0
age         0
sibsp       0
parch       0
fare        0
dtype: int64
(891, 6)


### Question 3

Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

__Answer__

Imbalanced data refers to a situation where the distribution of classes in a dataset is not equal. In other words, one class may have significantly more instances than another.

__Problem not handling imbalance dataset__

Imbalanced data can pose a challenge for machine learning algorithms as they tend to perform better on balanced datasets. This is because they are often optimized to minimize the overall error, which means they may prioritize the majority class at the expense of the minority class. As a result, the performance of the algorithm may be biased towards the majority class and the minority class may be incorrectly classified more often.

Dealing with imbalanced data is an important problem in machine learning, and there are several techniques that can be used to address it, such as:

1. undersampling
2. oversampling
3. using specialized algorithms designed to handle imbalanced data


### Question 4

Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

__Answer__



__Up-sampling__

Up-sampling involves increasing the number of instances in the minority class by either duplicating existing instances or generating new ones through techniques such as bootstrapping or data synthesis. This is done in order to balance the number of instances in each class, and thus improve the performance of machine learning algorithms that may be biased towards the majority class.

__Example__

In a dataset of 1000 transactions, only 50 of which are fraudulent, up-sampling may involve duplicating or generating new fraudulent transactions so that the number of fraudulent transactions increases to a desired level, say 500. This would make the dataset balanced and improve the performance of machine learning algorithms.

__Down-sampling__

Down-sampling, on the other hand, involves reducing the number of instances in the majority class. This is done in order to balance the number of instances in each class, and thus improve the performance of machine learning algorithms that may be biased towards the majority class.


__Example__

In a dataset of 1000 transactions, only 50 of which are fraudulent, down-sampling may involve randomly selecting 50 non-fraudulent transactions to match the number of fraudulent transactions. This would make the dataset balanced and improve the performance of machine learning algorithms.

__Summary__

In general, up-sampling is used when the minority class has too few instances to provide sufficient information to the machine learning algorithm, while down-sampling is used when the majority class has too many instances and may dominate the training process. However, the choice between up-sampling and down-sampling depends on the specific problem and the performance of the machine learning algorithm on the dataset.


### Question 5
Q5: What is data Augmentation? Explain SMOTE.


__Answer__

Data augmentation is a technique used in machine learning to increase the size of a dataset by creating new samples that are similar to the existing ones. This is typically done by applying random transformations to the existing data, such as:

1. flipping, rotating
2. shifting images
3. adding noise or perturbations to numerical data.


SMOTE (Synthetic Minority Over-sampling Technique) creates new samples in the minority class by interpolating between existing samples in the same class and it is a popular data augmentation technique for imbalanced datasets is Synthetic Minority Over-sampling Technique. 

Here's how SMOTE works:

1. Choose a minority class instance randomly
2. Find k nearest minority class instances to the chosen instance (typically k=5)
3. Select one of these k instances randomly and create a new sample by interpolating between the chosen instance and the selected instance
4. Repeat steps 1-3 until the desired number of new samples has been generated


By interpolating between existing samples in the same class, SMOTE can generate new samples that are similar to the existing ones, but with some degree of variation that can help to improve the performance of machine learning algorithms on imbalanced datasets.

__Example__

For example, suppose we have a dataset of 1000 transactions, of which only 50 are fraudulent. We can use SMOTE to generate new synthetic fraudulent transactions by interpolating between the existing fraudulent transactions. This would increase the size of the dataset and improve the performance of machine learning algorithms that may be biased towards the majority class.



### Question 6

Q6: What are outliers in a dataset? Why is it essential to handle outliers?


__Answer__


Outliers are data points in a dataset that are significantly different from the majority of the other data points. These can be caused by measurement or recording errors, or they can be genuine data points that represent extreme values in the underlying distribution of the data.

it essential to handle outliers in data analysis and machine learning because they can have a significant impact on the results of the analysis or the performance of the machine learning algorithm. 

Below are the reasons why?:

1. Outliers can skew the distribution of the data: Outliers can distort the distribution of the data, making it difficult to understand the underlying patterns and relationships in the data. This can lead to inaccurate conclusions and predictions.

2. Outliers can influence statistical analysis: Outliers can have a significant impact on statistical measures such as mean, standard deviation, and correlation coefficient. This can lead to biased results and inaccurate conclusions.

3. Outliers can negatively impact machine learning algorithms: Outliers can have a disproportionate impact on the performance of machine learning algorithms, particularly those that are sensitive to the scale of the data or to the presence of extreme values. This can lead to overfitting or underfitting of the model, resulting in poor performance on new data.

Thus handling outliers typically involves identifying and removing them from the dataset, although in some cases it may be more appropriate to transform them or treat them as a separate group in the analysis. This can be done using a variety of techniques, such as:

1. visual inspection
2. statistical tests
3. machine learning algorithms designed to detect and handle outliers.

Overall, handling outliers is an essential part of data analysis and machine learning, as it can help to ensure the accuracy and reliability of the results, and improve the performance of the models on new data.

### Question 7
Q7: You are working on a project that requires analyzing customer data. However, you notice that some of
the data is missing. What are some techniques you can use to handle the missing data in your analysis?


__Answer__

Below are the following technique I will adopt (Note I have explained this concepts above with examples):

1. Delete the missing data
2. Impute missing data
3. Use machine learning algorithm (eg Linear regression)
4. Use specialised package  - There are several specialized packages available in Python and R that can handle missing data. These packages provide a range of imputation techniques and algorithms for handling missing data

However, it's important to carefully consider which approach to use based on the nature of the data and the specific analysis being performed. It's also important to carefully document any missing data and the approach used to handle it, as this can affect the validity and reproducibility of the analysis.

### Question 8

Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

__Answer__

Below are the method I will follow:

1. Visual inspection: One approach is to visually inspect the data to identify any patterns or trends in the missing data. For example, you could create a heatmap or correlation plot to identify if there are any systematic patterns in the missing values.

2. Statistical tests: There are several statistical tests that can be used to determine whether the missing data is MAR or NMAR. One common test is the Little's MCAR test, which tests whether the missing data is completely random or whether it is related to other variables in the dataset.

3. Imputation and comparison: Another approach is to perform multiple imputations on the missing data using different techniques, and then compare the results to see if there are any significant differences. If the imputed values are similar across different techniques, it suggests that the missing data is likely MAR.

4. Correlation analysis: You can perform a correlation analysis to see if there are any relationships between the missing values and other variables in the dataset.

5. Expert knowledge: In some cases, it may be helpful to consult with subject matter experts or individuals who have a deeper understanding of the data to help identify any patterns or trends in the missing data.

Overall, it's important to carefully consider the nature of the missing data and use multiple strategies to determine whether it is MAR or NMAR. This can help ensure that the appropriate techniques are used to handle the missing data and that the results of the analysis are valid and reliable.



### Question 9

Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

__Answer__

Here are some strategies to evaluate the performance of your machine learning model on an imbalanced dataset:

1. Confusion matrix: A confusion matrix is a useful tool to evaluate the performance of a binary classification model on an imbalanced dataset. It displays the number of true positives, true negatives, false positives, and false negatives. From the confusion matrix, you can calculate metrics such as precision, recall, F1-score, and accuracy.

2. ROC curve: A Receiver Operating Characteristic (ROC) curve is another useful tool for evaluating the performance of a binary classification model on an imbalanced dataset. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. The area under the ROC curve (AUC-ROC) is a commonly used metric for evaluating the model's performance. A higher AUC-ROC value indicates better model performance.

3. Precision-Recall curve: A precision-recall (PR) curve is similar to the ROC curve, but it plots precision against recall at different classification thresholds. The PR curve is especially useful when working with imbalanced datasets because it focuses on the positive class. The area under the PR curve (AUC-PR) is a commonly used metric for evaluating the model's performance. A higher AUC-PR value indicates better model performance.

4. Resampling techniques: Resampling techniques such as oversampling the minority class, undersampling the majority class, or generating synthetic samples using techniques like SMOTE can be used to balance the dataset. This can improve the performance of the model on the minority class.

5. Cost-sensitive learning: Cost-sensitive learning involves assigning different costs to misclassifications of different classes. This can be especially useful when the cost of misclassifying the minority class is higher than the cost of misclassifying the majority class.

Overall, evaluating the performance of a machine learning model on an imbalanced dataset requires a combination of techniques. It's important to consider the context of the problem and the relative costs of misclassification when selecting the most appropriate evaluation metric and approach.

### Question 10

Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is
unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to
balance the dataset and down-sample the majority class?


__Answer__

Some of the method I will employ for down-sampling the majority class would include:

1. Random under-sampling: This involves randomly selecting observations from the majority class and removing them from the dataset until the desired balance is achieved.

2. Tomek links: This method involves identifying pairs of observations from different classes that are close to each other (called Tomek links) and removing the observation from the majority class.

3. Neighborhood cleaning rule: This method involves removing observations from the majority class that are misclassified by a k-nearest neighbors classifier.

4. Cluster centroids: This method involves using a clustering algorithm to group observations from the majority class into clusters and then replacing each cluster with its centroid.

__Summary__

By using these methods, you can down-sample the majority class and balance the dataset to improve the performance of your machine learning model.

### Question 11

Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?


__Answer__

1. Oversampling: In oversampling, the minority class is up-sampled to balance the dataset. One way to oversample is to duplicate the minority class samples to match the size of the majority class. However, this can lead to overfitting and poor generalization. Another way is to use synthetic data generation techniques like SMOTE to create new minority class samples.

2. Hybrid methods: Hybrid methods combine both undersampling and oversampling techniques to balance the dataset. One approach is to use oversampling to generate synthetic minority class samples and then undersample the majority class to balance the dataset.

3. Ensemble methods: Ensemble methods combine multiple classifiers to improve the performance on imbalanced datasets. One approach is to use boosting, where multiple classifiers are trained on the original and synthetic samples from the minority class to improve their predictive power.

4. Cost-sensitive learning: Cost-sensitive learning assigns different misclassification costs to different classes, depending on their relative importance. This can help improve the performance of the classifier on the rare event by increasing the cost of misclassifying it.

4. Anomaly detection: Anomaly detection can be used to identify and remove outliers from the majority class. This can help improve the performance of the classifier by reducing the noise and improving the signal-to-noise ratio in the data.


Overall, the choice of method to balance the dataset and up-sample the minority class depends on the size of the dataset, the imbalance ratio, and the desired performance of the classifier. 

It's important to evaluate the performance of the classifier using appropriate metrics, such as precision, recall, F1-score, and AUC-ROC/AUC-PR, to ensure that the classifier is effective in estimating the occurrence of the rare event.

### The End