## Feature Engineering 1
**By Shahequa Modabbera**

### Q1: What are missing values in a dataset? Why is it essential to handle missing values? Name some algorithms that are not affected by missing values.

`Ans) Missing values are values that are not present for one or more variables in a dataset. They can occur for a variety of reasons, such as data entry errors, equipment malfunctions, or survey non-response.`

`Handling missing values is essential because they can affect the quality of the analysis and modeling process. Ignoring or mishandling missing values can lead to biased or inaccurate results, reduced statistical power, and increased variability. It can also affect the validity of statistical inferences and predictions.`

`Here are some algorithms that are not affected by missing values:`

1. Decision Trees: Decision Trees are robust to missing values and can handle them directly. They work by recursively splitting the data based on the available features and finding the best split based on the criteria such as Gini index or entropy.

2. Random Forest: Random Forest is an ensemble learning algorithm that uses multiple decision trees. Like decision trees, Random Forest is also robust to missing values and can handle them directly.

3. Gradient Boosting: Gradient Boosting is another ensemble learning algorithm that uses multiple decision trees. Like Random Forest and decision trees, Gradient Boosting is also robust to missing values and can handle them directly.

4. K-Nearest Neighbors (KNN): KNN is a non-parametric algorithm that works by finding the K nearest data points to a given test data point and assigning a label based on the majority label of those K data points. KNN can handle missing values by imputing them with the mean or median of the available values for the given feature.

5. Naive Bayes: Naive Bayes is a probabilistic algorithm that works by using Bayes' theorem to calculate the probability of a given class given the available features. Naive Bayes can handle missing values by imputing them with the mean or median of the available values for the given feature.

`In conclusion, missing values are an important issue to be handled in data science and machine learning. The above-mentioned algorithms can be used to work with datasets with missing values. It's essential to handle missing values carefully and effectively, as it can significantly impact the results and performance of the models.`

### Q2: List down techniques used to handle missing data. Give an example of each with python code.

`Ans) Some techniques used to handle missing data and an example of each with Python code are as follows:`

1. Deletion: One simple way to handle missing data is to delete the missing values. It can be done by either deleting the rows or columns containing missing data.

Example:

```python
import pandas as pd

# read the data
data = pd.read_csv('data.csv')

# delete rows with missing values
data.dropna(axis=0, inplace=True)
```

2. Mean/ Median/ Mode Imputation: This technique involves replacing the missing values with the mean, median or mode of the available data. This method is easy to implement and does not require complex calculations.

Example:

```python
import pandas as pd

# read the data
data = pd.read_csv('data.csv')

# replace missing values with mean
data.fillna(data.mean(), inplace=True)
```

3. Interpolation: This technique involves estimating the missing values by filling in the gaps between existing values. It works well for time series data and can be done using different methods such as linear interpolation or cubic interpolation.

Example:
```python
import pandas as pd

# read the data
data = pd.read_csv('data.csv')

# interpolate missing values using linear method
data.interpolate(method='linear', inplace=True)
```

4. K-Nearest Neighbors: This technique involves using the K-Nearest Neighbors algorithm to predict missing values. It works by finding the K nearest data points to the missing value and then taking the average or weighted average of their values.

Example:
```python
import pandas as pd
from sklearn.impute import KNNImputer

# read the data
data = pd.read_csv('data.csv')

# initialize the KNN imputer
imputer = KNNImputer(n_neighbors=5)

# fit and transform the data
data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
```

5. Multiple Imputation: This technique involves creating multiple imputations of the missing values and then combining them to get a final result. It can be done using different methods such as Markov Chain Monte Carlo or Bayesian methods.

Example:

```python
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# read the data
data = pd.read_csv('data.csv')

# initialize the iterative imputer
imputer = IterativeImputer()

# fit and transform the data
data = pd.DataFrame(imputer.fit_transform(data), columns=data.columns)
```

`These are some of the common techniques used to handle missing data in a dataset. It's important to choose the appropriate technique based on the nature and characteristics of the dataset.`

### Q3: Explain the imbalanced data. What will happen if imbalanced data is not handled?

`Ans) Imbalanced data refers to a situation where the distribution of classes in the target variable is not uniform, resulting in one class having significantly more or fewer instances than the other class(es). For example, in a binary classification problem where the positive class represents only 10% of the data, while the negative class represents 90%, it is an imbalanced dataset.`

`Not handling imbalanced data can lead to several issues. First, the model trained on such data will be biased towards the majority class, resulting in poor performance on the minority class. For example, if the positive class represents a rare disease, an unbalanced dataset can lead to false negatives, where the model fails to identify the disease in patients who actually have it. Secondly, the evaluation metrics can be misleading. For example, accuracy can be high, but it will not provide a true picture of the model's performance since it will be dominated by the majority class.`

`Let’s assume we are going to predict disease from an existing dataset where for every 100 records only 5 patients are diagnosed with the disease. So, the majority class is 95% with no disease and the minority class is only 5% with the disease. Now, assume our model predicts that all 100 out of 100 patients have no disease.`

`Sometimes when the records of a certain class are much more than the other class, our classifier may get biased towards the prediction. In this case, the confusion matrix for the classification problem shows how well our model classifies the target classes and we arrive at the accuracy of the model from the confusion matrix. It is calculated based on the total no of correct predictions by the model divided by the total no of predictions. In the above case it is (0+95)/(0+95+0+5)=0.95 or 95%. It means that the model fails to identify the minority class yet the accuracy score of the model will be 95%.`

### Q4: What are Up-sampling and Down-sampling? Explain with an example when up-sampling and down-sampling are required.

`Ans) Upsampling and downsampling are two techniques used to balance imbalanced datasets.`

`Upsampling refers to the process of increasing the number of instances in the minority class to balance the dataset. This can be achieved by duplicating existing instances in the minority class or by generating new synthetic instances of the minority class. The goal is to ensure that the number of instances in both classes is approximately equal.`

`Downsampling, on the other hand, refers to the process of reducing the number of instances in the majority class to balance the dataset. This can be achieved by randomly selecting a subset of instances from the majority class. The goal is to ensure that the number of instances in both classes is approximately equal.`

`When to use upsampling or downsampling depends on the specific problem and the dataset. Generally, upsampling is used when the minority class is underrepresented, and the goal is to increase its representation in the dataset. For example, in a credit card fraud detection problem, the number of fraudulent transactions is usually very low compared to the number of legitimate transactions. In this case, upsampling the fraudulent transactions can help improve the model's ability to detect fraud.`

`On the other hand, downsampling is used when the majority class is dominant, and there is a large imbalance between the classes. For example, in a medical diagnosis problem, the number of healthy patients may be much higher than the number of patients with a particular disease. In this case, downsampling the healthy patients can help the model to be less biased towards the majority class and provide better performance on the minority class.`

`Both techniques have their advantages and disadvantages, and the choice depends on the specific problem and the dataset. It is recommended to try both techniques and see which one works better for the given problem.`

### Q5: What is data Augmentation? Explain SMOTE.

`Ans) Data augmentation is a technique used to artificially increase the size of a dataset by creating new examples from the original data. This is done by applying a series of transformations to the original data, such as flipping, rotating, cropping, or adding noise, to generate new data points that are similar but not identical to the original data. The goal of data augmentation is to increase the diversity of the dataset and improve the model's ability to generalize to new, unseen data.`

`SMOTE (Synthetic Minority Over-sampling Technique) is a popular data augmentation technique used to address the issue of imbalanced datasets. SMOTE works by generating synthetic examples of the minority class by interpolating between existing minority class instances. Specifically, SMOTE selects a random minority class instance and its k nearest minority class neighbors. It then generates new synthetic instances along the line segments that connect the selected instance to its neighbors. The number of new synthetic instances generated is determined by the desired level of oversampling.`

`For example, let's say we have a dataset of credit card transactions with a highly imbalanced class distribution, where only 1% of the transactions are fraudulent. Using SMOTE, we can generate new synthetic instances of the fraudulent transactions to increase their representation in the dataset. Suppose we select a fraudulent transaction and its two nearest fraudulent transaction neighbors. We can then generate new synthetic instances by interpolating along the line segments connecting the three instances. The resulting synthetic instances will have feature values that are similar to the original instances but with some variation.`

`SMOTE is a powerful technique that can help improve the performance of models on imbalanced datasets. However, it is important to use it carefully, as generating too many synthetic instances can lead to overfitting and poor generalization performance on new, unseen data. It is also important to evaluate the performance of the model on a separate validation set to ensure that the oversampling has not introduced bias or other issues.`

### Q6: What are outliers in a dataset? Why is it essential to handle outliers?

`Ans) Outliers are data points that differ significantly from other observations in the dataset. In other words, they are data points that fall far outside the range of other values in the dataset. These data points can be caused by measurement or data entry errors, or they may represent genuine observations that are very different from other data points in the dataset.`

`It is essential to handle outliers because they can have a significant impact on the results of data analysis. Outliers can skew summary statistics, such as the mean and standard deviation, and can also affect the results of machine learning algorithms. If outliers are not handled, they can lead to incorrect conclusions being drawn from the data and can reduce the accuracy of predictive models.`

![glod6txi.bmp](attachment:0d5dd378-1d71-4690-b2e3-6d59ec99198c.bmp)

### Q7: You are working on a project that requires analyzing customer data. However, you notice that some of the data is missing. What are some techniques you can use to handle the missing data in your analysis?

`Ans) There are several techniques that can be used to handle missing data in analysis. Here are a few:`

1. Deletion: In this technique, you simply delete the rows or columns with missing data. This method is straightforward but may result in a loss of information.

2. Mean/median/mode imputation: In this technique, the missing values are replaced with the mean, median or mode of the available data. This method is simple but may not be very accurate.

3. Regression imputation: In this technique, a regression model is used to predict the missing values based on the values of other variables. This method can be more accurate than mean imputation, but it requires more effort.

4. Multiple imputation: In this technique, the missing values are imputed multiple times, generating several datasets with imputed values. These datasets are then analyzed, and the results are combined to generate a final result. This method is more accurate than mean imputation and can handle missing data in a more sophisticated way.

5. K-nearest neighbor imputation: In this technique, the missing values are imputed based on the values of the nearest neighbors in the dataset. This method is useful when the data has a natural structure, such as in time series or spatial data.

`The choice of technique depends on the nature of the data and the specific requirements of the analysis.`

### Q8: You are working with a large dataset and find that a small percentage of the data is missing. What are some strategies you can use to determine if the missing data is missing at random or if there is a pattern to the missing data?

`Ans) There are several strategies that can be used to determine if the missing data is missing at random or if there is a pattern to the missing data. Here are a few:`

1. Visualize the missing data: One way to determine if there is a pattern to the missing data is to visualize it. You can create a heatmap or a bar chart showing the percentage of missing data for each variable in the dataset. This can help us identify if there is a specific pattern to the missing data.

2. Conduct statistical tests: Another strategy is to conduct statistical tests to determine if the missing data is missing at random or if there is a pattern. For example, we can perform a t-test or an ANOVA to compare the mean of a variable with missing data to the mean of the same variable without missing data. If there is no significant difference between the means, it may suggest that the missing data is missing at random.

3. Use machine learning algorithms: We can use machine learning algorithms to predict the missing data. For example, we can use regression analysis to predict the missing values in a continuous variable, or use classification algorithms to predict the missing values in a categorical variable. If the predicted values are close to the actual values, it may suggest that the missing data is missing at random.

4. Consult subject matter experts: It may be helpful to consult subject matter experts to understand if there is a pattern to the missing data. They may be able to provide insight into why the data is missing and whether it is missing at random or not.

### Q9: Suppose you are working on a medical diagnosis project and find that the majority of patients in the dataset do not have the condition of interest, while a small percentage do. What are some strategies you can use to evaluate the performance of your machine learning model on this imbalanced dataset?

`Ans) When dealing with an imbalanced dataset in a medical diagnosis project, there are several strategies we can use to evaluate the performance of our machine learning model. Some of these strategies are:`

1. Confusion matrix: The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives. This matrix can help you evaluate the accuracy of your model and identify any potential issues with the predictions.

2. Precision, Recall, and F1-score: These are metrics that can help you evaluate the performance of your model on both the positive and negative classes. Precision measures the proportion of true positives among all positive predictions, recall measures the proportion of true positives among all actual positives, and F1-score is the harmonic mean of precision and recall.

3. ROC curve and AUC score: The Receiver Operating Characteristic (ROC) curve is a graphical representation of the performance of a binary classifier as the discrimination threshold is varied. The area under the ROC curve (AUC) is a metric that can help you evaluate the performance of your model. A higher AUC indicates a better model performance.

4. Stratified cross-validation: Cross-validation is a technique used to evaluate the performance of a machine learning model. Stratified cross-validation is a variation of cross-validation that ensures that each fold of the cross-validation process has a representative sample of both the positive and negative classes.

5. Resampling techniques: Resampling techniques such as oversampling and undersampling can be used to balance the class distribution in the dataset. Oversampling involves increasing the number of instances in the minority class, while undersampling involves decreasing the number of instances in the majority class.

`By using these strategies, we can evaluate the performance of our machine learning model and determine if any modifications are necessary to improve the accuracy and reliability of our model.`

### Q10: When attempting to estimate customer satisfaction for a project, you discover that the dataset is unbalanced, with the bulk of customers reporting being satisfied. What methods can you employ to balance the dataset and down-sample the majority class?

`Ans) To address an unbalanced dataset where the majority class is overrepresented, down-sampling the majority class can be a useful method. Here are some techniques that can be used to balance the dataset and down-sample the majority class:`

1. Random under-sampling: This technique randomly removes examples from the majority class until the classes are balanced. This method may result in loss of important data if the majority class is important.

2. Cluster-based under-sampling: In this technique, the majority class is clustered, and the centroids of each cluster are used to represent the majority class. The examples that are closest to each centroid are kept, while the rest are removed.

3. Tomek Links: Tomek Links are pairs of examples from different classes that are the closest to each other. This technique removes the majority class examples that form Tomek Links.

4. Edited Nearest Neighbours: In this technique, examples from the majority class that are not well-classified by the nearest neighbours classifier are removed.

`Once the dataset is balanced, it can be used for further analysis or training machine learning models to estimate customer satisfaction.`

### Q11: You discover that the dataset is unbalanced with a low percentage of occurrences while working on a project that requires you to estimate the occurrence of a rare event. What methods can you employ to balance the dataset and up-sample the minority class?

`Ans) When dealing with an imbalanced dataset where the minority class has low occurrences, one method to balance the dataset is through upsampling. Upsampling involves randomly duplicating observations from the minority class to increase its representation in the dataset.`

`There are several techniques to up-sample the minority class, some of them are:`

1. Random oversampling: In this technique, the minority class observations are randomly duplicated until it reaches the same size as the majority class.

2. SMOTE (Synthetic Minority Over-sampling Technique): SMOTE creates synthetic samples of the minority class by selecting pairs of similar observations from the minority class and creating new observations along the line connecting them. 

`Here's an example of how to use the `SMOTE` technique in Python using the `imblearn` library:`

```python
from imblearn.over_sampling import SMOTE

X = ... # input features
y = ... # target variable

# Instantiate SMOTE and apply it to the dataset
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
```

`After applying the `SMOTE` technique, the `X_resampled` and `y_resampled` arrays will contain the up-sampled data that can be used to train machine learning models on a balanced dataset.`