In [None]:
Missing values in a dataset are instances where no data is present for a particular
variable in a specific observation. These missing values can occur due to various reasons such as data 
entry errors, equipment malfunction, or simply because the information was not collected.

Handling missing values is crucial because they can lead to biased analyses and inaccurate conclusions.
For example, if a significant portion of data is missing for a particular variable, any analysis involving
that variable may be skewed. Additionally, many machine learning algorithms cannot handle missing values 
and may either produce errors or provide inaccurate results if missing values are present in the dataset.

Some algorithms that are not affected by missing values include tree-based algorithms like Decision Trees,
Random Forests, and Gradient Boosting Machines. These algorithms can inherently handle missing values by 
making decisions based on available data in the dataset. Similarly, algorithms that use distance metrics 
like K-Nearest Neighbors (KNN) can also handle missing values, as they calculate distances based on 
available data points without requiring imputation of missing values.

In [1]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})


df_cleaned = df.dropna()
print(df_cleaned)


     A    B
0  1.0  5.0
3  4.0  8.0


In [2]:

df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

df_filled = df.fillna(-1)
print(df_filled)


     A    B
0  1.0  5.0
1  2.0 -1.0
2 -1.0  7.0
3  4.0  8.0


In [3]:

df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})


df_filled = df.fillna(df.mean())
print(df_filled)


          A         B
0  1.000000  5.000000
1  2.000000  6.666667
2  2.333333  7.000000
3  4.000000  8.000000


In [None]:
Imbalanced data refers to a situation in a dataset where the number of observations in one class 
(the minority class) is significantly lower than the number of observations in the other class(es)
(the majority class(es)).

If imbalanced data is not handled, it can lead to several issues:

Biased Models: Machine learning models trained on imbalanced data tend to be biased towards the majority
class, as they focus more on correctly classifying the majority class while ignoring the minority class.

Poor Generalization: Models trained on imbalanced data may not generalize well to unseen data, especially
for the minority class, leading to lower overall performance.

Misclassification of Minority Class: In scenarios where the minority class is of interest
(e.g., fraud detection, rare disease diagnosis), imbalanced data can result in the minority class
being misclassified more often, leading to higher false negative rates.

Model Evaluation Issues: Traditional evaluation metrics like accuracy can be misleading in the presence 
of imbalanced data, as a model that predicts the majority class all the time can still achieve high 
accuracy.

In [None]:

from sklearn.utils import resample

majority_class = df[df['target'] == 0]
minority_class = df[df['target'] == 1]

minority_upsampled = resample(minority_class, replace=True, n_samples=len(majority_class), random_state=42)

upsampled_df = pd.concat([majority_class, minority_upsampled])


In [None]:
# Example of down-sampling
from sklearn.utils import resample

majority_class = df[df['target'] == 0]
minority_class = df[df['target'] == 1]

majority_downsampled = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)

downsampled_df = pd.concat([majority_downsampled, minority_class])


In [None]:
Data augmentation is a technique used to artificially expand the size of a dataset
by creating modified versions of images, text, or other data points in the dataset. This technique is
commonly used in machine learning and deep learning to improve model performance, especially when the
original dataset is limited in size.

SMOTE (Synthetic Minority Over-sampling Technique) is a specific data augmentation technique used to 
address the class imbalance problem in classification tasks. It works by generating synthetic examples
from the minority class rather than simply duplicating them. SMOTE selects two or more similar instances
(nearest neighbors) from the minority class, computes the difference between them, and then creates new 
instances along the line segments joining these points in the feature space.

In [None]:
Outliers are data points that significantly differ from other observations in a dataset. 
They can occur due to various reasons such as measurement errors, experimental errors, or natural
variation in the data. Outliers can distort statistical analyses and machine learning models, leading
to incorrect conclusions and predictions.

It is essential to handle outliers for several reasons:

Impact on Descriptive Statistics: Outliers can skew descriptive statistics such as the mean and standard
deviation, making them less representative of the central tendency and variability of the data.

Impact on Inferential Statistics: Outliers can lead to incorrect conclusions in hypothesis testing and 
estimation, as they can influence the results of statistical tests and confidence intervals.

Impact on Machine Learning Models: Outliers can negatively impact the performance of machine learning 
models, as they can introduce bias and reduce the accuracy of predictions.

Data Quality: Outliers may indicate data quality issues or measurement errors, which need to be addressed 
to ensure the integrity of the dataset.

In [None]:
Removing Rows with Missing Data: If the amount of missing data is small and 
randomly distributed, you can simply remove rows with missing values.

Replacing with a Constant Value: You can fill missing values with a constant value (e.g., 0) if
the missingness is meaningful or if it makes sense for your analysis.

Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.
This is useful when the missing values are missing at random and the distribution of the data is not
significantly affected by the missing values.

Forward Fill (or Backward Fill): Use the last known value to fill missing values (forward fill) or the 
next known value (backward fill). This is useful for time series data where values are likely to be 
correlated over time.

In [None]:
Visualize Missing Data: Use heatmaps or bar plots to see if missing values occur more frequently in 
certain variables or patterns.

Statistical Tests: Perform tests like Little MCAR test to check if missingness is completely at random.

Analyze Missing Data Patterns: Look for consistent patterns in missing values across variables.

Compare Across Groups: Compare missing data patterns between different groups to identify differences.

Use Domain Knowledge: Consider if there are logical reasons for missing data based on your understanding 
of the dataset.

In [None]:
Use of Evaluation Metrics: Instead of accuracy, which can be misleading in imbalanced datasets, use 
evaluation metrics like precision, recall, F1-score, and area under the ROC curve (AUC-ROC) to assess the
performance of your model. These metrics provide a more comprehensive view of the model performance, 
especially in imbalanced datasets.

Resampling Techniques: Use resampling techniques such as oversampling the minority class 
(e.g., Synthetic Minority Over-sampling Technique - SMOTE) or undersampling the majority class to balance
the dataset before training the model. This can help improve the model performance on the minority class.

In [None]:
When dealing with an unbalanced dataset, where the majority of customers report being satisfied,
you can employ severalmethods to balance the dataset and down-sample the majority class. One common 
approach is to use random undersampling, where you randomly select a subset of the majority class to 
match the size of the minority class. 

In [None]:
When dealing with a dataset that is unbalanced, with a low percentage of occurrences of a rare event 
(the minority class), you can employ various methods to balance the dataset and up-sample the minority 
class. One common approach is to use synthetic oversampling techniques like Synthetic Minority Over-
sampling Technique (SMOTE). 