Missing Values:
Missing values occur when no data value is stored for a variable in an observation. This can happen for various reasons such as data entry errors, equipment malfunctions, or participants opting out of answering certain questions.

Importance of Handling Missing Values:
Handling missing values is essential because:
Bias Reduction
Efficiency
Accuracy

Algorithms Not Affected by Missing Values:
Some algorithms can handle missing values natively, such as:
Decision Trees 
Random Forests
XGBoost
K-Nearest Neighbors (KNN)

In [2]:
#REMOVAL
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [10, 11, 12, 13]
})

df_dropped_rows = df.dropna()
df_dropped_columns = df.dropna(axis=1)

#Imputation with Mean/Median/Mode:
df['A'].fillna(df['A'].mean(), inplace=True)
df['B'].fillna(df['B'].median(), inplace=True)
#Imputation with Forward/Backward Fill:
df['A'].fillna(method='ffill', inplace=True)
df['B'].fillna(method='bfill', inplace=True)  
#Imputation with Interpolation:
df['A'].interpolate(method='linear', inplace=True)

Imbalanced Data:
Imbalanced data refers to a dataset where the number of observations in one class is significantly higher than the number of observations in other classes. This is common in classification problems.

Consequences if Not Handled:
Bias Towards Majority Class: Models tend to be biased towards the majority class, leading to poor predictive performance on the minority class.
Misleading Metrics: Common metrics like accuracy can be misleading as they do not account for class imbalance.
Poor Generalization: The model may fail to generalize well to new data, especially if the minority class is underrepresented.

Up-sampling:
Up-sampling involves increasing the number of samples in the minority class by duplicating existing samples or generating new ones.

Down-sampling:
Down-sampling involves reducing the number of samples in the majority class to balance the class distribution.

In [None]:
from sklearn.utils import resample

df_majority = df[df.target == 0]
df_minority = df[df.target == 1]

df_minority_upsampled = resample(df_minority, 
                                 replace=True,     
                                 n_samples=len(df_majority),  
                                 random_state=123) 

df_upsampled = pd.concat([df_majority, df_minority_upsampled])


In [None]:
from sklearn.utils import resample

df_majority_downsampled = resample(df_majority, 
                                   replace=False,   
                                   n_samples=len(df_minority),
                                   random_state=123) 
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

Data Augmentation:
Data augmentation is a technique used to increase the diversity of the training set by applying random transformations to the existing data, such as rotations, shifts, or flips. It is commonly used in image processing.

SMOTE (Synthetic Minority Over-sampling Technique):
SMOTE is a specific data augmentation technique used to handle imbalanced datasets. It generates synthetic samples for the minority class by interpolating between existing minority class samples.

Outliers:
Outliers are data points that are significantly different from the majority of the data. They can occur due to measurement errors, data entry errors, or inherent variability in the data.

Importance of Handling Outliers:
1.Distortion of Statistical Measures
2.Model Performance
3.Robustness

Techniques to Handle Missing Data:

Remove Missing Data:
Rows: Remove rows with missing values if the proportion of missing data is small.
Columns: Remove columns with a high proportion of missing values.

Imputation:
Mean/Median/Mode Imputation: Fill missing values with mean, median, or mode of the column.

Predictive Imputation:
Use machine learning models to predict and fill missing values based on other features.

Forward/Backward Fill:
Use the previous/next value to fill missing data.

In [6]:
from sklearn.utils import resample

df_majority = df[df.satisfaction == 'satisfied']
df_minority = df[df.satisfaction == 'unsatisfied']

df_majority_downsampled = resample(df_majority,
                                   replace=False,
                                   n_samples=len(df_minority),
                                   random_state=42)

df_balanced = pd.concat([df_majority_downsampled, df_minority])


In [None]:
from sklearn.cluster import KMeans

majority_class_indices = df[df.satisfaction == 'satisfied'].index
kmeans = KMeans(n_clusters=5, random_state=42).fit(df.loc[majority_class_indices])
downsampled_indices = []

for i in range(5):
    cluster_indices = majority_class_indices[kmeans.labels_ == i]
    downsampled_indices.extend(resample(cluster_indices,
                                        replace=False,
                                        n_samples=int(len(df_minority)/5),
                                        random_state=42))

df_downsampled = df.loc[downsampled_indices]
df_balanced = pd.concat([df_downsampled, df[df.satisfaction == 'unsatisfied']])
