#### About

> Handling imbalanced data, outliers and missing values.

Handling imbalanced data, outliers, and missing values is an essential step in data preprocessing and analysis to ensure that the data used for modeling or analysis is accurate and reliable. Following techniques can be used.


1. Handling Imbalanced Data:
Imbalanced data refers to datasets where the distribution of classes is not equal, resulting in a biased model. This can lead to poor performance, as the model may be biased towards the majority class. Here are some techniques to handle imbalanced data:



- a) Resampling Techniques: We can oversample the minority class or undersample the majority class to create a balanced dataset. For oversampling, we can use techniques like Random Oversampling or SMOTE (Synthetic Minority Over-sampling Technique). For undersampling, we can use techniques like Random Undersampling or Tomek Links.

In [2]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


In [3]:
# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1], random_state=42)



In [4]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [5]:
# Perform SMOTE oversampling on train set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

- b) Ensemble Methods: we can use ensemble methods like Bagging and Boosting with resampling techniques to create a balanced model. For example, we can use techniques like Random Forest or Gradient Boosting along with SMOTE to handle imbalanced data.

In [6]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE

In [7]:
# Generate imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=10, weights=[0.9, 0.1], random_state=42)


In [8]:
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [9]:
# Perform SMOTE oversampling on train set
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [10]:
# Train Random Forest classifier on resampled train set
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_resampled, y_train_resampled)


In [11]:
# Evaluate model on test set
y_pred = rf.predict(X_test)

2. Handling Outliers:
Outliers are data points that deviate significantly from the rest of the data. They can impact the performance of a model, as they can introduce noise or bias.

Following techniques can be used to handle outliers

- a) Statistical Methods: We can use statistical methods like z-score or IQR (interquartile range) to detect and remove outliers. Z-score measures how far a data point is from the mean in terms of standard deviations, while IQR measures the spread of data around the median.



In [12]:
import numpy as np
import pandas as pd

# Generate a dataframe with some outliers
data = {'A': [1, 2, 3, 4, 5, 20, 21, 22, 23, 24]}
df = pd.DataFrame(data)

In [14]:

# Calculate z-score for each data point
z_scores = (df['A'] - df['A'].mean()) / df['A'].std()



In [15]:
#detect and remove outliers based on z-score threhsold
z_score_threshold = 2
df_no_outliers = df[(z_scores <= z_score_threshold)]

In [16]:
#Calculate IQR for each data point
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1

In [17]:
#Detect and remove outliers based on IQR threshold
IQR_threshold = 1.5
df_no_outliers = df[((df['A'] >= Q1 - IQR_threshold * IQR) & (df['A'] <= Q3 + IQR_threshold * IQR))]

- b) Robust Algorithms: We can use robust algorithms that are not affected by outliers, such as Decision Trees, Random Forest, or Support Vector Machines (SVM). These algorithms do not make assumptions about the distribution of data and are less sensitive to outliers.


In [18]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


In [19]:
# Generate a dataset with some outliers
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X[-1, :] = np.array([10, 10, 10, 10, 10, 10, 10, 10, 10, 100])  # Add an outlier


In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [21]:
# Train Random Forest regressor on train set
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)

In [22]:

# Evaluate model on test set
y_pred = rf.predict(X_test)

3. Handling Missing Values:
Missing values are gaps in the data where no value is recorded. They can cause issues in data analysis and machine learning as they can lead to biased results or incomplete analyses. 
Following methods can be used to handle em.

- a) Imputation Techniques: We can fill in missing values with appropriate values using imputation techniques such as mean, median, or mode imputation. These techniques fill in the missing values with the central tendency of the data.


In [24]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

In [25]:
iris = load_iris()
X, y = iris.data, iris.target

In [26]:
# Introduce some missing values
X[0, 0] = np.nan
X[5, 2] = np.nan
X[10, 1] = np.nan


In [27]:
# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)


In [28]:
# Fill missing values with mean
df_filled = df.fillna(df.mean())

- b) Deletion Techniques: We can delete rows or columns with missing values if the missing values are negligible and do not significantly impact the analysis. However, caution should be taken when using deletion techniques, as it may result in loss of valuable data.



In [29]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

In [30]:
iris = load_iris()
X, y = iris.data, iris.target

In [31]:
# Introduce some missing values
X[0, 0] = np.nan
X[5, 2] = np.nan
X[10, 1] = np.nan

In [32]:

# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)

In [33]:
# Drop rows with missing values
df_dropped= df.dropna()



In [34]:
#Drop columns with missing values
df_dropped_cols = df.dropna(axis=1)


- c) Advanced Techniques: There are advanced techniques such as K-nearest neighbors imputation, regression imputation, and machine learning-based imputation methods that can be used to handle missing values based on patterns observed in other parts of the data.


In [35]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.impute import KNNImputer


In [36]:
iris = load_iris()
X, y = iris.data, iris.target


In [38]:
## Introduce some missing values

X[0, 0] = np.nan
X[5, 2] = np.nan
X[10, 1] = np.nan

In [39]:
# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)

In [40]:
# Use K-nearest neighbors imputer to fill missing values
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = knn_imputer.fit_transform(df)
df_imputed = pd.DataFrame(df_imputed, columns=iris.feature_names)

- d) By using Skewness and Kurtosis: Skewness measures the asymmetry of the data distribution, while kurtosis measures the peakedness or flatness of the data distribution. By using these measures, we can impute missing values based on the distribution of the data.

In [41]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris

In [42]:
iris = load_iris()
X, y = iris.data, iris.target

In [43]:
# Introduce some missing values
X[0, 0] = np.nan
X[5, 2] = np.nan
X[10, 1] = np.nan


In [44]:
# Convert to DataFrame
df = pd.DataFrame(X, columns=iris.feature_names)


In [45]:
# Calculate skewness and kurtosis for each feature
skewness = df.skew()
kurtosis = df.kurtosis()

In [46]:
# Loop through each feature and impute missing values based on skewness and kurtosis
for feature in df.columns:
    if df[feature].isnull().sum() > 0:
        if skewness[feature] > 0:
            # If positive skewness, fill missing values with median
            df[feature].fillna(df[feature].median(), inplace=True)
        elif skewness[feature] < 0:
            # If negative skewness, fill missing values with mean
            df[feature].fillna(df[feature].mean(), inplace=True)
        else:
            # If no skewness, fill missing values with mode
            df[feature].fillna(df[feature].mode()[0], inplace=True)