#Outlier Handling:
This notebook will apply various outlier detection and removal techniques with the goal of improving classification model performance. Outliers can distort data distributions and negatively impact model accuracy, especially in high-dimensional biological datasets like miRNA expression.

The following methods will be used to identify and process outliers:

-Isolation Forest

-DBSCAN (Density-Based Spatial Clustering)

-Statistical Capping using the IQR × 1.5 rule

-One-Class Support Vector Machines (OCSVM)

-Manual Capping and Transformation using winsorize

In [None]:
# Reading miRNA data frame with new features
df_mir_new = pd.read_csv('df_mir_new.csv', index_col='File_ID')
df_mir_new.head()

In [None]:
# The first strategy will be capping outliers. The variables distributions will be sliced between an upper limit (2% highest) and a lower limit (2% lowest)
columns = df_mir_new.columns
outlier_capping_columns = columns[:21]
for column in outlier_capping_columns:
  df_mir_new[column] = winsorize(df_mir_new[column], limits=[0.02, 0.02])

In [None]:
df_mir_new.to_csv('df_mir_new_capped.csv')
from google.colab import files
# Download the new miRNA df capped df
files.download('df_mir_new_capped.csv')

In [None]:
# Reading miRNA data frame with new features
df_mir_new = pd.read_csv('df_mir_new.csv', index_col='File_ID')
df_mir_new.head()

In [None]:
# The second strategy involves using the IsolationForest machine learning model to detect isolated extreme values
df_mir_new_columns = df_mir_new.columns
outlier_detection_columns = df_mir_new_columns[:21]
iso = IsolationForest(contamination=0.1, random_state=42)
df_mir_new['outlier'] = iso.fit_predict(df_mir_new[outlier_detection_columns])
# Filter the rows with isolated extreme values
df_mir_new_iso = df_mir_new[df_mir_new['outlier'] == 1].drop(columns='outlier')
df_mir_new_iso.shape

In [None]:
df_mir_new_iso.to_csv('df_mir_new_iso.csv')
from google.colab import files
# Download the new miRNA df without extreme values
files.download('df_mir_new_iso.csv')

In [None]:
# Reading miRNA data frame with new features
df_mir_new = pd.read_csv('df_mir_new.csv', index_col='File_ID')
df_mir_new.head()

In [None]:
# The third strategy requires using an unsupervised machine learning density-base model: DBSCAN
scaler = StandardScaler()
scaled = scaler.fit_transform(df_mir_new[outlier_detection_columns]) # First we standardize the features to search for outliers
# Build the clustering model
db = DBSCAN(eps=1.5, min_samples=3)
df_mir_new['outlier'] = db.fit_predict(scaled)
# Filter the rows without outliers
df_mir_new_dbscan = df_mir_new[df_mir_new['outlier'] != -1].drop(columns='outlier')
df_mir_new_dbscan.shape

In [None]:
df_mir_new_dbscan.to_csv('df_mir_new_dbscan.csv')
from google.colab import files
# Download the new miRNA df without outliers
files.download('df_mir_new_dbscan.csv')

In [None]:
# Reading miRNA data frame with new features
df_mir_new = pd.read_csv('df_mir_new.csv', index_col='File_ID')
df_mir_new.head()

In [None]:
# The fourth strategy involves using the one-class support vector machine
scaler = StandardScaler()
scaled = scaler.fit_transform(df_mir_new[outlier_detection_columns]) # First we standardize the features to search for outliers
# Build the classifier
ocsvm = OneClassSVM(kernel='rbf', gamma='auto', nu=0.05)
ocsvm.fit(scaled)
labels = ocsvm.predict(scaled)
# Filter the rows without outliers
df_mir_new_ocsvm = df_mir_new[labels == 1]
df_mir_new_ocsvm.drop('outlier', inplace=True, axis=1)

In [None]:
df_mir_new_dbscan.to_csv('df_mir_new_dbscan.csv')
from google.colab import files
# Download the new miRNA df without outliers
files.download('df_mir_new_dbscan.csv')

In [None]:
# Reading miRNA data frame with new features
df_mir_new = pd.read_csv('df_mir_new.csv', index_col='File_ID')
df_mir_new.head()

In [None]:
# The last strategy involves using the upper extreme of the feature's distribution to detect the rows having extreme values
columns = df_mir_new.columns
outlier_wrangling_columns = columns[:21]
for column in outlier_wrangling_columns: # Build a loop to extract distribution data of each feature
  column_statistics = df_mir_new.describe()[column]
  Q3 = column_statistics.loc['75%']
  Q1 = column_statistics.loc['25%']
  inter_quartile_range = Q3 - Q1
  upper_extreme = Q3 + inter_quartile_range*1.5 # Calculate the upper extreme value
  rows_to_drop = df_mir_new[df_mir_new[column] > upper_extreme].index
  df_mir_new.drop(rows_to_drop, axis=0, inplace=True) # Drop the rows that have values above the upper extreme
  print(df_mir_new.shape)

In [None]:
df_mir_new.to_csv('df_mir_new_statistical.csv')
from google.colab import files
# Download the miRNA df without outliers}
files.download('df_mir_new_statistical.csv')