# Preprocessing

In this notebook we asses what steps are needed to preprocess the data for the training of the SOM. This answers question 4 of the assignment.

> Get the data into the form needed for training SOMs. Describe your preprocessing steps (e.g. transcoding, scaling), why you did it and how you did it. Specifically, if your dataset turns out to be extremely large (very high-dimensional and huge number of vectors so that it does not fit into memory for training SOMs) you may choose to apply subsampling for the training data. 


## Libraries and Setup

### Install Libraries

In [None]:
%pip install liac-arff numpy pandas scikit-learn matplotlib seaborn scipy --quiet

### Import Libraries

In [1]:
import io
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from scipy.io import arff
from scipy.stats import zscore

from sklearn import preprocessing

### Data Loading

In [2]:
with open("data/phpchCuL5.arff", 'r') as f:
    data, meta = arff.loadarff(io.StringIO(f.read()))

df = pd.DataFrame(data)

## Processing

There are two main steps in the preprocessing of the data. First, we need to transform the data into a format that can be used by the SOM. Second, we need to scale the data so that the SOM can be trained properly.

We also need two different files:

- **Input Vector File (.vec):** 
    This file contains the actual data that you want to train the SOM with. Each line corresponds to one input vector (i.e., one data point such as the expression level of a protein). The format of each line is generally a list of numeric values separated by spaces or tabs.

- **Template Vector File (.tv):** 
    This file describes the structure of the input vectors. It lists the names of the variables (e.g., protein names) and optionally their type (e.g., numeric, nominal) and other meta-information. It is used to interpret the input vector file correctly.

We start by normalizing the numerical data.

In [3]:
numeric_columns = df.columns[1:78]
df_clean = df.dropna(subset=numeric_columns)
numeric_df = df_clean[numeric_columns]

scaler = preprocessing.MinMaxScaler()
normalized_data = scaler.fit_transform(numeric_df)

normalized_df = pd.DataFrame(normalized_data, columns=numeric_columns)

In [4]:
normalized_df['class'] = df_clean['class'].values  

Converting categorical data to numerical data is done by using the `LabelEncoder` from `sklearn.preprocessing`. 

In [None]:
# Encoding categorical columns
categorical_columns = ['Genotype', 'Treatment', 'Behavior']
label_encoders = {}
for col in categorical_columns:
    le = preprocessing.LabelEncoder()
    df_clean[col] = le.fit_transform(df_clean[col])
    label_encoders[col] = le  # Storing the label encoder for each column


Now we'll convert the DataFrame into the SOMLib vector format.

In [5]:
vec_content = f'$XDIM {len(df_clean)}\n$YDIM 1\n$VEC_DIM {len(df.columns)-1}\n'
for index, row in df_clean.iterrows():
    vec_row = ' '.join(map(str, row[1:]))  # Exclude 'MouseID'
    vec_content += f'vec_{index} {vec_row}\n'

Next, the `.tv` file:

In [6]:
tv_content = '$TYPE template\n$XDIM {}\n'.format(len(df.columns)-1)
for i, col in enumerate(df.columns[1:], start=1):  # Exclude 'MouseID'
    col_type = 'NUMERIC' if col in numeric_columns else 'LABEL'
    tv_content += f'{col} {col_type}\n'

Lastly, we save the files.

In [7]:
vec_file_path = 'data/dataset.vec'
tv_file_path = 'data/dataset.tv'

with open(vec_file_path, 'w') as vec_file:
    vec_file.write(vec_content)
    
with open(tv_file_path, 'w') as tv_file:
    tv_file.write(tv_content)

### Summary of Preprocessing

1. Transcoding: The original dataset was loaded from an `ARFF` file the data was then converted into a pandas DataFrame to facilitate easy manipulation, analysis, and description of the data (see also the `profile.ipynb` notebook). 

2. Handling Missing Values: Due to the nature of machine learning algorithms requiring complete data for all features, rows with missing values in the numerical protein expression levels were removed. 
   This step ensures that the SOM algorithm receives consistent and complete data for training.

3. Scaling (Normalization): The protein expression levels were normalized to ensure that each protein has equal weight during the SOM training. 
   Without normalization, proteins with higher absolute values could disproportionately influence the map, leading to biased results. 
   Normalization was done using the `MinMaxScaler` from the `sklearn.preprocessing` module, which scales each feature to a given range, in this case, `[0, 1]`. This scaling is crucial for algorithms like SOM that are sensitive to the scale of the data.

4. Encoding Categorical Data: The categorical columns 'Genotype', 'Treatment', and 'Behavior' were encoded numerically. Since these are binary categories, label encoding can be used, which assigns a unique integer to each category. 

Lastly, the data was converted into the SOMLib vector format, which is the format required by the SOMToolbox.