# Data preprocessing

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
from collections import Counter
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.precision', 3)

In [2]:
# Extra imports
from pandas import read_csv
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import LocalOutlierFactor
from sklearn import preprocessing
from pandas.plotting import scatter_matrix
from scipy.stats import boxcox
from statsmodels.genmod.generalized_linear_model import GLM

## Reading CSV

In [3]:
raw_ILDS = read_csv("train_features_ILDS.csv", delimiter=',')
raw_ILDS.columns = ['Age', 'Female', 'TB', 'DB', 'Alkphos', 'Sgpt', 
                    'Sgot', 'TP', 'ALB', 'AR']
raw_ILDS['target'] = read_csv("train_labels_ILDS.csv", delimiter=',')

raw_ILDS.shape #tamany

(462, 11)

Editing dataframe

- Bilirubin (TB & DB): product of the degradation of hemoglobin. It should be processed by the liver and make it water-soluble to be expelled from the body. Direct Bilirubin is the Bilirubin processed by the liver, the Indirect one is the Bilirubin unprocessed, and the total is the sum of the two.
- Alkaline Phosphotase (Alkphos): enzime that breaks down phosphate groups. It is everywhere, but more concentrated in the liver and bones. Liver damages can lead to an increased release of these enzimes.
- Alamine Aminotransferase (Sgpt): enzime that participates in the metabolism of amino acids. Damages in the liver can lead to an increase of these enzimes.
- Aspartate Aminotransferase (Sgot): enzime that participates in the metabolism of amino acids (just like Sgpt). Liver, heart or Muscle injuries can lead to an increase of concentration of these enzimes.
- Total proteins (TP): total of proteins in blood. Problems in liver or kidneys may lead to a decrease of this concentration.
- Albumin (ALB): keeps fluids inside blood vessels and transports substances. Problems in the liver may decrease its concentration, as this proteins is exclusively procused by the Liver.
- Albumin to Globulin (A/R): low ratio may indicate liver problems

In [4]:
raw_ILDS.head()

ValueError: Length mismatch: Expected axis has 11 elements, new values have 10 elements

In [None]:
raw_ILDS.describe()

In [None]:
raw_ILDS['Female'].value_counts() 

In [None]:
train_labels[raw_ILDS['Female'] == 1].value_counts()

In [None]:
train_labels[raw_ILDS['Female'] == 0].value_counts()

TODO: hi ha més homes que dones, podria anar bé una mica de resampling.

TODO2: També ocorreix que hi ha més persones sanes que malaltes. Podríem provar de fer resampling de les dues coses, prioritzant arreglar el desbalanceig en els labels.

## Creating new variables

Si és necessari, creem les variables extres abans de fer res. Jo volia fer una variable que fos Indirect Bilirubin, per veure quanta Bilirubina no está processada pel fetge.

Nota: Com el mínim és un valor negatiu, clarament hi ha mínim 1 error en les dades de TB / DB. Després hi ha un cas molt extrem on la bilirubina no processada és molt alta.

TODO: Jo processaria això de la següent manera: en els casos on TB < DB, definim $DB := TB$. 

In [None]:
raw_ILDS['IB'] = raw_ILDS['TB'] - raw_ILDS['DB']
raw_ILDS['IB'].describe()

## Dealing with missing values

NOTA: Com lo dels histogrames ho fem molt sovint se m'ha ocorregut fer una funció per fer l'histograma i el boxplot. Així agilitzem el procés. No he tocat res més del que vas fer.

In [None]:
def histbox (data, column: str, n_bins: int = 10) -> None:
    fig, axes = plt.subplots(1, 2, gridspec_kw = {'width_ratios': [1, 4]}, figsize=(9,4))
    sn.boxplot(data = data, x = 'target', y = column, hue = 'Female', column = column, ax = axes[0])
    sn.histplot(data = data, x = column, y = 'target', hue = 'Female', ax = axes[1], bins = n_bins)

In [None]:
(raw_ILDS.Age == 90).value_counts()

### Age

In [None]:
histbox(raw_ILDS, "Age", 16)

### Total Bilirubin

In [None]:
histbox(raw_ILDS, "TB", 16) # TODO2: Una transformació logarítmica potser és millor.

In [None]:
(raw_ILDS.TB==75).value_counts() # TODO: potser no és un missing value, i realment algú el tenia tan alt?

Seems like 75 is a _missing_ value, lets delete it

In [None]:
ILDS = raw_ILDS[raw_ILDS.TB!=75]

In [None]:
histbox(raw_ILDS, "TB", 16)

In [None]:
raw_ILDS['log_TB'] = np.log(raw_ILDS['TB'])
histbox(raw_ILDS, "log_TB", 16)

### Direct Bilirubin

In [None]:
histbox(raw_ILDS, "DB", 16) # TODO: Logarithmic també?

In [None]:
raw_ILDS['log_DB'] = np.log(raw_ILDS['DB'])
histbox(raw_ILDS, "log_DB", 16)

### Indirect Bilirubin

In [None]:
histbox(raw_ILDS, "IB", 16) # TODO: Logarithmic també?

In [None]:
raw_ILDS['log_IB'] = np.log(raw_ILDS['IB'])
histbox(raw_ILDS, "log_IB", 16)

### Alkaline Phosphotase

In [None]:
histbox(raw_ILDS, "Alkphos", 16) # TODO: Logarithmic també?

In [None]:
raw_ILDS['log_Alkphos'] = np.log(raw_ILDS['Alkphos'])
histbox(raw_ILDS, "log_Alkphos", 16)

### Alamine Aminotransferase

In [None]:
histbox(raw_ILDS, "Sgpt", 16) # TODO: Logarithmic també?

Sgpt==2000 may be a missing value

In [None]:
(raw_ILDS.Sgpt==2000).value_counts()

In [None]:
ILDS = ILDS[ILDS.Sgpt != 2000]

In [None]:
histbox(raw_ILDS, "Sgpt", 16)

### Aspartate Aminotransferase

In [None]:
histbox(raw_ILDS, "Sgot", 16) # TODO: Res sembla una normal :D

In [None]:
(raw_ILDS.Sgot==2946).value_counts() # TODO: potser no és un missing value is eliminar només aquest valor no fa absolutament res?

In [None]:
ILDS = ILDS[ILDS.Sgot != 2946]

In [None]:
histbox(raw_ILDS, "Sgot", 16)

### Total Protiens

In [None]:
histbox(raw_ILDS, "TP", 16)

### Albumin

In [None]:
histbox(raw_ILDS, "ALB", 16) # Cool

### Albumin and Globulin Ratio

In [None]:
histbox(raw_ILDS, "AR", 16) # TODO: Logarithmic també?