TODO:
- Describe HBc.. Columns

# 4Da Medical Data Analysis and Visualisation FS
**Authors:** Roman Studer, Alexandre Rau

**Goal:** The goal of the Medical Challenge is the automated classification of the eye disease uveities. The goal is to find the best possible model for classifying the disease based on a data set of +1000 patients. Important features for the prediction are to be identified.

## Detailed Description
The detailed description was taken from the job description on the [DS-Spaces website](https://ds-spaces.technik.fhnw.ch/medical-data-analysis-and-visualisation-fs/#menu-second) on 08/03/2021: 

In this international challenge, you will work in collaboration with Prof. Dr. Nida Şen (MD, MHS, Director of the Uveitis Clinic at the National Eye Institute, Washington DC) and her team, including Dr. Shilpa Kodati. Dr. David Kuo (MD, Biomed Eng.) of University of California, San Diego is also glad to support to project (in an asynchronous manner due to the large differences in time zones). You can listen to a 25-min podcast interview with Dr. Şen to get a first impression of her research agenda. Using real data obtained from 1075 patients with 55 markers (more  detailed description below), you will delve into computational methods for medical research to identify which markers are relevant for the diagnosis of uveitis, a common eye disease. 

**International and interdisciplinary** setting As the description above suggests, an exciting and challenging aspect of this challenge is in its interdisciplinarity and internationality. You will experience different academic cultures, professional roles/backgrounds and fully practice your English in a professional setting. The collaborative international work means several virtual meetings with our international partner and requires some flexibility due to time zones. These interactions will serve as an opportunity to practice and reflect on questions related to intercultural competence, i.e., intercultural communication and collaboration. Since it is a medical challenge you will also need to learn new terminology and communicate your ideas to a client of a different field.

**Description of the disease**  Uveitis is a group of inflammatory diseases of the uveal tract of the eye. It is a sight-threatening autoimmune disease and it is responsible for approximately 10-15% of blindness. This disease has a real and sizable impact: Besides its life-changing negative implications for the individual, it also has consequences for the society, because it affects people in their most productive work years and thus leads to a socioeconomic burden. Uveitis is a multifactorial condition and its causes are not fully understood despite recent advances. Challenges in clinical uveitis research include disease heterogeneity, lack of understanding of what triggers and what propagates the disease. 

## Tasks
Imagine that you are a part of a scientific team, working in the team’s data science unit. The above-mentioned dataset, already pre-processed & cleaned, is delivered to you with the following request: Using an exploratory (agnostic) data science and machine learning approach, analyze which of the variables might 

a) have strong correlations with each other, and with the diagnosed diseases of patients and patient groups

b) lead to the identification (prediction) of the uveitis subtypes which may inform eventual diagnosis

c) compare the HLA haplotypes in healthy controls to those in the project data, check if they correlate with specific subgroups. This will help verifying genetic predictors of certain subtypes of disease in this uveitis cohort. Data from healthy controls, i.e., normative data is publicly available as a separate data set (also see “Data” section below)

d) analyze if the steps in the preprocessing could be improved

e) which missing value strategy is best.

We expect that you use a combination of exploratory data analysis (see competence ‘eda‘) probability testing, machine learning (see competences ule, sul) and visual analytics (see competence ‘van‘) to conduct your analyses. Using machine learning (ML), you will search for patterns and anomalies in the data, profile patients and/or patient groups (e.g., genetic profiles or patient history, their state vs. normative baseline data). Using visual analytics (VA) you will visualize/plot to demonstrate the relationships between variables and what ML detects using multiple linked views.

The project management will be inspired in scrum. Code and artifacts should be documented in a git repository. After 1 or 2 weeks in the project a proposal of focus of the challenge group should be submitted to the challenge owners, preferably with a rough roadmap.

As a data scientist, you will need to interpret and give context to your findings in order to enhance their value.

You may also explore (optional) visualizing ML algorithm’s decisions (as a meta tool to make ML “understandable”). Such an analysis will facilitate the dialogue with the domain expert when analyzing which factor was (potentially) responsible for which impact. Note that understandable (explainable, interpretable) machine learning is a large and advanced topic. If you are curious about it, a good starting point maybe to learn about the decision tree concept: Decision Tree for starters is although fine: https://www.youtube.com/watch?v=qB8HZpwqPEg

We will guide you both by providing you some materials from this space and guiding you to the relevant competences. When you have suggestions and questions, please use the “Stream” space, also answer each other’s questions, come to our sessions (see “timelines”), or when it is really needed, request appointments. Main tools for communication should be the “Stream” and the provided office hours.

## Approach
1. Exploratory data analysis (EDA)

This inital step acts as the foundation of the project. By analyzing the data set, initial questions of understanding can be clarified and necessary steps for preparing the data set can be found. Strategies for imputation of missing, btw. wrong values are also considered here.

2. Data Preprocessing

In parallel with the EDA process, the first transformations of the data are carried out. This includes, for example, normalization and simplification of column names, adaptation of the data types of individual features and elimination of entry errors.

3. Feature extraction

The data set is not in a "Tidy Data" state. This means that there are columns that contain more than one piece of information, i.e. more than one feature. Furthermore, columns exist that have values which should be transferred to more than one column. This step is necessary to prepare the dataset for a preprocessing pipeline that allows to transform the dataset for a machine learning model.

4. Preprocessing pipeline

At this point, the dataset was prepared in such a way that, with the support of the library sklearn.preprocessing, the dataset can be processed in such a way that a machine learning algorithm can work with it. 

5. Setup modular environment to test multiple algorithms

By using the pipeline module of the sklearn library, we can build a modular environment that allows us to quickly apply and evaluate different machine learning algorithms.

6. Identify useful models

Of the models tested, the most promising can be taken out. Through parameter optimization, these can be refined to achieve higher accuracy. 

7. Identify useful features

Depending on the algorithm, the influence of a feature is identifiable. This influence of individual features can be checked or increased by various methods.

8. Document findings

The insights gained will be recorded in a conference paper.

In [None]:
# imports, libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import regex as re
from sklearn.preprocessing import StandardScaler, Binarizer, LabelEncoder, Normalizer, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import os

# import helperfunctions
import pipe
#import preprocessing.pipe as pipe

In [None]:
os.getcwd()
#os.chdir("D:/Drive/FHNW/zaRepos/fhnw_ds_fs2021_medical_challenge/preprocessing")

## Import and Renaming
The dataset is imported with pandas `read_excel()`. The naming of the features, i.e. the names of the columns is not uniform. The features are renamed with the function `pipe.rename()`, which can be found in the script pipe.py, based on a given list. The list can be consulted in the document "col_names&data_type-Copy1.xlsx". All features are renamed in lowercase, and preceding and trailing spaces are removed. Brackets and their contents, e.g. "(Blood)", are removed. These would only complicate the readability of the code and are recognizable from the context as well as the name of the feature.

In [None]:
# import dataframe
df = pd.read_excel("../data/uveitis_data.xlsx")
assert len(df) >= 1075, "Data is not complete"

# rename columns
df = pipe.rename(df, "../data/col_names_and_data_type.xlsx")

In [None]:
df = df.applymap(lambda x: x.strip() if isinstance(x, str) else x)

In [None]:
df = pipe.drop_nan_columns(df, nan_percentage=.5, verbose = True)

In [None]:
# df = pipe.drop_via_filter(test, 'range', verbose=True)

# Categorical Features
This section deals with categorical variables that can be taken as such directly from the dataset. There are features/variables that contain both categorical and numerical values. These are treated seperately. For each feature, a description is given of how it was processed. Mostly it is a simple normalization of the values, uniformization of values that contain the same information or removal of wrong or useless values. The decision to evaluate a value as "missing" is discussed in each case. All changes made can be adjusted or undone.



## Feature Description 
- **Gender**, a qualitative, nominal feature describing the patients gender. A patient can either be in the "male" or "female" category.

- **Race** describes the patients ethnicity.

- **Location** locates the position of the inflammation in the eye. A distinction is made between posterior, anterior, intermediate, etc. 

- The feature **Categorical** records the source of the inflammation as seen by the specialists who recorded the data. Uveitis can be caused by systemic problems, infections, or often idopathic.

- **EHR Diagnosos** is an electronic transmited diagnose, usually given beforehand by another doctor, that has had no knowledge about the lab tests and final diagnosis.

- **Specific Diagnosis** is the diagnosis given by the team that collected the data. According to Dr. Nida Sen this is one of the most important outcome variables. This variable will be consired to be the target feature. 

- **AC Abn Od Cells and AC Abn Os Cells**. These qualitative, ordinal features describe the severity of the inflammation of the Anterior Chamber Cells (AC) in either the left eye (OS) or the right eye (OD). The inflammation can be rated as 0, +0.5, +1, +2, +3, +4. The higher the value the more severe the inflammation is. If either one of these values a patient can be considered as "Active", else as "Quiet". This information could be recorded in a new column.

- **Vit Abn Od Cells, Vit Abn Os Cells, Vit Abn Od Haze and Vit Abn Os Haze** describe (similar to AC Abn O...) the inflammation of cells in the left (OS) and right (OD) eye. The same scale of 0, +0.5, +1, +2, +3, +4 is used. If one of the values is higher than 0 the patient is considered to be "Active" as well. This information can be recorded in a new column as well. 

- **HBc (HepB core) Ab (Blood), HBs (HepB surface) Ag (Blood), HCV (HepC) Ab (Blood)** 

Features that contain categorical and numerical information will be discussed in a later chapter.

### Gender

This features containes the gender of the patient ("female" or "male") and is currently of the data type 'Object' ('O'). This feature gets transfromed to the dtype 'catgory' via the `pd.DataFrame.astype('category')`-function. This way it can later on easily be OneHotEncoded. 

In [None]:
df.gender.unique().tolist() # categories in feature 'gender'

In [None]:
df.gender.dtype # dtype before transformation

In [None]:
df = pipe.gender_dtype(df)
df.gender.dtype # dtype after transformation

### Race
The categorical variable "Race" includes the category "race or ethnic group data not provided by source". These values are treated as missing values, aka in the category 'unknown', since they do not contain any information about the respective person. "race or ethnic group data not provided by source" and "unknown race" collaps into the category "unknown". Missing values (NaN's) are also marked with 'unknown'

In [None]:

df = pipe.preprocessing_race(df)
df.race.value_counts()

TODO: Categories with less than 10 values, aka 'Native Hawaiian or Other Pacific Islander', 'American Indian or Alaska Native' may should be collapsed or discarded,

### loc, "Location"
The loc-Feature indicates the location of the inflammation of the eye. The category 'pan' is the same as 'panuveitis' and can be collapsed. 
We want to explore two diffrent approaches to treat this feature:

1. We keep the categories 'anterior', 'intermediate', 'panuveitis', 'posterior' and 'sclerits'. All categories indicate a diffrent section of the eye (or multiple at once) that show inflammation. 
2. We collapse mutliple categories to get an 'anterior' and 'posterior' category. Aka, collapse the location to inflammations in the front and the back of the eye (binary feature). To achieve this we collapse the categories 'intermediate', 'posterior' and 'panuveities' to the category "posterior_segment". 'anterior' and 'scleritis' get collapsed to the category 'anterior_segment'.

In [None]:
df = pipe.preprocessing_loc(df,'multi', verbose=True)

### cat, "Category" 
The cat-feature describes the origin of the inflammation. For example infectious or idiopathic origin. 
We can collapse the categories "nonneoplastic masquerade" and " neoplastic masquerade" to not_uveitits. As these are "pseudo-uveitis"-types. The row with the single occurance of scleritis should be dropped as it has to few records with this category. The single occurance of NaN is a "not_uveitis" case and can be filled with that category.

In [None]:
df = pipe.preprocessing_cat(df)
df.cat.value_counts()

### ehr_diagnosis
EHR diagnosos is an electronic transmitted diagnosis, usually given beforehand by another doctor, not knowing about the lab results and final diagnosis. This feature contains a lot of diffrent categories (533 unique values). Because of that we drop this feature.

In [None]:
df.drop(columns=['ehr_diagnosis'], inplace=True)

### specific_diagnosis
Specific diagnoses which occur less or equal to 10 times in the dataset get collapsed into the catgory 'other'

In [None]:
df = pipe.preprocessing_specific(df)
df.specific_diagnosis.value_counts()

### notes
This column contains notes to the diagnosis and is mostly missing. This feature will be dropped at the end of the preprocessing.

In [None]:
if 'notes' in df.columns:
    print(df.note.isna().sum()/len(df))

### ac_abn_...-columns and vit_abn_...-columns
Replace 'C' as Missing and change dtype to 'float'

In [None]:

df = pipe.preprocessing_inflammation(df)

### hbc__ab, hbs__ag and hcv__ab
These columns encode the lab results for diffrent types of hepatitis. We encode these in binary form. Negative results are '0' and positive results get encoded as '1'. There are some cases where neither a positive or negative result can be identified. These values will be set as missing values. 

In [None]:
df = pipe.preprocessing_hepatitis(df)

### hla-columns
These columns contain genetic data about the patients. This data should be used for a seperate model and thus will not be used (at least for now) and dropped. A function has been defined to drop these columns.

In [None]:
# df = pipe.drop_via_filter(test, 'hla', verbose=True)

## Numerical Features
This section deals with numerical variables that can be extracted from the dataset. As mentioned before, there are features/variables that contain both categorical and numerical values. These are treated seperately. For each feature, a description is given of how it was processed. All numerical features are extracted using the same method. All changes made can be adjusted or undone.

## Feauture Description

- **id** a numerical, nominal feature describing unique to each patient. It has no other use than to serve as an index

- **calcium** is the concentration in mmol/L (millimole per liter) of calcium in blood.

- **lactate_dehydrogenase** describes the ammount in U/L (units per liter) of lactate_dehydrogenase, an enzyme present in alot of cells and if a high presence is detected in blood it usually indicates some form of tissue damage.

- **c-reactive_protein,_normal_and_high_sensitivity** Protein, measured in mg/L, found in blood that rise in response to inflammation. Low sensitivity tests only go from 10-1000 mg/L while high sensitivity tests range from 0.5-10 mg/L. Helthy patients have CRP from 0.8-3 mg/L

- **wbc** The number of white blood cells in blood. Measured in K/uL (thousands per microliter). Healthy US patient count is between 4000-11'000 (Source: Wikipedia 25.03.21)

- **rbc** is the number of red blood cells in blood. Measured in M/uL (million per microliter)

- **hemoglobin** is the amount in g/dL of proteins that allow the transport of oxygen (these are mostly contained inside the rbc and make up 96% of it's dry content)

- **hematocrit** is the volume in % of red blood cells in blood

- **mcv** is the mean (average) corpuscolar (cell) volume measured in fL (femtoliter) of red blood cells

- **mch** is the mean (average) corpuscolar (cell) hemoglobin measured in pg (picogram) of the average mass of hemoglobin per red blood cell in a sample

- **mchc** is the mean (average) corpuscolar (cell) hemoglobin concentration in g/dL (grams per deciliter) in a given volume of packed blood cells

- **rdw** is the red blood cell distribution width. This value, measured in % is a measure of the range of variation of red blood cell volume

- **platelet_count** is the number of platelets, measured in K/uL (thousands per microliter) contained in the blood

- **neutrophil_%** is a specific type of wbc that make up 40-70% of all wbc in humans. Measured in %

- **lymphocytes_%** is another type of wbc that make up 18-40% of all wbc in humans. Measured in %

- **angiotensin_conv#enzyme** is the angiotensin-converting enzyme which if present in quantities, undirectly raises blood pressure. Measured in U/L (units per liter)

- **beta-2-microglobulin** is measured in mg/L (milligrams per liter), is a component of molecules present in nucleated cells

- **lysozyme,_plasma** is measured in mcg/mL (micrograms per milliliter), is a antimicrobial enzyme, part of the immune system

- **anti-dnase-b** is measured in U/mL (units per milliliter). It measures the presence of antibodies that combat streptococcus

- **complement_c3** is measured in mg/dL (milligrams per deciliter). It is a protein of the (innate) immune system

- **complement_c4** is measured in mg/dL (milligrams per deciliter). It is a protein involved in immunity, tolerance, and autoimmunity

- **rheumatoid_factor** is measured in IU/mL (International Unit per milliliter). It is an autoantibody and if higher than 20 IU/mL it indicates (80% of cases) rheumatoid arthritis

### id

A primary key that references each patient. Each value is unique and seems to be ordered but there is none.

### calcium

This feature indicates the content of calcium inside of the blood. These are mostly decimal values that range from 1.9 to 2.75 and contain some string values. For these reason the column uses the data type 'Object' ('O'). This feature gets transformed into a numerical datatype 'float64'.

### lactate_dehydrogenase

This numerical variable, if containing a high value indicates the potential presence of tissue damage. While most occurencies are numerical, as for calcium some string values are present which gives this feature an object data type.


### c-reactive_protein,_normal_and_high_sensitivity

This protein acts as an indicator of inflammation in the body. Ideally it would be a numerical column but some strings are contained inside, some containing special characters like > and <. These are tricky because we still want to try and keep their numerical importance without having to outright discard them.


### wbc

A pretty straightforward feature. Indicates the number of white blood cells in blood. It seems like some values are abnormaly high which could indicate UOM inconsistencies. It also is encoded as an object data type so it also has to be transformed into numerical.


### rbc

This feature is the count of red blood cells in blood. Also an object data type, to be converted into numerical


### hemoglobin

A numerical feature, an object data type, to be converted into numeric

### hematocrit

A numerical feature, an object data type, to be converted into numeric

### mcv

A numerical feature, an object data type, to be converted into numeric


### mch

A numerical feature, an object data type, to be converted into numeric


### mchc

A numerical feature, an object data type, to be converted into numeric

### rdw

A numerical feature, an object data type, to be converted into numeric

### platelet_count

A numerical feature, an object data type, to be converted into numeric

### neutrophil_%

A numerical feature, an object data type, to be converted into numeric


### lymphocytes_%

A numerical feature, an object data type, to be converted into numeric

### angiotensin_conv#enzyme

A numerical feature, an object data type, to be converted into numeric

### beta-2-microglobulin

A numerical feature, an object data type, to be converted into numeric

### lysozyme,_plasma

A numerical feature, an object data type, to be converted into numeric


### anti-dnase_b

A numerical feature, an object data type, to be converted into numeric

### complement_c3

A numerical feature, an object data type, to be converted into numeric

### complement_c4

A numerical feature, an object data type, to be converted into numeric

### rheumatoid_factor

A numerical feature, an object data type, to be converted into numeric

## Features containing both numerical and categorical values
Certain columns don't follow the tidy data principle that only one datatyp should be existant in a column/feature.
This chapter deals with said columns and either splits them into a numeric and categorical feature or changes values to reach a uniform datatyp over a column.

### Anti-CCP Ab
Anti-CCP is a numeric column with mostly values set to '<20'. A value below or at 20 is viewed as a negative result. Above 20 the result is positive. This allows for a binarization of the column. We set every value below or at 20 to 0 (aka 'negative') and all values above 20 to 1 (aka postive). Some values are still missing.

In [None]:
# df = pipe.num_to_binary(df, 'anti-ccp_ab', 20)
# df['anti-ccp_ab'].value_counts(dropna=False)

### Anti-ENA Screen
Anti-ENA Screen consists of mostly 'NEG' (Negative) Values (1001 out of 1075), we assume that the other, numerical values can be regarded as positive. We encode these into 0 (Negative) and 1 (Positive) values. The singel occurance of 'see note | In-house test down.  Test re-ordered and sent to Referral L' gets dropped and replaced with `np.nan`.

In [None]:
# df['anti-ena_screen'].value_counts(dropna=False)

### Antinuclear Antibody

In [None]:
# df['antinuclear_antibody'].value_counts()

### DNA Double-Stranded Ab

In [None]:
# df['dna_double-stranded_ab'].value_counts(dropna=False)

The function `pipe.neg_col_to_cat` transforms a list of columns (in our case `['anti-ena_screen','antinuclear_antibody','dna_double-stranded_ab']`) to binary, categorical columns where 0 = 'Negative' and 1 = 'Postive

The function `pipe.neg_col_to_cat` changes the features 'anti-ccp_ab', 'anti-ena_screen', 'antinuclear_antibody', and 'dna_double-stranded_ab' to categorical variables.

In [None]:
df = pipe.neg_col_to_cat(df, ['anti-ccp_ab','anti-ena_screen','antinuclear_antibody','dna_double-stranded_ab', 'rheumatoid_factor'], verbose=True)

### Myeloperoxidase Ab
This column has been dropped because of to many missing values

### Proteinase-3 Antibodies
This column has been dropped because of to many missing values

In [None]:
# df['proteinase-3_antibodies'].value_counts(dropna=False)

## Numerical Features
This section deals with numerical variables that can be extracted from the dataset. As mentioned before, there are features/variables that contain both categorical and numerical values. These are treated seperately. For each feature, a description is given of how it was processed. Mostly it is a simple normalization of the values, uniformization of values that contain the same information or removal of wrong or useless values. The decision to evaluate a value as "missing" is discussed in each case. All changes made can be adjusted or undone.

In [None]:
#variables assignement
a = df.copy()
a = pipe.preprocessing_numeric(a)

####  Handle multiple Units of Measurment in Feat

In [None]:
df = pipe.uom_fix(df)

### Numeric Features with ranges to categorical
We propose to encode features that have a range of "normal" values associated with them to categorical features. Values in the range are considered 'normal'. We thus create three categories: 0 = 'below normal', 1 = 'normal', 2 = 'above normal.  

In [None]:
t = df.copy()
t = pipe.preprocessing_numeric(t, num_to_cat = True, verbose=False)

## Drop 'uom' and 'range' columns
Every lab test is accompanied by two columns. One specifies the unit of measurement (uom) for said test and the other defines the acceptable/normal range of the test (range).
Although these informations are important for the exploratory data analysis test and the preprocessing it is not advised to include these columns in the dataframe that serves as the input for a machine learning algorithmn. 

In [None]:
# df = pipe.drop_uom_and_range(df, verbose=True)

# Preprocessing Pipeline

CHECK IF EVERY COLUMN IS ACCOUNTED FOR:
- [x] 'id'
- [x] 'gender' 
- [x] 'race'
- [x] 'loc'
- [x] 'ehr_diagnosis' (Dropped)
- [x] 'anti-dnase_b' (Dropped, too many missing values)
- [x] 'other_' (Dropped, too many missing values)
- [x] 'notes' (Dropped, too many missing values)
- [x] 'beta-2-microglobulin' (Dropped, too many missing values)
- [x] 'lupus_anticoagulant' (Dropped, too many missing values)
- [x] 'myeloperoxidase_ab' (Dropped, too many missing values)
- [x] 'proteinase-3_antibodies' (Dropped, too many missing values)
- [x] 'cat'
- [x] 'specific_diagnosis'
- [x] 'ac_abn_od_cells'
- [x] 'ac_abn_os_cells'
- [x] 'vit_abn_od_cells'
- [x] 'vit_abn_os_cells'
- [x] 'vit_abn_od_haze'
- [x] 'vit_abn_os_haze'
- [x] 'calcium'
- [x] 'lactate_dehydrogenase'
- [x] 'c-reactive_protein,_normal_and_high_sensitivity'
- 'wbc'
- [x] 'rbc'
- [x] 'hemoglobin'
- [x] 'hematocrit'
- [x] 'mcv'
- [x] 'mch'
- [x] 'mchc'
- [x] 'rdw'
- [x] 'platelet_count'
- [x] 'neutrophil_%'
- [x] 'lymphocytes_%'
- [x] 'angiotensin_conv#enzyme'
- [x] 'lysozyme,_plasma'
- [x] 'anti-ccp_ab'
- [x] 'anti-ena_screen'
- [x] 'antinuclear_antibody'
- [x] 'complement_c3'
- [x] 'complement_c4'
- [x] 'dna_double-stranded_ab'
- [x] 'hla-a*' (Dropped)
- [x] 'hla_a_1' (Dropped)
- [x] 'hla_a_2' (Dropped)
- [x] 'hla-b*' (Dropped)
- [x] 'hla_b_1' (Dropped)
- [x] 'hla_b_2' (Dropped)
- [x] 'hla-cw*' (Dropped)
- [x] 'hla_c_1' (Dropped)
- [x] 'hla_c_2' (Dropped)
- [x] 'hla-drb1*' (Dropped)
- [x] 'hla_drb1_1' (Dropped)
- [x] 'hla_drb1_2' (Dropped)
- [x] 'hla-dqb1*_/_dq*' (Dropped)
- [x] 'hla_dq_1' (Dropped)
- [x] 'hla_dq_2' (Dropped)
- [x] 'hla-drb_*' (Dropped)
- [x] 'hla_drb*_1' (Dropped)
- [x] 'hla_drb*_2' (Dropped)
- [x] 'rheumatoid_factor'
- [x] 'hbc__ab'
- [x] 'hbs__ag'
- [x] 'hcv__ab'
- [x] uom and range columns (Dropped, after used for transformation)

In [None]:
df = pipe.preprocessing_pipe()

In [None]:
pd.set_option('max_columns', None)
df

In [None]:
df.isna().sum()

## Export dataframe


In [None]:
# df.to_csv('../data/cleaned_uveitis_data.csv')