# Data Cleaning

In [8]:
import os
import pandas as pd
from ydata_profiling import ProfileReport

### Read in data from file
The `trial_classes.xlsx` file has been enriched with additional multiclass labels identifying some characteristics of the trials.

In [6]:
data_directory = 'data'

data_dict = {}
for file_name in os.listdir(data_directory):
    file_path = os.path.join(data_directory, file_name)

    file_type = file_path.split('.')[-1]
    name = file_name.split('.')[0]
    if file_type == 'parquet':
        data_dict[name] = pd.read_parquet(file_path)
    elif file_type == 'xlsx':
        data_dict[name] = pd.read_excel(file_path)

---
### Create data profile reports

In [None]:
for key, df in data_dict.items():
    prof = ProfileReport(df)
    prof.to_file(output_file=f'profiles/{key}.html')

---
# Notes

### Country
- A lot of missing data
    - Some columns are almost completely missing
    - Many columns have roughly 50 % missing data
    - -> **We need to choose a methodology robust to missing data**
- Multicollinearity
    - Many variables display a correlation with one another
    - -> **We might need to do some PCA**

### Target
- Multicollinearity
    - `no_of_patients` is highly correlated with `enrolment_months` (0.720). This is unsurprising as more time spent finding patients leads to more patients found. The reason more time is spent can be due to many reasons but likely due to funding.
    - -> **This suggests that a reasonable metric for "goodness" could be `efficiency = no_of_patients / enrolment_months`**
    - The measure of efficiency will vary depending on how common certain diseases are in different locations or which sites have been part of previous studies. This is, in a sense, what we are trying to predict.
- Zeros
    - A lot of sites managed to gather no patients for some trials (5.9 %)

### Trial
- Trials are significantly different in what they are measuring 
    - Some are heavily focused on ensuring patients have given "informed consent" (even though this is always required?)
    - Some focus on obesity and weight change
    - Some on Cardiovascular issues
    - -> **There are not enough different trials to go full on NLP here and try to fine-tune a language model. Instead we elect a more manual approach of creating multi-class labels for the trials. These explain, on a high level, what the trials involve.**
    - A quick google search leads me to believe that this is public data, [link](https://ctv.veeva.com/), so GPT will be consulted to create classes.
- `maximum_age`
    - Zeroes seem to signify missing information
    - Values that do exist also seem to be placeholders without much consideration to what impact they will have on the trial or gathering of patients (e.g. 99, 100, 130).
    - -> **Drop column**

### Trial site
- Profile shows no warnings