# EuCT Data Cleaning

#### Imports

In [7]:
import pandas as pd

url = 'https://drive.google.com/u/0/uc?id=1Lo6zbyhzTMww79L3ssETF51_E8rUqZi_&export=download&confirm=t&uuid=27bc2a30-2ec0-4717-9db3-c8d7ffcb9995&at=AKKF8vyKYKl999sGvDqE2TG36L3q:1684746320204'
trials_raw = pd.read_csv(url, low_memory=False)


In [8]:
trials_raw.shape

len(trials_raw['EudraCT Number:'].value_counts())

43464

Since scrapping was done in a very early stage, we weren't sure wich variables were needed for analysis. We also wanted freedom to explore all of the data avaliable in order to get the whole picture, and then decide wich variables to include. In the scrapping phase, we noticed that some of the the fields for each trial were mandatory but others weren't, so we decided to get all the data and posteriorly filter it.

The original dataset has 114333 rows and 181 columns


## 1. Feature Engineering

### 1.1 Filtering

In [25]:
fields= {'EudraCT Number:' : 'EudraCT_num',
        'Member State Concerned' : 'Country',
        'Date on which this record was first entered in the EudraCT database:': 'Start_date',
        'Date of the global end of the trial' : 'End_date',
        'Trial Status:' : 'Trial_status',
        'Status of the sponsor' : 'Sponsor_type',
        'Therapeutic area' : 'Therapeutic_area',
        'In the Member State concerned years' : 'Years_duration' , 
        'In the Member State concerned months' : 'Months_duration',
        'In the Member State concerned days' : 'Days_duration',
        'In the member state' : 'Sample_size',
        'Condition being studied is a rare disease' : 'Rare_disease',
        'Diagnosis': 'Diagnosis_scope',
        'Prophylaxis': 'Prophylaxis_scope',
        'Therapy': 'Therapy_scope',
        'Safety': 'Safety_scope',
        'Efficacy': 'Efficacy_scope',
        'Pharmacokinetic': 'Pharmacokinetic_scope',
        'Pharmacodynamic': 'Pharmacodynamic_scope',
        'Bioequivalence': 'Bioequivalence_scope',
        'Dose response': 'Dose_response_scope',
        'Pharmacogenetic': 'Pharmacogenetic_scope',
        'Pharmacogenomic': 'Pharmacogenomic_scope',
        'Pharmacoeconomic': 'Pharmacoeconomic_scope',
        'Others': 'Others_scope',
        'Human pharmacology (Phase I)' : '1_phase',
        'Therapeutic exploratory (Phase II)' : '2_phase' ,
        'Therapeutic confirmatory (Phase III)' : '3_phase',
        'Therapeutic use (Phase IV)' : '4_phase',
        'In Utero': 'In_utero_age',
        'Newborns (0-27 days)': 'Newborns_age',
        'Infants and toddlers (28 days-23 months)': 'Infants_age',
        'Children (2-11years)': 'Children_age',
        'Adolescents (12-17 years)': 'Adolescents_age',
        'Adults (18-64 years)': 'Adults_age',
        'Elderly (>=65 years)': 'Elderly_age',
        'Female' : 'Female_gender',
        'Male' : 'Male_gender'
        }

fields = {key: value.lower() for key, value in fields.items()}
trials_filtered = trials_raw.rename(columns=fields)[list(fields.values())]


### 1.2 Cleaning

After filtering the relevant features, we started by defining the agregation funcionts based on the subgroups of columns.

In [26]:
trials = trials_filtered.dropna(how='all').copy()

Country
- Trials outside EU/EEA have that designation on a different field (Clinical Trial Type:). Since the number of nan on country matches this field we can confidently conclude that all the trials that don't have a country assigned are outside the EU.

In [27]:
outside = trials_raw['Clinical Trial Type:'].value_counts().values[1]
nan =trials.country.isna().sum()
print(f"Number of trials outside the Eu/EEA: {outside} \nNumber of nan in Country {nan}")

Number of trials outside the Eu/EEA: 1553 
Number of nan in Country 1553


In [28]:
trials['country'] = trials.country.str.split('-').str[0].str.strip()              # Get the country name, dropping agency name
trials.loc[trials.country == 'Bulgarian Drug Agency', 'country'] = 'Bulgaria' # Bulgaria escaped the convention 
trials.country = trials.country.fillna('Outside EU/EEA') 

Therapeutic Area

In [29]:
# Separating strings into two new cols
trials['topic_category'] = trials.therapeutic_area.str.split('-').str[0].str.split('[').str[0].str.strip()
trials['topic_sub_category'] = trials.therapeutic_area.str.split('-').str[1].str.split('[').str[0].str.strip()
trials.drop('therapeutic_area', axis=1, inplace=True)

Duration

In [30]:
trials['years_duration'] = trials['years_duration'].fillna(0)
trials['months_duration'] = trials['months_duration'].fillna(0)
trials['days_duration'] = trials['days_duration'].fillna(0)

trials['duration_days'] = (trials.years_duration * 365) + (trials.months_duration * 30) + trials.days_duration
trials.drop(['years_duration', 'months_duration', 'days_duration'], axis=1, inplace=True)

scope_cols = trials.loc[:, trials.columns.str.contains('_scope')]

Scope & Gender & Age

In [31]:
def agg(subgroup):
    cols = trials.loc[:, trials.columns.str.contains('_' + subgroup)]
    trials[subgroup] = cols.apply(lambda x: ', '.join([col.split('_')[0] for col, val in x.items() if val == 'Yes']), axis=1) # Func that adds all col name is value is Yes

agg('age')
agg('gender')
agg('scope')

In [32]:
trials.topic_category.fillna('Not possible to specify', inplace=True)
trials.topic_sub_category.fillna('Not possible to specify', inplace=True)

Phase

In [33]:
cols = trials.loc[:, trials.columns.str.contains('_phase')]
trials['phase'] = cols.apply(lambda x: ''.join([col.split('_')[0] for col, val in x.items() if val == 'Yes']), axis=1)  #Gets phase number if phase is Yes
trials['phase'] = trials['phase'].apply(lambda x: max([int(element) for element in x]) if len(x) > 0 else None)         #Turns phase numbers into int list and gets max (= maximum phase of the trial). 
                                                                                                                        #Converts empty lists to nan - Trials that don't have info on phases

In [34]:
trials.to_csv('../DataSets/trials.csv', index=False)