### EuCT Data Cleaning:

#### Library and Data Import:

In [1]:
import os
import pandas as pd

In [2]:
path = os.getcwd()

In [3]:
df = pd.read_csv(r'C:\Users\ggrkn\OneDrive - NOVAIMS\PostGrad\PDS\_Final Project\EuCTRegister.csv', low_memory=False)

#### Column Selection and Aggregation

In [4]:
df.shape

(114333, 181)

The original dataset has 114333 rows and 181 columns so we need to clean it before any analysis.

By their names, we selected a set of 26 possible columns that might interest us

In [5]:
column_list=['EudraCT Number:','Country','Member State Concerned',
             'Date on which this record was first entered in the EudraCT database:','Date of the global end of the trial',
             'Trial Status:','Therapeutic area','Diagnosis','Prophylaxis','Therapy','Safety','Efficacy',
             'Main objective of the trial','Human pharmacology (Phase I)','Therapeutic exploratory (Phase II)',
             'Therapeutic confirmatory (Phase III)','Therapeutic use (Phase IV)','In Utero','Newborns (0-27 days)',
             'Infants and toddlers (28 days-23 months)','Children (2-11years)','Adolescents (12-17 years)',
             'Adults (18-64 years)','Elderly (>=65 years)','Female','Male', 'Medical condition(s) being investigated', 'Medical condition in easily understood language']

In [6]:
len(column_list)

28

In [7]:
df = df.loc[:,column_list]

As we can see by the table below there are a lot of columns that are 'dummy' columns that can probably be aggregated to simplify the analysis.

In [8]:
df.describe().transpose()

Unnamed: 0,count,unique,top,freq
EudraCT Number:,114290,43464,2017-004295-55,24
Country,77738,88,United States,17694
Member State Concerned,112737,31,Spain - AEMPS,11095
Date on which this record was first entered in the EudraCT database:,114290,4993,2021-06-17,327
Date of the global end of the trial,63363,5511,2012-12-20,60
Trial Status:,111285,10,Completed,66850
Therapeutic area,78199,55,Diseases [C] - Cancer [C04],21952
Diagnosis,114290,3,No,108189
Prophylaxis,114290,3,No,103340
Therapy,114290,3,Yes,63972


We started by defining the agregation funcionts based on the subgroups of columns.

In [9]:
def aggregate_columns_phase(row):
    yes_columns = [col for col in row.index if col in ['Human pharmacology (Phase I)','Therapeutic exploratory (Phase II)',
             'Therapeutic confirmatory (Phase III)','Therapeutic use (Phase IV)'] 
                   and row[col] == 'Yes']
    return ', '.join(yes_columns) if yes_columns else ''

In [10]:
def aggregate_columns_age(row):
    yes_columns = [col for col in row.index if col in ['In Utero','Newborns (0-27 days)',
             'Infants and toddlers (28 days-23 months)','Children (2-11years)','Adolescents (12-17 years)',
             'Adults (18-64 years)','Elderly (>=65 years)'] 
                   and row[col] == 'Yes']
    return ', '.join(yes_columns) if yes_columns else ''

In [11]:
def aggregate_columns_scope(row):
    yes_columns = [col for col in row.index if col in ['Diagnosis', 'Prophylaxis', 'Therapy', 'Safety', 'Efficacy'] 
                   and row[col] == 'Yes']
    return ', '.join(yes_columns) if yes_columns else ''

In [12]:
def aggregate_columns_gender(row):
    yes_columns = [col for col in row.index if col in ['Female','Male'] 
                   and row[col] == 'Yes']
    return ', '.join(yes_columns) if yes_columns else ''

And then apply the aggregation function to the dataset to group the column values:

In [13]:
df['Scope of the Trial'] = df.apply(aggregate_columns_scope, axis=1)
df['Trial Phase'] = df.apply(aggregate_columns_phase, axis=1)
df['Age Group'] = df.apply(aggregate_columns_age, axis=1)
df['Gender'] = df.apply(aggregate_columns_gender, axis=1)

Finally, we deleted the unecessary columns. Our new dataset has now 114333rows and 12 columns.

In [14]:
df.drop(['Diagnosis', 'Prophylaxis', 'Therapy', 'Safety', 'Efficacy',
        'Human pharmacology (Phase I)','Therapeutic exploratory (Phase II)',
         'Therapeutic confirmatory (Phase III)','Therapeutic use (Phase IV)',
         'In Utero','Newborns (0-27 days)', 'Infants and toddlers (28 days-23 months)','Children (2-11years)',
         'Adolescents (12-17 years)','Adults (18-64 years)','Elderly (>=65 years)','Female','Male'],
        axis=1, inplace = True)

In [15]:
df.shape

(114333, 14)

#### Column Rename and Data Split

Having reduced the number of columns it was necessary to rename some of the columns to a more intuitive name.

Additionally, as we are focusing on depression and neurological diseases we must filter and split the dataset.

In [16]:
df = df.rename(columns={'Date on which this record was first entered in the EudraCT database:':'Start Date',
                         'Member State Concerned':'Member State',
                         'Date of the global end of the trial':'End Date',
                       'Scope of the Trial':'Trial Scope'})

In [17]:
depression_trials = df[(df['Therapeutic area'] == 'Psychiatry and Psychology [F] - Mental Disorders [F03]') |
                    (df['Therapeutic area'] == 'Psychiatry and Psychology [F] - Behaviours [F01]') |
                    (df['Therapeutic area'] == 'Psychiatry and Psychology [F] - Behavioral Disciplines and Activities [F04]') |
                    (df['Therapeutic area'] == 'Psychiatry and Psychology [F] - Psychological processes [F02]')] 

In [18]:
#neural_trials = df[(df['Therapeutic area'] == 'Diseases [C] - Nervous System Diseases [C10]')]

In [19]:
neurology_trials = df[(df['Therapeutic area'] == 'Diseases [C] - Nervous System Diseases [C10]') |
                    (df['Therapeutic area'] == 'Diseases [C] - Cardiovascular Diseases [C14]')] 

### Conclusions and Export

The following tables summarize the main information regarding both datasets:

In [20]:
depression_trials.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1785 entries, 17 to 114272
Data columns (total 14 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   EudraCT Number:                                  1785 non-null   object
 1   Country                                          1768 non-null   object
 2   Member State                                     1757 non-null   object
 3   Start Date                                       1785 non-null   object
 4   End Date                                         1042 non-null   object
 5   Trial Status:                                    1707 non-null   object
 6   Therapeutic area                                 1785 non-null   object
 7   Main objective of the trial                      1783 non-null   object
 8   Medical condition(s) being investigated          1780 non-null   object
 9   Medical condition in easily understood

In [21]:
depression_trials.describe().transpose()

Unnamed: 0,count,unique,top,freq
EudraCT Number:,1785,852,2019-002992-33,14
Country,1768,31,United States,375
Member State,1757,28,Germany - BfArM,182
Start Date,1785,1255,2012-10-22,6
End Date,1042,493,2013-09-03,13
Trial Status:,1707,8,Completed,1010
Therapeutic area,1785,4,Psychiatry and Psychology [F] - Mental Disorde...,1493
Main objective of the trial,1783,1059,Evaluation of the long-term safety and tolerab...,15
Medical condition(s) being investigated,1780,750,Major Depressive Disorder,181
Medical condition in easily understood language,1772,701,Schizophrenia,160


In [22]:
#neural_trials.info()

In [23]:
#neural_trials.describe().transpose()

In [24]:
#depression_trials.to_csv("depression_trials.csv")
#neural_trials.to_csv("neural_trials.csv")

In [25]:
neurology_trials.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11615 entries, 0 to 114313
Data columns (total 14 columns):
 #   Column                                           Non-Null Count  Dtype 
---  ------                                           --------------  ----- 
 0   EudraCT Number:                                  11615 non-null  object
 1   Country                                          11452 non-null  object
 2   Member State                                     11431 non-null  object
 3   Start Date                                       11615 non-null  object
 4   End Date                                         6243 non-null   object
 5   Trial Status:                                    11251 non-null  object
 6   Therapeutic area                                 11615 non-null  object
 7   Main objective of the trial                      11615 non-null  object
 8   Medical condition(s) being investigated          11551 non-null  object
 9   Medical condition in easily understood

In [26]:
neurology_trials.describe().T

Unnamed: 0,count,unique,top,freq
EudraCT Number:,11615,3732,2012-005012-26,22
Country,11452,49,Germany,2199
Member State,11431,29,Spain - AEMPS,1040
Start Date,11615,3389,2021-06-17,40
End Date,6243,1624,2022-12-31,31
Trial Status:,11251,10,Completed,6141
Therapeutic area,11615,2,Diseases [C] - Nervous System Diseases [C10],6612
Main objective of the trial,11615,5933,Demonstrate that ofatumumab 20 mg sc once ever...,27
Medical condition(s) being investigated,11551,4455,Relapsing Multiple Sclerosis,122
Medical condition in easily understood language,11489,4310,Epilepsy,315


In [27]:
depression_trials.to_csv("depression_trials.csv")
neurology_trials.to_csv("neurology_trials.csv")