## Data pipeline

But du notebook : Construction d'une data pipeline permettant de traiter les datas de PubMed afin d'obtenir un graphe de liaison entre les différents médicaments et leurs mentions respectives dans les différentes publications PubMed, scientifiques et journaux

In [1]:
import zipfile
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from itertools import compress
import json

%matplotlib inline

#### Download data

In [2]:
#Chemin d'accès
drugs = pd.read_csv("/Datas/drugs.csv")
pubmed = pd.read_csv("/Datas/pubmed.csv")
pubmedjson = pd.read_json("/Datas/pubmed.json")
clinical_trials = pd.read_csv("/Datas/clinical_trials.csv")

#### Data preparation

In [3]:
drugs

Unnamed: 0,atccode,drug
0,A04AD,DIPHENHYDRAMINE
1,S03AA,TETRACYCLINE
2,V03AB,ETHANOL
3,A03BA,ATROPINE
4,A01AD,EPINEPHRINE
5,6302001,ISOPRENALINE
6,R01AD,BETAMETHASONE


In [4]:
pubmed

Unnamed: 0,id,title,date,journal
0,1,A 44-year-old man with erythema of the face di...,01/01/2019,Journal of emergency nursing
1,2,"An evaluation of benadryl, pyribenzamine, and ...",01/01/2019,Journal of emergency nursing
2,3,Diphenhydramine hydrochloride helps symptoms o...,02/01/2019,The Journal of pediatrics
3,4,Tetracycline Resistance Patterns of Lactobacil...,01/01/2020,Journal of food protection
4,5,Appositional Tetracycline bone formation rates...,02/01/2020,American journal of veterinary research
5,6,Rapid reacquisition of contextual fear followi...,2020-01-01,Psychopharmacology
6,7,The High Cost of Epinephrine Autoinjectors and...,01/02/2020,The journal of allergy and clinical immunology...
7,8,Time to epinephrine treatment is associated wi...,01/03/2020,The journal of allergy and clinical immunology...


In [5]:
pubmedjson

Unnamed: 0,id,title,date,journal
0,9.0,Gold nanoparticles synthesized from Euphorbia ...,2020-01-01,"Journal of photochemistry and photobiology. B,..."
1,10.0,Clinical implications of umbilical artery Dopp...,2020-01-01,The journal of maternal-fetal & neonatal medicine
2,11.0,Effects of Topical Application of Betamethason...,2020-01-01,Journal of back and musculoskeletal rehabilita...
3,12.0,"Comparison of pressure release, phonophoresis ...",2020-01-03,Journal of back and musculoskeletal rehabilita...
4,,"Comparison of pressure BETAMETHASONE release, ...",2020-01-03,The journal of maternal-fetal & neonatal medicine


In [6]:
clinical_trials 

Unnamed: 0,id,scientific_title,date,journal
0,NCT01967433,Use of Diphenhydramine as an Adjunctive Sedati...,1 January 2020,Journal of emergency nursing
1,NCT04189588,Phase 2 Study IV QUZYTTIR™ (Cetirizine Hydroch...,1 January 2020,Journal of emergency nursing
2,NCT04237090,,1 January 2020,Journal of emergency nursing
3,NCT04237091,Feasibility of a Randomized Controlled Clinica...,1 January 2020,Journal of emergency nursing
4,NCT04153396,Preemptive Infiltration With Betamethasone and...,1 January 2020,Hôpitaux Universitaires de Genève
5,NCT03490942,Glucagon Infusion in T1D Patients With Recurre...,25/05/2020,
6,,Glucagon Infusion in T1D Patients With Recurre...,25/05/2020,Journal of emergency nursing
7,NCT04188184,Tranexamic Acid Versus Epinephrine During Expl...,27 April 2020,Journal of emergency nursing\xc3\x28


In [7]:
#pubmed and pubmedjson have the same structure, we can concatenate them
pubmedb = pd.concat([pubmed,pubmedjson])

In [8]:
#We can also concatenate clinical_trials with pubmedb, to distinguish these two we create a column called Publication type  
clinical_trials = clinical_trials.rename(columns={"scientific_title": "title"})
pubmedb["publication_type"] = "Pubmed"
clinical_trials["publication_type"] = "Clinical_trials"
df = pd.concat([pubmedb,clinical_trials])

In [9]:
df

Unnamed: 0,id,title,date,journal,publication_type
0,1,A 44-year-old man with erythema of the face di...,01/01/2019,Journal of emergency nursing,Pubmed
1,2,"An evaluation of benadryl, pyribenzamine, and ...",01/01/2019,Journal of emergency nursing,Pubmed
2,3,Diphenhydramine hydrochloride helps symptoms o...,02/01/2019,The Journal of pediatrics,Pubmed
3,4,Tetracycline Resistance Patterns of Lactobacil...,01/01/2020,Journal of food protection,Pubmed
4,5,Appositional Tetracycline bone formation rates...,02/01/2020,American journal of veterinary research,Pubmed
5,6,Rapid reacquisition of contextual fear followi...,2020-01-01,Psychopharmacology,Pubmed
6,7,The High Cost of Epinephrine Autoinjectors and...,01/02/2020,The journal of allergy and clinical immunology...,Pubmed
7,8,Time to epinephrine treatment is associated wi...,01/03/2020,The journal of allergy and clinical immunology...,Pubmed
0,9,Gold nanoparticles synthesized from Euphorbia ...,2020-01-01 00:00:00,"Journal of photochemistry and photobiology. B,...",Pubmed
1,10,Clinical implications of umbilical artery Dopp...,2020-01-01 00:00:00,The journal of maternal-fetal & neonatal medicine,Pubmed


In [10]:
df.index = range(len(df))

In [11]:
#Get a uniform date format for easier study
df["date"] = pd.to_datetime(df["date"])  

In [12]:
#get all row in cap lock for easier reading
df["title"] = df["title"].str.upper()
df["journal"] = df["journal"].str.upper()

#### Data pipeline

Construction du dataframe de classification

In [13]:
def drug_classification(drugs, df):
    finaldf = []
    temp = drugs['drug'].apply(lambda x: x.strip()).to_list() # We select the name of all the drugs in the Dataframe
    for drug in temp:
        l = df["title"].str.contains(drug).to_list()  # We select the publications with the drug mentionned in it
        lt = list(compress(range(len(l)), l))
        row_pubmed = []
        row_journal = []
        for s in lt:
            row = []
            row.append(drug)
            row.append(drugs[drugs["drug"] == drug]["atccode"].iloc[0])
            row.append(df["publication_type"].loc[s])
            row.append(df["id"].loc[s])
            row.append(df["date"].loc[s])
            row_pubmed.append(row)
            if df["journal"].loc[s] is not None: # We check that we don't have nan values in journal
                row2 = []
                row2.append(drug)
                row2.append(drugs[drugs["drug"] == drug]["atccode"].iloc[0])
                row2.append("Journal")
                row2.append(df["journal"].loc[s])
                row2.append(df["date"].loc[s])
                row_journal.append(row2)
                rowf = row_pubmed + row_journal
        finaldf = finaldf + rowf
    return finaldf

In [14]:
l = drug_classification(drugs,df)

In [15]:
final_df = pd.DataFrame(columns=['Drug_name', 'Drug_atccode', 'Type', 'Source_ID', 'date'], data = l)

In [16]:
final_df = final_df.drop_duplicates() 
nan_value = float("NaN")
final_df.replace("", nan_value, inplace=True)
final_df.dropna(subset = ["Source_ID"], inplace=True)
final_df

Unnamed: 0,Drug_name,Drug_atccode,Type,Source_ID,date
0,DIPHENHYDRAMINE,A04AD,Pubmed,1,2019-01-01
1,DIPHENHYDRAMINE,A04AD,Pubmed,2,2019-01-01
2,DIPHENHYDRAMINE,A04AD,Pubmed,3,2019-02-01
3,DIPHENHYDRAMINE,A04AD,Clinical_trials,NCT01967433,2020-01-01
4,DIPHENHYDRAMINE,A04AD,Clinical_trials,NCT04189588,2020-01-01
5,DIPHENHYDRAMINE,A04AD,Clinical_trials,NCT04237091,2020-01-01
6,DIPHENHYDRAMINE,A04AD,Journal,JOURNAL OF EMERGENCY NURSING,2019-01-01
8,DIPHENHYDRAMINE,A04AD,Journal,THE JOURNAL OF PEDIATRICS,2019-02-01
9,DIPHENHYDRAMINE,A04AD,Journal,JOURNAL OF EMERGENCY NURSING,2020-01-01
12,TETRACYCLINE,S03AA,Pubmed,4,2020-01-01


#### Construction du fichier json

In [24]:
import json
from json import dumps

json_dict = {}
json_dict['Drug_classification'] = []
for grp, grp_data in final_df.groupby('Drug_name'):
    grp_dict = {}
    grp_dict['Drug_name'] = grp
    grp_dict['Drug_atccode'] = drugs[drugs["drug"] == grp]["atccode"].iloc[0]
    grp_dict['Publications'] = []
    for p, p_data in grp_data.groupby('Type'):
        p_data = p_data.drop(['Drug_name', 'Drug_atccode'], axis=1).set_index('Type')
        for d in p_data.to_dict(orient='records'):
            grp_dict['Publications'].append({'Type': p, 'Date': d['date'].__str__(), 'Source_ID': d['Source_ID']})   
    json_dict['Drug_classification'].append(grp_dict)
json_out = dumps(json_dict)
parsed = json.loads(json_out)

In [25]:
parsed

{'Drug_classification': [{'Drug_name': 'ATROPINE',
   'Drug_atccode': 'A03BA',
   'Publications': [{'Type': 'Journal',
     'Date': '2020-01-03 00:00:00',
     'Source_ID': 'THE JOURNAL OF MATERNAL-FETAL & NEONATAL MEDICINE'}]},
  {'Drug_name': 'BETAMETHASONE',
   'Drug_atccode': 'R01AD',
   'Publications': [{'Type': 'Clinical_trials',
     'Date': '2020-01-01 00:00:00',
     'Source_ID': 'NCT04153396'},
    {'Type': 'Journal',
     'Date': '2020-01-01 00:00:00',
     'Source_ID': 'THE JOURNAL OF MATERNAL-FETAL & NEONATAL MEDICINE'},
    {'Type': 'Journal',
     'Date': '2020-01-01 00:00:00',
     'Source_ID': 'JOURNAL OF BACK AND MUSCULOSKELETAL REHABILITATION'},
    {'Type': 'Journal',
     'Date': '2020-01-03 00:00:00',
     'Source_ID': 'THE JOURNAL OF MATERNAL-FETAL & NEONATAL MEDICINE'},
    {'Type': 'Journal',
     'Date': '2020-01-01 00:00:00',
     'Source_ID': 'HÔPITAUX UNIVERSITAIRES DE GENÈVE'},
    {'Type': 'Pubmed', 'Date': '2020-01-01 00:00:00', 'Source_ID': 10},
    {'T

In [26]:
with open('drug_classification.json', 'w') as f:
    json.dump(parsed, f)