# FAERS graph

## Data quality analysis

### 1. Introduction

This notebook explores the data contents and quality of the FAERS data files, available for download [here](https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html).


#### Notebook contents:
1. [Introduction](#1.-Introduction)
2. [Notebook setup](#2.-Notebook-setup)
3. [Data sources]()
4. [Sample raw data files]()
5. [DQA summary]()
6. [Next steps]()

### 2. Notebook setup  
#### Imports

In [1]:
import pandas as pd
import numpy as np

import re
import xml.etree.ElementTree as ET

from timeit import default_timer as timer

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

#### Settings

In [2]:
# Customize matplotlib default settings
matplotlib.rcParams.update({'font.size': 16})
plt.rcParams["figure.figsize"] = (20,10)

In [3]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

### 3. Data sources

FAERS stands for FDA Adverse Event Reporting System. It is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. ([Source](https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm)) 


https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm

In [3]:
raw_data_path = "data/raw/"

In [4]:
!ls data/raw/

[1m[36mascii_2018q4[m[m [1m[36mxml_2018q4[m[m


Let's look at the XML datafiles.

In [5]:
!ls data/raw/xml_2018q4/xml

1_ADR18Q4.xml 2_ADR18Q4.xml 3_ADR18Q4.xml XML_NTS.pdf   xml18q4.pdf


In [5]:
! head -50 data/raw/xml_2018q4/xml/1_ADR18Q4.xml

<?xml version="1.0"?>
<ichicsr lang="en">
  <ichicsrmessageheader>
    <messagetype>ICSR</messagetype>
    <messageformatversion>2.1</messageformatversion>
    <messageformatrelease>1.0</messageformatrelease>
    <messagenumb>2019-02</messagenumb>
    <messagesenderidentifier>FDA CDER</messagesenderidentifier>
    <messagereceiveridentifier>Public Use</messagereceiveridentifier>
    <messagedateformat>204</messagedateformat>
    <messagedate>20190207040220</messagedate>
  </ichicsrmessageheader>
  <safetyreport>
    <safetyreportversion>1</safetyreportversion>
    <safetyreportid>15529521</safetyreportid>
    <primarysourcecountry>US</primarysourcecountry>
    <occurcountry>US</occurcountry>
    <transmissiondateformat>102</transmissiondateformat>
    <transmissiondate>20190205</transmissiondate>
    <reporttype>1</reporttype>
    <serious>2</serious>
    <receivedateformat>102</receivedateformat>
    <receivedate>20181018</receivedate>
    <receiptdateformat>102

In [6]:
raw_xml_path = "data/raw/xml_2018q4/xml/"

In [7]:
xml_file_1 = raw_xml_path + "1_ADR18Q4.xml"

In [8]:
tree = ET.parse(xml_file_1)
root = tree.getroot()

In [9]:
root.tag

'ichicsr'

In [10]:
root.attrib

{'lang': 'en'}

In [11]:
i=0
for child in root:
    print(child.tag, child.attrib)
    i+=1
    if i>10:
        break

ichicsrmessageheader {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}


In [12]:
root[0][1].text

'2.1'

In [13]:
i=0
        
for report_id in root.iter('safetyreportid'):
    print(report_id.text)
    i+=1
    if i>10:
        break

15529521
15529522
15529524
15529856
15529858
15529861
15530134
15529556
15529558
15529559
15529564


In [14]:
all_tags = list(set([elem.tag for elem in root.iter()]))

In [15]:
len(all_tags)

87

In [16]:
all_tags.sort()

Print out all data fields in this XML file

In [17]:
all_tags

['actiondrug',
 'activesubstance',
 'activesubstancename',
 'authoritynumb',
 'companynumb',
 'drug',
 'drugadditional',
 'drugadministrationroute',
 'drugauthorizationnumb',
 'drugbatchnumb',
 'drugcharacterization',
 'drugcumulativedosagenumb',
 'drugcumulativedosageunit',
 'drugdosageform',
 'drugdosagetext',
 'drugenddate',
 'drugenddateformat',
 'drugindication',
 'drugintervaldosagedefinition',
 'drugintervaldosageunitnumb',
 'drugrecuraction',
 'drugrecurreadministration',
 'drugrecurrence',
 'drugseparatedosagenumb',
 'drugstartdate',
 'drugstartdateformat',
 'drugstructuredosagenumb',
 'drugstructuredosageunit',
 'drugtreatmentduration',
 'drugtreatmentdurationunit',
 'duplicate',
 'duplicatenumb',
 'duplicatesource',
 'fulfillexpeditecriteria',
 'ichicsr',
 'ichicsrmessageheader',
 'literaturereference',
 'medicinalproduct',
 'messagedate',
 'messagedateformat',
 'messageformatrelease',
 'messageformatversion',
 'messagenumb',
 'messagereceiveridentifier',
 'messagesenderiden

Now let's look at the ASCII datafiles format.

In [18]:
!ls data/raw/ascii_2018q4/ascii

ASC_NTS.pdf  INDI18Q4.txt RPSR18Q4.txt drug18q4.pdf reac18q4.pdf
DEMO18Q4.txt OUTC18Q4.txt THER18Q4.txt indi18q4.pdf rpsr18q4.pdf
DRUG18Q4.txt REAC18Q4.txt demo18q4.pdf outc18q4.pdf ther18q4.pdf


In [19]:
!head data/raw/ascii_2018q4/ascii/DEMO18Q4.txt

primaryid$caseid$caseversion$i_f_code$event_dt$mfr_dt$init_fda_dt$fda_dt$rept_cod$auth_num$mfr_num$mfr_sndr$lit_ref$age$age_cod$age_grp$sex$e_sub$wt$wt_cod$rept_dt$to_mfr$occp_cod$reporter_country$occr_country
100035916$10003591$6$F$20130718$20181203$20140312$20181211$EXP$$PHHY2013GB101660$NOVARTIS$$47$YR$$F$Y$$$20181211$$OT$GB$GB
100050413$10005041$3$F$20140306$20141118$20140312$20181213$EXP$$US-TEVA-468475USA$TEVA$$25$YR$$F$Y$68.1$KG$20181213$$CN$US$US
1000551312$10005513$12$F$20120209$20181107$20140313$20181115$EXP$$BR-AMGEN-BRASP2012013548$AMGEN$$55$YR$A$F$Y$67$KG$20181115$$CN$BR$BR
100058832$10005883$2$F$$20180928$20140313$20181012$EXP$$FR-RANBAXY-2014RR-78735$RANBAXY$$31$YR$$F$Y$$$20181012$$OT$GB$FR
100065479$10006547$9$F$201203$20181211$20140313$20181228$EXP$$US-BAYER-2014-035909$BAYER$$36$YR$A$F$Y$90.7$KG$20181228$$CN$US$US
100066188$10006618$8$F$$20181004$20140313$20181017$PER$$US-PFIZER INC-2014069077$PFIZER$$58$YR$$F$Y$$$20181017$$CN$US$US
1000808588$10008085$88$F$201

In [20]:
raw_ascii_path = "data/raw/ascii_2018q4/ascii/"

In [21]:
ascii_file_demo = raw_ascii_path + "DEMO18Q4.txt"

In [22]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'caseversion': np.int32, 
    'i_f_code': 'object', 
    'event_dt': 'object', 
    'mfr_dt': 'object',
    'init_fda_dt': 'object', 
    'fda_dt': 'object', 
    'rept_cod': 'object', 
    'auth_num': 'object', 
    'mfr_num': 'object', 
    'mfr_sndr': 'object',
    'lit_ref': 'object', 
    'age': np.float64, 
    'age_cod': 'object', 
    'age_grp': 'object', 
    'sex': 'object', 
    'e_sub': 'object', 
    'wt': np.float64, 
    'wt_cod': 'object',
    'rept_dt': 'object', 
    'to_mfr': 'object', 
    'occp_cod': 'object', 
    'reporter_country': 'object', 
    'occr_country': 'object'
}

# {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}

In [23]:
demo = pd.read_csv(ascii_file_demo, sep='$', dtype=datatypes)

In [24]:
demo.columns

Index(['primaryid', 'caseid', 'caseversion', 'i_f_code', 'event_dt', 'mfr_dt',
       'init_fda_dt', 'fda_dt', 'rept_cod', 'auth_num', 'mfr_num', 'mfr_sndr',
       'lit_ref', 'age', 'age_cod', 'age_grp', 'sex', 'e_sub', 'wt', 'wt_cod',
       'rept_dt', 'to_mfr', 'occp_cod', 'reporter_country', 'occr_country'],
      dtype='object')

In [25]:
demo.wt.head(20)

0      nan
1    68.10
2    67.00
3      nan
4    90.70
5      nan
6    50.34
7      nan
8      nan
9      nan
10     nan
11     nan
12     nan
13   72.00
14     nan
15   79.00
16     nan
17     nan
18     nan
19     nan
Name: wt, dtype: float64

In [26]:
demo.head()

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
0,100035916,10003591,6,F,20130718.0,20181203,20140312,20181211,EXP,,PHHY2013GB101660,NOVARTIS,,47.0,YR,,F,Y,,,20181211,,OT,GB,GB
1,100050413,10005041,3,F,20140306.0,20141118,20140312,20181213,EXP,,US-TEVA-468475USA,TEVA,,25.0,YR,,F,Y,68.1,KG,20181213,,CN,US,US
2,1000551312,10005513,12,F,20120209.0,20181107,20140313,20181115,EXP,,BR-AMGEN-BRASP2012013548,AMGEN,,55.0,YR,A,F,Y,67.0,KG,20181115,,CN,BR,BR
3,100058832,10005883,2,F,,20180928,20140313,20181012,EXP,,FR-RANBAXY-2014RR-78735,RANBAXY,,31.0,YR,,F,Y,,,20181012,,OT,GB,FR
4,100065479,10006547,9,F,201203.0,20181211,20140313,20181228,EXP,,US-BAYER-2014-035909,BAYER,,36.0,YR,A,F,Y,90.7,KG,20181228,,CN,US,US


In [27]:
demo.describe(include='all')

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
count,394066.0,394066.0,394066.0,394066,205438.0,370593.0,394066.0,394066.0,394066,20168.0,370595,394065,23441,235444.0,235452,80189,347760,394066,81142.0,81142,393749.0,23479,387070,394066,394053
unique,394066.0,394066.0,,2,4711.0,2370.0,2503.0,183.0,3,15597.0,370595,471,17759,,6,6,3,2,,2,351.0,3,5,160,163
top,154544191.0,14508418.0,,I,2018.0,20181210.0,20181016.0,20181016.0,EXP,0.0,PHJP2018JP021151,PFIZER,"STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...",,YR,A,F,Y,,KG,20181016.0,N,CN,US,US
freq,1.0,1.0,,267661,25293.0,6857.0,11177.0,12657.0,204438,14.0,1,35409,79,,230226,48200,212580,370587,,80809,11312.0,22175,168973,249968,262062
mean,,,1.67,,,,,,,,,,,200.03,,,,,75.17,,,,,,
std,,,1.75,,,,,,,,,,,1843.75,,,,,29.24,,,,,,
min,,,1.0,,,,,,,,,,,-10.0,,,,,0.0,,,,,,
25%,,,1.0,,,,,,,,,,,45.0,,,,,59.87,,,,,,
50%,,,1.0,,,,,,,,,,,60.0,,,,,72.58,,,,,,
75%,,,2.0,,,,,,,,,,,71.0,,,,,88.45,,,,,,


In [28]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394066 entries, 0 to 394065
Data columns (total 25 columns):
primaryid           394066 non-null object
caseid              394066 non-null object
caseversion         394066 non-null int32
i_f_code            394066 non-null object
event_dt            205438 non-null object
mfr_dt              370593 non-null object
init_fda_dt         394066 non-null object
fda_dt              394066 non-null object
rept_cod            394066 non-null object
auth_num            20168 non-null object
mfr_num             370595 non-null object
mfr_sndr            394065 non-null object
lit_ref             23441 non-null object
age                 235444 non-null float64
age_cod             235452 non-null object
age_grp             80189 non-null object
sex                 347760 non-null object
e_sub               394066 non-null object
wt                  81142 non-null float64
wt_cod              81142 non-null object
rept_dt             393749 non-nu

The ASCIIs look better, let's start with them instead of the XML files.

##### Unique identifiers

The `primaryid` field is the unique identifier for a current case report in the data, and it is a combination of `caseid` and `caseversion`.

In [37]:
demo.primaryid.describe()

count        394066
unique       394066
top       154544191
freq              1
Name: primaryid, dtype: object

In [38]:
demo.caseid.describe()

count       394066
unique      394066
top       14508418
freq             1
Name: caseid, dtype: object

In [39]:
demo.caseversion.describe()

count   394066.00
mean         1.67
std          1.75
min          1.00
25%          1.00
50%          1.00
75%          2.00
max         88.00
Name: caseversion, dtype: float64

In [40]:
len(demo[demo.caseversion > 2])

51721

In [41]:
len(demo[demo.caseversion > 10])

2161

In [42]:
demo.caseversion.value_counts()

1     267661
2      74684
3      25090
4      11270
5       5559
6       3129
7       1857
8       1261
9        824
10       570
11       443
12       342
13       226
14       177
15       139
16       117
17       106
18        71
19        68
21        58
20        50
22        34
23        31
26        31
24        28
       ...  
36         5
56         5
48         4
49         4
53         4
39         4
37         4
45         3
54         3
47         3
42         3
50         3
43         3
57         2
55         2
77         2
52         2
51         2
44         2
58         1
59         1
66         1
68         1
75         1
88         1
Name: caseversion, Length: 64, dtype: int64

In [29]:
demo.age.value_counts()

70.00       5562
65.00       5499
63.00       5440
60.00       5394
68.00       5333
64.00       5316
62.00       5312
67.00       5191
69.00       5178
61.00       5177
66.00       5176
71.00       5159
59.00       4903
58.00       4885
56.00       4647
57.00       4605
55.00       4567
72.00       4561
75.00       4526
74.00       4404
73.00       4400
54.00       4260
76.00       3869
53.00       3831
52.00       3748
            ... 
25508.00       1
25506.00       1
797.00         1
25444.00       1
25355.00       1
25356.00       1
25380.00       1
25399.00       1
25408.00       1
25409.00       1
25426.00       1
25437.00       1
795.00         1
25456.00       1
25503.00       1
25461.00       1
25463.00       1
25466.00       1
25467.00       1
25477.00       1
25483.00       1
25488.00       1
25490.00       1
25502.00       1
1023.00        1
Name: age, Length: 1809, dtype: int64

In [31]:
demo.age.head(20)

0    47.00
1    25.00
2    55.00
3    31.00
4    36.00
5    58.00
6    73.00
7      nan
8      nan
9      nan
10   47.00
11   59.00
12     nan
13   53.00
14   27.00
15   66.00
16   38.00
17   69.00
18     nan
19   31.00
Name: age, dtype: float64

In [32]:
demo.age[7]

nan

In [33]:
type(demo.age[7])

numpy.float64

In [34]:
type(demo.age[6])

numpy.float64

In [35]:
demo.age[6]

73.0

In [36]:
np.isnan(demo.age[7])

True