# FAERS graph

## Data quality analysis

### 1. Introduction

This notebook explores the data contents and quality of the FAERS data files, available for download [here](https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html).


#### Notebook contents:
1. [Introduction](#1.-Introduction)
2. [Notebook setup](#2.-Notebook-setup)
3. [Data sources](#3.-Data-sources)  
    3.1 [DEMO ASCII file](#3.1-DEMO-ASCII-file)
4. [Sample raw data files]()
5. [DQA summary]()
6. [Next steps]()

### 2. Notebook setup  
#### Imports

In [1]:
import pandas as pd
import numpy as np

import re
import xml.etree.ElementTree as ET

from timeit import default_timer as timer

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

#### Settings

In [2]:
# Customize matplotlib default settings
matplotlib.rcParams.update({'font.size': 16})
plt.rcParams["figure.figsize"] = (20,10)

In [3]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

### 3. Data sources

FAERS stands for FDA Adverse Event Reporting System. It is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. ([Source](https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm)) 


https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm 


Datfile download for 2018Q4.

In [3]:
raw_data_path = "data/raw/"

In [4]:
!ls data/raw/

[1m[36mascii_2018q4[m[m [1m[36mxml_2018q4[m[m


#### Let's look at the XML datafiles.

In [5]:
!ls data/raw/xml_2018q4/xml

1_ADR18Q4.xml 2_ADR18Q4.xml 3_ADR18Q4.xml XML_NTS.pdf   xml18q4.pdf


In [5]:
! head -50 data/raw/xml_2018q4/xml/1_ADR18Q4.xml

<?xml version="1.0"?>
<ichicsr lang="en">
  <ichicsrmessageheader>
    <messagetype>ICSR</messagetype>
    <messageformatversion>2.1</messageformatversion>
    <messageformatrelease>1.0</messageformatrelease>
    <messagenumb>2019-02</messagenumb>
    <messagesenderidentifier>FDA CDER</messagesenderidentifier>
    <messagereceiveridentifier>Public Use</messagereceiveridentifier>
    <messagedateformat>204</messagedateformat>
    <messagedate>20190207040220</messagedate>
  </ichicsrmessageheader>
  <safetyreport>
    <safetyreportversion>1</safetyreportversion>
    <safetyreportid>15529521</safetyreportid>
    <primarysourcecountry>US</primarysourcecountry>
    <occurcountry>US</occurcountry>
    <transmissiondateformat>102</transmissiondateformat>
    <transmissiondate>20190205</transmissiondate>
    <reporttype>1</reporttype>
    <serious>2</serious>
    <receivedateformat>102</receivedateformat>
    <receivedate>20181018</receivedate>
    <receiptdateformat>102

In [6]:
raw_xml_path = "data/raw/xml_2018q4/xml/"

In [7]:
xml_file_1 = raw_xml_path + "1_ADR18Q4.xml"

In [8]:
tree = ET.parse(xml_file_1)
root = tree.getroot()

In [9]:
root.tag

'ichicsr'

In [10]:
root.attrib

{'lang': 'en'}

In [11]:
i=0
for child in root:
    print(child.tag, child.attrib)
    i+=1
    if i>10:
        break

ichicsrmessageheader {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}


In [12]:
root[0][1].text

'2.1'

In [13]:
i=0
        
for report_id in root.iter('safetyreportid'):
    print(report_id.text)
    i+=1
    if i>10:
        break

15529521
15529522
15529524
15529856
15529858
15529861
15530134
15529556
15529558
15529559
15529564


In [14]:
all_tags = list(set([elem.tag for elem in root.iter()]))

In [15]:
len(all_tags)

87

In [16]:
all_tags.sort()

Print out all data fields in this XML file

In [17]:
all_tags

['actiondrug',
 'activesubstance',
 'activesubstancename',
 'authoritynumb',
 'companynumb',
 'drug',
 'drugadditional',
 'drugadministrationroute',
 'drugauthorizationnumb',
 'drugbatchnumb',
 'drugcharacterization',
 'drugcumulativedosagenumb',
 'drugcumulativedosageunit',
 'drugdosageform',
 'drugdosagetext',
 'drugenddate',
 'drugenddateformat',
 'drugindication',
 'drugintervaldosagedefinition',
 'drugintervaldosageunitnumb',
 'drugrecuraction',
 'drugrecurreadministration',
 'drugrecurrence',
 'drugseparatedosagenumb',
 'drugstartdate',
 'drugstartdateformat',
 'drugstructuredosagenumb',
 'drugstructuredosageunit',
 'drugtreatmentduration',
 'drugtreatmentdurationunit',
 'duplicate',
 'duplicatenumb',
 'duplicatesource',
 'fulfillexpeditecriteria',
 'ichicsr',
 'ichicsrmessageheader',
 'literaturereference',
 'medicinalproduct',
 'messagedate',
 'messagedateformat',
 'messageformatrelease',
 'messageformatversion',
 'messagenumb',
 'messagereceiveridentifier',
 'messagesenderiden

#### Now let's look at the ASCII datafiles format.

In [18]:
!ls data/raw/ascii_2018q4/ascii

ASC_NTS.pdf  INDI18Q4.txt RPSR18Q4.txt drug18q4.pdf reac18q4.pdf
DEMO18Q4.txt OUTC18Q4.txt THER18Q4.txt indi18q4.pdf rpsr18q4.pdf
DRUG18Q4.txt REAC18Q4.txt demo18q4.pdf outc18q4.pdf ther18q4.pdf


In [19]:
!head data/raw/ascii_2018q4/ascii/DEMO18Q4.txt

primaryid$caseid$caseversion$i_f_code$event_dt$mfr_dt$init_fda_dt$fda_dt$rept_cod$auth_num$mfr_num$mfr_sndr$lit_ref$age$age_cod$age_grp$sex$e_sub$wt$wt_cod$rept_dt$to_mfr$occp_cod$reporter_country$occr_country
100035916$10003591$6$F$20130718$20181203$20140312$20181211$EXP$$PHHY2013GB101660$NOVARTIS$$47$YR$$F$Y$$$20181211$$OT$GB$GB
100050413$10005041$3$F$20140306$20141118$20140312$20181213$EXP$$US-TEVA-468475USA$TEVA$$25$YR$$F$Y$68.1$KG$20181213$$CN$US$US
1000551312$10005513$12$F$20120209$20181107$20140313$20181115$EXP$$BR-AMGEN-BRASP2012013548$AMGEN$$55$YR$A$F$Y$67$KG$20181115$$CN$BR$BR
100058832$10005883$2$F$$20180928$20140313$20181012$EXP$$FR-RANBAXY-2014RR-78735$RANBAXY$$31$YR$$F$Y$$$20181012$$OT$GB$FR
100065479$10006547$9$F$201203$20181211$20140313$20181228$EXP$$US-BAYER-2014-035909$BAYER$$36$YR$A$F$Y$90.7$KG$20181228$$CN$US$US
100066188$10006618$8$F$$20181004$20140313$20181017$PER$$US-PFIZER INC-2014069077$PFIZER$$58$YR$$F$Y$$$20181017$$CN$US$US
1000808588$10008085$88$F$201

In [20]:
raw_ascii_path = "data/raw/ascii_2018q4/ascii/"

In [21]:
ascii_file_demo = raw_ascii_path + "DEMO18Q4.txt"

In [22]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'caseversion': np.int32, 
    'i_f_code': 'object', 
    'event_dt': 'object', 
    'mfr_dt': 'object',
    'init_fda_dt': 'object', 
    'fda_dt': 'object', 
    'rept_cod': 'object', 
    'auth_num': 'object', 
    'mfr_num': 'object', 
    'mfr_sndr': 'object',
    'lit_ref': 'object', 
    'age': np.float64, 
    'age_cod': 'object', 
    'age_grp': 'object', 
    'sex': 'object', 
    'e_sub': 'object', 
    'wt': np.float64, 
    'wt_cod': 'object',
    'rept_dt': 'object', 
    'to_mfr': 'object', 
    'occp_cod': 'object', 
    'reporter_country': 'object', 
    'occr_country': 'object'
}

# {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}

In [23]:
demo = pd.read_csv(ascii_file_demo, sep='$', dtype=datatypes)

In [24]:
demo.columns

Index(['primaryid', 'caseid', 'caseversion', 'i_f_code', 'event_dt', 'mfr_dt',
       'init_fda_dt', 'fda_dt', 'rept_cod', 'auth_num', 'mfr_num', 'mfr_sndr',
       'lit_ref', 'age', 'age_cod', 'age_grp', 'sex', 'e_sub', 'wt', 'wt_cod',
       'rept_dt', 'to_mfr', 'occp_cod', 'reporter_country', 'occr_country'],
      dtype='object')

In [26]:
demo.head()

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
0,100035916,10003591,6,F,20130718.0,20181203,20140312,20181211,EXP,,PHHY2013GB101660,NOVARTIS,,47.0,YR,,F,Y,,,20181211,,OT,GB,GB
1,100050413,10005041,3,F,20140306.0,20141118,20140312,20181213,EXP,,US-TEVA-468475USA,TEVA,,25.0,YR,,F,Y,68.1,KG,20181213,,CN,US,US
2,1000551312,10005513,12,F,20120209.0,20181107,20140313,20181115,EXP,,BR-AMGEN-BRASP2012013548,AMGEN,,55.0,YR,A,F,Y,67.0,KG,20181115,,CN,BR,BR
3,100058832,10005883,2,F,,20180928,20140313,20181012,EXP,,FR-RANBAXY-2014RR-78735,RANBAXY,,31.0,YR,,F,Y,,,20181012,,OT,GB,FR
4,100065479,10006547,9,F,201203.0,20181211,20140313,20181228,EXP,,US-BAYER-2014-035909,BAYER,,36.0,YR,A,F,Y,90.7,KG,20181228,,CN,US,US


In [27]:
demo.describe(include='all')

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
count,394066.0,394066.0,394066.0,394066,205438.0,370593.0,394066.0,394066.0,394066,20168.0,370595,394065,23441,235444.0,235452,80189,347760,394066,81142.0,81142,393749.0,23479,387070,394066,394053
unique,394066.0,394066.0,,2,4711.0,2370.0,2503.0,183.0,3,15597.0,370595,471,17759,,6,6,3,2,,2,351.0,3,5,160,163
top,154544191.0,14508418.0,,I,2018.0,20181210.0,20181016.0,20181016.0,EXP,0.0,PHJP2018JP021151,PFIZER,"STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...",,YR,A,F,Y,,KG,20181016.0,N,CN,US,US
freq,1.0,1.0,,267661,25293.0,6857.0,11177.0,12657.0,204438,14.0,1,35409,79,,230226,48200,212580,370587,,80809,11312.0,22175,168973,249968,262062
mean,,,1.67,,,,,,,,,,,200.03,,,,,75.17,,,,,,
std,,,1.75,,,,,,,,,,,1843.75,,,,,29.24,,,,,,
min,,,1.0,,,,,,,,,,,-10.0,,,,,0.0,,,,,,
25%,,,1.0,,,,,,,,,,,45.0,,,,,59.87,,,,,,
50%,,,1.0,,,,,,,,,,,60.0,,,,,72.58,,,,,,
75%,,,2.0,,,,,,,,,,,71.0,,,,,88.45,,,,,,


In [28]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394066 entries, 0 to 394065
Data columns (total 25 columns):
primaryid           394066 non-null object
caseid              394066 non-null object
caseversion         394066 non-null int32
i_f_code            394066 non-null object
event_dt            205438 non-null object
mfr_dt              370593 non-null object
init_fda_dt         394066 non-null object
fda_dt              394066 non-null object
rept_cod            394066 non-null object
auth_num            20168 non-null object
mfr_num             370595 non-null object
mfr_sndr            394065 non-null object
lit_ref             23441 non-null object
age                 235444 non-null float64
age_cod             235452 non-null object
age_grp             80189 non-null object
sex                 347760 non-null object
e_sub               394066 non-null object
wt                  81142 non-null float64
wt_cod              81142 non-null object
rept_dt             393749 non-nu

The ASCIIs look easier to work with, and according to the documentation, most of the information they contain should be the same, although both file types contain some extra fields and miss some other fields.  

Let's start with the ASCII files first, and add any supplemental info from XMLs later if needed.

#### 3.1 DEMO ASCII file  

From above, the number of case reports in the 2018Q4 DEMO file is 394,066, which is consistent with the number supplied by FDA in the accompanying documentation.

##### Unique identifiers

The `primaryid` field is the unique identifier for a current case report in the data, and it is a combination of `caseid` and `caseversion`.

In [37]:
demo.primaryid.describe()

count        394066
unique       394066
top       154544191
freq              1
Name: primaryid, dtype: object

The unique record identifier is indeed unique. Great.

In [38]:
demo.caseid.describe()

count       394066
unique      394066
top       14508418
freq             1
Name: caseid, dtype: object

In [39]:
demo.caseversion.describe()

count   394066.00
mean         1.67
std          1.75
min          1.00
25%          1.00
50%          1.00
75%          2.00
max         88.00
Name: caseversion, dtype: float64

In [43]:
demo[["primaryid", "caseid", "caseversion"]].head()

Unnamed: 0,primaryid,caseid,caseversion
0,100035916,10003591,6
1,100050413,10005041,3
2,1000551312,10005513,12
3,100058832,10005883,2
4,100065479,10006547,9


##### Distribution of case version values

In [42]:
demo.caseversion.value_counts()

1     267661
2      74684
3      25090
4      11270
5       5559
6       3129
7       1857
8       1261
9        824
10       570
11       443
12       342
13       226
14       177
15       139
16       117
17       106
18        71
19        68
21        58
20        50
22        34
23        31
26        31
24        28
       ...  
36         5
56         5
48         4
49         4
53         4
39         4
37         4
45         3
54         3
47         3
42         3
50         3
43         3
57         2
55         2
77         2
52         2
51         2
44         2
58         1
59         1
66         1
68         1
75         1
88         1
Name: caseversion, Length: 64, dtype: int64

In [173]:
demo.caseversion.value_counts(normalize=True).head(10)

1    0.68
2    0.19
3    0.06
4    0.03
5    0.01
6    0.01
7    0.00
8    0.00
9    0.00
10   0.00
Name: caseversion, dtype: float64

The majority of most recent case version numbers are 1 (68%), 2 (19%) and 3 (6%), accounting for 93% of the cases. About 2% of all cases have most recent case version number that are above 6. The highest case version number is 88. No missing values. 

##### i_f_code  

From documentation:
> Code for initial or follow-up status of report, as reported
by manufacturer.
>
> | CODE | MEANING_TEXT |
| ---- |------------- |
| I    | Initial      |
| F    | Follow-up    |


In [110]:
demo.i_f_code.describe()

count     394066
unique         2
top            I
freq      267661
Name: i_f_code, dtype: object

In [104]:
demo.i_f_code.value_counts()

I    267661
F    126405
Name: i_f_code, dtype: int64

In [105]:
demo.i_f_code.value_counts(normalize=True)

I   0.68
F   0.32
Name: i_f_code, dtype: float64

This is consistent with the 68% of records with caseversion=1 shown above. No missing values in this field.

##### event_dt

From documentation:  

> Date the adverse event occurred or began. (YYYYMMDD format) –
If a complete date is not available, a partial date is
provided.

In [111]:
demo.event_dt.describe()

count     205438
unique      4711
top         2018
freq       25293
Name: event_dt, dtype: object

In [115]:
demo.event_dt.value_counts(dropna=False)

NaN         188628
2018         25293
201810        4210
201809        4096
2017          4057
201808        3189
201811        2739
2016          2353
201807        2290
2015          1785
201806        1751
201805        1462
201804        1266
20181001      1235
2014          1143
201803        1123
201801         993
201802         947
20181101       887
20181002       879
201712         866
20181010       850
20181015       836
201812         822
20181008       816
             ...  
19920610         1
20061220         1
20110221         1
20070521         1
20130615         1
20111128         1
20081214         1
20090710         1
20100719         1
198804           1
20100830         1
20081129         1
20081020         1
200005           1
20051108         1
20080921         1
20101125         1
20091007         1
20050119         1
20080712         1
20071122         1
20040314         1
19980826         1
20001221         1
20050511         1
Name: event_dt, Length: 4712, d

In [174]:
demo.event_dt.value_counts(normalize=True, dropna=False).head(20)

NaN        0.48
2018       0.06
201810     0.01
201809     0.01
2017       0.01
201808     0.01
201811     0.01
2016       0.01
201807     0.01
2015       0.00
201806     0.00
201805     0.00
201804     0.00
20181001   0.00
2014       0.00
201803     0.00
201801     0.00
201802     0.00
20181101   0.00
20181002   0.00
Name: event_dt, dtype: float64

Nearly half of the adverse event cases do not have a date for when the adverse event occurred or began. The missing values count is consistent with the number provided by the FDA.

##### mfr_dt  

From documentation:  

> Date manufacturer first received initial information. In
subsequent   versions of a case, the latest manufacturer
received date will be   provided (YYYYMMDD format). If a
complete date is not available, a   partial date will be
provided.

In [118]:
demo.mfr_dt.describe()

count       370593
unique        2370
top       20181210
freq          6857
Name: mfr_dt, dtype: object

In [119]:
demo.mfr_dt.value_counts(dropna=False)

NaN         23473
20181210     6857
20181203     5742
20181029     5662
20181126     5605
20181211     5482
20181001     5481
20181009     5465
20181217     5312
20181022     5298
20181212     5285
20181015     5264
20181127     5176
20181119     5107
20181003     5090
20181214     5007
20181112     5007
20181204     4999
20181105     4912
20181008     4850
20181023     4842
20181128     4796
20181106     4793
20181120     4720
20181024     4705
            ...  
20060831        1
20080701        1
20150222        1
20160313        1
20150115        1
20080717        1
20120516        1
20160717        1
20081114        1
20130123        1
20130711        1
20081020        1
20101025        1
20151002        1
20111110        1
20100624        1
20110302        1
20110905        1
20120316        1
20071003        1
20040312        1
20021014        1
20120731        1
20131111        1
20031024        1
Name: mfr_dt, Length: 2371, dtype: int64

In [175]:
demo.mfr_dt.value_counts(normalize=True, dropna=False).head(20)

NaN        0.06
20181210   0.02
20181203   0.01
20181029   0.01
20181126   0.01
20181211   0.01
20181001   0.01
20181009   0.01
20181217   0.01
20181022   0.01
20181212   0.01
20181015   0.01
20181127   0.01
20181119   0.01
20181003   0.01
20181214   0.01
20181112   0.01
20181204   0.01
20181105   0.01
20181008   0.01
Name: mfr_dt, dtype: float64

There are 6% missing values for this field. The missing values count is consistent with the FDA number.

##### init_fda_dt  

From documentation:

> Date FDA received first version (Initial) of Case (YYYYMMDD format)

In [121]:
demo.init_fda_dt.describe()

count       394066
unique        2503
top       20181016
freq         11177
Name: init_fda_dt, dtype: object

In [123]:
demo.init_fda_dt.value_counts()

20181016    11177
20181017     7905
20181018     6215
20181217     6142
20181120     6061
20181129     5967
20181015     5741
20181010     5285
20181116     5029
20181102     5020
20181001     4922
20181022     4912
20181205     4853
20181221     4843
20181113     4833
20181128     4833
20181011     4813
20181029     4800
20181213     4735
20181003     4726
20181219     4721
20181206     4706
20181126     4590
20181009     4498
20181228     4483
            ...  
20100422        1
20030725        1
20100723        1
20130605        1
20141005        1
20110406        1
20151122        1
20121225        1
20130224        1
20050418        1
20110111        1
20060626        1
20130513        1
20110719        1
20170916        1
20110912        1
20121210        1
20091103        1
20061220        1
20110218        1
20030428        1
20130829        1
20100716        1
20150130        1
20150919        1
Name: init_fda_dt, Length: 2503, dtype: int64

In [176]:
demo.init_fda_dt.value_counts(normalize=True).head(20)

20181016   0.03
20181017   0.02
20181018   0.02
20181217   0.02
20181120   0.02
20181129   0.02
20181015   0.01
20181010   0.01
20181116   0.01
20181102   0.01
20181001   0.01
20181022   0.01
20181205   0.01
20181221   0.01
20181113   0.01
20181128   0.01
20181011   0.01
20181029   0.01
20181213   0.01
20181003   0.01
Name: init_fda_dt, dtype: float64

No missing values.

##### fda_dt  

From documentation:  

> Date FDA received Case. In subsequent versions of a case, the latest manufacturer received date will be provided (YYYYMMDD format).

In [125]:
demo.fda_dt.describe()

count       394066
unique         183
top       20181016
freq         12657
Name: fda_dt, dtype: object

In [127]:
demo.fda_dt.value_counts()

20181016    12657
20181017     9620
20181217     8111
20181129     7432
20181120     7355
20181018     7265
20181227     6866
20181221     6863
20181015     6783
20181228     6514
20181219     6444
20181218     6357
20181213     6256
20181206     6212
20181205     6176
20181010     6079
20181116     6071
20181126     6053
20181102     5997
20181220     5981
20181128     5893
20181210     5836
20181029     5823
20181113     5821
20181022     5740
            ...  
20180704       13
20180724       13
20180703       13
20180916       12
20180729       11
20180706       11
20180805       11
20180812        9
20180922        9
20180902        8
20180705        8
20180819        6
20180825        6
20180702        4
20180923        3
20180707        2
20180908        2
20180721        2
20180708        1
20180930        1
20180112        1
20180722        1
20180826        1
20180909        1
20180714        1
Name: fda_dt, Length: 183, dtype: int64

In [177]:
demo.fda_dt.value_counts(normalize=True).head(20)

20181016   0.03
20181017   0.02
20181217   0.02
20181129   0.02
20181120   0.02
20181018   0.02
20181227   0.02
20181221   0.02
20181015   0.02
20181228   0.02
20181219   0.02
20181218   0.02
20181213   0.02
20181206   0.02
20181205   0.02
20181010   0.02
20181116   0.02
20181126   0.02
20181102   0.02
20181220   0.02
Name: fda_dt, dtype: float64

No missing values, consistent with FDA number.

##### rept_cod  

From documentation:  

> Code for the type of report submitted (See table below)
> 
> | CODE | MEANING_TEXT
| ---- | ---------------
| EXP  | Expedited (15-Day)
| PER  | Periodic (Non-Expedited)
| DIR  | Direct
>
> Expedited (15-day) and Periodic (Non-Expedited) reports are from manufacturers; "Direct" reports are voluntarily
submitted to the FDA by non-manufacturers.





In [129]:
demo.rept_cod.describe()

count     394066
unique         3
top          EXP
freq      204438
Name: rept_cod, dtype: object

In [130]:
demo.rept_cod.value_counts()

EXP    204438
PER    166157
DIR     23471
Name: rept_cod, dtype: int64

In [131]:
demo.rept_cod.value_counts(normalize=True)

EXP   0.52
PER   0.42
DIR   0.06
Name: rept_cod, dtype: float64

No missing values.

##### auth_num  

From documentation:  

> Regulatory Authority’s case report number, when available.  
> \* New tag added in 2014Q3 extract.

In [132]:
demo.auth_num.describe()

count     20168
unique    15597
top        0000
freq         14
Name: auth_num, dtype: object

In [133]:
demo.auth_num.value_counts(dropna=False)

NaN                                                  373898
0000                                                     14
00                                                       11
DE-CADRBFARM-2018025631                                  10
GB-MHRA-EYC 00190348                                      9
FR-AFSSAPS-TS20180923                                     8
FR-AFSSAPS-CN20182166                                     8
GB-MHRA-EYC 00188736                                      7
GB-MHRA-ADR 22496422                                      7
FR-AFSSAPS-AM20180734                                     7
GB-MHRA-MIDB-95BC6E56-49F4-47C0-BD11-9D2E1A9707AE         7
IT-MINISAL02-502630                                       7
DE-BFARM-18013879                                         7
FR-AFSSAPS-PV20180920                                     7
FR-AFSSAPS-RN20181277                                     7
FR-AFSSAPS-NT20181672                                     7
FR-AFSSAPS-RN20181400                   

In [178]:
demo.auth_num.value_counts(normalize=True, dropna=False).head(10)

NaN                       0.95
0000                      0.00
00                        0.00
DE-CADRBFARM-2018025631   0.00
GB-MHRA-EYC 00190348      0.00
FR-AFSSAPS-TS20180923     0.00
FR-AFSSAPS-CN20182166     0.00
GB-MHRA-EYC 00188736      0.00
GB-MHRA-ADR 22496422      0.00
FR-AFSSAPS-AM20180734     0.00
Name: auth_num, dtype: float64

Mostly missing values, with a couple of other values (like `00` ) that may be placeholders/defaults for missing values. 

##### mfr_num  

From documentation:  

> Manufacturer's unique report identifier.

In [135]:
demo.mfr_num.describe()

count               370595
unique              370595
top       PHJP2018JP021151
freq                     1
Name: mfr_num, dtype: object

In [136]:
demo.mfr_num.value_counts(dropna=False)

NaN                                                          23471
US-ROCHE-2112065                                                 1
PHHO2018CA011677                                                 1
CL-PROVELL PHARMACEUTICALS-2056832                               1
US-IGSA-SR10006388                                               1
CN-ROCHE-2208075                                                 1
PH-B.I. PHARMACEUTICALS,INC./RIDGEFIELD-2018-BI-049837           1
US-AMGEN-USASP2018185967                                         1
CA-ROCHE-2190755                                                 1
PHHY2016IT042284                                                 1
US-DSJP-DSU-2015-134280                                          1
JP-BRISTOL-MYERS SQUIBB COMPANY-BMS-2018-033906                  1
US-ENDO PHARMACEUTICALS INC-2017-006027                          1
AU-GLAXOSMITHKLINE-AU2017GSK189294                               1
JP-TEVA-201803_00001484                                       

In [179]:
demo.mfr_num.value_counts(normalize=True, dropna=False).head(10)

NaN                                                      0.06
US-ROCHE-2112065                                         0.00
PHHO2018CA011677                                         0.00
CL-PROVELL PHARMACEUTICALS-2056832                       0.00
US-IGSA-SR10006388                                       0.00
CN-ROCHE-2208075                                         0.00
PH-B.I. PHARMACEUTICALS,INC./RIDGEFIELD-2018-BI-049837   0.00
US-AMGEN-USASP2018185967                                 0.00
CA-ROCHE-2190755                                         0.00
PHHY2016IT042284                                         0.00
Name: mfr_num, dtype: float64

6% of values are missing, and the missing values count is consistent with the FDA number. The non-missing values are unique, as expected.

##### mfr_sndr  

From documentation:  

> Coded name of manufacturer sending report; if not found, then verbatim name of organization sending report.

In [139]:
demo.mfr_sndr.describe()

count     394065
unique       471
top       PFIZER
freq       35409
Name: mfr_sndr, dtype: object

In [140]:
demo.mfr_sndr.value_counts(dropna=False)

PFIZER                               35409
AMGEN                                30828
NOVARTIS                             25360
FDA-CTU                              23470
SANOFI AVENTIS                       18107
JANSSEN                              14866
CELGENE                              13511
BRISTOL MYERS SQUIBB                 13442
TEVA                                 12871
ABBVIE                               11719
ROCHE                                11304
GLAXOSMITHKLINE                       9406
SANDOZ                                9030
MYLAN                                 7967
ELI LILLY AND CO                      7273
BIOGEN                                7030
ASTRAZENECA                           6940
MERCK                                 6872
BAYER                                 6671
AUROBINDO                             6130
BOEHRINGER INGELHEIM                  4651
GILEAD                                4591
TAKEDA                                4040
ACTELION   

In [180]:
demo.mfr_sndr.value_counts(normalize=True, dropna=False).head(20)

PFIZER                 0.09
AMGEN                  0.08
NOVARTIS               0.06
FDA-CTU                0.06
SANOFI AVENTIS         0.05
JANSSEN                0.04
CELGENE                0.03
BRISTOL MYERS SQUIBB   0.03
TEVA                   0.03
ABBVIE                 0.03
ROCHE                  0.03
GLAXOSMITHKLINE        0.02
SANDOZ                 0.02
MYLAN                  0.02
ELI LILLY AND CO       0.02
BIOGEN                 0.02
ASTRAZENECA            0.02
MERCK                  0.02
BAYER                  0.02
AUROBINDO              0.02
Name: mfr_sndr, dtype: float64

In [155]:
# count missing values
demo.mfr_sndr.isna().sum()

1

In [156]:
demo[demo.mfr_sndr.isna()]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
203873,155751552,15575155,2,F,,,20181026,20181026,DIR,,,,,71.0,YR,,M,N,,,20181025,N,OT,US,US


One missing value, consistent with the FDA number.

##### lit_ref  

From documentation:  

> Literature Reference information, when available; populated with last 500 characters if >500 characters are available.
>
> \* New tag added in 2014Q3 extract.

In [157]:
demo.lit_ref.describe()

count                                                 23441
unique                                                17759
top       STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...
freq                                                     79
Name: lit_ref, dtype: object

In [160]:
demo.lit_ref.value_counts(dropna=False)

NaN                                                                                                                                                                                                                                                                                                                                                                                                                                                                370625
STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK M, HALL M, VASU S, HAMILTON C, KITZMAN D, HUNDLEY W. ASYMPTOMATIC MYOCARDIAL ISCHEMIA FORECASTS ADVERSE EVENTS IN CARDIOVASCULAR MAGNETIC RESONANCE DOBUTAMINE STRESS TESTING OF HIGH-RISK MIDDLE-AGED AND ELDERLY INDIVIDUALS. JOURNAL OF CARDIOVASCULAR MAGNETIC RESONANCE. 2018;20(75):1-11.                                                                                                                         79
DOI: 10.4081/NI.2018.7469#. LAPMAG A, LERTSINUDOM S, CHAIYAKAM A, SAWANYAWISUTH K, T

In [181]:
demo.lit_ref.value_counts(normalize=True, dropna=False).head(10)

NaN                                                                                                                                                                                                                                                                                                                                              0.94
STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK M, HALL M, VASU S, HAMILTON C, KITZMAN D, HUNDLEY W. ASYMPTOMATIC MYOCARDIAL ISCHEMIA FORECASTS ADVERSE EVENTS IN CARDIOVASCULAR MAGNETIC RESONANCE DOBUTAMINE STRESS TESTING OF HIGH-RISK MIDDLE-AGED AND ELDERLY INDIVIDUALS. JOURNAL OF CARDIOVASCULAR MAGNETIC RESONANCE. 2018;20(75):1-11.   0.00
DOI: 10.4081/NI.2018.7469#. LAPMAG A, LERTSINUDOM S, CHAIYAKAM A, SAWANYAWISUTH K, TIAMKAO S. CLINICAL OUTCOMES OF INTRAVENOUS LEVETIRACETAM TREATMENT IN PATIENTS WITH RENAL IMPAIRMENT. NEUROLOGY INTERNATIONAL. 2018;10(3):7469                                                                                          

94% of the values are missing.

##### age  

From documentation:  

> Numeric value of patient's age at event.

In [162]:
demo.age.describe()

count   235444.00
mean       200.03
std       1843.75
min        -10.00
25%         45.00
50%         60.00
75%         71.00
max      34926.00
Name: age, dtype: float64

In [163]:
demo.age.value_counts(dropna=False)

nan         158622
70.00         5562
65.00         5499
63.00         5440
60.00         5394
68.00         5333
64.00         5316
62.00         5312
67.00         5191
69.00         5178
61.00         5177
66.00         5176
71.00         5159
59.00         4903
58.00         4885
56.00         4647
57.00         4605
55.00         4567
72.00         4561
75.00         4526
74.00         4404
73.00         4400
54.00         4260
76.00         3869
53.00         3831
             ...  
25851.00         1
25835.00         1
12911.00         1
25647.00         1
12825.00         1
25656.00         1
25658.00         1
25660.00         1
25662.00         1
656.00           1
25667.00         1
25680.00         1
25694.00         1
25706.00         1
25728.00         1
25731.00         1
25751.00         1
805.00           1
25771.00         1
25787.00         1
25812.00         1
25815.00         1
25817.00         1
25821.00         1
24309.00         1
Name: age, Length: 1810, dtype:

In [172]:
demo.age.value_counts(normalize=True, dropna=False).head(20)

nan     0.40
70.00   0.01
65.00   0.01
63.00   0.01
60.00   0.01
68.00   0.01
64.00   0.01
62.00   0.01
67.00   0.01
69.00   0.01
61.00   0.01
66.00   0.01
71.00   0.01
59.00   0.01
58.00   0.01
56.00   0.01
57.00   0.01
55.00   0.01
72.00   0.01
75.00   0.01
Name: age, dtype: float64

In [183]:
demo[demo.age < 0]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
263592,156388221,15638822,1,I,20180429,20180508,20181120,20181120,PER,,US-PERRIGO-18US005100,PERRIGO,,-10.0,YR,,F,Y,77.98,KG,20181120,,CN,US,US


In [188]:
len(demo[demo.age > 100])

2011

In [189]:
demo[demo.age > 100].head()

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
532,104248582,10424858,2,F,201407,20181107,20140902,20181115,EXP,,US-ASTRAZENECA-2014SE63425,ASTRAZENECA,,552.0,MON,,M,Y,79.4,KG,20181115,,,US,US
655,1050904010,10509040,10,F,2014,20181030,20141009,20181107,EXP,,US-ASTRAZENECA-2014SE69488,ASTRAZENECA,,801.0,MON,,F,Y,50.3,KG,20181107,,,US,US
1075,107543802,10754380,2,F,201501,20181031,20150202,20181112,PER,,US-ASTRAZENECA-2015SE07497,ASTRAZENECA,,1023.0,MON,,M,Y,65.8,KG,20181112,,,US,US
1211,108765514,10876551,4,F,20130401,20180919,20150301,20181019,PER,,US-ASTRAZENECA-2013SE23016,ASTRAZENECA,,25245.0,DY,,F,Y,101.2,KG,20181019,,,US,US
1337,109705173,10970517,3,F,201411,20181119,20150331,20181122,EXP,,US-ASTRAZENECA-2015SE28983,ASTRAZENECA,,764.0,MON,,F,Y,93.0,KG,20181122,,,US,US


Age is missing in 40% of the records. There is one record with a negative age value, which will need to be cleaned. Most of the greater than 100 age values are coded in some other increment than a year, e.g. a month or a day.  

The missing values count is consistent with the FDA number.

##### age_cod  

From documentation:  

> Unit abbreviation for patient's age (See table below)  
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| DEC       | DECADE
| YR        | YEAR
| MON       | MONTH
| WK        | WEEK
| DY        | DAY
| HR        | HOUR

In [190]:
demo.age_cod.describe()

count     235452
unique         6
top           YR
freq      230226
Name: age_cod, dtype: object

In [191]:
demo.age_cod.value_counts(dropna=False)

YR     230226
NaN    158614
DY       1935
DEC      1618
MON      1536
WK        127
HR         10
Name: age_cod, dtype: int64

In [192]:
demo.age_cod.value_counts(normalize=True, dropna=False)

YR    0.58
NaN   0.40
DY    0.00
DEC   0.00
MON   0.00
WK    0.00
HR    0.00
Name: age_cod, dtype: float64

In [207]:
demo[(demo.age.isna()) & (demo.age_cod.notna())]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
119358,154870241,15487024,1,I,,,20181010,20181010,DIR,,,FDA-CTU,,,YR,,F,N,20.87,KG,20181010,N,CN,US,US
200400,155714611,15571461,1,I,20181016.0,,20181030,20181030,DIR,,,FDA-CTU,,,WK,,M,N,3.13,KG,20181030,N,MD,US,US
234374,156076701,15607670,1,I,,,20181112,20181112,DIR,,,FDA-CTU,,,YR,,F,N,,,20181112,N,,US,US
281552,156581371,15658137,1,I,,,20181115,20181115,DIR,,,FDA-CTU,,,DY,,F,N,54.43,KG,20181114,N,OT,US,US
318373,156982451,15698245,1,I,20181109.0,,20181126,20181126,DIR,,,FDA-CTU,,,YR,,M,N,,,20181120,N,OT,US,US
322785,157031521,15703152,1,I,20180918.0,,20181130,20181130,DIR,,,FDA-CTU,,,YR,,M,N,,,20181130,N,PH,US,US
332043,157133731,15713373,1,I,20180524.0,,20181129,20181129,DIR,,,FDA-CTU,,,YR,,,N,85.55,KG,20180822,N,PH,US,US
382724,157711891,15771189,1,I,,,20181227,20181227,DIR,,,FDA-CTU,,,YR,,F,N,11.0,KG,20181227,N,,US,US


In [208]:
len(demo[(demo.age.isna()) & (demo.age_cod.notna())])

8

In [209]:
demo[(demo.age.notna()) & (demo.age_cod.isna())]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country


This field is missing in 40% of the records, which matches the 40% of the records with missing age. Of the non-missing values, most of them are in years.  

The missing values count is consistent with the FDA number.

##### age_grp  

From documentation:  

> Patient Age Group code as follows, when available:
>
> | CODE   | MEANING_TEXT
| ----   | ------------
|  N     |  Neonate
|  I     |  Infant
|  C     |  Child
|  T     |  Adolescent
|  A     |  Adult
|  E     |  Elderly
>
> \* New tag added in 2014Q3 extract.

In [210]:
demo.age_grp.describe()

count     80189
unique        6
top           A
freq      48200
Name: age_grp, dtype: object

In [211]:
demo.age_grp.value_counts(dropna=False)

NaN    313877
A       48200
E       27869
C        1547
T        1129
N         916
I         528
Name: age_grp, dtype: int64

In [212]:
demo.age_grp.value_counts(normalize=True, dropna=False)

NaN   0.80
A     0.12
E     0.07
C     0.00
T     0.00
N     0.00
I     0.00
Name: age_grp, dtype: float64

80% of the values are missing, compared to the 40% missing age values. The missing value counts are consistent with the FDA number.

##### sex  

From documentation:  

> Code for patient's sex (See table below)  
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| UNK       | Unknown
| M         | Male
| F | Female

In [213]:
demo.sex.describe()

count     347760
unique         3
top            F
freq      212580
Name: sex, dtype: object

In [214]:
demo.sex.value_counts(dropna=False)

F      212580
M      135150
NaN     46306
UNK        30
Name: sex, dtype: int64

In [215]:
demo.sex.value_counts(normalize=True, dropna=False)

F     0.54
M     0.34
NaN   0.12
UNK   0.00
Name: sex, dtype: float64

12% missing values. The frequency counts and percentages are consistent with the FDA numbers.

##### e_sub  

From documentation:  

> Whether (Y/N) this report was submitted under the electronic submissions procedure for manufacturers.

In [216]:
demo.e_sub.describe()

count     394066
unique         2
top            Y
freq      370587
Name: e_sub, dtype: object

In [218]:
demo.e_sub.value_counts(dropna=False)

Y    370587
N     23479
Name: e_sub, dtype: int64

In [219]:
demo.e_sub.value_counts(normalize=True, dropna=False)

Y   0.94
N   0.06
Name: e_sub, dtype: float64

No missing values. The frequency counts are consistent with the FDA numbers.

##### wt  

From documentation:  

> Numeric value of patient's weight.

In [220]:
demo.wt.describe()

count   81142.00
mean       75.17
std        29.24
min         0.00
25%        59.87
50%        72.58
75%        88.45
max      2890.00
Name: wt, dtype: float64

In [221]:
demo.wt.value_counts(dropna=False)

nan       312924
70.00       1404
60.00       1343
65.00       1094
68.00       1039
80.00        990
75.00        949
90.00        844
63.00        834
72.00        803
50.00        802
73.00        741
64.00        726
55.00        707
62.00        699
82.00        685
58.00        674
54.00        657
59.00        652
74.00        648
85.00        635
57.00        617
67.00        613
61.00        610
77.00        608
           ...  
83.07          1
26.45          1
165.60         1
3.06           1
23.80          1
158.75         1
121.05         1
66.33          1
204.30         1
92.89          1
433.00         1
74.37          1
40.65          1
14.05          1
8.16           1
69.48          1
1.74           1
48.18          1
56.29          1
5.27           1
22.80          1
22.25          1
1.67           1
46.81          1
54.11          1
Name: wt, Length: 3961, dtype: int64

In [222]:
demo.wt.value_counts(normalize=True, dropna=False).head(20)

nan     0.79
70.00   0.00
60.00   0.00
65.00   0.00
68.00   0.00
80.00   0.00
75.00   0.00
90.00   0.00
63.00   0.00
72.00   0.00
50.00   0.00
73.00   0.00
64.00   0.00
55.00   0.00
62.00   0.00
82.00   0.00
58.00   0.00
54.00   0.00
59.00   0.00
74.00   0.00
Name: wt, dtype: float64

79% missing values. Missing value counts are consistent with the FDA number.

##### wt_cod  

From documentation:  

> Unit abbreviation for patient's weight (See table below)  
>
> | CODE     | MEANING_TEXT
| ----     | ------------
| KG       |  Kilograms
| LBS      |  Pounds
| GMS |  Grams

In [223]:
demo.wt_cod.describe()

count     81142
unique        2
top          KG
freq      80809
Name: wt_cod, dtype: object

In [224]:
demo.wt_cod.value_counts(dropna=False)

NaN    312924
KG      80809
LBS       333
Name: wt_cod, dtype: int64

In [225]:
demo.wt_cod.value_counts(normalize=True, dropna=False)

NaN   0.79
KG    0.21
LBS   0.00
Name: wt_cod, dtype: float64

79% missing values, consistent with the 79% missing weight values.  
Missing value counts are consistent with the FDA number.

##### rept_dt  

From documentation:  

> Date report was sent (YYYYMMDD format). If a complete date is not available, a partial date is provided. 

In [226]:
demo.rept_dt.describe()

count       393749
unique         351
top       20181016
freq         11312
Name: rept_dt, dtype: object

In [227]:
demo.rept_dt.value_counts(dropna=False)

20181016    11312
20181017     9311
20181015     7918
20181018     7689
20181120     7604
20181217     7069
20181129     7004
20181227     6808
20181218     6571
20181219     6562
20181128     6405
20181221     6287
20181220     6227
20181228     6179
20181206     6070
20181116     6034
20181210     5996
20181213     5937
20181205     5916
20181113     5889
20181102     5810
20181029     5755
20181126     5716
20181011     5679
20181010     5628
            ...  
20171101        1
20180505        1
20180210        1
20170302        1
20180405        1
20180323        1
20170921        1
20180225        1
20170928        1
20180408        1
20180707        1
20151015        1
20180617        1
20170614        1
20180401        1
20160121        1
20180208        1
20170711        1
20180415        1
20161121        1
20150608        1
20180511        1
20180313        1
20161006        1
20180708        1
Name: rept_dt, Length: 352, dtype: int64

In [229]:
demo.rept_dt.value_counts(normalize=True, dropna=False).head(10)

20181016   0.03
20181017   0.02
20181015   0.02
20181018   0.02
20181120   0.02
20181217   0.02
20181129   0.02
20181227   0.02
20181218   0.02
20181219   0.02
Name: rept_dt, dtype: float64

In [230]:
# missing values count
demo.rept_dt.isna().sum()

317

In [232]:
demo.rept_dt.isna().sum()/demo.primaryid.count()

0.0008044337750529099

In [233]:
demo[demo.rept_dt.isna()].head(20)

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
20427,142482596,14248259,6,F,2017.0,20181023.0,20171204,20181029,EXP,FR-002147023-PHHY2017FR176929,PHHY2017FR176929,NOVARTIS,,78.0,YR,,F,Y,80.0,KG,,,OT,FR,FR
42888,150317284,15031728,4,F,20170306.0,20180722.0,20180619,20180727,EXP,,PHHY2018FR024978,NOVARTIS,,68.0,YR,,M,Y,96.0,KG,,,OT,FR,FR
54684,152340061,15234006,1,I,,20180606.0,20180802,20180802,EXP,,PHHY2018ES063808,SANDOZ,,,,A,F,Y,,,,,OT,ES,ES
54685,152340091,15234009,1,I,20180217.0,20180601.0,20180802,20180802,EXP,,PHHY2018GB063683,SANDOZ,,51.0,YR,,M,Y,82.55,KG,,,OT,GB,GB
54707,152341871,15234187,1,I,20160123.0,20180724.0,20180802,20180802,EXP,FR-AFSSAPS-ST20181059,FR-TEVA-2018-FR-932970,TEVA,,75.0,YR,,F,Y,,,,,MD,FR,FR
63135,153133691,15313369,1,I,201807.0,20180814.0,20180823,20180823,PER,,US-TEVA-2018-US-945214,TEVA,,,,,F,Y,,,,,CN,US,US
64324,153236361,15323636,1,I,20180806.0,20180817.0,20180827,20180827,EXP,,PHHY2018DE078420,SANDOZ,,29.0,YR,,M,Y,,,,,CN,DE,DE
66091,153364721,15336472,1,I,20180623.0,20180819.0,20180830,20180830,EXP,,PHHY2018FR081625,SANDOZ,,73.0,YR,,F,Y,82.0,KG,,,OT,FR,FR
83914,154488013,15448801,3,F,199606.0,20180925.0,20180929,20180929,EXP,,PHHY2018AT111525,NOVARTIS,,67.0,YR,,M,Y,,,,,OT,AT,AT
92411,154587971,15458797,1,I,,,20181001,20181001,DIR,,,FDA-CTU,,64.0,YR,,F,N,89.81,KG,,N,CN,US,US


Less than 1% of missing values. Missing values count is consistent with the FDA number.

##### to_mfr  

From documentation:  

> Whether (Y/N) voluntary reporter also notified manufacturer (blank for manufacturer reports).

In [234]:
demo.to_mfr.describe()

count     23479
unique        3
top           N
freq      22175
Name: to_mfr, dtype: object

In [235]:
demo.to_mfr.value_counts(dropna=False)

NaN    370587
N       22175
Y        1303
U           1
Name: to_mfr, dtype: int64

In [236]:
demo.to_mfr.value_counts(normalize=True, dropna=False)

NaN   0.94
N     0.06
Y     0.00
U     0.00
Name: to_mfr, dtype: float64

In [237]:
demo[demo.to_mfr == "U"]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
390524,158228921,15822892,1,I,20181204,,20181217,20181217,DIR,,,FDA-CTU,,1.0,DY,,M,N,,,20181213,U,OT,US,US


94% of the values are missing. The Y, N and missing value counts match the FDA numbers, but the U value with a count of 1 is not present in the FDA numbers in the accompanying pdf file. The record with this value is displayed above.

##### occp_cod  

From documentation:  

> Abbreviation for the reporter's type of occupation in the latest version of a case.
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| MD        | Physician
| PH        | Pharmacist
| OT        | Other health-professional
| LW        | Lawyer
| CN | Consumer

In [238]:
demo.occp_cod.describe()

count     387070
unique         5
top           CN
freq      168973
Name: occp_cod, dtype: object

In [239]:
demo.occp_cod.value_counts(dropna=False)

CN     168973
MD      99246
OT      80654
PH      31766
NaN      6996
LW       6431
Name: occp_cod, dtype: int64

In [240]:
demo.occp_cod.value_counts(normalize=True, dropna=False)

CN    0.43
MD    0.25
OT    0.20
PH    0.08
NaN   0.02
LW    0.02
Name: occp_cod, dtype: float64

There are 2% missing values. The frequency counts are consistent with the FDA numbers.

##### reporter_country  

ISO country codes can be found here: https://www.iso.org/obp/ui/#search/code/  


From documentation:  

> The country of the reporter in the latest version of a case.
>
> \* Note: the links to the country codes in the documentation don't really work.

   

In [241]:
demo.reporter_country.describe()

count     394066
unique       160
top           US
freq      249968
Name: reporter_country, dtype: object

In [242]:
demo.reporter_country.value_counts(dropna=False)

US                       249968
CA                        16897
GB                        16739
FR                        15736
JP                        15711
COUNTRY NOT SPECIFIED     14719
DE                        10502
IT                         7833
ES                         4681
BR                         3454
CN                         2883
AU                         2787
NL                         2455
IN                         1889
PT                         1694
CO                         1619
SE                         1462
PL                         1349
BE                         1272
TR                         1086
CH                         1065
ZA                          994
AR                          978
GR                          962
IL                          873
                          ...  
AX                            2
ZW                            2
KH                            2
CI                            1
BN                            1
BQ      

In [243]:
demo.reporter_country.value_counts(normalize=True, dropna=False).head(20)

US                      0.63
CA                      0.04
GB                      0.04
FR                      0.04
JP                      0.04
COUNTRY NOT SPECIFIED   0.04
DE                      0.03
IT                      0.02
ES                      0.01
BR                      0.01
CN                      0.01
AU                      0.01
NL                      0.01
IN                      0.00
PT                      0.00
CO                      0.00
SE                      0.00
PL                      0.00
BE                      0.00
TR                      0.00
Name: reporter_country, dtype: float64

There are no missing values, which is consistent with the FDA number.  
However, about 4% of the case records have "COUNTRY NOT SPECIFIED" in this field.

##### occr_country  

_From documentation:_  
> The country where the event occurred.

In [244]:
demo.occr_country.describe()

count     394053
unique       163
top           US
freq      262062
Name: occr_country, dtype: object

In [245]:
demo.occr_country.value_counts(dropna=False)

US    262062
CA     17550
FR     16969
JP     15968
GB     13561
DE     10752
IT      8315
ES      4896
BR      3873
CN      2988
AU      2850
NL      2568
CO      1962
IN      1825
PT      1795
SE      1478
PL      1432
BE      1324
TR      1124
AR      1096
ZA      1074
CH      1048
GR       999
IL       914
RU       897
       ...  
BM         1
CW         1
GW         1
AS         1
MC         1
BQ         1
KY         1
CI         1
SY         1
WF         1
AZ         1
WS         1
BN         1
PF         1
HT         1
BB         1
CD         1
MQ         1
AQ         1
SZ         1
RW         1
MU         1
UZ         1
FO         1
SN         1
Name: occr_country, Length: 164, dtype: int64

In [246]:
demo.occr_country.value_counts(normalize=True, dropna=False).head(20)

US   0.67
CA   0.04
FR   0.04
JP   0.04
GB   0.03
DE   0.03
IT   0.02
ES   0.01
BR   0.01
CN   0.01
AU   0.01
NL   0.01
CO   0.00
IN   0.00
PT   0.00
SE   0.00
PL   0.00
BE   0.00
TR   0.00
AR   0.00
Name: occr_country, dtype: float64

In [247]:
# missing values count
demo.occr_country.isna().sum()

13

In [254]:
demo.occr_country.str.len().value_counts(dropna=False)

2.00    394053
nan         13
Name: occr_country, dtype: int64

There is less than 1% missing values. The missing values count is consistent with the FDA number.

##### Summary for DEMO ASCII file  

The data is mostly consistent with the accompanying FDA missing value and frequency counts pdf.

* **Data quality issues found:**  
  - one record has a negative value in the `age` field  
  - one record has a "U" categorical value in the `to_mfr` field, which wasn't listed in the FDA pdf  
  - While the `reporter_country` field does not have null missing values, it does contain the value "COUNTRY NOT SPECIFIED", which indicates missing country values. About 4% of records have this value.
  - some fields have low counts of non-missing values, which can be problematic for analyses  

  
* **Data cleaning steps to do:**  
  - 