# FAERS graph

## Data quality analysis

### 1. Introduction

This notebook checks the data contents and quality of the quarterly FAERS data files, available for download [here](https://fis.fda.gov/extensions/FPD-QDE-FAERS/FPD-QDE-FAERS.html).  

The data can be downloaded in two formats: XML and ASCII. Each of these data downloads contains identical README and FAQ documentation pdfs, along with the data files and documentation pdfs specific to the two data formats. The latter contain total record counts, missing value counts for selected fields, and frequency counts for categorical values.  

According to the README.pdf doc that comes with the data downloads, the two data formats mostly contain the same data, but each has some column that the other doesn't. 

In this notebook we'll take a look at what the available data formats look like. We can start with processing whichever data format is easier to work with, and add any extra fields from the other format if needed later.  
We'll check the data for consistency with the counts, missing value numbers and frequencies reported by FDA.  
We'll also check the data for any general anomalies, and not what data cleaning will need to be done.

#### Notebook contents:
1. [Introduction](#1.-Introduction)
2. [Notebook setup](#2.-Notebook-setup)
3. [Data sources](#3.-Data-sources)  
4. [Sample raw data files](#4.-Sample-raw-data-files)  
    4.1 [XML data files](#4.1-XML-data-files)  
    4.2 [ASCII data files](#4.2-ASCII-data-files)  
    4.2.1 [DEMO file](#4.2.1-DEMO-file)  
    = [Summary for DEMO ASCII file](#Summary-for-DEMO-ASCII-file)  
    4.2.2 [DRUG file](#4.2.2-DRUG-file)  
    = [Summary for DRUG ASCII file](#Summary-for-DRUG-ASCII-file)  
    4.2.3 [REACTION file](#4.2.3-REACTION-file)  
    = [Summary for REACTION ASCII file](#Summary-for-REACTION-ASCII-file)  
    4.2.4 [OUTCOME file](#4.2.4-OUTCOME-file)  
    = [Summary for OUTCOME ASCII file](#Summary-for-OUTCOME-ASCII-file)  
    4.2.5 [REPORT SOURCE file](#4.2.5-REPORT-SOURCE-file)  
    = [Summary for REPORT SOURCE ASCII file](#Summary-for-REPORT-SOURCE-ASCII-file)  
    4.2.6 [THERAPY file](#4.2.6-THERAPY-file)  
    = [Summary for THERAPY ASCII file](#Summary-for-THERAPY-ASCII-file)
5. [DQA summary](#5.-DQA-summary)
6. [Next steps](#6.-Next-steps)

### 2. Notebook setup  
#### Imports

In [1]:
import pandas as pd
import numpy as np

import re
# import xml.etree.ElementTree as ET

from timeit import default_timer as timer

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

#### Settings

In [2]:
# Customize matplotlib default settings
matplotlib.rcParams.update({'font.size': 16})
plt.rcParams["figure.figsize"] = (20,10)

In [3]:
# set up Pandas options
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)
pd.set_option('display.precision', 3)
pd.options.display.float_format = '{:.2f}'.format

#### Helper functions

In [4]:
def show_value_counts(ser, topn="none", sort_on="freq", sort_ascending=False):
    """Example usage: show_value_counts(demo.caseversion, 5, sort_on="labels", sort_ascending=True)"""
    counts = ser.value_counts(dropna=False)
    normalized = ser.value_counts(normalize=True, dropna=False)
    
    if sort_on == "freq" and sort_ascending == True:
        counts.sort_values(ascending=sort_ascending, inplace=True, na_position='first')
        normalized.sort_values(ascending=sort_ascending, inplace=True, na_position='first')
    elif sort_on == "labels":
        counts.sort_index(ascending=sort_ascending, inplace=True, na_position='first')
        normalized.sort_index(ascending=sort_ascending, inplace=True, na_position='first')
        
    if topn != "none":
        counts = counts.head(topn)
        normalized = normalized.head(topn)
        
    df = pd.concat([counts, normalized], axis=1).reset_index()
    
    df.columns = [ser.name, "count", "proportion"]
    
    return df
        

In [5]:
def show_na(ser):
    # total records
    l = len(ser)
    # missing values count
    m = ser.isna().sum()
    
    return pd.DataFrame([{'na_count': m, 'na_proportion': m/l}])

### 3. Data sources

FAERS stands for FDA Adverse Event Reporting System. It is a database that contains adverse event reports, medication error reports and product quality complaints resulting in adverse events that were submitted to FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. ([Source](https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/default.htm)) 


https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/ucm082193.htm 



### 4. Sample raw data files  


Datafile download for 2018Q4.

In [10]:
raw_data_path = "data/raw/"

In [6]:
raw_xml_path = "data/raw/xml_2018q4/xml/"

In [12]:
raw_ascii_path = "data/raw/ascii_2018q4/ascii/"

In [4]:
!ls data/raw/

[1m[36mascii_2018q4[m[m [1m[36mxml_2018q4[m[m


#### 4.1 XML data files 
Let's look at the XML data format. We can parse this file format with the xml ElementTree library.  
I've also tried using BeautifulSoup for parsing these files, but with xml parser it ran extremely slow.

In [5]:
!ls data/raw/xml_2018q4/xml

1_ADR18Q4.xml 2_ADR18Q4.xml 3_ADR18Q4.xml XML_NTS.pdf   xml18q4.pdf


In [5]:
! head -50 data/raw/xml_2018q4/xml/1_ADR18Q4.xml

<?xml version="1.0"?>
<ichicsr lang="en">
  <ichicsrmessageheader>
    <messagetype>ICSR</messagetype>
    <messageformatversion>2.1</messageformatversion>
    <messageformatrelease>1.0</messageformatrelease>
    <messagenumb>2019-02</messagenumb>
    <messagesenderidentifier>FDA CDER</messagesenderidentifier>
    <messagereceiveridentifier>Public Use</messagereceiveridentifier>
    <messagedateformat>204</messagedateformat>
    <messagedate>20190207040220</messagedate>
  </ichicsrmessageheader>
  <safetyreport>
    <safetyreportversion>1</safetyreportversion>
    <safetyreportid>15529521</safetyreportid>
    <primarysourcecountry>US</primarysourcecountry>
    <occurcountry>US</occurcountry>
    <transmissiondateformat>102</transmissiondateformat>
    <transmissiondate>20190205</transmissiondate>
    <reporttype>1</reporttype>
    <serious>2</serious>
    <receivedateformat>102</receivedateformat>
    <receivedate>20181018</receivedate>
    <receiptdateformat>102

In [7]:
xml_file_1 = raw_xml_path + "1_ADR18Q4.xml"

In [8]:
tree = ET.parse(xml_file_1)
root = tree.getroot()

In [9]:
root.tag

'ichicsr'

In [10]:
root.attrib

{'lang': 'en'}

In [11]:
i=0
for child in root:
    print(child.tag, child.attrib)
    i+=1
    if i>10:
        break

ichicsrmessageheader {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}
safetyreport {}


In [12]:
root[0][1].text

'2.1'

In [13]:
i=0
        
for report_id in root.iter('safetyreportid'):
    print(report_id.text)
    i+=1
    if i>10:
        break

15529521
15529522
15529524
15529856
15529858
15529861
15530134
15529556
15529558
15529559
15529564


In [14]:
all_tags = list(set([elem.tag for elem in root.iter()]))

In [15]:
len(all_tags)

87

In [16]:
all_tags.sort()

Print out all data fields in this XML file

In [17]:
all_tags

['actiondrug',
 'activesubstance',
 'activesubstancename',
 'authoritynumb',
 'companynumb',
 'drug',
 'drugadditional',
 'drugadministrationroute',
 'drugauthorizationnumb',
 'drugbatchnumb',
 'drugcharacterization',
 'drugcumulativedosagenumb',
 'drugcumulativedosageunit',
 'drugdosageform',
 'drugdosagetext',
 'drugenddate',
 'drugenddateformat',
 'drugindication',
 'drugintervaldosagedefinition',
 'drugintervaldosageunitnumb',
 'drugrecuraction',
 'drugrecurreadministration',
 'drugrecurrence',
 'drugseparatedosagenumb',
 'drugstartdate',
 'drugstartdateformat',
 'drugstructuredosagenumb',
 'drugstructuredosageunit',
 'drugtreatmentduration',
 'drugtreatmentdurationunit',
 'duplicate',
 'duplicatenumb',
 'duplicatesource',
 'fulfillexpeditecriteria',
 'ichicsr',
 'ichicsrmessageheader',
 'literaturereference',
 'medicinalproduct',
 'messagedate',
 'messagedateformat',
 'messageformatrelease',
 'messageformatversion',
 'messagenumb',
 'messagereceiveridentifier',
 'messagesenderiden

#### 4.2 ASCII data files

Now let's look at the ASCII datafiles format. According to the docs, these files are delimiter-separated text files, with the delimiter being `$`. The data is split up into separate files that correspond to database tables and are organized around their respective primary keys.

In [18]:
!ls data/raw/ascii_2018q4/ascii

ASC_NTS.pdf  INDI18Q4.txt RPSR18Q4.txt drug18q4.pdf reac18q4.pdf
DEMO18Q4.txt OUTC18Q4.txt THER18Q4.txt indi18q4.pdf rpsr18q4.pdf
DRUG18Q4.txt REAC18Q4.txt demo18q4.pdf outc18q4.pdf ther18q4.pdf


#### 4.2.1 DEMO file

In [19]:
!head data/raw/ascii_2018q4/ascii/DEMO18Q4.txt

primaryid$caseid$caseversion$i_f_code$event_dt$mfr_dt$init_fda_dt$fda_dt$rept_cod$auth_num$mfr_num$mfr_sndr$lit_ref$age$age_cod$age_grp$sex$e_sub$wt$wt_cod$rept_dt$to_mfr$occp_cod$reporter_country$occr_country
100035916$10003591$6$F$20130718$20181203$20140312$20181211$EXP$$PHHY2013GB101660$NOVARTIS$$47$YR$$F$Y$$$20181211$$OT$GB$GB
100050413$10005041$3$F$20140306$20141118$20140312$20181213$EXP$$US-TEVA-468475USA$TEVA$$25$YR$$F$Y$68.1$KG$20181213$$CN$US$US
1000551312$10005513$12$F$20120209$20181107$20140313$20181115$EXP$$BR-AMGEN-BRASP2012013548$AMGEN$$55$YR$A$F$Y$67$KG$20181115$$CN$BR$BR
100058832$10005883$2$F$$20180928$20140313$20181012$EXP$$FR-RANBAXY-2014RR-78735$RANBAXY$$31$YR$$F$Y$$$20181012$$OT$GB$FR
100065479$10006547$9$F$201203$20181211$20140313$20181228$EXP$$US-BAYER-2014-035909$BAYER$$36$YR$A$F$Y$90.7$KG$20181228$$CN$US$US
100066188$10006618$8$F$$20181004$20140313$20181017$PER$$US-PFIZER INC-2014069077$PFIZER$$58$YR$$F$Y$$$20181017$$CN$US$US
1000808588$10008085$88$F$201

In [21]:
ascii_file_demo = raw_ascii_path + "DEMO18Q4.txt"

In [22]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'caseversion': np.int32, 
    'i_f_code': 'object', 
    'event_dt': 'object', 
    'mfr_dt': 'object',
    'init_fda_dt': 'object', 
    'fda_dt': 'object', 
    'rept_cod': 'object', 
    'auth_num': 'object', 
    'mfr_num': 'object', 
    'mfr_sndr': 'object',
    'lit_ref': 'object', 
    'age': np.float64, 
    'age_cod': 'object', 
    'age_grp': 'object', 
    'sex': 'object', 
    'e_sub': 'object', 
    'wt': np.float64, 
    'wt_cod': 'object',
    'rept_dt': 'object', 
    'to_mfr': 'object', 
    'occp_cod': 'object', 
    'reporter_country': 'object', 
    'occr_country': 'object'
}

# {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}

In [23]:
demo = pd.read_csv(ascii_file_demo, sep='$', dtype=datatypes)

In [24]:
demo.columns

Index(['primaryid', 'caseid', 'caseversion', 'i_f_code', 'event_dt', 'mfr_dt',
       'init_fda_dt', 'fda_dt', 'rept_cod', 'auth_num', 'mfr_num', 'mfr_sndr',
       'lit_ref', 'age', 'age_cod', 'age_grp', 'sex', 'e_sub', 'wt', 'wt_cod',
       'rept_dt', 'to_mfr', 'occp_cod', 'reporter_country', 'occr_country'],
      dtype='object')

In [26]:
demo.head()

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
0,100035916,10003591,6,F,20130718.0,20181203,20140312,20181211,EXP,,PHHY2013GB101660,NOVARTIS,,47.0,YR,,F,Y,,,20181211,,OT,GB,GB
1,100050413,10005041,3,F,20140306.0,20141118,20140312,20181213,EXP,,US-TEVA-468475USA,TEVA,,25.0,YR,,F,Y,68.1,KG,20181213,,CN,US,US
2,1000551312,10005513,12,F,20120209.0,20181107,20140313,20181115,EXP,,BR-AMGEN-BRASP2012013548,AMGEN,,55.0,YR,A,F,Y,67.0,KG,20181115,,CN,BR,BR
3,100058832,10005883,2,F,,20180928,20140313,20181012,EXP,,FR-RANBAXY-2014RR-78735,RANBAXY,,31.0,YR,,F,Y,,,20181012,,OT,GB,FR
4,100065479,10006547,9,F,201203.0,20181211,20140313,20181228,EXP,,US-BAYER-2014-035909,BAYER,,36.0,YR,A,F,Y,90.7,KG,20181228,,CN,US,US


In [27]:
demo.describe(include='all')

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
count,394066.0,394066.0,394066.0,394066,205438.0,370593.0,394066.0,394066.0,394066,20168.0,370595,394065,23441,235444.0,235452,80189,347760,394066,81142.0,81142,393749.0,23479,387070,394066,394053
unique,394066.0,394066.0,,2,4711.0,2370.0,2503.0,183.0,3,15597.0,370595,471,17759,,6,6,3,2,,2,351.0,3,5,160,163
top,154544191.0,14508418.0,,I,2018.0,20181210.0,20181016.0,20181016.0,EXP,0.0,PHJP2018JP021151,PFIZER,"STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...",,YR,A,F,Y,,KG,20181016.0,N,CN,US,US
freq,1.0,1.0,,267661,25293.0,6857.0,11177.0,12657.0,204438,14.0,1,35409,79,,230226,48200,212580,370587,,80809,11312.0,22175,168973,249968,262062
mean,,,1.67,,,,,,,,,,,200.03,,,,,75.17,,,,,,
std,,,1.75,,,,,,,,,,,1843.75,,,,,29.24,,,,,,
min,,,1.0,,,,,,,,,,,-10.0,,,,,0.0,,,,,,
25%,,,1.0,,,,,,,,,,,45.0,,,,,59.87,,,,,,
50%,,,1.0,,,,,,,,,,,60.0,,,,,72.58,,,,,,
75%,,,2.0,,,,,,,,,,,71.0,,,,,88.45,,,,,,


In [28]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 394066 entries, 0 to 394065
Data columns (total 25 columns):
primaryid           394066 non-null object
caseid              394066 non-null object
caseversion         394066 non-null int32
i_f_code            394066 non-null object
event_dt            205438 non-null object
mfr_dt              370593 non-null object
init_fda_dt         394066 non-null object
fda_dt              394066 non-null object
rept_cod            394066 non-null object
auth_num            20168 non-null object
mfr_num             370595 non-null object
mfr_sndr            394065 non-null object
lit_ref             23441 non-null object
age                 235444 non-null float64
age_cod             235452 non-null object
age_grp             80189 non-null object
sex                 347760 non-null object
e_sub               394066 non-null object
wt                  81142 non-null float64
wt_cod              81142 non-null object
rept_dt             393749 non-nu

The ASCIIs look easier to work with, and according to the documentation, most of the information they contain should be the same, although both file types contain some extra fields and miss some other fields.  

We'll proceed with the ASCII files first, and add any supplemental info from XMLs later if needed.

##### DEMO file contents  

From above, the number of case reports in the 2018Q4 DEMO file is 394,066, which is consistent with the number supplied by FDA in the accompanying documentation.

##### Unique identifiers

The `primaryid` field is the unique identifier for a current case report in the data, and it is a combination of `caseid` and `caseversion`.

In [37]:
demo.primaryid.describe()

count        394066
unique       394066
top       154544191
freq              1
Name: primaryid, dtype: object

The unique record identifier is indeed unique. Great.

In [38]:
demo.caseid.describe()

count       394066
unique      394066
top       14508418
freq             1
Name: caseid, dtype: object

In [39]:
demo.caseversion.describe()

count   394066.00
mean         1.67
std          1.75
min          1.00
25%          1.00
50%          1.00
75%          2.00
max         88.00
Name: caseversion, dtype: float64

In [43]:
demo[["primaryid", "caseid", "caseversion"]].head()

Unnamed: 0,primaryid,caseid,caseversion
0,100035916,10003591,6
1,100050413,10005041,3
2,1000551312,10005513,12
3,100058832,10005883,2
4,100065479,10006547,9


##### Distribution of case version values

In [313]:
show_value_counts(demo.caseversion)

Unnamed: 0,caseversion,count,proportion
0,1,267661,0.68
1,2,74684,0.19
2,3,25090,0.06
3,4,11270,0.03
4,5,5559,0.01
5,6,3129,0.01
6,7,1857,0.00
7,8,1261,0.00
8,9,824,0.00
9,10,570,0.00


The majority of most recent case version numbers are 1 (68%), 2 (19%) and 3 (6%), accounting for 93% of the cases. About 2% of all cases have most recent case version number that are above 6. The highest case version number is 88. No missing values. 

##### i_f_code  

From documentation:
> Code for initial or follow-up status of report, as reported
by manufacturer.
>
> | CODE | MEANING_TEXT |
| ---- |------------- |
| I    | Initial      |
| F    | Follow-up    |


In [110]:
demo.i_f_code.describe()

count     394066
unique         2
top            I
freq      267661
Name: i_f_code, dtype: object

In [314]:
show_value_counts(demo.i_f_code)

Unnamed: 0,i_f_code,count,proportion
0,I,267661,0.68
1,F,126405,0.32


This is consistent with the 68% of records with caseversion=1 shown above. No missing values in this field.

##### event_dt

From documentation:  

> Date the adverse event occurred or began. (YYYYMMDD format) –
If a complete date is not available, a partial date is
provided.

In [111]:
demo.event_dt.describe()

count     205438
unique      4711
top         2018
freq       25293
Name: event_dt, dtype: object

In [315]:
show_value_counts(demo.event_dt, 20)

Unnamed: 0,event_dt,count,proportion
0,,188628,0.48
1,2018.0,25293,0.06
2,201810.0,4210,0.01
3,201809.0,4096,0.01
4,2017.0,4057,0.01
5,201808.0,3189,0.01
6,201811.0,2739,0.01
7,2016.0,2353,0.01
8,201807.0,2290,0.01
9,2015.0,1785,0.0


Nearly half of the adverse event cases do not have a date for when the adverse event occurred or began. The missing values count is consistent with the number provided by the FDA.

##### mfr_dt  

From documentation:  

> Date manufacturer first received initial information. In
subsequent   versions of a case, the latest manufacturer
received date will be   provided (YYYYMMDD format). If a
complete date is not available, a   partial date will be
provided.

In [118]:
demo.mfr_dt.describe()

count       370593
unique        2370
top       20181210
freq          6857
Name: mfr_dt, dtype: object

In [316]:
show_value_counts(demo.mfr_dt, 20)

Unnamed: 0,mfr_dt,count,proportion
0,,23473,0.06
1,20181210.0,6857,0.02
2,20181203.0,5742,0.01
3,20181029.0,5662,0.01
4,20181126.0,5605,0.01
5,20181211.0,5482,0.01
6,20181001.0,5481,0.01
7,20181009.0,5465,0.01
8,20181217.0,5312,0.01
9,20181022.0,5298,0.01


There are 6% missing values for this field. The missing values count is consistent with the FDA number.

##### init_fda_dt  

From documentation:

> Date FDA received first version (Initial) of Case (YYYYMMDD format)

In [121]:
demo.init_fda_dt.describe()

count       394066
unique        2503
top       20181016
freq         11177
Name: init_fda_dt, dtype: object

In [318]:
show_value_counts(demo.init_fda_dt, 20)

Unnamed: 0,init_fda_dt,count,proportion
0,20181016,11177,0.03
1,20181017,7905,0.02
2,20181018,6215,0.02
3,20181217,6142,0.02
4,20181120,6061,0.02
5,20181129,5967,0.02
6,20181015,5741,0.01
7,20181010,5285,0.01
8,20181116,5029,0.01
9,20181102,5020,0.01


No missing values.

##### fda_dt  

From documentation:  

> Date FDA received Case. In subsequent versions of a case, the latest manufacturer received date will be provided (YYYYMMDD format).

In [125]:
demo.fda_dt.describe()

count       394066
unique         183
top       20181016
freq         12657
Name: fda_dt, dtype: object

In [319]:
show_value_counts(demo.fda_dt, 20)

Unnamed: 0,fda_dt,count,proportion
0,20181016,12657,0.03
1,20181017,9620,0.02
2,20181217,8111,0.02
3,20181129,7432,0.02
4,20181120,7355,0.02
5,20181018,7265,0.02
6,20181227,6866,0.02
7,20181221,6863,0.02
8,20181015,6783,0.02
9,20181228,6514,0.02


No missing values, consistent with FDA number.

##### rept_cod  

From documentation:  

> Code for the type of report submitted (See table below)
> 
> | CODE | MEANING_TEXT
| ---- | ---------------
| EXP  | Expedited (15-Day)
| PER  | Periodic (Non-Expedited)
| DIR  | Direct
>
> Expedited (15-day) and Periodic (Non-Expedited) reports are from manufacturers; "Direct" reports are voluntarily
submitted to the FDA by non-manufacturers.





In [129]:
demo.rept_cod.describe()

count     394066
unique         3
top          EXP
freq      204438
Name: rept_cod, dtype: object

In [320]:
show_value_counts(demo.rept_cod)

Unnamed: 0,rept_cod,count,proportion
0,EXP,204438,0.52
1,PER,166157,0.42
2,DIR,23471,0.06


No missing values.

##### auth_num  

From documentation:  

> Regulatory Authority’s case report number, when available.  
> \* New tag added in 2014Q3 extract.

In [132]:
demo.auth_num.describe()

count     20168
unique    15597
top        0000
freq         14
Name: auth_num, dtype: object

In [321]:
show_value_counts(demo.auth_num, 10)

Unnamed: 0,auth_num,count,proportion
0,,373898,0.95
1,0000,14,0.0
2,00,11,0.0
3,DE-CADRBFARM-2018025631,10,0.0
4,GB-MHRA-EYC 00190348,9,0.0
5,FR-AFSSAPS-TS20180923,8,0.0
6,FR-AFSSAPS-CN20182166,8,0.0
7,GB-MHRA-EYC 00188736,7,0.0
8,GB-MHRA-ADR 22496422,7,0.0
9,FR-AFSSAPS-AM20180734,7,0.0


Mostly missing values, with a couple of other values (like `00` ) that may be placeholders/defaults for missing values. 

##### mfr_num  

From documentation:  

> Manufacturer's unique report identifier.

In [135]:
demo.mfr_num.describe()

count               370595
unique              370595
top       PHJP2018JP021151
freq                     1
Name: mfr_num, dtype: object

In [322]:
show_value_counts(demo.mfr_num, 10)

Unnamed: 0,mfr_num,count,proportion
0,,23471,0.06
1,US-ROCHE-2112065,1,0.0
2,PHHO2018CA011677,1,0.0
3,CL-PROVELL PHARMACEUTICALS-2056832,1,0.0
4,US-IGSA-SR10006388,1,0.0
5,CN-ROCHE-2208075,1,0.0
6,"PH-B.I. PHARMACEUTICALS,INC./RIDGEFIELD-2018-B...",1,0.0
7,US-AMGEN-USASP2018185967,1,0.0
8,CA-ROCHE-2190755,1,0.0
9,PHHY2016IT042284,1,0.0


6% of values are missing, and the missing values count is consistent with the FDA number. The non-missing values are unique, as expected.

##### mfr_sndr  

From documentation:  

> Coded name of manufacturer sending report; if not found, then verbatim name of organization sending report.

In [139]:
demo.mfr_sndr.describe()

count     394065
unique       471
top       PFIZER
freq       35409
Name: mfr_sndr, dtype: object

In [323]:
show_value_counts(demo.mfr_sndr, 20)

Unnamed: 0,mfr_sndr,count,proportion
0,PFIZER,35409,0.09
1,AMGEN,30828,0.08
2,NOVARTIS,25360,0.06
3,FDA-CTU,23470,0.06
4,SANOFI AVENTIS,18107,0.05
5,JANSSEN,14866,0.04
6,CELGENE,13511,0.03
7,BRISTOL MYERS SQUIBB,13442,0.03
8,TEVA,12871,0.03
9,ABBVIE,11719,0.03


In [155]:
# count missing values
demo.mfr_sndr.isna().sum()

1

In [156]:
demo[demo.mfr_sndr.isna()]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
203873,155751552,15575155,2,F,,,20181026,20181026,DIR,,,,,71.0,YR,,M,N,,,20181025,N,OT,US,US


One missing value, consistent with the FDA number.

##### lit_ref  

From documentation:  

> Literature Reference information, when available; populated with last 500 characters if >500 characters are available.
>
> \* New tag added in 2014Q3 extract.

In [157]:
demo.lit_ref.describe()

count                                                 23441
unique                                                17759
top       STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...
freq                                                     79
Name: lit_ref, dtype: object

In [324]:
show_value_counts(demo.lit_ref, 10)

Unnamed: 0,lit_ref,count,proportion
0,,370625,0.94
1,"STACEY R, VERA T, MORGAN T, JORDAN J, WHITLOCK...",79,0.0
2,"DOI: 10.4081/NI.2018.7469#. LAPMAG A, LERTSINU...",71,0.0
3,"GLEESON M, PECKITT C, TO YM, EDWARDS L, OATES ...",70,0.0
4,NOT APPLICABLE,61,0.0
5,"BISHOP-FREEMAN SC, HENSEL EM, FEASTER MS, WINE...",53,0.0
6,"GUMMIN, D.. 2016 ANNUAL REPORT OF THE AMERICAN...",48,0.0
7,"STRUGOV V, STADNIK E, VIRTS Y, ANDREEVA T, ZAR...",45,0.0
8,"DALKILIC E, COSKUN BN, YAGIZ B, TUFAN AN, ERMU...",40,0.0
9,"JABEEN SA, GADDAMANUGU P, CHERIAN A, MRIDULA K...",38,0.0


94% of the values are missing, and 61 records have this value set to "NOT APPLICABLE".

##### age  

From documentation:  

> Numeric value of patient's age at event.

In [162]:
demo.age.describe()

count   235444.00
mean       200.03
std       1843.75
min        -10.00
25%         45.00
50%         60.00
75%         71.00
max      34926.00
Name: age, dtype: float64

In [325]:
show_value_counts(demo.age)

Unnamed: 0,age,count,proportion
0,,158622,0.40
1,70.00,5562,0.01
2,65.00,5499,0.01
3,63.00,5440,0.01
4,60.00,5394,0.01
5,68.00,5333,0.01
6,64.00,5316,0.01
7,62.00,5312,0.01
8,67.00,5191,0.01
9,69.00,5178,0.01


In [183]:
demo[demo.age < 0]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
263592,156388221,15638822,1,I,20180429,20180508,20181120,20181120,PER,,US-PERRIGO-18US005100,PERRIGO,,-10.0,YR,,F,Y,77.98,KG,20181120,,CN,US,US


In [188]:
len(demo[demo.age > 100])

2011

In [189]:
demo[demo.age > 100].head()

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
532,104248582,10424858,2,F,201407,20181107,20140902,20181115,EXP,,US-ASTRAZENECA-2014SE63425,ASTRAZENECA,,552.0,MON,,M,Y,79.4,KG,20181115,,,US,US
655,1050904010,10509040,10,F,2014,20181030,20141009,20181107,EXP,,US-ASTRAZENECA-2014SE69488,ASTRAZENECA,,801.0,MON,,F,Y,50.3,KG,20181107,,,US,US
1075,107543802,10754380,2,F,201501,20181031,20150202,20181112,PER,,US-ASTRAZENECA-2015SE07497,ASTRAZENECA,,1023.0,MON,,M,Y,65.8,KG,20181112,,,US,US
1211,108765514,10876551,4,F,20130401,20180919,20150301,20181019,PER,,US-ASTRAZENECA-2013SE23016,ASTRAZENECA,,25245.0,DY,,F,Y,101.2,KG,20181019,,,US,US
1337,109705173,10970517,3,F,201411,20181119,20150331,20181122,EXP,,US-ASTRAZENECA-2015SE28983,ASTRAZENECA,,764.0,MON,,F,Y,93.0,KG,20181122,,,US,US


Age is missing in 40% of the records. There is one record with a negative age value, which will need to be cleaned. Most of the greater than 100 age values are coded in some other increment than a year, e.g. a month or a day.  

The missing values count is consistent with the FDA number.

##### age_cod  

From documentation:  

> Unit abbreviation for patient's age (See table below)  
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| DEC       | DECADE
| YR        | YEAR
| MON       | MONTH
| WK        | WEEK
| DY        | DAY
| HR        | HOUR

In [190]:
demo.age_cod.describe()

count     235452
unique         6
top           YR
freq      230226
Name: age_cod, dtype: object

In [326]:
show_value_counts(demo.age_cod)

Unnamed: 0,age_cod,count,proportion
0,YR,230226,0.58
1,,158614,0.4
2,DY,1935,0.0
3,DEC,1618,0.0
4,MON,1536,0.0
5,WK,127,0.0
6,HR,10,0.0


In [207]:
demo[(demo.age.isna()) & (demo.age_cod.notna())]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
119358,154870241,15487024,1,I,,,20181010,20181010,DIR,,,FDA-CTU,,,YR,,F,N,20.87,KG,20181010,N,CN,US,US
200400,155714611,15571461,1,I,20181016.0,,20181030,20181030,DIR,,,FDA-CTU,,,WK,,M,N,3.13,KG,20181030,N,MD,US,US
234374,156076701,15607670,1,I,,,20181112,20181112,DIR,,,FDA-CTU,,,YR,,F,N,,,20181112,N,,US,US
281552,156581371,15658137,1,I,,,20181115,20181115,DIR,,,FDA-CTU,,,DY,,F,N,54.43,KG,20181114,N,OT,US,US
318373,156982451,15698245,1,I,20181109.0,,20181126,20181126,DIR,,,FDA-CTU,,,YR,,M,N,,,20181120,N,OT,US,US
322785,157031521,15703152,1,I,20180918.0,,20181130,20181130,DIR,,,FDA-CTU,,,YR,,M,N,,,20181130,N,PH,US,US
332043,157133731,15713373,1,I,20180524.0,,20181129,20181129,DIR,,,FDA-CTU,,,YR,,,N,85.55,KG,20180822,N,PH,US,US
382724,157711891,15771189,1,I,,,20181227,20181227,DIR,,,FDA-CTU,,,YR,,F,N,11.0,KG,20181227,N,,US,US


In [208]:
len(demo[(demo.age.isna()) & (demo.age_cod.notna())])

8

In [209]:
demo[(demo.age.notna()) & (demo.age_cod.isna())]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country


This field is missing in 40% of the records, which matches the 40% of the records with missing age. Of the non-missing values, most of them are in years.  

The missing values count is consistent with the FDA number.

##### age_grp  

From documentation:  

> Patient Age Group code as follows, when available:
>
> | CODE   | MEANING_TEXT
| ----   | ------------
|  N     |  Neonate
|  I     |  Infant
|  C     |  Child
|  T     |  Adolescent
|  A     |  Adult
|  E     |  Elderly
>
> \* New tag added in 2014Q3 extract.

In [210]:
demo.age_grp.describe()

count     80189
unique        6
top           A
freq      48200
Name: age_grp, dtype: object

In [327]:
show_value_counts(demo.age_grp)

Unnamed: 0,age_grp,count,proportion
0,,313877,0.8
1,A,48200,0.12
2,E,27869,0.07
3,C,1547,0.0
4,T,1129,0.0
5,N,916,0.0
6,I,528,0.0


80% of the values are missing, compared to the 40% missing age values. The missing value counts are consistent with the FDA number.

##### sex  

From documentation:  

> Code for patient's sex (See table below)  
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| UNK       | Unknown
| M         | Male
| F | Female

In [213]:
demo.sex.describe()

count     347760
unique         3
top            F
freq      212580
Name: sex, dtype: object

In [328]:
show_value_counts(demo.sex)

Unnamed: 0,sex,count,proportion
0,F,212580,0.54
1,M,135150,0.34
2,,46306,0.12
3,UNK,30,0.0


12% missing values. The frequency counts and percentages are consistent with the FDA numbers.

##### e_sub  

From documentation:  

> Whether (Y/N) this report was submitted under the electronic submissions procedure for manufacturers.

In [216]:
demo.e_sub.describe()

count     394066
unique         2
top            Y
freq      370587
Name: e_sub, dtype: object

In [329]:
show_value_counts(demo.e_sub)

Unnamed: 0,e_sub,count,proportion
0,Y,370587,0.94
1,N,23479,0.06


No missing values. The frequency counts are consistent with the FDA numbers.

##### wt  

From documentation:  

> Numeric value of patient's weight.

In [220]:
demo.wt.describe()

count   81142.00
mean       75.17
std        29.24
min         0.00
25%        59.87
50%        72.58
75%        88.45
max      2890.00
Name: wt, dtype: float64

In [330]:
show_value_counts(demo.wt)

Unnamed: 0,wt,count,proportion
0,,312924,0.79
1,70.00,1404,0.00
2,60.00,1343,0.00
3,65.00,1094,0.00
4,68.00,1039,0.00
5,80.00,990,0.00
6,75.00,949,0.00
7,90.00,844,0.00
8,63.00,834,0.00
9,72.00,803,0.00


79% missing values. Missing value counts are consistent with the FDA number.

##### wt_cod  

From documentation:  

> Unit abbreviation for patient's weight (See table below)  
>
> | CODE     | MEANING_TEXT
| ----     | ------------
| KG       |  Kilograms
| LBS      |  Pounds
| GMS |  Grams

In [223]:
demo.wt_cod.describe()

count     81142
unique        2
top          KG
freq      80809
Name: wt_cod, dtype: object

In [331]:
show_value_counts(demo.wt_cod)

Unnamed: 0,wt_cod,count,proportion
0,,312924,0.79
1,KG,80809,0.21
2,LBS,333,0.0


79% missing values, consistent with the 79% missing weight values.  
Missing value counts are consistent with the FDA number.

##### rept_dt  

From documentation:  

> Date report was sent (YYYYMMDD format). If a complete date is not available, a partial date is provided. 

In [226]:
demo.rept_dt.describe()

count       393749
unique         351
top       20181016
freq         11312
Name: rept_dt, dtype: object

In [332]:
show_value_counts(demo.rept_dt, 20)

Unnamed: 0,rept_dt,count,proportion
0,20181016,11312,0.03
1,20181017,9311,0.02
2,20181015,7918,0.02
3,20181018,7689,0.02
4,20181120,7604,0.02
5,20181217,7069,0.02
6,20181129,7004,0.02
7,20181227,6808,0.02
8,20181218,6571,0.02
9,20181219,6562,0.02


In [230]:
# missing values count
demo.rept_dt.isna().sum()

317

In [232]:
demo.rept_dt.isna().sum()/demo.primaryid.count()

0.0008044337750529099

In [233]:
demo[demo.rept_dt.isna()].head(20)

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
20427,142482596,14248259,6,F,2017.0,20181023.0,20171204,20181029,EXP,FR-002147023-PHHY2017FR176929,PHHY2017FR176929,NOVARTIS,,78.0,YR,,F,Y,80.0,KG,,,OT,FR,FR
42888,150317284,15031728,4,F,20170306.0,20180722.0,20180619,20180727,EXP,,PHHY2018FR024978,NOVARTIS,,68.0,YR,,M,Y,96.0,KG,,,OT,FR,FR
54684,152340061,15234006,1,I,,20180606.0,20180802,20180802,EXP,,PHHY2018ES063808,SANDOZ,,,,A,F,Y,,,,,OT,ES,ES
54685,152340091,15234009,1,I,20180217.0,20180601.0,20180802,20180802,EXP,,PHHY2018GB063683,SANDOZ,,51.0,YR,,M,Y,82.55,KG,,,OT,GB,GB
54707,152341871,15234187,1,I,20160123.0,20180724.0,20180802,20180802,EXP,FR-AFSSAPS-ST20181059,FR-TEVA-2018-FR-932970,TEVA,,75.0,YR,,F,Y,,,,,MD,FR,FR
63135,153133691,15313369,1,I,201807.0,20180814.0,20180823,20180823,PER,,US-TEVA-2018-US-945214,TEVA,,,,,F,Y,,,,,CN,US,US
64324,153236361,15323636,1,I,20180806.0,20180817.0,20180827,20180827,EXP,,PHHY2018DE078420,SANDOZ,,29.0,YR,,M,Y,,,,,CN,DE,DE
66091,153364721,15336472,1,I,20180623.0,20180819.0,20180830,20180830,EXP,,PHHY2018FR081625,SANDOZ,,73.0,YR,,F,Y,82.0,KG,,,OT,FR,FR
83914,154488013,15448801,3,F,199606.0,20180925.0,20180929,20180929,EXP,,PHHY2018AT111525,NOVARTIS,,67.0,YR,,M,Y,,,,,OT,AT,AT
92411,154587971,15458797,1,I,,,20181001,20181001,DIR,,,FDA-CTU,,64.0,YR,,F,N,89.81,KG,,N,CN,US,US


Less than 1% of missing values. Missing values count is consistent with the FDA number.

##### to_mfr  

From documentation:  

> Whether (Y/N) voluntary reporter also notified manufacturer (blank for manufacturer reports).

In [234]:
demo.to_mfr.describe()

count     23479
unique        3
top           N
freq      22175
Name: to_mfr, dtype: object

In [333]:
show_value_counts(demo.to_mfr)

Unnamed: 0,to_mfr,count,proportion
0,,370587,0.94
1,N,22175,0.06
2,Y,1303,0.0
3,U,1,0.0


In [237]:
demo[demo.to_mfr == "U"]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
390524,158228921,15822892,1,I,20181204,,20181217,20181217,DIR,,,FDA-CTU,,1.0,DY,,M,N,,,20181213,U,OT,US,US


94% of the values are missing. The Y, N and missing value counts match the FDA numbers, but the U value with a count of 1 is not present in the FDA numbers in the accompanying pdf file. The record with this value is displayed above.

##### occp_cod  

From documentation:  

> Abbreviation for the reporter's type of occupation in the latest version of a case.
>
> | CODE      | MEANING_TEXT
| ----      | ------------
| MD        | Physician
| PH        | Pharmacist
| OT        | Other health-professional
| LW        | Lawyer
| CN | Consumer

In [238]:
demo.occp_cod.describe()

count     387070
unique         5
top           CN
freq      168973
Name: occp_cod, dtype: object

In [334]:
show_value_counts(demo.occp_cod)

Unnamed: 0,occp_cod,count,proportion
0,CN,168973,0.43
1,MD,99246,0.25
2,OT,80654,0.2
3,PH,31766,0.08
4,,6996,0.02
5,LW,6431,0.02


There are 2% missing values. The frequency counts are consistent with the FDA numbers.

##### reporter_country  

ISO country codes can be found here: https://www.iso.org/obp/ui/#search/code/  


From documentation:  

> The country of the reporter in the latest version of a case.
>
> \* Note: the links to the country codes in the documentation don't really work.

   

In [241]:
demo.reporter_country.describe()

count     394066
unique       160
top           US
freq      249968
Name: reporter_country, dtype: object

In [335]:
show_value_counts(demo.reporter_country, 20)

Unnamed: 0,reporter_country,count,proportion
0,US,249968,0.63
1,CA,16897,0.04
2,GB,16739,0.04
3,FR,15736,0.04
4,JP,15711,0.04
5,COUNTRY NOT SPECIFIED,14719,0.04
6,DE,10502,0.03
7,IT,7833,0.02
8,ES,4681,0.01
9,BR,3454,0.01


There are no missing values, which is consistent with the FDA number.  
However, about 4% of the case records have "COUNTRY NOT SPECIFIED" in this field.

##### occr_country  

_From documentation:_  
> The country where the event occurred.

In [244]:
demo.occr_country.describe()

count     394053
unique       163
top           US
freq      262062
Name: occr_country, dtype: object

In [336]:
show_value_counts(demo.occr_country, 20)

Unnamed: 0,occr_country,count,proportion
0,US,262062,0.67
1,CA,17550,0.04
2,FR,16969,0.04
3,JP,15968,0.04
4,GB,13561,0.03
5,DE,10752,0.03
6,IT,8315,0.02
7,ES,4896,0.01
8,BR,3873,0.01
9,CN,2988,0.01


In [247]:
# missing values count
demo.occr_country.isna().sum()

13

In [254]:
demo.occr_country.str.len().value_counts(dropna=False)

2.00    394053
nan         13
Name: occr_country, dtype: int64

There is less than 1% missing values. The missing values count is consistent with the FDA number.

##### Summary for DEMO ASCII file  

The data is mostly consistent with the accompanying FDA missing value and frequency counts pdf. 
Some fields have low counts of non-missing values, which can be problematic for analyses.  

* **Data quality issues found:**  
  - one record has a negative value in the `age` field  
  - the field `lit_ref` has the value "NOT APPLICABLE" in 61 records, in addition to the null missing values.
  - one record has a "U" categorical value in the `to_mfr` field, which wasn't listed in the FDA pdf  
  - While the `reporter_country` field does not have null missing values, it does contain the value "COUNTRY NOT SPECIFIED", which indicates missing country values. About 4% of records have this value.   
  
  
* **Data cleaning steps to do:**  
  - fix the data quality issues listed above
  - standardize weight and age fields to SI units  
  - infer age categories

#### 4.2.2 DRUG file

In [255]:
!head data/raw/ascii_2018q4/ascii/DRUG18Q4.txt

primaryid$caseid$drug_seq$role_cod$drugname$prod_ai$val_vbm$route$dose_vbm$cum_dose_chr$cum_dose_unit$dechal$rechal$lot_num$exp_dt$nda_num$dose_amt$dose_unit$dose_form$dose_freq
100035916$10003591$1$PS$GILENYA$FINGOLIMOD HYDROCHLORIDE$1$Oral$QD$$$$$$$022527$$$CAPSULE$QD
100050413$10005041$1$PS$PLAN B ONE-STEP$LEVONORGESTREL$1$Oral$1.5 MILLIGRAM DAILY;$$$D$$$$021998$1.5$MG$TABLET$QD
1000551312$10005513$1$PS$ENBREL$ETANERCEPT$1$Subcutaneous$50 MG, ONCE WEEKLY$50$MG$U$$ G79072$$103795$50$MG$SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE$/wk
1000551312$10005513$2$SS$ENBREL$ETANERCEPT$1$Unknown$50 MG, ONCE WEEKLY (EVERY THURSDAY)$50$MG$U$$ S77448$$103795$50$MG$SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE$/wk
1000551312$10005513$3$SS$ENBREL$ETANERCEPT$1$Unknown$1 DF, WEEKLY$50$MG$U$$$$103795$1$DF$SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE$/wk
1000551312$10005513$4$SS$ENBREL$ETANERCEPT$1$Unknown$UNK$50$MG$U$$$$103795$$$SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE$
1000551312$10005513$5

In [256]:
ascii_file_drug = raw_ascii_path + "DRUG18Q4.txt"

In [275]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'drug_seq': np.int32, 
    'role_cod': 'object', 
    'drugname': 'object', 
    'prod_ai': 'object',
    'val_vbm': 'object', 
    'route': 'object', 
    'dose_vbm': 'object', 
    'cum_dose_chr': np.float64, 
    'cum_dose_unit': 'object',
    'dechal': 'object', 
    'rechal': 'object', 
    'lot_num': 'object', 
    'exp_dt': 'object', 
    'nda_num': 'object', 
    'dose_amt': np.float64,
    'dose_unit': 'object', 
    'dose_form': 'object', 
    'dose_freq': 'object'
}

In [270]:
drug = pd.read_csv(ascii_file_drug, sep='$', dtype=datatypes)

In [271]:
drug.columns

Index(['primaryid', 'caseid', 'drug_seq', 'role_cod', 'drugname', 'prod_ai',
       'val_vbm', 'route', 'dose_vbm', 'cum_dose_chr', 'cum_dose_unit',
       'dechal', 'rechal', 'lot_num', 'exp_dt', 'nda_num', 'dose_amt',
       'dose_unit', 'dose_form', 'dose_freq'],
      dtype='object')

In [272]:
drug.head()

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
0,100035916,10003591,1,PS,GILENYA,FINGOLIMOD HYDROCHLORIDE,1,Oral,QD,,,,,,,22527,,,CAPSULE,QD
1,100050413,10005041,1,PS,PLAN B ONE-STEP,LEVONORGESTREL,1,Oral,1.5 MILLIGRAM DAILY;,,,D,,,,21998,1.5,MG,TABLET,QD
2,1000551312,10005513,1,PS,ENBREL,ETANERCEPT,1,Subcutaneous,"50 MG, ONCE WEEKLY",50.0,MG,U,,G79072,,103795,50.0,MG,SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE,/wk
3,1000551312,10005513,2,SS,ENBREL,ETANERCEPT,1,Unknown,"50 MG, ONCE WEEKLY (EVERY THURSDAY)",50.0,MG,U,,S77448,,103795,50.0,MG,SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE,/wk
4,1000551312,10005513,3,SS,ENBREL,ETANERCEPT,1,Unknown,"1 DF, WEEKLY",50.0,MG,U,,,,103795,1.0,DF,SOLUTION FOR INJECTION IN PRE-FILLED SYRINGE,/wk


In [273]:
drug.describe(include='all')

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
count,1546835.0,1546835.0,1546835.0,1546835,1546823,1511985,1546835.0,1102380,915572,49001.0,49011,794509,264528,236897,4715.0,499438.0,650429.0,650429,672867,410843
unique,394066.0,394066.0,,4,61846,5782,2.0,66,143318,,23,4,4,42141,716.0,6385.0,,36,372,33
top,146088398.0,14608839.0,,C,REVLIMID,ASPIRIN,1.0,Unknown,UNK,,MG,U,U,UNKNOWN,20200131.0,21880.0,,MG,TABLET,QD
freq,310.0,310.0,,714731,15475,19377,1511988.0,463084,278054,,35054,431577,226288,77892,193.0,14551.0,,506064,176663,229695
mean,,,7.2,,,,,,,58156.82,,,,,,,598.74,,,
std,,,11.6,,,,,,,5321159.75,,,,,,,85133.95,,,
min,,,1.0,,,,,,,0.0,,,,,,,0.0,,,
25%,,,1.0,,,,,,,80.0,,,,,,,5.0,,,
50%,,,4.0,,,,,,,610.0,,,,,,,30.0,,,
75%,,,8.0,,,,,,,5100.0,,,,,,,150.0,,,


In [274]:
drug.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1546835 entries, 0 to 1546834
Data columns (total 20 columns):
primaryid        1546835 non-null object
caseid           1546835 non-null object
drug_seq         1546835 non-null int32
role_cod         1546835 non-null object
drugname         1546823 non-null object
prod_ai          1511985 non-null object
val_vbm          1546835 non-null object
route            1102380 non-null object
dose_vbm         915572 non-null object
cum_dose_chr     49001 non-null float64
cum_dose_unit    49011 non-null object
dechal           794509 non-null object
rechal           264528 non-null object
lot_num          236897 non-null object
exp_dt           4715 non-null object
nda_num          499438 non-null object
dose_amt         650429 non-null float64
dose_unit        650429 non-null object
dose_form        672867 non-null object
dose_freq        410843 non-null object
dtypes: float64(2), int32(1), object(17)
memory usage: 230.1+ MB


##### Unique identifyers

The datarows in the DRUG file are unique by `primaryid` + `drug_seq`.  
The DRUG file has a many-to-one relationship with the DEMO file, matching on `primaryid`.  
The DRUG file also has the `caseid` field. Both  `primaryid` and `caseid` fields here are defined the same way as in the DEMO file, and across the rest of the datasets.

In [278]:
drug.primaryid.describe()

count       1546835
unique       394066
top       146088398
freq            310
Name: primaryid, dtype: object

In [337]:
show_value_counts(drug.primaryid, 20)

Unnamed: 0,primaryid,count,proportion
0,146088398,310,0.0
1,153490012,259,0.0
2,156811382,256,0.0
3,148011912,252,0.0
4,153486263,227,0.0
5,156232931,200,0.0
6,153535922,200,0.0
7,156196031,198,0.0
8,155573181,197,0.0
9,155787136,195,0.0


The number of unique values of `primaryid` matches the number of records in the DEMO files. 

In [283]:
drug.caseid.describe()

count      1546835
unique      394066
top       14608839
freq           310
Name: caseid, dtype: object

Same as above, the number of unique values of `caseid` matches the number of records in the DEMO files. 

##### drug_seq  

From documentation:

> Unique number for identifying a drug for a Case.  
> To link to the THERyyQq.TXT data file, both the Case number (primary key) and the DRUG_SEQ number (secondary key) are needed.  

From ENDNOTES in the documentation:  
> DRUG_SEQ (drug sequence number found in the Drug file, Therapy file, and Indications file) denotes the relationship between the drug(s) reported for a Case, the therapy date(s) reported for the drug(s), and the indications reported for the drug(s).  
Consider Case 3078140 version 1, received by the FDA on 12/31/97. The
PRIMARYID for this case is 30781401. Like any Case, it appears once (and only once) in the Demographic file:  
>
> | PRIMARYID |
  | ----- |
  | 30781401 |
>
>        
> Four drugs were reported for this Case: Aricept was reported as suspect, and Estrogens, Prozac, and Synthroid as concomitant. Primaryid 30781401 appears four times in the Drug file, with a different DRUG_SEQ for each drug:
>
> | PRIMARYID | DRUG_SEQ | DRUGNAME
  | --------- | -------- | --------
  | 30781401  | 1        | Aricept
  | 30781401  | 2        | Estrogens
  | 30781401  | 3        | Prozac( Fluoxetine Hydrochloride
  | 30781401  | 4        | Synthroid (Levothyroxine Sodium)
>
> Dates of therapy for Aricept were reported as "4/97 to 6/13/97", and "6/20/97 (ongoing)." Since the drug was started, stopped, then restarted, there are two entries in the Drug Therapy file. In such a circumstance, the two entries will have the same PRIMARYID and the same DRUG_SEQ # (or DSG_DRUG_SEQ number as it is called in the Therapy file - see below). No therapy dates were reported for the concomitants; therefore, they do not appear in the Drug Therapy file, which is excerpted as follows:  
>
> | PRIMARYID | DSG_DRUG_SEQ # | START_DT | END_DT
  | --------- | -------------- | -------- | ------
  | 30781401  | 1              | 199704   | 19970613
  | 30781401  | 1              | 19970620 |
>
> NOTE:  The Drug Seq number is no longer a unique key as was the case in LAERS QDE.  The Drug Seq number simply shows the order of the DRUGNAME within a unique case.  Additionally, the fields labeled DRUG_SEQ, INDI_DRUG_SEQ, and DSG_DRUG_SEQ in the Drug, Indication, and Therapy files, respectively, all serve the same purpose of linking the data elements in each individual file together with the appropriate drug listed in the case using the PRIMARYID.

In [285]:
drug.drug_seq.describe()

count   1546835.00
mean          7.20
std          11.60
min           1.00
25%           1.00
50%           4.00
75%           8.00
max         310.00
Name: drug_seq, dtype: float64

In [339]:
show_value_counts(drug.drug_seq)

Unnamed: 0,drug_seq,count,proportion
0,1,394061,0.25
1,2,211759,0.14
2,3,148953,0.10
3,4,115906,0.07
4,5,94757,0.06
5,6,78364,0.05
6,7,66161,0.04
7,8,55679,0.04
8,9,47721,0.03
9,10,40166,0.03


No missing values. Max drugs per case is 310. 

##### role_cod  
From documentation:  
> Code for drug's reported role in event (See table below)
>
> | CODE      | MEANING_TEXT
 | ----      | ------------
 | PS        | Primary Suspect Drug
 | SS        | Secondary Suspect Drug
 | C         | Concomitant
 | I         | Interacting

In [340]:
drug.role_cod.describe()

count     1546835
unique          4
top             C
freq       714731
Name: role_cod, dtype: object

In [341]:
show_value_counts(drug.role_cod)

Unnamed: 0,role_cod,count,proportion
0,C,714731,0.46
1,SS,428672,0.28
2,PS,394065,0.25
3,I,9367,0.01


In [351]:
show_na(drug.role_cod)

Unnamed: 0,na_count,na_proportion
0,0,0.0


No missing values. The frequency counts and proportions are consistent with the FDA numbers.

##### drugname  
From documentation:  
> Name of medicinal product.  
> If a "Valid Trade Name" is populated for this Case, then DRUGNAME = Valid Trade Name; if not, then DRUGNAME = "Verbatim" name, exactly as entered on the report.

In [352]:
drug.drugname.describe()

count      1546823
unique       61846
top       REVLIMID
freq         15475
Name: drugname, dtype: object

In [354]:
show_value_counts(drug.drugname, 20)

Unnamed: 0,drugname,count,proportion
0,REVLIMID,15475,0.01
1,HUMIRA,15081,0.01
2,ENBREL,13311,0.01
3,PREDNISONE.,12837,0.01
4,METHOTREXATE.,12210,0.01
5,XARELTO,11786,0.01
6,LYRICA,10128,0.01
7,REPATHA,9924,0.01
8,COSENTYX,9603,0.01
9,XOLAIR,9540,0.01


In [355]:
show_na(drug.drugname)

Unnamed: 0,na_count,na_proportion
0,12,0.0


In [357]:
drug[drug.drugname.isna()]

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
665436,154818901,15481890,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
669087,154831781,15483178,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
722274,154998041,15499804,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
722275,154998041,15499804,3,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
722276,154998041,15499804,4,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
942462,155793001,15579300,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
1008975,156016371,15601637,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
1057196,156172351,15617235,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
1061749,156186731,15618673,3,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,
1105598,156319821,15631982,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,


In [358]:
drug[drug.primaryid == "154818901"]

Unnamed: 0,primaryid,caseid,drug_seq,role_cod,drugname,prod_ai,val_vbm,route,dose_vbm,cum_dose_chr,cum_dose_unit,dechal,rechal,lot_num,exp_dt,nda_num,dose_amt,dose_unit,dose_form,dose_freq
665435,154818901,15481890,1,PS,HUMIRA,ADALIMUMAB,1,Subcutaneous,? OTHER FREQUENCY:Q 2 WEEKS;?,,,D,D,1095254.0,20191130.0,,40.0,MG,,QOW
665436,154818901,15481890,2,C,,UNSPECIFIED INGREDIENT,1,,,,,,,,,,,,,


In [359]:
demo[demo.primaryid == "154818901"]

Unnamed: 0,primaryid,caseid,caseversion,i_f_code,event_dt,mfr_dt,init_fda_dt,fda_dt,rept_cod,auth_num,mfr_num,mfr_sndr,lit_ref,age,age_cod,age_grp,sex,e_sub,wt,wt_cod,rept_dt,to_mfr,occp_cod,reporter_country,occr_country
114413,154818901,15481890,1,I,,,20181003,20181003,DIR,,,FDA-CTU,,,,,F,N,,,20181003,N,CN,US,US


In [360]:
ascii_file_drug

'data/raw/ascii_2018q4/ascii/DRUG18Q4.txt'

In [362]:
# check the datalines in the raw input file
# look at line 665436 and surrounding lines
# looking for primaryid=154818901
!sed '665434,665440!d' data/raw/ascii_2018q4/ascii/DRUG18Q4.txt

154818882$15481888$9$SS$NOXAFIL$POSACONAZOLE$1$Oral$300 MG, QD$2400$MG$Y$$$$$300$MG$GASTRO-RESISTANT TABLET$QD
154818882$15481888$10$SS$ZAVEDOS$IDARUBICIN$1$Intravenous (not otherwise specified)$13.7 MG, QD$55$MG$$$$$$13.7$MG$$QD
154818891$15481889$1$PS$LAMICTAL$LAMOTRIGINE$1$Oral$200 MG, QD$$$Y$$$$020241$200$MG$TABLET$QD
154818901$15481890$1$PS$HUMIRA$ADALIMUMAB$1$Subcutaneous$?          OTHER FREQUENCY:Q 2 WEEKS;?$$$D$D$ 1095254$20191130$$40$MG$$QOW
154818901$15481890$2$C$N/A$UNSPECIFIED INGREDIENT$1$$$$$$$$$$$$$
154818922$15481892$1$PS$TYSABRI$NATALIZUMAB$1$Intravenous (not otherwise specified)$INFUSED OVER 1 HOUR$$$$U$$$$300$MG$CONCENTRATE FOR SOLUTION FOR INFUSION$
154818931$15481893$1$PS$SYMDEKO$IVACAFTOR\TEZACAFTOR$1$Oral$TEZACAFTOR/IVACAFTOR AM, IVACAFTOR PM$$$$$ 1540373$$210491$$$TABLET$


There are 12 missing `drugname` values present in the raw data, and all of them have `prod_ai`="UNSPECIFIED INGREDIENT". This missing values count is inconsistent with the FDA's number of 0 in the accompanying documentation.  

##### prod_ai  
From documentation:  
> Product Active Ingredient, when available.  
>
> \* New tag added in 2014Q3 extract.

In [363]:
drug.prod_ai.describe()

count     1511985
unique       5782
top       ASPIRIN
freq        19377
Name: prod_ai, dtype: object

In [365]:
show_value_counts(drug.prod_ai, 10)

Unnamed: 0,prod_ai,count,proportion
0,,34850,0.02
1,ASPIRIN,19377,0.01
2,LENALIDOMIDE,16440,0.01
3,ADALIMUMAB,15714,0.01
4,ACETAMINOPHEN,15582,0.01
5,DOCETAXEL,14951,0.01
6,PREDNISONE,14313,0.01
7,ETANERCEPT,14101,0.01
8,METHOTREXATE,13535,0.01
9,METFORMIN HYDROCHLORIDE,12457,0.01


About 2% of the values are missing. The missing values count is consistent with the FDA number.

##### val_vbm  
From documentation:  
> Code for source of DRUGNAME (See table below)
> 
> | CODE | MEANING_TEXT
  | ---- | ------------
  | 1    | Validated trade name used
  | 2    | Verbatim name used

In [366]:
drug.val_vbm.describe()

count     1546835
unique          2
top             1
freq      1511988
Name: val_vbm, dtype: object

In [367]:
show_value_counts(drug.val_vbm)

Unnamed: 0,val_vbm,count,proportion
0,1,1511988,0.98
1,2,34847,0.02


In [368]:
show_na(drug.val_vbm)

Unnamed: 0,na_count,na_proportion
0,0,0.0


No missing values, the counts and proportions are consistent with FDA numbers.

##### route  
From documentation:  
> The route of drug administration

In [369]:
drug.route.describe()

count     1102380
unique         66
top       Unknown
freq       463084
Name: route, dtype: object

In [370]:
show_value_counts(drug.route)

Unnamed: 0,route,count,proportion
0,Unknown,463084,0.30
1,,444455,0.29
2,Oral,391052,0.25
3,Subcutaneous,81858,0.05
4,Intravenous (not otherwise specified),79911,0.05
5,Intravenous drip,16901,0.01
6,Respiratory (inhalation),12316,0.01
7,Intramuscular,9050,0.01
8,Transplacental,7618,0.00
9,Topical,7591,0.00


In [371]:
show_value_counts(drug[drug.route.notna()].route)

Unnamed: 0,route,count,proportion
0,Unknown,463084,0.42
1,Oral,391052,0.35
2,Subcutaneous,81858,0.07
3,Intravenous (not otherwise specified),79911,0.07
4,Intravenous drip,16901,0.02
5,Respiratory (inhalation),12316,0.01
6,Intramuscular,9050,0.01
7,Transplacental,7618,0.01
8,Topical,7591,0.01
9,Other,4700,0.00


The `route` field has about 29% missing values and 30% "Unknown" values. The missing values count and the category frequency counts are consistent with the FDA numbers, but the precentages given by the FDA are not. Excluding the missing values from the percentage calculations brings them somewhat closer to the FDA numbers, but significant differences remain.

##### dose_vbm  
From documentation:  
> Verbatim text for dose, frequency, and route, exactly as entered on report.

In [379]:
drug.dose_vbm.describe()

count     915572
unique    143318
top          UNK
freq      278054
Name: dose_vbm, dtype: object

In [380]:
show_value_counts(drug.dose_vbm, 20)

Unnamed: 0,dose_vbm,count,proportion
0,,631263,0.41
1,UNK,278054,0.18
2,"10 MG, QD",7950,0.01
3,UNKNOWN,6622,0.0
4,"20 MG, QD",5887,0.0
5,"1 DF, QD",5735,0.0
6,"UNK UNK, UNKNOWN FREQ.",5610,0.0
7,"140 MG, Q2WK",5345,0.0
8,10 MILLIGRAM,5232,0.0
9,"40 MG, QD",4578,0.0


41% missing values, the missing values count is not consistent with the FDA number in the accompanying documentation (off by 6).  
In addition the the missing values, 18% of records have the value "UNK", and many other records have values that are some variation of "UNK", e.g. "UNKNOWN", "UNK UNK, UNK", "UNK, UNKNOWN", etc.

##### cum_dose_chr  
From documentation:  
> Cumulative dose to first reaction

In [381]:
drug.cum_dose_chr.describe()

count        49001.00
mean         58156.82
std        5321159.75
min              0.00
25%             80.00
50%            610.00
75%           5100.00
max     1163833000.00
Name: cum_dose_chr, dtype: float64

In [382]:
show_value_counts(drug.cum_dose_chr)

Unnamed: 0,cum_dose_chr,count,proportion
0,,1497834,0.97
1,1.00,822,0.00
2,2.00,711,0.00
3,20.00,595,0.00
4,10.00,501,0.00
5,200.00,500,0.00
6,4.00,479,0.00
7,40.00,404,0.00
8,3.00,397,0.00
9,300.00,396,0.00


97% missing values.

##### cum_dose_unit  
From documentation:  
> Cumulative dose to first reaction unit  
>
> NOTE:  The list below provides Dose codes which are commonly reported; however, dose codes are not limited to this list and other code values may be present.
>
> | CODE        | Meaning_Text
  | ----        | ------------
  | KG          | Kilogram(s)
  | GM          | Gram(s)
  | MG          | Milligram(s)
  | UG          | Microgram(s) (μg)
  | NG          | Nanogram(s)
  | PG          | Picogram(s)
  | MG/KG       | Milligram(s)/Kilogram
  | UG/KG       | Microgram(s)/Kilogram (μG/KG)
  | MG/M**2     | Milligram(s)/Sq. Meter
  | UG/M**2     | Microgram(s)/Sq. Meter (μG/M**2)
  | L           | Litre(s)
  | ML          | Millilitre(s)
  | UL          | Microlitre(s) (μL)
  | BQ          | Becquerel(s)
  | GBQ         | Gigabecquerel(s)
  | MBQ         | Megabecquerel(s)
  | KBQ         | Kilobecquerel(s)
  | CI          | Curie(s)
  | MCI         | Millicurie(s)
  | UCI         | Microcurie(s) (μCI)
  | NCI         | Nanocurie(s)
  | MOL         | Mole(s)
  | MMOL        | Millimole(s)
  | UMOL        | Micromole(s)
  | IU          | International Unit(s)
  | KIU         | International Unit*(1000s)
  | MIU         | International Unit*(1,000,000s)
  | IU/KG       | IU/Kilogram
  | MEQ         | Milliequivalent(s)
  | PCT         | Percent (%)
  | GTT         | Drop(s)
  | DF          | Dosage Form  

In [383]:
drug.cum_dose_unit.describe()

count     49011
unique       23
top          MG
freq      35054
Name: cum_dose_unit, dtype: object

In [384]:
show_value_counts(drug.cum_dose_unit, 20)

Unnamed: 0,cum_dose_unit,count,proportion
0,,1497824,0.97
1,MG,35054,0.02
2,DF,7661,0.0
3,G,1680,0.0
4,UG,1108,0.0
5,ML,957,0.0
6,MG/KG,705,0.0
7,IU,649,0.0
8,MG/M2,623,0.0
9,Gtt,206,0.0


97% of the values are missing, which is consistent with the missing values proportion of `cum_dose_chr`.

##### dechal  
From documentation:  
> Dechallenge code, indicating if reaction abated when drug therapy was stopped (See table below)  
>
> | CODE      | MEANING_TEXT
  | ----      | ------------
  | Y         | Positive dechallenge
  | N         | Negative dechallenge
  | U         | Unknown
  | D         | Does not apply

In [385]:
drug.dechal.describe()

count     794509
unique         4
top            U
freq      431577
Name: dechal, dtype: object

In [386]:
show_value_counts(drug.dechal)

Unnamed: 0,dechal,count,proportion
0,,752326,0.49
1,U,431577,0.28
2,D,198638,0.13
3,Y,135631,0.09
4,N,28663,0.02


49% missing values. Missing values count and category frequency counts and percentages are consistent with the FDA numbers.

##### rechal  
From documentation:  
> 
> Rechallenge code, indicating if reaction recurred when drug therapy was restarted (See table below)  
>
> | CODE      | MEANING_TEXT
  | ----      | ------------
  | Y         | Positive rechallenge
  | N         | Negative rechallenge
  | U         | Unknown
  | D         | Does not apply

In [387]:
drug.rechal.describe()

count     264528
unique         4
top            U
freq      226288
Name: rechal, dtype: object

In [388]:
show_value_counts(drug.rechal)

Unnamed: 0,rechal,count,proportion
0,,1282307,0.83
1,U,226288,0.15
2,N,22566,0.01
3,D,11507,0.01
4,Y,4167,0.0


83% missing values, and 15% unknown values, adding up to 98%. The missing values count and the category frequencies and percentages are consistent with the FDA numbers.

##### lot_num  
From documentation:  
> Lot number of the drug (as reported).

In [389]:
drug.lot_num.describe()

count       236897
unique       42141
top        UNKNOWN
freq         77892
Name: lot_num, dtype: object

In [390]:
show_value_counts(drug.lot_num, 20)

Unnamed: 0,lot_num,count,proportion
0,,1309938,0.85
1,UNKNOWN,77892,0.05
2,UNK,17091,0.01
3,NOT AVAILABLE,4993,0.0
4,"UNKNOWN,UNKNOWN",2441,0.0
5,NOT REPORTED,2422,0.0
6,"NOT AVAILABLE,NOT AVAILABLE",1401,0.0
7,"NOT AVAILABLE,NOT AVAILABLE,NOT AVA",1339,0.0
8,"UNKNOWN,UNKNOWN,UNKNOWN,UNKNOWN,UNK",876,0.0
9,"UNKNOWN,UNKNOWN,UNKNOWN",823,0.0


85% missing values. The top ten most frequent non-missing values are some variation of "UNKNOWN", and there are probably more similar values with lower frequencies.  
The missing values count is consistent with the FDA number.

##### exp_dt  
From documentation:  
> Expiration date of the drug.  
> (YYYYMMDD format) - If a complete date is not available, a partial date is provided

In [391]:
drug.exp_dt.describe()

count         4715
unique         716
top       20200131
freq           193
Name: exp_dt, dtype: object

In [392]:
show_value_counts(drug.exp_dt, 20)

Unnamed: 0,exp_dt,count,proportion
0,,1542120,1.0
1,20200131.0,193,0.0
2,20200331.0,171,0.0
3,20200228.0,165,0.0
4,20210331.0,157,0.0
5,20200430.0,141,0.0
6,20200630.0,139,0.0
7,20200531.0,135,0.0
8,20210131.0,104,0.0
9,20210228.0,104,0.0


Nearly 100% of the values are missing. The missing values count is consistent with the FDA number.

##### nda_num  
From documentation:  
> NDA number (numeric only)

In [394]:
drug.nda_num.describe()

count     499438
unique      6385
top       021880
freq       14551
Name: nda_num, dtype: object

In [396]:
show_value_counts(drug.nda_num, 10)

Unnamed: 0,nda_num,count,proportion
0,,1047397,0.68
1,21880.0,14551,0.01
2,125057.0,11027,0.01
3,125522.0,9536,0.01
4,125504.0,8350,0.01
5,103795.0,7853,0.01
6,21446.0,7456,0.0
7,103976.0,7226,0.0
8,125031.0,7165,0.0
9,20449.0,6930,0.0


68% of the values are missing. The missing values count is consistent with the FDA number.

##### dose_amt  
From documentation:  
> Amount of drug reported

In [397]:
drug.dose_amt.describe()

count     650429.00
mean         598.74
std        85133.95
min            0.00
25%            5.00
50%           30.00
75%          150.00
max     36090180.00
Name: dose_amt, dtype: float64

In [398]:
show_value_counts(drug.dose_amt)

Unnamed: 0,dose_amt,count,proportion
0,,896406,0.58
1,1.00,52053,0.03
2,10.00,40691,0.03
3,20.00,39575,0.03
4,40.00,33125,0.02
5,5.00,31102,0.02
6,100.00,26858,0.02
7,50.00,24904,0.02
8,300.00,24845,0.02
9,2.00,20963,0.01


58% missing values.

##### dose_unit  
From documentation:  
> Unit of drug dose

In [399]:
drug.dose_unit.describe()

count     650429
unique        36
top           MG
freq      506064
Name: dose_unit, dtype: object

In [400]:
show_value_counts(drug.dose_unit)

Unnamed: 0,dose_unit,count,proportion
0,,896406,0.58
1,MG,506064,0.33
2,DF,59186,0.04
3,UG,20135,0.01
4,G,18172,0.01
5,MG/M**2,11348,0.01
6,IU,9767,0.01
7,MG/KG,9383,0.01
8,ML,7858,0.01
9,GTT,2670,0.0


58% missing values, consistent with the missing values proportion of `dose_amt`.

##### dose_form  
From documentation:  
> Form of dose reported

In [401]:
drug.dose_form.describe()

count     672867
unique       372
top       TABLET
freq      176663
Name: dose_form, dtype: object

In [402]:
show_value_counts(drug.dose_form, 20)

Unnamed: 0,dose_form,count,proportion
0,,873968,0.57
1,TABLET,176663,0.11
2,UNKNOWN,66233,0.04
3,SOLUTION FOR INJECTION,65418,0.04
4,CAPSULE,51025,0.03
5,INJECTION,40092,0.03
6,UNSPECIFIED,29405,0.02
7,CAPSULES,19464,0.01
8,FILM-COATED TABLET,17430,0.01
9,TABLETS,16658,0.01


57% missing values. In addition to the missing values, there are "UNKNOWN" values (4%), "UNSPECIFIED" (2%), "FORMULATION UNKNOWN" (1%), and possibly more values that are similar but less frequent.  
This field contains many values that are very similar, for example, "TABLET" and "TABLETS".

##### dose_freq  
From documentation:  
> Code for Frequency  
>
> NOTE: The list below provides frequency codes which are commonly reported; however, dose frequency codes are not limited to this list and other code values may be present.  
>
> | CODE  | Meaning_Text
  | ----  | ------------
  | 1X    | Once or one time
  | BID   | Twice a day
  | BIW   | Twice a week
  | HS    | At bedtime
  | PRN   | As needed
  | Q12H  | Every 12 hours
  | Q2H   | Every 2 hours
  | Q3H   | Every 3 hours
  | Q3W   | Every 3 weeks
  | Q4H   | Every 4 hours
  | Q5H   | Every 5 hours
  | Q6H   | Every 6 hours
  | Q8H   | Every 8 hours
  | QD    | Daily
  | QH    | Every hour
  | QID   | 4 times a day
  | QM    | Monthly
  | QOD   | Every other day
  | QOW   | Every other week
  | QW    | Every week
  | TID   | 3 times a day
  | TIW   | 3 times a week
  | UNK   | Unknown

In [403]:
drug.dose_freq.describe()

count     410843
unique        33
top           QD
freq      229695
Name: dose_freq, dtype: object

In [404]:
show_value_counts(drug.dose_freq)

Unnamed: 0,dose_freq,count,proportion
0,,1135992,0.73
1,QD,229695,0.15
2,BID,68832,0.04
3,QOW,23402,0.02
4,/wk,23294,0.02
5,TID,17585,0.01
6,/month,12431,0.01
7,Q3W,8061,0.01
8,Q12H,5838,0.0
9,QID,4125,0.0


73% missing values.

##### Summary for DRUG ASCII file  

The data is mostly consistent with the accompanying FDA missing value and frequency counts pdf. There is an inconsistency in the missing values count for the `drugname` field, and inconsistencies in the category proportions for the `route` field.    
Many fields have low counts of non-missing values, which can be problematic for analyses.  

* **Data quality issues found:**  
  - many of the fields are freetext, and would need to be cleaned up and standardized.   
  
* **Data cleaning steps to do:**  
  - fix the data quality issues listed above  
  - maybe infer categories from some of the freetext fields

#### 4.2.3 REACTION file

In [405]:
!ls data/raw/ascii_2018q4/ascii/

ASC_NTS.pdf  INDI18Q4.txt RPSR18Q4.txt drug18q4.pdf reac18q4.pdf
DEMO18Q4.txt OUTC18Q4.txt THER18Q4.txt indi18q4.pdf rpsr18q4.pdf
DRUG18Q4.txt REAC18Q4.txt demo18q4.pdf outc18q4.pdf ther18q4.pdf


In [406]:
!head data/raw/ascii_2018q4/ascii/REAC18Q4.txt

primaryid$caseid$pt$drug_rec_act
100035916$10003591$Angina pectoris$
100035916$10003591$Atrioventricular block first degree$
100050413$10005041$Maternal exposure before pregnancy$
100050413$10005041$Pregnancy after post coital contraception$
1000551312$10005513$Arthralgia$
1000551312$10005513$Dengue fever$
1000551312$10005513$Feeling hot$
1000551312$10005513$Headache$
1000551312$10005513$Injection site pain$


In [407]:
ascii_file_reaction = raw_ascii_path + "REAC18Q4.txt"

In [413]:
reaction = pd.read_csv(ascii_file_reaction, sep='$', dtype="object")

In [414]:
reaction.columns

Index(['primaryid', 'caseid', 'pt', 'drug_rec_act'], dtype='object')

In [415]:
reaction.head()

Unnamed: 0,primaryid,caseid,pt,drug_rec_act
0,100035916,10003591,Angina pectoris,
1,100035916,10003591,Atrioventricular block first degree,
2,100050413,10005041,Maternal exposure before pregnancy,
3,100050413,10005041,Pregnancy after post coital contraception,
4,1000551312,10005513,Arthralgia,


In [416]:
reaction.describe(include='all')

Unnamed: 0,primaryid,caseid,pt,drug_rec_act
count,1250978,1250978,1250978,2988
unique,394066,394066,11884,954
top,1051730251,10517302,Drug ineffective,Diarrhoea
freq,118,118,26078,62


In [417]:
reaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1250978 entries, 0 to 1250977
Data columns (total 4 columns):
primaryid       1250978 non-null object
caseid          1250978 non-null object
pt              1250978 non-null object
drug_rec_act    2988 non-null object
dtypes: object(4)
memory usage: 38.2+ MB


##### Unique identifyers

The datarows in the REACTION file are unique by `primaryid` + `pt` + `drug_rec_act`, i.e. the entire columns set.  
The REACTION file has a many-to-one relationship with the DEMO file, matching on `primaryid`.  
The REACTION file also has the `caseid` field. Both  `primaryid` and `caseid` fields here are defined the same way as in the DEMO file, and across the rest of the datasets.

In [423]:
len(reaction.drop_duplicates(subset=["primaryid", "pt", "drug_rec_act"]) )

1250978

In [418]:
reaction.primaryid.describe()

count        1250978
unique        394066
top       1051730251
freq             118
Name: primaryid, dtype: object

In [419]:
show_value_counts(reaction.primaryid, 20)

Unnamed: 0,primaryid,count,proportion
0,1051730251,118,0.0
1,154705241,118,0.0
2,1028691016,104,0.0
3,1193046938,103,0.0
4,1328414623,101,0.0
5,1189245828,98,0.0
6,147066109,97,0.0
7,152163772,94,0.0
8,1016951754,93,0.0
9,1460840510,92,0.0


The number of unique values of `primaryid` matches the number of records in the DEMO files. 

In [420]:
reaction.caseid.describe()

count      1250978
unique      394066
top       10517302
freq           118
Name: caseid, dtype: object

Same as above, the number of unique values of `caseid` matches the number of records in the DEMO files. 

##### pt  

From documentation:
 
> "Preferred Term"-level medical terminology describing the event, using the Medical Dictionary for Regulatory Activities (MedDRA). The order of the terms for a given event does not imply priority. In other words, the first term listed is not necessarily considered more significant than the last one listed.

In [424]:
reaction.pt.describe()

count              1250978
unique               11884
top       Drug ineffective
freq                 26078
Name: pt, dtype: object

In [425]:
show_value_counts(reaction.pt, 20)

Unnamed: 0,pt,count,proportion
0,Drug ineffective,26078,0.02
1,Fatigue,16794,0.01
2,Nausea,16198,0.01
3,Off label use,14091,0.01
4,Diarrhoea,13748,0.01
5,Headache,13520,0.01
6,Death,13457,0.01
7,Pain,12210,0.01
8,Dyspnoea,11782,0.01
9,Malaise,11201,0.01


In [426]:
show_na(reaction.pt)

Unnamed: 0,na_count,na_proportion
0,0,0.0


In [427]:
len(reaction.drop_duplicates(subset=["primaryid", "pt"]) )

1250965

In [None]:
dups = reaction[reaction.duplicated(subset=["primaryid", "pt"], keep=False)]

In [430]:
len(dups)

26

In [431]:
dups

Unnamed: 0,primaryid,caseid,pt,drug_rec_act
9870,108799712,10879971,Crohn's disease,Crohn^s disease
9871,108799712,10879971,Crohn's disease,
117107,140106725,14010672,Cardiac disorder,Cardiac disorder
117108,140106725,14010672,Cardiac disorder,
177597,1453360216,14533602,Oral disorder,Oral disorder
177598,1453360216,14533602,Oral disorder,
273176,150815975,15081597,Joint dislocation,Joint dislocation
273177,150815975,15081597,Joint dislocation,
388641,153841222,15384122,Staphylococcal infection,Staphylococcal infection
388642,153841222,15384122,Staphylococcal infection,


There are no missing values in the `pt` field.  

When we check for duplicates in the data by `primaryid` + `pt`, we have 26 duplicate rows, shown above. In those 26 duplicate rows, for each `primaryid` + `pt` key there are two duplicate rows. Each duplicate row pair has one row with `drug_rec_act` having the same or almost the same value as `pt`, and another row with `drug_rec_act` having a missing value.

##### drug_rec_act  

From documentation:
 
> Drug Recur Action data - populated with reaction/event information (PT) if/when the event reappears upon readministration of the drug.  
>
> \* New tag added in 2014Q3 extract.

In [432]:
reaction.drug_rec_act.describe()

count          2988
unique          954
top       Diarrhoea
freq             62
Name: drug_rec_act, dtype: object

In [433]:
show_value_counts(reaction.drug_rec_act, 20)

Unnamed: 0,drug_rec_act,count,proportion
0,,1247990,1.0
1,Diarrhoea,62,0.0
2,Nausea,51,0.0
3,Drug ineffective,46,0.0
4,Pruritus,37,0.0
5,Fatigue,36,0.0
6,Toxicity to various agents,35,0.0
7,Headache,34,0.0
8,Vomiting,32,0.0
9,Dizziness,32,0.0


In [439]:
# number of rows where pt=drug_rec_act
(reaction.pt == reaction.drug_rec_act).sum()

2981

In [443]:
# number of rows where pt!=drug_rec_act and drug_rec_act is not missing
# (ideally it should be equal or missing, according to docs)
((reaction.pt != reaction.drug_rec_act) & (reaction.drug_rec_act.notna() ) ).sum()

7

In [444]:
reaction[(reaction.pt != reaction.drug_rec_act) & (reaction.drug_rec_act.notna() ) ]

Unnamed: 0,primaryid,caseid,pt,drug_rec_act
9870,108799712,10879971,Crohn's disease,Crohn^s disease
150831,143164329,14316432,Fournier's gangrene,Fournier^s gangrene
411273,154194662,15419466,Behcet's syndrome,Behcet^s syndrome
719201,155674731,15567473,Basedow's disease,Basedow^s disease
961493,156642931,15664293,Meniere's disease,Meniere^s disease
973921,156696431,15669643,Behcet's syndrome,Behcet^s syndrome
1046755,156982891,15698289,Meige's syndrome,Meige^s syndrome


In [448]:
reaction["drug_rec_act_test"] = [int(i) for i in (reaction.pt == reaction.drug_rec_act)]

In [450]:
reaction.head()

Unnamed: 0,primaryid,caseid,pt,drug_rec_act,drug_rec_act_test
0,100035916,10003591,Angina pectoris,,0
1,100035916,10003591,Atrioventricular block first degree,,0
2,100050413,10005041,Maternal exposure before pregnancy,,0
3,100050413,10005041,Pregnancy after post coital contraception,,0
4,1000551312,10005513,Arthralgia,,0


In [451]:
show_value_counts(reaction.drug_rec_act_test)

Unnamed: 0,drug_rec_act_test,count,proportion
0,0,1247997,1.0
1,1,2981,0.0


Almost 100% of values are missing.  

According to the documentation for this field, it either repeats the value of `pt`, or is missing. This is not exactly true in the data: 7 rows have the same value in this field as `pt`, except the apostraphe(`'`) has been changed to a caret (`^`). This will need to be cleaned later, but for now, to get a general idea of the value counts for this field, I've created a binary variable for whether the reaction reappears on the reintroduction of the drug (1 = yes). The table above shows the approximate frequency distribution for it. 

##### Summary for REACTION ASCII file  


* **Data quality issues found:**  
  - There are 26 rows with both non-missing and missing values in `drug_rec_act` per `primaryid` + `pt` key. It's not clear which of the two values should be kept.  
  - The `drug_rec_act` non-missing values are supposed to be the same as the corresponding `pt` values, but 7 of them have `^` instead of `'` as in the `pt` field. 
  
* **Data cleaning steps to do:**  
  - fix/address the data quality issues listed above  
  - convert the `drug_rec_act` field to a binary field

#### 4.2.4 OUTCOME file

In [7]:
!ls data/raw/ascii_2018q4/ascii/*.txt

data/raw/ascii_2018q4/ascii/DEMO18Q4.txt
data/raw/ascii_2018q4/ascii/DRUG18Q4.txt
data/raw/ascii_2018q4/ascii/INDI18Q4.txt
data/raw/ascii_2018q4/ascii/OUTC18Q4.txt
data/raw/ascii_2018q4/ascii/REAC18Q4.txt
data/raw/ascii_2018q4/ascii/RPSR18Q4.txt
data/raw/ascii_2018q4/ascii/THER18Q4.txt


In [8]:
!head data/raw/ascii_2018q4/ascii/OUTC18Q4.txt

primaryid$caseid$outc_cod
100035916$10003591$OT
100050413$10005041$OT
1000551312$10005513$OT
100058832$10005883$LT
100058832$10005883$HO
100058832$10005883$OT
100065479$10006547$HO
100065479$10006547$OT
1000808588$10008085$HO


In [13]:
ascii_file_outcome = raw_ascii_path + "OUTC18Q4.txt"

In [14]:
outcome = pd.read_csv(ascii_file_outcome, sep='$', dtype="object")

In [15]:
outcome.columns

Index(['primaryid', 'caseid', 'outc_cod'], dtype='object')

In [16]:
outcome.head()

Unnamed: 0,primaryid,caseid,outc_cod
0,100035916,10003591,OT
1,100050413,10005041,OT
2,1000551312,10005513,OT
3,100058832,10005883,LT
4,100058832,10005883,HO


In [17]:
outcome.describe(include='all')

Unnamed: 0,primaryid,caseid,outc_cod
count,299135,299135,299135
unique,229931,229931,7
top,154808011,15480801,OT
freq,6,6,152703


In [18]:
outcome.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299135 entries, 0 to 299134
Data columns (total 3 columns):
primaryid    299135 non-null object
caseid       299135 non-null object
outc_cod     299135 non-null object
dtypes: object(3)
memory usage: 6.8+ MB


##### Unique identifyers

The datarows in the OUTCOME file are unique by `primaryid` + `outc_cod`, i.e. the entire columns set.  
The OUTCOME file has a many-to-one relationship with the DEMO file, matching on `primaryid`.  
The OUTCOME file also has the `caseid` field. Both  `primaryid` and `caseid` fields here are defined the same way as in the DEMO file, and across the rest of the datasets.

In [20]:
len(outcome.drop_duplicates(subset=["primaryid", "outc_cod"]) )

299135

In [21]:
outcome.primaryid.describe()

count        299135
unique       229931
top       154808011
freq              6
Name: primaryid, dtype: object

In [22]:
show_value_counts(outcome.primaryid, 20)

Unnamed: 0,primaryid,count,proportion
0,154808011,6,0.0
1,157188702,6,0.0
2,155849471,6,0.0
3,149253697,6,0.0
4,36703554,6,0.0
5,137774557,5,0.0
6,155121611,5,0.0
7,154730241,5,0.0
8,152082754,5,0.0
9,154029005,5,0.0


In [31]:
outcome[outcome.primaryid == "154808011"]

Unnamed: 0,primaryid,caseid,outc_cod
109545,154808011,15480801,OT
109546,154808011,15480801,LT
109547,154808011,15480801,CA
109548,154808011,15480801,DE
109549,154808011,15480801,DS
109550,154808011,15480801,HO


The number of unique values of `primaryid` is less than the number of records in the DEMO files. 

In [23]:
outcome.caseid.describe()

count       299135
unique      229931
top       15480801
freq             6
Name: caseid, dtype: object

Same as above. 

##### outc_cod  

From documentation:
 
> Code for a patient outcome (See table below)  
>
> | CODE      | MEANING_TEXT
  | ----      | ------------
  | DE        | Death
  | LT        | Life-Threatening
  | HO        | Hospitalization - Initial or Prolonged
  | DS        | Disability
  | CA        | Congenital Anomaly
  | RI        | Required Intervention to Prevent Permanent Impairment/Damage
  | OT        | Other Serious (Important Medical Event)  
>
> NOTE:  The outcome from the latest version of a case is provided. If there is more than one outcome, the codes will
be line listed.

In [24]:
outcome.outc_cod.describe()

count     299135
unique         7
top           OT
freq      152703
Name: outc_cod, dtype: object

In [25]:
show_value_counts(outcome.outc_cod)

Unnamed: 0,outc_cod,count,proportion
0,OT,152703,0.51
1,HO,94243,0.32
2,DE,31964,0.11
3,LT,12113,0.04
4,DS,6410,0.02
5,CA,1335,0.0
6,RI,367,0.0


In [28]:
dups = outcome[outcome.duplicated(subset=["primaryid"], keep=False)]

In [29]:
len(dups)

126789

In [30]:
dups.head(20)

Unnamed: 0,primaryid,caseid,outc_cod
3,100058832,10005883,LT
4,100058832,10005883,HO
5,100058832,10005883,OT
6,100065479,10006547,HO
7,100065479,10006547,OT
8,1000808588,10008085,HO
9,1000808588,10008085,OT
13,1001355853,10013558,LT
14,1001355853,10013558,HO
15,1001355853,10013558,OT


In [34]:
num_cases = 394066
num_cases_with_outcomes = outcome.primaryid.nunique()

print("Proportion of cases with non-missing outcomes: ", num_cases_with_outcomes / num_cases)

Proportion of cases with non-missing outcomes:  0.5834834773870367


There are no missing values in this field in the OUTCOME file. However, this file does not contain all cases. Calculating the proportion of cases with outcomes listed in this file, we have that  about 58% of cases have a serious medical event outcome, and we can infer that 42% of the cases do not have a report of a serious medical  event outcome at the time of the dataset's creation.  

The category frequency counts and percentages are consistent with the FDA numbers.

##### Summary for OUTCOME ASCII file   

Cases without `outc_cod` values are not recorded in this file. 

* **Data quality issues found:**  
  - none
  
* **Data cleaning steps to do:**  
  - none, just be aware when merging that this file doesn't contain all cases!

#### 4.2.5 REPORT SOURCE file

In [35]:
!ls data/raw/ascii_2018q4/ascii/*.txt

data/raw/ascii_2018q4/ascii/DEMO18Q4.txt
data/raw/ascii_2018q4/ascii/DRUG18Q4.txt
data/raw/ascii_2018q4/ascii/INDI18Q4.txt
data/raw/ascii_2018q4/ascii/OUTC18Q4.txt
data/raw/ascii_2018q4/ascii/REAC18Q4.txt
data/raw/ascii_2018q4/ascii/RPSR18Q4.txt
data/raw/ascii_2018q4/ascii/THER18Q4.txt


In [36]:
!head data/raw/ascii_2018q4/ascii/RPSR18Q4.txt

primaryid$caseid$rpsr_cod
151290961$15129096$CSM
151290971$15129097$CSM
151290981$15129098$HP
151296241$15129624$CSM
151296331$15129633$CSM
151296801$15129680$HP
151310961$15131096$HP
151310961$15131096$FGN
151311001$15131100$CSM


In [37]:
ascii_file_rsource = raw_ascii_path + "RPSR18Q4.txt"

In [58]:
rsource = pd.read_csv(ascii_file_rsource, sep='$', dtype="object")

In [60]:
rsource.columns

Index(['primaryid', 'caseid', 'rpsr_cod'], dtype='object')

In [40]:
rsource.head()

Unnamed: 0,primaryid,caseid,rpsr_cod
0,151290961,15129096,CSM
1,151290971,15129097,CSM
2,151290981,15129098,HP
3,151296241,15129624,CSM
4,151296331,15129633,CSM


In [41]:
rsource.describe(include='all')

Unnamed: 0,primaryid,caseid,rpsr_cod
count,21075,21075,21075
unique,20914,20914,7
top,156786771,15678677,HP
freq,3,3,17150


In [42]:
rsource.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21075 entries, 0 to 21074
Data columns (total 3 columns):
primaryid    21075 non-null object
caseid       21075 non-null object
rpsr_cod     21075 non-null object
dtypes: object(3)
memory usage: 494.0+ KB


##### Unique identifyers

The datarows in the REPORT SOURCE file are unique by `primaryid` + `rpsr_cod`, i.e. the entire columns set.  
The REPORT SOURCE file has a many-to-one relationship with the DEMO file, matching on `primaryid`.  
The REPORT SOURCE file also has the `caseid` field. Both  `primaryid` and `caseid` fields here are defined the same way as in the DEMO file, and across the rest of the datasets.

In [43]:
len(rsource.drop_duplicates(subset=["primaryid", "rpsr_cod"]) )

21075

In [44]:
rsource.primaryid.describe()

count         21075
unique        20914
top       156786771
freq              3
Name: primaryid, dtype: object

In [46]:
show_value_counts(rsource.primaryid, 10)

Unnamed: 0,primaryid,count,proportion
0,156786771,3,0.0
1,156329241,3,0.0
2,155984091,2,0.0
3,156604421,2,0.0
4,154785411,2,0.0
5,156013771,2,0.0
6,155362351,2,0.0
7,156261291,2,0.0
8,154927681,2,0.0
9,156439601,2,0.0


In [47]:
rsource[rsource.primaryid == "156786771"]

Unnamed: 0,primaryid,caseid,rpsr_cod
12800,156786771,15678677,HP
12801,156786771,15678677,UF
12802,156786771,15678677,DT


The number of unique values of `primaryid` is less than the number of records in the DEMO files. 

In [48]:
rsource.caseid.describe()

count        21075
unique       20914
top       15678677
freq             3
Name: caseid, dtype: object

Same as above. 

##### rpsr_cod  

From documentation:
 
> Code for the source of the report (See table below)  
> 
> | CODE        | MEANING_TEXT
  | ----        | ------------
  | FGN         | Foreign
  | SDY         | Study
  | LIT         | Literature
  | CSM         | Consumer
  | HP          | Health Professional
  | UF          | User Facility
  | CR          | Company Representative
  | DT          | Distributor
  | OTH         | Other  
>
> NOTE: The source from the latest version of a case is provided. If there is more than one source, the codes will be line listed.

In [49]:
rsource.rpsr_cod.describe()

count     21075
unique        7
top          HP
freq      17150
Name: rpsr_cod, dtype: object

In [50]:
show_value_counts(rsource.rpsr_cod)

Unnamed: 0,rpsr_cod,count,proportion
0,HP,17150,0.81
1,CSM,3749,0.18
2,FGN,153,0.01
3,UF,13,0.0
4,DT,4,0.0
5,SDY,3,0.0
6,CR,3,0.0


In [51]:
dups = rsource[rsource.duplicated(subset=["primaryid"], keep=False)]

In [52]:
len(dups)

320

In [53]:
dups.head(20)

Unnamed: 0,primaryid,caseid,rpsr_cod
6,151310961,15131096,HP
7,151310961,15131096,FGN
8,151311001,15131100,CSM
9,151311001,15131100,FGN
16,151911432,15191143,HP
17,151911432,15191143,FGN
18,152382372,15238237,CSM
19,152382372,15238237,FGN
20,152402751,15240275,FGN
21,152402751,15240275,CSM


In [54]:
num_cases = 394066
num_cases_with_sources = rsource.primaryid.nunique()

print("Proportion of cases with non-missing report sources: ", num_cases_with_sources / num_cases)

Proportion of cases with non-missing report sources:  0.05307232798566738


There are no missing values in this field in the REPORT SOURCE file. However, this file does not contain all cases. Calculating the proportion of cases with outcomes listed in this file, we have that  about 5% of cases have a serious medical event outcome, and we can infer that 95% of the cases do not have a report source.  

The category frequency counts and percentages are consistent with the FDA numbers.

##### Summary for REPORT SOURCE ASCII file   

Cases without `rpsr_cod` values are not recorded in this file. 

* **Data quality issues found:**  
  - none
  
* **Data cleaning steps to do:**  
  - none, just be aware when merging that this file doesn't contain all cases!

#### 4.2.6 THERAPY file

In [55]:
!ls data/raw/ascii_2018q4/ascii/*.txt

data/raw/ascii_2018q4/ascii/DEMO18Q4.txt
data/raw/ascii_2018q4/ascii/DRUG18Q4.txt
data/raw/ascii_2018q4/ascii/INDI18Q4.txt
data/raw/ascii_2018q4/ascii/OUTC18Q4.txt
data/raw/ascii_2018q4/ascii/REAC18Q4.txt
data/raw/ascii_2018q4/ascii/RPSR18Q4.txt
data/raw/ascii_2018q4/ascii/THER18Q4.txt


In [56]:
!head data/raw/ascii_2018q4/ascii/THER18Q4.txt

primaryid$caseid$dsg_drug_seq$start_dt$end_dt$dur$dur_cod
100035916$10003591$1$20130718$$$
100050413$10005041$1$2014$2014$$
1000551312$10005513$1$201201$20140211$$
1000551312$10005513$4$20120112$2012$$
1000551312$10005513$5$20120215$$$
1000551312$10005513$6$2010$$$
1000551312$10005513$7$2010$$$
1000551312$10005513$8$2010$$$
1000551312$10005513$9$2011$$$


In [57]:
ascii_file_therapy = raw_ascii_path + "THER18Q4.txt"

In [64]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'dsg_drug_seq': np.int32,  
    'start_dt': 'object',  
    'end_dt': 'object', 
    'dur': np.float64, 
    'dur_cod': 'object'
}

In [65]:
therapy = pd.read_csv(ascii_file_therapy, sep='$', dtype=datatypes)

In [66]:
therapy.columns

Index(['primaryid', 'caseid', 'dsg_drug_seq', 'start_dt', 'end_dt', 'dur',
       'dur_cod'],
      dtype='object')

In [67]:
therapy.head()

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
0,100035916,10003591,1,20130718,,,
1,100050413,10005041,1,2014,2014.0,,
2,1000551312,10005513,1,201201,20140211.0,,
3,1000551312,10005513,4,20120112,2012.0,,
4,1000551312,10005513,5,20120215,,,


In [68]:
therapy.describe(include='all')

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
count,620308.0,620308.0,620308.0,590770.0,280743.0,969.0,969
unique,228087.0,228087.0,,7347.0,5850.0,,6
top,146088398.0,14608839.0,,2018.0,2018.0,,MON
freq,282.0,282.0,,13682.0,9288.0,,343
mean,,,7.9,,,11.7,
std,,,15.29,,,98.35,
min,,,1.0,,,1.0,
25%,,,1.0,,,2.0,
50%,,,3.0,,,3.0,
75%,,,8.0,,,7.0,


In [69]:
therapy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620308 entries, 0 to 620307
Data columns (total 7 columns):
primaryid       620308 non-null object
caseid          620308 non-null object
dsg_drug_seq    620308 non-null int32
start_dt        590770 non-null object
end_dt          280743 non-null object
dur             969 non-null float64
dur_cod         969 non-null object
dtypes: float64(1), int32(1), object(5)
memory usage: 30.8+ MB


##### Unique identifyers

The datarows in the THERAPY file are unique by `primaryid` + `dsg_drug_seq` + `start_dt` + `end_dt` + `dur` (and any one or two of `start_dt`, `end_dt` or `dur` can be missing).  
The THERAPY file has a many-to-one relationship with the DRUG file, matching on `primaryid` and `dsg_drug_seq`/`drug_seq`.  

In [71]:
len(therapy.drop_duplicates(subset=["primaryid", "dsg_drug_seq"]) )

620282

In [75]:
len(therapy.drop_duplicates(subset=["primaryid", "dsg_drug_seq", "start_dt"]) )

620308

In [72]:
dups = therapy[therapy.duplicated(subset=["primaryid", "dsg_drug_seq"], keep=False)]

In [73]:
len(dups)

38

In [74]:
dups.head(20)

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
268018,154507832,15450783,1,20170406.0,20180412.0,,
268019,154507832,15450783,1,20180412.0,20180419.0,,
268020,154507832,15450783,1,20180419.0,20180426.0,,
268021,154507832,15450783,1,20180426.0,,,
276968,154591941,15459194,1,20180321.0,,,
276969,154591941,15459194,1,20180404.0,,,
288050,154690241,15469024,1,20171201.0,,,
288051,154690241,15469024,1,20180904.0,,,
309197,154876221,15487622,1,20180706.0,20180712.0,,
309198,154876221,15487622,1,20180713.0,201808.0,,


In [76]:
therapy.primaryid.describe()

count        620308
unique       228087
top       146088398
freq            282
Name: primaryid, dtype: object

In [77]:
show_value_counts(therapy.primaryid, 10)

Unnamed: 0,primaryid,count,proportion
0,146088398,282,0.0
1,153490012,253,0.0
2,148011912,250,0.0
3,156811382,250,0.0
4,153486263,224,0.0
5,156196031,198,0.0
6,156232931,198,0.0
7,153535922,193,0.0
8,1321263815,190,0.0
9,150832072,185,0.0


In [85]:
therapy[therapy.primaryid == "146088398"][:10]

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
106004,146088398,14608839,1,2008,,,
106005,146088398,14608839,2,2009,,,
106006,146088398,14608839,3,2010,,,
106007,146088398,14608839,4,2011,,,
106008,146088398,14608839,5,2012,,,
106009,146088398,14608839,6,2014,,,
106010,146088398,14608839,7,2015,,,
106011,146088398,14608839,8,200601,2015.0,,
106012,146088398,14608839,9,200611,201209.0,,
106013,146088398,14608839,10,201410,201505.0,,


The number of unique values of `primaryid` is less than the number of records in the DEMO files. 

In [79]:
therapy.caseid.describe()

count       620308
unique      228087
top       14608839
freq           282
Name: caseid, dtype: object

Same as above. 

##### dsg_drug_seq   

This field is meant to match the `drug_seq` field in the DRUG file, per `primaryid`.  

From documentation:
 
> Drug sequence number for identifying a drug for a Case. To link to the DRUGyyQq.TXT data file, both the Case number primary key) and the DRUG_SEQ number (secondary key) are needed.  

In [81]:
therapy.dsg_drug_seq.describe()

count   620308.00
mean         7.90
std         15.29
min          1.00
25%          1.00
50%          3.00
75%          8.00
max        310.00
Name: dsg_drug_seq, dtype: float64

In [82]:
show_value_counts(therapy.dsg_drug_seq)

Unnamed: 0,dsg_drug_seq,count,proportion
0,1,215363,0.35
1,2,86547,0.14
2,3,52000,0.08
3,4,35725,0.06
4,5,27606,0.04
5,6,22115,0.04
6,7,18289,0.03
7,8,15383,0.02
8,9,13438,0.02
9,10,11425,0.02


In [84]:
num_cases = 394066
num_cases_with_therapy_dates = therapy.primaryid.nunique()

print("Proportion of cases with non-missing therapy dates: ", num_cases_with_therapy_dates / num_cases)

Proportion of cases with non-missing therapy dates:  0.5788040582034482


There are no missing values in this field in the REPORT SOURCE file. However, this file does not contain all cases. Calculating the proportion of cases with therapy dates listed in this file, we have that  about 58% of cases have a therapy dates, and we can infer that 42% of the cases do not have a therapy date.  

##### start_dt  
From documentation:  
> Date the therapy was started (or re-started) for this drug (YYYYMMDD) – If a complete date not available, a partial date is provided.

In [86]:
therapy.start_dt.describe()

count     590770
unique      7347
top         2018
freq       13682
Name: start_dt, dtype: object

In [89]:
show_value_counts(therapy.start_dt, 10)

Unnamed: 0,start_dt,count,proportion
0,,29538,0.05
1,2018.0,13682,0.02
2,2017.0,7111,0.01
3,2016.0,6873,0.01
4,2015.0,6129,0.01
5,2014.0,4974,0.01
6,2013.0,4756,0.01
7,201808.0,4539,0.01
8,201809.0,4515,0.01
9,2012.0,4500,0.01


In [90]:
therapy[therapy.start_dt.isna()][:10]

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
29,1000808588,10008085,12,,201609,,
79,1002006715,10020067,1,,20170501,,
88,100268942,10026894,2,,2013,,
113,100324825,10032482,5,,201402,,
122,1003280412,10032804,7,,201210,,
127,100356166,10035616,2,,20140321,,
128,100356166,10035616,3,,20140321,,
129,100356166,10035616,4,,20140321,,
130,100356166,10035616,5,,20140321,,
131,100356166,10035616,6,,20140321,,


In [95]:
therapy[therapy.start_dt.isna() & therapy.end_dt.isna()][:10]

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
276548,154586971,15458697,1,,,10.0,DAY
276563,154587101,15458710,1,,,25.0,DAY
276718,154588861,15458886,1,,,5.0,YR
276729,154589001,15458900,1,,,2.0,DAY
276915,154591061,15459106,1,,,3.0,DAY
277392,154596821,15459682,1,,,3.0,MON
277811,154601561,15460156,1,,,4.0,MON
277934,154602681,15460268,1,,,2.0,YR
278199,154604641,15460464,1,,,2.5,YR
280724,154628021,15462802,1,,,1.0,YR


In [94]:
len(therapy[therapy.start_dt.isna() & therapy.end_dt.isna()])

896

In [96]:
# check the datalines in the raw input file
# look at line 276548 and surrounding lines
# looking for primaryid=154586971
!sed '276544,276558!d' data/raw/ascii_2018q4/ascii/THER18Q4.txt

154586912$15458691$1$20180726$$$
154586921$15458692$1$20180828$20180828$$
154586931$15458693$1$20171223$$$
154586931$15458693$2$20170110$$$
154586941$15458694$1$20180413$$$
154586961$15458696$1$20180927$$$
154586971$15458697$1$$$10$DAY
154586981$15458698$1$20180522$$$
154587021$15458702$1$20180628$$$
154587041$15458704$1$20180724$$$
154587041$15458704$2$20180724$$$
154587051$15458705$1$20160216$$$
154587061$15458706$1$20180901$$$
154587071$15458707$1$20180825$20180825$$
154587071$15458707$2$20180825$20180825$$


5% of the values are missing in the dataset. The missing values count is consistent with FDA number.

##### end_dt  

From documentation:  
> A date therapy was stopped for this drug. (YYYYMMDD) – If a complete date not available, a partial date will be provided.

In [97]:
therapy.end_dt.describe()

count     280743
unique      5850
top         2018
freq        9288
Name: end_dt, dtype: object

In [98]:
show_value_counts(therapy.end_dt, 10)

Unnamed: 0,end_dt,count,proportion
0,,339565,0.55
1,2018.0,9288,0.01
2,2016.0,5252,0.01
3,2017.0,3784,0.01
4,2015.0,2695,0.0
5,201809.0,2405,0.0
6,201810.0,2317,0.0
7,201808.0,1916,0.0
8,2014.0,1825,0.0
9,201807.0,1724,0.0


55% of values are missing in the dataset. The dataset missing values count is consistent with the FDA number.

##### dur  
From documentation:  
> Numeric value of the duration (length) of therapy

In [99]:
therapy.dur.describe()

count    969.00
mean      11.70
std       98.35
min        1.00
25%        2.00
50%        3.00
75%        7.00
max     2018.00
Name: dur, dtype: float64

In [100]:
show_value_counts(therapy.dur, 10)

Unnamed: 0,dur,count,proportion
0,,619339,1.0
1,1.0,191,0.0
2,2.0,150,0.0
3,3.0,135,0.0
4,5.0,82,0.0
5,4.0,64,0.0
6,6.0,60,0.0
7,10.0,39,0.0
8,7.0,35,0.0
9,8.0,32,0.0


Almost 100% of the values are missing. The missing values count is consistent with the FDA number.

##### dur_cod  
From documentation:  
> Unit abbreviation for duration of therapy (see table below)  
>
> | CODE      | MEANING TEXT
  | ----      | ------------
  | YR        | Years
  | MON       | Months
  | WK        | Weeks
  | DAY       | Days
  | HR        | Hours
  | MIN       | Minutes
  | SEC       | Seconds

In [101]:
therapy.dur_cod.describe()

count     969
unique      6
top       MON
freq      343
Name: dur_cod, dtype: object

In [102]:
show_value_counts(therapy.dur_cod)

Unnamed: 0,dur_cod,count,proportion
0,,619339,1.0
1,MON,343,0.0
2,YR,267,0.0
3,DAY,216,0.0
4,WK,97,0.0
5,HR,28,0.0
6,MIN,18,0.0


In [109]:
therapy[therapy.start_dt.notna() & therapy.end_dt.notna() & therapy.dur_cod.notna()]

Unnamed: 0,primaryid,caseid,dsg_drug_seq,start_dt,end_dt,dur,dur_cod
281043,154632151,15463215,1,20180504,20180510,7.0,DAY
281676,154638251,15463825,1,20171208,20171217,10.0,DAY
305082,154839811,15483981,1,20180517,20181003,3.0,WK
305083,154839811,15483981,2,20180517,20181003,3.0,WK
308827,154872351,15487235,1,20180810,20181003,4.0,WK
321701,154984081,15498408,1,20170906,20180926,2.0,YR
332677,155090361,15509036,1,20180116,20180216,14.0,DAY
337225,155175041,15517504,1,20181001,20181001,2.0,HR
340378,155219971,15521997,1,20180826,20181005,7.0,WK
380169,155600401,15560040,1,20170403,20170408,6.0,DAY


In [110]:
len(therapy[therapy.start_dt.notna() & therapy.end_dt.notna() & therapy.dur_cod.notna()])

27

Almost 100% of the values are missing. The missing values count is consistent with the FDA number.  

When all three of the values for `start_dt`, `end_dt` and `dur` are present (27 records), the therapy duration is often shorter than the time difference between therapy start and end dates. I could not find anything in the documentation to explain this difference.

##### Summary for THERAPY ASCII file   

Cases without therapy date values are not recorded in this file.  
The records in this file are unique by `primaryid`, `dsg_drug_seq` and some non-missing combination of `start_dt`, `end_dt` and `dur`. Any of those last three fields can have a missing value, as long as one of them is present.  
(In this specific file, the datarows are unique by `primaryid`, `dsg_drug_seq` and `start_dt`, but in general that might not be the case, since `start_dt` can be missing.)

* **Data quality issues found:**  
  - In the 27 records where `start_dt`, `end_dt` and `dur` are present, the duration fields are often not consistent with the thrapy start and end dates, and it is unclear why.
  
* **Data cleaning steps to do:**  
  - Be aware when merging that this file doesn't contain all cases.  
  - Maybe look further into the discrapencies between therapy duration fields and therapy start and end date fields.
  

#### 4.2.7 INDICATIONS file

In [111]:
!ls data/raw/ascii_2018q4/ascii/*.txt

data/raw/ascii_2018q4/ascii/DEMO18Q4.txt
data/raw/ascii_2018q4/ascii/DRUG18Q4.txt
data/raw/ascii_2018q4/ascii/INDI18Q4.txt
data/raw/ascii_2018q4/ascii/OUTC18Q4.txt
data/raw/ascii_2018q4/ascii/REAC18Q4.txt
data/raw/ascii_2018q4/ascii/RPSR18Q4.txt
data/raw/ascii_2018q4/ascii/THER18Q4.txt


In [112]:
!head data/raw/ascii_2018q4/ascii/INDI18Q4.txt

primaryid$caseid$indi_drug_seq$indi_pt
100035916$10003591$1$Multiple sclerosis
100050413$10005041$1$Post coital contraception
1000551312$10005513$1$Rheumatoid arthritis
1000551312$10005513$6$Rheumatoid arthritis
1000551312$10005513$7$Rheumatoid arthritis
1000551312$10005513$8$Rheumatoid arthritis
1000551312$10005513$9$Cardiovascular disorder
1000551312$10005513$10$Heart rate increased
100058832$10005883$1$Product used for unknown indication


In [113]:
ascii_file_indications = raw_ascii_path + "INDI18Q4.txt"

In [116]:
datatypes = {
    'primaryid': 'object', 
    'caseid': 'object', 
    'indi_drug_seq': np.int32,  
    'indi_pt': 'object'
}

In [117]:
indications = pd.read_csv(ascii_file_indications, sep='$', dtype=datatypes)

In [118]:
indications.columns

Index(['primaryid', 'caseid', 'indi_drug_seq', 'indi_pt'], dtype='object')

In [119]:
indications.head()

Unnamed: 0,primaryid,caseid,indi_drug_seq,indi_pt
0,100035916,10003591,1,Multiple sclerosis
1,100050413,10005041,1,Post coital contraception
2,1000551312,10005513,1,Rheumatoid arthritis
3,1000551312,10005513,6,Rheumatoid arthritis
4,1000551312,10005513,7,Rheumatoid arthritis


In [120]:
indications.describe(include='all')

Unnamed: 0,primaryid,caseid,indi_drug_seq,indi_pt
count,1064664.0,1064664.0,1064664.0,1064664
unique,352491.0,352491.0,,6362
top,148011912.0,14801191.0,,Product used for unknown indication
freq,249.0,249.0,,405708
mean,,,6.33,
std,,,10.29,
min,,,1.0,
25%,,,1.0,
50%,,,3.0,
75%,,,8.0,


In [121]:
indications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1064664 entries, 0 to 1064663
Data columns (total 4 columns):
primaryid        1064664 non-null object
caseid           1064664 non-null object
indi_drug_seq    1064664 non-null int32
indi_pt          1064664 non-null object
dtypes: int32(1), object(3)
memory usage: 28.4+ MB


##### Unique identifyers

The datarows in the INDICATIONS file are unique by `primaryid` + `indi_drug_seq` + `indi_pt`, i.e. the entire columns set.  
The INDICATIONS file has a many-to-one relationship with the DRUG file, matching on `primaryid` + `indi_drug_seq`/`drug_seq`.  

In [123]:
len(indications.drop_duplicates(subset=["primaryid", "indi_drug_seq", "indi_pt"]) )

1064664

In [125]:
indications.primaryid.describe()

count       1064664
unique       352491
top       148011912
freq            249
Name: primaryid, dtype: object

In [126]:
show_value_counts(indications.primaryid, 10)

Unnamed: 0,primaryid,count,proportion
0,148011912,249,0.0
1,148178252,184,0.0
2,653031217,170,0.0
3,148244282,147,0.0
4,149298723,128,0.0
5,146756124,128,0.0
6,147032992,128,0.0
7,148017782,122,0.0
8,146653933,119,0.0
9,768679319,117,0.0


In [127]:
indications[indications.primaryid == "148011912"]

Unnamed: 0,primaryid,caseid,indi_drug_seq,indi_pt
181569,148011912,14801191,1,Gastrooesophageal reflux disease
181570,148011912,14801191,2,Gastrooesophageal reflux disease
181571,148011912,14801191,3,Gastrooesophageal reflux disease
181572,148011912,14801191,4,Gastrooesophageal reflux disease
181573,148011912,14801191,5,Gastrooesophageal reflux disease
181574,148011912,14801191,6,Gastrooesophageal reflux disease
181575,148011912,14801191,7,Gastrooesophageal reflux disease
181576,148011912,14801191,8,Gastrooesophageal reflux disease
181577,148011912,14801191,9,Gastrooesophageal reflux disease
181578,148011912,14801191,10,Gastrooesophageal reflux disease


The number of unique values of `primaryid` is less than the number of records in the DEMO files. 

In [128]:
indications.caseid.describe()

count      1064664
unique      352491
top       14801191
freq           249
Name: caseid, dtype: object

Same as above. 

##### indi_drug_seq  

From documentation:
 
> Drug sequence number for identifying a drug for a Case. To link to the DRUGyyQq.TXT data file, both the Case number (primary key) and the DRUG_SEQ number (secondary key) are needed.

In [130]:
indications.indi_drug_seq.describe()

count   1064664.00
mean          6.33
std          10.29
min           1.00
25%           1.00
50%           3.00
75%           8.00
max         310.00
Name: indi_drug_seq, dtype: float64

In [133]:
show_value_counts(indications.indi_drug_seq, 10)

Unnamed: 0,indi_drug_seq,count,proportion
0,1,348149,0.33
1,2,129824,0.12
2,3,93244,0.09
3,4,73286,0.07
4,5,60239,0.06
5,6,49992,0.05
6,7,42179,0.04
7,8,35427,0.03
8,9,30024,0.03
9,10,25549,0.02


In [135]:
dups = indications[indications.duplicated(subset=["primaryid", "indi_drug_seq"], keep=False)]

In [136]:
len(dups)

2114

In [137]:
dups.head(20)

Unnamed: 0,primaryid,caseid,indi_drug_seq,indi_pt
254348,151826491,15182649,1,Breakthrough pain
254349,151826491,15182649,1,Pain
254350,151826491,15182649,1,Sciatica
254351,151826491,15182649,1,Spondylolisthesis
271915,152448301,15244830,1,Pain
271916,152448301,15244830,1,Psoriasis
337641,153911192,15391119,1,Paranasal sinus discomfort
337642,153911192,15391119,1,Pyrexia
337643,153911192,15391119,1,Sneezing
391317,154583571,15458357,1,Prophylaxis


In [138]:
num_cases = 394066
num_cases_with_indications = indications.primaryid.nunique()

print("Proportion of cases with non-missing indications: ", num_cases_with_indications / num_cases)

Proportion of cases with non-missing indications:  0.8944973684611207


In [146]:
num_drugs = 1546835
num_drugs_with_indications = len(indications.drop_duplicates(subset=["primaryid", "indi_drug_seq"]) )

print("Number of drugs with non-missing indications: ", num_drugs_with_indications)
print("Proportion of drugs with non-missing indications: ", num_drugs_with_indications / num_drugs)
print("Number of drugs with missing indications: ", num_drugs - num_drugs_with_indications)

Number of drugs with non-missing indications:  1063528
Proportion of drugs with non-missing indications:  0.6875510316226359
Number of drugs with missing indications:  483307


There are no missing values in this field in the INDICATIONS file. However, this file does not contain all cases. Calculating the proportion of cases with outcomes listed in this file, we have that about 89% of cases have a drug indication, and we can infer that 11% of the cases do not have a drug indication reported.  

Comparing the number of reported drugs in the DRUG file, to the number of drugs in the INDICATIONS file, we have that 483,307 reported drugs are missing an indication, which amounts to about 31% of drugs with missing indications. 

The calculated drugs with missing indications count is not consistent with the FDA number in the accompanying documentation, which states a different total drugs count than the one in both the DRUG documentation and data files, and the missing count given there seems to be the difference between that drug count value and the count of records in the indications file. This calculation fails to account for mupltiple `indi_pt` entries per drug in the INDICATIONS file.

##### indi_pt  
From documentation:  
> "Preferred Term"-level medical terminology describing the Indication for use, using the Medical Dictionary for Regulatory Activities MedDRA).

In [140]:
indications.indi_pt.describe()

count                                 1064664
unique                                   6362
top       Product used for unknown indication
freq                                   405708
Name: indi_pt, dtype: object

In [141]:
show_value_counts(indications.indi_pt, 20)

Unnamed: 0,indi_pt,count,proportion
0,Product used for unknown indication,405708,0.38
1,Rheumatoid arthritis,37499,0.04
2,Hypertension,24076,0.02
3,Plasma cell myeloma,15838,0.01
4,Multiple sclerosis,13999,0.01
5,Type 2 diabetes mellitus,12609,0.01
6,Gastrooesophageal reflux disease,12332,0.01
7,Pain,12247,0.01
8,Psoriasis,11118,0.01
9,Prophylaxis,10746,0.01


In [142]:
show_na(indications.indi_pt)

Unnamed: 0,na_count,na_proportion
0,0,0.0


There are no actual missing values in this field, but there is a similar value "Product used for unknown indication", and it captures 38% of the records in this file.

##### Summary for INDICATIONS ASCII file   

Cases/drugs without `indi_pt` values are not recorded in this file. 

* **Data quality issues found:**  
  - while there are no missing values in `indi_pt` field, there is an equivalent value of "Product used for unknown indication" in 38% of the records
  
* **Data cleaning steps to do:**  
  - be aware when merging that this file doesn't contain all cases

### 5. DQA summary

### 6. Next steps