# Finding, Scraping and Parsing summary pdfs accosiated with FDA 501(k) sumbissions

In [None]:
import sys
sys.path.append("..")

import pandas as pd
import numpy as np
import os
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import tabula.io


from src.data import metadata
import importlib
importlib.reload(metadata)

## Finding submissions

[This](https://www.fda.gov/medical-devices/510k-clearances/downloadable-510k-files) site has many submissions
available for download. They just contain metadata stored in a weird format by year. See the following:

In [138]:
metadata_df = metadata.build_df()
print("Shape of legacy submissions from 1976 to 1995: {}".format(metadata_df.shape))
metadata_df.head()

Shape of legacy submissions from 1976 to 1995: (75477, 22)


Unnamed: 0,KNUMBER,APPLICANT,CONTACT,STREET1,STREET2,CITY,STATE,COUNTRY_CODE,ZIP,POSTAL_CODE,...,DECISION,REVIEWADVISECOMM,PRODUCTCODE,STATEORSUMM,CLASSADVISECOMM,SSPINDICATOR,TYPE,THIRDPARTY,EXPEDITEDREVIEW,DEVICENAME
0,K760001,"ZIMMER, INC.",,"4221 Richmond Rd., N.W.",,Walker,MI,US,49534.0,49534.0,...,SESE,PM,,,,,Traditional,N,,ARCH SUPPORT (ARCH AID)
1,K760002,"ZIMMER, INC.",,,,,MO,US,,,...,SESE,PM,IQI,,PM,,Traditional,N,,KNEE AID
2,K760003,"ZIMMER, INC.",,803 N. Front St. Suite 3,,McHenry,IL,US,60050.0,60050.0,...,SESE,PM,ITG,,PM,,Traditional,N,,CAST MATERIAL (WICKET STOCKINETTE)
3,K760004,STEWART-NAUMANN LABORATORIES,,803 N. Front St. Suite 3,,McHenry,IL,US,60050.0,60050.0,...,SESE,HO,FMF,,HO,,Traditional,N,,"SYRINGE, DISPOSABLE, ALL PLASTIC"
4,K760005,STEWART-NAUMANN LABORATORIES,,803 N. Front St. Suite 3,,McHenry,IL,US,60050.0,60050.0,...,SESE,HO,FMF,,HO,,Traditional,N,,"SYRINGE, DISPOSABLE, GLASS & PLASTIC"


As you can see, even when the data is complete (which is rare) there's not a ton of useful information for finding
similar devices and stuff like that. However, many documents will have a summary pdf associated with them
that looks like [this](https://github.com/McClain98/FDAexplorer/blob/main/FDAExplorer/data/raw/pdfs/K183074.pdf). These documents have more information and
might prove to be pretty useful so I'm going to work on parsing these.

I've simplified the work flow so that all you need to know is the K number (the first column of the table above) and
the program will find a summary pdf if it exists.

**NOTE: Many of these documents don't have a summary pdf associated with them but generally the more recent
submissions do and because we do software, and software didn't exist in 1976, I assume this is ok.**

See below for a simple way to find and save these summary docs:

In [107]:
k_number = 'K183074'
loc = metadata.fetch_summary_pdf(k_number)
#this is where the pdf is saved locally. I couldn't figure out how ot do this dynamically but in production we would
# save the pdf anyway
loc


 [metadata.py:73]


/Users/mcclainthiel/Dropbox (MurDropBox)/FDAexplorer/FDAExplorer/notebooks


Now that we have a pdf, what can we do with it?

We can grab all the text pretty easily and just dump it. This might be pretty useful for a similarity metric using
some combination of bag of words and embedding models after a bit of cleaning. See below for example:

In [111]:
import PyPDF2
pdf_file = open(loc, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()

In [140]:
page = read_pdf.getPage(7)
page_content = page.extractText().replace('\t', '')
print(page_content)

 
 
 
 
 
Item
 
 
Subject device
 
 
Predicate device
 
Substantially
 
equivalence
 
 
No.: AK3
-
20, AK3
-
25)
 
TENS and EMS Stimulator provides 8 
types output modes (P1
-
P8). TENS/EMS 
output modes (P1
-
P6&P8) and TENS 
output modes (P7)
 
For TENS mode
 
(1) Symptomatic relief of chronic 
intractable pain, (2) Post traumatic pain,
 
(3) Post
-
surgical pain 
For EMS mode
 
(1) Relaxation of Muscle spasm, (2) 
Increase of local blood flow circulation,
 
(3) Prevention or retardation of disuse 
atrophy, (4
) Muscle re
-
education, (5) 
Maintaining or increasing range of 
motion, (6) Immediate postsurgical 
stimulation of muscles to prevent venous 
thrombosis
 
 
TENS and EMS Stimulator (Model 
No.: AK3
-
50)
 
TENS and EMS Stimulator provides 10 
types output modes 
(P0
-
P9). TENS output 
modes (P0
-
P4) and EMS output modes 
(P5
-
P9)
 
For TENS mode
 
(1) Symptomatic relief of chronic 
intractable pain, (2) Post traumatic pain,
 
(3) Post
-
surgical pain
 
For EMS mode
 
circ

Not bad and depending on out use case I'm pretty sure I can make some good headway on with jsut this data
from a modeling perspective, but of, course more structured data is alway better. These documents tend to have
a lot of tables so I'll try to extract come of those.

In [141]:
def read_and_clean(page_num, headers=True):
    table = tabula.io.read_pdf(loc, pages=page_num)
    df = table[0]
    df = df.dropna(how='all')
    if len(table) < 1:
        raise EOFError('No tables found on this page')

    if headers:
       df .columns = df.iloc[0]
       df = df.iloc[1:]

    return df

df = read_and_clean(7)
df

Unnamed: 0,Item,Subject device,Predicate device,equivalence
3,Proprietary Name,"TENS and EMS Stimulator, TENS\rStimulator","FOES 101 (ED401) TENS and EMS\rStimulator, FOE...",-
4,510(k) No.,K183074,K113010,-
5,Model number,"AK-10M, AK3-20,\rAK3-25, AK3-50",FOES101 (ED401),-
6,Manufacturer,ASTEK Technology Ltd.,"Famidoc Technology Co., Ltd",-
7,Prescription or OTC,Prescription,Prescription,Same
8,Regulation Number,890.5850,890.5850,Same
9,Product code,"IPF, GZJ","IPF, GZJ",Same
10,Intended Use,TENS Stimulator (Model No.: AK-\r10M)\rTENS St...,FDES 101 (ED401) TENS and EMS\rStimulator\rFor...,Same


The above function takes a table from a specific page and converts it to a pandas dataframe. It seems to work
pretty well. The only issue is that I manually entered:
1. the page with the table on it
2. if headers were included
3. how far the table continues (over multi page tables)

So if we combine this with a simple classifier, I think we can get some reasonably robust data extraction.

I will be working on the classifier extraction now

