# Download the MAUDE Data

Goals: 
1. ***(COMPLETE)*** Retrieve and list all the [MAUDE zip files on fda.gov](https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude).

The following Python code completes these steps:
1. Read the entire MAUDE webpage from fda.gov
2. The read returns one HTML table
3. Use the Pandas library to convert the HTML table into a Pandas dataframe
4. Drop the first row of the dataframe which is only used for formatting on the web page
5. Rename the columns of the dataframe to include 'Description' and remove tab characters
6. Convert total records to integer to allow math operations on record counts

In [9]:
import pandas as pd

# Read the entire webpaage from fda.gov
tables = pd.read_html('https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude')

# The read should return one table; use that as the dataframe
df = tables[0]

# Drop the first row which is only used for formatting on the web page
df.drop(index=df.index[0],
        axis=0,
        inplace=True)

# Rename the columns of the table to include 'Description' and remove tabs
df.columns = [
    'File Name',
    'Compressed Size in Bytes',
    'Uncompressed Size in Bytes',
    'Total Records',
    'Description'
]

# Convert total records to integer
df = df.astype({'Total Records':'int'})


Goals:
1. _**(COMPLETE)**_ Print a summary of the MAUDE data

The following Python code completes these steps:
1. Print the dataframe from the previous steps

In [10]:
# Move the 'Description' and 'Total Records' columns to be next to the file name
df = df.reindex(columns=[
    'File Name',
    'Description',
    'Total Records',
    'Compressed Size in Bytes',
    'Uncompressed Size in Bytes'
])

df

Unnamed: 0,File Name,Description,Total Records,Compressed Size in Bytes,Uncompressed Size in Bytes
1,mdrfoi.zip,MAUDE Base records received for 2022,2750504,78914KB,908833KB
2,mdrfoithru2022.zip,Master Record through 2022,12830703,546926KB,5243026KB
3,mdrfoiadd.zip,New MAUDE Base records for the current month.,198524,6294KB,63428KB
4,mdrfoichange.zip,MAUDE Base data updates: changes to existing B...,534164,18020KB,172061KB
5,patient.zip,MAUDE Patient records received for 2022,2757813,8041KB,77409KB
...,...,...,...,...,...
68,foitext2021.zip,Narrative Data for 2021,3746403,217872KB,1306481KB
69,foitext2022.zip,Narrative Data received for 2022,5021760,257211KB,2038145KB
70,foitext.zip,Narrative Data received to date for 2023,5021760,257211KB,2038145KB
71,foitextadd.zip,New MAUDE Narrative data for the current month.,168872,11815KB,90258KB


Goals:
1. _**(COMPLETE)**_ Download all of the MAUDE files to local storage

The following Python code completes these steps:
1. Iterate over all rows using DataFrame.iterrows()
2. For each row, get the file name
3. Use the file name to create a path on the local system
4. Check to see if the file exists.  If it does, don't download it again
5. Use the file name to create the URL for the file
6. Download the file to the local system



In [11]:
from os.path import exists

import os
import urllib.request

data_directory = './01-Download-MAUDE-Data'

# Create the data directory if needed
try:
    os.makedirs(data_directory, exist_ok = True)
except OSError as error:
    print(f"Error creating {data_directory}: {error}")

# Iterate all rows using DataFrame.iterrows()
for index, row in df.iterrows():
    file_name = row["File Name"]
    file_path = f"{data_directory}/{file_name}"
    if exists(file_path):
      print(f"Already downloaded {file_path}; Skipping!")
    else:
      print(f"Downloading {file_name}")
      url=f"https://www.accessdata.fda.gov/MAUDE/ftparea/{file_name}"
      urllib.request.urlretrieve(url, file_path)

print("Download complete.")

Already downloaded ./01-Download-MAUDE-Data/mdrfoi.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/mdrfoithru2022.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/mdrfoiadd.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/mdrfoichange.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patient.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patientthru2022.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patientadd.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patientchange.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patientproblemcode.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/patientproblemdata.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/foidevthru1997.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/foidev1998.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/foidev1999.zip; Skipping!
Already downloaded ./01-Download-MAUDE-Data/device2000.zip; Skip

Goal:
1. List the files on the local system

Steps:
1. Use the 'os.listdir()' function to list the files in the data directory

In [12]:
import os
print(os.listdir(data_directory))


['foitext.zip', 'device2003.zip', 'foitext2005.zip', 'foitext2019.zip', 'device2020.zip', 'device2000.zip', 'device2005.zip', 'foidev1999.zip', 'patientproblemdata.zip', 'foitext2003.zip', 'device2021.zip', 'device2001.zip', 'device2019.zip', 'device2004.zip', 'device2022.zip', 'device2015.zip', 'foitext2021.zip', 'mdrfoithru2021.zip', 'device2009.zip', 'patientthru2022.zip', 'device2008.zip', 'foitext2016.zip', 'deviceadd.zip', 'foitext2001.zip', 'foitext2011.zip', 'foitext2000.zip', 'foitext2007.zip', 'patientchange.zip', 'foitext2002.zip', 'mdrfoichange.zip', 'device.zip', 'device2007.zip', 'foidevthru1997.zip', 'foitext2004.zip', 'foitext1996.zip', 'device2012.zip', 'device2014.zip', 'foitext2012.zip', 'device2013.zip', 'foitext1998.zip', 'foitext2013.zip', 'mdrfoi.zip', 'list-of-all-maude-data-files.csv', 'foitext2006.zip', 'device2002.zip', 'patientadd.zip', 'mdrfoithru2022.zip', 'foitext1999.zip', 'device2016.zip', 'foitext2008.zip', 'foitext2015.zip', 'foitext2014.zip', 'device

## Download Product Codes
[Product Code Zip file](https://www.accessdata.fda.gov/premarket/ftparea/foiclass.zip) from [the FDA website](https://www.fda.gov/medical-devices/classify-your-medical-device/download-product-code-classification-files).

In [14]:
from os.path import exists

import os
import urllib.request

foiclass_zip = "https://www.accessdata.fda.gov/premarket/ftparea/foiclass.zip"

file_path = f"{data_directory}/foiclass.zip"

if exists(file_path):
  print(f"Already downloaded {file_path}; Skipping!")
else:
  print(f"Downloading {file_path}")
  urllib.request.urlretrieve(foiclass_zip, file_path)

Already downloaded ./01-Download-MAUDE-Data/foiclass.zip; Skipping!


# Observations and Research
There are four different types of files with MAUDE data:
1. Master event data : `mdrfoithru2021.zip`
2. Patient data: `patient...`
3. Device data: `device...`
4. Text data: `foitext...`

All reports are linked via the `MDR Report Key`.

This paper provides some additional insight on massaging the raw MAUDE data into useful formats:
- [A Primer to the Structure, Content and Linkage of the FDA’s Manufacturer and User Facility Device Experience (MAUDE) Files](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5994953/)

The [MAUDE data page](https://www.fda.gov/medical-devices/mandatory-reporting-requirements-manufacturers-importers-and-device-user-facilities/about-manufacturer-and-user-facility-device-experience-maude) also provides information that can be used to develop a schema for importing the data into a modern database that supports queries.

This video had good info on running Jupyter notebooks on AWS:  
- [Serverless Jupyter on AWS: Fully Managed Notebook Environments](https://www.youtube.com/watch?v=-k53AcgVHTI&ab_channel=AWSPublicSector)

Running the notebook on a dedicated server in AWS will allow faster runtimes and persistent storage.  This is in comparison to running on shared resources in Google Colab where the data will need to be downloaded and reingested for each run which will lead to long run times.




# Next Steps
1. Set up an AWS account to host a dedicated Jupyter notebook server
2. Migrate the 'Download' notebook to the AWS server
3. Run the 'Download' notebook in the AWS environment and begin processing the data files
4. Extract and concatenate the data for 2015 to build the database schema

# Summary
1. A Jupyter notebook has been developed to download MAUDE data from fda.gov
2. The Google Colab envirionment is suitable for running the notebook and viewing the downloaded files.
3. However, the shared Colab environment resulted in slow downloads.  In addition, the data is not saved permanently.
4. A dedicated environment needs to be created which will allow faster runs of the notebook and permanent storage.  The dedicated server will be created in the AWS cloud.
5. Next steps include processing data for 2015 to develop a database schema.

# Export Dataframe Details

In [15]:
# Truncated listing of the data set files downloaded from the MAUDE website
df

Unnamed: 0,File Name,Description,Total Records,Compressed Size in Bytes,Uncompressed Size in Bytes
1,mdrfoi.zip,MAUDE Base records received for 2022,2750504,78914KB,908833KB
2,mdrfoithru2022.zip,Master Record through 2022,12830703,546926KB,5243026KB
3,mdrfoiadd.zip,New MAUDE Base records for the current month.,198524,6294KB,63428KB
4,mdrfoichange.zip,MAUDE Base data updates: changes to existing B...,534164,18020KB,172061KB
5,patient.zip,MAUDE Patient records received for 2022,2757813,8041KB,77409KB
...,...,...,...,...,...
68,foitext2021.zip,Narrative Data for 2021,3746403,217872KB,1306481KB
69,foitext2022.zip,Narrative Data received for 2022,5021760,257211KB,2038145KB
70,foitext.zip,Narrative Data received to date for 2023,5021760,257211KB,2038145KB
71,foitextadd.zip,New MAUDE Narrative data for the current month.,168872,11815KB,90258KB


In [16]:
import IPython.display

def display_left_aligned_df(df):
    html_table = df.to_html(notebook=True, index=False)
    html_table = html_table.replace('<td>', '<td style="text-align:left">').replace('<th>', '<th style="text-align:left">')
    IPython.display.display(IPython.display.HTML(html_table))
    
display_left_aligned_df(df)

File Name,Description,Total Records,Compressed Size in Bytes,Uncompressed Size in Bytes
mdrfoi.zip,MAUDE Base records received for 2022,2750504,78914KB,908833KB
mdrfoithru2022.zip,Master Record through 2022,12830703,546926KB,5243026KB
mdrfoiadd.zip,New MAUDE Base records for the current month.,198524,6294KB,63428KB
mdrfoichange.zip,MAUDE Base data updates: changes to existing Ba...,534164,18020KB,172061KB
patient.zip,MAUDE Patient records received for 2022,2757813,8041KB,77409KB
patientthru2022.zip,Patient Record through 2022,15763975,57699KB,472885KB
patientadd.zip,New MAUDE Patient records for the current month.,227574,615KB,6291KB
patientchange.zip,MAUDE Patient data updates: changes to existing...,514821,1323KB,13728KB
patientproblemcode.zip,Device Data for patientproblemcode,16073453,116578KB,1105701KB
patientproblemdata.zip,Patient Problem Data,998,11KB,25KB


In [17]:
# Save the dataframe as CSV for formmating in excel
df.to_csv(f"{data_directory}/list-of-all-maude-data-files.csv", index=False)
