# Process 2015 Data

Goals:
- Use the 2015 MAUDE data to create a sample dataset for experimentation
- I really need to finish this soon.

Steps:
1. Identify all common data files

|File                    |Description|Required|
|------------------------|-----------|--------|
|`mdrfoithru2021.zip`    |Master Record through 2021|X|
|`patientthru2021.zip`   |Patient Record through 2021|X|
|`foitextchange.zip`     |Narrative data updates: changes to existing narrative data and additional narrative data for existing base records|X|
|`patientproblemcode.zip`|Device Data for patientproblemcode||
|`patientproblemdata.zip`|Patient Problem Data||
|`patientchange.zip`     |MAUDE Patient data updates: changes to existing Base data||
|`mdrfoichange.zip`      |MAUDE Base data updates: changes to existing Base data||
|`devicechange.zip`      |Device data updates: changes to existing Device data and additional Device data for existing Base records||
|`deviceproblemcodes.zip`|Device Problem Data||
|`foidevproblem.zip`     |Device Data for foidevproblem||

2. Identify all 2015 data files

|File                    |Description|Required|
|------------------------|-----------|--------|
|`device2015.zip`        |Device Data for 2015|X|
|`foitext2015.zip`       |Narrative Data for 2015|X|

3. Create databases for each data type
4. Create a merged dataset using joins for each Master Data Record ID in the 2015 data

## Unzip data files
Goals:
- _*(COMPLETE)*_ Unzip downloaded data files to extract the archived `.txt` file

The following Python code completes these steps:
1. Identify the data directory, working directory, and data files
1. Create the working directory if needed
1. Unzip the data files into the working directory

In [1]:
from zipfile import ZipFile
import os

# Identify the data directory, working directory, and data files
data_directory = './01-Download-MAUDE-Data'
working_directory = './02-Process-2015-Data'

data_files = [
    'device2015.zip', 
    'foitextchange.zip', 
    'patientthru2021.zip',              
    'foitext2015.zip', 
    'mdrfoithru2021.zip'
]

# Create the working directory if needed
try:
    os.makedirs(working_directory, exist_ok=True)
except OSError as error:
    print(f"Error creating {working_directory}: {error}")

# Unzip the data files into the working directory
for i in data_files:
    print(f"Unzipping {i}")
    with ZipFile(f"{data_directory}/{i}", "r") as zip:
        zip.extractall(f"{working_directory}")

print("Unzip complete.")


Unzipping device2015.zip
Unzipping foitextchange.zip
Unzipping patientthru2021.zip
Unzipping foitext2015.zip
Unzipping mdrfoithru2021.zip
Unzip complete.


## Create a database with tables for each data file
Goals:
- _*(COMPLETE)*_ Read the data files into dataframes
- _*(COMPLETE)*_ Write the dataframes to tables in an SQLite database

The following Python code completes these steps:
1. Create a database
1. Process each file in the working directory
1. Get the file path and the base name of the file by removing '.txt'
1. Create a dataframe for each file using Pandas, reading each file as a comma seperated values file, using the pipe seperator
1. Set the MDR_REPORT_KEY as the index for the dataframe
1. Write the dataframe to a new SQLite table


In [2]:
import pandas as pd
import os
import csv
import sqlite3

# Create an SQLite database
db = sqlite3.connect(f"{working_directory}/database.sqlite3")

# Process each file in the working directory
for root, dirs, files in os.walk(working_directory):
   for file_name in files:

      # Only process '.txt' files...
      if file_name.endswith(".txt"):
         print(f"Processing {file_name}")
         
         # Get the file path and the base name of the file by removing '.txt'
         file_path = os.path.join(root, file_name)
         base_name = file_name.replace('.txt','')

         # Create a dataframe for the file
         df = pd.read_csv(file_path, 
            sep="|", 
            encoding="ISO-8859-1", 
            on_bad_lines='warn', 
            quoting=csv.QUOTE_NONE)

         # Set the MDR_REPORT_KEY as the index for the dataframe
         df.set_index('MDR_REPORT_KEY', inplace=True)
         
         # Write the dataframe to a new SQLite table
         df.to_sql(base_name, db, if_exists="replace")

         # Remove the dataframe in an attempt to free up memory
         del df

print("Database creation complete")

Processing DEVICE2015.txt


  exec(code_obj, self.user_global_ns, self.user_ns)


Processing foitext2015.txt
Processing foitextChange.txt
Processing mdrfoiThru2021.txt


b'Skipping line 11971429: expected 82 fields, saw 83\n'
  exec(code_obj, self.user_global_ns, self.user_ns)


Processing patientThru2021.txt
Database creation complete


## Combine the 2015 Data

Goals:
- _*(PARTIALLY COMPLETE)*_ Given that the database has tables for all the data elements, combine the data to form a complete data set for 2015

The following Python code completes these steps:
1. Create tables for MDR and Patient data from 2015
1. Use the 2015 tables to join the DEVICE2015, foitext2015, and foitextChange data

In [3]:
import sqlite3

# Connect to the SQLite database
db = sqlite3.connect(f"{working_directory}/database.sqlite3")

# Create a table for the MDR 2015 data
db.execute('''
    CREATE TABLE IF NOT EXISTS mdrfoi2015 AS
    SELECT *
      FROM mdrfoiThru2021
     WHERE mdrfoiThru2021.DATE_RECEIVED LIKE '%2015%';
''')

# Create a table for the patient 2015 data
db.execute('''
    CREATE TABLE IF NOT EXISTS patient2015 AS
    SELECT *
    FROM patientThru2021
    WHERE patientThru2021.DATE_RECEIVED LIKE '%2015%';
''')

# Create a temp table for 2015 Device and FOI Text data
db.execute('''
    CREATE TABLE IF NOT EXISTS devicefoitext2015 AS
    SELECT  
        foitextChange.*,
        foitext2015.*,
        DEVICE2015.*
    FROM DEVICE2015
        LEFT JOIN foitextChange ON DEVICE2015.MDR_REPORT_KEY = foitextChange.MDR_REPORT_KEY
        LEFT JOIN foitext2015 ON DEVICE2015.MDR_REPORT_KEY = foitext2015.MDR_REPORT_KEY;
''')

print("2015 Device and FOI Text data combination complete")

# Create a temp table for 2015 Device, FOI Text, and Patient data
db.execute('''
    CREATE TABLE IF NOT EXISTS devicefoitextpatient2015 AS
    SELECT  
        patient2015.*,
        devicefoitext2015.*
    FROM devicefoitext2015
        LEFT JOIN patient2015 ON devicefoitext2015.MDR_REPORT_KEY = patient2015.MDR_REPORT_KEY;
''')

print("2015 Device, FOI Text, and Patient data combination complete")

# Create a temp table for 2015 Device, FOI Text, and Patient data, and MDR data
# db.execute('''
#     CREATE TABLE IF NOT EXISTS all2015 AS
#     SELECT  
#         mdrfoi2015.*,
#         devicefoitextpatient2015.*
#     FROM devicefoitextpatient2015
#         LEFT JOIN mdrfoi2015 ON devicefoitextpatient2015.MDR_REPORT_KEY = mdrfoi2015.MDR_REPORT_KEY;
# ''')

# print("2015 Device, FOI Text, Patient data, and MDR data combination complete")


2015 Device and FOI Text data combination complete
2015 Device, FOI Text, and Patient data combination complete


# Observations and Research
To complete this step in the data processing, two dedicated servers were created in AWS to run the Jupyter notebooks.  The first environment was a [SageMaker Studio Lab](https://studiolab.sagemaker.aws/) environment with 4GB RAM and 15GB disk.  The second was a [SageMaker Notebook server](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html) with 4GB RAM and 100GB disk.

The SageMaker Studio Lab server was substantially faster than the Google Colab environment initially used and allowed for persistent data storage.  This improved the processing for the download notebook, [1-Download-MAUDE-Data](./1-Download-MAUDE-Data.ipynb).  However, the disk space allocated to the SageMaker Studio Lab server was not enough to hold the downloaded MAUDE data (which is compressed) _and_ the uncompressed MAUDE data.  This prompted the move to a dedicated environment with more disk space.

The SageMaker Notebook Server environment allowed the server disk space to be specified directly.  The server was created with enough disk space to hold the compressed and uncompressed MAUDE data set.  However, while processing the uncompressed data to create a database, the Jupyter notebook runtime experienced crashes, indicating that not enough RAM was available to process the largest of the MAUDE datafiles which is about 4 GB uncompressed.  The code was updated in an attempt to try to free memory when possible but that did not prevent the notebook from crashing.

To progress with the data processing, a Jupyter environment was created on a MacBook Pro with 64GB of RAM and plenty of disk space for the compressed and uncompressed MAUDE data.  This allowed the [2-Process-2015-Data](./2-Process-2015-Data.ipynb) notebook to successfully create all databases as needed.

With databases in place for each file type of the MAUDE data for 2015, the process of combining the data was initiated.  An initial attempt was made to join all the data using the MDR_REPORT_KEY as the index for joining rows.  However, this resulted in a long runtime; the process was allowed to run for just over 11 hours before being interrupted.

In an effort to speed up the data combination, new tables were made to hold the 2015 data from the MDR and Patient files, which contain data from the start of the project to the present date.  Then, the data was joined two tables at a time, starting with the Device data and the report data.  The result of the device and text join was used to create an intermediate table.  This new table was then joined with the Patient data to create a second intermediate table.  These two joins were able to complete in less than 6 minutes.  However, attempting to join the second intermediate table with the MDR data resulted in the process running for more than two hours before being interrupted.

The document [A Primer to the Structure, Content and Linkage of the FDA’s Manufacturer and User Facility Device Experience (MAUDE) Files](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5994953/) was referenced again to see if anything could be done to expidite the combination of the MAUDE data to produce a complete data set for 2015.

Upon reviewing the space for the problem related to this student's practice, an observation was made that the MDR data may not be needed to complete the project as intended.

# Next Steps
1. Review the merged data for completeness
1. If the data is complete, filter for device type and remove duplicates
1. Review the data in the device and report data files to determine if the data they contain can be used without the MDR data

# Summary
1. Two dedicated Jupyter noteboook envionments were created in AWS to allow for persistent data storage and improved runtime speeds.
1. In both dedicated environments, the size of the MAUDE data taxed the disk and RAM capacity.  Ultimately, the data was processed on a local workstation with 1TB disk and 64GB RAM.
1. Combining the data for 2015 proved difficult using an "all-at-once" approach.  Using intermediate tables that joined two data files together at a time allowed the data to be merged efficiently and to completion.
1. The size of the MDR data set is proving to be problematic for merging.
1. The merged data will be reviewed to determine if the MDR data is required or if the device, report, and patient data can be used without the MDR data.


