# Data Stratification

## Goals and Questions: 
1. Identify what is obvious in the FOI text and if the unstructured data provides something
to predict
    1. What is the pattern that exists in the data and how can I identify it as a human?
    1. Based on the previous question's analysis, determine the sophistication of the model that needs to be developed. ie, if text contains “could not be determined” then label the row inconclusive
1. Create a piece of code to sample the data across device problem text so that you can examine as a human any patterns that might tie instances of FOI text that results in each category of Device Problem Text (see below for code notes)
    1. Use stratification across years, FOI text, and device problem code
        1. Why is this categorized as a device problem text?
        1. Why do these share this category
        1. Is there something across all of these that would make me categorize it as
component failure, inconclusive, or not-component failure?
1. Determine if simple rules will suffice or if more processing is needed

## Code Notes
Code notes: Stratify and Analyze

1. Start with 2020 and 2021 clean data
2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
    1. Remove any that are not assemblies/systems
3. Look at the top 5 DEVICE_PROBLEM_TEXT
4. Randomly pick 10 from each of the top 5 DEVICE_PROBLEM_TEXT for a total
of 50 records

In [1]:
from zipfile import ZipFile

import pandas as pd
import os
import csv

# Identify the working directory and data files
working_directory = './data_stratification'

# 1. Start with 2020 and 2021 clean data
data_file_2020 = "./2020_clean/2020_data_clean.csv"
data_file_2021 = "./2021_clean/2021_data_clean.csv"

# Create the working directory if needed
try:
    os.makedirs(working_directory, exist_ok=True)
except OSError as error:
    print(f"Error creating {working_directory}: {error}")

In [2]:
# Read the data into a pandas dataframe
data_2020 = pd.read_csv(data_file_2020, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words



In [3]:
# Read the data into a pandas dataframe
data_2021 = pd.read_csv(data_file_2021, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words

In [54]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
product_code_occurrences_2020 = data_2020.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')

In [53]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
product_code_occurrences_2021 = data_2021.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')

In [55]:
# Read the product code definitions; Unzip the foiclass_zip files into the working directory
with ZipFile('./data/foiclass.zip', "r") as zip:
    zip.extractall(f"{working_directory}")

In [38]:
# Read the product code definitions into a pandas dataframe
foi_class = pd.read_csv(f"{working_directory}/foiclass.txt", 
                        sep="|",
                        encoding="ISO-8859-1",
                        on_bad_lines='warn',
                        dtype = 'str')

In [42]:
# Identify the unwanted columns
unwanted_columns = [
    'REVIEW_PANEL',
    'MEDICALSPECIALTY',
    'UNCLASSIFIED_REASON',
    'GMPEXEMPTFLAG',
    'THIRDPARTYFLAG',
    'REVIEWCODE',
    'REGULATIONNUMBER',
    'SUBMISSION_TYPE_ID',
    'DEFINITION',
    'PHYSICALSTATE',
    'TECHNICALMETHOD',
    'TARGETAREA',
    'Implant_Flag',
    'Life_Sustain_support_flag',
    'SummaryMalfunctionReporting',
]

# Remove the unwanted columns from the device dataframe
foi_class_drop_columns = foi_class.drop(unwanted_columns, axis=1)

In [65]:
# Rename the 'PRODUCTCODE' column to match the clean data
foi_class_clean = foi_class_drop_columns.rename(columns={'PRODUCTCODE': 'DEVICE_REPORT_PRODUCT_CODE'})

In [67]:
# Merge the clean data's product code counts with the FOI class
product_code_occurrences_2020_merged = pd.merge(
        product_code_occurrences_2020, 
        foi_class_clean, 
        on="DEVICE_REPORT_PRODUCT_CODE", 
        how="inner")


In [68]:
# Merge the clean data's product code counts with the FOI class
product_code_occurrences_2021_merged = pd.merge(
        product_code_occurrences_2021, 
        foi_class_clean, 
        on="DEVICE_REPORT_PRODUCT_CODE", 
        how="inner")

In [62]:
product_code_occurrences_2020_merged.sort_values(by=['COUNT'], ascending=False).head(5)

Unnamed: 0,DEVICE_REPORT_PRODUCT_CODE,COUNT,DEVICENAME,DEVICECLASS
222,DZE,354972,"Implant, Endosseous, Root-Form",2
1936,QBJ,276350,Integrated Continuous Glucose Monitoring Syste...,2
1781,OZP,269978,"Automated Insulin Dosing Device System, Single...",3
450,FRN,256585,"Pump, Infusion",2
1780,OZO,236061,"Automated Insulin Dosing , Threshold Suspend",3


In [63]:
product_code_occurrences_2021_merged.sort_values(by=['COUNT'], ascending=False).head(5)

Unnamed: 0,DEVICE_REPORT_PRODUCT_CODE,COUNT,DEVICENAME,DEVICECLASS
230,DZE,690942,"Implant, Endosseous, Root-Form",2
455,FRN,529091,"Pump, Infusion",2
1986,QBJ,297367,Integrated Continuous Glucose Monitoring Syste...,2
1824,OZP,203393,"Automated Insulin Dosing Device System, Single...",3
2002,QFG,176681,Alternate Controller Enabled Insulin Infusion ...,2


In [64]:
# Write the data to disk
product_code_occurrences_2020_merged.sort_values(by=['COUNT'], ascending=False).head(5).to_csv(f"{working_directory}/2020_device_product_code_counts.csv")
product_code_occurrences_2021_merged.sort_values(by=['COUNT'], ascending=False).head(5).to_csv(f"{working_directory}/2021_device_product_code_counts.csv")