# Data Stratification

## Goals and Questions: 
1. Identify what is obvious in the FOI text and if the unstructured data provides something
to predict
    1. What is the pattern that exists in the data and how can I identify it as a human?
    1. Based on the previous question's analysis, determine the sophistication of the model that needs to be developed. ie, if text contains “could not be determined” then label the row inconclusive
1. Create a piece of code to sample the data across device problem text so that you can examine as a human any patterns that might tie instances of FOI text that results in each category of Device Problem Text (see below for code notes)
    1. Use stratification across years, FOI text, and device problem code
        1. Why is this categorized as a device problem text?
        1. Why do these share this category
        1. Is there something across all of these that would make me categorize it as
component failure, inconclusive, or not-component failure?
1. Determine if simple rules will suffice or if more processing is needed

## Code Notes
Code notes: Stratify and Analyze

1. Start with 2020 and 2021 clean data
2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
    1. Remove any that are not assemblies/systems
3. Look at the top 5 DEVICE_PROBLEM_TEXT
4. Randomly pick 10 from each of the top 5 DEVICE_PROBLEM_TEXT for a total of 50 records

In [1]:
from zipfile import ZipFile

import pandas as pd
import os
import csv

# Specify the decimal form for a percentage of the dataset to sample
# 0.01 = 1%
# 0.10 = 10%
sample_size = 0.01

# Identify the working directory and data files
working_directory = './data_stratification'

# 1. Start with 2020 and 2021 clean data
data_file_2020 = "./2020_clean/2020_data_clean.csv"
data_file_2021 = "./2021_clean/2021_data_clean.csv"

# Create the working directory if needed
try:
    os.makedirs(working_directory, exist_ok=True)
except OSError as error:
    print(f"Error creating {working_directory}: {error}")

In [2]:
# Read the data into a pandas dataframe
data_2020 = pd.read_csv(data_file_2020, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words



In [3]:
# Read the data into a pandas dataframe
data_2021 = pd.read_csv(data_file_2021, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words

In [4]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
product_code_occurrences_2020 = data_2020.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')

In [5]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
product_code_occurrences_2021 = data_2021.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')

In [6]:
# Read the product code definitions; Unzip the foiclass_zip files into the working directory
with ZipFile('./data/foiclass.zip', "r") as zip:
    zip.extractall(f"{working_directory}")

In [7]:
# Read the product code definitions into a pandas dataframe
foi_class = pd.read_csv(f"{working_directory}/foiclass.txt", 
                        sep="|",
                        encoding="ISO-8859-1",
                        on_bad_lines='warn',
                        dtype = 'str')

In [8]:
# Identify the unwanted columns
unwanted_columns = [
    'REVIEW_PANEL',
    'MEDICALSPECIALTY',
    'UNCLASSIFIED_REASON',
    'GMPEXEMPTFLAG',
    'THIRDPARTYFLAG',
    'REVIEWCODE',
    'REGULATIONNUMBER',
    'SUBMISSION_TYPE_ID',
    'DEFINITION',
    'PHYSICALSTATE',
    'TECHNICALMETHOD',
    'TARGETAREA',
    'Implant_Flag',
    'Life_Sustain_support_flag',
    'SummaryMalfunctionReporting',
]

# Remove the unwanted columns from the device dataframe
foi_class_drop_columns = foi_class.drop(unwanted_columns, axis=1)

In [9]:
# Rename the 'PRODUCTCODE' column to match the clean data
foi_class_clean = foi_class_drop_columns.rename(columns={'PRODUCTCODE': 'DEVICE_REPORT_PRODUCT_CODE'})

In [10]:
# Merge the clean data's product code counts with the FOI class
product_code_occurrences_2020_merged = pd.merge(
        product_code_occurrences_2020, 
        foi_class_clean, 
        on="DEVICE_REPORT_PRODUCT_CODE", 
        how="inner")

In [11]:
# Merge the clean data's product code counts with the FOI class
product_code_occurrences_2021_merged = pd.merge(
        product_code_occurrences_2021, 
        foi_class_clean, 
        on="DEVICE_REPORT_PRODUCT_CODE", 
        how="inner")

In [12]:
product_code_occurrences_2020_merged.sort_values(by=['COUNT'], ascending=False).head(5)

Unnamed: 0,DEVICE_REPORT_PRODUCT_CODE,COUNT,DEVICENAME,DEVICECLASS
222,DZE,354972,"Implant, Endosseous, Root-Form",2
1936,QBJ,276350,Integrated Continuous Glucose Monitoring Syste...,2
1781,OZP,269978,"Automated Insulin Dosing Device System, Single...",3
450,FRN,256585,"Pump, Infusion",2
1780,OZO,236061,"Automated Insulin Dosing , Threshold Suspend",3


In [13]:
product_code_occurrences_2021_merged.sort_values(by=['COUNT'], ascending=False).head(5)

Unnamed: 0,DEVICE_REPORT_PRODUCT_CODE,COUNT,DEVICENAME,DEVICECLASS
230,DZE,690942,"Implant, Endosseous, Root-Form",2
455,FRN,529091,"Pump, Infusion",2
1986,QBJ,297367,Integrated Continuous Glucose Monitoring Syste...,2
1824,OZP,203393,"Automated Insulin Dosing Device System, Single...",3
2002,QFG,176681,Alternate Controller Enabled Insulin Infusion ...,2


In [14]:
# Write the data to disk
product_code_occurrences_2020_merged.sort_values(by=['COUNT'], 
    ascending=False).head(5).to_csv(f"{working_directory}/2020_device_product_code_counts.csv")

product_code_occurrences_2021_merged.sort_values(by=['COUNT'], 
    ascending=False).head(5).to_csv(f"{working_directory}/2021_device_product_code_counts.csv")

In [15]:
qbj_2020 = data_2020.query("DEVICE_REPORT_PRODUCT_CODE == 'QBJ'")
#dze_2020 = data_2020.query("DEVICE_REPORT_PRODUCT_CODE == 'DZE'")
#ozp_2020 = data_2020.query("DEVICE_REPORT_PRODUCT_CODE == 'OZP'")
#frn_2020 = data_2020.query("DEVICE_REPORT_PRODUCT_CODE == 'FRN'")
#ozo_2020 = data_2020.query("DEVICE_REPORT_PRODUCT_CODE == 'OZO'")

In [16]:
qbj_2021 = data_2021.query("DEVICE_REPORT_PRODUCT_CODE == 'QBJ'")
#dze_2021 = data_2021.query("DEVICE_REPORT_PRODUCT_CODE == 'DZE'")
#frn_2021 = data_2021.query("DEVICE_REPORT_PRODUCT_CODE == 'FRN'")
#ozp_2021 = data_2021.query("DEVICE_REPORT_PRODUCT_CODE == 'OZP'")
#qfg_2021 = data_2021.query("DEVICE_REPORT_PRODUCT_CODE == 'QFG'")

In [17]:
qbj_2020_count = qbj_2020.groupby(['DEVICE_PROBLEM_CODE']).size().to_frame('COUNT')
qbj_2020_count.sort_values(by=['COUNT'], ascending=False).head(10)

Unnamed: 0_level_0,COUNT
DEVICE_PROBLEM_CODE,Unnamed: 1_level_1
3283,87978
1435,74530
1307,43113
2591,25194
2896,10144
1480,8500
3191,6646
1559,5031
2907,2740
4032,2243


In [18]:
qbj_2020_strata = pd.DataFrame(columns=['FOI_TEXT','DEVICE_PROBLEM_CODE','DEVICE_PROBLEM_TEXT','DEVICE_REPORT_PRODUCT_CODE'])

for i in qbj_2020_count.sort_values(by=['COUNT'], ascending=False).head(10).index.tolist():
    qbj_2020_strata = pd.concat([qbj_2020_strata, qbj_2020.loc[(qbj_2020['DEVICE_PROBLEM_CODE'] == str(i)),
                        ['FOI_TEXT','DEVICE_PROBLEM_CODE','DEVICE_PROBLEM_TEXT','DEVICE_REPORT_PRODUCT_CODE']].sample(n=10, random_state=1)])

In [19]:
qbj_2020_strata

Unnamed: 0,FOI_TEXT,DEVICE_PROBLEM_CODE,DEVICE_PROBLEM_TEXT,DEVICE_REPORT_PRODUCT_CODE
595439,IT WAS REPORTED THAT THE TRANSMITTER LOST CONN...,3283,Wireless Communication Problem,QBJ
2584242,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
896955,IT WAS DETERMINED THAT THE SIGNAL LOSS WAS REL...,3283,Wireless Communication Problem,QBJ
1646123,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
1231243,IT WAS DETERMINED THAT THE SIGNAL LOSS WAS REL...,3283,Wireless Communication Problem,QBJ
...,...,...,...,...
358480,IT WAS REPORTED THAT AN UNEXPECTED CGM APP SHU...,4032,Unintended Application Program Shut Down,QBJ
1020523,IT WAS REPORTED THAT AN APP CRASH ALERT OCCURR...,4032,Unintended Application Program Shut Down,QBJ
91659,IT WAS REPORTED THAT AN APP CRASH ALERT OCCURR...,4032,Unintended Application Program Shut Down,QBJ
1648712,IT WAS REPORTED THAT AN UNEXPECTED CGM APP SHU...,4032,Unintended Application Program Shut Down,QBJ


In [20]:
qbj_2021_count = qbj_2021.groupby(['DEVICE_PROBLEM_CODE']).size().to_frame('COUNT')
qbj_2021_count.sort_values(by=['COUNT'], ascending=False).head(10)

Unnamed: 0_level_0,COUNT
DEVICE_PROBLEM_CODE,Unnamed: 1_level_1
3283,114490
1435,73778
2896,45551
1307,26407
1480,9094
1559,6256
2591,3003
2907,2954
3191,2406
4032,1939


In [21]:
qbj_2021_strata = pd.DataFrame(columns=['FOI_TEXT','DEVICE_PROBLEM_CODE','DEVICE_PROBLEM_TEXT','DEVICE_REPORT_PRODUCT_CODE'])

for i in qbj_2021_count.sort_values(by=['COUNT'], ascending=False).head(10).index.tolist():
    qbj_2021_strata = pd.concat([qbj_2021_strata, qbj_2021.loc[(qbj_2021['DEVICE_PROBLEM_CODE'] == str(i)),
                        ['FOI_TEXT','DEVICE_PROBLEM_CODE','DEVICE_PROBLEM_TEXT','DEVICE_REPORT_PRODUCT_CODE']].sample(n=10, random_state=1)])

In [22]:
qbj_2021_strata

Unnamed: 0,FOI_TEXT,DEVICE_PROBLEM_CODE,DEVICE_PROBLEM_TEXT,DEVICE_REPORT_PRODUCT_CODE
2322778,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
2020672,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOU...,3283,Wireless Communication Problem,QBJ
2845620,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
2759955,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
2598875,IT WAS REPORTED THAT SIGNAL LOSS OVER ONE HOUR...,3283,Wireless Communication Problem,QBJ
...,...,...,...,...
2909164,COM-(B)(4).,4032,Unintended Application Program Shut Down,QBJ
962823,IT WAS REPORTED THAT AN UNEXPECTED CGM APP SHU...,4032,Unintended Application Program Shut Down,QBJ
2536320,IT WAS REPORTED THAT AN APP CRASH ALERT OCCURR...,4032,Unintended Application Program Shut Down,QBJ
1754721,IT WAS REPORTED THAT AN UNEXPECTED CGM APP SHU...,4032,Unintended Application Program Shut Down,QBJ


In [23]:
#qbj_2020_strata.to_csv(f"{working_directory}/qbj_2020_strata.csv")
#qbj_2021_strata.to_csv(f"{working_directory}/qbj_2021_strata.csv")

Instead of saving the stratified data, we sampled the QBJ data using the `sample_size` from the begining of the notebook.

In [24]:

qbj_2020.sample(n=int(sample_size * qbj_2020.shape[0]), random_state=1).to_csv(f"{working_directory}/qbj_2020_strata.csv")
qbj_2021.sample(n=int(sample_size * qbj_2021.shape[0]), random_state=1).to_csv(f"{working_directory}/qbj_2021_strata.csv")