# Clean 2020 Data
In previous processing, some columns were removed to minimize the number of columns in the data.

However, the data set contained more than 3M rows of data.  

Turns out we need to keep the following columns:
- 'DEVICE_REPORT_PRODUCT_CODE'
- 'FOI_TEXT'

In [1]:
# set the year
year = '2020'

In [2]:
# Identify the data directory, working directory, and data files
data_directory = f"./{year}_reprocessed"
working_directory = f"./{year}_clean"
data_file = f"{data_directory}/{year}_data_complete.csv"

import os

# Create the working directory if needed
try:
    os.makedirs(working_directory, exist_ok=True)
except OSError as error:
    print(f"Error creating {working_directory}: {error}")



In [3]:
import pandas as pd


# Read the data into a pandas dataframe
data = pd.read_csv(data_file, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words

In [4]:
# Replace any records that Pandas converted to 'N/A' with an empty string.
data.fillna('', inplace=True)

In [5]:
print(f"Number of: (Rows, Columns) = {data.shape}")

Number of: (Rows, Columns) = (3856740, 34)


## Remove Unwanted Columns

In [6]:
# Remove unwanted columns
unwanted_columns = [
    'MDR_REPORT_KEY',
    'MDR_TEXT_KEY',
    'TEXT_TYPE_CODE',
    'PATIENT_SEQUENCE_NUMBER',
    'DATE_REPORT',
    'DEVICE_SEQUENCE_NO',
    'BRAND_NAME',
    'MANUFACTURER_D_NAME',
    'MODEL_NUMBER',
    'DEVICE_AVAILABILITY',
    'REPORT_NUMBER',
    'REPORT_SOURCE_CODE',
    'NUMBER_DEVICES_IN_EVENT',
    'DATE_RECEIVED',
    'INITIAL_REPORT_TO_FDA',
    'MANUFACTURER_G1_NAME',
    'REMEDIAL_ACTION',
    'EVENT_TYPE',
    'MANUFACTURER_NAME',
    'TYPE_OF_REPORT',
    'SUMMARY_REPORT',
    'NOE_SUMMARIZED',
    #'UDI-DI',
    #'UDI-PUBLIC',
]

data.drop(unwanted_columns, axis=1, inplace=True)

In [7]:
print(f"Number of: (Rows, Columns) = {data.shape}")

Number of: (Rows, Columns) = (3856740, 12)


In [8]:
data

Unnamed: 0,FOI_TEXT,DEVICE_PROBLEM_CODE,DEVICE_PROBLEM_TEXT,GENERIC_NAME,DEVICE_REPORT_PRODUCT_CODE,UDI-DI,UDI-PUBLIC,DATE_OF_EVENT,REPORTER_OCCUPATION_CODE,REPORT_DATE,EVENT_LOCATION,SOURCE_TYPE
0,THE RESULTS OF THE INVESTIGATION ARE INCONCLUS...,2993,Adverse Event Without Identified Device or Use...,DEFIBRILLATION LEAD,LWS,05414734502085,05414734502085,12/12/2019,001,,I,"COMPANY REPRESENTATIVE,HEALTH"
1,IT WAS REPORTED THAT THE PATIENT EXPIRED. THER...,2993,Adverse Event Without Identified Device or Use...,DEFIBRILLATION LEAD,LWS,05414734502085,05414734502085,12/12/2019,001,,I,"COMPANY REPRESENTATIVE,HEALTH"
2,INVESTIGATION RESULTS WILL BE PROVIDED IN THE ...,1332,Failure to Interrogate,IMPLANTABLE CARDIOVERTER DEFIBRILLATOR,NVZ,05414734504386,05414734504386,12/12/2019,000,,I,"COMPANY REPRESENTATIVE,HEALTH"
3,IT WAS REPORTED THAT THE PATIENT CALLED EMERGE...,1332,Failure to Interrogate,IMPLANTABLE CARDIOVERTER DEFIBRILLATOR,NVZ,05414734504386,05414734504386,12/12/2019,000,,I,"COMPANY REPRESENTATIVE,HEALTH"
4,COMMUNICATION FAILURE AND PREMATURE BATTERY DE...,1332,Failure to Interrogate,IMPLANTABLE CARDIOVERTER DEFIBRILLATOR,NVZ,05414734504386,05414734504386,12/12/2019,000,,I,"COMPANY REPRESENTATIVE,HEALTH"
...,...,...,...,...,...,...,...,...,...,...,...,...
3856735,A REVIEW OF THE SUBJECT DEVICE DHR CONFIRMED T...,2937,Failure of Device to Self-Test,HOLMIUM (HO:YAG) SURGICAL LASERS AND DELIVERY ...,GEX,07290109140513,07290109140513,02/18/2020,,,I,USER FACILITY
3856736,THE CUSTOMER STATED THAT THE PREFENSE MONITORI...,3010,Power Problem,CENTRAL MONITORING STATION,DRG,00851725007023,00851725007023,04/14/2020,,,I,USER FACILITY
3856737,THE CUSTOMER STATED THAT THE PREFENSE MONITORI...,4032,Unintended Application Program Shut Down,CENTRAL MONITORING STATION,DRG,00851725007023,00851725007023,04/14/2020,,,I,USER FACILITY
3856738,THE CUSTOMER STATED THAT THE PREFENSE MONITORI...,3010,Power Problem,CENTRAL MONITORING STATION,DRG,00851725007023,00851725007023,04/14/2020,,,I,USER FACILITY


## Cleaning Data by Dropping Rows Matching Specific Criteria

Use [this answer on Stack Overflow](https://stackoverflow.com/questions/13851535/how-to-delete-rows-from-a-pandas-dataframe-based-on-a-conditional-expression) as a reference for dropping rows from a dataframe using regular expressesions.

In summary:
```
new_data_frame = previous_data_frame.drop(previous_data_frame[CONDITION GOES HERE; ie, previous_data_frame.COLUMN_NAME == "Some Text"].index)
```

### Drop rows where GENERIC_NAME starts with "UNK" ("UNKNOWN", "UNKOWN", or "UNK")

In [9]:
# Drop rows where GENERIC_NAME starts with "UNK" ("UNKNOWN", "UNKOWN", or "UNK")
remove_generic_name_starts_with_unk = data.drop(data[data.GENERIC_NAME.str.contains(r'^UNK')].index)

print(f"Previous row count = {data.shape[0]}")
print(f"New row count      = {remove_generic_name_starts_with_unk.shape[0]}")
print(f"Rows removed       = {data.shape[0] - remove_generic_name_starts_with_unk.shape[0]}")

Previous row count = 3856740
New row count      = 3853567
Rows removed       = 3173


### Drop rows where DEVICE_PROBLEM_TEXT == "Insufficient Information"

In [10]:
# Drop rows where DEVICE_PROBLEM_TEXT == "Insufficient Information"
remove_device_problem_text_insufficient_information = remove_generic_name_starts_with_unk.drop(remove_generic_name_starts_with_unk[remove_generic_name_starts_with_unk.DEVICE_PROBLEM_TEXT == "Insufficient Information"].index)

print(f"Previous row count = {remove_generic_name_starts_with_unk.shape[0]}")
print(f"New row count      = {remove_device_problem_text_insufficient_information.shape[0]}")
print(f"Rows removed       = {remove_generic_name_starts_with_unk.shape[0] - remove_device_problem_text_insufficient_information.shape[0]}")

Previous row count = 3853567
New row count      = 3773732
Rows removed       = 79835


### Drop rows where GENERIC_NAME is a number

In [11]:
# Drop rows where GENERIC_NAME is a number
remove_generic_name_is_number = remove_device_problem_text_insufficient_information.drop(remove_device_problem_text_insufficient_information[remove_device_problem_text_insufficient_information.GENERIC_NAME.str.match(r'^\d+$')].index)
print(f"Previous row count = {remove_device_problem_text_insufficient_information.shape[0]}")
print(f"New row count      = {remove_generic_name_is_number.shape[0]}")
print(f"Rows removed       = {remove_device_problem_text_insufficient_information.shape[0] - remove_generic_name_is_number.shape[0]}")

Previous row count = 3773732
New row count      = 3773714
Rows removed       = 18


### Drop rows where GENERIC_NAME is blank


In [12]:
# Drop rows where GENERIC_NAME is blank
remove_generic_name_is_blank = remove_generic_name_is_number.drop(remove_generic_name_is_number[remove_generic_name_is_number.GENERIC_NAME == ''].index)
print(f"Previous row count = {remove_generic_name_is_number.shape[0]}")
print(f"New row count      = {remove_generic_name_is_blank.shape[0]}")
print(f"Rows removed       = {remove_generic_name_is_number.shape[0] - remove_generic_name_is_blank.shape[0]}")

Previous row count = 3773714
New row count      = 3761023
Rows removed       = 12691


### Drop Rows Where FOI_TEXT == '(B)(4).'
[USING TEXT MINING OF FDA REPORTS TO INFORM EARLY SIGNAL DETECTION OF
CARDIOVASCULAR LEAD RECALLS](https://dashboard.digitalcollections.cuanschutz.edu/downloads/326bf216-7e24-40b3-80b5-2c9afda1da55)

In [13]:
# Drop Rows Where FOI_TEXT == '(B)(4).'
remove_foitext_equals_b4_1 = remove_generic_name_is_blank.drop(remove_generic_name_is_blank[remove_generic_name_is_blank.FOI_TEXT.str.match(r'(^\(B\)\s?\(4\)\s?\.$)+')].index)

print(f"Previous row count = {remove_generic_name_is_blank.shape[0]}")
print(f"New row count      = {remove_foitext_equals_b4_1.shape[0]}")
print(f"Rows removed       = {remove_generic_name_is_blank.shape[0] - remove_foitext_equals_b4_1.shape[0]}")

Previous row count = 3761023
New row count      = 3487640
Rows removed       = 273383


In [14]:
# Drop Rows Where FOI_TEXT == '(B)(4). (B)(4).'
remove_foitext_equals_b4_2 = remove_foitext_equals_b4_1.drop(remove_foitext_equals_b4_1[remove_foitext_equals_b4_1.FOI_TEXT == '(B)(4). (B)(4).'].index)

print(f"Previous row count = {remove_foitext_equals_b4_1.shape[0]}")
print(f"New row count      = {remove_foitext_equals_b4_2.shape[0]}")
print(f"Rows removed       = {remove_foitext_equals_b4_1.shape[0] - remove_foitext_equals_b4_2.shape[0]}")

Previous row count = 3487640
New row count      = 3487524
Rows removed       = 116


## Count the Product Code Occurrences

In [15]:
product_code_occurrences = remove_foitext_equals_b4_2.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')
product_code_occurrences.sort_values(by=['COUNT'], ascending=False).head(10)

Unnamed: 0_level_0,COUNT
DEVICE_REPORT_PRODUCT_CODE,Unnamed: 1_level_1
DZE,354972
QBJ,276350
OZP,269978
FRN,256585
OZO,236061
OYC,235809
LGW,82639
QFG,71301
LZG,70815
LWS,59780


In [16]:
product_code_occurrences.sort_values(by=['COUNT'], ascending=False).to_csv(f"{working_directory}/product_code_occurrences.csv")

### Identify Rows to Keep Based on Count of Product Code Occurrences
- QBJ

In [17]:
# Drop rows where DEVICE_REPORT_PRODUCT_CODE is not QBJ
remove_device_product_code_not_qbj = remove_foitext_equals_b4_2.drop(remove_foitext_equals_b4_2[remove_foitext_equals_b4_2.DEVICE_REPORT_PRODUCT_CODE != 'QBJ'].index)
print(f"Previous row count = {remove_foitext_equals_b4_2.shape[0]}")
print(f"New row count      = {remove_device_product_code_not_qbj.shape[0]}")
print(f"Rows removed       = {remove_foitext_equals_b4_2.shape[0] - remove_device_product_code_not_qbj.shape[0]}")

Previous row count = 3487524
New row count      = 276350
Rows removed       = 3211174


In [18]:
#remove_device_product_code_not_qbj.to_csv(f"{working_directory}/2020_data_clean.csv")
remove_foitext_equals_b4_2.to_csv(f"{working_directory}/2020_data_clean.csv")