# Data Stratification

## Goals and Questions: 
1. Identify what is obvious in the FOI text and if the unstructured data provides something
to predict
    1. What is the pattern that exists in the data and how can I identify it as a human?
    1. Based on the previous question's analysis, determine the sophistication of the model that needs to be developed. ie, if text contains “could not be determined” then label the row inconclusive
1. Create a piece of code to sample the data across device problem text so that you can examine as a human any patterns that might tie instances of FOI text that results in each category of Device Problem Text (see below for code notes)
    1. Use stratification across years, FOI text, and device problem code
        1. Why is this categorized as a device problem text?
        1. Why do these share this category
        1. Is there something across all of these that would make me categorize it as
component failure, inconclusive, or not-component failure?
1. Determine if simple rules will suffice or if more processing is needed

## Code Notes
Code notes: Stratify and Analyze

1. Start with 2020 and 2021 clean data
2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE
    1. Remove any that are not assemblys/systems
3. Look at the top 5 DEVICE_PROBLEM_TEXT
4. Randomly pick 10 from each of the top 5 DEVICE_PROBLEM_TEXT for a total
of 50 records

In [14]:
import pandas as pd

# Identify the working directory and data files
working_directory = './data_stratification'

# 1. Start with 2020 and 2021 clean data
data_file_2020 = "./2020_clean/2020_data_clean.csv"
data_file_2021 = "./2021_clean/2021_data_clean.csv"

import os

# Create the working directory if needed
try:
    os.makedirs(working_directory, exist_ok=True)
except OSError as error:
    print(f"Error creating {working_directory}: {error}")

In [15]:
# Read the data into a pandas dataframe
data_2020 = pd.read_csv(data_file_2020, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words



In [None]:
# Read the data into a pandas dataframe
data_2021 = pd.read_csv(data_file_2021, # The data file being read, from the variable assignment above
                   on_bad_lines='warn', # This tells Pandas to only warn on bad lines vs causing an error
                   dtype = 'str')       # This tells Pandas to treat all numbers as words

In [16]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE (currently only QBJ) 

product_code_occurrences_2020 = data_2020.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')
product_code_occurrences_2020.sort_values(by=['COUNT'], ascending=False).head(10)

Unnamed: 0_level_0,COUNT
DEVICE_REPORT_PRODUCT_CODE,Unnamed: 1_level_1
DZE,354972
QBJ,276350
OZP,269978
FRN,256564
OZO,236059
OYC,235809
LGW,82631
QFG,71301
LZG,70815
LWS,59760


In [17]:
# 2. Look at the top 5 DEVICE_REPORT_PRODUCT_CODE (currently only QBJ) 
product_code_occurrences_2021 = data_2021.groupby(['DEVICE_REPORT_PRODUCT_CODE']).size().to_frame('COUNT')
product_code_occurrences_2021.sort_values(by=['COUNT'], ascending=False).head(10)

Unnamed: 0_level_0,COUNT
DEVICE_REPORT_PRODUCT_CODE,Unnamed: 1_level_1
DZE,690942
FRN,529085
QBJ,297367
OZP,203161
QFG,176681
PZE,140709
OYC,136923
OZO,116199
LZG,71943
LGW,62534


In [None]:
# Remove any rows for product codes that are not assemblys or systems