# Introduction


Contributors: Kyle G Young, Sally Shrapnel,...

The PyIsaricBasics package has been designed to provide a simple introduction to facilitate exploration and analysis of the ISARIC dataset. We suggest first running the tutorial on the September dataset to match the tutorial outputs. Once you are comfortable using the methods the tutorial will work with any iteration of the dataset. 

The dataset is comprised of individual Domains:

SA = Clinical and Adverse Events 

MB = Microbiology Specimen 

LB = Laboratory Results 

HO = Healthcare Encounters 

DM = Demographics

IN = Treatments and Interventions 

RS = Disease Response and Clinical Classification 

SV = Subject Visits 

RP = Reproductive System Findings 

PO = Pregnancy Outcomes 

DS = Disposition 

ER = Environmental Risk 

IE = Inclusion/Exclusion Criteria 

TI = 

VS = Vital Signs 

SC = Subject Characteristics 


This package contains a Class method that loads an individual Domain and several functions to explore and analyse the data within that domain. Objects are stored as Pandas Dataframes and functions use the open source Pandas library (https://pandas.pydata.org) to facilitate data analysis, visualisation and manipulation.

The package also provide functionality to load the dataframes into SQLite for easy browsing using, for example, DB Browser (https://sqlitebrowser.org)

## Getting set up
### 1. Set file paths to data
Set DATA_DIRECTORY to the directory where your raw ISARIC .csv's are contained, and use DATABASE_FILE to name the sqlite database. 

In [1]:
DATA_DIRECTORY = "tests/Tutorial_data"
DATABASE_FILE = "test_db.sqlite"

### 2. Import the Domain Class and key functions from the pyISARICBasics package. 

In [2]:
# PIP install
from pyISARICBasics.domain import Domain
from pyISARICBasics.functions import csv_to_sqlite, df_to_sqlite

### 3. Convert CSV files to SQLite database

The first step in our data exploration / analysis is to convert all of our raw .csv's to a sqlite database. This is useful for browsing with the application DB Browser (https://sqlitebrowser.org).

Unfortunately, reading and writing full sqlite tables into memory as a dataframe is not particularly efficient in Python 3. However, the following function also creates auxiliary .pickle files that contain a serialised version of pandas DataFrame objects - loading these files is much more efficient. Generating the inital database can take some time (approximately 20mins on a laptop), we suggest you let this run and then have a read through the pyIsaricBasics documentation: (https://kyleyoung1997.github.io/pyISARICBasics/index.html)

In [3]:
# csv_to_sqlite(DATA_DIRECTORY, DATABASE_FILE)

Partner_HO_2021-09-20
______________________________________________________________________________________________________________________________________________________
Creating table: HO
Length of df  HO 2059828
Partner_ER_2021-09-20
______________________________________________________________________________________________________________________________________________________
Creating table: ER
Length of df  ER 1254087
Partner_IE_2021-09-20
______________________________________________________________________________________________________________________________________________________
Creating table: IE
Length of df  IE 847889
Partner_DS_2021-09-20
______________________________________________________________________________________________________________________________________________________
Creating table: DS
Length of df  DS 695096
Partner_SA_2021-09-20
________________________________________________________________________________________________________________

## Exploring an example Domain

For this example, we will use the SA domain. This domain contains (insert details from data dictionary).

The domain class contains three arguments: Domain (domain, data_directory, num_rows). 

1. domain: (string): specifying the name of the domain we wish to load e.g. "SA"
2. data_directory: (string): A path to the directory containing the raw ISARIC .csv's (the previous steps should set this up) 
3. num_rows: (int): An optional argument that can be used to specify how many rows of data we wish to load. If we wish to load all the data we can leave this blank or specify num_rows = None

Some of the ISARIC domains contain a large number of rows. If you wish to perform a quick exploration of the dataset or test individual functions, it can be useful to only load a subset of rows. This is achieved using the third argument, e.g. num_rows = 20. 


In [4]:
SA = Domain("SA", DATA_DIRECTORY, num_rows = None)

### 1. List the columns of the SA domain

The Domain.columns( ) function prints a list of the columns in the current domain.


All the columns in UPPERCASE are unaltered from the original SA.csv file. 

We also have one extra column 'status', which converts the outcomes from ISARIC / STDM format into a simple "Y", "N" or "U". (Yes, no or unknown). 

We will use the convention of lower case for columns like 'status' that have been derived or created here.

Some important columns from the original ISARIC data are:
    Put the list here
    
    SATERM, INTRT, LBTEST, HOTERM - Contains the verbatim non-standardised wording of an event 
    xxOCCUR - Signifies whether an event occured or not
    xxPREPSP - a value of 'y' in this column indicates that the event was prespecified on the CRF, while 'n' or missing indicates a spontaneous (or free-text) entry
    xxSTDY - Gives the day of an event (relative to admission day) 
    
The 'status' column indicates whether an event occurred based on the combination of values in xxPRESP and xxOCCUR as follows: 

| xxPRESP | xxOCCUR | status |
|---------|---------|--------|
| NA      | NA      | Y      |
| NA      | Y       | U      |
| N       | Y       | N      |
| U       | Y       | U      |
| Y       | NA      | Y      |
| Y       | Y       | Y      |


Source code and documentation for this function can be viewed at (https://kyleyoung1997.github.io/pyISARICBasics/domain.html#pyISARICBasics.domain.Domain.process_occur) 

In [5]:
SA.columns()

['STUDYID', 'DOMAIN', 'USUBJID', 'SASEQ', 'SATERM', 'SAMODIFY', 'SACAT', 'SASCAT', 'SAPRESP', 'SAOCCUR', 'SASTAT', 'SAREASND', 'SALOC', 'SADY', 'SASTDY', 'SATPT', 'SATPTREF', 'SASTRF', 'SAEVLINT', 'SAEVINTX', 'SARPOC', 'status']


### 2. Explore missingness in each column:

When columns are empty, or have very high missingness, it can be useful to remove them from the dataframe.
As individual patients will usually be associated with multiple rows it can also be useful to identify the number of unique patients.

In [6]:
SA.table_missingness()

Total number of rows: 31923533
Total number of unique patients: 677926
STUDYID            0
DOMAIN             0
USUBJID            0
SASEQ              0
SATERM            13
SAMODIFY      255150
SACAT         110859
SASCAT      19858160
SAPRESP       387793
SAOCCUR       488399
SASTAT      31840251
SAREASND    31840252
SALOC       31918377
SADY        10083301
SASTDY      31679830
SATPT       30634974
SATPTREF    30634974
SASTRF      31918482
SAEVLINT    31201067
SAEVINTX    20720309
SARPOC      31920431
status             0
dtype: int64


### 3. Exclude columns with high missingness
Exclude these columns from our dataframe has the benefit of freeing up memory and making computations more time efficient 

In [7]:
SA.exclude_columns(['SASCAT', "SASTAT", "SAREASND", "SALOC", "SATPT", "SATPTREF", "SASTRF", "SAEVINTX", "SARPOC"])

### 4. Provide a list of the variables contained within each column.

We can use the Domain.column_events method to identify the variables contained within each column. 

In [8]:
SA.column_events("SACAT")

['MEDICAL HISTORY' 'COMPLICATIONS'
 'SIGNS AND SYMPTOMS AT HOSPITAL ADMISSION'
 'SIGNS AND SYMPTOMS AT FOLLOW-UP' 'NEW DIAGNOSES AT FOLLOW-UP'
 'SIGNS AND SYMPTOMS PRE-COVID-19 DIAGNOSIS' nan 'DAILY CLINICAL FEATURES'
 'SIGNS AND SYMPTOMS AT ICU ADMISSION'
 'SIGNS AND SYMPTOMS AT INITIAL ACUTE COVID-19 ILLNESS']


We can see SACAT (SA Category) only has 9 distinct varibles.

In [9]:
SA.column_events("SAMODIFY")

['OTHER (NOT SPECIFIED)' 'TUBERCULOSIS' 'ARDS' 'PULMONARY EMBOLISM OR DVT'
 'MULTISYSTEM INFLAMMATORY SYNDROME'
 'DISSEMINATED INTRAVASCULAR COAGULATION' 'NEUROLOGICAL COMPLICATION'
 'CHRONIC PULMONARY DISEASE (NOT ASTHMA)' 'HYPERTENSION'
 'DIABETES MELLITUS - TYPE NOT SPECIFIED' 'NOSOCOMIAL SEPSIS' 'OBESITY'
 'SMOKING' 'ACUTE KIDNEY INJURY'
 'CHRONIC CARDIAC DISEASE (NOT HYPERTENSION)' 'MALIGNANT NEOPLASM'
 'ASTHMA' 'HIV' 'OTHER COMPLICATION (NOT SPECIFIED)'
 'CHRONIC KIDNEY DISEASE' 'SHOCK' 'CLINICALLY-DIAGNOSED COVID-19'
 'WEIGHT LOSS' nan 'PULMONARY EMBOLISM' 'PALPITATIONS'
 'COUGH - NON-PRODUCTIVE' 'CONJUNCTIVITIS' 'SEIZURES' 'DIZZINESS'
 'DEEP VEIN THROMBOSIS' 'KIDNEY DISEASE' 'SKIN RASH' 'PARAESTHESIA'
 'SHORTNESS OF BREATH' 'MUSCLE WEAKNESS' 'PROBLEMS WITH BALANCE'
 'PROBLEMS SLEEPING' 'ERECTILE DYSFUNCTION' 'ABDOMINAL PAIN'
 'COUGH - PRODUCTIVE' 'CHEST PAIN' 'PROBLEMS SWALLOWING OR CHEWING'
 'DIARRHOEA' 'FEVER/HISTORY OF FEVER' 'LOSS OF SENSATION'
 'CHANGES IN MENSTRUATION' 'C

We can see SAMODIFY (SA modified term) has many distinct variables.

### 5. Indentifying variable missingness.
We can now identify the missingness for a specific variable. For example, if we are interested in 'TREMOR' from the SAMODIFY column:



In [10]:
SA.table_missingness("SAMODIFY", "TREMOR")

Total number of rows: 12272
Total number of unique patients: 11974
STUDYID         0
DOMAIN          0
USUBJID         0
SASEQ           0
SATERM          0
SAMODIFY        0
SACAT           0
SAPRESP        46
SAOCCUR        46
SADY         2785
SASTDY      12272
SAEVLINT       46
status          0
dtype: int64


This output displays the missingness for the 12272 rows where SAMODIFY contains TREMOR. Of the 677,926 unique patients in the SA domain, there are 11974 that have an entry for TREMOR. Of these 12272 rows containing TREMOR, 12272 do not have an associated day (SASTDY) on which the event occurred.

### 6. Visualising the new DataFrame that we have filtered by SAMODIFY and TREMOR: 

The Domain.select_variable_from_column( ) method returns a Pandas DataFrame, so any function contained in the Pandas library can be used to further filter this dataframe. 

In [11]:
SA.select_variables_from_column("SAMODIFY", "TREMOR")

There is 11974 unique patients in filtered dataframe


Unnamed: 0,STUDYID,DOMAIN,USUBJID,SASEQ,SATERM,SAMODIFY,SACAT,SAPRESP,SAOCCUR,SADY,SASTDY,SAEVLINT,status
9758147,CVVCORE,SA,CVVCORE_00635-H0001,122,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,197.0,,-P7D,N
9759263,CVMEWUS,SA,CVMEWUS_00604-G965,35,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,,,-P7D,N
9759310,CVMEWUS,SA,CVMEWUS_00604-A295,35,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,,,-P7D,N
9759525,CVMEWUS,SA,CVMEWUS_00604-B768,35,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,,,-P7D,N
9759569,CVMEWUS,SA,CVMEWUS_00604-A780,35,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,,,-P7D,N
...,...,...,...,...,...,...,...,...,...,...,...,...,...
31923036,CVSURVY,SA,CVSURVY_00840_4025,75,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,182.0,,-P7D,N
31923109,CVSURVY,SA,CVSURVY_00840_2536,75,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,182.0,,-P7D,N
31923111,CVSURVY,SA,CVSURVY_00840_3006,75,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,324.0,,-P7D,N
31923450,CVSURVY,SA,CVSURVY_00840_0341,75,TREMOR/SHAKINESS,TREMOR,SIGNS AND SYMPTOMS AT FOLLOW-UP,Y,N,183.0,,-P7D,N


### 7. Create a list of relevant columns 
We can also create a list of specific columns that we're interested in: 

In [12]:
cols_of_interest = ["USUBJID", "SASTDY", "SAMODIFY", "SAPRESP", "SAOCCUR", 'status']
SA.select_variables_from_column("SAMODIFY", "TREMOR")[cols_of_interest]

There is 11974 unique patients in filtered dataframe


Unnamed: 0,USUBJID,SASTDY,SAMODIFY,SAPRESP,SAOCCUR,status
9758147,CVVCORE_00635-H0001,,TREMOR,Y,N,N
9759263,CVMEWUS_00604-G965,,TREMOR,Y,N,N
9759310,CVMEWUS_00604-A295,,TREMOR,Y,N,N
9759525,CVMEWUS_00604-B768,,TREMOR,Y,N,N
9759569,CVMEWUS_00604-A780,,TREMOR,Y,N,N
...,...,...,...,...,...,...
31923036,CVSURVY_00840_4025,,TREMOR,Y,N,N
31923109,CVSURVY_00840_2536,,TREMOR,Y,N,N
31923111,CVSURVY_00840_3006,,TREMOR,Y,N,N
31923450,CVSURVY_00840_0341,,TREMOR,Y,N,N


### 8. Print row counts for each column.


In [13]:
SA.column_summary("SAMODIFY")

Number of unique patients in domain: 677926
                                                  Number of Rows  Unique Patients
DIABETES MELLITUS - TYPE NOT SPECIFIED                   1050801           652833
OBESITY                                                  1031524           641802
TUBERCULOSIS                                              835876           446252
CHRONIC CARDIAC DISEASE (NOT HYPERTENSION)                823236           662106
MALIGNANT NEOPLASM                                        814181           659526
HIV                                                       663760           659143
CHRONIC KIDNEY DISEASE                                    662347           661794
CHRONIC PULMONARY DISEASE (NOT ASTHMA)                    661730           661310
ASTHMA                                                    659295           658831
SMOKING                                                   644254           644127
ACUTE KIDNEY INJURY                                   

### 9. Print row counts + status for each column

We can print a summary of the variables in each column as well as the 'status' variable. 

In [15]:
SA.column_summary("SAMODIFY", status = True)

Number of unique patients in domain: 677926
                                                         Number of rows  Unique patients
SAMODIFY                                         status                                 
ABDOMINAL PAIN                                   N               228210           212888
                                                 U                32344            32343
                                                 Y                21631            21314
ACUTE CARDIAC INJURY                             N                  532              532
                                                 U                   88               88
                                                 Y                   60               60
ACUTE GASTROENTERITIS                            U                   40               40
                                                 Y                   37               37
ACUTE KIDNEY INJURY                              N               3

We can also specify a subset of variables:

In [16]:
SA.column_summary("SAMODIFY",  "ASTHMA", "STROKE", "TUBERCULOSIS", status = True,)

Number of unique patients in domain: 677926
                     Number of rows  Unique patients
SAMODIFY     status                                 
ASTHMA       N               478805           478803
             U               133982           133982
             Y                46508            46165
STROKE       N               223490           223402
             U                 9095             9095
             Y                 5018             4921
TUBERCULOSIS N               558528           314849
             U               265096           139133
             Y                12252            10107


### 10. Saving the modified dataframe as a sqlite table: 

If we want to browse (or access later) we can save our new filtered dataframe into a sqlite table 
(note this takes some timefor large domains such as SA and IN).

In [17]:
SA.save_to_sqlite("SA_tutorial_modified", DATA_DIRECTORY, DATABASE_FILE )

This creates a new table in our existing sqlite database as well as a .pickle file for quicker read and write in Python

# Free Text Searches
For most variables in the ISARIC dataset, the xxMODIFY column contains a standardised event name. However, xxTERM contains some spontaneously recorded events that are not recorded in the xxMODIFY. 

For example, we can search the SA domain for terms that might be relevant to Kidney Stones (for which there is no standardised variable in the 'SAMODIFY' column). We use the domain.free_text_search() method. We can enter any search terms as strings separated by commas. This method then searches for these terms in the relevant column and returns a dataframe with the result. 

Note that the Domain.free_text_search( ) method searches to see if our search terms are substrings of any raw terms. For example searching "Kidney" would return rows containing "Acute Kidney Injury" as well as "Kidney Stones". 

In [18]:
stones_frame = SA.free_text_search("kidney stones", "nephrolithiasis", "renal calculi")

Free text entries containing any of 'kidney stones or nephrolithiasis or renal calculi' were found in 271 rows


In [19]:
stones_frame

Unnamed: 0,STUDYID,DOMAIN,USUBJID,SASEQ,SATERM,SAMODIFY,SACAT,SAPRESP,SAOCCUR,SADY,SASTDY,SAEVLINT,status
839322,CVZXZMV,SA,CVZXZMV_260259,16,RENAL CALCULI,,MEDICAL HISTORY,,,,,,Y
1338607,CVZXZMV,SA,CVZXZMV_384798,6,KIDNEY STONES,,COMPLICATIONS,,,,,,Y
3655180,CVZXZMV,SA,CVZXZMV_333242,23,KIDNEY STONES AND COVID-19 POSITIVE,,MEDICAL HISTORY,,,,,,Y
4881755,CVZXZMV,SA,CVZXZMV_66860,22,KIDNEY STONES,,MEDICAL HISTORY,,,,,,Y
9178547,CVZXZMV,SA,CVZXZMV_438607,10,ADMITTED FOR KIDNEY STONES,,MEDICAL HISTORY,,,,,,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
31446318,CVCCPUK,SA,CVCCPUK_7A4BV-0880,31,RECURRENT KIDNEY STONES,,MEDICAL HISTORY,,,1.0,,,Y
31534376,CVCCPUK,SA,CVCCPUK_RLT01-0706,7,NEPHROLITHIASIS,,MEDICAL HISTORY,,,1.0,,,Y
31653094,CVCCPUK,SA,CVCCPUK_RBA11-1188,18,RENAL CALCULI,,MEDICAL HISTORY,,,1.0,,,Y
31840748,CVCCPUK,SA,CVCCPUK_RDDH0-1157,17,RENAL CALCULI,,COMPLICATIONS,,,13.0,,,Y


We found 271 free text entries that are relevant for Kidney stones. Note that the value of SAPRESP is NaN (missing) as is the value of SAOCCUR. This indicates that the entry was made spontaenously (i.e. not indicated on the CRF).

# Vaccination Status Example

The following example uses the functionality we have used thus far to retrieve the vaccination status of patients. 

We will load the IN domain as this contains information about vaccinations. Note we first delete the SA domain from memory to save some space. 

In [20]:
del(SA)

In [21]:
IN = Domain("IN", DATA_DIRECTORY)

We then inspect the columns:

In [22]:
IN.columns()

['STUDYID', 'DOMAIN', 'USUBJID', 'SPDEVID', 'INSEQ', 'INREFID', 'INTRT', 'INMODIFY', 'INCAT', 'INSCAT', 'INPRESP', 'INOCCUR', 'INCLAS', 'INCLASCD', 'INSTAT', 'INREASND', 'ININDC', 'INDOSE', 'INDOSTXT', 'INDOSU', 'INDOSFRM', 'INDOSFRQ', 'INDOSTOT', 'INROUTE', 'INDY', 'INSTDY', 'INENDY', 'INDUR', 'INTPT', 'INTPTREF', 'INSTRF', 'INEVLINT', 'INEVINTX', 'INCDSTDY', 'status']


Most of those columns are not relevant to vaccination status so we're going to include only relevant columns

In [23]:
relevant_cols = ['USUBJID', 'INTRT', 'INMODIFY', 'INPRESP', 'INOCCUR', 'INREFID' ,'INSTDY', 'status']
IN.include_columns(relevant_cols)

We can now look at "INMODIFY" to ascertain what variables are relevant to COVID-19 Vaccination

In [24]:
IN.column_summary("INMODIFY", status = True)

Number of unique patients in domain: 684313
                                                           Number of rows  Unique patients
INMODIFY                                           status                                 
ACE INHIBITOR OR A2 BLOCKER                        N                 1258             1258
                                                   U                   21               21
                                                   Y                  332              332
ACETYLSALICYLIC ACID                               N                   43               43
                                                   Y                27693            27485
ACICLOVIR                                          Y                 4020             3227
ACICLOVIR / VALACICLOVIR                           Y                  117               53
ADRENALINE                                         Y                  183              148
AGENTS ACTING ON THE RENIN-ANGIOTENSIN SYSTEM 

We can then take a closer look at only those variables related to COVID-19 Vaccination: 

In [25]:
IN.column_summary("INMODIFY", 'COVID-19 VACCINATION', 'COVID-19 VACCINATION', 
                                            'COVID-19 VACCINE PFIZER-BIONTECH',
                                            'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
                                            'COVID-19 VACCINE TYPE UNKNOWN',
                                            'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD',
                                            'COVID-19 VACCINE CANSINBIO', 
                                            'COVID-19 VACCINE SPUTNIK V',
                                            'COVID-19 VACCINE SINOPHARM', 
                                            'COVID-19 VACCINE MODERNA',
                                            'COVID-19 VACCINE SINOVAC', 
                                            'COVID-19 VACCINE COVAXIN',
                                            status = True)

Number of unique patients in domain: 684313
                                                          Number of rows  Unique patients
INMODIFY                                          status                                 
COVID-19 VACCINATION                              N               115705           114963
                                                  U               396668           396642
                                                  Y                23217            23152
COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD Y                12496             6739
COVID-19 VACCINE CANSINBIO                        Y                    3                3
COVID-19 VACCINE COVAXIN                          Y                   29               29
COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)   Y                 1733             1710
COVID-19 VACCINE MODERNA                          Y                  829              760
COVID-19 VACCINE PFIZER-BIONTECH                  Y     

We can also use our list of INMODIFY variables to return a dataframe with only entries relevant to vaccination: 

In [26]:
covid_vacc = IN.select_variables_from_column("INMODIFY", 'COVID-19 VACCINATION', 
                                            'COVID-19 VACCINE PFIZER-BIONTECH',
                                            'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
                                            'COVID-19 VACCINE TYPE UNKNOWN',
                                            'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD',
                                            'COVID-19 VACCINE CANSINBIO', 
                                            'COVID-19 VACCINE SPUTNIK V',
                                            'COVID-19 VACCINE SINOPHARM', 
                                            'COVID-19 VACCINE MODERNA',
                                            'COVID-19 VACCINE SINOVAC', 
                                            'COVID-19 VACCINE COVAXIN')

There is 532581 unique patients in filtered dataframe


We can then take a look at our filtered dataframe: 

In [27]:
covid_vacc

Unnamed: 0,USUBJID,INTRT,INMODIFY,INPRESP,INOCCUR,INREFID,INSTDY,status
26,CVZXZMV_100028,RECEIVED A COVID-19 VACCINE,COVID-19 VACCINATION,Y,U,,,U
68,CVZXZMV_100189,RECEIVED A COVID-19 VACCINE,COVID-19 VACCINATION,Y,U,,,U
98,CVZXZMV_100556,RECEIVED A COVID-19 VACCINE,COVID-19 VACCINATION,Y,U,,,U
157,CVZXZMV_100578,RECEIVED A COVID-19 VACCINE,COVID-19 VACCINATION,Y,N,,,N
215,CVZXZMV_10060,RECEIVED A COVID-19 VACCINE,COVID-19 VACCINATION,Y,U,,,U
...,...,...,...,...,...,...,...,...
32363652,CVVCORE_286-1107,COVID-19 VACCINATION,COVID-19 VACCINATION,Y,N,,,N
32364746,CVVCORE_449-0447,COVID-19 VACCINATION,COVID-19 VACCINATION,Y,N,,,N
32364763,CVVCORE_449-0459,COVID-19 VACCINATION,COVID-19 VACCINATION,Y,N,,,N
32366361,CVVCORE_520-0601,COVID-19 VACCINATION,COVID-19 VACCINATION,Y,N,,,N


In [28]:
covid_vacc.INTRT.unique()

array(['RECEIVED A COVID-19 VACCINE', 'COVID-19 VACCINATION',
       'COVID-19 VACCINE PFIZER/BIONTECH',
       'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
       'JANSSENS (JOHNSON & JOHNSON)', 'COVID-19 VACCINE TYPE UNKNOWN',
       'PFIZER-BIONTECH', 'ASTRA ZENECA (COVISHIELD)',
       'COVID-19 VACCINE CANSINOBIO', 'COVID-19 VACCINE TYPE OTHER',
       'COVID-19 VACCINE SPUTNIK V', 'MODERNA',
       'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD (COVISHIELD IN INDIA)',
       'COVID-19 VACCINE SINOPHARM', 'COVID-19 VACCINE MODERNA',
       'COVID-19 VACCINE SINOVAC',
       'COVID-19 VACCINE JANSSENS (JOHNSON & JOHNSON)',
       'RECEIVED A COVID-19 VACCINE (OPEN LABEL LICENSED PRODUCT)',
       'COVID-19 VACCINE UNKNOWN TYPE',
       'ASTRAZENECA COVID-19 VACCINATION',
       'COVID-19 VACCINE OXFORD-ASTRAZENECA',
       'COVID-19 VACCINE PFIZER-BIONTECH', 'JANSSEN',
       'ASTRAZENECA VACCINE', 'COVID-19 AZ', 'JOHNSON AND JOHNSON',
       'COVID VACCINE',
       'COVI

In [29]:
covid_vacc.INMODIFY.unique()

array(['COVID-19 VACCINATION', 'COVID-19 VACCINE PFIZER-BIONTECH',
       'COVID-19 VACCINE JANSSENS (JOHNSON AND JOHNSON)',
       'COVID-19 VACCINE TYPE UNKNOWN',
       'COVID-19 VACCINE ASTRAZENECA/UNIVERSITY OF OXFORD',
       'COVID-19 VACCINE CANSINBIO', 'COVID-19 VACCINE SPUTNIK V',
       'COVID-19 VACCINE MODERNA', 'COVID-19 VACCINE SINOPHARM',
       'COVID-19 VACCINE SINOVAC', 'COVID-19 VACCINE COVAXIN'],
      dtype=object)

In [30]:
covid_vacc.INREFID.unique()

array([nan, 'DOSE 1', 'DOSE 2', 'DOSE 3'], dtype=object)

In [31]:
covid_vacc.status.value_counts()

U    396668
N    115705
Y     57211
Name: status, dtype: int64

Great! So now what do we do if we want to save this DataFrame to access it later?

We can use the function df_to_sqlite() which saves a DataFrame into the sqlite database created earlier and as a .pickle which we can load quickly into Python.

In [42]:
df_to_sqlite(covid_vacc, "vacc_status", DATA_DIRECTORY, DATABASE_FILE)

True

As you can see the function returns True, meaning the write has been succesful. 