# Data Preparation: Notebook Exploration
2023-07-26 ZD  

This notebook will explore the Key Programs file provided to FNL by the NCI Office of Data Sharing (ODS). The goal is to build functions that will load, clean, and export the file for downstream use. **This notebook will not be used for production processes.** The functions developed in this notebook will be moved to scripts for more standardized pipeline use when ready.   

The Key Programs csv. is a curated list of key research programs annotated with associated funding and administrative details. The format is odd because it is an export from Qualtrics Survey software. In the future, the idea is that Principal Investigators or Program Officers will submit study details to ODS via Qualtrics, which can then be exported and sent to FNL for more repeatable processes.  

**Update:** Functions from this notebook have been reformatted and copied to `modules/data_preparation.py` for use within `main.py`. 

In [1]:
import pandas as pd

# Method to import from parent directory
import os
import sys
root_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
sys.path.append(root_dir)

import config

In [2]:
# Add ../ to filepaths to make them work in /notebooks directory
csv_filepath = f"../{config.QUALTRICS_CSV_PATH}"
version = f"../{config.QUALTRICS_VERSION}"
col_dict = config.QUALTRICS_COLS

In [3]:
# Load raw file into pandas dataframe
df_raw = pd.read_csv(csv_filepath)

# Check top 5 rows
df_raw.head()

Unnamed: 0,StartDate,EndDate,Status,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,RecipientLastName,...,Q8,Q9,Q10,Q11,Q12,Q13,Q14,Q18,11,Login ID
0,Start Date,End Date,Response Type,IP Address,Progress,Duration (in seconds),Finished,Recorded Date,Response ID,Recipient Last Name,...,Primary Contact (PI),Primary Contact (PI) email,NIH Contact (Program Officer/Program Director),NIH Contact (Program Officer/Program Director)...,"NOFO number (eg. format as ""RFA-CA-00-000"") (I...",Grant/Award number {parent award FORMAT #LL#CA...,Link to program website,Link to data or DCC if available,What type of cancer is the primary focus of th...,Login ID
1,"{""ImportId"":""startDate"",""timeZone"":""America/De...","{""ImportId"":""endDate"",""timeZone"":""America/Denv...","{""ImportId"":""status""}","{""ImportId"":""ipAddress""}","{""ImportId"":""progress""}","{""ImportId"":""duration""}","{""ImportId"":""finished""}","{""ImportId"":""recordedDate"",""timeZone"":""America...","{""ImportId"":""_recordId""}","{""ImportId"":""recipientLastName""}",...,"{""ImportId"":""QID8_TEXT""}","{""ImportId"":""QID9_TEXT""}","{""ImportId"":""QID10_TEXT""}","{""ImportId"":""QID11_TEXT""}","{""ImportId"":""QID12_TEXT""}","{""ImportId"":""QID13_TEXT""}","{""ImportId"":""QID14_TEXT""}","{""ImportId"":""QID18_TEXT""}","{""ImportId"":""QID16""}","{""ImportId"":""Login ID""}"
2,7/18/2023 12:42,7/18/2023 12:44,IP Address,73.130.95.160,100,103,TRUE,7/18/2023 12:44,R_2OPgJ9zimeK6aTZ,,...,"MAITRA, ANIRBAN",amaitra@mdanderson.org,"Hildesheim, Jeff; UJHAZY, PETER",hildesheimj@mail.nih.gov; ujhazyp@mail.nih.gov,RFA-CA-21-041; RFA-CA-21-042,1 U24 CA274274-01,https://www.cancer.gov/about-nci/organization/...,,Pancreas Cancer,670373
3,7/18/2023 12:46,7/18/2023 12:48,IP Address,73.130.95.160,100,154,TRUE,7/18/2023 12:48,R_2wdDKMcjOMkYf46,,...,"BUTTE, ATUL J",atul.butte@ucsf.edu,Christine Nadeau;Joanna Watson,christine.nadeau@nih.gov;watsonjo@mail.nih.gov,PAR14-239; PAR-16-059; PAR-17-245; PAR-20-131,5U24CA195858,https://www.cancer.gov/about-nci/organization/...,,This program focuses on cancer broadly - not l...,383545
4,7/18/2023 12:50,7/18/2023 12:51,IP Address,73.130.95.160,100,68,TRUE,7/18/2023 12:51,R_1gkFDttXYeA3sLX,,...,"BULT, CAROL J",carol.bult@jax.org,"Smith, Malcolm",Malcolm.Smith@nih.gov,RFA-CA-20-034; RFA-CA-14-018; RFA-CA-20-041,U24CA263963,https://ctep.cancer.gov/MajorInitiatives/Pedia...,,This program focuses on cancer broadly - not l...,368030


In [4]:
# Check columns
df_raw.columns.tolist()

['StartDate',
 'EndDate',
 'Status',
 'IPAddress',
 'Progress',
 'Duration (in seconds)',
 'Finished',
 'RecordedDate',
 'ResponseId',
 'RecipientLastName',
 'RecipientFirstName',
 'RecipientEmail',
 'ExternalReference',
 'LocationLatitude',
 'LocationLongitude',
 'DistributionChannel',
 'UserLanguage',
 'Q1',
 'Q17',
 'Q15',
 'Q7',
 'Q8',
 'Q9',
 'Q10',
 'Q11',
 'Q12',
 'Q13',
 'Q14',
 'Q18',
 '11',
 'Login ID']

In [5]:
# Check first row with descriptive headers
df_raw.iloc[0]

StartDate                                                       Start Date
EndDate                                                           End Date
Status                                                       Response Type
IPAddress                                                       IP Address
Progress                                                          Progress
Duration (in seconds)                                Duration (in seconds)
Finished                                                          Finished
RecordedDate                                                 Recorded Date
ResponseId                                                     Response ID
RecipientLastName                                      Recipient Last Name
RecipientFirstName                                    Recipient First Name
RecipientEmail                                             Recipient Email
ExternalReference                                  External Data Reference
LocationLatitude         

In [6]:
# Check second row with Qualtrics ImportId coded headers
df_raw.iloc[1]

StartDate                {"ImportId":"startDate","timeZone":"America/De...
EndDate                  {"ImportId":"endDate","timeZone":"America/Denv...
Status                                               {"ImportId":"status"}
IPAddress                                         {"ImportId":"ipAddress"}
Progress                                           {"ImportId":"progress"}
Duration (in seconds)                              {"ImportId":"duration"}
Finished                                           {"ImportId":"finished"}
RecordedDate             {"ImportId":"recordedDate","timeZone":"America...
ResponseId                                        {"ImportId":"_recordId"}
RecipientLastName                         {"ImportId":"recipientLastName"}
RecipientFirstName                       {"ImportId":"recipientFirstName"}
RecipientEmail                               {"ImportId":"recipientEmail"}
ExternalReference                     {"ImportId":"externalDataReference"}
LocationLatitude         

The provided CSV has several header rows and many leading columns with Qualtrics export metadata that is unimportant for our purposes. There may be uses for this in the future, but for now, remove the extra rows and columns to return the data we want. 

In [7]:
def find_header_location(csv_filepath, key_value):
    """Detect the row and column where the given key_value is found."""

    with open(csv_filepath, 'r') as file:
        for row, line in enumerate(file):
            # Read the first row (header row) and split it by commas
            header_row = line.strip().split(',')  
            for col, header_value in enumerate(header_row):
                if key_value == header_value.strip():
                    return row, col

    # If the loop finishes without finding the key_value, raise an Error
    assert False, f"Key value '{key_value}' not found in file."

In [8]:
# Get string of first key column to look for within file
key_value = list(col_dict.keys())[0]

# Get row,col of given header text
header_row, header_col = find_header_location(csv_filepath, key_value)
print(header_row, header_col)

1 17


In [9]:
# Load file, skip leading rows, and then drop leading columns
df = (pd.read_csv(csv_filepath, skiprows=header_row)
      .iloc[:, header_col:])

# Show first 5 rows of df
df.head()

Unnamed: 0,Name of Key Program,Acronym for key program,Focus Area (select all that apply),DOC,Primary Contact (PI),Primary Contact (PI) email,NIH Contact (Program Officer/Program Director),NIH Contact (Program Officer/Program Director) email,"NOFO number (eg. format as ""RFA-CA-00-000"") (If more than one, separate with ; semicolon)","Grant/Award number {parent award FORMAT #LL#CA######, eg. 5UG3CA260607} (If more than one, separate with ; semicolon)",Link to program website,Link to data or DCC if available,What type of cancer is the primary focus of the program? (Check all that\napply),Login ID
0,"{""ImportId"":""QID1_TEXT""}","{""ImportId"":""QID17_TEXT""}","{""ImportId"":""QID15""}","{""ImportId"":""QID7""}","{""ImportId"":""QID8_TEXT""}","{""ImportId"":""QID9_TEXT""}","{""ImportId"":""QID10_TEXT""}","{""ImportId"":""QID11_TEXT""}","{""ImportId"":""QID12_TEXT""}","{""ImportId"":""QID13_TEXT""}","{""ImportId"":""QID14_TEXT""}","{""ImportId"":""QID18_TEXT""}","{""ImportId"":""QID16""}","{""ImportId"":""Login ID""}"
1,Pancreatic Adenocarcinoma Stromal Reprograming...,PSRC/PASSCODE,DCC,"DCB,DCTD","MAITRA, ANIRBAN",amaitra@mdanderson.org,"Hildesheim, Jeff; UJHAZY, PETER",hildesheimj@mail.nih.gov; ujhazyp@mail.nih.gov,RFA-CA-21-041; RFA-CA-21-042,1 U24 CA274274-01,https://www.cancer.gov/about-nci/organization/...,,Pancreas Cancer,670373
2,Oncology Models Forum (U24),OMF,"DCC,Cancer Moonshot",DCB,"BUTTE, ATUL J",atul.butte@ucsf.edu,Christine Nadeau;Joanna Watson,christine.nadeau@nih.gov;watsonjo@mail.nih.gov,PAR14-239; PAR-16-059; PAR-17-245; PAR-20-131,5U24CA195858,https://www.cancer.gov/about-nci/organization/...,,This program focuses on cancer broadly - not l...,383545
3,Pediatric Preclinical in Vivo Testing (PIVOT),PIVOT,"Pediatric/AYA,DCC,Cancer Moonshot","CIB,CTEP,DCTD","BULT, CAROL J",carol.bult@jax.org,"Smith, Malcolm",Malcolm.Smith@nih.gov,RFA-CA-20-034; RFA-CA-14-018; RFA-CA-20-041,U24CA263963,https://ctep.cancer.gov/MajorInitiatives/Pedia...,,This program focuses on cancer broadly - not l...,368030
4,Cancer Immunologic Data Commons,CIDC,Cancer Moonshot,DCTD,"CERAMI, ETHAN",cerami@jimmy.harvard.edu,"THURIN, MAGDALENA",thurinm@mail.nih.gov,RFA-CA-17-006; RFA-CA-22-038,1U24CA224316,https://dctd.cancer.gov/ResearchNetworks/cimac...,,This program focuses on cancer broadly - not l...,102692


Given column names match the survey question text and need to be shortened and standardized. Additionatlly, there is a row beneath the headers with question IDs. These could be used as new headers, but I'll drop them and rename the main headers for downstream clarity.

In [10]:
# Use dictionary to check for unexpected or missing columns in data
actual_cols = df.columns.tolist()
expected_cols = list(col_dict.keys())

assert actual_cols == expected_cols, "Column names do not match expected."

In [11]:
# Rename columns with defined dictionary
df = df.rename(columns=col_dict)

# Drop second header row with survey question IDs
df = df.drop(axis=0, index=0).reset_index(drop=True)

# Check that the last column is login_id and then drop it
assert (df.columns[-1] == "login_id"), ( 
        f"Unexpected final column: {df.columns[-1]}")
df = df.drop(columns=["login_id"])

df.head()

Unnamed: 0,program_name,program_acronym,focus_area,doc,contact_pi,contact_pi_email,contact_nih,contact_nih_email,nofo,award,program_link,data_link,cancer_type
0,Pancreatic Adenocarcinoma Stromal Reprograming...,PSRC/PASSCODE,DCC,"DCB,DCTD","MAITRA, ANIRBAN",amaitra@mdanderson.org,"Hildesheim, Jeff; UJHAZY, PETER",hildesheimj@mail.nih.gov; ujhazyp@mail.nih.gov,RFA-CA-21-041; RFA-CA-21-042,1 U24 CA274274-01,https://www.cancer.gov/about-nci/organization/...,,Pancreas Cancer
1,Oncology Models Forum (U24),OMF,"DCC,Cancer Moonshot",DCB,"BUTTE, ATUL J",atul.butte@ucsf.edu,Christine Nadeau;Joanna Watson,christine.nadeau@nih.gov;watsonjo@mail.nih.gov,PAR14-239; PAR-16-059; PAR-17-245; PAR-20-131,5U24CA195858,https://www.cancer.gov/about-nci/organization/...,,This program focuses on cancer broadly - not l...
2,Pediatric Preclinical in Vivo Testing (PIVOT),PIVOT,"Pediatric/AYA,DCC,Cancer Moonshot","CIB,CTEP,DCTD","BULT, CAROL J",carol.bult@jax.org,"Smith, Malcolm",Malcolm.Smith@nih.gov,RFA-CA-20-034; RFA-CA-14-018; RFA-CA-20-041,U24CA263963,https://ctep.cancer.gov/MajorInitiatives/Pedia...,,This program focuses on cancer broadly - not l...
3,Cancer Immunologic Data Commons,CIDC,Cancer Moonshot,DCTD,"CERAMI, ETHAN",cerami@jimmy.harvard.edu,"THURIN, MAGDALENA",thurinm@mail.nih.gov,RFA-CA-17-006; RFA-CA-22-038,1U24CA224316,https://dctd.cancer.gov/ResearchNetworks/cimac...,,This program focuses on cancer broadly - not l...
4,CANCER IMMUNE MONITORING AND ANALYSIS CENTERS,CIMAC,Cancer Moonshot,"CTEP,DCTD",,,"Thurin, Magdalena",thurinm@mail.nih.gov,RFA-CA-17-005;RFA-CA-22-038,,https://dctd.cancer.gov/ResearchNetworks/cimac...,,This program focuses on cancer broadly - not l...


In [12]:
df.columns.tolist()

['program_name',
 'program_acronym',
 'focus_area',
 'doc',
 'contact_pi',
 'contact_pi_email',
 'contact_nih',
 'contact_nih_email',
 'nofo',
 'award',
 'program_link',
 'data_link',
 'cancer_type']

Looks good. Ready to save as CSV for reference and then use downstream. (originally saved as TSV, then changed to CSV to prevent column shifting)

Done. Functionality ported to `modules/data_preparation.py` and included in `main.py`