# Subject metadata

In this notebook, we:
- Read subjects' information from fillable forms 
- Organize the data in a table (Pandas dataframe)
- Save the metadata in a tabular format (.xlsx or .csv)

Note: Run a separate notebook per each transmittal form template

By Serena Bonaretti

---

Installations:

- If pyPDF2 is not installed on your machine yet, uncomment the code below (remove the #) and run the cell. The line of code will install pyPDF2. After installing, comment out the code (re-insert #) as you won't need to install the library everytime you run this notebook

In [1]:
#! pip install PyPDF2

---
Imports and variables:

In [2]:
import os
import pandas as pd
import PyPDF2

In [3]:
pdf_folder = "./subjects/subjects_3309/"

---
## 1. Getting the names of the *.pdf* files in the folder:

In [4]:
# getting the folder content
folder_content = os.listdir(pdf_folder)

# creating the list for .pdf file names
pdf_file_names = [] 

# getting only .isq files
for file in folder_content: 
    
    # getting file extensions
    filename, file_extension = os.path.splitext(pdf_folder + file)
    
    # get only the files with .pdf or .PDF file extension
    if "pdf" in file_extension or "PDF" in file_extension:
        pdf_file_names.append(file)
        
print ("-> Found " + str(len(pdf_file_names)) + " .pdf files in folder:" )

for filename in pdf_file_names:
    print (filename)

-> Found 3 .pdf files in folder:
transmittal_3309_subject_1.pdf
transmittal_3309_subject_3.pdf
transmittal_3309_subject_2.pdf


---
## 2. Extracting information from fillable forms

- To read the fillable forms, we use PyPDF2
- For each file, we will get two lists: 
    - `keys`, containing all the fields (e.g. *sample_ID*, etc.)
    - `values`, containing all the actual values (e.g. *440*, etc.)  
- Then we save the `keys` of the first subject into the list `all_keys` - we do not need to save the keys for every subject because they are the same   
  The values in `all_keys` will become the column names of the table
- Finally for each subject, we add the list `values` to the list of lists `all_values`  
  The values in `all_values` will become the content of the table

In [5]:
# initializing list containing keys and values
all_keys = []
all_values = []

# for each .pdf file in the folder
for i in range(0, len(pdf_file_names)): 
       
    # read fillable form for the current subject
    f = PyPDF2.PdfFileReader(pdf_folder + pdf_file_names[i])
    ff = f.getFields()
    
    # get keys for the current subject
    current_keys = list(ff.keys())
    
    # get the values for the current subject
    current_values = []
    for k,v in ff.items():
        if "/V" in v.keys():
            current_values.append(v["/V"])
        else:
            current_values.append("")
        
    # save the keys of the first subject in the variable all_keys
    if i == 0:
        all_keys = current_keys

    # add the values of the current subject to all_values
    all_values.append(current_values)
    
# print (all_keys)
# print (all_values)

--- 
## 3. Creating a metadata table 

- We want to create a metadata table containing subject information from *.pdf* headers  
  
- To handle tables, we use the python package [Pandas](https://pandas.pydata.org/), imported at the beginning of the notebook

In [6]:
# display all pandas columns and rows 
pd.options.display.max_rows    = None
pd.options.display.max_columns = None

In [7]:
# create dataframe (=table)
subjects_info = pd.DataFrame(all_values, columns = all_keys)

# adding column with file names in position 0
subjects_info.insert(0, "file_name", pdf_file_names)

# show dataframe
subjects_info

Unnamed: 0,file_name,birth_date,sex,side_per_clinician,height_cm,pat_name,fractures_surgeries,metal_in_VOI,scanner_id_1,pat_no_1,meas_no_1,ctr_file_1,ref_line_1,saved_scout_1,side_1,comments_1,tech_1,weight_kg,study_ID,time_FU_6mo,time_BL,recent_imaging
0,transmittal_3309_subject_1.pdf,14 Mar 1958,F,L,180,EUA_001,NO,NO,3309,2745,9538,78,208,YES,L,,RT,90,3309_HAND,,x,NO
1,transmittal_3309_subject_3.pdf,04 Aug 1955,F,L,175,CJS_3043R,NO,NO,3309,3643,13628,78,206,NO,L,,RT,70,3309_HAND,,x,NO
2,transmittal_3309_subject_2.pdf,15 Jun 1970,M,R,173,EUA_002,Hand surgery,NO,3309,2746,11111,77,204,YES,R,,RT,65,3309_HAND,x,x,MRI


- If in the transmittal form there are several fields for *meas_no*, we merge them 
  - Note: This step is needed when merging tables containing information about subjects, protocols, and image information (see notebook merge_and_query.ipynb)

In [8]:
# find all the fields containing meas_no
meas_no_fields = []
for field in subjects_info.columns:
    if "meas_no" in field:
        meas_no_fields.append(field)

# if there are more than 1, we need to merge them
if len(meas_no_fields) > 1:
    
    print ("Merging the columns " + str(meas_no_fields) + " in one column called meas_no")
    
    # prepare data or the merging
    for field in meas_no_fields:
        # replace empty cells with 0
        subjects_info[field] = subjects_info[field].replace({"": "0"})
        # transform cell content from strings to integers
        subjects_info[field] = subjects_info[field].astype(int)
   
    # rename the first column contaning meas_no_x to meas_no 
    subjects_info = subjects_info.rename(columns={meas_no_fields[0]: "meas_no"})
    # merge all cells to the first one
    for i in range (1, len(meas_no_fields)):
        subjects_info["meas_no"] += subjects_info[meas_no_fields[i]]
        # delete the column that got merged
        subjects_info = subjects_info.drop(columns=[meas_no_fields[i]])
    
    # make sure the resulting column contains integers
    subjects_info['meas_no'] = subjects_info['meas_no'].astype(int)
    
# if there is only 1, we have to make sure it is called meas_no
else:
    if meas_no_fields[0] != "meas_no":
        print ("Renaming " + subjects_info[meas_no_fields[0]] + " to meas_no")
        subjects_info = subjects_info.rename(columns={meas_no_fields[0]: "meas_no"})
    else:
        print ("No change needed for the column meas_no")

0     Renaming 9538 to meas_no
1    Renaming 13628 to meas_no
2    Renaming 11111 to meas_no
Name: meas_no_1, dtype: object


---
## 4. Saving the table to a *.csv* or *.xlsx* file  

We can save the dataframe to several different file formats. Here we save it as:  
- *.csv* (open source)
- *.xlsx* (proprietary)  


In [9]:
# save to csv
subjects_info.to_csv("subjects_info_3309.csv", index=False)

# save to excel
# subjects_info.to_excel(pdf_folder + "subjects_info_3300.xlsx", index=False)

---
## Dependencies

In [10]:
%load_ext watermark
%watermark -v -m -p PyPDF2,pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.22.0

PyPDF2: 1.26.0
pandas: 1.2.4

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 20.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

