# Merging information tables 

In this notebook, we:
- Read subject information organized in tabular data as dataframe
- Merge the dataframes 
- Save the resulting dataframe


By Serena Bonaretti

---

Imports and variables:

In [1]:
import os
import pandas as pd

In [2]:
folder = "./"

---
## 1. Getting the names of the tabular files in the folder:

In [3]:
# getting the folder content
folder_content = os.listdir(folder)

# creating the list for .pdf file names
file_names = [] 

# getting only .isq files
for file_name in folder_content: 
    
    # getting file extensions
    file_name_again, file_extension = os.path.splitext(folder + file_name)
    
    # get only the files with .pdf or .PDF file extension
    if ("csv" in file_extension or "xlsx" in file_extension) and file_name != "subjects_info.csv":
        file_names.append(file_name)
        
print ("-> Found " + str(len(file_names)) + " .csv or .xlsx files in folder:" )

for filename in file_names:
    print (filename)

-> Found 2 .csv or .xlsx files in folder:
subjects_info_3309.csv
subjects_info_3300.csv


---
## 2. Reading the files and merging them:

In [4]:
# display all pandas columns and rows 
pd.options.display.max_rows    = None
pd.options.display.max_columns = None

In [5]:
# read the first file 

file_name, file_extension = os.path.splitext(folder + file_names[0])
if file_extension == ".csv":
    subjects_info = pd.read_csv(file_names[0])
else:
    subjects_info = pd.read_excel(file_names[0])
subjects_info

Unnamed: 0,file_name,birth_date,sex,side_per_clinician,height_cm,pat_name,fractures_surgeries,metal_in_VOI,scanner_id_1,pat_no_1,meas_no,ctr_file_1,ref_line_1,saved_scout_1,side_1,comments_1,tech_1,weight_kg,study_ID,time_FU_6mo,time_BL,recent_imaging
0,transmittal_3309_subject_1.pdf,14 Mar 1958,F,L,180,EUA_001,NO,NO,3309,2745,9538,78,208,YES,L,,RT,90,3309_HAND,,x,NO
1,transmittal_3309_subject_3.pdf,04 Aug 1955,F,L,175,CJS_3043R,NO,NO,3309,3643,13628,78,206,NO,L,,RT,70,3309_HAND,,x,NO
2,transmittal_3309_subject_2.pdf,15 Jun 1970,M,R,173,EUA_002,Hand surgery,NO,3309,2746,11111,77,204,YES,R,,RT,65,3309_HAND,x,x,MRI


In [6]:
# read other files and merge them

for i in range (1, len(file_names)):
    
    # read current table
    file_name, file_extension = os.path.splitext(folder + file_names[i])
    if file_extension == ".csv":
        current_subjects_info = pd.read_csv(file_names[i])
    else:
        current_subjects_info = pd.read_excel(file_names[i])
        
    # merge the table
    subjects_info = pd.concat([subjects_info, current_subjects_info])

subjects_info

Unnamed: 0,file_name,birth_date,sex,side_per_clinician,height_cm,pat_name,fractures_surgeries,metal_in_VOI,scanner_id_1,pat_no_1,meas_no,ctr_file_1,ref_line_1,saved_scout_1,side_1,comments_1,tech_1,weight_kg,study_ID,time_FU_6mo,time_BL,recent_imaging,time_FU_3mo,pregnant,scanner_id_2,scanner_id_3,pat_no_2,pat_no_3,ctr_file_2,ctr_file_3,ref_line_2,ref_line_3,saved_scout_2,saved_scout_3,side_2,side_3,comments_2,comments_3,tech_2,tech_3,LMP
0,transmittal_3309_subject_1.pdf,14 Mar 1958,F,L,180,EUA_001,NO,NO,3309.0,2745.0,9538,78.0,208.0,YES,L,,RT,90,3309_HAND,,x,NO,,,,,,,,,,,,,,,,,,,
1,transmittal_3309_subject_3.pdf,04 Aug 1955,F,L,175,CJS_3043R,NO,NO,3309.0,3643.0,13628,78.0,206.0,NO,L,,RT,70,3309_HAND,,x,NO,,,,,,,,,,,,,,,,,,,
2,transmittal_3309_subject_2.pdf,15 Jun 1970,M,R,173,EUA_002,Hand surgery,NO,3309.0,2746.0,11111,77.0,204.0,YES,R,,RT,65,3309_HAND,x,x,MRI,,,,,,,,,,,,,,,,,,,
0,transmittal_3300_subject_1.pdf,20 Oct 1960,F,R,160,MCP_MAIN7,NO,NO,3300.0,440.0,1909,77.0,200.0,YES,R,,LG,60,3300_SPECTRA,,x,Ultrasound,,no,,,,,,,,,,,,,,,,,
1,transmittal_3300_subject_2.pdf,10 Mar 1967,F,L,170,MCP_MAIN2,hand surgery,NO,3300.0,426.0,1841,78.0,210.0,YES,L,Subject couldn’t stay still,LG,65,3300_SPECTRA,,x,NO,,NO,,,,,,,,,,,,,,,,,
2,transmittal_3300_subject_3.pdf,24 Apr 1958,M,L,180,MCP_MAIN4,NO,NO,,,1863,,,,,,,90,3300_SPECTRA,x,,NO,,NO,,3300.0,,431.0,,78.0,,205.0,,NO,,L,,,,LG,
3,transmittal_3300_subject_4.pdf,03 Mar 1965,F,L,170,MCP_MAIN6,tendon,NO,,,1896,,,,,,,60,3300_SPECTRA,,,X-rays,x,NO,3300.0,,437.0,,78.0,,208.0,,NO,,L,,,,LG,,


---
## 3. Saving the table to a *.csv* or *.xlsx* file  

We can save the dataframe to several different file formats. Here we save it as:  
- *.csv* (open source)
- *.xlsx* (proprietary)  


In [7]:
# save to csv
subjects_info.to_csv("subjects_info.csv", index=False)

# save to excel
# subjects_info.to_excel("subjects_info_3300.xlsx", index=False)

---
## Dependencies

In [8]:
%load_ext watermark
%watermark -v -m -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.22.0

pandas: 1.2.4

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 20.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

