# Merging information tables 

In this notebook, we:
- Read image information organized in tabular data as dataframes
- Merge the dataframes 
- Save the resulting dataframe


By Serena Bonaretti

---

Imports and variables:

In [1]:
import os
import pandas as pd

In [2]:
folder = "./"

---
## 1. Getting the names of the tabular files in the folder:

In [3]:
# getting the folder content
folder_content = os.listdir(folder)

# creating the list for .pdf file names
file_names = [] 

# getting only .isq files
for file_name in folder_content: 
    
    # getting file extensions
    file_name_again, file_extension = os.path.splitext(folder + file_name)
    
    # get only the files with .pdf or .PDF file extension
    if ("csv" in file_extension or "xlsx" in file_extension) and file_name != "images_info.csv":
        file_names.append(file_name)
        
print ("-> Found " + str(len(file_names)) + " .csv or .xlsx files in folder:" )

for filename in file_names:
    print (filename)

-> Found 2 .csv or .xlsx files in folder:
images_3300.csv
images_3309.csv


---
## 2. Reading the files and merging them:

In [4]:
# display all pandas columns and rows 
pd.options.display.max_rows    = None
pd.options.display.max_columns = None

- Read the first file:

In [5]:
file_name, file_extension = os.path.splitext(folder + file_names[0])
if file_extension == ".csv":
    subjects_info = pd.read_csv(file_names[0])
else:
    subjects_info = pd.read_excel(file_names[0])
subjects_info

Unnamed: 0,file_name,check,data_type,nr_of_bytes,nr_of_blocks,pat_no,scanner_id,date,n_voxels_x,n_voxels_y,n_voxels_z,total_size_um_x,total_size_um_y,total_size_um_z,slice_thickness_um,pixel_size_um,slice_1_pos_um,min_intensity,max_intensity,mu_scaling,nr_of_samples,nr_of_projections,scan_dist_um,scanner_type,exposure_time,meas_no,site,reference_line_um,recon_algo,pat_name,energy_V,intensity_uA,data_offset
0,S0006767.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,440,3300,2013_03_27,1536,1536,330,125952,125952,27060,82,82,48227,-1766,9928,8192,1536,750,125952,9,100000,1909,4,0,3,MCP_MAIN7,59400,900,5
1,S0006514.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,426,3300,2013_02_12,1536,1536,330,125952,125952,27060,82,82,56270,-2814,10518,8192,1536,750,125952,9,100000,1841,4,0,3,MCP_MAIN2,59400,900,5
2,S0006589.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,431,3300,2013_02_25,1536,1536,330,125952,125952,27060,82,82,48449,-1915,11070,8192,1536,750,125952,9,100000,1863,4,0,3,MCP_MAIN4,59400,900,5
3,S0006755.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,437,3300,2013_03_26,1536,1536,330,125952,125952,27060,82,82,44315,-2936,11935,8192,1536,750,125952,9,100000,1896,4,0,3,MCP_MAIN6,59400,900,5


- Read other files and merge them:

In [6]:
for i in range (1, len(file_names)):
    
    # read current table
    file_name, file_extension = os.path.splitext(folder + file_names[i])
    if file_extension == ".csv":
        current_subjects_info = pd.read_csv(file_names[i])
    else:
        current_subjects_info = pd.read_excel(file_names[i])
        
    # merge the table
    subjects_info = pd.concat([subjects_info, current_subjects_info])

subjects_info

Unnamed: 0,file_name,check,data_type,nr_of_bytes,nr_of_blocks,pat_no,scanner_id,date,n_voxels_x,n_voxels_y,n_voxels_z,total_size_um_x,total_size_um_y,total_size_um_z,slice_thickness_um,pixel_size_um,slice_1_pos_um,min_intensity,max_intensity,mu_scaling,nr_of_samples,nr_of_projections,scan_dist_um,scanner_type,exposure_time,meas_no,site,reference_line_um,recon_algo,pat_name,energy_V,intensity_uA,data_offset
0,S0006767.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,440,3300,2013_03_27,1536,1536,330,125952,125952,27060,82,82,48227,-1766,9928,8192,1536,750,125952,9,100000,1909,4,0,3,MCP_MAIN7,59400,900,5
1,S0006514.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,426,3300,2013_02_12,1536,1536,330,125952,125952,27060,82,82,56270,-2814,10518,8192,1536,750,125952,9,100000,1841,4,0,3,MCP_MAIN2,59400,900,5
2,S0006589.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,431,3300,2013_02_25,1536,1536,330,125952,125952,27060,82,82,48449,-1915,11070,8192,1536,750,125952,9,100000,1863,4,0,3,MCP_MAIN4,59400,900,5
3,S0006755.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,437,3300,2013_03_26,1536,1536,330,125952,125952,27060,82,82,44315,-2936,11935,8192,1536,750,125952,9,100000,1896,4,0,3,MCP_MAIN6,59400,900,5
0,C0008472.ISQ;1,CTDATA-HEADER_V1,3,0,506886,2745,3309,2011_12_22,768,768,220,62976,62976,18040,82,82,107081,-1695,10801,8192,1536,750,125952,9,100000,9538,4,0,3,EUA_001,59400,1000,0
1,C0010013.ISQ;1,CTDATA-HEADER_V1,3,1557138432,3041286,2746,3309,2012_02_10,1536,1536,330,125952,125952,27060,82,82,88161,-2913,11085,8192,1536,750,125952,9,100000,11111,4,0,3,EUA_002,59400,1000,5
2,CJS_R_C0012934.ISQ;1,CTDATA-HEADER_V1,3,1038093312,2027526,3643,3309,2013_10_01,1536,1536,220,125952,125952,18040,82,82,128525,-12332,20959,8192,1536,750,125952,9,100000,13628,4,0,3,CJS_3043R,59400,900,5


---
## 3. Saving the table to a *.csv* or *.xlsx* file  

We can save the dataframe to several different file formats. Here we save it as:  
- *.csv* (open source)
- *.xlsx* (proprietary)  


In [7]:
# save to csv
subjects_info.to_csv("images_info.csv", index=False)

# save to excel
# subjects_info.to_excel("images_info.xlsx", index=False)

---
## Dependencies

In [8]:
%load_ext watermark
%watermark -v -m -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.22.0

pandas: 1.2.4

Compiler    : Clang 10.0.0 
OS          : Darwin
Release     : 20.5.0
Machine     : x86_64
Processor   : i386
CPU cores   : 4
Architecture: 64bit

