## Mouse B-ENaC Study Data Loading & Preprocessing

The script is designed to handle the preprocessing of data in the Mouse B-ENaC Study. Its primary purpose is to reorganize the data format for improved usability, without removing any data.

This script will only get the data needed and process them in the data folder.

In [1]:
import data_loading
import os
import pandas as pd

# the path of the raw dataset folder
source_folder = "../../../University of Adelaide/Mouse B-ENaC Study/"

# the path of the destination dataset folder
destination_folder = "../../data/Mouse B-ENaC Study/"

# create destination folder if not exists
os.makedirs(destination_folder, exist_ok=True)

### 1. Generating report summary

We will firstly get the summary of the dataset in one spreadsheet, which includes all the data from the study and the 6 parameters (VDP, MSV, TV, VH, VHSS, VHLS).

We will do that by scraping the data from the reports.

Also, we will add the column `FileName` of that animal to extract additional data.

In [2]:
df = data_loading.create_report_summary(source_folder+"PDF_reports/")
df[:5]

Unnamed: 0,ScanName,DatePrepared,VDP(%),MSV(mL/mL),TV(L),VH(%),VHSS(%),VHLS(%),FileName
0,MAL-006531_XV,2023-08-23-15:49:03.475248,14.7,0.39,0.262,39.16,21.23,26.1,MAL-006531_XV_wt.INSP.norm.ventilationReport.pdf
1,MAL-006524_XV,2023-08-23-15:53:24.209860,12.0,0.334,0.245,33.32,17.69,22.01,MAL-006524_XV_wt.INSP.norm.ventilationReport.pdf
2,MAL-006583_XV,2023-08-23-15:52:50.404602,16.9,0.276,0.134,53.12,17.51,40.23,MAL-006583_XV_wt.INSP.norm.ventilationReport.pdf
3,MAL-006504_XV,2023-08-23-15:53:56.931235,12.9,0.367,0.197,39.85,19.17,27.09,MAL-006504_XV_wt.INSP.norm.ventilationReport.pdf
4,MAL-006498_XV,2023-08-23-15:52:35.049030,16.2,0.337,0.342,43.74,18.83,30.9,MAL-006498_XV_tg.INSP.norm.ventilationReport.pdf


We will find add the `DiseaseType` to this df based on its `FileName`

In [3]:
disease_types = []

for file_name in df["FileName"]:
    disease_type = file_name.split(".")[0].split("_")[-1].upper()
    if disease_type == "TG":
        disease_type = "B-ENaC"
    disease_types.append(disease_type)

df["DiseaseType"] = disease_types

We will drop the column `FileName`, and save this Dataframe as a csv file and store it in the data folder

In [4]:
df = df.drop('FileName', axis=1)
df.to_csv(destination_folder+"report_summary.csv", index=False)

### 2. Copying 3D csv files

Now we will move all the csv data from the raw dataset to the destination dataset as well

In [5]:
data_loading.copy_3d_csvs(source_folder+"csv/", destination_folder+"csv/")

We will then rename the csv files for easier organisation

In [6]:
csv_files = [f for f in os.listdir(destination_folder+"csv/") if f.endswith(".csv")]
for csv_file in csv_files:
    scan_name = csv_file.split(".")[0]
    disease_type = scan_name.split("_")[-1].upper()
    if disease_type == "TG":
        disease_type = "B-ENaC"
    new_csv_file = "_".join(scan_name.split("_")[:-1])+"."+disease_type+"."+".".join(csv_file.split(".")[1:])
    os.rename(destination_folder+"csv/"+csv_file, destination_folder+"csv/"+new_csv_file)

### 3. Updating from metadata

We will now update the report summary with the new information from the `metadata.csv` (Sex, Age, Weight)

In [9]:
# Open metadata.xslx and extract necessary info into a dict
df = pd.read_excel(source_folder+"metadata.xlsx", skiprows=2)
df = df.drop(index=0)
df.reset_index(drop=True, inplace=True)
mouse_dict = {}
for i, subject in enumerate(df["Subject "]):
    subject += "_XV"
    subject_dict = {}
    subject_dict["sex"] = df["Sex"][i].upper()
    subject_dict["weight"] = df["Weight"][i]
    subject_dict["age"] = df["Age"][i]
    mouse_dict[subject] = subject_dict

# Open the report summary and add new info
df = pd.read_csv(destination_folder+"report_summary.csv")

sex = []
weight = []
age = []
for scan_name in df["ScanName"]:
    sex.append(mouse_dict[scan_name]["sex"])
    weight.append(mouse_dict[scan_name]["weight"])
    age.append(mouse_dict[scan_name]["age"])


df["Sex"] = sex
df["Weight(G)"] = weight
df["Age(D)"] = age

df.to_csv(destination_folder+"report_summary.csv", index=False)
df

Unnamed: 0,ScanName,DatePrepared,VDP(%),MSV(mL/mL),TV(L),VH(%),VHSS(%),VHLS(%),DiseaseType,Sex,Weight(G),Age(D)
0,MAL-006531_XV,2023-08-23-15:49:03.475248,14.7,0.39,0.262,39.16,21.23,26.1,WT,F,21.5,89
1,MAL-006524_XV,2023-08-23-15:53:24.209860,12.0,0.334,0.245,33.32,17.69,22.01,WT,M,29.0,88
2,MAL-006583_XV,2023-08-23-15:52:50.404602,16.9,0.276,0.134,53.12,17.51,40.23,WT,M,27.8,80
3,MAL-006504_XV,2023-08-23-15:53:56.931235,12.9,0.367,0.197,39.85,19.17,27.09,WT,F,20.0,97
4,MAL-006498_XV,2023-08-23-15:52:35.049030,16.2,0.337,0.342,43.74,18.83,30.9,B-ENaC,M,25.8,97
5,MAL-006502_XV,2023-08-23-15:54:34.799341,14.4,0.381,0.21,36.94,20.32,23.81,WT,F,20.3,97
6,MAL-006503_XV,2023-08-23-15:57:16.651112,17.6,0.345,0.268,43.5,21.84,26.53,B-ENaC,F,24.0,97
7,MAL-006527_XV,2023-08-23-15:43:35.312556,11.1,0.365,0.229,36.77,17.56,23.78,WT,F,22.3,90
8,MAL-006548_XV,2023-08-23-15:55:19.756459,17.4,0.328,0.246,46.01,20.55,31.05,B-ENaC,F,21.3,86
9,MAL-006529_XV,2023-08-23-15:52:10.296985,15.7,0.382,0.297,39.15,19.71,29.26,B-ENaC,F,22.8,90
