# Patient Data Extraction from PhysioNet

This Jupyter Notebook downloads patient `.txt` files from the I-CARE dataset hosted on PhysioNet. Each file contains information about individual patients, such as their age, hospital, ROSC, and outcome. The goal is to download these files, extract the relevant information, and store it in a structured format like a Pandas DataFrame.

This procedure is necessary, as bulk downloading the data is not working due to the sizes of the eeg and ecg data of over 1.5TB.

## Steps
1. Import libraries
2. Set Variables
3. Load the patient list
4. Iterate through all patients and download the txt file
5. Save the data as csv file

### 1. Import Required Libraries
First, we import the necessary Python libraries for making HTTP requests, working with data, and managing file paths.

- `requests`: For downloading the patient files from the web.
- `pandas`: To store and manipulate the extracted data.
- `os`: For handling file paths and directory creation.

In [1]:
import requests
import pandas as pd
import os
from bs4 import BeautifulSoup

### 2. Define the variables
The variables such as the URL, patient dictionary and file pathes need to be defined.

In [2]:
base_url = "https://physionet.org/content/i-care/2.1/training/"
records_url = base_url + "RECORDS"
patient_numbers = []

In [3]:
destination_folder = "data"
if not os.path.exists(destination_folder):
    os.makedirs(destination_folder)

### 3. Download the RECORDS file of patients

We are firstly accessing the patient list, that is published.
- Open and read the file html
- parse the HTML for the patient numbers
- create a list with the patient numbers (without the 'patient' part)

In [4]:
# Download the RECORDS file to get the list of patient numbers
response = requests.get(records_url)
if response.status_code == 200:
    # The RECORDS file is a plain text file, so split it by lines to get the patient numbers
    patient_numbers_html = response.text
else:
    print("Failed to download the RECORDS file.")

In [5]:
# Parse the HTML content
soup = BeautifulSoup(patient_numbers_html, 'html.parser')

# Find the <pre> tag with class "plain" and the <code> tag within it
code_tag = soup.find('pre', {'class': 'plain'}).find('code')

# Extract the content of the <code> tag
patient_records = code_tag.text if code_tag else ''

# Split the content into individual records (one per line)
patient_records_list = patient_records.splitlines()

# Clean the list by removing any empty strings
patient_records_list = [record.strip() for record in patient_records_list if record.strip()]

# Clean up the patient numbers by removing 'training/' and the trailing '/'
patient_numbers = [record.split('/')[1] for record in patient_records_list]

# Print the cleaned patient numbers to verify
print(patient_numbers)



['0284', '0286', '0296', '0299', '0303', '0306', '0311', '0312', '0313', '0316', '0319', '0320', '0326', '0328', '0332', '0334', '0335', '0337', '0340', '0341', '0342', '0344', '0346', '0347', '0348', '0349', '0350', '0351', '0352', '0353', '0354', '0355', '0356', '0357', '0358', '0359', '0360', '0361', '0362', '0363', '0364', '0365', '0366', '0367', '0368', '0369', '0370', '0371', '0372', '0373', '0375', '0376', '0377', '0378', '0379', '0380', '0382', '0383', '0384', '0385', '0387', '0389', '0390', '0391', '0392', '0394', '0395', '0396', '0397', '0398', '0399', '0400', '0402', '0403', '0404', '0405', '0406', '0407', '0409', '0410', '0411', '0412', '0413', '0414', '0415', '0416', '0417', '0418', '0419', '0420', '0421', '0422', '0423', '0424', '0426', '0427', '0428', '0429', '0430', '0431', '0432', '0433', '0434', '0435', '0436', '0437', '0438', '0439', '0440', '0441', '0442', '0443', '0444', '0445', '0446', '0447', '0448', '0450', '0451', '0452', '0453', '0455', '0456', '0457', '0458',

### 4. Download each patient file and get the information

We are now taking the list of the patients, build the links to the txt files that hold the patient information data such as number, hospital, age, sex, rosc, ohca, rhythm, ttm, outcome and cpc.

- Build the link
- Access the file and HTML
- Parse the HTML for the needed part
- Create a dictionary from the parsed HTML
- Create a df and add it to the overall df

In [6]:
# Initialize an empty DataFrame with the expected columns
columns = ['Patient', 'Hospital', 'Age', 'Sex', 'ROSC', 'OHCA', 'Shockable Rhythm', 'TTM', 'Outcome', 'CPC']
df = pd.DataFrame(columns=columns)

# Download each patient's .txt file
for patient_number in patient_numbers:
    # Construct the full URL to the .txt file
    file_url = f"{base_url}/{patient_number}/{patient_number}.txt"

    # Send the HTTP request to download the file
    response = requests.get(file_url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the HTML content
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the <pre> tag with class "plain" and the <code> tag within it
        code_tag = soup.find('pre', {'class': 'plain'}).find('code')

        # Extract the content of the <code> tag
        patient_data = code_tag.text if code_tag else ''

        # Split the content into individual records (one per line)
        patient_data_list = patient_data.splitlines()

        # Clean the list by removing any empty strings
        patient_data_list = [record.strip() for record in patient_data_list if record.strip()]

        # Step 1: Create a dictionary from the list by splitting each string on ": "
        patient_dict = {item.split(": ")[0]: item.split(": ")[1] for item in patient_data_list}

        # Step 2: Convert the dictionary into a DataFrame
        patient_df = pd.DataFrame([patient_dict])
        
        # Step 3: Concatenate the new row to the existing DataFrame
        df = pd.concat([df, patient_df], ignore_index=True)

    else:
        print(f"Failed to download file for patient {patient_number}")

### 5. Save the Dataframe to a csv file
Now save the collected data to a csv file that makes it accessible in other data pipelines

In [7]:
# Save the DataFrame to a CSV file
df.to_csv('data/raw_patient_data.csv', index=False)

# Confirmation message
print("DataFrame has been saved to 'data/raw_patient_data.csv'")

DataFrame has been saved to 'data/patient_data.csv'
