# Introduction

This notebook is designed to extract human pluripotent stem cell (hPSC) data from the Cellosaurus dataset into a pandas DataFrame. The data is sourced from the Cellosaurus File Transfer Protocol (FTP) file: `cellosaurus.txt`.

---

## Dataset Information

- **File Name:** `cellosaurus.txt`
- **Version:** 49.0
- **Last Update:** 02-May-2024

---

## License

The Cellosaurus dataset is shared under the **Creative Commons Attribution 4.0 International (CC BY 4.0)** license. This allows you to:
- Copy and redistribute the Cellosaurus in any medium or format.
- Remix, transform, and build upon the dataset for any purpose, including commercial use.
- Provide appropriate credit, a link to the license, and indicate if any changes were made.

More information about the license is available here: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).

For any questions regarding the Cellosaurus dataset, please contact: [cellosaurus@sib.swiss](mailto:cellosaurus@sib.swiss)


In [1]:
# set up
from google.colab import drive
drive.mount('/content/drive')

%run '/content/drive/My Drive/hPSC-FAIRness Analysis/scripts/setup_drive.py'

root_dir, data_dir, processed_dir, results_dir = setup_drive()

MessageError: Error: credential propagation was unsuccessful

#1. Download the Text File


*   Fetch the cellosaurus.txt file from the Cellosaurus website using FTP


In [None]:
# Download FTP file from Cellosaurus ftp.expasy.org/databases/cellosaurus/cellosaurus.txt

from ftplib import FTP

# Define FTP server and file details
ftp_server = 'ftp.expasy.org'
ftp_path = '/databases/cellosaurus/cellosaurus.txt'
local_filename = 'cellosaurus.txt'

# Connect to the FTP server
ftp = FTP(ftp_server)
ftp.login()  # No username and password required for this FTP server

# Download the file
with open(local_filename, 'wb') as local_file:
    ftp.retrbinary(f'RETR {ftp_path}', local_file.write)

# Close the FTP connection
ftp.quit()

'221 Goodbye.'

In [None]:
# save this txt file
!cp cellosaurus.txt "{data_dir}/"

#2. Convert the Text File into a DataFrame

- Read the downloaded text file and parse its content into a pandas DataFrame

In [None]:
# already imported from setup_drive.py
#import pandas as pd
#import re

# Define the regular expression pattern
pattern = re.compile(r'^([A-Z]{2})\s{3}(?!\s)(.*)$')


# Initialize variables
data = []
current_record = {}

# Open the file and process line by line
with open('cellosaurus.txt', 'r') as file:
    for line in file:
        line = line.strip()

        # Skip empty lines or lines that don't match the pattern
        if not line or line.startswith('//'):
            # End of the current record
            if current_record:
                data.append(current_record)
                current_record = {}
            continue

        # Match the line against the pattern
        match = pattern.match(line)
        if match:
            field_name, value = match.groups()

            if field_name not in current_record:
              current_record[field_name] = value
            else:
              # if the key exists, check if the value is already a list
              if not isinstance(current_record[field_name], list):
                # If it is not a list, convert the original value to a list:
                current_record[field_name] = [current_record[field_name]]
              # Add the new value to the list
              current_record[field_name].append(value)

# Handle the last record if it doesn't end with '//'
if current_record:
    data.append(current_record)

# Convert the list of records to a DataFrame
df = pd.DataFrame(data)

# Set the Celllosaurus ID (RRID) as the index
df.set_index('AC', inplace=True)

# save this file as data file
df.to_excel(os.path.join(data_dir, 'Cellosaurus.xlsx'), index = True)

# Display the first few rows of the DataFrame
#print(df.head())

#3. Filter for hPSC Lines
- Apply filters to the DataFrame to retain only the records related to human pluripotent stem cells (hPSCs) based on relevant criteria.

In [None]:
# Filter based on the OX and CA criteria
filtered_df = df.loc[
    (df['OX'] == 'NCBI_TaxID=9606; ! Homo sapiens (Human)') &
    (df['CA'].isin(['Embryonic stem cell', 'Induced pluripotent stem cell']))
]

In [None]:
print(f'Cellosaurus has {df.shape[0]} cell line records')
print(f'Cellosaurus has {filtered_df.shape[0]} hPSC records')

Cellosaurus has 159461 cell line records
Cellosaurus has 21674 hPSC records


#4. Save the DataFrame
- Save the filtered DataFrame to Google Drive in Excel format.

In [None]:
# save file to drive
filtered_df.to_excel(os.path.join(processed_dir,'hPSC Cellosaurus.xlsx'), index=True)