# Introduction

This notebook is designed to extract human pluripotent stem cell (hPSC) data from the ICSCB database into a pandas DataFrame. The data is sourced from webpage https://icscb.stemcellinformatics.org/result?lang=en using ‘Download’ Tab.

---

## Dataset Information
- **Download Date:** 25 September 2024



In [None]:
# set up
from google.colab import drive
drive.mount('/content/drive')

%run '/content/drive/My Drive/hPSC-FAIRness Analysis/scripts/setup_drive.py'

root_dir, data_dir, processed_dir, results_dir = setup_drive()

Mounted at /content/drive
Mounted at /content/drive
Setting up root directory with name: 'hPSC-FAIRness Analysis'
Root directory path: '/content/drive/My Drive/hPSC-FAIRness Analysis'


# 1. Download the csv File
from its Website

#2. Filter for hPSC Lines

In [None]:
df = pd.read_csv(os.path.join(data_dir,'ICSCB.csv'))

print(df.info())

print(df['_source'].unique())
print(df['stem_cell_type'].unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19259 entries, 0 to 19258
Data columns (total 14 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   _source                                19259 non-null  object
 1   _cellid                                19259 non-null  object
 2   stem_cell_name                         19259 non-null  object
 3   stem_cell_type                         10231 non-null  object
 4   cell_grade                             2873 non-null   object
 5   produced_by                            8876 non-null   object
 6   provider_distributor                   9718 non-null   object
 7   reference_publications                 4593 non-null   object
 8   gender_of_donor                        14538 non-null  object
 9   ethnicity_of_donor                     10159 non-null  object
 10  health_status                          12221 non-null  object
 11  age_of_donor   

In [None]:
# Define the list of stem cell types to keep
allowed_stem_cell_types = [
    'iPS Cell',
    'ES Cell',
    'Human iPS cell lines',
    'Induced pluripotent stem cell line'
]

# Filter the DataFrame
filtered_df = df[(df['_source'] == 'hPSCreg') | (df['stem_cell_type'].isin(allowed_stem_cell_types))]

filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16471 entries, 0 to 17704
Data columns (total 14 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   _source                                16471 non-null  object
 1   _cellid                                16471 non-null  object
 2   stem_cell_name                         16471 non-null  object
 3   stem_cell_type                         8997 non-null   object
 4   cell_grade                             1295 non-null   object
 5   produced_by                            8795 non-null   object
 6   provider_distributor                   8488 non-null   object
 7   reference_publications                 4548 non-null   object
 8   gender_of_donor                        11802 non-null  object
 9   ethnicity_of_donor                     7634 non-null   object
 10  health_status                          9797 non-null   object
 11  age_of_donor        

#3. Save the DataFrame
- Save the filtered DataFrame to Google Drive in Excel format.

In [None]:
# save file to drive
filtered_df.to_excel(os.path.join(processed_dir,'hPSC ICSCB.xlsx'), index=True)