# Cellosaurus and ICSCB Integration

This notebook is designed to compare hPSC records between two major databases: the International Stem Cell Banking Initiative (ICSCB) and Cellosaurus. The goal is to identify how well the records from these two databases align and identify any discrepancies or unmatched records to see if we can have a expanded list of hPSCs

In [21]:
# set up
from google.colab import drive
drive.mount('/content/drive')

%run '/content/drive/My Drive/hPSC-FAIRness Analysis/scripts/setup_drive.py'

root_dir, data_dir, processed_dir, results_dir = setup_drive()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Mounted at /content/drive
Setting up root directory with name: 'hPSC-FAIRness Analysis'
Root directory path: '/content/drive/My Drive/hPSC-FAIRness Analysis'


<Figure size 640x480 with 0 Axes>

##1. Load hPSC Records from Cellosaurus and ICSCB

In [37]:
# Load the Cellosaurus hPSC dataset
Cello_df = pd.read_excel(os.path.join(processed_dir,'hPSC Cellosaurus.xlsx'))
Cello_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21674 entries, 0 to 21673
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   AC      21674 non-null  object
 1   ID      21674 non-null  object
 2   SY      15411 non-null  object
 3   DR      21627 non-null  object
 4   RX      11675 non-null  object
 5   CC      21653 non-null  object
 6   OX      21674 non-null  object
 7   HI      5341 non-null   object
 8   CA      21674 non-null  object
 9   DT      21674 non-null  object
 10  WW      2144 non-null   object
 11  SX      20936 non-null  object
 12  AG      20178 non-null  object
 13  DI      10510 non-null  object
 14  ST      1213 non-null   object
 15  OI      6814 non-null   object
 16  AS      35 non-null     object
dtypes: object(17)
memory usage: 2.8+ MB


In [38]:
# Load the ICSCB hPSC records
ICSCB_df = pd.read_excel(os.path.join(processed_dir, 'hPSC ICSCB.xlsx'))
ICSCB_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16471 entries, 0 to 16470
Data columns (total 15 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Unnamed: 0                             16471 non-null  int64 
 1   _source                                16471 non-null  object
 2   _cellid                                16471 non-null  object
 3   stem_cell_name                         16471 non-null  object
 4   stem_cell_type                         8997 non-null   object
 5   cell_grade                             1295 non-null   object
 6   produced_by                            8795 non-null   object
 7   provider_distributor                   8488 non-null   object
 8   reference_publications                 4548 non-null   object
 9   gender_of_donor                        11802 non-null  object
 10  ethnicity_of_donor                     7634 non-null   object
 11  health_status  

In [53]:
# Unique hPSCs in ICSCB for records
unique_count = ICSCB_df['_cellid'].nunique()
print(unique_count)

16462


In [55]:
# Remove duplicates based on the '_cellid' column, keeping the first occurrence
ICSCB_df = ICSCB_df.drop_duplicates(subset='_cellid', keep='first')
ICSCB_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16462 entries, 0 to 16470
Data columns (total 15 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Unnamed: 0                             16462 non-null  int64 
 1   _source                                16462 non-null  object
 2   _cellid                                16462 non-null  object
 3   stem_cell_name                         16462 non-null  object
 4   stem_cell_type                         8988 non-null   object
 5   cell_grade                             1295 non-null   object
 6   produced_by                            8795 non-null   object
 7   provider_distributor                   8488 non-null   object
 8   reference_publications                 4548 non-null   object
 9   gender_of_donor                        11802 non-null  object
 10  ethnicity_of_donor                     7625 non-null   object
 11  health_status       

## 2. Match Records

ICSCB doesn't utilize Cellosaurus identifiers (RRID). However, it stores local ids across platforms which Cellosaurus captures under the field 'DR. We will utilized this field to match with the cell line ids in ICSCB.

In [26]:
import ast

# Function to extract local IDs from 'DR' column in Cellosaurus
def extract_ids(dr_list_str):
    try:
        dr_list = ast.literal_eval(dr_list_str)
        unique_ids = {entry.split('; ')[1] for entry in dr_list if '; ' in entry}
        return list(unique_ids)
    except:
        return []

In [27]:
# Apply the function to the DR column to get a list of IDs for each row
Cello_df['extracted_ids'] = Cello_df['DR'].apply(extract_ids)
Cello_df.head(5)

# Explode the DataFrame so that each ID gets its own row
Cello_exploded = Cello_df.explode('extracted_ids')



### a. Matched Records



In [56]:
# Merge exploded Cellosaurus with ICSCB to find matches
matched = pd.merge(ICSCB_df, Cello_exploded, left_on='_cellid', right_on='extracted_ids', how='inner') # how can be change to 'inner', 'left', 'right', 'outer'

# Print the matched DataFrame
matched.info()
# save the matched results
#matched.to_excel(os.path.join(results_dir,'Matched Cello_ICSCB.xlsx'), index=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 33 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Unnamed: 0                             10910 non-null  int64 
 1   _source                                10910 non-null  object
 2   _cellid                                10910 non-null  object
 3   stem_cell_name                         10910 non-null  object
 4   stem_cell_type                         4013 non-null   object
 5   cell_grade                             1009 non-null   object
 6   produced_by                            7913 non-null   object
 7   provider_distributor                   5705 non-null   object
 8   reference_publications                 4310 non-null   object
 9   gender_of_donor                        8316 non-null   object
 10  ethnicity_of_donor                     4378 non-null   object
 11  health_status  

In [57]:
unique_count = matched['_cellid'].nunique()
print(unique_count)

10781


### b. Summary of duplicated matches to Cellosaurus

In [63]:
# Group by AC to see all matches per record
# TO DO:
# I need to change this because multiple iscsb match with multiple Cello records
# I need to group by '_cellid' first then group by 'AC'

grouped_matches = matched.groupby('AC').agg({'_cellid': list}).reset_index()

# Count the number of synonyms for each group and add a new column 'synonyms count'
grouped_matches['duplicated records'] = grouped_matches['_cellid'].apply(len)

# Print the grouped matches DataFrame
print(grouped_matches)
#grouped_matches.to_excel(os.path.join(results_dir, 'duplication Cello_ICSCB.xlsx'), index=True)

              AC                   _cellid  duplicated records
0      CVCL_0057  [SKIP001492, HVRDe009-A]                   2
1      CVCL_0A03              [SKIP001308]                   1
2      CVCL_0A07              [SKIP001140]                   1
3      CVCL_0G81              [SKIP000533]                   1
4      CVCL_0G82  [SKIP000534, SKIP001333]                   2
...          ...                       ...                 ...
10453  CVCL_ZX65              [UMCGi009-A]                   1
10454  CVCL_ZY38  [SKIP005865, KEIOi001-A]                   2
10455  CVCL_ZZ62              [SKIP005863]                   1
10456  CVCL_ZZ65              [SKIP005857]                   1
10457  CVCL_ZZ82            [HMGUi001-A-4]                   1

[10458 rows x 3 columns]


### c. Unmatched ICSCB Records

In [59]:
# Exclude rows from ICSCB_df that are in matched based on the '_cellid' column
unmatched = ICSCB_df[~ICSCB_df["_cellid"].isin(matched["_cellid"])]
# Print the unmatched records
#unmatched.to_excel(os.path.join(results_dir,'UNmatches Cello_ICSCB.xlsx'), index=True)

In [60]:
unmatched.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5681 entries, 33 to 16430
Data columns (total 15 columns):
 #   Column                                 Non-Null Count  Dtype 
---  ------                                 --------------  ----- 
 0   Unnamed: 0                             5681 non-null   int64 
 1   _source                                5681 non-null   object
 2   _cellid                                5681 non-null   object
 3   stem_cell_name                         5681 non-null   object
 4   stem_cell_type                         4983 non-null   object
 5   cell_grade                             286 non-null    object
 6   produced_by                            1005 non-null   object
 7   provider_distributor                   2900 non-null   object
 8   reference_publications                 360 non-null    object
 9   gender_of_donor                        3607 non-null   object
 10  ethnicity_of_donor                     3260 non-null   object
 11  health_status       