# Introduction

This notebook aims to resolve missing or "Unknown" country information for human pluripotent stem cell (hPSC) lines by leveraging their registration and banking details. By analyzing the associations with specific stem cell banks and registries, we can infer and update the country of origin for these cell lines where such information is missing.


# Set-Up

In [1]:
# set up
from google.colab import drive
drive.mount('/content/drive')

%run '/content/drive/My Drive/hPSC-FAIRness Analysis/scripts/setup_drive.py'

root_dir, data_dir, processed_dir, results_dir = setup_drive()

Mounted at /content/drive
Mounted at /content/drive
Setting up root directory with name: 'hPSC-FAIRness Analysis'
Root directory path: '/content/drive/My Drive/hPSC-FAIRness Analysis'


# Statistics by Country before Rescue

In [2]:
# Load my dataframe
df = pd.read_excel(os.path.join(processed_dir,'Registration & Banking Status.xlsx'))
# Correct the country names
df.loc[df['Country'] == 'Quatar', 'Country'] = 'Qatar'
df.loc[df['Country'] == 'United Kingdom', 'Country'] = 'UK'

# Summary of the number of hPSCs by country
print(df['Country'].value_counts(dropna=False))

Country
Unknown                 7525
USA                     5301
UK                      2161
China                   1305
Germany                 1063
Japan                    433
Spain                    361
South Korea              317
Australia                316
France                   236
Netherlands              232
Italy                    216
Israel                   199
Denmark                  197
Sweden                   177
Russia                   156
Ireland                  145
Taiwan                   142
Iran                     127
India                    126
Brazil                   115
Canada                   105
Saudi Arabia              88
Belgium                   72
Thailand                  66
Finland                   65
Switzerland               65
Poland                    47
Austria                   41
Czech Republic            39
Portugal                  37
Jordan                    36
Qatar                     34
Luxembourg                24
Turkey

# Rescue by Registries

summary of the methodology:

* Utilized registry records from the ICSCB database, where many cell lines have the generating or distributing institution's details recorded.

* Manually assigned countries based on the institution information provided in the ICSCB.

* Extracted local registry IDs from my DataFrame and matched them with the corresponding IDs in the ICSCB to identify overlaps.

* Populated missing country data in my DataFrame using the country information found in the ICSCB matches.

* We prioritized the order of rescuing countries as follows: first through SKIP, then hPSCreg, and finally NIH hESCs. This decision was based on the observation that the majority of registered hPSCs with unknown country information in my DataFrame are registered in SKIP, followed by hPSCreg, and then NIH hESCs.

## SKIP

1. Create a new column in my df to store the SKIP IDs

In [3]:
import ast
# Function to extract SKIP ID
def extract_skip_id(dr_str):
    try:
        dr_list = ast.literal_eval(dr_str)  # Parse the string to a list
        for item in dr_list:
            if item.startswith('SKIP;'):
                return item.split(' ')[1]  # Extract the SKIP ID
    except (ValueError, SyntaxError):
        return None  # Return None if parsing fails or no SKIP ID is found
    return None  # Return None if no SKIP ID is found

In [4]:
# Apply the function only to rows where Country is 'Unknown'
df.loc[(df['Country'] == 'Unknown') & (df['SKIP'] == 'yes'), 'SKIP_ID'] = df.loc[(df['Country'] == 'Unknown') & (df['SKIP'] == 'yes'), 'DR'].apply(extract_skip_id)

2. Load the SKIP lookup table from ICSCB, where country of cell lines has been manually assigned

In [5]:
# Load my SKIP lookup table
df_SKIP = pd.read_csv(os.path.join(processed_dir,'SKIP_Country.csv'))

3. populate the country information the df using the corresponding data form SKIP lookup table

In [6]:
# Merge df and df_SKIP on 'SKIP_ID' and '_cellid'
SKIP_merged_df = df.merge(df_SKIP[['Country', '_cellid']], left_on='SKIP_ID', right_on='_cellid', how='left', suffixes=('', '_df_skip'))
# Count the number of occurrences of each value in 'column_name'
value_counts = SKIP_merged_df['Country_df_skip'].value_counts()

# Rescue results
print(value_counts)
print('Number of rescued hPSCs:', sum(value_counts))

Country_df_skip
Japan            313
United States    196
USA               96
Germany           10
Canada             7
China              7
Singapore          5
Unknown            3
South Korea        2
Spain              2
Netherlands        1
Denmark            1
Name: count, dtype: int64
Number of rescued hPSCs: 643


In [7]:
# Update the 'Country' column in df where 'Country' is 'Unknown' using the 'Country_df_skip' from df_SKIP
SKIP_merged_df['Country'] = SKIP_merged_df.apply(
    lambda row: row['Country_df_skip'] if row['Country'] == 'Unknown' and pd.notna(row['Country_df_skip']) else row['Country'],
    axis=1
)

# Drop the '_cellid' and 'Country_df_skip' columns if no longer needed
df = SKIP_merged_df.drop(columns=['_cellid', 'Country_df_skip', 'SKIP_ID'])

# save file
# merged_df.to_excel(os.path.join(results_dir, 'SKIP_matching_result.xlsx'), index=False) # for record keeping

## hPSCreg

1. Create a new column in the DataFrame to store the SKIP IDs.


In [8]:
import ast
# Function to extract hPSCreg ID
def extract_hPSCreg_id(dr_str):
    try:
        dr_list = ast.literal_eval(dr_str)  # Parse the string to a list
        for item in dr_list:
            if item.startswith('hPSCreg;'):
                return item.split(' ')[1]
    except (ValueError, SyntaxError):
        return None
    return None

In [9]:
# Apply the function only to rows where Country is 'Unknown'
df.loc[(df['Country'] == 'Unknown') & (df['hPSCreg'] == 'yes'), 'hPSCreg_ID'] = df.loc[(df['Country'] == 'Unknown') & (df['hPSCreg'] == 'yes'), 'DR'].apply(extract_hPSCreg_id)

2. Load the SKIP lookup table from ICSCB, where the country of origin for cell lines has been manually assigned.

In [10]:
# Load my lookup table
df_hPSCreg = pd.read_csv(os.path.join(processed_dir, 'hPSCreg_Country.csv'))

3. Populate the country information in the DataFrame using the corresponding data from the SKIP lookup table.

In [11]:
# Merge df and df_SKIP on 'SKIP_ID' and '_cellid'
hPSCreg_merged_df = df.merge(df_hPSCreg[['Country', '_cellid']], left_on='hPSCreg_ID', right_on='_cellid', how='left', suffixes=('', '_df_hPSCreg'))
# Count the number of occurrences of each value in 'column_name'
value_counts = hPSCreg_merged_df['Country_df_hPSCreg'].value_counts()

# Rescue results
print(value_counts)
print('Number of rescued hPSCs:', sum(value_counts))

Country_df_hPSCreg
Japan             209
USA                92
Spain              17
India               6
Netherlands         2
Germany             2
Denmark             2
China               1
United Kingdom      1
Name: count, dtype: int64
Number of rescued hPSCs: 332


In [12]:
# Update the 'Country' column in df where 'Country' is 'Unknown' using the 'Country_df_skip' from df_SKIP
hPSCreg_merged_df['Country'] = hPSCreg_merged_df.apply(
    lambda row: row['Country_df_hPSCreg'] if row['Country'] == 'Unknown' and pd.notna(row['Country_df_hPSCreg']) else row['Country'],
    axis=1
)

# Drop the '_cellid' and 'Country_df_skip' columns if no longer needed
df = hPSCreg_merged_df.drop(columns=['_cellid', 'Country_df_hPSCreg', 'hPSCreg_ID'])

# save file
# df.to_excel(os.path.join(results_dir, 'SKIP&hPSCreg_matching_result.xlsx'), index=False) # for record keeping

## NIH hESCreg

Only one cell line, CVCL_B813, registered in the NIH hESCreg, has a missing country in Cellosaurus. After reviewing the [NIH website](https://stemcells.nih.gov/registry/eligible-to-use-lines?ID=NIHhESC-11-0106), we found that this cell line was submitted by Cedars-Sinai Medical Center. Therefore, we assigne the country as the USA for this cell line.


In [13]:
df.loc[df['AC'] == 'CVCL_B813', 'Country'] = 'USA'

## Summary

**A total of 973 hPSCs have had their country information successfully rescue through registries.**


In [14]:
print(df['Country'].value_counts(dropna=False))

Country
Unknown                 6552
USA                     5490
UK                      2161
China                   1313
Germany                 1075
Japan                    955
Spain                    380
South Korea              319
Australia                316
France                   236
Netherlands              235
Italy                    216
Denmark                  200
Israel                   199
United States            196
Sweden                   177
Russia                   156
Ireland                  145
Taiwan                   142
India                    132
Iran                     127
Brazil                   115
Canada                   112
Saudi Arabia              88
Belgium                   72
Thailand                  66
Finland                   65
Switzerland               65
Poland                    47
Austria                   41
Czech Republic            39
Portugal                  37
Jordan                    36
Qatar                     34
Singap

# Rescue by Banks

Summary of Methodology:

* Researchers often deposit cell lines in banks located within their own country/region. Based on this, we assume that the country of origin for a banked cell line corresponds to the country/region of the bank.

**Create a function to assign the country of origin for hPSCs based on their associated banks**

In [15]:
def rescue_by_bank(df, bank, country):
  condition = (df[bank] == 'yes') & (df['Country'] == 'Unknown')
  # Count the number of rows that meet the condition

  num_rows = condition.sum()

  # Print the count
  print(f"Number of rescued hPSCs in {bank} is : {num_rows}")

  # Update 'Country' to 'USA' for those rows
  df.loc[condition, 'Country'] = country

  return None

## Banks in USA

* Coriell

In [16]:
rescue_by_bank(df, 'Coriell', 'USA')

Number of rescued hPSCs in Coriell is : 172


* WiCell

In [17]:
rescue_by_bank(df, 'WiCell', 'USA')

Number of rescued hPSCs in WiCell is : 351


* ATCC

In [18]:
rescue_by_bank(df, 'ATCC', 'USA')

Number of rescued hPSCs in ATCC is : 13


* NHCDR - NINDS


In [19]:
rescue_by_bank(df, 'NHCDR', 'USA')

Number of rescued hPSCs in NHCDR is : 224


* FCDI

In [20]:
rescue_by_bank(df, 'FCDI', 'USA')

Number of rescued hPSCs in FCDI is : 0


## Banks in Japan

* RCB - Riken Biobank

In [21]:
rescue_by_bank(df, 'RCB', 'Japan')

Number of rescued hPSCs in RCB is : 2032


* JCRB

In [22]:
rescue_by_bank(df, 'JCRB', 'Japan')

Number of rescued hPSCs in JCRB is : 17


## Banks in Europe

* HipSci


In [23]:
rescue_by_bank(df, 'HipSci', 'United Kingdom')

Number of rescued hPSCs in HipSci is : 1


* ECACC, EBiSC

*cannot identify the specific country*

In [24]:
rescue_by_bank(df, 'ECACC', 'Europe')
rescue_by_bank(df, 'EBiSC', 'Europe')

Number of rescued hPSCs in ECACC is : 1
Number of rescued hPSCs in EBiSC is : 0


## Banks in Taiwan

In [25]:
rescue_by_bank(df, 'BCRC', 'Taiwan')

Number of rescued hPSCs in BCRC is : 0


## Banks in Iran

In [26]:
rescue_by_bank(df, 'RSCB', 'Iran')

Number of rescued hPSCs in RSCB is : 0


## Summary

**A total of 2805 hPSCs have had their country information successfully assigned through registries.**


# Save Results after Country Rescue

**Group European Countries**

In [35]:
# save my country rescue file
df['Country'] = df['Country'].replace({
    'United States': 'USA',
    'United Kingdom': 'UK',
    'EU': 'Europe'
})

import numpy as np
# List of European countries
'''
eu_countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", "France",
    "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal",
    "Romania", "Slovakia", "Slovenia", "Spain", "Sweden"]
'''

europe_countries = ["Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", "Denmark", "Estonia", "Finland", "France",
    "Germany", "Greece", "Hungary", "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", "Netherlands", "Poland", "Portugal",
    "Romania", "Slovakia", "Slovenia", "Spain", "Sweden", 'UK', 'Switzerland']

# Create a new column based on the 'Country' column
df['Region'] = np.where(df['Country'].isin(europe_countries), 'Europe', df['Country'])
#df['Region'] = df['Region'].replace('Europe', 'EU')

In [36]:
df.to_excel(os.path.join(data_dir, 'processed', 'Final.xlsx'), index=False)