# SoMeSci

## Links

### SoMeSci

This is the "Gold Standard" training and evaluation dataset

- [Documentation](https://data.gesis.org/somesci/)
- [Dataset](https://zenodo.org/records/8100213)
- [GitHub](https://github.com/dave-s477/SoMeSci_Code)

### SoftwareKG

This is an inferred dataset from PloS articles

- [Dataset](https://zenodo.org/records/3715147)
- [GitHub](https://github.com/f-krueger/ESWC-SoftwareKG)

### SoftwareKG-PMC

This is an inferred dataset from PMC articles

Note: that the CSV files do not contain information about articles that where only available as PDF

- [Dataset](https://zenodo.org/records/10048276)
- [GitHub](https://github.com/f-krueger/SoftwareKG-PMC-Analysis)

### SoftwareKG-Mention-Entity

Seems like a follow up from the PMC dataset with additional entity information

- [Dataset](https://zenodo.org/records/10951778)

## Goal

Extract a basic column-wise dataset from the SoMeSci dataset that has the following columns:

- `article_id`: the unique identifier for each article
- `software_id`: the unique identifier for each software
- `software_name`: the name of the software used in the article
- `mention_type`: a classification of reason for mention (e.g., "use", "create", "share", etc.)
- `context`: the context of the mention (what was the surrounding text)
- `extra_fields`: what other fields are available in this dataset that we can use

In [1]:
from pathlib import Path

CURRENT_DIR = Path.cwd()
OVERALL_DATA_DIR = CURRENT_DIR / "data"

# SoftwareKG-PMC

Starting with PMC because it seems like the easiest to work with (CSVs) and has large coverage.

In [2]:
import polars as pl

PMC_DATA_DIR = OVERALL_DATA_DIR / "software-kg-pmc"

# Read in the relevant CSV files
article_software = pl.scan_csv(PMC_DATA_DIR / "kg_article_software.csv")
software_name = pl.scan_csv(PMC_DATA_DIR / "kg_software.csv")
software_url = pl.scan_csv(PMC_DATA_DIR / "kg_software_url.csv")
mention_type = pl.scan_csv(PMC_DATA_DIR / "kg_mention_type.csv")

# Mark columns for extra fields by preprending with "extra_field_" to the column names
article_software = article_software.select(
    "article_id",
    "software_id",
).unique(
    ["article_id", "software_id"],
).collect()

# Iirc, the "ratio" is the percentage of the time
# that the software is referenced with the given name
# we take the top name from that ratio then
top_software_name = software_name.sort(
    "ratio",
    descending=True,
).unique(
    "id",
    keep="first",
).select(
    pl.col("id").alias("software_id"),
    pl.col("name").alias("top_software_name"),
).collect()

# There can be multiple names for a given software_id,
# merge them all together into a single string, split by commas
all_software_names = software_name.select(
    pl.col("id").alias("software_id"),
    pl.col("name").alias("software_name"),
).group_by(
    ["software_id"],
).agg(
    pl.col("software_name").str.join("; ").alias("all_software_names"),
).select(
    "software_id",
    "all_software_names",
).collect()

# Seemingly only a single URL per article software pair
software_url = software_url.select(
    "article_id",
    "software_id",
    pl.col("url").str.to_lowercase().str.replace_all(
        r"\s+",
        "",
    ).str.replace(
        "https://",
        "http://",
        literal=True,
    ).str.replace_all(
        r"(\/$)",
        "",
    ).alias("software_url"),
).unique(
    ["article_id", "software_id"],
).select(
    "article_id",
    "software_id",
    "software_url",
).collect()

# The same software can be mentioned multiple times in an article,
# take the set of the mention types for each article-software pair
mention_type = mention_type.select(
    "article_id",
    "software_id",
    "mention_type",
).group_by(
    ["article_id", "software_id"],
).agg(
    pl.col("mention_type").unique().str.join("; ").alias("mention_types"),
).select(
    "article_id",
    "software_id",
    "mention_types",
).collect()

# Successive joins to combine all the dataframes
article_software_w_top_name = (
    article_software.join(
        top_software_name,
        on="software_id",
        how="left",
    )
)
article_software_w_top_and_all_names = (
    article_software_w_top_name.join(
        all_software_names,
        on="software_id",
        how="left",
    )
)
article_software_w_top_all_names_and_url = (
    article_software_w_top_and_all_names.join(
        software_url,
        on=("article_id", "software_id"),
        how="left",
    )
)
article_software_w_name_url_mention_type = (
    article_software_w_top_all_names_and_url.join(
        mention_type,
        on=("article_id", "software_id"),
        how="left",
    )
)

# Finally print out complete dataframe
article_software_w_name_url_mention_type

article_id,software_id,top_software_name,all_software_names,software_url,mention_types
str,i64,str,str,str,str
"""PMC5995956""",1482052,"""NUPACK""","""NUPACK; Nucleic Acid Package; …","""http://www.nupack.org""","""Usage"""
"""PMC5847299""",1474895,"""Prism""","""Prism; Groups of Growth Curves…",,"""Usage"""
"""PMC5460295""",1687678,"""LingReg PCR""","""LingReg PCR""",,"""Usage"""
"""PMC6625405""",1477542,"""PSORT""","""PSORT; Subcellular Localizatio…","""http://db.psort.org""","""Usage"""
"""PMC6466382""",1738111,"""SeNTU""","""SeNTU""",,"""Creation; Usage"""
…,…,…,…,…,…
"""PMC6154907""",1481918,"""RandomForestClassifier""","""RandomForestClassifier; Random…",,"""Usage"""
"""PMC6995369""",1482531,"""SPSS""","""SPSS Statistics; Statistical P…",,"""Usage"""
"""PMC1550565""",1720034,"""VBscript""","""VBscript""",,"""Mention"""
"""PMC3868361""",1474942,"""R""","""R; The R Foundation for Statis…",,"""Usage"""


In [3]:
# Debugging: check for any rows with multiple mention types
article_software_w_name_url_mention_type.filter(
    pl.col("mention_types").str.contains("Creation", literal=True)
)

article_id,software_id,top_software_name,all_software_names,software_url,mention_types
str,i64,str,str,str,str
"""PMC6466382""",1738111,"""SeNTU""","""SeNTU""",,"""Creation; Usage"""
"""PMC2760863""",1442540,"""RHGP""","""RHGP""",,"""Creation"""
"""PMC5745602""",1483869,"""REACT""","""REACT; Risk Estimation for Add…",,"""Creation; Usage; Mention"""
"""PMC4178918""",1686810,"""DAFOP""","""DAFOP""",,"""Mention; Creation; Usage"""
"""PMC2853119""",1482520,"""PhastCons""","""PhastCons; PHylogenetic Analys…","""http:""","""Creation; Usage"""
…,…,…,…,…,…
"""PMC5870566""",1521720,"""omiXcore""","""omiXcore""","""http://service.tartaglialab.co…","""Creation; Deposition; Usage; M…"
"""PMC6147213""",1487599,"""Tripal""","""Tripal; KGD; Bulk; tripal; Gal…",,"""Mention; Creation; Usage"""
"""PMC7341028""",1131079,"""SABESS""","""SABESS""",,"""Creation"""
"""PMC2635471""",1745574,"""Genlight""","""Genlight""",,"""Usage; Creation; Mention"""


In [4]:
# Store to parquet
article_software_w_name_url_mention_type.write_parquet(
    PMC_DATA_DIR / "processed-pmc-kg-dataset.parquet",
)