# CZI
## Links
- [Dataset](https://datadryad.org/dataset/doi:10.5061/dryad.6wwpzgn2c)
- [Github](https://github.com/chanzuckerberg/software-mentions)

## Goal

Extract a basic column-wise dataset from the CZI dataset that has the following columns:

- `article_id`: the unique identifier for each article,
- `software_id`: the unique identifier for each software,
- `software_name`: the name of the software used in the article,
- `mention_type`: a classification of reason for mention (e.g., \"use\", \"create\", \"share\", etc.),
- `context`: the context of the mention (what was the surrounding text),
- `extra_fields`: what other fields are available in this dataset that we can use

## Three sub-datasets

### Raw
Raw, plain-text software mentions, as extracted by the NER model

```
license	location	pmcid	pmid	doi	pubdate	source	number	text	software	version	ID	curation_label
```
### Disambiguated
Disambiguated software mentions, after disambiguation

```
license	location	pmcid	pmid	doi	pubdate	source	number	text	software	version	ID	curation_label	mapped_to_software
```
### Linked 
Linked software mentions

```
ID	software_mention	mapped_to	source	platform	package_url	description	homepage_url	other_urls	license	github_repo	github_repo_license	exact_match	RRID	reference	scicrunch_synonyms
```

### What is available in this dataset based on the columns of somesci
- [x] `article_id`: Available; pmcid, pmid or doi
- [x] `software_id`: Available; RRID, Research Resource Identifier, but not many seem to have it (same as used for somisci?)
- [x] `software_name`: Available; raw disambiguated or linked. Using most processed level, linked
- [ ] `mention_type`: Not available 
- [x] `context`: Available
- [ ] `extra_fields`: added software_url based on 4 different url fields - replace limited RRID?

## Extracting basic columns

In [None]:
import polars as pl

In [None]:
ROOT_DATA_DIR = "../../data/CZI/"
RAW_PATH = ROOT_DATA_DIR + "raw/comm_raw.tsv.gz"
DISAMB_PATH = ROOT_DATA_DIR + "disambiguated/comm_disambiguated.tsv.gz"
LINKED_PATH = ROOT_DATA_DIR + "linked/metadata.tsv.gz"

In [66]:
disamb_df = pl.read_csv(DISAMB_PATH, separator="\t") #, infer_schema_length=100, n_rows=50000)
linked_df = pl.read_csv(LINKED_PATH, separator="\t") #, infer_schema_length=100, n_rows=50000)

### A look at the original data

In [86]:
disamb_df.describe()

statistic,license,location,pmcid,pmid,doi,pubdate,source,number,text,software,version,ID,curation_label,mapped_to_software
str,str,str,f64,f64,str,f64,str,f64,str,str,str,str,str,str
"""count""","""14770209""","""14770209""",14770209.0,14684219.0,"""14679097""",14770209.0,"""13406580""",14770209.0,"""14770209""","""14764379""","""1127612""","""14770209""","""14770209""","""14770207"""
"""null_count""","""0""","""0""",0.0,85990.0,"""91112""",0.0,"""1363629""",0.0,"""0""","""5830""","""13642597""","""0""","""0""","""2"""
"""mean""",,,5910700.0,29085000.0,,2017.106828,,29.593835,,,,,,
"""std""",,,1745600.0,4332200.0,,3.607958,,109.923538,,,,,,
"""min""","""comm""","""comm/20_Century_Br_Hist/PMC480…",176545.0,1777407.0,""" 10.1186/1477-5956-10-26""",1797.0,"""""""""""""""""Administration""""""""""""""""",0.0,""" # 198 genes mapped to this te…",""" MGA""","""#20""","""SM0""","""not_curated""","""#GenomicDay"""
"""25%""",,,4529189.0,26161174.0,,2015.0,,8.0,,,,,,
"""50%""",,,6128579.0,30105754.0,,2018.0,,18.0,,,,,,
"""75%""",,,7431423.0,32730277.0,,2020.0,,35.0,,,,,,
"""max""","""comm""","""comm/psychopraxis/PMC8325535.n…",8510840.0,34637085.0,"""10.9745/GHSP-D-21-00233""",2022.0,"""𝜑XANES analysis""",20116.0,"""𝜀c regressions and comparisons…","""鼠源及人源化BCMA CAR-T的转染效率""","""应用SPSS22.0软件进行统计学分析""","""SM999999""","""unclear""","""∗BEAST"""


In [87]:
linked_df.describe()

statistic,ID,software_mention,mapped_to,source,platform,package_url,description,homepage_url,other_urls,license,github_repo,github_repo_license,exact_match,RRID,reference,scicrunch_synonyms
str,str,str,str,str,str,str,str,str,str,str,str,str,f64,str,str,str
"""count""","""149015""","""149015""","""149015""","""149015""","""17540""","""149015""","""116070""","""36306""","""18766""","""13485""","""143835""","""39464""",149015.0,"""18766""","""22134""","""18766"""
"""null_count""","""0""","""0""","""0""","""0""","""131475""","""0""","""32945""","""112709""","""130249""","""135530""","""5180""","""109551""",0.0,"""130249""","""126881""","""130249"""
"""mean""",,,,,,,,,,,,,0.903902,,,
"""std""",,,,,,,,,,,,,,,,
"""min""","""SM100000""","""'O""",""" AnaMorph""","""Bioconductor Index""","""Bioconductor""","""https://cran.r-project.org/web…",""" Multilevel Modeling in Epide…","""[""http://www.maths.soton.ac.uk…","""['Mutation', 'Surveyor', 'soft…","""ACM""","""<https://github.com/zhangjunpe…","""0BSD""",0.0,"""SCR_000004""","""<pre>  @Article{,  author …","""[""a character of the italian c…"
"""25%""",,,,,,,,,,,,,,,,
"""50%""",,,,,,,,,,,,,,,,
"""75%""",,,,,,,,,,,,,,,,
"""max""","""SM999993""","""ÖGD""","""zzip""","""SciCrunch API""","""Pypi""","""https://www.bioconductor.org/p…","""🪱 PARASITE || A parallel sente…","""[]""","""[]""","""file LICENSE""","""https://github.com/zzzzbw/Fame""","""Zlib""",1.0,"""SCR_021924""","""https://doi.org/doi:10.18129/B…","""['zymo research corporation', …"


### Exclude the mentions marked as not_software:


In [None]:
disamb_df_clean = disamb_df.filter(pl.col("curation_label") != "not_software")
# ratio of clean to not clean data
disamb_df_clean.shape[0] / disamb_df.shape[0]


0.9012042415919774

### Select and rename columns

In [92]:
# Select core columns from the disambigued data
core_df = disamb_df_clean.select([
    pl.col("doi").alias("article_id"), # TODO: Change to doi or pmid depending on which to use, or combine all?
    pl.col("ID").alias("CZI_software_mention_id"),
    pl.col("software").alias("software_name"),
    pl.col("text").alias("context"),
    # pl.col(" ").alias("mention_type") # TODO: Mention type does not exist, extract from context?
]) 

# Software IDs from linked 
software_info_df = linked_df.select([
    pl.col("ID").alias("CZI_software_mention_id"),
    pl.col("RRID").alias("software_id"), #disambiguated software identifier (if available)

    # combine the different urls into one called software_url - list of unique not nulls:
    # TODO: these are all currently str "lists" except for package_url, need to fix that later
    (
    pl.concat_list([
        pl.col("homepage_url"),
        pl.col("other_urls"),
        pl.col("github_repo"),
        pl.col("package_url")  # this is originally a string
    ]).alias("software_url")
    )

])


# Join  - add software ids to the entries in the core df
merged_df = core_df.join(software_info_df, on="CZI_software_mention_id", how="left")

# Rearrange so its the same as the somisci
# Although this also contains the contexta and does not contain mention type as that is not available
final_df = merged_df.select([
    "article_id",
    "CZI_software_mention_id",
    "software_id",
    "software_name",
    "software_url",
    "context"  
])

final_df



article_id,CZI_software_mention_id,software_id,software_name,software_url,context
str,str,str,str,list[str],str
"""10.1186/s43591-021-00017-9""","""SM0""",,"""Olympus CellSens""",,"""Then, all items were photograp…"
"""10.1186/s43591-021-00017-9""","""SM1""",,"""OPUS""",,"""Spectra were then vector norma…"
"""10.1186/s43591-021-00017-9""","""SM2""",,"""R package DHARMa""",,"""Model fit was assessed through…"
"""10.1186/s43591-021-00017-9""","""SM3""",,"""R""","[null, null, … ""https://github.com/ncornwell/R""]","""Analyses and plotting were per…"
"""10.1186/s43591-021-00017-9""","""SM3""",,"""R""","[null, null, … ""https://github.com/dmpe/R""]","""Analyses and plotting were per…"
…,…,…,…,…,…
"""10.3390/nu11071443""","""SM53566""",,"""MetaVision""",,"""All data were obtained by revi…"
"""10.3390/nu11071443""","""SM4442""",,,,"""All data were obtained by revi…"
"""10.3390/nu11071443""","""SM53019""",,"""iMDsoft""",,"""All data were obtained by revi…"
"""10.3390/nu11071443""","""SM165""","""SCR_002865""","""SPSS""","[""['http://www-01.ibm.com/software/uk/analytics/spss/']"", ""['https://www.ibm.com/products/software']"", … ""https://scicrunch.org/browse/resources/SCR_002865""]","""Statistical analysis was perfo…"


### Stats

In [93]:
final_df.describe()

statistic,article_id,CZI_software_mention_id,software_id,software_name,software_url,context
str,str,str,str,str,f64,str
"""count""","""16056698""","""16158993""","""4223665""","""16153163""",9465310.0,"""16158993"""
"""null_count""","""102295""","""0""","""11935328""","""5830""",6693683.0,"""0"""
"""mean""",,,,,,
"""std""",,,,,,
"""min""",""" 10.1186/1477-5956-10-26""","""SM0""","""SCR_000004""",""" MGA""",,""" # 198 genes mapped to this te…"
"""25%""",,,,,,
"""50%""",,,,,,
"""75%""",,,,,,
"""max""","""10.9745/GHSP-D-21-00144""","""SM999999""","""SCR_021924""","""鼠源及人源化BCMA CAR-T的转染效率""",,"""𝜀c regressions and comparisons…"


In [94]:
# not many RRIDs
null_ratio = final_df.select(
    (pl.col("software_id").is_null().sum() / pl.len())
).item()

print(null_ratio)


0.7386183037519727
