In [1]:
import pandas as pd
from pathlib import Path
data_dir = Path("data")

# The New School Method

For the Foras project I took the original `van_de_Schoot_2018` dataset on PTSD in the version from the Synergy dataset (https://github.com/asreview/synergy-dataset) to produce 4 new datasets, that Bruno and his team will screen manually. My main source of data was the OpenAlex database. I worked with the data from a snapshot of OpenAlex from 2023-10-20. Some statistics:

- Total records in OpenAlex: 246M
- Records with abstract: 126M
- Records with publication_year >= 2015: 85M
- Records with abstract and publication_year >= 2015: 49M

The set that we actually searched in was the set of records with publication_yer >= 2015 and containing a title or an abstract. This set consisted of slightly less than 85 million records.

And some information on `van_de_Schoot_2018`:
- 4544 records
- 38 included records
- 4 records are excluded based on updated views, so that leaves 34 included records

## Method 1: Snowballing
I took the 34 included records and looked in OpenAlex for records that cite those records. I also looked one step further and to get the records that cite the records that I found. I call the first one 'primary' and other ones 'secondary'. Here are the numbers:

|        | Total | Primary | Secondary |
|--------|-------|---------|-----------|
| All    | 9016  | 465     | 8551      |
| >=2015 | 8682  | 451     | 8231      |

Below is a desription of what I did. For more details and the exact implementation and scripts, see https://github.com/IDfuse/foras/

In [5]:
snowballing_df = pd.read_csv(data_dir / "citations.csv")
snowballing_df.head()

Unnamed: 0,id,doi,title,abstract,referenced_works,publication_date,level
0,https://openalex.org/W2500077069,https://doi.org/10.1093/alcalc/agw046,Overview of the Genetics of Alcohol Use Disorder,Alcohol Use Disorder (AUD) is a chronic psychi...,"['https://openalex.org/W220770808', 'https://o...",2016-07-21,primary
1,https://openalex.org/W2783658081,https://doi.org/10.1080/00273171.2017.1412293,Bayesian PTSD-Trajectory Analysis with Informe...,There is a recent increase in interest of Baye...,"['https://openalex.org/W171228341', 'https://o...",2018-01-11,primary
2,https://openalex.org/W2554356877,https://doi.org/10.1111/acer.13269,Alcohol Misuse and Co‐Occurring Mental Disorde...,Background Problem drinking that predates enli...,"['https://openalex.org/W131455497', 'https://o...",2016-11-24,primary
3,https://openalex.org/W2806456116,https://doi.org/10.1016/j.jpain.2018.04.013,Alcohol and Opioid Use in Chronic Pain: A Cros...,Opioid misuse is regularly associated with dis...,"['https://openalex.org/W1595936767', 'https://...",2018-10-01,primary
4,https://openalex.org/W3003066715,https://doi.org/10.1016/j.bbr.2020.112500,Chronic repeated predatory stress induces resi...,"Trauma related psychiatric disorders, such as ...","['https://openalex.org/W23314739', 'https://op...",2020-03-01,primary


## Vectorizing OpenAlex
For the other methods I needed to have a feature matrix just like we use in ASReview. I used a multilingual sentence transformers model (https://huggingface.co/intfloat/multilingual-e5-small) to vectorize all records in OpenAlex with a title or abstract. This gave me a around 640GB of vectors, which I uploaded to Huggingface Datasets (https://huggingface.co/datasets/GlobalCampus/openalex-multilingual-embeddings/tree/main) so that other can use them as well. Next I took the vectors of records with publication date from 2015 onwards and fed them to a search engine (https://vespa.ai/). So the set in which I search is the records in OpenAlex with title or abstract, and a publication date at least 2015. This results in approximately 85 million records. Now I can search in the dataset for texts that are close to a query text:
- I turn the text into a vector
- I send the vector to the search engine, it gives me back the `N` closest vectors. (This uses Vespa to create an HNSW index based on cosine similarity between the vectors. The details on the Vespa implementation of the HSNW index can be found here in (https://docs.vespa.ai/en/approximate-nn-hnsw.html) and it is based on the original paper (https://arxiv.org/abs/1603.09320) )
- I look up the information of those `N` vectors in OpenAlex.

I did this in three ways:

## Method 2: Semantic Search using inclusion criteria
For this method used the inclusion criteria of the systematic review as the input text, as can found in the section 'Inclusion criteria' uner 'Types of study to be included' from (https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=494027). I got back the 1000 records that are closest to these inclusion criteria.

In [7]:
inclusion_criteria_df = pd.read_csv(data_dir / "inclusion_criteria_dataset.csv")
inclusion_criteria_df.head()

Unnamed: 0,id,doi,title,abstract,rank
0,https://openalex.org/W4214825701,https://doi.org/10.1016/b978-0-12-823039-8.000...,"Posttraumatic stress disorder: Diagnosis, meas...",This chapter discussed the diagnosis of posttr...,0
1,https://openalex.org/W2966802484,,Self-Rated versus Clinician-Rated Assessment o...,,1
2,https://openalex.org/W2617390452,https://doi.org/10.1080/24750573.2017.1326746,Psychometric properties of the Turkish version...,Background: In the subsequent revision of Diag...,2
3,https://openalex.org/W3198362577,https://doi.org/10.1016/j.psychres.2021.114197,After a disaster: Validation of PTSD checklist...,Posttraumatic stress disorder (PTSD) is a comm...,3
4,https://openalex.org/W4289526011,https://doi.org/10.1177/10731911221113571,Self-Rated Versus Clinician-Rated Assessment o...,Posttraumatic stress disorder (PTSD) is common...,4


## Method 3: Semantic Search using included records
With each of the 34 included records I searched for the closest 5000 records. This gave me 170.000 results, but obviously some records will be there multiple times because they are close to multiple input records. After deduplication there are only 57232 records left. Some statistics:

| n_responses | n_records |  n_responses | n_records |
| ----------- | --------- |  ----------- | --------- |
| 34 |       4 |  17 |     184 |
| 33 |       8 |  16 |     210 |
| 31 |      24 |  15 |     267 |
| 29 |      26 |  14 |     313 |
| 32 |      29 |  13 |     332 |
| 30 |      36 |  12 |     409 |
| 28 |      38 |  11 |     512 |
| 25 |      51 |  10 |     610 |
| 27 |      56 |  9  |     765 |
| 26 |      58 |  8  |     856 |
| 24 |      66 |  7  |    1117 |
| 23 |      73 |  6  |    1431 |
| 22 |      86 |  5  |    1786 |
| 21 |      92 |  4  |    2505 |
| 20 |     116 |  3  |    4085 |
| 19 |     142 |  2  |    7944 |
| 18 |     151 |  1  |   32850 |

Or in a picture:

![Number of responses containing record](n_responses_log.png)

Description: This figure shows for each `n` between 1 and 34, the number of records that are contained in `n` responses. On the x-axis is the number of responses. On the y-axis is the logarithm of the number of records contained in exactly that many responses.

A picture with the overlap between the different queries.

![Overlap between queries](overlap_clustered.png)

Description of `overlap_clustered.png`: For each pair of result lists I take the number of records in the overlap between two lists. This is a number between 0 (no overlap) and 5000 (complete overlap). These values are plotted in a 2D grid, with at the `(x,y)` coordinate the overlap between list `x` and list `y`. The values are sorted in such a way that lists with lots of overlap are closer together, so that it's easier to see clusters of similar lists.

Obviously these are too many records to screen manually so I needed to bring down this number. The first idea is to choose the top `N` records for each query.

![New record per rank](n_unique_at_rank.png)

Description of `n_unique_at_rank.png`: For each `i` between 1 and 5000, I check how many more records you get if you select the top `i` records instead of the top `i-1` records. This `i` is on the x-axis and the number of new records is on the y-axis. Of course the y-value can be between 0 and 34, because we have 34 lists of records.

By choosing `N=427` I get a resulting dataset of exactly 7000 records.

In [9]:
included_records_df = pd.read_csv(data_dir / "included_records_dataset.csv")
included_records_df.head()

Unnamed: 0,id,doi,title,abstract,rank
0,https://openalex.org/W2793121854,https://doi.org/10.1080/20008198.2018.1443672,Pre-deployment dissociation and personality as...,Objective: This study investigated whether pre...,0
1,https://openalex.org/W3179801679,https://doi.org/10.1080/10615806.2021.1950694,The impact of political violence on posttrauma...,Objective The current paper uses the Conservat...,0
2,https://openalex.org/W3128621410,https://doi.org/10.1016/j.jad.2021.01.086,Understanding trajectories of underlying dimen...,Research suggests four modal trajectories of p...,0
3,https://openalex.org/W2195260412,https://doi.org/10.1002/jts.22055,Mental Health Over Time in a Military Sample: ...,To identify trajectories of depression and pos...,0
4,https://openalex.org/W3027965607,https://doi.org/10.1093/schbul/sbaa028.016,O3.5. EARLY TRAJECTORIES OF POSITIVE SYMPTOMS ...,Abstract Background The Prevention and Early i...,0


## Method 4: Using a logistic model to bring down the numbers

Finally I took the `10*N = 4270` top records for each query. This gave me 50055 unique records. Then I used ASReview (version 1.5) to train a logistic model (with double balance) on the original dataset. I applied this trained model to the get relevance scores. See the actual script in https://github.com/IDfuse/foras/ for the implementation details. Then I took the top 7000 based on relevance scores.

In [10]:
logistic_df = pd.read_csv(data_dir / "included_records_logistic_dataset.csv")
logistic_df.head()

Unnamed: 0,id,doi,title,abstract,relevance_score
0,https://openalex.org/W3046427873,https://doi.org/10.31390/gradschool_dissertati...,Factor Predicting Maternal Posttraumatic Stres...,"Natural disasters are sudden, large-scale even...",0.962412
1,https://openalex.org/W2144310792,https://doi.org/10.1111/jcpp.12420,Trajectories of post‐traumatic stress disorder...,Background Theorists and researchers have demo...,0.957404
2,https://openalex.org/W4213023858,https://doi.org/10.1192/bjp.2022.2,Risk and resilience in trajectories of post-tr...,Background First responders to disasters are a...,0.955518
3,https://openalex.org/W4294199880,https://doi.org/10.1192/j.eurpsy.2022.629,Risk and Resilience in Trajectories of Post-Tr...,Introduction First responders to disasters are...,0.949393
4,https://openalex.org/W2331818908,https://doi.org/10.1037/abn0000020,Posttraumatic stress in deployed Marines: Pros...,We examined the course of PTSD symptoms in a c...,0.94841
