# Biobanks

This jupyter notebook is intended for the purpose of obtaining a list of papers from the Microsoft Academic Graph (MAG) introducing a biobank.

The final list is obtained in a three-step process, two of which are automatized and presented in this notebook.

The first step is to obtain a subset of articles with a list of keyworkds. At the time you run this notebook this should be already done using the `keywords.py` script from the `python` folder.

The second step is to process the articles to filter out articles not introducing biobanks. **This step is the purpose of this jupyter notebook**.

The third and last step is a manual curation of the list to remove clinical trials and biobanks that are not human. To retrieve the last biobank table curated from the articles returned here run the `manual_table.py` script.

## 1) List of keyworkds

The following list is used to obtain the first set of articles and their biobanks. Note that the colon (:) is included in the searches of articles.

- 'cohort', 'baseline', ':'
- 'biobank', ':'
- 'mega biobank', ':'
- '^cohort profile:'
- 'epidemiological', 'study', ':'
- 'prospective', 'study', 'rationale', ':'
- 'longitudinal', 'study', 'rationale', 'design', ':'

## 2) Processing the data

In order to process the obtained list, we filter some articles based on the following.

### 1. Cohort, Baseline

Consider:

- design, rationale, characteristics, methods
Avoid:
- 'baseline study', 'baseline findings', 'baseline analysis'

### 2. Health Study

Consider:
- Starts with the word "The"

### 3. Biobank

Consider
- '^\S+ biobank:''^\S+ biobank:'
- Starts with the word "The"

### 4. Mega Biobank

No further processing.

### 5. Cohort Profile

No further processing.

### 6. Epidemiological study

Consider
- 'desgin', 'rationale', 'methods', 'characteristics', 'objectives', 'approach'
- '^the \S+ study:'
- ': [Tt]he [A-Z]\S+ [sS]tudy'


### 7. Prospective

Avoid:
- 'trials'

### 8. Longitudinal

Avoid
- 'trial', 'comparison', 'random', 'longitudinal study'

In [26]:
import sys
sys.path.append('../python')

from serendipity import *

In [1]:
import pandas as pd

In [36]:
cohort = pd.read_csv('../../data/raw/initial.csv', low_memory=False)

## Cohort Baseline

In [38]:
cb = cohort[cohort['keyword']=='cohort']

In [39]:
def get_design(x):
    title = x.lower()
    avoid = ['baseline study', 'baseline findings',
             'baseline analysis']
    for word in avoid:
        if word in title:
            return False
    words = ['desgin', 'rationale', 'methods', 'characteristics', 'objectives']
    for word in words:
        if word in title:
            return True
    return False

In [40]:
cb = cb[cb['OriginalTitle'].apply(get_design)]

## Health Study

In [41]:
hs = cohort[cohort['keyword']=='health_study']

In [42]:
def filter_hs(x):
    title = x.lower()
    words = ['desgin', 'rationale', 'methods', 'characteristics', 'objectives']
    for word in words:
        if word in title:
            return True
    return False

In [43]:
ths = hs[hs['PaperTitle'].str.startswith('the ')]

## Biobank and Mega Biobank

In [44]:
bb = cohort[cohort['keyword']=='biobank']

In [45]:
mbb = cohort[cohort['keyword']=='mega_biobank']

In [46]:
bb1 = bb[bb['OriginalTitle'].str.contains(r'^\S+ biobank:', case=False)]
bb2 = bb[bb['PaperTitle'].str.startswith('the ')]

In [47]:
bb = pd.concat([bb1, bb2, mbb])

## Cohort Profiles

Nothing to do here.

In [48]:
cp = cohort[cohort['keyword']=='cohort_profile']

## Epidemiological study

In [49]:
ep = cohort[cohort['keyword']=='epidemiological_study']

In [50]:
ep1 = ep[ep['OriginalTitle'].str.contains(r'^the \S+ study:', case=False)]
ep2 = ep[ep['OriginalTitle'].str.contains(r': [Tt]he [A-Z]\S+ [sS]tudy')]

In [51]:
ep = pd.concat([ep1, ep2])

## Prospective Study

In [52]:
pros = cohort[cohort['keyword']=='prospective']

In [53]:
pros = pros[~pros['PaperTitle'].str.contains('trial')]

## Longitudinal

In [54]:
long = cohort[cohort['keyword']=='longitudinal']

In [55]:
long = long[~long['PaperTitle'].str.contains('trial')]

long = long[~long['PaperTitle'].str.contains('comparison')]

long = long[~long['PaperTitle'].str.contains('random')]

long = long[~long['PaperTitle'].str.contains('longitudinal study')]

## All together

In [56]:
cohort = pd.concat([bb, ths, cb, cp, ep, pros, long])

In [61]:
cohort = cohort.drop_duplicates(subset='PaperId')

In [None]:
cohort.to_csv('../../data/out/step2.csv', index=False)