<a href="https://colab.research.google.com/github/NLP4/Enhacing-SPECTER-with-some-extensions/blob/main/CORD19_Baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook represent the baseline for testing the classification task using the CORD19 dataset and generating the SPECTER embeddings using title and abstract

In [None]:
!git clone https://github.com/allenai/specter.git

Cloning into 'specter'...
remote: Enumerating objects: 195, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (61/61), done.[K
remote: Total 195 (delta 80), reused 70 (delta 59), pack-reused 75[K
Receiving objects: 100% (195/195), 316.53 KiB | 10.91 MiB/s, done.
Resolving deltas: 100% (96/96), done.


In [None]:
!wget https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/specter/archive.tar.gz

--2023-04-24 18:21:00--  https://ai2-s2-research-public.s3-us-west-2.amazonaws.com/specter/archive.tar.gz
Resolving ai2-s2-research-public.s3-us-west-2.amazonaws.com (ai2-s2-research-public.s3-us-west-2.amazonaws.com)... 52.218.168.225, 52.92.208.10, 52.218.182.49, ...
Connecting to ai2-s2-research-public.s3-us-west-2.amazonaws.com (ai2-s2-research-public.s3-us-west-2.amazonaws.com)|52.218.168.225|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 873742818 (833M) [application/x-tar]
Saving to: ‘archive.tar.gz’


2023-04-24 18:21:27 (30.9 MB/s) - ‘archive.tar.gz’ saved [873742818/873742818]



In [None]:
!pip install -U -q kaggle
!mkdir -p ~/.kaggle
!echo '{"username":"oumaimaregragui","key":"98f2ff74d27102ab393ebb934d047f83"}' > ~/.kaggle/kaggle.json
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d dillonpulliam/cord19cleaneddata

Downloading cord19cleaneddata.zip to /content
 98% 595M/608M [00:07<00:00, 109MB/s]
100% 608M/608M [00:07<00:00, 85.0MB/s]


In [None]:
!unzip cord19cleaneddata.zip

Archive:  cord19cleaneddata.zip
  inflating: covidData.csv           
  inflating: covidDataCleaned.csv    


In [None]:
import re
import numpy as np
import pandas as pd
import pickle
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

In [None]:
metadata_path = '/content/covidDataCleaned.csv'
meta_df = pd.read_csv(metadata_path, dtype={'doi': str})

In [None]:
meta_df = meta_df[['paper_id','abstract','body_text','title','authors','journal','url']].reset_index(drop=True)

In [None]:
def get_label(row):
    abstract = str(row['abstract']).lower() # convert abstract to string and lowercase
    if "covid-19" in abstract or "coronavirus" in abstract or "sars-cov-2" in abstract:
        return "relevant"
    else:
        return "irrelevant"
meta_df['label'] = meta_df.apply(get_label, axis=1)

In [None]:
meta_df.head(5)

Unnamed: 0,paper_id,abstract,body_text,title,authors,journal,url,label
0,7037460cc980744603573744bf370ee8f49a4ffe,Objectives The aim of this study was to determ...,Drugs that inhibit virus replication have beco...,Efficacy and safety of the nucleoside analog G...,"Pedersen, Niels C; Perron, Michel; Bannasch, M...",J Feline Med Surg,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,irrelevant
1,46bf124930f3ef18bc9dd2d4ae356a45d3bae461,Objective: This study presents a preliminary r...,https://doi.org/10.3348/kjr.2020.0132 kjronlin...,Chest Radiographic and CT Findings of the 2019...,"Yoon, Soon Ho; Lee, Kyung Hee; Kim, Jin Yong; ...",Korean J Radiol,https://doi.org/10.3348/kjr.2020.0132,relevant
2,983df610328c1e73e3c12546d42a14d520844f9b,"How to cite: Bhuiyan ZA, Ali MZ, Moula MM, Bar...",The poultry industry is an important subsector...,Seroprevalence of major avian respiratory dise...,"Bhuiyan, Zafar Ahmed; Ali, Md Zulfekar; Moula,...",J Adv Vet Anim Res,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,irrelevant
3,4bbb0c59babc718f67953fae032dad6ae0d7aeb1,"Genome Detective is a web-based, user-friendly...",We are currently faced with a potential global...,Genome Detective Coronavirus Typing Tool for r...,"Cleemput, S.; Dumon, W.; Fonseca, V.; Karim, W...","Bioinformatics (Oxford, England)",https://doi.org/10.1093/bioinformatics/btaa145,relevant
4,3061f05203159384dfbb2fd9b1d9a1ca7b98c8a6,Introduction: The earthquake is one of the mos...,"Over the past 10 years, natural disasters have...",Iranian Emergency Medical Service Response in ...,"Saberian, Peyman; Kolivand, Pir-Hossein; Hasan...",Adv J Emerg Med,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,irrelevant


In [None]:
import pandas as pd
import re
import numpy as np
import spacy

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Define a regular expression to match dates
date_regex = re.compile(r"\d{4}")

# Create a new column for the date
meta_df["date"] = ""

# Process each abstract and extract the date
for i, abstract in enumerate(meta_df["abstract"]):
    if isinstance(abstract, float) and np.isnan(abstract):
        abstract = ""
    doc = nlp(abstract)
    for token in doc:
        if token.pos_ == "NUM" and date_regex.match(token.text):
            date = token.text
            break
    else:
        date = ""
    
    # Store the extracted date in the date column
    meta_df.at[i, "date"] = date if isinstance(date, str) else str(date)

In [None]:
meta_df = meta_df[meta_df['date'].notnull() & (meta_df['date'] > '2018') & (meta_df['date'] < '2024')]

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel

# Load the metadata
meta_df = meta_df[['paper_id','abstract','body_text','title','authors','journal','url']].reset_index(drop=True)

# Load the SPECTER model and tokenizer
model_name = "allenai/specter"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Compute the embeddings in batches
batch_size = 12
num_batches = (len(meta_df) + batch_size - 1) // batch_size

embeddings = []
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(meta_df))
    batch = meta_df.iloc[start_idx:end_idx]
    inputs = list(batch.apply(lambda row: f"{row['title']} {row['abstract']}", axis=1))

    # Tokenize the inputs and pad the sequences
    encoded_inputs = tokenizer(inputs, padding=True, truncation=True, max_length=512, return_tensors='pt')
    padded_inputs = {k: v.to(model.device) for k, v in encoded_inputs.items()}

    # Compute the embeddings for the batch
    with torch.no_grad():
        outputs = model(**padded_inputs)
        batch_embeddings = outputs.last_hidden_state[:, 0, :].cpu().numpy()

    embeddings.append(batch_embeddings)

# Concatenate the embeddings for all batches
embeddings = np.concatenate(embeddings, axis=0)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(embeddings, meta_df['label'], test_size=0.2, random_state=42)

# Train a logistic regression classifier
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the performance of the classifier
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

  irrelevant       0.86      0.75      0.80        32
    relevant       0.90      0.95      0.92        78

    accuracy                           0.89       110
   macro avg       0.88      0.85      0.86       110
weighted avg       0.89      0.89      0.89       110

