**Step 1: Setup the Environment**

In [1]:
!pip install pandas transformers torch




**Step 2: Loading Required Libraries and Pretrained Model**

I have used **DistilBERT model** to encode each paper’s abstract and title. I have also performed semantic filtering and classification based on this representation.

In [2]:
import pandas as pd
import torch
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load the pretrained DistilBERT model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

**Step 3: Loading and Preprocessing the Dataset**

In [3]:
# Load dataset
data = pd.read_csv("/content/drive/MyDrive/NLP Task/collection_with_abstracts.csv")
# Display dataset structure
data.head()

Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI,Abstract
0,39435445,Editorial: The operationalization of cognitive...,"Winter M, Probst T, Tallon M, Schobel J, Pryss R.",Front Neurosci. 2024 Oct 7;18:1501636. doi: 10...,Winter M,Front Neurosci,2024,2024/10/22,PMC11491427,,10.3389/fnins.2024.1501636,
1,39398866,Characterization of arteriosclerosis based on ...,"Zhou J, Li X, Demeke D, Dinh TA, Yang Y, Janow...",J Med Imaging (Bellingham). 2024 Sep;11(5):057...,Zhou J,J Med Imaging (Bellingham),2024,2024/10/14,PMC11466048,,10.1117/1.JMI.11.5.057501,PURPOSE: Our purpose is to develop a computer ...
2,39390053,Multi-scale input layers and dense decoder agg...,"Lan X, Jin W.",Sci Rep. 2024 Oct 10;14(1):23729. doi: 10.1038...,Lan X,Sci Rep,2024,2024/10/10,PMC11467340,,10.1038/s41598-024-74701-0,Accurate segmentation of COVID-19 lesions from...
3,39367648,An initial game-theoretic assessment of enhanc...,"Fatemi MY, Lu Y, Diallo AB, Srinivasan G, Azhe...",Brief Bioinform. 2024 Sep 23;25(6):bbae476. do...,Fatemi MY,Brief Bioinform,2024,2024/10/05,PMC11452536,,10.1093/bib/bbae476,The application of deep learning to spatial tr...
4,39363262,Truncated M13 phage for smart detection of E. ...,"Yuan J, Zhu H, Li S, Thierry B, Yang CT, Zhang...",J Nanobiotechnology. 2024 Oct 3;22(1):599. doi...,Yuan J,J Nanobiotechnology,2024,2024/10/04,PMC11451008,,10.1186/s12951-024-02881-y,BACKGROUND: The urgent need for affordable and...


To printout the columns in the dataset

In [6]:
print(data.columns)


Index(['PMID', 'Title', 'Authors', 'Citation', 'First Author', 'Journal/Book',
       'Publication Year', 'Create Date', 'PMCID', 'NIHMS ID', 'DOI',
       'Abstract'],
      dtype='object')


**Step 4: Defining Filtering and Classification Functions**

Definíng Filtering and Embedding Functions


In [32]:
# Define keywords for filtering and classification
domain_keywords = ["virology", "epidemiology"]

method_keywords = ["neural network", "artificial neural network", "machine learning model",
    "feedforward neural network", "neural net algorithm", "multilayer perceptron",
    "convolutional neural network", "recurrent neural network", "long short-term memory network",
    "CNN", "GRNN", "RNN", "LSTM", "deep learning", "deep neural networks"]

text_mining_keywords = ["natural language processing", "text mining", "NLP", "computational linguistics",
    "language processing", "text analytics", "textual data analysis", "speech and language technology",
    "language modeling", "computational semantics"]

computer_vision_keywords = ["computer vision", "vision model", "image processing", "vision algorithms",
    "computer graphics and vision", "object recognition", "scene understanding"]

transformer_keywords = ["transformer models", "self-attention models", "transformer architecture", "transformer",
    "attention-based neural networks", "sequence-to-sequence models", "large language model", "llm",
    "transformer-based model", "pretrained language model", "foundation model"]

generative_ai_keywords = ["generative artificial intelligence", "generative AI", "generative deep learning",
    "generative models", "diffusion model", "vision transformer", "multimodal model",
    "multimodal neural network"]

# Function to check if a paper matches any domain and method keyword
def is_relevant_paper(abstract):
    if isinstance(abstract, str):  # Check if abstract is a string
        domain_match = any(keyword in abstract.lower() for keyword in domain_keywords)
        method_match = any(keyword in abstract.lower() for keyword in method_keywords)
        return domain_match and method_match
    return False  # Return False if abstract is not a string (e.g., NaN values)



# Function to extract the specific deep learning method
def extract_method(abstract):
    tokens = tokenizer.tokenize(abstract)
    methods = [token for token in tokens if token in method_keywords]
    return methods if methods else "Not specified"

Applying Semantic Filtering

In [33]:
# Filter papers
filtered_data = data[data['Abstract'].apply(is_relevant_paper)].reset_index(drop=True)


Classify Method


In [34]:
# Function to classify papers by method category
def classify_method(abstract):
    abstract_lower = abstract.lower()
    is_text_mining = any(keyword in abstract_lower for keyword in text_mining_keywords)
    is_computer_vision = any(keyword in abstract_lower for keyword in computer_vision_keywords)
    is_transformer = any(keyword in abstract_lower for keyword in transformer_keywords)
    is_generative_ai = any(keyword in abstract_lower for keyword in generative_ai_keywords)

    # Classify based on method type
    if is_text_mining and is_computer_vision:
        return "both"
    elif is_text_mining:
        return "text mining"
    elif is_computer_vision:
        return "computer vision"
    elif is_transformer or is_generative_ai:
        return "transformer/generative"
    else:
        return "other"

filtered_data['method_category'] = filtered_data['Abstract'].apply(classify_method)


Extracting Method Information

In [36]:
# Function to extract the specific deep learning method
def extract_method(abstract):
    tokens = tokenizer.tokenize(abstract[:512])  # Limit to first 512 tokens if necessary
    methods = [token for token in tokens if token in method_keywords]
    return methods if methods else "Not specified"

filtered_data['deep_learning_method'] = filtered_data['Abstract'].apply(extract_method)


**Step 5: Save and Review Results**


In [38]:
# Save filtered results
filtered_data.to_csv("/content/drive/MyDrive/NLP Task/filtered_virology_ai_papers.csv", index=False)

# Display summary
print(f"Total initial papers: {len(data)}")
print(f"Papers after filtering: {len(filtered_data)}")
print(filtered_data['method_category'].value_counts())

Total initial papers: 11450
Papers after filtering: 307
method_category
other                     288
transformer/generative      9
text mining                 7
computer vision             2
both                        1
Name: count, dtype: int64
