#### **Step 1: Load the Dataset**

In [1]:
#read csv file from drive.
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [2]:
file_path = '/content/drive/MyDrive/Datasets/collection_with_abstracts.csv'
df = pd.read_csv(file_path)
df


Unnamed: 0,PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI,Abstract
0,39435445,Editorial: The operationalization of cognitive...,"Winter M, Probst T, Tallon M, Schobel J, Pryss R.",Front Neurosci. 2024 Oct 7;18:1501636. doi: 10...,Winter M,Front Neurosci,2024,2024/10/22,PMC11491427,,10.3389/fnins.2024.1501636,
1,39398866,Characterization of arteriosclerosis based on ...,"Zhou J, Li X, Demeke D, Dinh TA, Yang Y, Janow...",J Med Imaging (Bellingham). 2024 Sep;11(5):057...,Zhou J,J Med Imaging (Bellingham),2024,2024/10/14,PMC11466048,,10.1117/1.JMI.11.5.057501,PURPOSE: Our purpose is to develop a computer ...
2,39390053,Multi-scale input layers and dense decoder agg...,"Lan X, Jin W.",Sci Rep. 2024 Oct 10;14(1):23729. doi: 10.1038...,Lan X,Sci Rep,2024,2024/10/10,PMC11467340,,10.1038/s41598-024-74701-0,Accurate segmentation of COVID-19 lesions from...
3,39367648,An initial game-theoretic assessment of enhanc...,"Fatemi MY, Lu Y, Diallo AB, Srinivasan G, Azhe...",Brief Bioinform. 2024 Sep 23;25(6):bbae476. do...,Fatemi MY,Brief Bioinform,2024,2024/10/05,PMC11452536,,10.1093/bib/bbae476,The application of deep learning to spatial tr...
4,39363262,Truncated M13 phage for smart detection of E. ...,"Yuan J, Zhu H, Li S, Thierry B, Yang CT, Zhang...",J Nanobiotechnology. 2024 Oct 3;22(1):599. doi...,Yuan J,J Nanobiotechnology,2024,2024/10/04,PMC11451008,,10.1186/s12951-024-02881-y,BACKGROUND: The urgent need for affordable and...
...,...,...,...,...,...,...,...,...,...,...,...,...
11445,10607521,The characteristics of epidemics and invasions...,"Cruickshank I, Gurney WS, Veitch AR.",Theor Popul Biol. 1999 Dec;56(3):279-92. doi: ...,Cruickshank I,Theor Popul Biol,1999,1999/12/23,,,10.1006/tpbi.1999.1432,In this paper we report the development of a h...
11446,10072741,Effects of sales promotion on smoking among U....,Redmond WH.,Prev Med. 1999 Mar;28(3):243-50. doi: 10.1006/...,Redmond WH,Prev Med,1999,1999/03/12,,,10.1006/pmed.1998.0410,OBJECTIVE: The purpose of this study was to ex...
11447,9200018,Hypertension in an inner-city minority population,Wieck KL.,J Cardiovasc Nurs. 1997 Jul;11(4):41-9. doi: 1...,Wieck KL,J Cardiovasc Nurs,1997,1997/07/01,,,10.1097/00005082-199707000-00005,This study describes an inner-city elderly min...
11448,8039948,Aerosol transmission of a viable virus affecti...,"Grant RH, Scheidt AB, Rueff LR.",Int J Biometeorol. 1994 May;38(1):33-9. doi: 1...,Grant RH,Int J Biometeorol,1994,1994/05/01,,,10.1007/BF01241802,A Gaussian diffusion model was applied to an e...


In [3]:
print(df['Abstract'][12])
# df['Abstract'][1]

Computationally expensive data processing in neuroimaging research places demands on energy consumption-and the resulting carbon emissions contribute to the climate crisis. We measured the carbon footprint of the functional magnetic resonance imaging (fMRI) preprocessing tool fMRIPrep, testing the effect of varying parameters on estimated carbon emissions and preprocessing performance. Performance was quantified using (a) statistical individual-level task activation in regions of interest and (b) mean smoothness of preprocessed data. Eight variants of fMRIPrep were run with 257 participants who had completed an fMRI stop signal task (the same data also used in the original validation of fMRIPrep). Some variants led to substantial reductions in carbon emissions without sacrificing data quality: for instance, disabling FreeSurfer surface reconstruction reduced carbon emissions by 48%. We provide six recommendations for minimising emissions without compromising performance. By varying par

####**Step 2: Text preprocessing.**

select the necessary coloumns for further processing.


In [4]:
#selected the coloumns PMID, Title, Journal/Book, Publication Year, Abstract

selected_columns = df[['PMID', 'Title', 'Journal/Book', 'Abstract']]
selected_columns


Unnamed: 0,PMID,Title,Journal/Book,Abstract
0,39435445,Editorial: The operationalization of cognitive...,Front Neurosci,
1,39398866,Characterization of arteriosclerosis based on ...,J Med Imaging (Bellingham),PURPOSE: Our purpose is to develop a computer ...
2,39390053,Multi-scale input layers and dense decoder agg...,Sci Rep,Accurate segmentation of COVID-19 lesions from...
3,39367648,An initial game-theoretic assessment of enhanc...,Brief Bioinform,The application of deep learning to spatial tr...
4,39363262,Truncated M13 phage for smart detection of E. ...,J Nanobiotechnology,BACKGROUND: The urgent need for affordable and...
...,...,...,...,...
11445,10607521,The characteristics of epidemics and invasions...,Theor Popul Biol,In this paper we report the development of a h...
11446,10072741,Effects of sales promotion on smoking among U....,Prev Med,OBJECTIVE: The purpose of this study was to ex...
11447,9200018,Hypertension in an inner-city minority population,J Cardiovasc Nurs,This study describes an inner-city elderly min...
11448,8039948,Aerosol transmission of a viable virus affecti...,Int J Biometeorol,A Gaussian diffusion model was applied to an e...


In [5]:
#checking null values.
selected_columns.isnull().sum()

Unnamed: 0,0
PMID,0
Title,0
Journal/Book,0
Abstract,213


To address the 213 missing values in the 'Abstract' column, I combined the 'Title' and 'Journal/Book' columns to fill these gaps.

This approach helps retain relevant information for each record, improving data consistency and completeness for subsequent tasks, such as classification and method extraction.

In [6]:
# Fill missing abstracts
selected_columns['Combined'] = selected_columns['Title'].fillna('') + ' ' + selected_columns['Journal/Book'].fillna('')
selected_columns['Abstract'] = selected_columns['Abstract'].fillna(selected_columns['Combined'])
selected_columns = selected_columns.drop('Combined', axis=1)
selected_columns.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns['Combined'] = selected_columns['Title'].fillna('') + ' ' + selected_columns['Journal/Book'].fillna('')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_columns['Abstract'] = selected_columns['Abstract'].fillna(selected_columns['Combined'])


Unnamed: 0,PMID,Title,Journal/Book,Abstract
0,39435445,Editorial: The operationalization of cognitive...,Front Neurosci,Editorial: The operationalization of cognitive...
1,39398866,Characterization of arteriosclerosis based on ...,J Med Imaging (Bellingham),PURPOSE: Our purpose is to develop a computer ...
2,39390053,Multi-scale input layers and dense decoder agg...,Sci Rep,Accurate segmentation of COVID-19 lesions from...
3,39367648,An initial game-theoretic assessment of enhanc...,Brief Bioinform,The application of deep learning to spatial tr...
4,39363262,Truncated M13 phage for smart detection of E. ...,J Nanobiotechnology,BACKGROUND: The urgent need for affordable and...


In [7]:
#checking null values again.
selected_columns.isnull().sum()

Unnamed: 0,0
PMID,0
Title,0
Journal/Book,0
Abstract,0


####**Step 3: Data cleaning**
With all gaps filled, the next step is to clean the text fields. This involves removing stopwords, punctuation, and converting text to lowercase to standardize and streamline the data for further processing.


In [8]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def clean_text(text):
  if isinstance(text, str):  # Check if the input is a string
    words = word_tokenize(text.lower())
    cleaned_words = [word for word in words if word not in stop_words and word not in punctuation]
    return " ".join(cleaned_words)
  else:
    return ""  # Or handle non-string values as needed

selected_columns['Title'] = selected_columns['Title'].apply(clean_text)
selected_columns['Journal/Book'] = selected_columns['Journal/Book'].apply(clean_text)
selected_columns['Abstract'] = selected_columns['Abstract'].apply(clean_text)

selected_columns.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,PMID,Title,Journal/Book,Abstract
0,39435445,editorial operationalization cognitive systems...,front neurosci,editorial operationalization cognitive systems...
1,39398866,characterization arteriosclerosis based comput...,j med imaging bellingham,purpose purpose develop computer vision approa...
2,39390053,multi-scale input layers dense decoder aggrega...,sci rep,accurate segmentation covid-19 lesions medical...
3,39367648,initial game-theoretic assessment enhanced tis...,brief bioinform,application deep learning spatial transcriptom...
4,39363262,truncated m13 phage smart detection e. coli da...,j nanobiotechnology,background urgent need affordable rapid detect...


In [9]:
print(selected_columns['Abstract'][1])

purpose purpose develop computer vision approach quantify intra-arterial thickness digital pathology images kidney biopsies computational biomarker arteriosclerosis approach severity arteriosclerosis scored 0 3 753 arteries 33 trichrome-stained whole slide images wsis kidney biopsies outer contours media intima lumen manually delineated renal pathologist developed multi-class deep learning dl framework segmenting different intra-arterial compartments training dataset 648 arteries 24 wsis testing dataset 105 arteries 9 wsis subsequently employed radial sampling made measurements media intima thickness function spatially encoded polar coordinates throughout artery pathomic features extracted measurements collectively describe arterial wall characteristics technique first validated numerical analysis simulated arteries systematic deformations applied study effect arterial thickness measurements compared computationally derived measurements pathologists grading arteriosclerosis results num

 #### **Step 4: Semantic Filtering**
 Since the Abstract field provides the most detailed insight into each paper's content, it is used as the primary source for filtering.

 This method prioritizes contextual understanding to accurately capture papers that apply deep learning techniques to virology/epidemiology.



To achieve this, SBERT (Sentence-BERT) is used for semantic NLP filtering, enabling more accurate identification of relevant papers based on meaning rather than simple keyword matching.

In [32]:
# Install Necessary Libraries
!pip install -U sentence-transformers



#### **Load the SBERT Model**

Load a pretrained SBERT model from the sentence-transformers library.

For this semantic similarity task, all-MiniLM-L6-v2 is a good choice due to its efficiency and accurate embeddings, making it well-suited for filtering relevant papers.

In [10]:
from sentence_transformers import SentenceTransformer, util
# Load the SBERT model
#model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
model = SentenceTransformer('all-MiniLM-L6-v2')
model


  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

#### **Defining Target Terms for Semantic Filtering.**

These target terms provide a comprehensive set of phrases related to the application of deep learning in virology/epidemiology.
Using the SBERT model, reference embeddings are then generated for each target term. These embeddings serve as the basis for semantic similarity comparisons during filtering, helping to accurately identify relevant papers.

In [11]:
# Define main categories for combining terms dynamically
fields = ["virology", "epidemiology"]
techniques = [
    "deep learning", "neural network", "artificial neural network", "machine learning model","feedforward neural network","neural net algorithm",
    "convolutional neural network", "recurrent neural network", "LSTM", "multilayer perceptron",
    "CNN", "RNN", "GRNN", "computer vision", "vision models", "image processing","vision algorithms","computer graphics and vision",
    "object recognition", "scene understanding", "deep neural networks","computational semantics",
    "natural language processing", "text mining", "NLP", "computational linguistics","text analytics"
    "language modeling", "text analysis", "generative AI", "generative deep learning","text data analysis",
    "generative models", "generative artificial intelligence", "transformer models", "self-attention models",
    "transformer architecture", "attention-based neural networks", "transformer networks","speech and language technology",
    "sequence-to-sequence models", "large language model", "llm", "transformer-based model",
    "pretrained language model", "generative language model", "foundation model","language processing",
    "state-of-the-art language model", "multimodal model", "multimodal neural network",
    "vision transformer", "diffusion model", "generative diffusion model", "long short-term memory network",
    "diffusion-based generative model", "continuous diffusion model","textual data analysis"
]

# Combine terms
target_terms = [f"{tech} in {field}" for tech in techniques for field in fields]

# Generate reference embeddings
reference_embedding = model.encode(target_terms, convert_to_tensor=True)


#### **Generate Embeddings for Each Record's Abstracts**

To facilitate semantic filtering, embeddings are generated for each paper's 'Abstract' by applying the SBERT model.
These embeddings capture the semantic meaning of each 'Abstract', enabling accurate similarity comparisons with target terms.

In [12]:
# Generate embeddings for each paper's Abstract
selected_columns['abstract_embedding'] = selected_columns['Abstract'].apply(lambda x: model.encode(x, convert_to_tensor=True))
selected_columns.head()

Unnamed: 0,PMID,Title,Journal/Book,Abstract,abstract_embedding
0,39435445,editorial operationalization cognitive systems...,front neurosci,editorial operationalization cognitive systems...,"[tensor(0.0982, device='cuda:0'), tensor(-0.03..."
1,39398866,characterization arteriosclerosis based comput...,j med imaging bellingham,purpose purpose develop computer vision approa...,"[tensor(-0.0169, device='cuda:0'), tensor(-0.0..."
2,39390053,multi-scale input layers dense decoder aggrega...,sci rep,accurate segmentation covid-19 lesions medical...,"[tensor(0.0098, device='cuda:0'), tensor(-0.06..."
3,39367648,initial game-theoretic assessment enhanced tis...,brief bioinform,application deep learning spatial transcriptom...,"[tensor(-0.0076, device='cuda:0'), tensor(-0.0..."
4,39363262,truncated m13 phage smart detection e. coli da...,j nanobiotechnology,background urgent need affordable rapid detect...,"[tensor(-0.0285, device='cuda:0'), tensor(-0.0..."


#### **Step 5. Filtering Papers Based on Semantic Similarity**
 Each paper’s Abstract embedding is compared to the reference embeddings generated from target terms.

 Using cosine similarity, a score is calculated for each paper, representing how closely it aligns with the target terms. Papers with a similarity score at or above the threshold are retained for further analysis.

In [13]:
# Set similarity threshold (adjust based on results)
similarity_threshold = 0.5

# Filter papers based on similarity score to reference embedding
selected_columns['similarity_score'] = selected_columns['abstract_embedding'].apply(
    lambda x: util.cos_sim(x, reference_embedding).max().item()
)
filtered_papers = selected_columns[selected_columns['similarity_score'] >= similarity_threshold]


####**Step 6. Export the Filtered Papers**
Once filtered, save the relevant papers to a new file for further processing or analysis.

In [14]:
filtered_papers.to_csv('filtered_deep_learning_virology_epidemiology.csv', index=False)


In [15]:
# Load the filtered deep learning papers in virology/epidemiology
df1 = pd.read_csv('filtered_deep_learning_virology_epidemiology.csv')
df1

Unnamed: 0,PMID,Title,Journal/Book,Abstract,abstract_embedding,similarity_score
0,39013794,deep learning methods amplify epidemiological ...,j epidemiol,deep learning subfield artificial intelligence...,"tensor([ 3.6120e-02, -2.2385e-02, 2.9091e-02,...",0.758788
1,38473002,advancements glaucoma diagnosis role ai medica...,diagnostics basel,progress artificial intelligence algorithms di...,"tensor([ 1.7732e-02, -3.1443e-02, 1.9534e-02,...",0.514860
2,38454859,scope artificial intelligence retinopathy prem...,indian j ophthalmol,artificial intelligence ai revolutionary techn...,"tensor([-6.3707e-02, -2.1677e-02, -2.9215e-02,...",0.535145
3,38424562,prevalence computer vision syndrome covid-19 p...,bmc public health,background computer vision syndrome become sig...,"tensor([ 6.4034e-02, 1.7856e-02, -1.1642e-02,...",0.521855
4,37920033,unet segmentation network covid-19 ct images m...,math biosci eng,recent years global outbreak covid-19 posed ex...,"tensor([ 5.6156e-02, -4.8756e-02, 2.3895e-02,...",0.532703
...,...,...,...,...,...,...
725,16387332,diffusive si model allee effect application fiv,math biosci,minimal reaction-diffusion model spatiotempora...,"tensor([ 4.0788e-02, -6.0069e-02, 1.2528e-02,...",0.501422
726,16289268,dynamics dengue epidemics large-scale information,theor popul biol,model spatial temporal dynamics dengue fever p...,"tensor([ 8.1916e-02, -6.3010e-02, 3.9929e-02,...",0.511256
727,11177527,diffusion theory drug use,addiction,paper examines applicability diffusion model d...,"tensor([ 1.2172e-01, -4.6752e-02, -3.3996e-02,...",0.588906
728,10607521,characteristics epidemics invasions thresholds,theor popul biol,paper report development highly efficient nume...,"tensor([ 1.6910e-02, -8.1633e-02, 2.1418e-02,...",0.586339


In [16]:
df1["Abstract"][0]

'deep learning subfield artificial intelligence machine learning based mostly neural networks often combined attention algorithms used detect identify objects text audio images video serghiou rough j epidemiol 0000 000 00 :0000-0000 present primer epidemiologists deep learning models models provide substantial opportunities epidemiologists expand amplify research data collection analyses increasing geographic reach studies including research subjects working large high dimensional data tools implementing deep learning methods quite yet straightforward ubiquitous epidemiologists traditional regression methods found standard statistical software exciting opportunities interdisciplinary collaboration deep learning experts epidemiologists statisticians healthcare providers urban planners professionals despite novelty methods epidemiological principles assessing bias study design interpretation others still apply implementing deep learning methods assessing findings studies used'