# Screening Task: Semantic NLP Filtering for Identifying Deep Learning Papers in Virology and Epidemiology

### Install Required Libraries

In this step, we install essential libraries that will be used throughout our analysis:

- **`transformers`**: Provides tools for using pre-trained transformer models, which are essential for embedding generation.
- **`pandas`**: Allows for efficient data manipulation and analysis in DataFrame format.
- **`scikit-learn`**: Includes tools for machine learning and similarity calculations.
- **`torch`**: Powers the transformer models and provides deep learning functionalities.
- **`tqdm`**: Enables progress bars to track the progress of various operations, making it easier to follow batch processing steps.

```python
# Install necessary libraries
!pip install transformers pandas scikit-learn torch tqdm


In [1]:
# Install necessary libraries
!pip install transformers pandas scikit-learn torch tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Data Loading and Initial Filtering

In this step, we load the dataset and perform initial filtering to ensure data quality for further analysis. Here’s a breakdown of the process:

1. **Load Dataset**: The dataset, which includes research paper titles and abstracts, is loaded into a DataFrame.

2. **Identify Missing Values**:
   - We check for rows where either the "Title" or "Abstract" column is missing.
   - These records are saved separately in a file named `missing_title_or_abstract.csv` for reference.

3. **Filter Non-Empty Abstracts**:
   - We remove rows where the "Abstract" column is empty or contains only whitespace, as these entries are not suitable for semantic analysis.
   - After filtering, we display the number of records remaining.

4. **Confirm Data Quality**:
   - We display summary information about the filtered DataFrame to confirm that only records with both a title and a non-empty abstract remain.
   - Additionally, a preview of the cleaned DataFrame is provided to verify the data structure before moving to the next stage.

In [2]:
import pandas as pd

# Load the CSV file into a DataFrame
df = pd.read_csv('collection_with_abstracts.csv')

print("Total records before filtering:", len(df))

# Identify rows where either Title or Abstract (or both) are missing
missing_either_df = df[df['Title'].isnull() | df['Abstract'].isnull()]
print("Records with missing Title or Abstract:", len(missing_either_df))

# Save this DataFrame for tracking purposes
missing_either_df.to_csv('missing_title_or_abstract.csv', index=False)
print("Records with missing Title or Abstract saved as 'missing_title_or_abstract.csv'.")

# Remove rows where Abstract is NaN or an empty string
df_for_analysis = df.dropna(subset=['Abstract'])  # First, remove rows where Abstract is NaN
df_for_analysis = df_for_analysis[df_for_analysis['Abstract'].str.strip() != '']  # Then, remove rows where Abstract is an empty string

print("Total records after filtering for non-empty Abstract:", len(df_for_analysis))

# Display info to confirm filtering
print("\nMain DataFrame after filtering out rows with both Title and Abstract missing (for analysis):")
print(df_for_analysis.info())

# Optional: Display the first few rows of the cleaned DataFrame for analysis
print("\nCleaned DataFrame preview for analysis:")
print(df_for_analysis.head())


Total records before filtering: 11450
Records with missing Title or Abstract: 213
Records with missing Title or Abstract saved as 'missing_title_or_abstract.csv'.
Total records after filtering for non-empty Abstract: 11237

Main DataFrame after filtering out rows with both Title and Abstract missing (for analysis):
<class 'pandas.core.frame.DataFrame'>
Index: 11237 entries, 1 to 11449
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   PMID              11237 non-null  int64 
 1   Title             11237 non-null  object
 2   Authors           11237 non-null  object
 3   Citation          11237 non-null  object
 4   First Author      11237 non-null  object
 5   Journal/Book      11237 non-null  object
 6   Publication Year  11237 non-null  int64 
 7   Create Date       11237 non-null  object
 8   PMCID             6387 non-null   object
 9   NIHMS ID          946 non-null    object
 10  DOI               1076

### Generate Target Embeddings for Virology/Epidemiology and Deep Learning

In this step, we create target embeddings based on phrases relevant to two primary focus areas in our analysis: **virology/epidemiology** and **deep learning**. These target embeddings will serve as reference points to identify papers that align with the topics of interest.

1. **Define Target Phrases**:
   - A list of phrases is defined for each focus area:
     - **Virology/Epidemiology Phrases**: These phrases capture specific keywords related to infectious diseases, viral interactions, public health surveillance, and similar topics.
     - **Deep Learning Phrases**: These phrases cover keywords relevant to neural networks, machine learning models, and advanced computational methods used in virology and epidemiology.


2. **Load Model and Tokenizer**:
   - We load a pretrained transformer model and its tokenizer, which are used to generate embeddings. This model will provide the vector representation for each phrase.


3. **Embedding Generation Function**:
   - A function, `get_embedding`, is defined to generate an embedding for each phrase. This function:
     - Tokenizes the input text and applies truncation and padding.
     - Uses the transformer model to create an embedding from the mean of the last hidden states, ensuring each phrase is represented by a dense vector.


4. **Calculate Average Embeddings**:
   - **Virology/Epidemiology Embedding**: Each phrase in this category is converted to an embedding, and the average of these embeddings is calculated to create a unified target embedding.
   - **Deep Learning Embedding**: Similarly, the average embedding for all phrases related to deep learning is calculated.


5. **Display Results**:
   - Once the embeddings are generated, we print a confirmation message for each focus area, indicating that the average embeddings for virology/epidemiology and deep learning are ready for use in the analysis.

These embeddings will enable us to assess the relevance of each research paper by comparing its embedding to these target embeddings.


In [3]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# Define phrases for each target area
virology_epidemiology_relevance_phrases = [
    "deep learning in virology and epidemiology",
    "neural network applications in infectious diseases",
    "AI models for viral infection analysis",
    "machine learning in public health research",
    "deep learning solutions for disease spread analysis",
    "infectious diseases", "virus detection", "disease spread modeling",
    "viral disease outbreaks", "virus genomics and sequencing", 
    "viral infection mechanisms", "antiviral drug resistance", 
    "virus-host interactions", "epidemic response to viral infections", 
    "pandemic virology research", "epidemiological models", 
    "public health surveillance", "disease transmission patterns", 
    "infection rate prediction", "epidemic spread modeling", 
    "population-level health impact", "disease incidence and prevalence", 
    "contact tracing in disease outbreaks", "infectious disease modeling", 
    "pathogen tracking and monitoring", "viral pathogen analysis", 
    "health risk assessment for infectious diseases", 
    "genomic epidemiology of viruses", "disease outbreak prediction"
]

deep_learning_phrases = [
    "neural network", "artificial neural network", "machine learning model", 
    "feedforward neural network", "neural net algorithm", "multilayer perceptron", 
    "convolutional neural network", "recurrent neural network", "long short-term memory network", 
    "CNN", "GRNN", "RNN", "LSTM", "deep learning", "deep neural networks", 
    "computer vision", "vision model", "image processing", "vision algorithms", 
    "computer graphics and vision", "object recognition", "scene understanding", 
    "natural language processing", "text mining", "NLP", "computational linguistics", 
    "language processing", "text analytics", "textual data analysis", "text data analysis", 
    "text analysis", "speech and language technology", "language modeling", 
    "computational semantics", "generative artificial intelligence", "generative AI", 
    "generative deep learning", "generative models", "transformer models", 
    "self-attention models", "transformer architecture", "transformer", 
    "attention-based neural networks", "transformer networks", "sequence-to-sequence models", 
    "large language model", "LLM", "transformer-based model", "pretrained language model", 
    "generative language model", "foundation model", "state-of-the-art language model", 
    "multimodal model", "multimodal neural network", "vision transformer", 
    "diffusion model", "generative diffusion model", "diffusion-based generative model", 
    "continuous diffusion model"
]

# Load the model and tokenizer
model_name = 'HuggingFaceTB/SmolLM2-1.7B'  # Replace with preferred model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Set padding token for the tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Function to generate embeddings for a given text
def get_embedding(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of the last hidden state as the embedding
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Generate embeddings for virology/epidemiology relevance phrases and calculate the average embedding
virology_embeddings = [get_embedding(phrase) for phrase in virology_epidemiology_relevance_phrases]
average_virology_embedding = np.mean(virology_embeddings, axis=0)

# Generate embeddings for deep learning phrases and calculate the average embedding
deep_learning_embeddings = [get_embedding(phrase) for phrase in deep_learning_phrases]
average_deep_learning_embedding = np.mean(deep_learning_embeddings, axis=0)

# Display the results
print("Average Virology/Epidemiology Embedding generated.")
print("Average Deep Learning Embedding generated.")


Average Virology/Epidemiology Embedding generated.
Average Deep Learning Embedding generated.


### Step 4: Generate Embeddings for Each Paper and Calculate Similarity with Target Embeddings

In this step, we prepare the dataset by generating embeddings for each paper in the analysis dataset. These embeddings are crucial as they allow us to measure the semantic similarity between the content of each paper and the predefined target embeddings.

#### Step-by-Step Explanation:

1. **Concatenate Title and Abstract**:
   - We combine the `Title` and `Abstract` columns for each paper to create a comprehensive text representation (`Title_Abstract`). This combined text will serve as the input for embedding generation, ensuring we capture the full context of each paper.

2. **Generate Embeddings**:
   - For each paper’s combined title and abstract, we generate an embedding using the `get_embedding` function.
   - To track progress and manage potentially large datasets, we use `tqdm` to provide a progress bar, helping us monitor the embedding generation process.

3. **Progress Tracking**:
   - The `tqdm.pandas()` function is applied to `pandas`' `apply` method, allowing us to view progress updates as embeddings are generated for each paper.

The result is a new column, `Paper_Embedding`, in the `df_for_analysis` DataFrame. This column contains the embeddings for each paper, which we will use in the next steps to calculate similarity to the target embeddings for **virology/epidemiology** and **deep learning** relevance.


In [4]:
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm

# Enable tqdm for pandas apply
tqdm.pandas()

# Combine Title and Abstract text for each paper
df_for_analysis['Title_Abstract'] = df_for_analysis['Title'].fillna('') + " " + df_for_analysis['Abstract'].fillna('')

# Generate embeddings for each paper's combined Title and Abstract with progress tracking
df_for_analysis['Paper_Embedding'] = df_for_analysis['Title_Abstract'].progress_apply(get_embedding)

100%|██████████████████████████████████| 11237/11237 [13:46:57<00:00,  4.42s/it]


### Step 4.1: Calculate Similarity to Virology/Epidemiology Embedding and Apply Threshold Filtering

In this step, we calculate the similarity between each paper's embedding and the **average embedding for virology/epidemiology relevance**. Using a defined threshold, we then filter papers that are most relevant to our focus on virology and epidemiology.

#### Step-by-Step Explanation:

1. **Calculate Similarity with Virology/Epidemiology Embedding**:
   - We use the `cosine_similarity` function to compute the similarity between each paper's embedding (`Paper_Embedding`) and the `average_virology_embedding`.
   - The resulting similarity score is stored in a new column, `Virology_Similarity`, in `df_for_analysis`.

2. **Set and Apply Threshold for Filtering**:
   - We define a similarity threshold (`virology_similarity_threshold = 0.90`), which serves as the cutoff to determine if a paper is relevant to virology and epidemiology.
   - Papers with a similarity score above or equal to this threshold are considered relevant and are filtered into a new DataFrame, `virology_filtered_df`.
   - This threshold value is critical for ensuring that we retain only those papers that closely match the virology/epidemiology criteria.

3. **Preview and Save Filtered Results**:
   - A preview of the filtered data is displayed to validate the filtering.
   - Finally, the filtered DataFrame `virology_filtered_df` is saved to a CSV file, `virology_relevant_papers.csv`, which contains only the virology-relevant papers for further analysis.

The similarity threshold ensures we focus on papers that meet a high degree of relevance to virology and epidemiology, making this step key in narrowing down the dataset to our specific area of interest.

**Additionally, based on specific requirements or the results observed, this threshold can be adjusted to either broaden or restrict the range of papers filtered, allowing for flexibility in refining the selection to match analysis needs.**

In [23]:
# Step 4.1: Calculate similarity with virology/epidemiology embedding
df_for_analysis['Virology_Similarity'] = df_for_analysis['Paper_Embedding'].progress_apply(
    lambda emb: cosine_similarity([emb], [average_virology_embedding])[0][0]
)

# Display sample of the virology/epidemiology similarity scores
print("Virology/Epidemiology similarity scores calculated. Sample data:")
print(df_for_analysis[['PMID', 'Title', 'Abstract', 'Virology_Similarity']].head())

# Define similarity threshold for Virology/Epidemiology relevance
virology_similarity_threshold = 0.90

# Filter the DataFrame based on Virology/Epidemiology similarity threshold
virology_filtered_df = df_for_analysis[df_for_analysis['Virology_Similarity'] >= virology_similarity_threshold]

# Display filtered results for virology
print(f"Papers with Virology/Epidemiology Similarity >= {virology_similarity_threshold}:")
print(virology_filtered_df[['PMID', 'Title', 'Abstract', 'Virology_Similarity']].head())
print("Total virology-relevant papers after thresholding:", len(virology_filtered_df))

# Save the filtered DataFrame to a CSV file
virology_filtered_df.to_csv('virology_relevant_papers.csv', index=False)

print("Filtered virology-relevant papers saved as 'virology_relevant_papers.csv'.")

100%|███████████████████████████████████| 11237/11237 [00:01<00:00, 5899.41it/s]


Virology/Epidemiology similarity scores calculated. Sample data:
       PMID                                              Title  \
1  39398866  Characterization of arteriosclerosis based on ...   
2  39390053  Multi-scale input layers and dense decoder agg...   
3  39367648  An initial game-theoretic assessment of enhanc...   
4  39363262  Truncated M13 phage for smart detection of E. ...   
5  39287522  AI for Multistructure Incidental Findings and ...   

                                            Abstract  Virology_Similarity  
1  PURPOSE: Our purpose is to develop a computer ...             0.844549  
2  Accurate segmentation of COVID-19 lesions from...             0.858944  
3  The application of deep learning to spatial tr...             0.878694  
4  BACKGROUND: The urgent need for affordable and...             0.857912  
5  Background Incidental extrapulmonary findings ...             0.870465  
Papers with Virology/Epidemiology Similarity >= 0.9:
        PMID                 

## Download Output File

The following links allow you to download the filtered and processed datasets directly from the notebook:

  - Contains papers filtered based on their relevance to virology and epidemiology.

You can click on link to download the respective CSV file.

In [24]:
from IPython.display import FileLink

# Display download link for the final CSV
display(FileLink('virology_relevant_papers.csv'))


### Step 4.2: Calculate Similarity with Deep Learning Embedding and Apply Threshold Filtering

In this step, we extend our filtering process to identify papers that meet both the virology/epidemiology and deep learning relevance criteria. By calculating the similarity of each paper’s embedding to a deep learning embedding, we can further refine the dataset to include papers that intersect with deep learning applications in virology and epidemiology.

#### Step-by-Step Explanation:

1. **Calculate Similarity with Deep Learning Embedding**:
   - We use the `cosine_similarity` function to compute the similarity between each paper’s embedding (`Paper_Embedding`) and the `average_deep_learning_embedding`.
   - The resulting similarity scores are stored in a new column, `Deep_Learning_Similarity`, within the `virology_filtered_df` DataFrame, which contains papers previously filtered for virology relevance.

2. **Set and Apply Threshold for Filtering**:
   - We define a threshold (`deep_learning_similarity_threshold = 0.80`) to assess deep learning relevance.
   - Papers with a similarity score equal to or above this threshold are retained as deep learning-relevant and filtered into a new DataFrame, `final_filtered_df`.
   - This threshold value helps ensure that the filtered dataset contains papers with significant overlap in both deep learning and virology/epidemiology themes.

3. **Preview and Save Final Filtered Results**:
   - A preview of the final filtered data is displayed to confirm that only papers meeting both criteria are included.
   - The final DataFrame, `final_filtered_df`, is saved as `deep_learning_virology_relevant_papers.csv` for future reference and detailed analysis.

The two-stage threshold filtering approach allows us to hone in on papers that not only focus on virology/epidemiology but also incorporate substantial deep learning methodologies.

**As with the virology threshold, the deep learning similarity threshold can be adjusted to tailor the selection. Increasing or lowering the threshold can help further refine the dataset based on the specific analysis needs and project objectives.**


In [18]:
# Step 4.2: Calculate similarity with deep learning embedding
virology_filtered_df['Deep_Learning_Similarity'] = virology_filtered_df['Paper_Embedding'].progress_apply(
    lambda emb: cosine_similarity([emb], [average_deep_learning_embedding])[0][0]
)

# Define deep learning similarity threshold
deep_learning_similarity_threshold = 0.80

# Filter DataFrame based on Deep Learning similarity threshold
final_filtered_df = virology_filtered_df[virology_filtered_df['Deep_Learning_Similarity'] >= deep_learning_similarity_threshold]

# Display filtered results for deep learning
print(f"Papers with Deep Learning Similarity >= {deep_learning_similarity_threshold}:")
print(final_filtered_df[['PMID', 'Title', 'Abstract', 'Deep_Learning_Similarity']].head())
print("Total deep learning-relevant papers after thresholding:", len(final_filtered_df))

# Save the final filtered DataFrame to a CSV file
final_filtered_df.to_csv('deep_learning_virology_relevant_papers.csv', index=False)

print("Final filtered deep learning and virology-relevant papers saved as 'deep_learning_virology_relevant_papers.csv'.")


100%|█████████████████████████████████████| 2009/2009 [00:00<00:00, 6038.56it/s]

Papers with Deep Learning Similarity >= 0.8:
         PMID                                              Title  \
30   39013794  Deep Learning - Methods to Amplify Epidemiolog...   
127  38006509  Quantitation of Oncologic Image Features for R...   
228  36189512  Visual ergonomics for changing work environmen...   
402  34253822  Deep learning for COVID-19 detection based on ...   
462  35782182  Computer Audition for Fighting the SARS-CoV-2 ...   

                                              Abstract  \
30   Deep learning is a subfield of artificial inte...   
127  Radiomics is an emerging and exciting field of...   
228  BACKGROUND: The coronavirus 2019 (COVID-19) pa...   
402  COVID-19 has tremendously impacted patients an...   
462  Computer audition (CA) has experienced a fast ...   

     Deep_Learning_Similarity  
30                   0.816207  
127                  0.801202  
228                  0.810959  
402                  0.803380  
462                  0.806091  
Total


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  virology_filtered_df['Deep_Learning_Similarity'] = virology_filtered_df['Paper_Embedding'].progress_apply(


## Download Output File

The following links allow you to download the filtered and processed datasets directly from the notebook:

  - Contains papers relevant to both virology/epidemiology and deep learning criteria.

You can click on link to download the respective CSV file.


In [25]:
# Display download link for the final CSV
display(FileLink('deep_learning_virology_relevant_papers.csv'))


### Step 5: Classify Method Type Based on Semantic Similarity for Text Mining and Computer Vision

In this step, we further refine our analysis by categorizing each paper into core method types—**Text Mining** and **Computer Vision**—based on the semantic similarity of the paper’s content. This classification allows us to identify which deep learning techniques are being used in virology and epidemiology research, facilitating a targeted review of relevant methodologies.

#### Step-by-Step Explanation:

1. **Define Target Phrases for Core Method Types**:
   - We define target phrases that represent the key method types of interest:
     - **Text Mining**: Captures natural language processing, text mining, and language model applications.
     - **Computer Vision**: Focuses on medical image processing, computer vision, and visual data analysis related to infectious disease research.

2. **Generate Embeddings for Each Method Type**:
   - Using the `get_embedding` function, we create embeddings for each target phrase, representing the core method types. These embeddings serve as a basis for comparing each paper’s content.

3. **Classify Papers by Method Type**:
   - We define a `classify_method_semantically` function that computes the similarity of each paper’s embedding to both the **Text Mining** and **Computer Vision** embeddings using `cosine_similarity`.
   - A similarity threshold (`threshold = 0.90`) is used to determine if a paper falls under a specific method type:
     - If a paper meets the threshold for both method types, it is classified as **Both**.
     - If only one threshold is met, the paper is classified as **Text Mining** or **Computer Vision** accordingly.
     - Papers that do not meet the threshold for either method type are classified as **Other**.

4. **Apply Classification Function**:
   - The classification function is applied to each paper in the `final_filtered_df`, resulting in a new column, `Method_Type`, which indicates the method category for each paper.

5. **Preview and Save Final Data with Method Classification**:
   - A preview of the classified data is displayed to confirm accuracy.
   - The final DataFrame, `final_filtered_df`, containing the method classifications, is saved as `deep_learning_virology_method_classification.csv` for future reference.

By classifying each paper based on the method type, this step enriches our analysis, allowing for a more organized view of deep learning applications in virology and epidemiology. The threshold value can be adjusted if necessary, depending on initial results and specific analysis requirements.


In [19]:
# Define target phrases for the two core method types
method_type_phrases = {
    "Text Mining": "natural language processing, text mining, and language model applications in virology and epidemiology research",
    "Computer Vision": "medical image processing, computer vision, and visual data analysis in infectious disease research"
}

# Generate embeddings for each method type
method_type_embeddings = {method: get_embedding(phrase) for method, phrase in method_type_phrases.items()}

# Function to classify each paper based on semantic similarity
def classify_method_semantically(paper_embedding):
    # Calculate similarity to each core method type
    similarities = {method: cosine_similarity([paper_embedding], [emb])[0][0] for method, emb in method_type_embeddings.items()}
    text_mining_sim = similarities["Text Mining"]
    computer_vision_sim = similarities["Computer Vision"]
    
    # Define threshold
    threshold = 0.90  # Adjust based on initial results

    # Classify based on similarity scores
    if text_mining_sim >= threshold and computer_vision_sim >= threshold:
        return "Both"
    elif text_mining_sim >= threshold:
        return "Text Mining"
    elif computer_vision_sim >= threshold:
        return "Computer Vision"
    else:
        return "Other"

# Apply the classification function to each paper in the final_filtered_df
final_filtered_df['Method_Type'] = final_filtered_df['Paper_Embedding'].apply(classify_method_semantically)

print(len(final_filtered_df))

print("Semantic method type classification completed with two main phrases.")
print(final_filtered_df[['PMID', 'Title', 'Abstract', 'Method_Type']].head())

# Save the final filtered DataFrame with Method_Type to a CSV file
final_filtered_df.to_csv('deep_learning_virology_method_classification.csv', index=False)

print("Final DataFrame with method classification saved as 'deep_learning_virology_method_classification.csv'.")

154
Semantic method type classification completed with two main phrases.
         PMID                                              Title  \
30   39013794  Deep Learning - Methods to Amplify Epidemiolog...   
127  38006509  Quantitation of Oncologic Image Features for R...   
228  36189512  Visual ergonomics for changing work environmen...   
402  34253822  Deep learning for COVID-19 detection based on ...   
462  35782182  Computer Audition for Fighting the SARS-CoV-2 ...   

                                              Abstract      Method_Type  
30   Deep learning is a subfield of artificial inte...             Both  
127  Radiomics is an emerging and exciting field of...  Computer Vision  
228  BACKGROUND: The coronavirus 2019 (COVID-19) pa...            Other  
402  COVID-19 has tremendously impacted patients an...  Computer Vision  
462  Computer audition (CA) has experienced a fast ...             Both  
Final DataFrame with method classification saved as 'deep_learning_virolog

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_filtered_df['Method_Type'] = final_filtered_df['Paper_Embedding'].apply(classify_method_semantically)


## Download Output File

The following links allow you to download the filtered and processed datasets directly from the notebook:

  - Includes papers classified by method type, indicating their focus on Text Mining, Computer Vision, or both.

You can click on link to download the respective CSV file.

In [26]:
from IPython.display import FileLink

# Display download link for the final CSV
display(FileLink('deep_learning_virology_method_classification.csv'))


### Step 6: Extract Specific Deep Learning Methods from Titles and Abstracts

In this step, we extract specific deep learning methods mentioned in each paper based on a comprehensive list of relevant keywords. This helps to provide a detailed overview of the deep learning techniques applied in each paper, facilitating further analysis.

#### Step-by-Step Explanation:

1. **Define List of Target Keywords**:
   - We compile a list of keywords representing various deep learning and machine learning methods, architectures, and models (e.g., "neural network," "transformer," "convolutional neural network").
   - These keywords cover a broad range of techniques and terms related to both text and image analysis methods.

2. **Extract Methods by Matching Keywords in Title and Abstract**:
   - The function `extract_methods_from_text` is defined to check each paper’s combined `Title_Abstract` field.
   - For each entry, it searches for occurrences of the keywords in the title and abstract. If any keywords are found, they are recorded as the methods used in that paper. If no relevant keywords are detected, the method is labeled as **Not Specified**.

3. **Apply Extraction Function and Save Results**:
   - The extraction function is applied to each row in the `final_filtered_df`, creating a new column, `Methods_Used`, which lists the identified methods.
   - A summary of entries with "Not Specified" in `Methods_Used` is also displayed to track papers where no relevant method keywords were found.

4. **Save the Final Data with Extracted Methods**:
   - The final DataFrame, including the extracted methods for each paper, is saved as `deep_learning_virology_methods_extracted.csv`. This output serves as the culmination of our analysis, providing a comprehensive view of the specific deep learning methods applied in the virology and epidemiology context.

By capturing specific methods mentioned in the papers, this step enriches the dataset, making it easier to analyze the prevalence and application of various techniques within the field.


In [30]:
import pandas as pd

# List of specific methods based on provided query phrases
method_keywords = [
    "neural network", "artificial neural network", "machine learning model", 
    "feedforward neural network", "neural net algorithm", "multilayer perceptron", 
    "convolutional neural network", "recurrent neural network", "long short-term memory network", 
    "CNN", "GRNN", "RNN", "LSTM", "deep learning", "deep neural networks", 
    "computer vision", "vision model", "image processing", "vision algorithms", 
    "computer graphics and vision", "object recognition", "scene understanding", 
    "natural language processing", "text mining", "NLP", "computational linguistics", 
    "language processing", "text analytics", "textual data analysis", "text analysis", 
    "speech and language technology", "language modeling", "computational semantics", 
    "generative artificial intelligence", "generative AI", "generative deep learning", 
    "generative models", "transformer models", "self-attention models", 
    "transformer architecture", "transformer", "attention-based neural networks", 
    "transformer networks", "sequence-to-sequence models", "large language model", 
    "LLM", "transformer-based model", "pretrained language model", "generative language model", 
    "foundation model", "state-of-the-art language model", "multimodal model", 
    "multimodal neural network", "vision transformer", "diffusion model", 
    "generative diffusion model", "diffusion-based generative model", "continuous diffusion model"
]

# Function to extract methods by matching keywords in the title and abstract
def extract_methods_from_text(text):
    if not isinstance(text, str):  # Check if text is a valid string
        return "Not Specified"
    
    text = text.lower()  # Convert to lowercase for consistent matching
    found_methods = [method for method in method_keywords if method.lower() in text]
    return ", ".join(found_methods) if found_methods else "Not Specified"

# Display DataFrame information to confirm the current state
print(len(final_filtered_df))
print('length')

# Combine Title and Abstract into a comprehensive text field for analysis
final_filtered_df['Title_Abstract'] = final_filtered_df['Title'].fillna('') + " " + final_filtered_df['Abstract'].fillna('')

# Apply the method extraction function
final_filtered_df['Methods_Used'] = final_filtered_df['Title_Abstract'].apply(extract_methods_from_text)

print(final_filtered_df.info())

# Display a sample of the results to check extracted methods
print("Method extraction based on keywords completed.")
print(final_filtered_df[['PMID', 'Title', 'Abstract', 'Methods_Used']].head())

# Count entries with "Not Specified" in 'Methods_Used'
not_specified_count = final_filtered_df['Methods_Used'].value_counts().get("Not Specified", 0)
print(f"Number of 'Not Specified' entries in Methods_Used: {not_specified_count}")

# Save the final filtered DataFrame with Methods_Used to a CSV file
final_filtered_df.to_csv('deep_learning_virology_methods_extracted.csv', index=False)

print("Final DataFrame with extracted methods saved as 'deep_learning_virology_methods_extracted.csv'.")


154
length
<class 'pandas.core.frame.DataFrame'>
Index: 154 entries, 30 to 11283
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   PMID                      154 non-null    int64  
 1   Title                     154 non-null    object 
 2   Authors                   154 non-null    object 
 3   Citation                  154 non-null    object 
 4   First Author              154 non-null    object 
 5   Journal/Book              154 non-null    object 
 6   Publication Year          154 non-null    int64  
 7   Create Date               154 non-null    object 
 8   PMCID                     113 non-null    object 
 9   NIHMS ID                  6 non-null      object 
 10  DOI                       149 non-null    object 
 11  Abstract                  154 non-null    object 
 12  Title_Abstract            154 non-null    object 
 13  Paper_Embedding           154 non-null    object 
 14  V

## Download Output File

The following links allow you to download the filtered and processed datasets directly from the notebook:

    - Contains papers with extracted method details based on keyword matching for deep learning techniques.

You can click on link to download the respective CSV file.

In [28]:
# Display download link for the final CSV
display(FileLink('deep_learning_virology_methods_extracted.csv'))