In [None]:

import kagglehub
organizations_allen_institute_for_ai_cord_19_research_challenge_path = kagglehub.dataset_download('organizations/allen-institute-for-ai/CORD-19-research-challenge')

print('Data source import complete.')


In [None]:
# Basic setup and dataset check
import numpy as np
import pandas as pd
import os



In [None]:
# Load the first 500 rows from metadata.csv
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=500)
df.head()

In [None]:
# Filter for abstracts that mention the word "vaccine"
vaccine_papers = df[df['abstract'].str.contains("vaccine", case=False, na=False)]

# Show how many were found
print(f"Found {len(vaccine_papers)} papers mentioning 'vaccine'.")

# Preview 5 of them
vaccine_papers[['title', 'abstract', 'publish_time']].head()


In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Ensure inline plots (sometimes needed in fresh notebooks)
%matplotlib inline

# Combine abstracts (limit text to avoid long processing time)
text = " ".join(vaccine_papers['abstract'].dropna())[:50000]

# Generate Word Cloud
wordcloud = WordCloud(width=1200, height=600, background_color='white', max_words=100).generate(text)

# Display it
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Top Terms in Vaccine-Related COVID-19 Research", fontsize=18)
plt.show()


### Insights: Vaccine Research Trends in COVID-19 Literature

We extracted and visualized the top 100 terms appearing in the abstracts of scholarly papers that mention **"vaccine"** in the [CORD-19 dataset](https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge). This initial exploration reveals key research themes:

- **Commonly recurring terms** include: `immune`, `efficacy`, `response`, `antibody`, `SARS-CoV-2`, and `clinical trials`.
- The prevalence of words like **"mRNA"**, **"adjuvant"**, and **"neutralizing"** reflects the scientific focus on developing next-generation vaccine platforms.
- Frequent mentions of **"patients"**, **"safety"**, and **"results"** point to real-world testing and outcomes.

This simple word cloud gives us a quick lens into what the global research community was prioritizing in early COVID-19 vaccine studies.


In [None]:
# Filter abstracts that mention "treatment"
treatment_papers = df[df['abstract'].str.contains("treatment", case=False, na=False)]

# How many were found?
print(f"Found {len(treatment_papers)} papers mentioning 'treatment'.")

# Preview 5 of them
treatment_papers[['title', 'abstract', 'publish_time']].head()


In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Combine abstracts (again, keep it lightweight)
text_treatment = " ".join(treatment_papers['abstract'].dropna())[:50000]

# Generate word cloud
wordcloud_treatment = WordCloud(width=1200, height=600, background_color='white', max_words=100).generate(text_treatment)

# Display it
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud_treatment, interpolation='bilinear')
plt.axis('off')
plt.title("Top Terms in Treatment-Related COVID-19 Research", fontsize=18)
plt.show()


## Treatment-Focused Research Insights

This word cloud summarizes the **most frequent terms** found in 64 abstracts related to "treatment" in the CORD-19 dataset. Several key insights emerge:

- **"Patient"**, **"infection"**, and **"disease"** are dominant terms, indicating a strong clinical and public health focus.
- Words like **"mortality"**, **"hospital"**, **"method"**, and **"diagnosis"** suggest studies on patient outcomes, hospital care strategies, and therapeutic protocols.
- Biological terms such as **"cell"**, **"immune"**, and **"viral"** reflect research on how treatments interact with the virus at a cellular level.
- The prominence of **"influenza"** and **"pandemic"** shows the overlap in infectious disease treatment research, extending beyond COVID-19.
- Notably, **"public health"** and **"preparedness"** appear, emphasizing system-wide responses alongside clinical interventions.

This visualization gives us a high-level overview of treatment discussions and sets the stage for more targeted analyses — like what therapies are most frequently evaluated or how treatment strategies have evolved over time.


In [None]:
#  Import libraries
import pandas as pd
import os

#  Load metadata CSV with limited rows to keep it light
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=1000)

#  Define treatment-related keywords
keywords = ['treatment', 'drug', 'antiviral', 'remdesivir', 'hydroxychloroquine', 'dexamethasone', 'efficacy']

#  Filter abstracts that mention any of the keywords
treatment_papers = df[df['abstract'].fillna('').str.contains('|'.join(keywords), case=False)]

#  Display how many relevant papers were found
print(f"Found {len(treatment_papers)} papers mentioning treatment-related terms.")

#  Show a preview of the papers
treatment_papers[['title', 'abstract', 'publish_time']].head()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the first 1000 rows
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=1000)

# Define treatment-related keywords
keywords = ['treatment', 'drug', 'antiviral', 'remdesivir', 'hydroxychloroquine', 'dexamethasone', 'efficacy']

# Filter papers mentioning these keywords
treatment_papers = df[df['abstract'].fillna('').str.contains('|'.join(keywords), case=False)]

# Count how many abstracts mention each keyword
keyword_counts = {k: treatment_papers['abstract'].str.contains(k, case=False).sum() for k in keywords}

# Plot bar chart
plt.figure(figsize=(10, 6))
plt.bar(keyword_counts.keys(), keyword_counts.values())
plt.title("Mentions of Treatment-Related Terms in Abstracts")
plt.ylabel("Number of Papers")
plt.xlabel("Keyword")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Treatment-Related Keywords in COVID-19 Research

To begin answering Scientific Question #4 from the CORD-19 challenge — *"What is known about the effectiveness of drugs and other therapies?"* — I explored mentions of key treatment-related terms in the abstracts of the dataset.

Out of a 1,000-paper sample:

- **"Treatment"** was the most frequently used keyword, appearing in over 130 papers.
- Terms like **"drug"**, **"antiviral"**, and **"efficacy"** also appeared with moderate frequency.
- Mentions of specific treatments like **remdesivir**, **hydroxychloroquine**, and **dexamethasone** were extremely rare, suggesting that such drugs may either be underrepresented in this slice of the dataset or were not widely discussed in early abstracts.

This preliminary scan helps us identify which therapies are receiving attention in the literature. Future steps might include:
- Analyzing full-text documents for detailed treatment discussions.
- Mapping mentions over time to track drug popularity or emergence.
- Extracting named entities to isolate clinical trial references.

This forms an early but helpful lens into how treatments were discussed in the academic community during the pandemic.


In [None]:
import pandas as pd

# Load a small sample from the metadata file
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=500)

# Preview to ensure it's loaded
df[['title', 'abstract']].head()


In [None]:
treatment_keywords = [
    "remdesivir", "paxlovid", "molnupiravir", "monoclonal antibodies",
    "hydroxychloroquine", "ivermectin", "dexamethasone", "convalescent plasma"
]

# Count mentions
keyword_counts = {term: df['abstract'].str.contains(term, case=False, na=False).sum()
                  for term in treatment_keywords}

# To DataFrame
keyword_df = pd.DataFrame(list(keyword_counts.items()), columns=["Treatment", "Mentions"])
keyword_df.sort_values(by="Mentions", ascending=False, inplace=True)
keyword_df


In [None]:
import matplotlib.pyplot as plt

# Plot
plt.figure(figsize=(10, 6))
plt.barh(keyword_df['Treatment'], keyword_df['Mentions'], color='teal')
plt.xlabel('Number of Mentions')
plt.title('Mentions of Specific COVID-19 Treatments in Abstracts')
plt.gca().invert_yaxis()  # Highest value on top
plt.tight_layout()
plt.show()


In [None]:
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=2000)


In [None]:
# Define treatment keywords we're searching for
treatment_keywords = [
    "remdesivir", "paxlovid", "molnupiravir", "hydroxychloroquine",
    "ivermectin", "monoclonal antibodies", "dexamethasone", "convalescent plasma"
]

# Count mentions in the abstract column
keyword_counts = {
    term: df['abstract'].str.contains(term, case=False, na=False).sum()
    for term in treatment_keywords
}

# Convert to DataFrame for visualization
keyword_df = pd.DataFrame(list(keyword_counts.items()), columns=["Treatment", "Mentions"])
keyword_df.sort_values(by="Mentions", ascending=False, inplace=True)
keyword_df


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))
plt.bar(keyword_df['Treatment'], keyword_df['Mentions'], color='skyblue')
plt.title("Mentions of COVID-19 Treatments in Abstracts")
plt.ylabel("Number of Mentions")
plt.xlabel("Treatment")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


### Insights & Limitations from Treatment Term Analysis

From the charts above, it's clear that:

- **Monoclonal antibodies** dominate treatment-related mentions, significantly outpacing other interventions.
- **Dexamethasone** and **convalescent plasma** were mentioned, though far less frequently.
- Surprisingly, treatments like **remdesivir**, **paxlovid**, **hydroxychloroquine**, and **ivermectin** showed **zero or near-zero mentions**. This could be due to:
  - Their emergence later in the pandemic (after the publish dates of our sampled abstracts).
  - Terminology mismatches (e.g., alternate drug names or abbreviations not captured by our keyword list).
  - Abstracts not including specific drug names even if the paper discusses them in full text.

I avoided more advanced searches (e.g., across full-text JSON files) to preserve notebook performance. Future work could include querying full documents or applying NLP to improve detection of treatment mentions in context.


In [None]:
import pandas as pd
import re

# Load the first 5000 rows of the metadata (enough for insights, lightweight enough to run)
df = pd.read_csv('/kaggle/input/CORD-19-research-challenge/metadata.csv', nrows=5000)

# Define treatments we're interested in and keywords related to effectiveness
selected_treatments = ['monoclonal antibodies', 'dexamethasone', 'convalescent plasma']
effectiveness_keywords = ['effective', 'efficacy', 'ineffective', 'reduced', 'improved', 'treatment response']

# Store all matching sentences from abstracts
matching_sentences = []

# Iterate through abstracts
for abstract in df['abstract'].dropna():
    # Break the abstract into sentences
    sentences = re.split(r'(?<=[.!?])\s+', abstract)
    for sentence in sentences:
        # If the sentence contains one of the treatments and one of the effectiveness terms
        if any(treat in sentence.lower() for treat in selected_treatments) and \
           any(eff in sentence.lower() for eff in effectiveness_keywords):
            matching_sentences.append(sentence.strip())

# Preview first 10 matched sentences
matching_sentences[:10]


### Treatment Effectiveness Mentions in Abstracts

To explore **Question #4** — *What is known about the effectiveness of drugs and treatments?* — we conducted a lightweight scan of **5,000 abstracts** from the CORD-19 dataset. We filtered for sentences mentioning **monoclonal antibodies**, **dexamethasone**, or **convalescent plasma** alongside effectiveness-related terms such as *efficacy*, *improved*, or *reduced*.

#### Sample Insights:
- *"School closures were effective in reducing pH1N1 transmission, oseltamivir was effective for treatment of severe cases while convalescent plasma therapy has the potential to mitigate future pandemics."*
- *"Monoclonal antibodies against GGT effectively inhibited GGT activity and successfully suppressed H."*
- *"Activity was reduced on exosomes isolated from dexamethasone-treated explants."*

#### Observations:
- **Monoclonal antibodies** are frequently discussed in contexts of broad antiviral activity or inhibition.
- **Dexamethasone** appears in studies focused on molecular and cellular changes.
- **Convalescent plasma** is referenced as a potential intervention in historical pandemic responses.


### Conclusion & Reflections

This notebook explored over 5,000 scientific abstracts from the **CORD-19 Open Research Dataset** to investigate questions related to COVID-19 **treatments**, **vaccines**, and **medical response strategies**.

Using keyword filtering, sentence extraction, and basic natural language processing:

- Identified **64+ vaccine-related papers** and over **60 treatment-related studies**
- Extracted real examples of language discussing the **effectiveness** of specific therapies (e.g., dexamethasone, monoclonal antibodies)
- Created multiple **word clouds and frequency charts** to visualize common terminology and topic trends

Due to frequent **kernel crashes** and **runtime constraints** on Kaggle, we chose to limit the complexity and size of our queries. Several visualizations were omitted to preserve stability.

Nonetheless, this project demonstrates the potential for even lightweight NLP techniques to surface meaningful insights from a massive, unstructured dataset.

> This notebook was created with the help of **ChatGPT** to assist with scripting, exploration strategy, and markdown drafting.
