<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Topic Modeling using BERT
One of the downsides of using topic modeling methods which extract topics from documents converted to vectors using word counts/frequency is that there is no consideration given to semantic similarity of words/phrases.  For example, such an approach might not correctly recognize that a document which repeatedly includes the word "covid-19" shares the same topic as another document which uses the word "coronavirus".  

An alternative approach is to use embedings to represent the text with vectors which capture the semantic meaning of the text.  One way to do this is to create a list of candidate topics, and then compare the embedding of each candidate topic to the embedding of each document (which is usually calculated as the mean of the embeddings of all words in the document).  We then presume that the candidate topics with embeddings closest to the embedding of the document (usually measured with cosine similarity) are the topics contained in the document. 

**Notes:** 
- This does not need to be run on GPU

**References:**  
- This demo notebook is inspired by [this article](https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea) from the creator of the [KeyBERT package](https://github.com/MaartenGr/KeyBERT)

In [2]:
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
import spacy
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
import numpy as np

## Get documents to tag with topics
We will use BeautifulSoup to get the content of a few articles from the web and strip the text content from the hmtl.  The articles we will use for this example are news articles each relating to one or both of two primary themes: COVID-19 and Duke basketball.  Therefore we would expect the topics which we identify to be related to these two themes.

In [3]:
# Get article
article_urls = ['https://www.cbssports.com/college-basketball/news/duke-basketballs-game-vs-clemson-postponed-due-to-positive-covid-19-tests-in-blue-devils-program/',
                'https://www.usatoday.com/story/news/health/2021/12/21/covid-holiday-safety-need-to-know/8968198002/',
                'https://www.fayobserver.com/story/sports/college/basketball/2021/12/29/duke-blue-devils-basketball-recruiting-jon-scheyer-commits/9032663002/',
                'https://www.today.com/health/health/covid-19-cold-flu-tell-difference-rcna10114',
                'https://www.dukechronicle.com/article/2021/06/duke-mens-basketball-head-coach-jon-scheyer-mike-krzyzewski',
                'https://www.hopkinsmedicine.org/health/conditions-and-diseases/coronavirus']
article_text = []
titles = []
for url in article_urls:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    # Extract body text from article
    bodytext = soup.find_all('p')
    bodytext = [i.text for i in bodytext]
    bodytext = ' '.join(bodytext)
    article_text.append(bodytext)
    # Extract titles for articles
    title = soup.find_all('h1')
    title = title[0].text.strip()
    titles.append(title)


## Identify candidate topics
We first need to create a set of candidate topics.  We will then embed each candidate topic and compare the embedding to the overall document embedding to see which candidate topics most closely match the overall article.  To create our list of candidate topics, we will extract all nouns and noun phrases (phrases including an adjective and a noun) from the articles, and these will be our candidates.

In [4]:
# Extract candidate 1-grams and 2-grams 
n_gram_range = (1, 2)
vectorizer = CountVectorizer(ngram_range=n_gram_range, stop_words=stopwords.words('english'))
vectorizer.fit(article_text)
candidates = vectorizer.get_feature_names()

# Get noun phrases and nouns from articles
nlp = spacy.load('en_core_web_sm')
all_nouns = set()
for doc in article_text:
    doc_processed = nlp(doc)
    # Add noun chunks
    all_nouns.add(chunk.text.strip().lower() for chunk in doc_processed.noun_chunks)
    # Add nouns
    for token in doc_processed:
            if token.pos_ == "NOUN":
                all_nouns.add(token.text)

# Filter candidate topics to only those in the nouns set
candidates = [c for c in candidates if c in all_nouns]



## Embed candidates and documents and find matching topics

In [5]:
def model_topics(documents,candidates,num_topics):
    model = SentenceTransformer('distilbert-base-nli-mean-tokens')
    # Encode each of the articles
    doc_embeddings = [model.encode([doc]) for doc in documents]
    # Encode the candidate topics
    candidate_embeddings = model.encode(candidates)

    # Calculate cosine similarity between each document and candidate topics
    # Take the top candidate topics as keywords for each document
    article_keywords = []
    for doc in doc_embeddings:
        scores = cosine_similarity(doc, candidate_embeddings)
        keywords = [candidates[index] for index in scores.argsort()[0][-num_topics:]]
        article_keywords.append(keywords)
    
    return article_keywords

In [6]:
topics = model_topics(article_text,candidates,num_topics=3)
for i,keywords in enumerate(topics):
    print('Article {}: {}'.format(i,titles[i]))
    print('Topic keywords: {}'.format(keywords))
    print()

Article 0: Duke basketball games vs. Clemson, Notre Dame postponed due to positive COVID-19 tests in Blue Devils program
Topic keywords: ['flu', 'viruses', 'coronavirus']

Article 1: Vaccinated and test positive? What to know about omicron, COVID for this holiday season.
Topic keywords: ['viruses', 'flu', 'coronavirus']

Article 2: How did Duke basketball and Jon Scheyer keep up their major recruiting hot streak in December?
Topic keywords: ['preseason', 'basketball', 'championship']

Article 3: Is it COVID-19 or just a cold? Here's how to tell the difference
Topic keywords: ['fever', 'coronavirus', 'flu']

Article 4: Jon Scheyer to succeed Mike Krzyzewski after Duke men's basketball's 2021-22 season
Topic keywords: ['doctor', 'coach', 'coaches']

Article 5: What Is Coronavirus?
Topic keywords: ['vaccines', 'vaccinations', 'coronavirus']

