# Blendle Topic & Event Extraction EN

Based on workflows by:

https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

https://towardsdatascience.com/natural-language-processing-event-extraction-f20d634661d3?gi=253736be2ed7

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

import re
import pickle
import json
import os

import umap
import hdbscan

from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import pairwise_distances_argmin_min
from tqdm import tqdm
from summarizer import Summarizer

## Importing and preprocessing data

The following functions are used to load the JSON data in as a dataframe and preprocess it

In [5]:
def import_json_data(path):
    """
    import_json_data imports JSON data and keeps only the id, date,
    headline and content of an article.

    :path: path to raw JSON data
    :return: list of dicts where each dict is a article.
    """ 
    articles = []
    data = [json.loads(line) for line in open(path, 'r')]
    
    for d in data:
        article = {
            'id' : d['id'],
            'date' : d['date'],
        }
        for i in d['body']:
            try:
                if i['type'] == 'hl1':
                    article['headline'] = i['content']
            except:
                break
        content = []
        for i in d['body']:
            try:
                if i['type'] == 'p':
                    content.append(i['content'])
            except:
                continue
        article['content'] = ' '.join(content)
        articles.append(article)
    
    return articles

In [8]:
def preprocess_data(data):
    """
    preprocess_data takes in list of dicts and returns
    Pandas DataFrame with keys as columns and value
    as rows, and removes HTML tags from text.

    :data: list of dicts of articles.
    
    :return: dataframe.
    """
    TAG_RE = re.compile(r'<[^>]+>')
    
    dataframe = pd.DataFrame(data)
    
    # drop duplicate articles
    dataframe = dataframe.drop_duplicates(subset='content', keep="last")
    
    # remove HTML
    dataframe['content'] = dataframe['content'].apply(lambda x: TAG_RE.sub('', x))
    dataframe['content'] = dataframe['content'].apply(lambda x: x.replace("&nbsp;", " "))
    
    # create new article column
    dataframe['article'] = dataframe['headline'] + ' ' + dataframe['content']
    
    # make date column datetime
    dataframe['date'] = pd.to_datetime(dataframe['date'])
    
    return dataframe.dropna().drop_duplicates()

In [3]:
def create_df(path):
    """
    Loops over directory and calls above functions to
    return dataframe.

    :path: path to JSON files.
    
    :return: dataframe.
    """
    frames = []
    for file in tqdm(os.listdir(path)):
        data = import_json_data(path + file)
        frames.append(preprocess_data(data))
    
    return pd.concat(frames).reset_index(drop=True)

In [9]:
PATH = 'data_en/'
df = create_df(PATH)
df.head()

100%|██████████| 2/2 [01:22<00:00, 41.39s/it]


Unnamed: 0,id,date,headline,content,article
0,bnl-newyorktimes528-20200216-489baa4a905a,2020-02-16 00:00:00+00:00,Questioning CPR as a Default Response,DR. MONIQUE STARKS DUKE UNIVERSITY SCHOOL OF M...,Questioning CPR as a Default Response DR. MONI...
1,bnl-chicagotribune-20200219-89419fb1,2020-02-19 00:00:00+00:00,‘Taking Sexy Back’ a fantastic and timely book,"If I didn’t know better, I would think Alexand...",‘Taking Sexy Back’ a fantastic and timely book...
2,bnl-atavist-20200226-5e52a7d36b667,2020-02-26 00:00:00+00:00,Deliverance,Devilry of the kind necessary to kill a toddle...,Deliverance Devilry of the kind necessary to k...
3,bnl-economist-20200131-7f87b05bd7e,2020-01-31 00:00:00+00:00,Whendunnit?,Since the first use of fingerprints to identif...,Whendunnit? Since the first use of fingerprint...
4,bnl-fastcompany-20200201-ba11405d1ed,2020-02-01 00:00:00+00:00,EXPERIENCE MATTERS,"IN DIGITAL With support from SAP, Fast Company...",EXPERIENCE MATTERS IN DIGITAL With support fro...


Lets checkout an article.

In [8]:
df['article'][0][:2000]

'Questioning CPR as a Default Response DR. MONIQUE STARKS DUKE UNIVERSITY SCHOOL OF MEDICINE A FEW MONTHS AGO, an ambulance brought a woman in her 90s to the emergency department at Brigham and Women’s Hospital in Boston. Her metastatic breast cancer had entered its final stages, and she had begun home hospice care. Yet a family member who had discovered her unresponsive that morning had called 911. The paramedics determined that she was in cardiac arrest, began cardiopulmonary resuscitation and put a breathing tube down her throat. “It’s a common scenario,” said Dr. Kei Ouchi, an emergency physician and researcher at Brigham and Women’s who reviews such cases. “And it’s not going to have a good outcome.” At the hospital, the patient’s blood pressure continued to fall despite intravenous medications. “She was trying to die, and it was only a matter of time before she arrested again,” Dr. Ouchi said. An oncologist and emergency physicians met with the patient’s family, and explained tha

As we are using BERT for our embeddings, we do not have lemmatize words or remove stop words. We can feed the raw text straight into the model! Now we convert the article column to a list for training.

In [9]:
# create list of articles to cluster model
train = df.article.tolist()
print(f'There are {len(train)} articles.')

There are 344740 articles.


## Creates BERT embeddings by encoding data.

For more information see:
https://huggingface.co/models

We used the following models:

* **English text**: 'distilbert-base-nli-mean-tokens'

In [8]:
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
embeddings = model.encode(train, show_progress_bar=True)
embeddings.shape

Batches:   0%|          | 0/10774 [00:00<?, ?it/s]

(344740, 768)

Uncomment if using pre embedded data

In [10]:
# embeddings = np.load('embeddings/embeddings_en.npy')
# embeddings.shape

(344740, 768)

## Reducing dimensionality of embeddings.

For more information see: https://umap-learn.readthedocs.io/en/latest/

In [10]:
umap_embeddings = umap.UMAP(n_neighbors=15, 
                            n_components=5, 
                            metric='cosine').fit_transform(embeddings)
umap_embeddings.shape

(344740, 5)

Uncomment if using pre embedded UMAP data

In [11]:
# umap_embeddings = np.load('embeddings/umap_embeddings_en.npy')
# umap_embeddings.shape

(344740, 5)

## Clustering articles into topics

For more information see: https://github.com/scikit-learn-contrib/hdbscan

In [12]:
cluster = hdbscan.HDBSCAN(min_cluster_size=15,
                          metric='euclidean',                      
                          cluster_selection_method='eom').fit(umap_embeddings)

In [14]:
# class based TF-IDF
def c_tf_idf(documents, m, ngram_range=(1, 1)):
    count = CountVectorizer(ngram_range=ngram_range, stop_words="english").fit(
        documents
    )
    t = count.transform(documents).toarray()
    w = t.sum(axis=1)
    tf = np.divide(t.T, w)
    sum_t = t.sum(axis=0)
    idf = np.log(np.divide(m, sum_t)).reshape(-1, 1)
    tf_idf = np.multiply(tf, idf)

    return tf_idf, count


# topic representation
def extract_top_n_words_per_topic(tf_idf, count, docs_per_topic, n=20):
    words = count.get_feature_names()
    labels = list(docs_per_topic.Topic)
    tf_idf_transposed = tf_idf.T
    indices = tf_idf_transposed.argsort()[:, -n:]

    top_n_words = {
        label: [(words[j], tf_idf_transposed[i][j]) for j in indices[i]][::-1]
        for i, label in enumerate(labels)
    }
    top_n_words_clean = {
        label: [words[j] for j in indices[i]][::-1] for i, label in enumerate(labels)
    }

    return top_n_words, top_n_words_clean


def extract_topic_sizes(df):
    topic_sizes = (
        df.groupby(["Topic"])
        .Doc.count()
        .reset_index()
        .rename({"Topic": "Topic", "Doc": "Size"}, axis="columns")
        .sort_values("Size", ascending=False)
    )

    return topic_sizes


In [15]:
docs_df = pd.DataFrame(train, columns=["Doc"])
docs_df["Topic"] = cluster.labels_
docs_df["Doc_ID"] = range(len(docs_df))
docs_per_topic = docs_df.groupby(["Topic"], as_index=False).agg({"Doc": " ".join})

tf_idf, count = c_tf_idf(docs_per_topic.Doc.values, m=len(train))

top_n_words, top_n_words_clean = extract_top_n_words_per_topic(
    tf_idf, count, docs_per_topic, n=20
)
topic_sizes = extract_topic_sizes(docs_df)
topic_sizes.head(10)

Unnamed: 0,Topic,Size
0,-1,241045
153,152,13492
112,111,7511
665,664,2163
86,85,1868
538,537,1863
115,114,1701
726,725,1598
578,577,1551
464,463,1454


In [16]:
print(f'There are {len(topic_sizes)} topics.')

There are 740 topics.


You can check out different keywords relating to a topic using the following code:

In [18]:
top_n_words[725]

[('sanders', 0.014172281729564824),
 ('biden', 0.01199950858886089),
 ('voters', 0.011383020763193545),
 ('candidates', 0.009810150507078074),
 ('iowa', 0.009126619806508533),
 ('democratic', 0.00881395758274148),
 ('buttigieg', 0.007657650558769603),
 ('democrats', 0.007099756541263775),
 ('warren', 0.006998057847040383),
 ('primary', 0.006820382291034903),
 ('polls', 0.0067843080403029825),
 ('candidate', 0.006755786074205527),
 ('race', 0.00667055535337568),
 ('senator', 0.00632638094080744),
 ('poll', 0.005758337215866537),
 ('republican', 0.005714212024029153),
 ('campaign', 0.005582795841044771),
 ('debate', 0.005372495115494521),
 ('presidential', 0.005089820825376859),
 ('republicans', 0.004729822072244175)]

You can use the following function to get all topics relating to a specific word

In [19]:
def get_topic_number(topics, word):
    """
    Returns all topics numbers given a specific word

    :topics: list of dicts of topics.
    :word: the topic you're looking for (str) 
    
    :return: list of topics
    """
    
    results = []
    keys = len(topics.keys())
    
    for topic in topics.items():
        if word in topic[1]:
            results.append(topic)
    
    if len(results) == 0: 
        return 'no topic found'
    return results

In [20]:
get_topic_number(top_n_words_clean, 'brexit')

[(355,
  ['______',
   'briefing',
   'brexit',
   'parliament',
   'smarter',
   'boris',
   'prime',
   'johnson',
   'minister',
   'josephson',
   'break',
   'youto',
   'snapshot',
   'britain',
   'eleanor',
   'today',
   'clue',
   'puzzles',
   'nytimes',
   'morning']),
 (577,
  ['brexit',
   'johnson',
   'britain',
   'parliament',
   'eu',
   'european',
   'labour',
   'mrs',
   'deal',
   'corbyn',
   'union',
   'ireland',
   'prime',
   'referendum',
   'minister',
   'british',
   'lawmakers',
   'bloc',
   'boris',
   'conservative'])]

## Event extraction
We add the embeddings to the dataframe so we can extract one event every day

In [21]:
def add_embeddings(dataframe):
    df = dataframe.copy()
    df['keywords'] = df['Topic'].apply(lambda x: top_n_words_clean[x])
    df['keywords_prob'] = df['Topic'].apply(lambda x: top_n_words[x])
    df['embeddings'] = list(umap_embeddings)
    
    return df

def merge_dataframes(df1, df2):
    df = pd.merge(df1, df2, left_index=True, right_index=True)
    clean_df = df[['id', 'date', 'Topic', 'keywords', 'headline', 'content', 'embeddings']]
    
    return clean_df

Now we combine the first and last dataframes to get a new dataframe that has all the necessary information.

In [22]:
embedded_df = add_embeddings(docs_df)
clean_df = merge_dataframes(embedded_df, df)
clean_df.head()

Unnamed: 0,id,date,Topic,keywords,headline,content,embeddings
0,bnl-newyorktimes528-20200216-489baa4a905a,2020-02-16 00:00:00+00:00,-1,"[pandemic, biden, workers, money, social, comp...",Questioning CPR as a Default Response,DR. MONIQUE STARKS DUKE UNIVERSITY SCHOOL OF M...,"[8.258087, 5.4001455, 8.514114, 4.913037, 3.66..."
1,bnl-chicagotribune-20200219-89419fb1,2020-02-19 00:00:00+00:00,537,"[novel, book, books, writing, story, literary,...",‘Taking Sexy Back’ a fantastic and timely book,"If I didn’t know better, I would think Alexand...","[8.350002, 5.7192316, 6.9386177, 4.987291, 1.9..."
2,bnl-atavist-20200226-5e52a7d36b667,2020-02-26 00:00:00+00:00,-1,"[pandemic, biden, workers, money, social, comp...",Deliverance,Devilry of the kind necessary to kill a toddle...,"[8.8728895, 6.339035, 7.4803505, 3.258, 4.066124]"
3,bnl-economist-20200131-7f87b05bd7e,2020-01-31 00:00:00+00:00,-1,"[pandemic, biden, workers, money, social, comp...",Whendunnit?,Since the first use of fingerprints to identif...,"[8.657768, 4.904919, 7.007837, 4.775185, 4.732..."
4,bnl-fastcompany-20200201-ba11405d1ed,2020-02-01 00:00:00+00:00,-1,"[pandemic, biden, workers, money, social, comp...",EXPERIENCE MATTERS,"IN DIGITAL With support from SAP, Fast Company...","[8.226043, 4.2056518, 6.258909, 5.8530326, 4.2..."


In [24]:
def get_central_vector(dataframe, n=1):
    """
    Extracts key event per day of dataframe of topic

    :dataframe: Pandas DataFrame of specific topic number 
    :n: how many articles one a day are needed for something to be classified as a event (int)
    
    :return: new dataframe
    """
    result = dataframe.iloc[0:0]
    articles = dataframe.set_index('date')
    
    for name, group in articles.groupby(pd.Grouper(freq='D')): 
        if len(group.embeddings) > max(n - 1, n):
            mean_vector = np.mean(group.embeddings)
            em = np.matrix(group.embeddings.tolist())
            index = pairwise_distances_argmin_min(mean_vector.reshape(1, -1), em)[0][0]
            result = result.append(group.iloc[index])
        elif len(group.embeddings) == 0:
            continue
        elif n == 1:
            result = result.append(group)
            
    return result

In [25]:
def get_topic_df(dataframe, topic):
    """
    Returns new dataframe based on topic number

    :dataframe: Pandas DataFrame
    :topic: number (int)
    
    :return: new dataframe
    """
    return dataframe.loc[dataframe['Topic'] == topic]

In [26]:
def summerize_article(dataframe, n_sentences=3):
    """
    Creates new dataframe column with summarize of articles

    :dataframe: Pandas DataFrame
    :n_sentences: number of sentences the summary should be
    
    :return: dataframe
    """
    model = Summarizer()
    
    dataframe['summary'] = dataframe['content'].apply(lambda x: model(x, num_sentences=n_sentences))
    
    return dataframe

In [27]:
def create_timeline(dataframe, topic, filename, summarize=True, n=1):
    """
    Prints 1 headline of article everyday.

    :dataframe: Pandas DataFrame
    :topic: number (int)
    :filename: output txt file name
    :summarize: set to false if you don't want a 
                summarization of the content of an article
    
    :return: None
    """
    df1 = get_topic_df(clean_df, topic)
    df2 = get_central_vector(df1, n)
    if summarize:
        df3 = summerize_article(df2)
        with open(filename, 'w') as f:
            for article in tqdm(df3.itertuples()):
                f.write(f'Date: {article.Index} \nHeadline: {article.headline} \nSummary: {article.summary} \n\n')
    else:
        with open(filename, 'w') as f:
            for article in tqdm(df2.itertuples()):
                f.write(f'Date: {article.Index} \nHeadline: {article.headline} \n\n')
    

## Example: Brexit

In [26]:
create_timeline(clean_df, 577, 'brexit.txt')

Downloading:   0%|          | 0.00/434 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

410it [00:00, 75104.36it/s]


## Example: 2020 US Elections 

In [35]:
create_timeline(clean_df, 725, 'elections.txt', summarize=False)

526it [00:00, 249740.08it/s]
