# Import Libraries and Data

In [11]:
import pandas as pd

import nltk
nltk.download("stopwords", quiet=True)
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.metrics import jaccard_distance

The data was taken from Kaggle: https://www.kaggle.com/datasets/rmisra/news-category-dataset?resource=download

In [2]:
df = pd.read_json("News_Category_Dataset_v3.json", lines=True)

In [3]:
df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209527 entries, 0 to 209526
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   link               209527 non-null  object        
 1   headline           209527 non-null  object        
 2   category           209527 non-null  object        
 3   short_description  209527 non-null  object        
 4   authors            209527 non-null  object        
 5   date               209527 non-null  datetime64[ns]
dtypes: datetime64[ns](1), object(5)
memory usage: 9.6+ MB


# Concept

Want to test if patient appointment notes can be linked as being in the same patient episode, meaning they are addressing the same problem. 

As a proof of concept, we will attempt to determine whether the `short_description` category in the news headline dataset can be grouped with headlines about similar things. We use descriptions instead of headlines because often headlines are not written in proper sentences and are written to grab the attention of the audience using jokes or puns.

Some examples of descriptions:

In [5]:
for i in range(5):
    print(df["short_description"][i], "\n")

Health experts said it is too early to predict whether demand would match up with the 171 million doses of the new boosters the U.S. ordered for the fall. 

He was subdued by passengers and crew when he fled to the back of the aircraft after the confrontation, according to the U.S. attorney's office in Los Angeles. 

"Until you have a dog you don't understand what could be eaten." 

"Accidentally put grown-up toothpaste on my toddler’s toothbrush and he screamed like I was cleaning his teeth with a Carolina Reaper dipped in Tabasco sauce." 

Amy Cooper accused investment firm Franklin Templeton of unfairly firing her and branding her a racist after video of the Central Park encounter went viral. 



The process we will use to compare descriptions is:

1. Clean each description, such that we have comparable lists of words.
2. For each description, we will compare it to the next five descriptions and calculate the similarity between them using Jaccard Distances. If the similarity exceeds a certain threshold then the two given descriptions will be considered as on the same topic and so will be assigned the same group ID.
3. We add a column to the data containing the group ID for each row, so we can see the raw descriptions that were considered in the same group.

We look at the next five descriptions because, going back to the patient appointment notes example, a patient episode will likely be uninterupted by many appointments about other issues, but of course this number can be adjusted.

First we define a function for cleaning raw strings and returning a list of comparable words:

In [6]:
stopwords = stopwords.words("english")
stemmer = PorterStemmer()

def clean_text(text):
    # Convert to lower case:
    text = text.lower()
    
    # Split text into individual words:
    words = nltk.word_tokenize(text)
    
    # Remove insignificant words (stopwords) and punctuation/numbers:
    filtered_words = [word for word in words if word not in stopwords and word.isalpha()]
    
    # Stem words (e.g. milling becomes mill):
    stemmed_words = [stemmer.stem(word) for word in filtered_words]

    # Return as a set to remove duplicates:
    return set(stemmed_words)

Next, we will generate a dictionary called `groups` which will contain each row index in the data as keys, and the group ID for each row as values.

As discussed, it will compare each description to the next five descriptions and if the Jaccard Distance between two descriptions is below a certain threshold they will be assigned the same group ID.

In [7]:
%%time
# For testing the concept, we only use a portion of the data
df_test = df.head(1000).copy()

# Initialise dictionary for groups, group ID counter, and a validation list of indexes already grouped:
groups = {}
current_group_id = 1
grouped_indexes = []

for index in df_test.index:
    # Skip row if it has already been grouped with a previous row:
    if index in grouped_indexes:
        continue

    groups[index] = current_group_id
    grouped_indexes += [index]
    current_text = clean_text(df_test.loc[index, "short_description"])

    # Skip if row descriptions has no valid words:
    if len(current_text) == 0:
        continue

    if index+5 < df_test.index.max():
        for i,row in df_test.loc[index:index+6].iterrows():
            if (i not in groups.keys()) & (jaccard_distance(current_text, clean_text(row["short_description"])) < 0.9):
                groups[i] = current_group_id
                grouped_indexes += [i]

    current_group_id += 1

CPU times: total: 5.41 s
Wall time: 5.44 s


Then, map the values from `groups` to the dataframe based on row indexes (keys):

In [8]:
df_test["group_id"] = df_test.index.map(groups)

Here are all rows where there are multiple rows in their group:

In [9]:
df_test.groupby("group_id").filter(lambda group: group.shape[0] > 1)

Unnamed: 0,link,headline,category,short_description,authors,date,group_id
19,https://www.huffpost.com/entry/hurricane-fiona...,Fiona Barrels Toward Turks And Caicos Islands ...,WORLD NEWS,The Turks and Caicos Islands government impose...,"Dánica Coto, AP",2022-09-20,20
22,https://www.huffpost.com/entry/hurricane-fiona...,Hurricane Fiona Bears Down On Dominican Republ...,WORLD NEWS,The storm knocked out the power grid and unlea...,"Danica Coto, AP",2022-09-19,20
27,https://www.huffpost.com/entry/queen-elizabeth...,World Leaders Pay Respects To Queen Elizabeth II,WORLD NEWS,President Joe Biden and first lady Jill Biden ...,"Mike Corder, Jill Lawless and Danica Kirka, AP",2022-09-18,27
30,https://www.huffpost.com/entry/europe-britain-...,Biden Says Queen's Death Left 'Giant Hole' For...,POLITICS,"U.S. President Joe Biden, in London for the fu...","Darlene Superville, AP",2022-09-18,27
105,https://www.huffpost.com/entry/bc-us-texas-sch...,Uvalde Fourth Graders Waited An Hour With Woun...,POLITICS,When Elsa Avila looks at the scar that runs do...,"ACACIA CORONADO, AP",2022-09-05,104
...,...,...,...,...,...,...,...
938,https://www.huffpost.com/entry/amy-schumers-jo...,Amy Schumer’s Joke About Kirsten Dunst Being A...,ENTERTAINMENT,Chris Rock wasn’t the only comedian at the Aca...,Elyse Wanshel,2022-03-28,900
949,https://www.huffpost.com/entry/biden-poland-pu...,"Biden On Putin: 'For God's Sake, This Man Cann...",WORLD NEWS,President Joe Biden visited Poland's capital o...,"Chris Megerian and Vanessa Gera, AP",2022-03-26,915
955,https://www.huffpost.com/entry/biden-poland-uk...,"60 Miles From Ukraine, Biden Sees Refugee Cris...",POLITICS,President Joe Biden paid tribute to Poland for...,"CHRIS MEGERIAN and DARLENE SUPERVILLE, AP",2022-03-25,915
986,https://www.huffpost.com/entry/russia-ukraine-...,Russian Military Slog In Ukraine A ‘Dreadful M...,WORLD NEWS,More than three weeks into his invasion of Ukr...,"Ellen Knickmeyer, AP",2022-03-20,951


And an example of descriptions from a group:

In [10]:
for text in df_test.loc[(df_test["group_id"] == 20), "short_description"]:
    print(text, "\n")

The Turks and Caicos Islands government imposed a curfew as the intensifying storm kept dropping copious rain over the Dominican Republic and Puerto Rico. 

The storm knocked out the power grid and unleashed floods and landslides in Puerto Rico, where the governor said the damage was “catastrophic.” 



Clearly the process works as these two descriptions are definitely about the same thing.

This has just been a proof of concept that a similar process could be applied to doctor's notes on patient appointments to identify patient episodes in data. Many of the parameters and decisions here were arbitrarily chosen to illustrate the process and can be much more refined and considered (e.g. the Jaccard Distance cutoff value, or the cleaning process to focus on more relevant vocabulary).