## Note: To view the visualizations, you have to run the entire notebook. Alternatively you may view the visualizations in Text Clustering.html.

### Import the necessary libraries

In [None]:
# Run this cell if this is the first time you are using nltk
import nltk
nltk.download('punkt')

In [56]:
import re

import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

from transformers import AutoTokenizer

from nltk import ngrams
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

### Load the dataset

In [2]:
news_data = pd.read_csv("news_data.csv")
news_data.head()

Unnamed: 0,altid,title,content
0,sa1a70ab8ef5,Davenport hits out at Wimbledon,World number one Lindsay Davenport has critic...
1,ta497aea0e36,Camera phones are 'must-haves',Four times more mobiles with cameras in them ...
2,ta0f0fa26a93,US top of supercomputing charts,The US has pushed Japan off the top of the su...
3,ba23aaa4f4bb,Trial begins of Spain's top banker,"The trial of Emilio Botin, the chairman of Sp..."
4,baa126aeb946,Safety alert as GM recalls cars,The world's biggest carmaker General Motors (...


### Exploratory Data Analysis

In [3]:
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [7]:
# Tokenize the content
news_data["content_tokens"] = news_data.content.apply(tokenizer.tokenize)

Token indices sequence length is longer than the specified maximum sequence length for this model (716 > 512). Running this sequence through the model will result in indexing errors


In [12]:
# Calculate the number of tokens per article
news_data["num_tokens"] = news_data.content_tokens.apply(len)

In [33]:
# Visualize the distribution of the number of tokens in each article
fig = px.box(news_data, y="num_tokens",
            points="all",
            labels={"num_tokens": "Number of Tokens"},
            hover_data=["title"])
fig.show()

It is observed that the median number of tokens that each article has is 465 which exceeds the maximum sequence length (128) of our chosen model. In fact, even the shortest article "Dementieva prevails in Hong Kong" with 200 tokens exceeds the maxium sequence length of 128. By default, the model will truncate our article to its maximum allowed number of tokens of 128 during inference and this would result in information loss, especially so for longer articles.

To resolve this, we cluster the news on a sentence level instead of article level. This way, we are not limited by the maximum sequence length (128) constraint as it is very unlikely for a sentence to be longer than 128 tokens.

In [21]:
# Split content by sentences
news_data["sentences"] = news_data.content.apply(sent_tokenize)

In [23]:
# Store each sentence in its own row
news_data_sentence = news_data[["title", "sentences"]].explode(column = "sentences", ignore_index = True)

In [25]:
# Tokenize the sentences
news_data_sentence["sentence_tokens"] = news_data_sentence.sentences.apply(tokenizer.tokenize)

In [27]:
# Calculate the number of tokens per sentence
news_data_sentence["num_tokens"] = news_data_sentence.sentence_tokens.apply(len)

In [35]:
# Verify that indeed the number of tokens in each sentence is less than the maximum sequence length of 128
# Visualize the distribution of the number of tokens in each sentence
fig = px.box(news_data_sentence, y="num_tokens",
            points="all",
            labels={"num_tokens": "Number of Tokens in Sentence"},
            hover_data=["title", "sentences"])
fig.show()

With the exception of an outlier sentence with 131 tokens, the rest of the sentences have less than 128 tokens, meaning to say they will not be truncated during inference; no information loss due to truncation. Even for the outlier sentence, information loss due to truncation is minimal, only a loss of 5 tokens (although 128 tokens, we have to allocate 2 token spaces for the CLS and SEP tokens; meaning that we effectively only have space for 126 tokens; 131 - 126 = 5)

In [89]:
# Visualize the top 50 words in the news

# Get stopwords and remove punctuations from them
stop_words = [re.sub(r"[^a-z]", "", stopword) for stopword in stopwords.words("english")] + ["ha"]

In [65]:
# Initialize Stemmer
stemmer = PorterStemmer()

In [112]:
def process_sentence(sentence, stop_words = None, stemmer = None):
    """
        Function to process sentence. Does the following:
        
        1) Convert to lowercase
        2) Remove non-alphabetic characters
        3) Perform stemming
        4) Extract Bigrams
        5) Remove stopwords
        
        Returns string
    """
    # Convert to lowercase
    sentence = sentence.lower()
    
    # Remove non-alphabetic characters
    sentence = re.sub(r"[^a-z ]", "", sentence)
    
    # Remove stopwords
    tokens = [token for token in sentence.split() if token not in stop_words]
    
    # Perform stemming
    tokens = [stemmer.stem(word) for word in tokens]
    
    # Extract Bigrams
    bigrams = ["_".join(tokens[i:i+2]) for i in range(len(tokens)-1)]
    
    # Return tokens + bigrams
    return " ".join([token for token in (tokens + bigrams)])

In [113]:
news_data_sentence["processed_sentences"] = news_data_sentence.sentences.apply(
    lambda sentence: process_sentence(sentence, stop_words, stemmer))

In [114]:
news_data_sentence["processed_sentences"][0]

'world number one lindsay davenport criticis wimbledon issu equal prize money women world_number number_one one_lindsay lindsay_davenport davenport_criticis criticis_wimbledon wimbledon_issu issu_equal equal_prize prize_money money_women'

In [115]:
# Calculate the frequency of each token (Do by tf-idf instead, by tf need to clean a lot of stopwords for things to make sense)
freq = pd.Series(" ".join(news_data_sentence.processed_sentences).split()).value_counts().to_frame(name="freq_counts")

In [120]:
freq.head(10)

Unnamed: 0,freq_counts
said,206
mr,128
year,84
would,76
peopl,73
also,68
world,50
new,46
say,44
one,42
