## Note: You have to run this notebook to view the visualizations. Alternatively, you may view the view the visualizations in Text Clustering with Top2Vec.html.

### Import the necessary libraries

In [4]:
import umap

import pandas as pd

import plotly.express as px
import plotly.graph_objects as go

from transformers import AutoTokenizer

from nltk.tokenize import sent_tokenize

from utils import *

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\addison\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Load dataset

In [5]:
news_data = pd.read_csv("news_data.csv")

# Verify that we loaded the correct dataset
news_data.head()

Unnamed: 0,altid,title,content
0,sa1a70ab8ef5,Davenport hits out at Wimbledon,World number one Lindsay Davenport has critic...
1,ta497aea0e36,Camera phones are 'must-haves',Four times more mobiles with cameras in them ...
2,ta0f0fa26a93,US top of supercomputing charts,The US has pushed Japan off the top of the su...
3,ba23aaa4f4bb,Trial begins of Spain's top banker,"The trial of Emilio Botin, the chairman of Sp..."
4,baa126aeb946,Safety alert as GM recalls cars,The world's biggest carmaker General Motors (...


### Exploratory Data Analysis

To cluster the news, we have to first convert its text content into numerical representation i.e. text embeddings. There are several approaches that we can take to create the text embeddings (i.e. term frequency, TF-IDF, Word2Vec, GloVe, Doc2Vec, Universal Sentence Encoder, BERT family of models etc).

For this exercise, we will use a pre-trained model from BERT family of models to create the text embeddings for the news. The BERT family of models approach is taken because its models are able to capture the semantic and contextual meaning of the text and also the word order. In particular, we will use the pre-trained model, all-MiniLM-L6-v2 from SentenceTransformers library to create the text embeddings for the news. The choice of model is arbitrary; any model from [here](https://www.sbert.net/docs/pretrained_models.html) should also deliver decent results so long as it was trained on English text.

BERT family of models however have a maximum sequence length of 512 tokens due to computational and memory constraints. Our chosen model was trained on data with a maximum sequence length of 128 tokens; it performs best when it is fed with inputs with less than or equal to 128 tokens.

Let's check the news dataset to see if any of the news content violates the 128 tokens constraint. If yes, we might need to perform additional processing steps to abide by the constraint.