<a href="https://colab.research.google.com/github/BI-DS/ELE-3909/blob/master/lecture7/clusteting_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from nltk.tokenize import word_tokenize
import numpy as np
import re
import nltk
import os
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')
nltk.download('punkt')
from bokeh.plotting import figure, output_file, save, gridplot, show
from bokeh.transform import jitter
from bokeh.models import HoverTool
from bokeh.palettes import Category20_10 as Palette
from bokeh.models import ColumnDataSource,OpenURL, TapTool
from sklearn.manifold import TSNE
from bokeh.transform import factor_cmap
import pandas as pd
import bokeh.io
bokeh.io.output_notebook()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


# Clustering news articles 🔥🔥🔥

Get some articles from the course repository

In [2]:
!wget -P . https://raw.githubusercontent.com/BI-DS/ELE-3909/master/lecture6/news_articles.txt

--2023-10-09 20:03:32--  https://raw.githubusercontent.com/BI-DS/ELE-3909/master/lecture6/news_articles.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25337978 (24M) [text/plain]
Saving to: ‘./news_articles.txt’


2023-10-09 20:03:32 (158 MB/s) - ‘./news_articles.txt’ saved [25337978/25337978]



Define some functions to clean the data

In [3]:
def clean_text(text):
    if type(text) == float:
        return ""
    stopwords = nltk.corpus.stopwords.words("english")
    normalMap = {'í':'i', 'ó':'o', 'á':'a', 'é':'e', 'ú':'u', 'ñ':'n'}
    normalize = str.maketrans(normalMap)
    temp = text.lower()
    temp = temp.translate(normalize)

    temp = re.sub("'", "", temp) # to avoid removing contractions in english
    temp = re.sub("@[A-Za-z0-9_]+","", temp)
    temp = re.sub(r'http\S+', '', temp)
    temp = re.sub('[()!?]', ' ', temp)
    temp = re.sub('\[.*?\]',' ', temp)
    temp = re.sub("[^a-z0-9]"," ", temp)
    temp = temp.split()
    temp = [w for w in temp if not w in stopwords]
    temp = " ".join(word for word in temp)
    return temp

The code below is almost as the one we saw last lecture. This version, however, keeps track of the index in the original file with all news

In [4]:
def text_to_corpus():
    with open("./news_articles.txt", "r") as infile:
        lines = infile.readlines()
    print('total no of lines {}'.format(len(lines)))

    lengths = []
    corpus  = []
    orig_idx = []
    counter=0

    if os.path.exists('./clean_text.txt'):
      print('deleting old file...')
      os.system('rm ./clean_text.txt')

    with open("./clean_text.txt", "w") as x:
        for i, text in enumerate(lines):
            clean_content = clean_text(text)
            tokens = word_tokenize(clean_content)
            length = len(tokens)

            if length <= 100:
                orig_idx.append(i+1)
                counter+=1
                x.write(" ".join(tokens)+"\n")
                corpus.append(" ".join(tokens))
                lengths.append(length)


    print('{} news with length smaller than {}'.format(counter, np.max(lengths)))
    print('done!')
    x.close()

    return corpus, orig_idx

Get corpus and original index for each article with less than 100 tokens

In [5]:
corpus, orig_idx = text_to_corpus()

total no of lines 4551
127 news with length smaller than 100
done!


Cherry-pick some articles

In [6]:
c1 = [94,64,44,93,53,10,70,107,54,45,35]
c2 = [66,95,50,87,60,65,76,13,40,43,36]
c3 = [51,81,79,99,83,124,4,61,25,91,115]
c_idx = [c1,c2,c3]
clusters = []
for i in range(127):
  if i in c1:
    clusters.append(0)
  elif i in c2:
    clusters.append(1)
  elif i in c3:
    clusters.append(2)
  else:
    clusters.append(-1)
clusters = np.array(clusters)

Now we will plot our results using the `bokeh` library. If you dont know it, take a look at it!

First create a datafrem to save the corpus and indexes (it is easier when we use `bokeh`

In [7]:
df_corpus = pd.DataFrame(np.c_[corpus,orig_idx])

Now define some variables to plot and save our scatter plot

In [8]:
output_file("./plot_and_news.html")
p = figure(title="Visualizing News")
p.title.text_font_size = '15pt'
p.title.align = 'center'
p.background_fill_color = "gray"
p.background_fill_alpha = 0.35

Let's use t-SNE to get 2D data 🔥

In [9]:
max_features = 50
vectorizer  = TfidfVectorizer(max_features = max_features)
tf_idf = vectorizer.fit_transform(corpus).toarray()
transformer = TSNE(n_components=2,learning_rate='auto',init='random',n_jobs=-1,random_state=1234)
representations = transformer.fit_transform(tf_idf)

Finally an interactive scatter plot © ✨

In [10]:
colors = ['blue','red','yellow']
for l in range(3):
  x = representations[clusters==l,0]
  y = representations[clusters==l,1]
  cluster = list(np.repeat(l,x.shape[0]))
  news=df_corpus.iloc[clusters==l][0].values
  index =df_corpus.iloc[clusters==l][1].values
  source = ColumnDataSource(data=dict(x=x,y=y,index=index,news=list(news),cluster=cluster))
  s = p.circle(x='x',y='y', size=6, line_color='black', source=source, fill_color=colors[l])
  p.add_tools(HoverTool(renderers=[s],tooltips=[("index", "@index"),("news","@news"),('cluster',"@cluster")]))

In [None]:
#save(p)
bokeh.io.show(p)