<a href="https://colab.research.google.com/github/tharindudr/CS3MIR/blob/main/CS3MIR_Lab_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**CS3MIR Lab 5. Document Clustering**

In this lab, we will learn how to cluster a set of documents using Python. My motivating example is to identify the latent structures within the synopses of the top 100 films of all time. We will be performing the following steps in order to achieve the clustering.
*   tokenizing and stemming each synopsis
*   transforming the corpus into vector space using tf-idf
*   calculating cosine distance between each document as a measure of similarity
*   clustering the documents using the k-means algorithm
*   using multidimensional scaling to reduce dimensionality within the corpus
*   plotting the clustering output using matplotlib


In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import os
import codecs
from sklearn import feature_extraction
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.manifold import MDS
import joblib

import matplotlib.pyplot as plt
import matplotlib as mpl
import requests

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Let's download the dataset

In [2]:
!wget https://github.com/TharinduDR/CS3MIR/raw/main/data/Lab5.zip
!unzip Lab5.zip -d docs

--2023-02-22 20:02:13--  https://github.com/TharinduDR/CS3MIR/raw/main/data/Lab5.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/TharinduDR/CS3MIR/main/data/Lab5.zip [following]
--2023-02-22 20:02:13--  https://raw.githubusercontent.com/TharinduDR/CS3MIR/main/data/Lab5.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 565526 (552K) [application/zip]
Saving to: ‘Lab5.zip.2’


2023-02-22 20:02:13 (12.2 MB/s) - ‘Lab5.zip.2’ saved [565526/565526]

Archive:  Lab5.zip
replace docs/Lab5/genres_list.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: docs/Lab5/genres_list.txt  
  inflating: 

We have three primary lists:

1. 'titles': the titles of the films in their rank order
2. 'wiki synopses': the synopses of the films from wiki matched to the 'titles' order
3. 'imdb synopses': the synopses of the films from imdb matched to the 'titles' order

The reading from file code is pretty simple - similar to previous workshops:

In [3]:
#import three lists: titles, wikipedia synopses and imdb synopses
#by reading the data from files
#ensure that we are reading only the first 100 records.
titles = open('docs/Lab5/title_list.txt').read().split('\n')
#ensures that only the first 100 are read in
titles = titles[:100]

synopses_wiki = open('docs/Lab5/synopses_list_wiki.txt').read().split('\n BREAKS HERE')
synopses_wiki = synopses_wiki[:100]

synopses_clean_wiki = []
for text in synopses_wiki:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_wiki.append(text)

synopses_wiki = synopses_clean_wiki

synopses_imdb = open('docs/Lab5/synopses_list_imdb.txt').read().split('\n BREAKS HERE')
synopses_imdb = synopses_imdb[:100]

synopses_clean_imdb = []

for text in synopses_imdb:
    text = BeautifulSoup(text, 'html.parser').getText()
    #strips html formatting and converts to unicode
    synopses_clean_imdb.append(text)

synopses_imdb = synopses_clean_imdb

print(str(len(titles)) + ' titles')
print(str(len(synopses_wiki)) + ' wiki synopses')
print(str(len(synopses_imdb)) + ' imdb synopses')

100 titles
100 wiki synopses
100 imdb synopses


After reading the data from files, we are cleaning the synopses (wiki and imdb both) using BeautifulSoup. BeautifulSoup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

Complete the following task: extract the Aston University wiki page (https://en.wikipedia.org/wiki/Aston_University), get the page content using BeautifulSoup, and print the text. You can get the page using the following command:
page = requests.get(YOURLINK). Store the page text in astonText variable.
  

Now, Let's combine the film wiki and imdb synopses and generate the rank for each.

In [4]:
synopses = []

for i in range(len(synopses_wiki)):
    item = synopses_wiki[i] + synopses_imdb[i]
    synopses.append(item)

# generates index for each item in the corpora (in this case it's just rank) and I'll use this for scoring later
ranks = []
for i in range(0,len(titles)):
    ranks.append(i)

We will need to remove the stopwords and stemmer from synopses, therefore, we will be using the last workshop code.

In [5]:
# load nltk's English stopwords as variable called 'stopwords'
stopwords = nltk.corpus.stopwords.words('english')

# load nltk's SnowballStemmer as variabled 'stemmer'
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

Let's call the function to process the synopses and store the proccessed (filtered) synopses in data frame.

In [6]:
totalvocab_stemmed = []
totalvocab_tokenized = []
for i in synopses:
  allwords_stemmed = tokenize_and_stem(i)
  totalvocab_stemmed.extend(allwords_stemmed)
    
  allwords_tokenized = tokenize_only(i)
  totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

Complete the following task: remove the stopwords and stemmers from the astonText (This variable contains the Aston wiki page text).  

**Tf-idf and document similarity** 
We will be using frequency-inverse document frequency (tf-idf) vectorizer parameters and then convert the synopses list into a tf-idf matrix. Please refer to Lab 02 if you are unsimilar with Tf-idf.

In [7]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(synopses)

print(tfidf_matrix.shape)

terms = tfidf_vectorizer.get_feature_names_out()

dist = 1 - cosine_similarity(tfidf_matrix)



(100, 563)


K-means clustering: Using the tf-idf matrix, we can run a slew of clustering algorithms to better understand the hidden structure within the synopses. We will be using k-means algorithm with cluster size 5. Each observation is assigned to a cluster based on the cluster sum of squares. Next, the mean of the clustered observations is calculated and used as the new cluster centroid. Then, observations are reassigned to clusters and centroids recalculated in an iterative process until the algorithm reaches convergence.

In [8]:


def cluster(num_clusters):
    km = KMeans(n_clusters=num_clusters)

    km.fit(tfidf_matrix)

    clusters = km.labels_.tolist()

    joblib.dump(km,'doc_cluster.pkl')
    km = joblib.load('doc_cluster.pkl')
    clusters = km.labels_.tolist()

    films = { 'title': titles, 'rank': ranks, 'synopsis': synopses, 'cluster': clusters }

    frame = pd.DataFrame(films, index = [clusters] , columns = ['rank', 'title', 'cluster'])
    return frame, km

num_clusters = 5
frame, km =  cluster(num_clusters)

Let's display the number of films per cluster.

In [9]:
frame['cluster'].value_counts()

2    42
3    18
4    17
0    12
1    11
Name: cluster, dtype: int64

Let's groupby cluster for aggregation purposes and display the average rank (1 to 100) per cluster.

In [10]:
grouped = frame['rank'].groupby(frame['cluster'])
grouped.mean()

cluster
0    47.583333
1    48.000000
2    55.095238
3    44.333333
4    43.470588
Name: rank, dtype: float64

Let's identify the top n (here we are using 6) words per cluster that are nearest to the cluster centroid and display them. These words gives a good sense of the main topic of the cluster.

In [11]:
from __future__ import print_function

print("Top terms per cluster:")
print()
order_centroids = km.cluster_centers_.argsort()[:, ::-1]

for i in range(num_clusters):
    print("Cluster %d words:" % i, end='')
    for ind in order_centroids[i, :6]:
        print(' %s' % vocab_frame.loc[terms[ind].split(' ')].values.tolist()[0][0].encode('utf-8', 'ignore'), end=',')
    print()
    print()
    print("Cluster %d titles:" % i, end='')
    for title in frame.loc[i]['title'].values.tolist():
        print(' %s,' % title, end='')
    print()
    print()


Top terms per cluster:

Cluster 0 words: b'george', b'brothers', b'family', b'father', b'york', b'new',

Cluster 0 titles: The Godfather, Raging Bull, The Godfather: Part II, It's a Wonderful Life, The Philadelphia Story, An American in Paris, The King's Speech, A Place in the Sun, Rain Man, Annie Hall, Tootsie, Yankee Doodle Dandy,

Cluster 1 words: b'soldiers', b'killed', b'army', b'battles', b'men', b'orders',

Cluster 1 titles: Forrest Gump, Gladiator, From Here to Eternity, Saving Private Ryan, Patton, Braveheart, Butch Cassidy and the Sundance Kid, Platoon, The Deer Hunter, All Quiet on the Western Front, Shane,

Cluster 2 words: b'car', b'police', b'father', b'apartments', b'friends', b'love',

Cluster 2 titles: The Shawshank Redemption, One Flew Over the Cuckoo's Nest, Citizen Kane, Psycho, Sunset Blvd., Vertigo, On the Waterfront, West Side Story, The Silence of the Lambs, Chinatown, Singin' in the Rain, Some Like It Hot, 12 Angry Men, Amadeus, Rocky, A Streetcar Named Desire,

Do the Kmeans clustering for four clusters and print the top terms per cluster