# SC207 Text Mining
## Vectorisation and TFIDF
### Turning words into numerical values

More advanced forms of text analysis require that text documents are converted into numerical values or features. In this  section we will examine:

* different methods for representing a collection of texts as numbers
* the decisions we need to make when generating a particular representation as well as the kinds of insights each numerical representation can give us.

## Tools
- [SciKit-Learn](https://scikit-learn.org/stable/index.html): A key library in Python data science and machine learning. Has a wide variety of accessible tools for complex data transformation, analysis and AI model building.
- [WordCloud](https://github.com/amueller/word_cloud) by amueller: A well established library for generating wordclouds from text data.

In [None]:
! pip install scikit-learn wordcloud

In [None]:
# Import libraries

import pandas as pd
import numpy as np
import seaborn as sns

from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

In [None]:
# Some settings to make seaborn display better in Jupyter notebook
sns.set(rc={'figure.figsize':(8.2,5.8)})
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})

## Frequency Vectors

The most basic way of representing text numerically, is to count the number of times a word appears within a document. Two documents that have similar high frequency words we can intuitively understand might be more similar, than two documents that share no words at all. Scikit's `CountVectorizer` allows us to easily transform a set of strings into a matrix of frequency values.

In [None]:
test_corpus = ['This is my first sentence',
               'This is the second',
               'I enjoy peas in my sentence, peas peas peas!',
               'This is my first sentence']

In [None]:
cv =

In [None]:
matrix =
matrix

In [None]:
#show array

The matrix shape is 4 rows and 10 columns. The rows represent the documents, and the columns represent the unique words in the entire corpus. We can see this more clearly...

In [None]:
# This shows us the ordering of the matrix columns, and which word each column represents.
# get features

In [None]:
# If we wrap our matrix in a pandas dataframe, and provide this list as the column names everything lines up...

matrix_df =
matrix_df


We can see that each row corresponds to each document, and that each column corresponds to a unique word. The values correspond to the frequency of that word, in each document. For example "Peas" only occurs in the document at position 2, and it occurs 4 times. The word "Sentence" occurs once in all documents except the document at row 1.


In [None]:
# If we wanted to see the most frequent words we would first add the values together across the rows, and then sort in decending order.


## A larger example

In [None]:
df = pd.read_csv('sample_news_large_with_tokens.csv')
df.info()

For this we'll use the titles of some different news stories to give us a more varied dataset.

In [None]:
sample =
corpus =
corpus

In [None]:
cv =
matrix =
matrix_df =
matrix_df

The first thing you'll notice is that there are a lot of 0's. Obviously not all words are used in every document but as every word requires a column, it can result in a very wide matrix of many columns.

You may also be thinking that some of the words are a bit useless, in that they don't tell us much about the document. By default the vectoriser does no filtering of words like we would in our pre-processing. There are ways to adjust this which we'll look at later.


In [None]:
# Top words


The top words are frequent, but not necessarily informative. This is a common problem with frequency counts. Just because a word occurs a lot, doesn't necessarily mean it is important. Before we address this, one last aspect of the vectoriser we can experiment with is the `ngram` argument.

Whereas normally our vectoriser would ensure each token is a single item (word), ngrams allows the pairing of 2 or more items into phrases. Whilst other approaches out there are more sophisticated, examining the entire corpus to determine if two words together really is a phrase, scikit simply creates tokens for all word pairs.

In [None]:
cv =
matrix =
matrix_df =
matrix_df

In [None]:

matrix_df.sum().sort_values(ascending=False).head(10)

Note that we do have some bi-grams (a pair of words together) as most frequent, but also that it has massively expanded the width of our matrix.

## Improving your Vectorisation

Some issues we've encountered.
- Highly frequent words aren't necessarily informative.
- Adding ngrams massively increases the size of our matrix because it creates a column for every word pair it finds.
- These problems only get worse with larger, full document (not just title) datasets.

### Solutions
- We can use some of the vectoriser's built in filtering features.
- We can use TFIDF to adjust our frequency scores to be more nuanced than simple counts.
- We can pre-process documents first to reduce the noise and variability like we did when we generated our own tokens.
- We can do all 3!

## Filtering Features

- `min_df`: Minimum document frequency. The proportion of documents a token must occur in to be included. Filters out very low frequency words, which is also good for spelling mistakes. If we provide an integer it represents the minimum number of documents a feature should occur in before it is excluded. Providing a float between 0.0 - 1.0 indicates a proportion.
- `min_df=5` means any features that occurs in less than 5 documents will be excluded.
- `min_df=0.5` means any feature that occurs in less than 50% of the documents will be excluded.


In [None]:
# Lets do the same again but this time add a minimum document frequency of 2, as in anything that only occurs in one document is dropped.

cv =
matrix =
matrix_df =
matrix_df

That cut hundreds of noisy tokens out of our matrix! We'll experiment with some more filtering features later.

## TFIDF
Term Frequency Inverse Document Frequency (TFIDF) is an approach to measuring word frequency that can be thought of as giving higher scores to words of greater "significance".

TFIDF is not a simple word frequency, instead it assigns a word a score based on...

- The frequency of that word in a document
- How many other words are in that document
- How many documents are in the overall corpus
- How many of those documents that word appears in.

#### The forumla for those interested
- TFIDF = term freqency * inverse document frequency
- term frequency = Frequency of occurences of a term within a single document, sometimes divided by the number of terms in the document.
- inverse document frequency = number of documents within the entire corpus / number of documents the term occurs in.

Remember our test example from earlier?

In [None]:
test_corpus = ['This is my first sentence',
               'This is the second',
               'I enjoy peas in my sentence, peas peas peas!',
               'This is my first sentence']

In [None]:
cv = CountVectorizer()
cv_matrix = cv.fit_transform(test_corpus)

tfidf =
tfidf_matrix =

In [None]:
feature_names =

cv_matrix_df =
tfidf_matrix_df =

In [None]:
cv_matrix_df

In [None]:
tfidf_matrix_df

## Interpreting TFIDF

- 'Peas' has a high weighting in doc 2 because it is frequent in doc 2, but infrequent elsewhere.
- 'Sentence' has the same weighting in docs 0 and 3, but lower in 2 despite occuring once in all three, because it is competing against more terms.
- 'Second' has an above average score because it is only competing against a few other words, and it doesn't occur anywhere else in the corpus.

TFIDF highlights "significant" words for two reasons...

- It gives higher scores to words that occur frequently within a single document, relative to the amount of other words in a document.
    - In a document with only 10 words, and 8 of them are "Peas", you would imagine peas to be a word that indicates what that document is about.
    - In a document where "Peas" occurs 8 times, but there are 10,000 other words, then suddenly Peas doesn't look so significant.
- It pulls down the scores of words that occur across a lot of documents.
    - If a document uses the word "Peas" 8 times and the word "Dog" twice, BUT the entire corpus uses the word "Peas" in every document, well now "Dog" is a more significant word for that document relative to the rest of the corpus of documents.

## Pre-Processed Tokens
Let's start working with a larger dataset using our pre-processed tokens. We can also compare to see if TFIDF improves our results.

In [None]:
token_corpus =

In [None]:
cv = CountVectorizer(ngram_range=(1,2), min_df=5, max_df=0.99)
matrix = cv.fit_transform(token_corpus)
matrix_df = pd.DataFrame(matrix.toarray(), columns=cv.get_feature_names_out())
matrix_df

In [None]:
# Check top words
matrix_df.sum().sort_values(ascending=False).head(10)

In [None]:
# Let's try this again but instead use the TFIDF vectoriser which is essentially the count vectoriser and tfidf transformer in one


tfidf =
matrix =
matrix_df = pd.DataFrame(matrix.toarray(), columns=tfidf.get_feature_names_out())
matrix_df

In [None]:
# Check top words

matrix_df.sum().sort_values(ascending=False).head(10)

In [None]:
# Here we ask for the index positions of the stories that match different queries e.g. all the brexit stories

brexit_story_positions =
tesla_story_positions =

In [None]:
# And use those index positions to select only those rows in our matrix before finding the top words.
brexit_top_words =
brexit_top_words

In [None]:
tesla_top_words =
tesla_top_words

## Visualising Word Significance

In [None]:
to_plot =
to_plot

In [None]:
# A simple clear approach is to use a basic bar chart.


In [None]:
# A more fun approach...

def create_wordcloud(word_freq, save_path, max_words=1000, mask_filename=None):
    if mask_filename:
        mask = np.array(Image.open(mask_filename))
    else:
        mask = None
    wc = WordCloud(max_words=max_words, background_color='white', mask=mask, width=1000, height=1000)
    wc.generate_from_frequencies(word_freq)
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.savefig(save_path, dpi=400)


In [None]:
freq =


In [None]:
freq =


## Basic Clustering with [Principal Component Analysis (PCA)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)

So we've converted our each of our documents into a row of numbers also known as a *vector*. Using TFIDF our vectors are comprised of numbers indicating the significance of each word to that document. Each vector is like a signature distinguishing one document from the others. Intuitively we would understand that documents that have similarly high values for particular words would probably be similar in content.

In our case we already have a classification for our documents. We know what query was used to retrieve the news item. But what if we didn't know that, could we use these signatures to find groups of documents in an otherwise unknown set of documents? Here we will use two of the most basic techniques. Better methods, specificially for text, are available, but can also be more complex.

If you would like to better understand what is happening under the hood you can read Chapter 8 in the McLevey textbook.

#### Principal Component Analysis (PCA)
PCA is a technique for 'dimensionality reduction'. What does this mean? Imagine a dataset of peoples height and weight. Rows are people, and then there is a column for height and one for weight. This is a two dimensional dataset, and could easily be plotted in a two-dimensional scatter plot. Height along one axis, weight along the other. Our dataset where each row is a document, and each represents the significance of a single word has *hundreds* of dimensions.
Dimensionality reduction techniques examine those hundreds of dimensions and attempt to create two new dimensions that approximately represents the variance of the original hundreds of dimensions dataset.

There are many approaches to dimensionality reduction. PCA is a basic one, not necessarily best suited to text but good as an introduction.

In [None]:
tfidf =  # We use max features to limit the total number of words, keeping only the most significant
matrix = tfidf.fit_transform(token_corpus)
matrix_df = pd.DataFrame(matrix.toarray(), columns=tfidf.get_feature_names_out())
matrix_df

In [None]:
# Principal Component Analysis

# Initialise the PCA estimator and keep the first 2 components
pca =

# Fit the PCA estimator; first convert the sparse matrix to an array using toarray 
pca_components=
pca_df =
pca_df['query']
pca_df

In [None]:
# Here we can visualise how well the PCA worked


Whilst we might consider PCA not to have performed very well because our queries have a lot of overlap, consider that many of the topics may well overlap quite significantly in their language. There is a messy middle of political topics, whilst those more specific topics with different language are quite distant from the middle.

#### K-means clustering
Imagine we had no query labels for our texts. How would we know if there were any clusters of documents that talk about similar things. Enter, clustering algorithms! Clustering algorithms examine a set of dimensions and determine which items in the dataset it thinks are close enough to be part of the same cluster (group). We'll be using K-means, a simple clustering algorithm that is well established. Again, better options are available, particularly for text data, and we'll use one later in the course.

In [None]:
# Initialise the k-means estimator with 3 clusters

n_clusters = 3

kmeans =

# Fit the k-means estimator using the two components

# fit

#labels


In [None]:
# We replace our original colouring by query label, with colouring by kmeans label

clusters = # as string



In [None]:
# Because we have a classification already we can check against it to see how the clustering performed.

dat =

hm_data =

hm_data['count']

counts =
#plot heatmap

In [None]:
# We can examine the top words for each cluster like we did when we examined based on the query groupings...



### How do we know how many clusters to form? 
If you don't know what clusters are in your dataset, how do you choose the value of K? We might try running with different values, perhaps qualitatively evaluate each time and go back and adjust to a broader clustering (a lower K) or a more fine grained clustering (a higher K). Whilst time consuming this is not necessarily an unreasonable approach, but it helps if we can get a sense of what a reasonable range of values would be.

For this we can use the *inertia* of our Kmeans model. The intertia essentially tells us how distant all of the points are from the centre of their assigned cluster. A clustering where there are points really far away from the centre of its cluster might indicate that you need more clusters in there so the distant ones might be assigned something closer. However you don't want too many clusters or it becomes meaningless (a clustering where there is exactly as many clusters as datapoints will have no distance at all from point to cluster, but it won't be very informative).

N.B. Inertia is more typically called 'Sum of squared errors', in case you need to Google it.

In [None]:
intertia_scores = [] # Initialise a list

k_range =


intertia_scores

### The 'Elbow' Plot
By plotting those scores as a line graph we can see how the inertia changes as we add more clusters...

In [None]:
# Plot appearance and size

# Generate the plot



Here we can see the inertia score drops dramatically as we start adding clusters, but then somewhere between 4-6 clusters little is gained by adding more clusters. This is the range of options we would want to explore.

Go back and change `n_clusters` to a different value, re-run the plots and keyword outputs and see what happens. Which value do you think works best?