<a href="https://colab.research.google.com/github/Jumblygrindrod/cooccurrences/blob/main/Constructing_Word_Vectors_A_Taster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Constructing Word Vectors: A Taster

### This notebook will show you how you can produce a co-occurrence matrix using the Brown corpus. Once you have a co-occurrence matrix, you have a (relatively simple) vector space, with the rows corresponding to word vectors and the columns corresponding to the dimensions of the space.

---

### The advantage of hosting this here on Google Colab is that you can see each part of the code being run in real time without having to install python. Simply keep pressing shift + enter.


---


### Each chunk of code comes with a little piece of text explaining what is happening at each stage.

This notebook has largely been constructed using the python notebooks created by John Gamboa and Philip Blandfort. You can access those materials here: https://jcbgamboa.github.io/computational_linguistics/

Their materials have been edited for the purposes of this notebook, in accordance with a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License  CC BY-NC-SA 4.0. This notebook is also licenced under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

First we import the python packages that we'll need. One of the most important is nltk (Natural Language Toolkit), which is a package dedicated to pre-processing and processing textual data. But we'll also need some other packages (re, numpy, tabulate).

In [None]:
import nltk
import re
import numpy as np
from tabulate import tabulate
import requests

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Now we are getting the brown corpus from a url and introducing it as a string (i.e. one long object of text consisting of the entire corpus).

(The nltk package actually provides the brown corpus and other corpora pre-tokenized, but for illustrative purposes we'll ignore that!)


In [None]:
response = requests.get("http://www.sls.hawaii.edu/bley-vroman/brown.txt")
text = response.text

We will now tokenize the corpus (i.e. split it up into smaller linguistic chunks). First we will tokenize it into sentences (so we have a list of every sentence). Then we tokenize every sentence into a list of words. So we end up with a list of lists of words. (This can take a bit longer to complete).

In [None]:
sentences = nltk.sent_tokenize(text)
sentences = [nltk.word_tokenize(s) for s in sentences]

Now we are going to remove non-words by requiring that each word only contains alphanumeric characters:

In [None]:
new_sentences = []

for s in sentences:
    new_s = [w for w in s if re.fullmatch('[\w]+', w)]
    new_sentences.append(new_s)

sentences = new_sentences

Then we are going to uncapitalize everything:

In [None]:
new_sentences = []

for s in sentences:
    # Creates a new list of tokens, uncapitalizing everything
    new_s = [w.casefold() for w in s]
    # Inserts the new list (i.e., `new_s`) in the new_sentences list
    new_sentences.append(new_s)

sentences = new_sentences

The following just gives us a list of stopwords, which will be of use when constructing our matrix:

In [None]:
stopwords = set(nltk.corpus.stopwords.words("english"))
print(stopwords)

{'doing', "you're", 'was', 'shouldn', 'll', 'ourselves', 'about', 'other', 'yourselves', 'all', 'that', "shouldn't", "you've", 'once', 'those', 'y', "that'll", 'here', 'o', 'after', 'he', 'is', 'yourself', 't', 'itself', 'under', 'just', 'been', 'are', "hadn't", 'who', 'out', 'then', 'doesn', "doesn't", "you'll", 'and', 'the', 'what', 'they', "don't", 'should', 'couldn', "mightn't", "haven't", 'but', 'below', 'ain', 'a', 'her', 'don', 'theirs', 'between', 'mustn', 'myself', 'while', 'too', "you'd", 'its', 'not', 're', 'haven', 'now', 'each', 'such', 'ma', "didn't", "weren't", 'you', 'because', "wasn't", 'by', 'am', 'does', 'further', 'our', "it's", 'were', "shan't", 'into', 'm', 'own', "wouldn't", 'being', 'through', 'from', 'mightn', 'will', 'why', 'down', 'aren', "she's", 'yours', 'me', 'up', 'had', 'him', 'an', 'his', 'hasn', 'weren', "should've", 'nor', 'until', 'of', 'these', 'my', "mustn't", 'shan', 'some', 'isn', 'during', 'if', 'has', 'more', 'very', 'in', 'both', 'himself', 'f

Now we are defining a function that will produce a co-occurrence matrix for us. This works by using the nltk function "FreqDist", which produces a dictionary (in Python's sense of the term) where the keys are words and the values are the number of times they appear. However, we want a more complex function because we only want the number of times that an expression appears within the window of the target expression we are interested in.

There are a few parameters to play with here. You can change the window size. You can also change the vocab-size - where the vocab is the set of context words that occupy the columns of the cooccurrence matrix. Here the vocab is set to the words that are most common among all of the words of interest. You can also remove stopwords with this function.

Note that the output of the function is two items: the vocab (i.e. the list of words that constitutes the columns) and the matrix itself.

In [None]:
def compute_context_stuff(sentences, words, remove_stopwords=False, vocab_size=10, window_size=5):
    co_occurrences = {word: nltk.FreqDist() for word in words}

    for sentence in sentences:
        for word in words:
            if word in sentence:
                word_pos = sentence.index(word)
                co_occurrences[word].update([w.lower() for w in sentence[max(0,word_pos-window_size):min(word_pos+window_size,len(sentence)-1)]
                                            if re.match('[\w]+', w) and w!=word
                                            and (not remove_stopwords or w.lower() not in stopwords)])

    vocab = [c for c,count in nltk.FreqDist([w for word in co_occurrences
                                             for w in co_occurrences[word]]).most_common(vocab_size)]

    co_matrix = np.array([[co_occurrences[word][ctx] for ctx in vocab] for word in words])
    return vocab, co_matrix

The following function will display the matrix in an easy-to-read format, rather than just a list of arrays:

In [None]:
def show_co_matrix(mat, words, vocab, max_vocab_size=None):
    if max_vocab_size:
        print(tabulate([[word]+list(mat[i,:max_vocab_size]) for i,word in enumerate(words)],
                       headers=['word \ context']+vocab[:max_vocab_size]))
    else:
        print(tabulate([[word]+list(mat[i]) for i,word in enumerate(words)],
                       headers=['word \ context']+vocab))

Now we need some words of interest that will form the rows in our matrix. Let's use a list of emotive words.

In [None]:
emotive_words = nltk.word_tokenize("afraid angry calm cheerful cold crabby crazy cross excited frigid furious glad glum happy icy jolly jovial kind lively livid mad ornery rosy sad scared seething shy sunny tense tranquil upbeat wary weary worried")

Now let's generate a matrix and vocab list with the function we defined above.

In [None]:
vocab, matrix = compute_context_stuff(sentences, emotive_words, remove_stopwords = True, window_size = 10, vocab_size = 20)

Now let's look at the matrix using the show_co_matrix function:

In [None]:
show_co_matrix(matrix, emotive_words, vocab)

word \ context      would    man    little    could    one    new    said    go    way    even    first    eyes    back    old    made    people    might    time    many    never
----------------  -------  -----  --------  -------  -----  -----  ------  ----  -----  ------  -------  ------  ------  -----  ------  --------  -------  ------  ------  -------
afraid                  4      1         0        0      6      1       1     0      2       2        1       0       1      0       0         5        2       2       1        1
angry                   1      3         2        0      2      1       1     1      0       2        0       2       3      2       0         0        2       0       0        1
calm                    0      0         0        1      0      0       2     0      0       0        1       1       0      1       0         1        0       1       0        0
cheerful                0      0         0        0      0      0       0     0      1       0        0  

The above matrix shows co-occurence between the words of interest and the vocab words. For instance, it tells us that "afraid" co-occurs with "would" 4 times. But this can also serve as a vector space, with the rows serving as vectors for the words of interest.
