# Plotting word context vectors
*Written by Tao Wang and Angus Roberts, August 2024*

Words keep familiar friends. The famous linguist J.R. Firth said:

> *You shall know a word by the company it keeps*

We find that the word *apple* often appears near the words *eat* and *pie*. So does the word *strawberry*. The word *dog*, on the other hand, has different friends, like *tail* and *bark*.

More specifically, groups of words are used in very particular ways. For example, some verbs are only used with certain subjects (people talk, but rocks don't talk much), and certain adjectives will only apply to certain groups of nouns (you might use an adjective to describe the taste of a food, but not the taste of a vehicle).

We will examine this idea using the [iWeb corpus](https://www.english-corpora.org/iweb/): a corpus of 14 billion words in 22 million systematically selected English language web pages. This can be searched and analysed using the tools at [English-Corpora.org](https://www.english-corpora.org/). We have used these tools to look at a few words, and to find what other words appear in their context (on the same web page). We have saved these in a spreadsheet, and will load this in to Python to take a look.


### Set everything up

First we need to load some Python libraries - modules of pre-written code that have functionality we can use. We will use:

- Pandas which contains code to load and manipulate data - we will call this **pd**;
- pyplot which contains code to plot graphs - we will call this **plt**

In [None]:
# Import some libraries that we need.
import pandas as pd
import matplotlib.pyplot as plt

Next we will fetch our spreadsheet of words and their context, by copying the whole github repository where it lives, in to the Colab filespace. We can read the spreadsheet from Python, using Pandas *read_excel* method. This will read each sheet, storing them in separate Pandas *dataframes*. We will put them all in a variable called *contexts*. This is a Python Dictionary - it contains a set of keys (the names of our sheets), each of which has a value (the dataframe holding the data for that sheet).

In [None]:
# Copy files from github in to the local Colab filespace.
!git clone --quiet https://github.com/KCL-Health-NLP/nlp_youth_awards.git
print("Done copying files")

In [None]:
# Read in the spreadsheet.
contexts = pd.read_excel('./nlp_youth_awards/practicals/contexts.xlsx', sheet_name=None)

Let's take a look at what we have read in. First, we will print out the keys - i.e. the names of the dataframes stored in the *contexts* variable. These are the words for which the spreadsheet contains contexts.

In [None]:
# Let's take a look at the names of the sheets that have been read in.
# These are our words.
print(contexts.keys())

Now let's look at the dataframe for one of our words. We will list just the first 10 rows. See how it contains columns and rows read directly from the spreadsheet. Each row gives one context word found in the context of the sheet word, and the count for the number of times this has been found with the sheet word. The top 50 commonest context words are given.

In [None]:
# Let's take a look at the first few lines of one of the sheets.
# You can change this to look at others.
print(contexts['lettuce'].head(10))

### A function to look up context word dimensions
We can imagine each of our context words being a dimension in some space. Imagine a 2 dimension space for now, like a 2D graph with an x and y axis. For any given word we can plot a vector that shows how often our word is found with occurs in the same web pages as these context words, i.e. the size of these two dimensions. So, if we are considering the word *apple*, and the word *eat* is found in 100 web pages that mention *apple*, whereas the word *tree* is found in 80 pages, we might plot the (eat, tree) dimensions of *apple* as the vector:

`(100, 80)`

Other words will have different vectors in the same 2 dimension space.

To model this in our code, we need a function that will look up a dimension in a word dataframe. We can then use this in our code wherever we need the a dimension for a word.



In [None]:
# This function takes a word, and a dimension word.
# It looks up the number of times the dimension word
# occurs with the word. The value of the dimension
# is returned. If the dimension word is not found,
# zero is returned
def get_dimension_value(word, dimension):

  # Get the dataframe of dimensions for this word
  word_context = contexts[word]

  # If the dimension word is in the context column of the table
  if dimension in word_context['context'].values:

    # The value of the dimension is found in the row named for that dimension
    # and in the relative-count column
    value = word_context.loc[word_context['context'] == dimension, ['relative-count']].values[0][0]

  # If the dimension word is not found in the table
  else:
    value = 0

  return value

### Make the vectors

Now that we can look up the sizes of different dimensions for words, let's make some vectors! We will choose some words to vectorise, and a couple of dimensions against which to vectorise them. We've put some ideas for this below. You should change the words, and the dimensions, to see what happens.

In [None]:
# Choose some words for which we will create vectors
words_to_vectorise = ['lettuce', 'cucumber', 'butter', 'sugar']

# Choose vector dimensions
x_dimension = 'bowl'
y_dimension = 'salad'

We can now vectorise our words, against the two dimensions. Once we have done this, we will print out the vectors and take a look.

In [None]:
# Make an empty list to hold the vectors
vectors = []

# Go through the words one at a time
for word in words_to_vectorise:

  # Look up the values of the two dimensions
  x_value = get_dimension_value(word, x_dimension)
  y_value = get_dimension_value(word, y_dimension)

  # Add the dimensions in to the vectors
  v = [word, (x_value, y_value)]
  print(v)
  vectors.append(v)

# Take a look at the vectors
print(vectors)



### Plot the vectors

Now we have some vectors, what can we do with them? An obvious thing to do is plot them - let's do that first.

In [None]:
# Go through the vectors and get our the word
# and the vectors, and plot each one
for word, vec in vectors:

    # get our x and y dimensions
    x = vec[0]
    y = vec[1]

    # Plot an arrow
    plt.quiver(0, 0, x, y, angles='xy', scale_units='xy', scale=1)

    # Add a label at the end of each arrow
    plt.text(x+2, y+2, word, fontsize=8)

# Set axis labels and limits
plt.xlabel(x_dimension)
plt.ylabel(y_dimension)
plt.xlim((-1, 101))
plt.ylim((-1, 101))

# Show the graph
plt.show()


### Next steps
- Try changing the words and dimensions, to explore our small vocabulary
- What else might we do with vectors?
- What about adding more dimensions? How about 3, or 4? How about more?
- How might we use these vectors?