# Word Embeddings

We will be looking at vector representation of words and word embeddings generally in this notebook. Most of the examples are adapted from [Allison Parrish's notebook here](https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469) which has lots of great stuff.

First we will be playing with the data collected in a color survey conducted by the people (person?) behind [XKCD](https://xkcd.com/). From what I understand they asked many people to name colors as presented to them on screen - many people and many colors. So this led to a lot of data about what people think of when they say a given color. Also lots of interesting data about what different genders see (or names they come up with) and also just general ingenuity/boredom of people. [Read more about the survey here](https://blog.xkcd.com/2010/05/03/color-survey-results/).

[The color data can be downloaded here](https://github.com/dariusk/corpora/blob/master/data/colors/xkcd.json).

---

Furthur down in the notebook we use the text of [_Dracula_](http://www.gutenberg.org/cache/epub/345/pg345.txt) by Bram Stoker and [_The Yellow Wallpaper_](http://www.gutenberg.org/cache/epub/1952/pg1952.txt) by Charlotte Perkins Gilman which can be downloaded from Project Gutenberg.

In [None]:
import json
import matplotlib.pyplot as plt
import numpy as np
import random
import spacy


from scipy import spatial

I have downloaded the two text files (_Dracula_ and _The Yellow Wallpaper_) which we'll use at various points, so I'll just store them in variables below with the XKCD Color Data.

In [None]:
DRACULA = "pg345.txt"
YELLOW_WALLPAPER = "pg1952.txt"
XKCD_COLOR = "xkcd.json"

In [None]:
# lolad the json data about colours
color_data = json.loads(open(XKCD_COLOR).read())

The color data is stored as hex codes, so the function and loop below will convert the hex to a more recognizable RGB color and store it in a new dictionary.

In [None]:
def hex_to_int(s):
    s = s.lstrip("#")
    return int(s[:2], 16), int(s[2:4], 16), int(s[4:6], 16)

In [None]:
# translate the whole colour dictionary from hex to tuples of integers
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

In [None]:
print(colors)

In [None]:
colors['olive']

In [None]:
colors['red']

In [None]:
colors['pig pink']

This is just a simple function to turn the color code into an image.

In [None]:
def make_solid_color(rgb):
    # create a 64x64 matrix for red, green, blue colour values
    r = np.full((64,64), rgb[0])
    g = np.full((64,64), rgb[1])
    b = np.full((64,64), rgb[2])
    
    # stack the data to create a 3-channel image stored in a 3-dimensional array
    return np.dstack((r, g, b))

In [None]:
image = make_solid_color(colors['cloudy blue'])

print(colors['cloudy blue'])
print(image.shape)

plt.imshow(image)

You may have noticed that an RGB color value is a vector of length 3: (172, 194, 217)

And we now have a dataset where each of these vectors has an associated name:

'Cloudy Blue': (172, 194, 217)

So we can start playing with this in maths world.

Below we create a function that finds the distance between two colours through th euclidean distance of their vectors

In [None]:
def distance(a, b):
    _a = np.array(a) # array from tuple
    _b = np.array(b)
    return np.linalg.norm(_a - _b) # numpy can perform operations on arrays


distance([10, 1], [5, 2])

In [None]:
colors['red']

In [None]:
d_red_green = distance(colors['red'], colors['green'])
d_red_pink  = distance(colors['red'], colors['pink'])

print(d_red_green, d_red_pink)
print(d_red_green > d_red_pink)

The below loops through the colors and finds

In [None]:
def closest(space, coord, n=10):
    closest = []
    for key in sorted(
        space.keys(),
        key=lambda x: distance(coord, space[x])
    )[:n]:
        closest.append(key)
    return closest

In [None]:
closest(colors, colors['olive'])

In [None]:
closest(colors, [150, 60, 150])

lets subtract two colors and see of we can find a color closet to those vectors

In [None]:
def subtract(a, b):
    _a = np.array(a)
    _b = np.array(b)
    return _a - _b

In [None]:
closest(colors, subtract(colors['purple'], colors['red']))

Lets create a poem by randomly choosing colours close to red and blue

In [None]:
red = colors['red']
blue = colors['blue']
for i in range(14):
    rednames = closest(colors, red)
    bluenames = closest(colors, blue)
    print("Roses are " + rednames[0] + ", violets are " + bluenames[0])
    red = colors[random.choice(rednames[1:])]
    blue = colors[random.choice(bluenames[1:])]

The below function helps us calculates the average of vectors

In [None]:
def mean(coords):
    return np.mean(np.array(coords), axis=0)

In [None]:
# Uncommment and run the line below once to download the model
# !python -m spacy download en_core_web_lg

Here we load _Dracula_ to find any mentions of colours in the book. We can then find the average color and all the closest colors to that average.

In [None]:
# English pipeline optimized for CPU. Components: tok2vec, tagger, parser, senter, ner, attribute_ruler, lemmatizer.
# more info: 

nlp = spacy.load('en_core_web_lg')

In [None]:
with open(DRACULA, encoding="utf8") as file:
    dracula_text = file.read()

dracula_doc = nlp(dracula_text)

In [None]:
# use word.lower_ to normalize case
drac_colors = [colors[word.lower_] for word in dracula_doc if word.lower_ in colors]
avg_color = mean(drac_colors)
print(avg_color)

In [None]:
drac_color_img = make_solid_color(np.array(avg_color).astype(np.uint8))
plt.imshow(drac_color_img)

In [None]:
closest(colors, avg_color)

In [None]:
closest_dracula_col = closest(colors, avg_color)[0]
print(closest_dracula_col)

In [None]:
image = make_solid_color(colors[closest_dracula_col])
print(colors[closest_dracula_col])

plt.imshow(image)

Here we load yellow wallpaper to find any mentions of colours in the book. We can then find the average color and all the closest colors to that average.

In [None]:
wallpaper_doc = nlp(open(YELLOW_WALLPAPER).read())

In [None]:
wallpaper_colors = [colors[word.lower_] for word in wallpaper_doc if word.lower_ in colors]
avg_color = mean(wallpaper_colors)
wallpaper_closest = closest(colors, avg_color)

In [None]:
image = make_solid_color(colors[wallpaper_closest[0]])
print(wallpaper_closest[0])

plt.imshow(image)

Lets load the Dracula text and do some word similarities

In [None]:
def cosine_similarity(vec1,vec2): 
    return 1-spatial.distance.cosine(vec1,vec2)

Tokenize Dracula:

In [None]:
tokens = list(set([w.text for w in dracula_doc if w.is_alpha and w.has_vector]))

In [None]:
def vec(s):
    return nlp(s).vector

In [None]:
cosine_similarity(vec('dog'), vec('puppy')) > cosine_similarity(vec('trousers'), vec('octopus'))

Learn more about the Python `sorted` function [here](https://www.w3schools.com/python/ref_func_sorted.asp).

In [None]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(
        token_list,
        key=lambda x: cosine_similarity(vec_to_check, vec(x)), # sort based on the similarity of each token to the specified vector
    )[:n]

In [None]:
day_vec = vec("day")

In [None]:
# the length of the vector is the number of parameters in the model
len(day_vec)

In [None]:
min(day_vec), max(day_vec)

In [None]:
# halfway between day and night
spacy_closest(tokens, mean([vec("day"), vec("night")]))

In [None]:
spacy_closest(tokens, subtract(vec("wine"), vec("alcohol")))

## Sentence Similarity
Lets do some basic sentence similarities, we are going to average the vectors from all the words in the sentence. we will assume if those two sentences are similar the mean of the each word vector in the sentence will be close

In [None]:
def sentvec(s):
    sentence = nlp(s)
    return mean([w.vector for w in sentence])

In [None]:
sentences = list(dracula_doc.sents)

In [None]:
wallpaper_sentences = list(wallpaper_doc.sents)

In [None]:
wallpaper_sentences[100]

In [None]:
sentences[111]

In [None]:
def spacy_closest_sent(space, input_str, n=10):
    input_vec = sentvec(input_str)
    # we are selecting the top 10 sentences
    return sorted(space,
                  key=lambda x: cosine_similarity(np.mean([w.vector for w in x], axis=0), input_vec),
                  reverse=True)[:n]

In [None]:
for sent in spacy_closest_sent(wallpaper_sentences, "My favorite food is strawberry ice cream."):
    print(sent.text)
    print("---")