## Fooling with Word Vectors ##
https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb

https://github.com/aparrish/rwet/blob/master/understanding-word-vectors.ipynb

In order to get some of my Dark Souls objects to rhyme I'll have to learn Word Vectors.

Word Vectors are distance relationships (their *Euclidean distance*) between points. These points can be words and their attributes. 

Here's some basic arithmetic function. 
>(The ** operator raises the value on its left to the power on its right.)



In [1]:
import math
def distance2d(x1, y1, x2, y2):
    return math.sqrt((x1 -x2)**2 + (y1 - y2)**2)

In [2]:
distance2d(70, 30, 75, 40)

11.180339887498949

In [3]:
import json

In [4]:
color_data = json.loads(open("xkcd.json").read())

This code will convert the color from the hex format to a tuple of integers - hence the "#" (#1a2b3c)

In [6]:
def hex_to_int(s):
    s = s.lstrip("#")
    return int(s[:2], 16), int(s[2:4], 16), int(s[4:6],16)

The following will create a dictionary and populate it with mapping from color names to RGB

"Color", "Colors" and "Hex" relates to how the xkcd.json file is organized.

In [7]:
colors = dict()
for item in color_data['colors']:
    colors[item["color"]] = hex_to_int(item["hex"])

In [8]:
colors['olive']

(110, 117, 14)

Create the functions that allow for Vector arithmetic to find distances between points in several dimensions (trippy!)

Tried on paper...using a ruler. Math still works. Here's what's happening: √(x2-x1)²+(y2-y1)²

No idea what's the "zip" for though...

**Distance between points**

In [9]:
def distance(coord1, coord2):
    # VERY SLOW - use for learning purposed only!
    return math.sqrt(sum([(i-j)**2 for i, j in zip(coord1, coord2)]))

In [None]:
distance([10,1], [5,2])

**Subtract Vectors**

In [12]:
def subtractv (coord1, coord2):
    return [c1 - c2 for c1, c2 in zip (coord1, coord2)]

In [13]:
subtractv([10,1],[5,2])

[5, -1]

**Add Vectors**

In [14]:
def addv(coord1, coord2):
    return[c1 + c2 for c1, c2 in zip(coord1, coord2)]
addv([10,1], [5,2])

[15, 3]

**Avarege between a list of vectors**

In [42]:
def meanv(coords):
    sumv = [0] * len(coords[0])
    for item in coords:
        for i in range(len(item)):
            sumv[i] += item[i]
    mean = [0] * len(sumv)
    for i in range(len(sumv)):
        mean[i] = float(sumv[i]) / len(coords)
    return mean
meanv([[0, 1], [2, 2], [4, 3]])

[2.0, 2.0]

In [44]:
distance(colors['red'], colors['dark pastel green'])

241.4415043028021

In [46]:
distance(colors['blue'], colors['green']) > distance(colors['red'], colors['pink'])

False

## Word Vectors in spaCY ##

The xkcd.json example dealt with colors that were already organized as a numbered matrix (don't even know if this makes sense, but bear with me). But words are not so easily related. The number of dimensions vary imensily and change depending on methodology. 

Allison recommends in her tutorial to go straight to using spaCy. This program uses **GloVe** - Stanford's ready made (trained) word vectors.

In [47]:
import spacy

In [56]:
nlp = spacy.load('en_core_web_md')


In [72]:
file = open(filename, encoding="utf8")


SyntaxError: invalid syntax (<ipython-input-72-99e161da019a>, line 1)

In [74]:
doc = nlp(open("frankenstein.txt", encoding="utf8").read())

Grabbed frankenstein (84-0.txt) from the gutenberg page, renamed it and ran using the ", encoding="utf8" command - Withouth the charset I had errors.

In [75]:
# add all the words in the text file
tokens = list(set([w.text for w in doc if w.is_alpha]))

In [80]:
nlp.vocab['master'].vector

array([  2.65439987e-01,  -4.81970012e-01,   4.41280007e-01,
        -6.79279983e-01,   4.23440002e-02,  -2.80189998e-02,
        -7.59750009e-02,  -7.15219975e-01,  -4.24109995e-01,
         1.94410002e+00,  -1.21140003e-01,  -8.99289995e-02,
        -5.72640002e-01,   2.53939986e-01,   1.99970007e-01,
         5.26599996e-02,   3.35750014e-01,   1.71739995e+00,
        -4.35149997e-01,   2.17390001e-01,   2.21090000e-02,
        -6.64120018e-02,  -3.66420001e-01,  -4.11419988e-01,
        -2.00049996e-01,  -5.14129996e-01,   1.52140006e-01,
         2.87669986e-01,   1.59429997e-01,   5.04710019e-01,
        -4.47310001e-01,   8.42199981e-01,  -4.75209989e-02,
        -2.90859997e-01,  -2.70959996e-02,  -5.63480020e-01,
        -7.71640018e-02,   1.10760003e-01,   2.30910003e-01,
        -5.79890013e-01,   1.91699993e-02,  -2.47799993e-01,
         2.84900010e-01,  -1.42419994e-01,   1.33900002e-01,
         3.67000014e-01,   2.31389999e-01,   1.47330001e-01,
        -2.69219995e-01,

The following cell creates a function that gets the vector of a given string from spaCy

In [81]:
def vec(s): 
    return nlp.vocab[s].vector

In [82]:
vec

<function __main__.vec>

## Cosine similarity and closest neighbors ##
consine() returns cosine similarity of two vector. This is another way to see similarities "more suited to hight-dimensional spaces", like all the words in Frankenstein.

**Have to install numpy**

In [83]:
import numpy as np

In [86]:
from numpy import dot

from numpy.linalg import norm

In [87]:
def cosine(v1, v2):
    if norm(v1) > 0 and norm(v2) > 0:
        return dot(v1,v2) / (norm(v1) * norm(v2))
    else:
        return 0.0

In [94]:
cosine(vec('human'), vec('person')) > cosine(vec('dog'), vec ('peace'))

True

In [96]:
def spacy_closest(token_list, vec_to_check, n=10):
    return sorted(token_list,
                 key=lambda x: cosine(vec_to_check, vec(x)),
                 reverse=True) [:n]

In [118]:
spacy_closest(tokens, vec("truth"))

['truth',
 'faith',
 'believing',
 'belief',
 'true',
 'believe',
 'Believe',
 'reality',
 'nothing',
 'Nothing']

**Finding the halfway point between two words**

In [119]:
spacy_closest(tokens, meanv([vec("truth"), vec("war")]))

['war',
 'truth',
 'nothing',
 'Nothing',
 'Fear',
 'fear',
 'strife',
 'conflict',
 'wars',
 'humanity']

In [112]:
blue_to_sky = subtractv(vec("men"), vec("god"))
spacy_closest(tokens, addv(blue_to_sky, vec("industry")))

['men',
 'Men',
 'women',
 'business',
 'Among',
 'among',
 'market',
 'professional',
 'opportunities',
 'employees']

blue_to_sky

## Mixing with tracery ##

I'll try to mix what I've done with Tracery rules



In [114]:
import tracery


In [115]:
from tracery.modifiers import base_english

In [116]:
import json

In [117]:
animal_data = json.loads(open("darksouls.json").read())


In [120]:
closest_truWar = spacy_closest(tokens, meanv([vec("truth"), vec("war")]))

In [122]:
type (closest_truWar)

list

In [123]:
closest_truWar[2]

'nothing'