# Mapping Print, Charting Enlightenment

## Machine Learning Experiment 1: Classifying books by titles and other metadata

Team: Rachel Hendery, Tomas Trescak, Katie McDonough, Michael Falk, Simon Burrows

## Notebook 2: Converting titles to mean word vectors 

Author: Michael Falk

In this notebook, I experiment with converting each title into a 300-dimensional feature vector using Facebook's pre-trained French word vector model.

The model is large (~5GB), so has not been uploaded to the repository. All Facebook's pre-trained models are available here: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

The binary files FB provides do not work with the Gensim package, which makes them hard to use in python. I have compiled a new version of the bin file that does work with Gensim, which substantially decreases the load time. You can use this code to achieve this:

```python
from gensim.models import KeyedVectors
file_path = "fr_model/wiki.fr.vec" # locate .vec file
word_vectors = KeyedVectors.load_word2vec_format(file_path) # load into python
save_path = "fr_model/french_vectors.bin" # choose save path
word_vectors.save(save_path, binary = True) # save to it
```

The aim is to allow the model to generalise to unseen titles that have different words, as well as making the features more meaningful for the learning algorithm.

NB: If you're running this notebook on windows, you'll need to have a C compiler installed. 

In [None]:
# Load pre-trained word vectors
from gensim.models import KeyedVectors
import numpy as np

file_path = "fr_model/french_vectors.bin"
word_vectors = KeyedVectors.load_word2vec_format(file_path, binary = True)

In [None]:
# Sanity check - does word_vectors return a 300-dimensional vector as intended?
# (Actually it is a 1d numpy array)
word_vectors["écraser"].shape

In [None]:
# Load training data and preprocess
import pandas as pd # library for manipulating data
from nltk.tokenize import wordpunct_tokenize # tokeniser
import re # regular expressions

data_path = "data/editions_trimmed.csv" # locate data file

data = pd.read_csv(data_path).dropna()
title_strings = [] # initialise list of word vectors
for title in data["full_book_title"]: # loop over titles
    title = title.lower() # make all letters lower case
    tokens = wordpunct_tokenize(title) # tokenise
    reg1 = re.compile("\w") # regex for finding tokens with letters
    reg2 = re.compile("\D") # regex for finding tokens with numbers
    filtered = [i for i in tokens if reg1.search(i)] # strip out punctuation
    filtered = [i for i in filtered if reg2.search(i)] # strip out numbers
    filtered = [i for i in filtered if len(i) > 2] # strip out stopwords TODO: get proper sw list
    title_strings.append(filtered) # append to results list

In [None]:
# Sanity check: how does a random title look?
import random
title_strings[random.randint(1,len(title_strings))]

The next step could do with some investigation. From my reading, it is apparently legit to calculate the word embeddings for a whole sentence by getting the word vectors for each word and then averaging them. We will see if this enables the model to train effectively. 

In [None]:
# Get word vectors for each title
import numpy as np

def get_mean_word_vec(string_list, word_vectors):
    '''
    params:
    
        string_list: a list of strings
        word_vectors: a KeyedVectors object (gensim)
        
        dependencies: numpy, gensim
    
    desc:
    
    This function takes a list of strings and a KeyedVectors
    object as arguments. It first computes the word vectors for
    each string in the list, according to the provided model.
    It then takes the mean of the all the vectors. It returns a
    single vector, whose dimensionality is determined by the
    provided model.
    '''
    n = word_vectors.vector_size # how many dimensions are the word vectors?
    W = np.empty((n,0)) # initialise title matrix
    
    # for tracking errors
    count = int()
    
    for word in string_list: # loop through strings
        count = count + 1
        try:
            w = word_vectors[word].reshape((n,1)) # find the word vector and check dimensions
            W = np.c_[W, w] # add as column to title matrix
            break
        except:
            print(count)
    
    title_vector = np.mean(W, axis = 1).reshape((n, 1)) # take the sum of each row
    
    # title_vector is an n x 1 feature vector of the whole title
    
    return title_vector

In [None]:
# Loop through examples and apply the function.

# TODO: parallelise this part to speed the whole thing up.

n = word_vectors.vector_size # how many dimensions do our training examples have?
T = np.empty((0,n)) # initialise design matrix

for title in title_strings:
    np.r_[T, get_mean_word_vec(title_strings, word_vectors)]
    
### Causing weird errors!!!!! Need to learn to debug. :(