<a href="https://colab.research.google.com/github/OdysseusPolymetis/enexdi2025_prep/blob/main/2_word_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**Word Vectors**</center>

---



##**Definition**

You can try and imagine language as a cloud, with scattered points, where each point is a different word. The location of each point is dependent on the location of every other point in the cloud (eg. if two words share the same context, they should appear near one to another). As long as you can represent a point in space, it gets a computational representation : it becomes a vector in space, a direction. And it becomes possible to compute things from it.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("ORstNrlG_2g", width=512, height=288)

![](https://drive.google.com/uc?export=view&id=1FsTcOQ5LVgbDqkT5nm_gve5gZfQrZ8pV)

In [None]:
import os
import gensim
from gensim.models import Word2Vec
import glob
import nltk

from lxml import etree as ET
import lxml.html
import string
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/auteurs.zip

In [None]:
!unzip "/content/auteurs.zip"

In [None]:
flaubert="auteurs/flaubert/"
balzac="auteurs/balzac/"

In [None]:
def strip_ns_prefix(tree):
    query = "descendant-or-self::*[namespace-uri()!='']"
    for element in tree.xpath(query):
        element.tag = ET.QName(element).localname
    return tree

In [None]:
if balzac != "":
    files = glob.iglob(balzac + '/**/*.xml', recursive=True)
    sentences = []

    for filename in files:
        print(filename)
        parser = ET.XMLParser(remove_blank_text=True, resolve_entities=False, encoding='utf8')
        tree = strip_ns_prefix(ET.parse(filename, parser))

        words = tree.xpath(".//wf/@lemma")

        sentence = []
        for word in words:
            if word != ".":
                sentence.append(word)
            else:
                sentences.append(sentence + [word])
                sentence = []

In [None]:
print(len(sentences))
print(sentences[5])

## Building a model

This part, depending on the amount of data you intend to compute, may take some time (default : 8 minutes)

In [None]:
model = Word2Vec(sentences, min_count=2, max_vocab_size=10000, negative=10, epochs=200)

In [None]:
model.wv.save("/content/model_balzac.bin")

This next cell is to be run only if you want to reload a saved model.

In [None]:
from gensim.models import KeyedVectors
KeyedVectors.load("/content/model_balzac.bin")
wv = KeyedVectors.load("/content/model_balzac.bin")

model = Word2Vec(vector_size=wv.vector_size, min_count=1)
model.wv = wv

In [None]:
print(model.wv.index_to_key)

In [None]:
#Paris is to France what London is to what ? model.wv.most_similar(positive=['Londres', 'France'], negative=['Paris'],topn=5)
#King is to man what Queen is to what ? model.wv.most_similar(positive=['reine', 'homme'], negative=['roi'],topn=5)
model.wv.most_similar(positive=['reine', 'homme'], negative=['roi'],topn=10)

In [None]:
model.wv.most_similar('esprit',topn=20)

## Visualization with Tensorflow
You can get a clearer visualization using the [online tensorflow visualizer](https://projector.tensorflow.org/). After this next cell, you'll get two files, one containing the vectors, the other their labels.

In [None]:
!wget https://raw.githubusercontent.com/ABC-DH/EnExDi2024/main/materials/3_NLP/stopwords_fr.txt

In [None]:
stops = open("/content/stopwords_fr.txt", encoding="utf-8").read().split("\n")

In [None]:
with open("/content/vecteurs.tsv", 'w') as file_vectors, open("/content/metadonnees.tsv", 'w') as file_metadata:
    for word in model.wv.index_to_key:
        file_vectors.write('\t'.join([str(x) for x in model.wv[word]]) + "\n")
        file_metadata.write(word + "\n")