<a href="https://colab.research.google.com/github/OdysseusPolymetis/enssib_class/blob/main/3_word_vectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>**Word Vectors**</center>

---



##**Definition**

You can try and imagine language as a cloud, with scattered points, where each point is a different word. The location of each point is dependent on the location of every other point in the cloud (eg. if two words share the same context, they should appear near one to another). As long as you can represent a point in space, it gets a computational representation : it becomes a vector in space, a direction. And it becomes possible to compute things from it.

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo("ORstNrlG_2g", width=512, height=288)

![](https://drive.google.com/uc?export=view&id=1FsTcOQ5LVgbDqkT5nm_gve5gZfQrZ8pV)

Here are the few modules you're going to need. Basically, here's what they do.
<br>`os` and `glob` are useful for navigating in your content.

*   `os` and `glob` are useful for navigating in your content.
*   `gensim` is a module that contains loads of practical tools for basic word vectorization.
*   `nltk` is useful here for string manipulation (split in sentences and so on).
*   `lxml` is a module for interpreting xml files.
*   `string` is a module for basic manipulation.
*   `numpy` and `pandas` are generally used for table and matrix manipulations and representations.
*   `matplotlib` is a representation/visualization tool.

In [None]:
!pip install gensim

In [None]:
import os
import gensim
from gensim.models import Word2Vec
import glob
import nltk

from lxml import etree as ET
import lxml.html
import string
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Here we're going to download a default folder for vector analysis. These texts are in Frantext format. Later on in the notebook, you'll be able to do the same thing with `.txt` files.

In [None]:
!wget https://raw.githubusercontent.com/OdysseusPolymetis/enexdi2025_prep/refs/heads/main/auteurs.zip

In [None]:
!unzip "/content/auteurs.zip"

Here are the authors proposed by default.

In [None]:
flaubert="auteurs/flaubert/"
balzac="auteurs/balzac/"

This is basic (but be careful, it's dirty) code for clearing xml code.

In [None]:
def strip_ns_prefix(tree):
    query = "descendant-or-self::*[namespace-uri()!='']"
    for element in tree.xpath(query):
        element.tag = ET.QName(element).localname
    return tree

Here we are going to get each lemma from each word, store it in their sentence, and store each sentence in a list.

In [None]:
if balzac != "":
    files = glob.iglob(balzac + '/**/*.xml', recursive=True)
    sentences = []

    for filename in files:
        print(filename)
        parser = ET.XMLParser(remove_blank_text=True, resolve_entities=False, encoding='utf8')
        tree = strip_ns_prefix(ET.parse(filename, parser))

        words = tree.xpath(".//wf/@lemma")

        sentence = []
        for word in words:
            if word != ".":
                sentence.append(word)
            else:
                sentences.append(sentence + [word])
                sentence = []

Here we are going to check wether it worked (something should be printed).

In [None]:
print(len(sentences))
print(sentences[10])

## Building a model

This part, depending on the amount of data you intend to compute, may take some time (default : 8 minutes)

In [None]:
model = Word2Vec(sentences, min_count=2, max_vocab_size=10000, negative=10, epochs=300)

In [None]:
model.wv.save("/content/model_balzac.bin")

This next cell is to be run only if you want to reload a saved model.

In [None]:
from gensim.models import KeyedVectors
KeyedVectors.load("/content/model_balzac.bin")
wv = KeyedVectors.load("/content/model_balzac.bin")

model = Word2Vec(vector_size=wv.vector_size, min_count=1)
model.wv = wv

The following cell shows analogies between vectors : tell me, if I give you the link between "queen" and "king", the equivalent for "man", and you should get something like "woman" or "girl", depending on your corpus.

In [None]:
#Paris is to France what London is to what ? model.wv.most_similar(positive=['Londres', 'France'], negative=['Paris'],topn=5)
#King is to man what Queen is to what ? model.wv.most_similar(positive=['reine', 'homme'], negative=['roi'],topn=5)
model.wv.most_similar(positive=['reine', 'homme'], negative=['roi'],topn=10)

You can change the default value `'esprit'` here.

In [None]:
model.wv.most_similar('esprit',topn=20)

## Visualization with Tensorflow
You can get a clearer visualization using the [online tensorflow visualizer](https://projector.tensorflow.org/). After this next cell, you'll get two files, one containing the vectors, the other their labels.

In [None]:
!wget https://raw.githubusercontent.com/OdysseusPolymetis/enexdi2025_prep/refs/heads/main/stopwords_fr.txt

In [None]:
stops = open("/content/stopwords_fr.txt", encoding="utf-8").read().split("\n")

In [None]:
with open("/content/vecteurs.tsv", 'w') as file_vectors, open("/content/metadonnees.tsv", 'w') as file_metadata:
    for word in model.wv.index_to_key:
        file_vectors.write('\t'.join([str(x) for x in model.wv[word]]) + "\n")
        file_metadata.write(word + "\n")

## Same thing with your own TXT

As we don't have pre-processed text, we need to preprocess it a bit. We'll use `stanza` for lemmatization.

In [None]:
!pip install stanza

We'll also use a list of stopwords from a basic repo (several languages) : you can get other lists from [here](https://github.com/stopwords-iso). You'll have to look for your language, and get the `.txt` file in the adequate folder. When you see it, visualize it in "raw", and copy the url (and paste it in the next cell).

In [None]:
!wget https://raw.githubusercontent.com/stopwords-iso/stopwords-fr/refs/heads/master/stopwords-fr.txt

In [None]:
stopwords = open("/content/stopwords-fr.txt",'r',encoding="utf8").read().split("\n")

Here you can upload your own text by clicking on the folder icon to your left (📁), and drop your `.txt` there.
<br>Next you can modify the following cell by changing the title. Careful though, also as a general rule, no spaces, no accents, no special characters in the title.

In [None]:
filepath_of_text = "/content/3mousquetaires.txt"

In [None]:
full_text = open(filepath_of_text, encoding="utf-8").read()

By default, the model used is French. You can choose your own model from this [list](https://stanfordnlp.github.io/stanza/performance.html). Just change `fr` in the two following cells.

In [None]:
import stanza
stanza.download('fr')

In [None]:
nlp_stanza = stanza.Pipeline(lang='fr', processors='tokenize,mwt,pos,lemma')

This part is just a way to use the GPU better.

In [None]:
def batch_process_to_lemmas(text, nlp, batch_size=100):
    paragraphs = text.split('\n')
    batches = [paragraphs[i:i + batch_size] for i in range(0, len(paragraphs), batch_size)]

    sentences_lemmas = []

    for batch in batches:
        batch_text = '\n'.join(batch)
        doc = nlp(batch_text)
        for sentence in doc.sentences:
            sentence_lemmas = []
            for word in sentence.words:
                if word.lemma is not None and word.lemma not in stopwords:
                    sentence_lemmas.append(word.lemma.lower())
            sentences_lemmas.append(sentence_lemmas)

    return sentences_lemmas

In [None]:
sentences = batch_process_to_lemmas(full_text, nlp_stanza)

In [None]:
model = Word2Vec(sentences, min_count=2, max_vocab_size=10000, negative=10, epochs=300)

In [None]:
model.wv.save("/content/yourModel.bin")

In [None]:
model.wv.most_similar('courage',topn=50)

In [None]:
with open("/content/vecteurs.tsv", 'w') as file_vectors, open("/content/metadonnees.tsv", 'w') as file_metadata:
    for word in model.wv.index_to_key:
        file_vectors.write('\t'.join([str(x) for x in model.wv[word]]) + "\n")
        file_metadata.write(word + "\n")