**Ben KABONGO**, *21116436*, M1 DAC

# Word Embedding for Sequence Processing

**The goal of this practical is to use pre-trained word embedding for adressing the sequence prediction tasks studied in week 2: PoS and chunking.**

In [1]:
import numpy as np
import gensim.downloader as api
from gensim.models import KeyedVectors

## 0) Loading PoS (or chunking) datasets (small or large)

In [2]:
def load(filename):
    listeDoc = list()
    with open(filename, "r") as f:
        doc = list()
        for ligne in f:
            #print "l : ",len(ligne)," ",ligne
            if len(ligne) < 2: # fin de doc
                listeDoc.append(doc)
                doc = list()
                continue
            mots = ligne.replace("\n","").split(" ")
            doc.append((mots[0],mots[2])) # mettre mots[2] à la place de mots[1] pour le chuncking
    return listeDoc

In [3]:
bSmall = False
directory = "../TME02/ressources/conll2000/"

if(bSmall==True):
    filename = "chtrain.txt" 
    filenameT = "chtest.txt" 

else:
    # Larger corpus .
    filename = "train.txt" 
    filenameT = "test.txt" 

alldocs = load(directory + filename)
alldocsT = load(directory + filenameT)

print(len(alldocs)," docs read")
print(len(alldocsT)," docs (T) read")

8936  docs read
2012  docs (T) read


# 1) Word embedding for classifying each word

### Pre-trained word2vec

In [4]:
import gensim.downloader as api
bload = True
fname = "word2vec-google-news-300"
sdir = "" # Change

if(bload==True):
    wv_pre_trained = KeyedVectors.load(sdir+fname+".dat")
else:    
    wv_pre_trained = api.load(fname)
    wv_pre_trained.save(sdir+fname+".dat")

### Some token on the dataset are missing, we will encode them with a random vector
This is sub-optimal, but we need to do something

In [5]:
def randomvec():
    default = np.random.randn(300)
    default = default  / np.linalg.norm(default)
    return default

In [6]:
np.random.seed(seed=10) # seed the randomness

dictadd = dict()
cpt=0
i = 0
for d in alldocs:
    cpt+=1
    #print(" ****** Document ******",cpt)
    for (x,pos) in d:
        if (not (x in wv_pre_trained) and not (x in dictadd)):
            #print(x," not in WE, adding it with random vector")
            i += 1
            dictadd[x] = randomvec()
print('Random vectors :', i)
    

i = 0
for d in alldocsT:
    cpt+=1
    #print(" ****** TEST Document ******",cpt)
    for (x,pos) in d:
        if (not (x in wv_pre_trained) and not (x in dictadd)):
            #print(x," not in WE, adding it with random vector")
            i += 1
            dictadd[x] = randomvec()
print('Random vectors test :', i)

Random vectors : 3576
Random vectors test : 659


### Add the (key-value) 'random' word embeddings for missing inputs

In [7]:
wv_pre_trained.add_vectors(list(dictadd.keys()), list(dictadd.values()))

### Store the train and test datasets: a word embedding for each token in the sequences

In [8]:
wvectors = [wv_pre_trained[word] for doc in alldocs for word, tag in doc]
wvectorsT = [wv_pre_trained[word] for doc in alldocsT for word, tag in doc]

### Check the size of your train/test datasets

In [9]:
len(wvectors), len(wvectorsT)

(211727, 47377)

### Collecting train/test labels

In [10]:
# Labels train/test

buf2 = [[pos for m,pos in d ] for d in alldocs]
cles = []
[cles.extend(b) for b in buf2]
cles = np.unique(np.array(cles))
cles2ind = dict(zip(cles,range(len(cles))))
nCles = len(cles)
print(nCles," keys in the dictionary")

labels  = np.array([cles2ind[pos] for d in alldocs for (m,pos) in d ])
#np.array([cles2ind[pos] for (m,pos) in d for d in alldocs])
labelsT  = np.array([cles2ind.setdefault(pos,len(cles)) for d in alldocsT for (m,pos) in d ])

print(len(cles2ind)," keys in the dictionary")

22  keys in the dictionary
23  keys in the dictionary


In [11]:
print(labels.shape)
print(labelsT.shape)

(211727,)
(47377,)


### Train a Logistic Regression Model! 
**An compare performances to the baseline and sequence models (HMM/CRF) or practical 2a**

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr = LogisticRegression()
lr.fit(wvectors, labels)

labelsP = lr.predict(wvectorsT)
accuracy_score(labelsP, labelsT)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7718724275492328

# 2) Using word embedding with CRF

## We will define the following features functions for CRF

In [17]:
def features_wv(sentence, index):
    v = wv_pre_trained.get_vector(sentence[index])
    d = {'f'+str(i):v[i] for i in range(300)}
    return d

def features_structural(sentence, index):
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '' if index == 0 else sentence[index - 1],
        'next_word': '' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
     ## We will define the following features functions for CRF## We will define the following features functions for CRF   'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }
def features_wv_plus_structural(sentence, index):
    v = wv_pre_trained.get_vector(sentence[index]) 
    d = {'f'+str(i):v[i] for i in range(300)}

    return {**d, **features_structural(sentence, index)}

## [Question]: explain what the 3 feature functions encode and what their differences are

`
The features_wv function encodes word embeddings for a given word in a sentence. It uses a pre-trained word embedding model (in this case, wv_pre_trained) to retrieve a vector representation for the word. The function creates a dictionary with keys f0 to f299, where each key represents a dimension of the word embedding vector. This function captures the semantic similarity between words, since words with similar meanings tend to have similar vector representations in the embedding space.
`

`
The features_structural function encodes structural features of a given word in a sentence. It creates a dictionary with keys that encode various aspects of the word, such as its position in the sentence, capitalization, prefixes and suffixes, adjacent words, and other properties like presence of hyphens or numeric digits. This function captures the syntactic and morphological properties of the words in the sentence.
`

`
The features_wv_plus_structural function combines the word embedding and structural features by concatenating their respective dictionaries. It creates a new dictionary with keys that represent both the dimensions of the word embedding vector and the structural features of the word. This function captures both the semantic and syntactic/morphological properties of the words in the sentence.
`

### You can now train a CRF with the 3 features and analyse the results

In [14]:
from nltk.tag.crf import CRFTagger

tagger = CRFTagger(feature_func=features_wv)
## Train the model
tagger.train(alldocs, 'model_w2v_crf_1')

## Evaluate performances
tagger.evaluate(alldocsT)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  tagger.evaluate(alldocsT)


0.8817147561052832

In [15]:
tagger = CRFTagger(feature_func=features_structural)
## Train the model
tagger.train(alldocs, 'model_w2v_crf_2')

## Evaluate performances
tagger.evaluate(alldocsT)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  tagger.evaluate(alldocsT)


0.9384089326044283

In [18]:
tagger = CRFTagger(feature_func=features_wv_plus_structural)
## Train the model
tagger.train(alldocs, 'model_w2v_crf_1')

## Evaluate performances
tagger.evaluate(alldocsT)

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  tagger.evaluate(alldocsT)


0.9452054794520548