# Mangoes : Cooccurrence

This notebook illustrates how to create a cooccurrence matrix from a corpus. The examples are applied on some extracts of wikipedia (en and fr). 

First, we have to import the module counting.

In [1]:
from IPython.display import display
import mangoes.counting
import nltk

## Content of this notebook

1. [Create a corpus](#1.-Create-a-Corpus)
2. [Choose the vocabulary](#2.-Choose-the-vocabulary)
3. [Compute the co-occurrence count matrix](#3.-Compute-the-co-occurrence-matrix)
4. [Define your context](#4.-Define-your-context)
5. [Annotated text](#5.-Annotated-text)



## 1. Create a Corpus
The `Corpus` class generate sentences from your documents.

In [2]:
corpus = mangoes.Corpus("data/wiki_article_en", lower= True)
print("{} sentences, {} words".format(corpus.nb_sentences, corpus.size))

Counting words: 0it [00:00, ?it/s]

382 sentences, 10969 words


The source can be a file, a folder or an iterable. The data can be raw text, or annotated text in brown, xml or conll format (see the section about annotated formats below)

## 2. Choose the vocabulary

The `Vocabulary` class manage list of words (e.g. the words you're going to represent as vectors). It encapsulates a mapping between words and their ids.  
To create your `Vocabulary` objects, you can provide your own list of words (in a list, a dictionary or another Vocabulary) or select a vocabulary from your corpus using filters. 

Available filters are :  
* truncate, with parameter max_nb : to keep the `max_nb` most common words
* remove_most_frequent, with parameter max_frequency : keep the words that appear at most `max_frequency` times 
* remove_least_frequent, with parameter min_frequency : keep the words that appear at least `min_frequency` times 
* remove_stop_words, with parameter stopwords : keep the words that don't appear in stopwords

In [3]:
# Examples :

# Provide your own list of words to create a Vocabulary:
my_words = mangoes.Vocabulary(["anarchist", "communism", "societies", "state"])
print("My words : {}".format(", ".join(my_words)))

# or extract the most frequent of a corpus
frequent_words = corpus.create_vocabulary(filters = [mangoes.corpus.remove_most_frequent(5),mangoes.corpus.truncate(10)])
print("Most frequent words : {}".format(", ".join(frequent_words.words)))

My words : anarchist, communism, societies, state
Most frequent words : advocates, voluntary, basis, particular, instead, considered, syndicalism, next, himself, used


To remove stopwords from your vocabulary when creating from corpus, you have to define your own list of stopwords.  
For example, using nltk :

In [4]:
len(nltk.corpus.stopwords.words("english"))

179

In [5]:
nltk.corpus.stopwords.words("english")

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [6]:
import nltk.corpus
stopwords_filter = mangoes.corpus.remove_elements(nltk.corpus.stopwords.words('english'))
print(corpus.create_vocabulary(filters = [stopwords_filter, mangoes.corpus.truncate(10)]).words)

[',', '.', '"', 'anarchist', 'anarchism', 'anarchists', '-lrb-', '-rrb-', "'s", 'international']


With the same filter `remove_elements`, you can also remove punctuation signs from your vocabulary, using e.g. `string.punctuation`

In [7]:
import string
punctuation_filter = mangoes.corpus.remove_elements(string.punctuation)
print(corpus.create_vocabulary(filters = [punctuation_filter, mangoes.corpus.truncate(10)]).words)

['the', 'of', 'and', 'in', 'a', 'to', 'anarchist', 'as', 'was', 'anarchism']


And combine all these filters :

In [8]:
frequent_words = corpus.create_vocabulary(filters = [punctuation_filter, 
                                                     stopwords_filter,
                                                     mangoes.corpus.remove_elements(["''", '``', '-rrb-', '-lrb-', "'s"]),
                                                     mangoes.corpus.truncate(10)])
print(frequent_words.words)

['anarchist', 'anarchism', 'anarchists', 'international', 'movement', 'first', 'free', 'workers', 'state', 'federation']


## 3. Compute the co-occurrence matrix

To count the word-word co-occurrences, use the `mangoes.counting.count_cooccurrence()` function :

In [9]:
coocc_count_1 = mangoes.counting.count_cooccurrence(corpus, my_words)
coocc_count_1.pprint(display=display)

Unnamed: 0,collectivist,associates,by,ideas,international,congress,the,movement,;,trade,...,post-classical,when,heterosexual,robert,women,william,criticises,sponsored,publication,leo
anarchist,1,1,3,3,4,2,27,13,1,1,...,1,0,1,1,1,1,0,0,1,1
communism,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
societies,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
state,0,0,0,0,0,0,16,0,0,0,...,0,1,0,0,0,0,1,1,0,0


In [10]:
coocc_count_1.matrix

<4x146 sparse matrix of type '<class 'numpy.int32'>'
	with 162 stored elements in Compressed Sparse Row format>

By default, the co-occurrences are counted within a window of one word around each word of the given vocabulary.  

You can give a vocabulary to use as context to limit the column of the matrix :


In [11]:
coocc_count_2 = mangoes.counting.count_cooccurrence(corpus, my_words, context=frequent_words)
coocc_count_2.pprint(display=display)

Unnamed: 0,anarchist,anarchism,anarchists,international,movement,first,free,workers,state,federation
anarchist,0,0,0,4,13,4,0,0,0,8
communism,4,0,0,0,0,0,1,0,0,0
societies,0,0,0,0,0,0,0,0,0,0
state,0,0,0,0,0,0,0,0,0,0


In [12]:
coocc_count_2.shape

(4, 10)

But you can also define other kinds of context :

## 4. Define your context

Mangoes provides the class Window to consider this kind of context.

To create a context :

In [13]:
simple_window = mangoes.context.Window(window_half_size=5, vocabulary=frequent_words)
asymmetric_window = mangoes.context.Window(symmetric=False, window_half_size=(1,2), vocabulary=frequent_words)
dynamic_window = mangoes.context.Window(window_half_size=5, dynamic=True, vocabulary=frequent_words)

Using the contexts defined above :

In [14]:
print("context = symmetric window of size 2*5")
coocc_count_1 = mangoes.counting.count_cooccurrence(corpus,  
                                                    my_words, 
                                                    context=simple_window)
coocc_count_1.pprint(display=display)

context = symmetric window of size 2*5


Unnamed: 0,anarchist,anarchism,anarchists,international,movement,first,free,workers,state,federation
anarchist,6,4,1,10,16,11,2,2,1,12
communism,4,1,0,0,0,0,3,0,0,0
societies,0,0,0,0,0,0,0,0,0,0
state,1,2,0,0,0,1,1,0,0,0


In [15]:
print("context = asymmetric window 1-x-2")
coocc_count_2 = mangoes.counting.count_cooccurrence(corpus, 
                                                    my_words, 
                                                    context=asymmetric_window)
coocc_count_2.pprint(display=display)

context = asymmetric window 1-x-2


Unnamed: 0,anarchist,anarchism,anarchists,international,movement,first,free,workers,state,federation
anarchist,0,0,0,4,13,5,0,0,0,9
communism,4,0,0,0,0,0,1,0,0,0
societies,0,0,0,0,0,0,0,0,0,0
state,0,0,0,0,0,0,0,0,0,0


## 5. Annotated text

If your corpus is annotated, you can count other entities than words. You have to define the format of your corpus by choosing the appropriate reader in the mangoes.corpus module. You can also write your own reader.

In [16]:
annotated_corpus = mangoes.Corpus("data/wiki_article_fr.lemmatized", reader=mangoes.corpus.BROWN)
print("{} sentences, {} words".format(annotated_corpus.nb_sentences, annotated_corpus.size))

Counting words: 0it [00:00, ?it/s]

19 sentences, 514 words


In [17]:
print(list(annotated_corpus.words_count)[:10])

[Token(form='Antoine', lemma='Antoine', POS='NPP'), Token(form='Meillet', lemma='*Meillet', POS='NPP'), Token(form='Paul', lemma='Paul', POS='NPP'), Token(form='Jules', lemma='Jules', POS='NPP'), Token(form=',', lemma=',', POS='PONCT'), Token(form='né', lemma='naître', POS='VPP'), Token(form='le', lemma='le', POS='DET'), Token(form='à', lemma='à', POS='P'), Token(form='Moulins', lemma='Moulins', POS='NPP'), Token(form='(', lemma='(', POS='PONCT')]


From this corpus, you can create vocabularies of different entities, combinations of available attributes :

In [18]:
# creating a vocabulary of lemmas :
lemma_vocabulary = annotated_corpus.create_vocabulary(attributes="lemma", 
                                                      filters=[mangoes.corpus.remove_elements(nltk.corpus.stopwords.words('french')), 
                                                               punctuation_filter, 
                                                               mangoes.corpus.truncate(20)])
print("Lemma vocabulary :")
print(lemma_vocabulary.words)

# creating a vocabulary of pos + lemmas :
pos_lemma_vocabulary = annotated_corpus.create_vocabulary(attributes=("lemma", "POS"), 
                                                          filters=[mangoes.corpus.remove_elements(string.punctuation, attribute="lemma"), 
                                                                   mangoes.corpus.truncate(10)])
print("\nPOS+Lemma vocabulary :")
print(pos_lemma_vocabulary.words)

Lemma vocabulary :
['ilimp', '*Meillet', 'être', '*des', 'avoir', 'Parry', 'linguiste', '*du', 'étude', '*au', 'Antoine', 'premier', 'cours', 'celui', 'thèse', 'tradition', 'Moulins', '*Châteaumeillant', 'Cher', 'français']

POS+Lemma vocabulary :
[Token(lemma='le', POS='DET'), Token(lemma='de', POS='P'), Token(lemma='à', POS='P'), Token(lemma='ilimp', POS='CLS'), Token(lemma='son', POS='DET'), Token(lemma='un', POS='DET'), Token(lemma='*Meillet', POS='NPP'), Token(lemma='et', POS='CC'), Token(lemma='*des', POS='P+D'), Token(lemma='en', POS='P')]


Then, we can count cooccurences between these entities.

In [19]:
cc = mangoes.counting.count_cooccurrence(corpus=annotated_corpus, words=lemma_vocabulary, context=pos_lemma_vocabulary)
display(cc.to_df())

Unnamed: 0_level_0,le,de,à,ilimp,son,un,*Meillet,et,*des,en
Unnamed: 0_level_1,DET,P,P,CLS,DET,DET,NPP,CC,P+D,P
ilimp,0,0,0,0,0,0,0,0,0,0
*Meillet,0,0,0,0,0,0,0,0,0,0
être,2,0,1,1,0,1,0,0,0,0
*des,0,0,0,0,0,0,0,0,0,0
avoir,0,0,0,3,0,0,0,0,0,0
Parry,0,0,1,0,0,0,0,1,0,0
linguiste,0,2,0,0,0,0,0,0,0,0
*du,0,0,0,0,0,0,0,0,0,0
étude,2,1,0,0,1,0,0,0,0,0
*au,0,0,0,0,0,0,0,0,0,0


If your corpus is annotated with dependencies between words, you can also use a new kind of context, baed on dependency between words :

## Dependency-based Context

*Reference : Levy, O., & Goldberg, Y. (2014, June). Dependency-Based Word Embeddings. In ACL (2) (pp. 302-308).*

As an alternative to windows (or bag of words), Dependency-based Context define contexts based on the syntactic 
relations each word participates in.  

To use such a context, your corpus has to be preprocessed to provide dependencies annotations.

### Example

Let's have a look to the contexts of each word of this sentence : "Australian scientist discovers star with telescope".

In [20]:
conllu_string_ud = ["1	australian	australian	ADJ	JJ	_	2	amod	_	_",
                    "2	scientist	scientist	NOUN	NN	_	3	nsubj	_	_",
                    "3	discovers	discover	VERB	VBZ	_	0	root	_	_",
                    "4	star	star	NOUN	NN	_	3	dobj	_	_",
                    "5	with	with	ADP	IN	_	6	case	_	_",
                    "6	telescope	telescope	NOUN	NN	_	3	nmod	_	_"]

corpus = mangoes.Corpus(conllu_string_ud, reader=mangoes.corpus.CONLLU, lower=True)
sentence = corpus.reader.sentences().__next__()

Counting words: 0it [00:00, ?it/s]

In [21]:
print("{:30} {}".format("WORD", "CONTEXTS"))
for word, contexts in zip(sentence, mangoes.context.DependencyBasedContext(labels=True)(sentence)):
    print("{:30} {}".format(word.form, ' '.join(contexts)))

WORD                           CONTEXTS
australian                     scientist/amod-
scientist                      australian/amod discovers/nsubj-
discovers                      scientist/nsubj star/dobj telescope/nmod
star                           discovers/dobj-
with                           telescope/case-
telescope                      with/case discovers/nmod-


The class DependencyBasedContext provides 2 parameters :
- the `labels` parameters defines whether or not you want to get the labels of the dependency_relations

In [22]:
print("{:30} {:50} {}".format("WORD", "CONTEXTS WITH LABELS", "CONTEXTS WITHOUT LABELS"))
for word, contexts_w_labels, contexts_no_labels in zip(sentence, 
                                                       mangoes.context.DependencyBasedContext(labels=True)(sentence),
                                                       mangoes.context.DependencyBasedContext(labels=False)(sentence)):
    print("{:30} {:50} {}".format(word.form, ' '.join(contexts_w_labels), ' '.join(contexts_no_labels)))

WORD                           CONTEXTS WITH LABELS                               CONTEXTS WITHOUT LABELS
australian                     scientist/amod-                                    scientist
scientist                      australian/amod discovers/nsubj-                   australian discovers
discovers                      scientist/nsubj star/dobj telescope/nmod           telescope star scientist
star                           discovers/dobj-                                    discovers
with                           telescope/case-                                    telescope
telescope                      with/case discovers/nmod-                          with discovers


- the `collapse` parameter defines whether or not relations including a preposition should be collapsed : 

In [23]:
print("{:30} {:50} {}".format("WORD", "COLLAPSED", "NOT COLLAPSED"))
for word, contexts_collapsed, contexts_no_collapsed in zip(sentence, 
                                                           mangoes.context.DependencyBasedContext(collapse=True, labels=True)(sentence),
                                                           mangoes.context.DependencyBasedContext(collapse=False, labels=True)(sentence)):
    print("{:30} {:50} {}".format(word.form, ' '.join(contexts_collapsed), ' '.join(contexts_no_collapsed)))

WORD                           COLLAPSED                                          NOT COLLAPSED
australian                     scientist/amod-                                    scientist/amod-
scientist                      australian/amod discovers/nsubj-                   australian/amod discovers/nsubj-
discovers                      scientist/nsubj star/dobj telescope/case_with      scientist/nsubj star/dobj telescope/nmod
star                           discovers/dobj-                                    discovers/dobj-
with                                                                              telescope/case-
telescope                      discovers/case_with-                               with/case discovers/nmod-
