Natural Language Processing  
author: D.Thébault

Based on NLP Demystified (YouTube) https://nlpdemystified.org


# Part I: Fundamentals of NLP

## 2. **Basic vectorization**

### 2.0. **Reminders**

In [39]:
# Tokenization

import spacy
nlp = spacy.load('en_core_web_sm')

sentence = "He didn't want to pay $20 for the book."
doc = nlp(sentence)
print(doc)
print([t.text for t in doc]) # iterate over the Doc object
print(doc[0:3]) # Slicing
print([(t.text, t.i) for t in doc])

He didn't want to pay $20 for the book.
['He', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'the', 'book', '.']
He didn't
[('He', 0), ('did', 1), ("n't", 2), ('want', 3), ('to', 4), ('pay', 5), ('$', 6), ('20', 7), ('for', 8), ('the', 9), ('book', 10), ('.', 11)]


In [40]:
# Case Folding

print([t.lower_ for t in doc])

['he', 'did', "n't", 'want', 'to', 'pay', '$', '20', 'for', 'the', 'book', '.']


In [41]:
# Stop Words

import spacy
nlp = spacy.load('en_core_web_sm')
# Print tokens and indicate if they are stop words

sentence = "I saw the movie last night. I was not amused."
doc = nlp(sentence)

filtered_sentence = ' '.join([token.text for token in doc if not token.is_stop])
print(filtered_sentence)

saw movie night . amused .


In [42]:
# Stemming

# Not possible with spaCy

In [43]:
# Lemmatization

import spacy
nlp = spacy.load('en_core_web_sm')

doc = nlp("Did, Done, Doing")
[(t.text, t.lemma_) for t in doc]

[('Did', 'do'), (',', ','), ('Done', 'do'), (',', ','), ('Doing', 'do')]

In [44]:
# Part-of-Speech tagging

import spacy

# Loads the small English language model provided by SpaCy for NLP tasks.
nlp = spacy.load('en_core_web_sm')

# Exemple of sentence
sentence = "John is a watching an old movie at a cinema."

# Creating a Doc object that contains linguistic annotations for the text.
doc = nlp(sentence)

# POS (course-grained) tags thanks to pos_ attribute
print([(t.text, t.pos_) for t in doc])

# to get a description for a POS tag, use spacy explain() method
print(spacy.explain('PROPN'))

# You can also have fine-grained tags with the attribute tag_
# more details than with pos_ attribute (tense, type of pronoun...)
print([(t.text, t.tag_) for t in doc])

print(spacy.explain("VBD"))
print(spacy.explain("NNP"))

[('John', 'PROPN'), ('is', 'AUX'), ('a', 'DET'), ('watching', 'VERB'), ('an', 'DET'), ('old', 'ADJ'), ('movie', 'NOUN'), ('at', 'ADP'), ('a', 'DET'), ('cinema', 'NOUN'), ('.', 'PUNCT')]
proper noun
[('John', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('watching', 'VBG'), ('an', 'DT'), ('old', 'JJ'), ('movie', 'NN'), ('at', 'IN'), ('a', 'DT'), ('cinema', 'NN'), ('.', '.')]
verb, past tense
noun, proper singular


In [45]:
# Named Entity Recognation (NER)

s = "Volkswagen is developping an electric sedan which could potentially come to America next fall."

doc = nlp(s)

# To access named entities we use here the spacy attribute ent_type_
# others ways to make it are possible with Spacy.

print([(t.text, t.ent_type_) for t in doc])

print(spacy.explain('GPE'))
print(spacy.explain('ORG'))
print(spacy.explain('DATE'))

# You can also check if a token is an entity before printing it by ckecking the attribute ent_type without underscore
print([(t.text, t.ent_type_) for t in doc if t.ent_type != 0])

# Another way is through the ents property of the Doc object itself (Note: "next fall as a single entity this time").
print([(ent.text, ent.label_) for ent in doc.ents])

[('Volkswagen', 'ORG'), ('is', ''), ('developping', ''), ('an', ''), ('electric', ''), ('sedan', ''), ('which', ''), ('could', ''), ('potentially', ''), ('come', ''), ('to', ''), ('America', 'GPE'), ('next', 'DATE'), ('fall', 'DATE'), ('.', '')]
Countries, cities, states
Companies, agencies, institutions, etc.
Absolute or relative dates or periods
[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next', 'DATE'), ('fall', 'DATE')]
[('Volkswagen', 'ORG'), ('America', 'GPE'), ('next fall', 'DATE')]


[spaCy visualizers](https://spacy.io/usage/visualizers)

In [46]:
from spacy import displacy

# We need to set the 'jupyter' variable to True in order to ouput
# the visualization directly. Otherwise, you'll get row HTML.
# style = 'ent' for entity recognation.

displacy.render(doc, style='ent', jupyter=True)

print(spacy.explain('ORG'))

Companies, agencies, institutions, etc.


In [47]:
# Dependency Parsing

import spacy

# Load the english language model of SpaCy
nlp = spacy.load("en_core_web_sm")

# Sentence to analyse
sentence = "She enrolled in the course at the university."

# Analyze the sentence
doc = nlp(sentence)

# Display the dependency relations
for token in doc:
    print(f"{token.text} ({token.dep_}) <-- {token.head.text}")

print(spacy.explain("nsubj"))

# Let's visualize a dependency parse
displacy.render(doc, style='dep', jupyter=True)

She (nsubj) <-- enrolled
enrolled (ROOT) <-- enrolled
in (prep) <-- enrolled
the (det) <-- course
course (pobj) <-- in
at (prep) <-- course
the (det) <-- university
university (pobj) <-- at
. (punct) <-- enrolled
nominal subject


In [48]:
# The general Matcher is one of multiple matcher objects
# included with spaCy.
from spacy.matcher import Matcher
# We initialize the Matcher with the spaCy vocab object, which contains
# words along with their labels and tags.
matcher = Matcher(nlp.vocab)
s = "I want to book a hotel room."
doc = nlp(s)
# Patterns are expressed as an ordered sequence. 
# Here, we're looking to match occurences starting with a 'book' string followed by 
# a determiner (DET) POS tag such as "the","and" , then a noun POS tag.
# The OP key marks the match as optional in some way.

# Here, the DET POS (marked with '?') will match 0 or 1 times (i.e. the determiner is optional), and 
# the NOUN POS (marked with '+') will match 1 or more times (i.e., at least one noun is required).
# See this link for more information.
# https://spacy.io/usage/rule-based-matching#quantifiers

pattern = [
    {'TEXT': 'book'},
    {'POS': 'DET', 'OP': '?'},
    {'POS': 'NOUN', 'OP': '+'}
]

# So, the pattern will match sequences that start with the word "book", 
# optionally followed by a determiner, and then followed by one or more nouns.
# We give our pattern a label and pass it to the matcher.
matcher.add('USER_INTENT', [pattern])

# Run the matcher over the doc.
matches = matcher(doc)

# For each match, the matcher returns a tuple specifying a match id, start,
# and end of the match.
print("Matches: ", [doc[start:end].text for match_id, start, end in matches])

doc = nlp("I want to book a flight and hotel room in Berlin.")
for noun_phrase in doc.noun_chunks:
  print("phrase: {}, root head: {}".format(noun_phrase, noun_phrase.root.head))

Matches:  ['book a hotel', 'book a hotel room']
phrase: I, root head: want
phrase: a flight and hotel room, root head: book
phrase: Berlin, root head: in


**For the ML algorith we need to translate our pre-processed text into vectors.**  

**This is called <u>Vectorization</u>**

A vector is simply an array or a list of numbers.

To work our text with ML algorithm we need  
to turn text into numbers and measuring similarity between documents.

**Feature:**  

Any property in your data you think is useful for making predictions or explaining some relationship.

Exemples in NLP:
- word count
- document age
- author id
- ...

At the end, we want a **Matrix** where:  

- Each row represents a document called an instance or a feature vector,

- Each column is a feature.  

A dcoument can be a sentence, a tweet, an entire book...

## 2. **Bag-of-Words** (BOW)

**Describing documents by word occurences**  

Basic idea: if two documents share the same vocabulary, the more likely they belong to the same class.  

This is called bag-of-words because of the lack of order of the words, no grammar taking into account, no syntaxe.

In [49]:
import spacy

from scipy import spatial
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [50]:
# A corpus of sentences.
corpus = [
  "Red Bull drops hint on F1 engine.",
  "Honda exits F1, leaving F1 partner Red Bull.",
  "Hamilton eyes record eighth F1 title.",
  "Aston Martin announces sponsor."
]

<u>Frequency BOW</u>  

- Each column of the matrix represents a word in the vocabulary with its frequence in the document.

- Each row is a document.

In [51]:
# CountVectorizer takes a collection of text documents and creates 
# a matrix of token counts

vectorizer = CountVectorizer()

In [52]:
# The fit_transform method does two things:
# 1. It learns a vocabulary dictionary from the corpus.
# 2. It returns a matrix where each row represents a document and 
# each column represents a token

bow = vectorizer.fit_transform(corpus)

print(bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 24 stored elements and shape (4, 20)>
  Coords	Values
  (0, 17)	1
  (0, 2)	1
  (0, 3)	1
  (0, 10)	1
  (0, 14)	1
  (0, 8)	1
  (0, 5)	1
  (1, 17)	1
  (1, 2)	1
  (1, 8)	2
  (1, 11)	1
  (1, 6)	1
  (1, 12)	1
  (1, 15)	1
  (2, 8)	1
  (2, 9)	1
  (2, 7)	1
  (2, 16)	1
  (2, 4)	1
  (2, 19)	1
  (3, 1)	1
  (3, 13)	1
  (3, 0)	1
  (3, 18)	1


In [53]:
# View features (tokens).
print(vectorizer.get_feature_names_out())

# View vocabulary dictionary.
vectorizer.vocabulary_

['announces' 'aston' 'bull' 'drops' 'eighth' 'engine' 'exits' 'eyes' 'f1'
 'hamilton' 'hint' 'honda' 'leaving' 'martin' 'on' 'partner' 'record'
 'red' 'sponsor' 'title']


{'red': 17,
 'bull': 2,
 'drops': 3,
 'hint': 10,
 'on': 14,
 'f1': 8,
 'engine': 5,
 'honda': 11,
 'exits': 6,
 'leaving': 12,
 'partner': 15,
 'hamilton': 9,
 'eyes': 7,
 'record': 16,
 'eighth': 4,
 'title': 19,
 'aston': 1,
 'martin': 13,
 'announces': 0,
 'sponsor': 18}

<u>Binary BOW</u>  

- Each column of the matrix represents a word in the vocabulary (0 absent, 1 present)

- Each row is a document.

CountVectorizer supports using a custom tokenizer.  
For every document, it will call your tokenizer and  
expect a list of tokens returned.  
We'll create a simple callback below which has spaCy  
tokenize and filter tokens, and then return them.

In [55]:
# As usual, we start by importing spaCy and loading a statistical model.
nlp = spacy.load('en_core_web_sm')

# Create a tokenizer callback using spaCy under the hood. Here, we tokenize
# the passed-in text and return the tokens, filtering out punctuation.
def spacy_tokenizer(doc):
  return [t.text for t in nlp(doc) if not t.is_punct]

This time, we instantiate CountVectorizer with our custom tokenizer 
(spacy_tokenizer), turn off case-folding,  
and also set the binary parameter to True  
so we simply get 1s and 0s marking token presence rather than token frequency.

In [56]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
bow = vectorizer.fit_transform(corpus)



In [59]:
# View features (tokens).
print(vectorizer.get_feature_names_out())

['Aston' 'Bull' 'F1' 'Hamilton' 'Honda' 'Martin' 'Red' 'announces' 'drops'
 'eighth' 'engine' 'exits' 'eyes' 'hint' 'leaving' 'on' 'partner' 'record'
 'sponsor' 'title']


In [61]:
corpus

['Red Bull drops hint on F1 engine.',
 'Honda exits F1, leaving F1 partner Red Bull.',
 'Hamilton eyes record eighth F1 title.',
 'Aston Martin announces sponsor.']

In [60]:
print(bow)

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 24 stored elements and shape (4, 20)>
  Coords	Values
  (0, 6)	1
  (0, 1)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 2)	1
  (0, 10)	1
  (1, 6)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 11)	1
  (1, 14)	1
  (1, 16)	1
  (2, 2)	1
  (2, 3)	1
  (2, 12)	1
  (2, 17)	1
  (2, 9)	1
  (2, 19)	1
  (3, 0)	1
  (3, 5)	1
  (3, 7)	1
  (3, 18)	1


To get a dense array representation of our sparse matrix, use toarray.

In [62]:
print('A dense representation like we saw in the slides.')
print(bow.toarray())
print()
print('Indexing and slicing.')
print(bow[0])
print()
print(bow[0:2])

A dense representation like we saw in the slides.
[[0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0]
 [0 1 1 0 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0]
 [0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1]
 [1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0]]

Indexing and slicing.
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 7 stored elements and shape (1, 20)>
  Coords	Values
  (0, 6)	1
  (0, 1)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 2)	1
  (0, 10)	1

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 14 stored elements and shape (2, 20)>
  Coords	Values
  (0, 6)	1
  (0, 1)	1
  (0, 8)	1
  (0, 13)	1
  (0, 15)	1
  (0, 2)	1
  (0, 10)	1
  (1, 6)	1
  (1, 1)	1
  (1, 2)	1
  (1, 4)	1
  (1, 11)	1
  (1, 14)	1
  (1, 16)	1


With BOW we have shifted from a sequence of symbols to points in a multidimensional space that encodes some meaning of the text.  

Each feature vector in our BOW is now a point in this multidimensional space.  
This is called a **Vector Space Model** (VSM).  

Two documents which have similar vocabulary should be closer together in this space.  

This allows to measure similarity which is useful for:

- Relevance ranking,
- Plagiarism detection,
- Document classification
- ...



**Cosine similarity**

How do we measure similarity ?  

The more two vectors share the same direction and the same magnitude, the more similar they are.

$a . b = \sum_{i=1}^{n}a_1b_1 + a_2b_2 + \ldots + a_nb_n$

In [54]:
import numpy as np
a = np.array([1,2,3])
b = np.array([4,5,6])
a.dot(b)

32

But high frequency words will lead to larger dot products...  
To avoid this we normalize by vector length: $||v|| = \sqrt{\sum_{i=1}^nv_i^2}$  
Knows also as the $L^2$ norm or Euclidean norm.  

$ \frac{a . b}{||a|| ||b||} = cos(\theta)$  

With $\theta$ the angle between the two vectors.

In [85]:
# Define our Cosine similarity function:
def cos_similar(a,b) -> float:
    """Compute Cosine similarity between two vectors.
    Inputs:
        a, b: vectors
    Returns,
        a float
    """
    a_norm = (a.transpose().dot(a))**0.5
    b_norm = (b.transpose().dot(b))**0.5
    return a.dot(b) / (a_norm * b_norm)

In [86]:
print(cos_similar(a,b))

0.9746318461970762


In [87]:
# Or use scipy spatial
from scipy import spatial
1 - spatial.distance.cosine(a, b)

0.9746318461970762

The value of $cos(\theta)$ ranges from 0 to 1.
- 1 when the vectors point in the same direction,
- 0 when the vectors are orhogonal (dissimilar)

In [91]:
print(type(bow[0].toarray()[0]))

<class 'numpy.ndarray'>


In [92]:
# The cosine method expects array_like inputs, so we need to generate
# arrays from our sparse matrix.
doc1_vs_doc2 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[1].toarray()[0])
doc1_vs_doc3 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[2].toarray()[0])
doc1_vs_doc4 = 1 - spatial.distance.cosine(bow[0].toarray()[0], bow[3].toarray()[0])

print(corpus)

print(f"Doc 1 vs Doc 2: {doc1_vs_doc2}")
print(f"Doc 1 vs Doc 3: {doc1_vs_doc3}")
print(f"Doc 1 vs Doc 4: {doc1_vs_doc4}")

['Red Bull drops hint on F1 engine.', 'Honda exits F1, leaving F1 partner Red Bull.', 'Hamilton eyes record eighth F1 title.', 'Aston Martin announces sponsor.']
Doc 1 vs Doc 2: 0.4285714285714286
Doc 1 vs Doc 3: 0.15430334996209194
Doc 1 vs Doc 4: 0.0


Another approach is using scikit-learn's ```cosine_similarity``` which  
computes the metric between multiple vectors.  
Here, we pass it our BOW and get a matrix of cosine similarities between each document.


In [93]:
# cosine_similarity can take either array-likes or sparse matrices.
print(cosine_similarity(bow))

[[1.         0.42857143 0.15430335 0.        ]
 [0.42857143 1.         0.15430335 0.        ]
 [0.15430335 0.15430335 1.         0.        ]
 [0.         0.         0.         1.        ]]


Drawbacks of BOW:

- Does not capture similarity between synonyms.
- No way to handle Out-of-Vocabulary (OOV) words.
- Creates sparse vectors which can be inefficient
- Word order information is lots.

### 2.2. **N-grams**

Chuncks of continuous tokens.  

2-grams or bigram has two tokens per chunck.  
3-grams or trigram has three tokens per chunck.  
...

Exemple of tokenization into bigram:  

"Barcelona beats Chelsea" => ["Barcelona beats", "beats Chelsea"]

```CountVectorizer()``` includes an ```ngram_range``` parameter to generate different n-grams.  
n_gram range is specified using a minimum and maximum range.  
By default, n_gram range is set to (1, 1) which generates unigrams.  
Setting it to (1, 2) generates both unigrams and bigrams.
```

In [94]:
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, 
                             lowercase=False, 
                             binary=True, 
                             ngram_range=(1,2))

bigrams = vectorizer.fit_transform(corpus)

print(vectorizer.get_feature_names_out())
print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))
print(vectorizer.vocabulary_)

['Aston' 'Aston Martin' 'Bull' 'Bull drops' 'F1' 'F1 engine' 'F1 leaving'
 'F1 partner' 'F1 title' 'Hamilton' 'Hamilton eyes' 'Honda' 'Honda exits'
 'Martin' 'Martin announces' 'Red' 'Red Bull' 'announces'
 'announces sponsor' 'drops' 'drops hint' 'eighth' 'eighth F1' 'engine'
 'exits' 'exits F1' 'eyes' 'eyes record' 'hint' 'hint on' 'leaving'
 'leaving F1' 'on' 'on F1' 'partner' 'partner Red' 'record'
 'record eighth' 'sponsor' 'title']
Number of features: 40
{'Red': 15, 'Bull': 2, 'drops': 19, 'hint': 28, 'on': 32, 'F1': 4, 'engine': 23, 'Red Bull': 16, 'Bull drops': 3, 'drops hint': 20, 'hint on': 29, 'on F1': 33, 'F1 engine': 5, 'Honda': 11, 'exits': 24, 'leaving': 30, 'partner': 34, 'Honda exits': 12, 'exits F1': 25, 'F1 leaving': 6, 'leaving F1': 31, 'F1 partner': 7, 'partner Red': 35, 'Hamilton': 9, 'eyes': 26, 'record': 36, 'eighth': 21, 'title': 39, 'Hamilton eyes': 10, 'eyes record': 27, 'record eighth': 37, 'eighth F1': 22, 'F1 title': 8, 'Aston': 0, 'Martin': 13, 'announces



In [95]:
# Setting n_gram range to (2, 2) generates only bigrams.
vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(2,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(vectorizer.vocabulary_)

['Aston Martin' 'Bull drops' 'F1 engine' 'F1 leaving' 'F1 partner'
 'F1 title' 'Hamilton eyes' 'Honda exits' 'Martin announces' 'Red Bull'
 'announces sponsor' 'drops hint' 'eighth F1' 'exits F1' 'eyes record'
 'hint on' 'leaving F1' 'on F1' 'partner Red' 'record eighth']
{'Red Bull': 9, 'Bull drops': 1, 'drops hint': 11, 'hint on': 15, 'on F1': 17, 'F1 engine': 2, 'Honda exits': 7, 'exits F1': 13, 'F1 leaving': 3, 'leaving F1': 16, 'F1 partner': 4, 'partner Red': 18, 'Hamilton eyes': 6, 'eyes record': 14, 'record eighth': 19, 'eighth F1': 12, 'F1 title': 5, 'Aston Martin': 0, 'Martin announces': 8, 'announces sponsor': 10}


In [96]:
#
# EXERCISE: Create a spacy_tokenizer callback which takes a string and returns
# a list of tokens (each token's text) with punctuation filtered out.
#
corpus = [
  "Students use their GPS-enabled cellphones to take birdview photographs of a land in order to find specific danger points such as rubbish heaps.",
  "Teenagers are enthusiastic about taking aerial photograph in order to study their neighbourhood.",
  "Aerial photography is a great way to identify terrestrial features that aren’t visible from the ground level, such as lake contours or river paths.",
  "During the early days of digital SLRs, Canon was pretty much the undisputed leader in CMOS image sensor technology.",
  "Syrian President Bashar al-Assad tells the US it will 'pay the price' if it strikes against Syria."
]

nlp = spacy.load('en_core_web_sm')

def spacy_tokenizer(doc):
  pass

In [97]:
#
# EXERCISE: Initialize a CountVectorizer object and set it to use
# your spacy_tokenizer with lower-casing off and to create a binary BOW.
#

# Instantiate a CountVectorizer object called 'vectorizer'.


# Create a binary BOW from the corpus using your CountVectorizer.

In [98]:
#
# The string below is a whole paragraph. We want to create another
# binary BOW but using the vocabulary of our *current* CountVectorizer. This means
# that words in this paragraph which AREN'T already in the vocabulary won't be
# represented. This is to illustrate how BOW can't handle out-of-vocabulary words
# unless you rebuild your whole vocabulary. Still, we'll see that if there's
# enough overlapping vocabulary, some similarity can still be picked up.
#
# Note that we call 'transform' only instead of 'fit_transform' because the
# fit step (i.e. vocabulary build) is already done and we don't want to re-fit here.
#
s = ["Teenagers take aerial shots of their neighbourhood using digital cameras sitting in old bottles which are launched via kites - a common toy for children living in the favelas. They then use GPS-enabled smartphones to take pictures of specific danger points - such as rubbish heaps, which can become a breeding ground for mosquitoes carrying dengue fever."]
new_bow = vectorizer.transform(s)

#
# EXERCISE: using the pairwise cosine_similarity method from sklearn,
# calculate the similarities between each document from the corpus against
# this new document (new_bow). HINT: You can pass two parameters to
# cosine_similarity in this case. See the docs:
# https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html#scipy.spatial.distance.cosine
#
# Which document is the most similar? Which is the least similar? Do the results make sense
# based on what you see?
#

In [None]:
#
# EXERCISE: In spacy_tokenizer, instead of returning the plain text,
# return the lemma_ attribute instead. How do the cosine similarity
# results differ? What if you filter out stop words as well?
#

## Modelling Overview

Types of Machine Learning algorithms vs models, evaluation...

## First steps into Classification

Classifying text using Naive Bayes; evaluation with precision and recall.

## Topic modelling

Automatically finding topics in documents using Latent Dirichlet Allocation.