![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

# Semantic Models

# Table of Contents
* [Objectives](#Objectives)
* [Corpus](#Corpus)
* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)
* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)
* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)

# Objectives

In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.

The main objectives of this session are:
* Understand the models and their differences
* Learn to use some of the most popular NLP libraries

# Corpus

We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.

We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)

In [1]:
from sklearn.datasets import fetch_20newsgroups

# We filter only some categories, otherwise we have 20 categories
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
# We remove metadata to avoid bias in the classification
newsgroups_train = fetch_20newsgroups(subset='train', 
                                      remove=('headers', 'footers', 'quotes'), 
                                      categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),
                                    categories=categories)


# Obtain a vector

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)

vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_train.shape

(2034, 2807)

# Converting Scikit-learn to gensim

Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.

You should install first:

* *gensim*. Run 'conda install gensim' in a terminal.
* *python-Levenshtein*. Run 'conda install python-Levenshtein' in a terminal

In [2]:
from gensim import matutils

vocab = vectorizer.get_feature_names()

dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])
corpus_tfidf = matutils.Sparse2Corpus(vectors_train)



# Latent Dirichlet Allocation (LDA)

Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*.

In [3]:
from gensim.models.ldamodel import LdaModel

# It takes a long time

#  train the lda model, choosing number of topics equal to 4
lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)

In [4]:
# check the topics
lda.print_topics(4)

[(0,
  '0.003*"activity" + 0.002*"color" + 0.002*"complex" + 0.002*"netters" + 0.002*"objects" + 0.002*"eyes" + 0.002*"direct" + 0.002*"license" + 0.002*"apple" + 0.002*"missions"'),
 (1,
  '0.003*"aware" + 0.003*"objects" + 0.003*"brian" + 0.003*"claiming" + 0.003*"pain" + 0.003*"men" + 0.003*"obtained" + 0.003*"guns" + 0.003*"id" + 0.003*"company"'),
 (2,
  '0.005*"allow" + 0.005*"discuss" + 0.005*"certain" + 0.004*"member" + 0.004*"pounds" + 0.004*"compared" + 0.004*"greater" + 0.004*"fuel" + 0.004*"manipulation" + 0.003*"edited"'),
 (3,
  '0.003*"forces" + 0.003*"profit" + 0.003*"frank" + 0.003*"platform" + 0.003*"led" + 0.003*"friends" + 0.003*"president" + 0.002*"determine" + 0.002*"mechanism" + 0.002*"301"')]

In [6]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> book
    Downloading collection 'book'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package chat80 to /root/nltk_data...
       |   Unzipping corpora/chat80.zip.
       | Downloading package cmudict to /root/nltk_data...
       |   Unzipping corpora/cmudict.zip.
       | Downloading package conll2000 to /root/nltk_data...
       |   Unzipping corpora/conll2000.zip.
       | Downloading package conll2002 to /root/nltk_data...
       |   Unzipping corpora/conll2002.zip.
       | Downloading package dependency_

True

Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim.

In [7]:
# import the gensim.corpora module to generate dictionary
from gensim import corpora

from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk import RegexpTokenizer

import string

def preprocess(words):
    tokenizer = RegexpTokenizer('[A-Z]\w+')
    tokens = [w.lower() for w in tokenizer.tokenize(words)]
    stoplist = stopwords.words('english')
    tokens_stop = [w for w in tokens if w not in stoplist]
    punctuation = set(string.punctuation)
    tokens_clean = [w for w in tokens_stop if  w not in punctuation]
    return tokens_clean

#words = preprocess(newsgroups_train.data)
#dictionary = corpora.Dictionary(newsgroups_train.data)

texts = [preprocess(document) for document in newsgroups_train.data]

dictionary = corpora.Dictionary(texts)

In [8]:
# You can save the dictionary
dictionary.save('newsgroup.dict.texts')

print(dictionary)

Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)


In [9]:
# Generate a list of docs, where each doc is a list of words

docs = [preprocess(doc) for doc in newsgroups_train.data]

In [10]:
# import the gensim.corpora module to generate dictionary
from gensim import corpora

dictionary = corpora.Dictionary(docs)

In [11]:
# We can print the dictionary, it is a mappying of id and tokens

print(dictionary)

Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)


In [12]:
# construct the corpus representing each document as a bag-of-words (bow) vector
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [13]:
from gensim.models import TfidfModel

# calculate tfidf
tfidf_model = TfidfModel(corpus)
corpus_tfidf = tfidf_model[corpus]

In [14]:
#print tf-idf of first document
print(corpus_tfidf[0])

[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]


In [15]:
from gensim.models.ldamodel import LdaModel

# train the lda model, choosing number of topics equal to 4, it takes a long time

lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)

In [16]:
# check the topics
lda_model.print_topics(4)

[(0,
  '0.007*"islam" + 0.006*"ns" + 0.005*"zoroastrians" + 0.005*"khomeini" + 0.005*"ssrt" + 0.005*"samaritan" + 0.005*"yayayay" + 0.004*"bull" + 0.004*"gerald" + 0.004*"septuagint"'),
 (1,
  '0.010*"baptist" + 0.010*"koresh" + 0.009*"bible" + 0.008*"plane" + 0.007*"bob" + 0.005*"shag" + 0.005*"scarlet" + 0.004*"tootsie" + 0.004*"kinda" + 0.004*"captain"'),
 (2,
  '0.010*"mary" + 0.008*"god" + 0.007*"moon" + 0.007*"western" + 0.007*"jeff" + 0.006*"joy" + 0.006*"jesus" + 0.006*"lucky" + 0.006*"joseph" + 0.006*"davidian"'),
 (3,
  '0.010*"whatever" + 0.007*"unix" + 0.007*"thanks" + 0.006*"phobos" + 0.006*"unfortunately" + 0.006*"martian" + 0.005*"hi" + 0.005*"russian" + 0.005*"rayshade" + 0.004*"would"')]

In [17]:
# check the lsa vector for the first document
corpus_lda = lda_model[corpus_tfidf]
print(corpus_lda[0])

[(0, 0.084204), (1, 0.7040298), (2, 0.08284816), (3, 0.12891802)]


In [18]:
#predict topics of a new doc
new_doc = "God is love and God is the Lord"
#transform into BOW space
bow_vector = dictionary.doc2bow(preprocess(new_doc))
print([(dictionary[id], count) for id, count in bow_vector])

[('lord', 1), ('god', 2)]


In [19]:
#transform into LDA space
lda_vector = lda_model[bow_vector]
print(lda_vector)

[(0, 0.06554373), (1, 0.06262062), (2, 0.8093011), (3, 0.06253459)]


In [20]:
# print the document's single most prominent LDA topic
print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))

0.010*"mary" + 0.008*"god" + 0.007*"moon" + 0.007*"western" + 0.007*"jeff" + 0.006*"joy" + 0.006*"jesus" + 0.006*"lucky" + 0.006*"joseph" + 0.006*"davidian"


In [21]:
lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]
print(lda_vector_tfidf)
# print the document's single most prominent LDA topic
print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))

[(0, 0.11036819), (1, 0.104191266), (2, 0.6814583), (3, 0.10398224)]
0.010*"mary" + 0.008*"god" + 0.007*"moon" + 0.007*"western" + 0.007*"jeff" + 0.006*"joy" + 0.006*"jesus" + 0.006*"lucky" + 0.006*"joseph" + 0.006*"davidian"


# Latent Semantic Indexing (LSI)

In [22]:
from gensim.models.lsimodel import LsiModel

#It takes a long time

# train the lsi model, choosing number of topics equal to 20


lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)

In [23]:
# check the topics
lsi_model.print_topics(4)

[(0,
  '-0.769*"god" + -0.346*"jesus" + -0.235*"bible" + -0.203*"christian" + -0.149*"christians" + -0.107*"christ" + -0.089*"well" + -0.085*"koresh" + -0.081*"kent" + -0.080*"christianity"'),
 (1,
  '-0.863*"thanks" + -0.255*"please" + -0.160*"hello" + -0.153*"hi" + 0.123*"god" + -0.111*"sorry" + -0.087*"could" + -0.075*"windows" + -0.066*"jpeg" + -0.063*"gif"'),
 (2,
  '-0.783*"well" + 0.229*"god" + -0.166*"yes" + 0.154*"thanks" + -0.132*"ico" + -0.131*"tek" + -0.129*"beauchaine" + -0.129*"queens" + -0.129*"bronx" + -0.128*"manhattan"'),
 (3,
  '-0.336*"ico" + 0.335*"well" + -0.334*"tek" + -0.329*"beauchaine" + -0.329*"queens" + -0.329*"bronx" + -0.326*"manhattan" + -0.306*"com" + -0.305*"bob" + -0.072*"god"')]

In [24]:
# check the lsi vector for the first document
print(corpus_tfidf[0])

[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]


# References

* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)
* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.