# **Probabilistic Programming Project**

Ghadamiyan Lida

---

The purpose of this project is to implement the LDA algorithm using PyMC, following the indications from the course.





In [1]:
import pandas as pd

corpus = ["I had a peanuts butter sandwich for breakfast", 
          "I like to eat almonds, peanuts and walnuts", 
          "My neighbour got a little dog yesterday",
          "Cats and dogs are mortal enemies",
          "You mustn't feed peanuts to your dog"]


In [2]:
from sklearn.datasets import fetch_20newsgroups

##**Data Preprocessing**



* Tokenization. 
* Stopwords removal - done by CountVectorizer in feature extraction. 
* Lowercase the words. 
* Remove punctuation.
* Remove short words.
* Lematization - reducing words to their meaningful base form.
* Stemming — reducing words to their base form.
* **Feature Extraction**



In [3]:
import string
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
nltk.download('punkt')

stemmer = SnowballStemmer('english')

def preprocessing(corpus):

    df = pd.DataFrame({'doc':corpus})     # Converting to data frame

    data = []
    for i in range(0, len(df.index)):

        table = str.maketrans(dict.fromkeys(string.punctuation))                    # Punctuation removal
        words = (df.doc[i].translate(table)).lower() 

        words = nltk.word_tokenize(words) # Tokenization

        words_ = []
        for word in words:
            if len(word) > 2:                                                       # Short words removal
                word1 = stemmer.stem(WordNetLemmatizer().lemmatize(word, pos='v'))  # Lemmatization and stemming           
                words_.append(word1)
        data.append(words_)

    #print(data) # list of list
    corpus = pd.DataFrame({'doc':data})                                             # Saving as dataframe
    return corpus, data

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
data, data_ = preprocessing(corpus)
print([len(doc) for doc in data_])
print(len(data))

[6, 6, 5, 6, 6]
5


# **Feature Extraction**

* Bag of Words - CountVectorizer is used for building a dictionary of features, and also for tokenizing and filtering stopwords. 
* Frequencies - TfidfTransformer compute the term frequencies by  dividing the number of occurrences given by CountVectorizer by the total number of words.


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

def feat_extraction(data):
    data = data['doc'].astype(str).values.tolist() # convert df to list of strings so that tfidf would work
    #print(data)
    
    count_vect =  CountVectorizer(lowercase='false', stop_words='english')
    occurrence = count_vect.fit_transform(data)

    tf_transformer = TfidfTransformer(use_idf=False).fit(occurrence)
    tf_vect = tf_transformer.transform(occurrence)

    return count_vect, occurrence

In [6]:
voc, occ = feat_extraction(data)

#print(voc.get_feature_names())
#print(voc.vocabulary_)

vocab_ = {v : k for k, v in voc.vocabulary_.items()}
print(vocab_)


{13: 'peanut', 2: 'butter', 14: 'sandwich', 1: 'breakfast', 8: 'like', 5: 'eat', 0: 'almond', 15: 'walnut', 12: 'neighbour', 9: 'littl', 4: 'dog', 16: 'yesterday', 3: 'cat', 10: 'mortal', 6: 'enemi', 11: 'mustnt', 7: 'fee'}


In [7]:
data_occ = []
for doc in occ.toarray():
    words = []
    for poz, n in enumerate(doc):
        if n == 1:
            words.append(poz)
    data_occ.append(words)

print(data_occ)


data_tf = []
for doc in data_occ:
    doc_tf = []
    for w in doc:
        doc_tf.append(w / len(vocab_))
    data_tf.append(doc_tf)

print(data_tf)


[[1, 2, 13, 14], [0, 5, 8, 13, 15], [4, 9, 12, 16], [3, 4, 6, 10], [4, 7, 11, 13]]
[[0.058823529411764705, 0.11764705882352941, 0.7647058823529411, 0.8235294117647058], [0.0, 0.29411764705882354, 0.47058823529411764, 0.7647058823529411, 0.8823529411764706], [0.23529411764705882, 0.5294117647058824, 0.7058823529411765, 0.9411764705882353], [0.17647058823529413, 0.23529411764705882, 0.35294117647058826, 0.5882352941176471], [0.23529411764705882, 0.4117647058823529, 0.6470588235294118, 0.7647058823529411]]


After I tried both with data and data_tf, I concluded that working with frequencies indead of occurences gives better results.

In [8]:
data = data_tf

# **Latent Dirichlet Allocation**

LDA is a statistical model that reflects the belonging of a document to several topics.

We are usig CompleteDirichlet instead of Dirichlet because of its property to assign to the last element the rest of the sum to one.

Alpha and beta reprezents the priors and are initialized with one.

Let t be the number of topics and  d the size of the corpus. Thus, the LDA generative process is:

1. For each topic: 

 a) Draw a distribution over words $\phi_d = Dirichlet(\beta)$


2. For each document: 

 a) Draw a topic of vector proportions $\theta_t = Dirichlet(\alpha)$
        
 b) For each word: 

    i) Draw a topic assignment $z_{d, t} = Multinomial(\theta_d)$
        
    ii) Draw a word $w_{d, t} = Multinomial(\phi_{z_{t, d}})$
  




In [9]:
!pip install pymc



In [10]:
import pymc as pm
import numpy as np

nr_topics = 2  
vocab_size = len(vocab_)
corpus_size = len(data)

alpha = np.ones(nr_topics)*0.5
beta = np.ones(vocab_size)*0.5
Nm = [len(doc) for doc in data]

phi_ = pm.Container([pm.Dirichlet("phi_ %s" % topic, theta = beta) for topic in range(nr_topics)])

phi = pm.Container( [pm.CompletedDirichlet("Phi %s" % topic,  phi_[topic])  for topic in range(nr_topics)] ) #word distribution per topic

theta_ = pm.Container([pm.Dirichlet("theta_ %s" % doc, theta = alpha) for doc in range(corpus_size)])

theta = pm.Container([pm.CompletedDirichlet("Theta %s" % doc, theta_[doc]) for doc in range(corpus_size)])    # topic distribution per docs


z = pm.Container([pm.Categorical('Z %i' % doc,       # topic for word per docs
                             p = theta[doc], 
                             size = Nm[doc],
                             value = np.random.randint(nr_topics, size = Nm[doc]))
                for doc in range(corpus_size)])


w = pm.Container([pm.Categorical("W %i %i" % (doc, word),     # the word from doc
                                p = pm.Lambda('Phi Z %i %i' % (doc, word), 
                                             lambda z = z[doc][word], 
                                             phi = phi: phi[z]),
                                value = data[doc][word], 
                                observed = True)
                for doc in range(corpus_size)
                for word in range(Nm[doc]) ])


model = pm.Model([phi_, theta_, theta, phi, z, w])

map_ = pm.MAP(model) # improving convergence
map_.fit()

mcmc = pm.MCMC(model)    # fitting
tr = mcmc.sample(10000, 4000)


  import pandas.util.testing as tm






 [-----------------100%-----------------] 10000 of 10000 complete in 16.0 sec

In [11]:
print('Topic distribution for each word:\n')
for doc in range(corpus_size):  # topic distribution per word per document
    print(mcmc.trace('Z %i' % doc)[-1]) 

Topic distribution for each word:

[1 1 1 1]
[1 1 1 1 1]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]


Topic distribution for each word: -------------- data:

[1 1 1 1] ----------------------------------------------------- [peanut, butter, sandwich, for, breakfast]

[1 1 1 1 1] -------------------------------------------------- [like, eat, almond, peanut, and, walnut]

[0 0 0 0] ----------------------------------------------------- [neighbour, get, littl, dog, yesterday]

[0 0 0 0] ----------------------------------------------------- [cat, and, dog, be, mortal, enemi]

[0 0 0 0] ----------------------------------------------------- [you, mustnt, fee, peanut, your, dog]



In [12]:
def topics__(nr_topics, vocab_, w):
    
    for topic in range(nr_topics):
        print ("Topic % i" % topic)
        
        idxs = np.argsort(mcmc.trace('Phi %i' % topic)[:].mean(axis=0)[0], axis = 0)   # trace - iid draws from the posterior 
        words_ = [vocab_[idx] for idx in idxs]

        list_ = []

        for i, j in enumerate(mcmc.trace('Phi %i' % topic)[:].mean(axis=0)[0]):
            for id in idxs:
                if id == i:
                    list_.append(j)

        for ix in idxs[w:0:-1]:
            print("\t", words_[ix], ":", list_[ix])


In [13]:
print('Words distribution per topic:\n')
topics__(nr_topics, vocab_, 5)

Words distribution per topic:

Topic  0
	 neighbour : 0.007115568368636087
	 dog : 0.003947428131491523
	 breakfast : 0.0035652383691614177
	 walnut : 0.0021047412664710404
	 sandwich : 0.001380059487851138
Topic  1
	 enemi : 0.011000210173604476
	 yesterday : 0.01062986523847483
	 breakfast : 0.002811108860692383
	 peanut : 0.002308913654492566
	 mustnt : 0.00032400491911813536


Words distribution per topic:

Topic  0

	 neighbour : 0.007115568368636087
	 dog : 0.003947428131491523
	 breakfast : 0.0035652383691614177
	 walnut : 0.0021047412664710404
	 sandwich : 0.001380059487851138
     
Topic  1

	 enemi : 0.011000210173604476
	 yesterday : 0.01062986523847483
	 breakfast : 0.002811108860692383
	 peanut : 0.002308913654492566
	 mustnt : 0.00032400491911813536

In [14]:
print('Topic distribution per document:\n')
for doc in range(corpus_size):  # topic distribution document
    print(mcmc.trace('Theta %i' % doc)[0]) 

Topic distribution per document:

[[8.72466055e-10 9.99999999e-01]]
[[5.48124998e-10 9.99999999e-01]]
[[1.00000000e+00 6.36080077e-12]]
[[1.00000000e+00 4.47131221e-12]]
[[1.00000000e+00 8.57880433e-12]]


Topic distribution per document:

[[8.72466055e-10 9.99999999e-01]]

[[5.48124998e-10 9.99999999e-01]]

[[1.00000000e+00 6.36080077e-12]]

[[1.00000000e+00 4.47131221e-12]]

[[1.00000000e+00 8.57880433e-12]]


# **TASK2**

# Can the topic model be used to define a topic-based similarity measure between documents?

We will analyze the similariy between topics by computing the cosine distance and JensenShannon distance of the theta distribution (i.e. the distribution of topics per document) of the first n documents given.


In [17]:
from scipy.spatial import distance

def cos_sim(nr_doc):

    print('Documents \t cosine similarity \t JensenShannon simlarity')
    for i in range(nr_doc):
        for j in range(nr_doc):
            if i != j:
                print('%i & %i:\t\t %f \t\t %f ' %(i,j,
                                             1-distance.cosine( mcmc.trace('Theta %i' % i)[:].mean(axis=0)[0] , mcmc.trace('Theta %i' % j)[:].mean(axis=0)[0]),
                                             1-distance.jensenshannon( mcmc.trace('Theta %i' % i)[:].mean(axis=0)[0] , mcmc.trace('Theta %i' % j)[:].mean(axis=0)[0])))       
                

In [18]:
print('Topic similarity between documents:\n')
cos_sim(3)

Topic similarity between documents:

Documents 	 cosine similarity 	 JensenShannon simlarity
0 & 1:		 1.000000 		 0.999985 
0 & 2:		 0.000155 		 0.167900 
1 & 0:		 1.000000 		 0.999985 
1 & 2:		 0.000155 		 0.167900 
2 & 0:		 0.000155 		 0.167900 
2 & 1:		 0.000155 		 0.167900 
