In [1]:
from collections import Counter
import random

To start with, we’ll need a function to randomly choose an index based on an arbitrary set of weights:

In [12]:
def sample_from(weights):
    """returns i with probability weights[i] / sum(weights)"""
    total = sum(weights)
    rnd = total * random.random() # uniform between 0 and total
    for i, w in enumerate(weights):
        rnd -= w # return the smallest i such that
        if rnd <= 0: 
            return i # weights[0] + ... + weights[i] >= rnd

In [13]:
sample_from([1, 1, 3])

2

For instance, if you give it weights [1, 1, 3], then one-fifth of the time it will return 0, one-fifth of the time it will return 1, and three-fifths of the time it will return 2.

In [26]:
documents = [
    ["Hadoop", "Big Data", "HBase", "Java", "Spark", "Storm", "Cassandra"],
    ["NoSQL", "MongoDB", "Cassandra", "HBase", "Postgres"],
    ["Python", "scikit-learn", "scipy", "numpy", "statsmodels", "pandas"],
    ["R", "Python", "statistics", "regression", "probability"],
    ["machine learning", "regression", "decision trees", "libsvm"],
    ["Python", "R", "Java", "C++", "Haskell", "programming languages"],
    ["statistics", "probability", "mathematics", "theory"],
    ["machine learning", "scikit-learn", "Mahout", "neural networks"],
    ["neural networks", "deep learning", "Big Data", "artificial intelligence"],
    ["Hadoop", "Java", "MapReduce", "Big Data"],
    ["statistics", "R", "statsmodels"],
    ["C++", "deep learning", "artificial intelligence", "probability"],
    ["pandas", "R", "Python"],
    ["databases", "HBase", "Postgres", "MySQL", "MongoDB"],
    ["libsvm", "regression", "support vector machines"]
]

K=4

And we’ll try to find K = 4 topics.

In order to calculate the sampling weights, we’ll need to keep track of several counts. Let’s first create the data structures for them.

How many times each topic is assigned to each document:

In [15]:
# a list of Counters, one for each document
document_topic_counts = [Counter() for _ in documents]

In [16]:
document_topic_counts

[Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter(),
 Counter()]

How many times each word is assigned to each topic:

In [41]:
# a list of Counters, one for each topic
topic_word_counts = [Counter() for _ in range(K)]

In [53]:
# topic_word_counts

The total number of words assigned to each topic:

In [17]:
# a list of numbers, one for each topic
topic_counts = [0 for _ in range(K)]

In [18]:
topic_counts

[0, 0, 0, 0]

The total number of words contained in each document:

In [22]:
# a list of numbers, one for each document
document_lengths = list(map(len, documents))

The number of distinct words:

In [23]:
distinct_words = set(word for document in documents for word in document)
W = len(distinct_words)

And the number of documents:

In [24]:
D = len(documents)

For example, once we populate these, we can find the number of words in documents[3] associated with topic 1 as:



In [57]:
document_topic_counts[3][1]

0

And we can find the number of times nlp is associated with topic 2 as:

In [58]:
topic_word_counts[2]["nlp"]

0

Now we’re ready to define our conditional probability functions. As in Chapter 13, each has a smoothing term that ensures every topic has a nonzero chance of being chosen in any document and that every word has a nonzero chance of being chosen for any topic:

In [27]:
def p_topic_given_document(topic, d, alpha=0.1):
    """the fraction of words in document _d_
    that are assigned to _topic_ (plus some smoothing)"""
    return ((document_topic_counts[d][topic] + alpha) /
            (document_lengths[d] + K * alpha))

def p_word_given_topic(word, topic, beta=0.1):
    """the fraction of words assigned to _topic_
    that equal _word_ (plus some smoothing)"""
    return ((topic_word_counts[topic][word] + beta) /
            (topic_counts[topic] + W * beta))

We’ll use these to create the weights for updating topics:

In [28]:
def topic_weight(d, word, k):
    """given a document and a word in that document,
    return the weight for the kth topic"""
    return p_word_given_topic(word, k) * p_topic_given_document(k, d)

def choose_new_topic(d, word):
    return sample_from([topic_weight(d, word, k)
                        for k in range(K)])

There are solid mathematical reasons why topic_weight is defined the way it is, but their details would lead us too far afield. Hopefully it makes at least intuitive sense that—given a word and its document—the likelihood of any topic choice depends on both how likely that topic is for the document and how likely that word is for the topic.

This is all the machinery we need. We start by assigning every word to a random topic, and populating our counters appropriately:

In [45]:
random.seed(0)
document_topics = [[random.randrange(K) for word in document]
                   for document in documents]

for d in range(D):
    for word, topic in zip(documents[d], document_topics[d]):
        document_topic_counts[d][topic] += 1
        topic_word_counts[topic][word] += 1
        topic_counts[topic] += 1

In [46]:
word

'support vector machines'

In [47]:
topic

3

Our goal is to get a joint sample of the topics-words distribution and the documents-topics distribution. We do this using a form of Gibbs sampling that uses the conditional probabilities defined previously:

In [30]:
for iter in range(1000):
    for d in range(D):
        for i, (word, topic) in enumerate(zip(documents[d],
                                              document_topics[d])):

            # remove this word / topic from the counts
            # so that it doesn't influence the weights
            document_topic_counts[d][topic] -= 1
            topic_word_counts[topic][word] -= 1
            topic_counts[topic] -= 1
            document_lengths[d] -= 1

            # choose a new topic based on the weights
            new_topic = choose_new_topic(d, word)
            document_topics[d][i] = new_topic

            # and now add it back to the counts
            document_topic_counts[d][new_topic] += 1
            topic_word_counts[new_topic][word] += 1
            topic_counts[new_topic] += 1
            document_lengths[d] += 1

What are the topics? They’re just numbers 0, 1, 2, and 3. If we want names for them we have to do that ourselves. Let’s look at the five most heavily weighted words for each (Table 20-1):

In [32]:
for k, word_counts in enumerate(topic_word_counts):
    for word, count in word_counts.most_common():
        if count > 0: print(k, word, count)

0 pandas 4
0 scikit-learn 4
0 R 2
0 Big Data 2
0 C++ 2
0 Java 2
0 statsmodels 2
0 HBase 2
0 artificial intelligence 2
0 Haskell 2
0 scipy 1
0 libsvm 1
0 numpy 1
0 deep learning 1
0 Hadoop 1
0 regression 1
0 mathematics 1
0 statistics 1
1 neural networks 4
1 deep learning 3
1 MongoDB 3
1 Postgres 2
1 theory 2
1 decision trees 2
1 MySQL 2
1 HBase 2
1 databases 2
1 Mahout 2
1 numpy 1
1 Big Data 1
1 Cassandra 1
1 Python 1
1 statistics 1
2 R 5
2 Python 5
2 regression 5
2 Java 4
2 machine learning 3
2 Cassandra 3
2 probability 3
2 statistics 3
2 Postgres 2
2 C++ 2
2 statsmodels 2
2 artificial intelligence 2
2 HBase 2
2 scipy 1
2 programming languages 1
2 Storm 1
2 mathematics 1
2 MongoDB 1
3 libsvm 3
3 Big Data 3
3 probability 3
3 NoSQL 2
3 support vector machines 2
3 Spark 2
3 Hadoop 2
3 Python 2
3 MapReduce 2
3 R 1
3 Storm 1
3 machine learning 1
3 programming languages 1
3 statistics 1


Based on these I’d probably assign topic names:

In [45]:
topic_names = ["Big Data and programming languages",
               "Python and statistics",
               "databases",
               "machine learning"]

at which point we can see how the model assigns topics to each user’s interests:

In [35]:
for document, topic_counts in zip(documents, document_topic_counts):
    print(document)
    for topic, count in topic_counts.most_common():
        if count > 0:
            print(topic_names[topic], count)
    print()

['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']
machine learning 6
databases 5
Big Data and programming languages 2

['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres']
databases 5
Python and statistics 3
machine learning 2

['Python', 'scikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas']
Big Data and programming languages 6
databases 4
Python and statistics 2

['R', 'Python', 'statistics', 'regression', 'probability']
databases 6
Big Data and programming languages 3
machine learning 1

['machine learning', 'regression', 'decision trees', 'libsvm']
databases 3
machine learning 3
Python and statistics 2

['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages']
Big Data and programming languages 5
databases 4
machine learning 3

['statistics', 'probability', 'mathematics', 'theory']
Python and statistics 3
Big Data and programming languages 2
machine learning 2
databases 1

['machine learning', 'scikit-learn', 'Mahout', 'neural networks']
Python a

## Turn into class

In [47]:
class CustomLDA(object):
    
    def __init__(self, documents,nr_topics ):
        self.documents=documents
        self.nr_topics=nr_topics
        self.distinct_words=None
        self.count_distinct_words()
        
        self.D= len(self.documents)
        self.W=len(self.distinct_words)
        
        self.document_topic_counts =None
        self.count_document_topic()
        
        self.topic_word_counts =None
        self.count_topic_word()
        
        self.topic_counts = None
        self.count_topic()
        
        self.document_lengths =None
        self.get_document_lengths()
        
        self.document_topics = None
        self.init_document_topics()
        
        self.init_counts()
        
        self.nr_iter=None
        self.topic_names=None
        
    def count_document_topic(self):
         self.document_topic_counts = [Counter() for _ in self.documents]
    
    def count_topic_word(self):
        self.topic_word_counts= [Counter() for _ in range(self.nr_topics)]
    
    def sample_from(self, weights):
        """returns i with probability weights[i] / sum(weights)"""
        total = sum(weights)
        rnd = total * random.random() # uniform between 0 and total
        for i, w in enumerate(weights):
            rnd -= w # return the smallest i such that
            if rnd <= 0: 
                return i # weights[0] + ... + weights[i] >= rnd
            
    def count_topic(self):
        self.topic_counts= [0 for _ in range(self.nr_topics)]
    
    def get_document_lengths(self):
        self.document_lengths= list(map(len, self.documents))
    
    def count_distinct_words(self):
        self.distinct_words=set(word for document in self.documents for word in document)
    
    def p_topic_given_document(self, topic, d, alpha=0.1):
        """the fraction of words in document _d_
        that are assigned to _topic_ (plus some smoothing)"""
        return ((self.document_topic_counts[d][topic] + alpha) /
                (self.document_lengths[d] + self.nr_topics * alpha))

    def p_word_given_topic(self,word, topic, beta=0.1):
        """the fraction of words assigned to _topic_
        that equal _word_ (plus some smoothing)"""
        return ((self.topic_word_counts[topic][word] + beta) /
                (self.topic_counts[topic] + self.W * beta))
    
    def init_document_topics(self):
        random.seed(0)
        self.document_topics = [[random.randrange(self.nr_topics) for word in document] for document in self.documents]

    def init_counts(self):
        for d in range(self.D):
            for word, topic in zip(self.documents[d], self.document_topics[d]):
                self.document_topic_counts[d][topic] += 1
                self.topic_word_counts[topic][word] += 1
                self.topic_counts[topic] += 1
                
    def topic_weight(self, d, word, k):
        """given a document and a word in that document,
        return the weight for the kth topic"""
        return self.p_word_given_topic(word, k) * self.p_topic_given_document(k, d)

    def choose_new_topic(self, d, word):
        return self.sample_from([self.topic_weight(d, word, k)
                            for k in range(self.nr_topics)])

    def train(self, nr_iter):  
        self.nr_iter=nr_iter
        for iter in range(self.nr_iter):
            for d in range(self.D):
                for i, (word, topic) in enumerate(zip(self.documents[d],
                                                      self.document_topics[d])):

                    # remove this word / topic from the counts
                    # so that it doesn't influence the weights
                    self.document_topic_counts[d][topic] -= 1
                    self.topic_word_counts[topic][word] -= 1
                    self.topic_counts[topic] -= 1
                    self.document_lengths[d] -= 1

                    # choose a new topic based on the weights
                    new_topic = self.choose_new_topic(d, word)
                    self.document_topics[d][i] = new_topic

                    # and now add it back to the counts
                    self.document_topic_counts[d][new_topic] += 1
                    self.topic_word_counts[new_topic][word] += 1
                    self.topic_counts[new_topic] += 1
                    self.document_lengths[d] += 1
                    
    def top_words_per_topic(self):
        for k, word_counts in enumerate(self.topic_word_counts):
            for word, count in word_counts.most_common():
                if count > 0: 
                    print(k, word, count)
    
    def assign_topics(self, topic_names):
        self.topic_names=topic_names
        for document, topic_counts in zip(self.documents, self.document_topic_counts):
            print(document)
            for topic, count in topic_counts.most_common():
                if count > 0:
                    
                    print(self.topic_names[topic], count)
            print()

In [48]:
lda_model=CustomLDA(documents, 4)

In [49]:
lda_model.train(nr_iter=1000)

In [50]:
lda_model.top_words_per_topic()

0 Java 3
0 Big Data 3
0 Hadoop 2
0 programming languages 1
0 HBase 1
0 MapReduce 1
0 Storm 1
0 Cassandra 1
0 C++ 1
0 Spark 1
0 deep learning 1
1 Postgres 2
1 HBase 2
1 MongoDB 2
1 neural networks 2
1 machine learning 2
1 artificial intelligence 1
1 scipy 1
1 NoSQL 1
1 decision trees 1
1 deep learning 1
1 MySQL 1
1 Cassandra 1
1 databases 1
1 numpy 1
2 regression 3
2 scikit-learn 2
2 libsvm 2
2 R 2
2 Python 2
2 Mahout 1
2 mathematics 1
2 Haskell 1
2 support vector machines 1
3 statistics 3
3 probability 3
3 pandas 2
3 R 2
3 statsmodels 2
3 Python 2
3 artificial intelligence 1
3 theory 1
3 C++ 1


In [51]:
lda_model.assign_topics(topic_names)

['Hadoop', 'Big Data', 'HBase', 'Java', 'Spark', 'Storm', 'Cassandra']
Big Data and programming languages 7

['NoSQL', 'MongoDB', 'Cassandra', 'HBase', 'Postgres']
Python and statistics 5

['Python', 'scikit-learn', 'scipy', 'numpy', 'statsmodels', 'pandas']
Python and statistics 2
databases 2
machine learning 2

['R', 'Python', 'statistics', 'regression', 'probability']
machine learning 3
databases 2

['machine learning', 'regression', 'decision trees', 'libsvm']
Python and statistics 2
databases 2

['Python', 'R', 'Java', 'C++', 'Haskell', 'programming languages']
Big Data and programming languages 3
databases 3

['statistics', 'probability', 'mathematics', 'theory']
machine learning 3
databases 1

['machine learning', 'scikit-learn', 'Mahout', 'neural networks']
Python and statistics 2
databases 2

['neural networks', 'deep learning', 'Big Data', 'artificial intelligence']
Python and statistics 3
Big Data and programming languages 1

['Hadoop', 'Java', 'MapReduce', 'Big Data']
Big D

In [52]:
lda_model.document_topic_counts

[Counter({0: 7, 1: 0, 2: 0, 3: 0}),
 Counter({0: 0, 1: 5, 2: 0, 3: 0}),
 Counter({0: 0, 1: 2, 2: 2, 3: 2}),
 Counter({0: 0, 1: 0, 2: 2, 3: 3}),
 Counter({0: 0, 1: 2, 2: 2, 3: 0}),
 Counter({0: 3, 1: 0, 2: 3, 3: 0}),
 Counter({0: 0, 1: 0, 2: 1, 3: 3}),
 Counter({0: 0, 1: 2, 2: 2, 3: 0}),
 Counter({0: 1, 1: 3, 2: 0, 3: 0}),
 Counter({0: 4, 1: 0, 2: 0, 3: 0}),
 Counter({0: 0, 1: 0, 2: 0, 3: 3}),
 Counter({0: 1, 1: 0, 2: 0, 3: 3}),
 Counter({0: 0, 1: 0, 2: 0, 3: 3}),
 Counter({0: 0, 1: 5, 2: 0, 3: 0}),
 Counter({0: 0, 1: 0, 2: 3, 3: 0})]