<font color = green >

# Text classification: topic modeling 

</font>

<font color = green >

### Latent Dirichlet allocation (LDA)

</font>

Typically used to detect underlying topics in the text documents

**Input** : text documents and number of topics 
<br>
**Output**: Distribution of topics for each document (that allows to assign th one with highest probability) and word distribution for each topic 

**Assumptions**:
- Documents with similar topics use similar groups of words 
- Documents are probability distribution over latent topics 
- Topics are probability distribution over words


<font color = green >

#### Generative process
<br>
</font>

LDA considers the every document is created the following way:

1) Define number if words in the document
<br>
2) Chose the topic mixture over the fixed set of topics (e.g. 20% of topic A, 30% of topic A, and 50% of topic A)
<br>
3) Generate the words by:
<br>
   -pick the topic based on document's multinomial distribution 
<br>
   -pick the word based on topic's multinomial distribution 

<img src = "img/topics_modeling.png" height=500 width= 800 align="left">



<font color = green >

#### Recall
</font>


#### Binomial distribution

$$p(k/n)\quad =\quad C^{ k }_{ n }\cdot p^{ k }(1-p)^{ n-k }\quad =\quad \frac { n! }{ k!(n-k)! } p^{ k }(1-p)^{ n-k }$$

Example: Probability of 6 of 10 for fear coin: 
$$p(6,4)\quad =\quad C^{ 6 }_{ 10 }\cdot {0.5}^{ 6 }(0.5)^{ 4 }\quad = 210 \cdot 0.015625 \cdot 0.0625 = 0.205078125$$


#### Multinomial distribution

$$p(n_{ 1 }n_{ 2 }...n_{ k })\quad =\quad \frac { n! }{ n_{ 1 }!n_{ 2 }!...n_{ k }! } p^{ n_{ 1 } }_{ 1 }p^{ n_{ 2 } }_{ 2 }...p^{ n_{ k } }_{ k }$$

Example (three outcomes): <br>
n = 12 (12 games are played),<br>
n1 = 7 (number won by Player A),<br>
n2 = 2 (number won by Player B),<br>
n3 = 3 (the number drawn),<br>
p1 = 0.4 (probability Player A wins)<br>
p2 = 0.35(probability Player B wins)<br>
p3 = 0.25(probability of a draw)<br>
$$p(7,2,3)\quad =\quad \frac {12!}{ 7! \cdot 2! \cdot3 ! }  \cdot 0.4^{7} \cdot 0.35^{2} \cdot0.25^{3} = 0.0248$$




<font color = green >

#### Maximul Likelihood Estimation

</font>

<br>

**Recall** 
<br> Known are text documents and number $K$ of topics 

**Target**:
<br>Within all possible topics distribution for all documemnts and all possible words distribution for topics, shoose the one wich maximizes probability of all text documents.

**Approach** :
<br>
1) Randomly assign each word of each document to $K$ topics 
<br>
2) Iterate the following process till convergence (steady assignments of w to topics) 
<br>$\quad\quad$For each document $d$: 
<br>
    $\quad\quad\bullet$ Assume that all topic assignment except current one are correct     
    $\quad\quad\bullet$ For each word $w$ in $d$:           
    $\quad\quad\quad$ - For every topic $t$ compare the the score for hypothesis that w is in this topic $t$:
   <br>$\quad\quad\quad\quad\quad score (t) =  p(t | d) \cdot p (w |t),$
   <br>$\quad\quad\quad\quad p(t|d)$ is proportion of all words in d from t,
    <br>$\quad\quad\quad\quad p(w|t)$ is share of word w in topic t.  
    $\quad\quad\quad$ - Assign the word w to the topic with max score
    <br>$\quad\quad\bullet$ Iterate through all $w$ in $d$:           
$\quad\quad$Iterate through all $d$

Te results is matrix of distribution of words in topics  
Note: The computed topics are just words distribution, i.e. need to summarize them somehow. 


<font color = green >

## Gensim LDA 

</font>



In [27]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords

<font color = green >

### Define the text documents 

</font>



In [28]:
doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
doc_e = "Health professionals say that brocolli is good for your health."
doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]


<font color = green >

### Tokenize, clean, and stem

</font>



In [29]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()

def tokenize(doc_set):
    texts = []
    for doc in doc_set:
        # tokenize document string
        raw = doc.lower()
        tokens = word_tokenize(raw)

        # remove stop words from tokens
        tokens = [token for token in tokens if token not in en_stop]

        # stem tokens
        tokens = [p_stemmer.stem(token) for token in tokens]

        # add tokens to list
        texts.append(tokens)
    return texts

texts = tokenize(doc_set)
texts[0]

['brocolli',
 'good',
 'eat',
 '.',
 'brother',
 'like',
 'eat',
 'good',
 'brocolli',
 ',',
 'mother',
 '.']

<font color = green >

### Convert tokenized documents into a "id <-> term" dictionary

</font>



In [30]:
dictionary = corpora.Dictionary(texts) # this is alternative way - without using count vectorizer
print (type(dictionary), dictionary)
for k,w in dictionary.items():
    print (k,w)

<class 'gensim.corpora.dictionary.Dictionary'> Dictionary(34 unique tokens: [',', '.', 'brocolli', 'brother', 'eat']...)
0 ,
1 .
2 brocolli
3 brother
4 eat
5 good
6 like
7 mother
8 around
9 basebal
10 drive
11 lot
12 practic
13 spend
14 time
15 blood
16 caus
17 expert
18 health
19 increas
20 may
21 pressur
22 suggest
23 tension
24 better
25 feel
26 never
27 often
28 perform
29 school
30 seem
31 well
32 profession
33 say


<font color = green >

### Create gensim corpus

</font>



In [31]:
print ('\nconvert tokenized documents into a document-term matrix')
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
for item in corpus:
    print (item)


convert tokenized documents into a document-term matrix
[(0, 1), (1, 2), (2, 2), (3, 1), (4, 2), (5, 2), (6, 1), (7, 1)]
[(1, 1), (3, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)]
[(1, 1), (10, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1)]
[(0, 1), (1, 1), (3, 1), (7, 1), (10, 1), (21, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1)]
[(1, 1), (2, 1), (5, 1), (18, 2), (32, 1), (33, 1)]


<font color = green >

### Generate LDA model

</font>



In [32]:
ldamodel = gensim.models.ldamodel.LdaModel(
    corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

### Review topics 

</font>



In [33]:
ldamodel.print_topics(num_topics=2,num_words=10)

[(0,
  '0.098*"." + 0.076*"brocolli" + 0.076*"good" + 0.055*"mother" + 0.055*"brother" + 0.054*"health" + 0.054*"eat" + 0.033*"," + 0.033*"like" + 0.033*"spend"'),
 (1,
  '0.060*"drive" + 0.059*"pressur" + 0.059*"." + 0.036*"," + 0.036*"never" + 0.036*"often" + 0.036*"increas" + 0.036*"perform" + 0.036*"seem" + 0.036*"well"')]

<font color = green >

### Classify the new text 

</font>



In [34]:
test_doc_list = ["Some experts suggest that car may cause increased blood pressure. professionals say that brocolli is good "]
test_texts = tokenize(test_doc_list)
test_corpus = [dictionary.doc2bow(text) for text in test_texts ]
test_doc_topics = ldamodel.get_document_topics(test_corpus)
print ('\nget topics:')
for el in test_doc_topics: # loop over all tests in provided list
    print(el)


get topics:
[(0, 0.37417665), (1, 0.6258233)]


<font color = green >

### Sample of topic modeling on large dataset

</font>



In [16]:
from sklearn.feature_extraction.text import CountVectorizer
import pickle

<font color = green >

#### Load "News" data 

</font>



In [17]:
import os
cwd= os.getcwd()
path = os.path.join(cwd,'')
fn=  os.path.join(path , 'newsgroups')

with open(fn, 'rb') as f:
    newsgroup_data = pickle.load(f)

<font color = green >

#### Review data

</font>



In [18]:
print (type(newsgroup_data))
print ('len of documents = {:,}\n'.format(len(newsgroup_data)))

newsgroup_data[0]

<class 'list'>
len of documents = 2,000



"The best group to keep you informed is the Crohn's and Colitis Foundation\nof America.  I do not know if the UK has a similar organization.  The\naddress of\nthe CCFA is \n\nCCFA\n444 Park Avenue South\n11th Floor\nNew York, NY  10016-7374\nUSA\n\nThey have a lot of information available and have a number of newsletters.\n \nGood Luck."

<font color = green >

#### Define custom vectorizer

</font>



In [133]:
three_words_pattern = r"\b\w{3,}\b"
vectorizer = CountVectorizer(
    min_df=20, 
    stop_words='english',
    token_pattern=three_words_pattern) 
vectorizer.fit(newsgroup_data)
newsgroup_data_vectorized= vectorizer.transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)

scipy.sparse.csr.csr_matrix

<font color = green >

#### Review feratures 

</font>



In [20]:
print ('len of features = {:,}\n'.format(len(vectorizer.get_feature_names())))
print (vectorizer.get_feature_names()[:40])
type(vectorizer.get_feature_names()[:40])

len of features = 902

['000', '100', '1990', '1992', '1993', '200', '2nd', '300', '400', '486', '500', '800', 'ability', 'able', 'accept', 'accepted', 'access', 'according', 'actual', 'actually', 'add', 'addition', 'additional', 'address', 'advance', 'advice', 'age', 'ago', 'agree', 'ahead', 'air', 'allow', 'alt', 'america', 'american', 'answer', 'answers', 'anybody', 'apparently', 'appears']


list

<font color = green >

#### Vectorize data set

</font>



In [40]:
newsgroup_data_vectorized= vectorizer.transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
print (newsgroup_data_vectorized)

  (0, 23)	1
  (0, 33)	1
  (0, 58)	1
  (0, 76)	1
  (0, 326)	1
  (0, 335)	1
  (0, 386)	1
  (0, 409)	1
  (0, 451)	1
  (0, 456)	1
  (0, 515)	1
  (0, 529)	1
  (0, 545)	1
  (0, 727)	1
  (0, 843)	1
  (0, 900)	1
  (1, 33)	1
  (1, 34)	1
  (1, 84)	1
  (1, 184)	1
  (1, 201)	1
  (1, 214)	1
  (1, 231)	2
  (1, 241)	1
  (1, 324)	1
  :	:
  (1998, 622)	1
  (1998, 625)	3
  (1998, 688)	1
  (1998, 698)	2
  (1998, 726)	1
  (1998, 804)	1
  (1998, 805)	1
  (1998, 810)	10
  (1998, 813)	2
  (1998, 814)	1
  (1998, 816)	1
  (1998, 818)	1
  (1998, 844)	1
  (1998, 882)	2
  (1998, 899)	1
  (1999, 171)	1
  (1999, 194)	1
  (1999, 205)	1
  (1999, 213)	1
  (1999, 276)	2
  (1999, 308)	1
  (1999, 344)	1
  (1999, 669)	1
  (1999, 832)	1
  (1999, 874)	1


<font color = green >

#### Create gensim corpus

</font>



In [41]:
corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
# comparing to using corpora.Dictionary:
# corpus = [dictionary.doc2bow(text) for text in texts] 
[item for item in corpus][:5]


[[(23, 1),
  (33, 1),
  (58, 1),
  (76, 1),
  (326, 1),
  (335, 1),
  (386, 1),
  (409, 1),
  (451, 1),
  (456, 1),
  (515, 1),
  (529, 1),
  (545, 1),
  (727, 1),
  (843, 1),
  (900, 1)],
 [(33, 1),
  (34, 1),
  (84, 1),
  (184, 1),
  (201, 1),
  (214, 1),
  (231, 2),
  (241, 1),
  (324, 1),
  (332, 1),
  (359, 1),
  (363, 1),
  (365, 1),
  (409, 1),
  (430, 3),
  (451, 1),
  (475, 1),
  (492, 2),
  (525, 2),
  (605, 1),
  (633, 2),
  (642, 1),
  (674, 1),
  (688, 1),
  (709, 1),
  (750, 1),
  (777, 1),
  (823, 1),
  (838, 1),
  (874, 1),
  (896, 1)],
 [(25, 1),
  (26, 1),
  (63, 1),
  (120, 1),
  (231, 1),
  (297, 1),
  (326, 1),
  (344, 1),
  (373, 1),
  (423, 1),
  (442, 1),
  (444, 1),
  (448, 2),
  (465, 1),
  (572, 1),
  (653, 1),
  (659, 1),
  (714, 1),
  (777, 1),
  (779, 1),
  (781, 1),
  (818, 1),
  (836, 1),
  (855, 1),
  (890, 1),
  (898, 1)],
 [(4, 1),
  (17, 2),
  (18, 1),
  (22, 1),
  (42, 1),
  (48, 2),
  (68, 1),
  (78, 1),
  (86, 1),
  (94, 1),
  (117, 1),
  (119, 1)

<font color = green >

#### Create id2word dictionary

</font>



In [42]:
id_map = dict((v, k) for k, v in vectorizer.vocabulary_.items()) 

<font color = green >

#### Generate LDA model

</font>



In [43]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)
# Comparing to corpora.Dictionary
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)


<font color = green >

#### Review topics

</font>



In [44]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=6, id2word=id_map, passes=25, random_state=34)
ldamodel.print_topics(num_topics=6,num_words=10)

[(0,
  '0.025*"edu" + 0.019*"com" + 0.018*"use" + 0.018*"thanks" + 0.016*"does" + 0.015*"know" + 0.011*"mail" + 0.010*"apple" + 0.009*"help" + 0.008*"want"'),
 (1,
  '0.061*"drive" + 0.039*"disk" + 0.030*"scsi" + 0.027*"drives" + 0.027*"hard" + 0.025*"controller" + 0.021*"card" + 0.018*"rom" + 0.016*"cable" + 0.016*"floppy"'),
 (2,
  '0.024*"people" + 0.022*"god" + 0.013*"atheism" + 0.012*"think" + 0.012*"believe" + 0.012*"don" + 0.010*"does" + 0.010*"just" + 0.009*"argument" + 0.009*"say"'),
 (3,
  '0.023*"game" + 0.021*"year" + 0.020*"team" + 0.013*"games" + 0.013*"play" + 0.011*"good" + 0.011*"don" + 0.010*"think" + 0.010*"season" + 0.010*"players"'),
 (4,
  '0.035*"space" + 0.019*"nasa" + 0.018*"data" + 0.013*"information" + 0.013*"available" + 0.013*"center" + 0.011*"ground" + 0.010*"research" + 0.010*"000" + 0.010*"new"'),
 (5,
  '0.017*"just" + 0.017*"like" + 0.016*"don" + 0.012*"car" + 0.012*"time" + 0.011*"think" + 0.011*"good" + 0.010*"know" + 0.008*"way" + 0.008*"people"')]

<font color = green >

#### Name topics

</font>



In [45]:
topics_names= ['Education', 'Computers & IT', 'Religion', 'Sports', 'Science','Society & Lifestyle']

<font color = green >

#### Classify the new text 

</font>



In [46]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "] 


In [47]:
doc_vectorized= vectorizer.transform(new_doc) # input param is list
new_doc_corpus = gensim.matutils.Sparse2Corpus(doc_vectorized, documents_columns=False)
doc_topics = ldamodel.get_document_topics(new_doc_corpus)
list(doc_topics)

[[(0, 0.033414923),
  (1, 0.03333784),
  (2, 0.033519648),
  (3, 0.033781353),
  (4, 0.83230925),
  (5, 0.033636983)]]

In [48]:
import numpy as np
def elicit_topic_name(doc_topics):    
    return topics_names[np.squeeze(np.array(doc_topics))[:,1].argmax()]
elicit_topic_name(doc_topics)

'Science'

<font color = green >

## Home Task 

</font>


<font color = green >

### Topic Modeling 

</font>

[voted-kaggle-dataset](https://www.kaggle.com/canggih/voted-kaggle-dataset/version/2#voted-kaggle-dataset.csv)

In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
from nltk.corpus import stopwords
import pandas as pd
import os
from sklearn.feature_extraction.text import CountVectorizer
import pickle



In [121]:
import csv
cwd= os.getcwd()
path = os.path.join(cwd,'')
fn=  os.path.join(path , 'voted-kaggle-dataset.csv')
df = pd.read_csv(fn)


# with open('test.csv', 'rb') as f:
#     data = list(csv.reader(f))
# data=df['Description']
# for row in data:
#     writer = csv.writer(open("test1.csv", "wb"))
#     writer.writerows(row)
print ('len of texts= {:,}'.format(len(df)))


len of texts= 2,150


In [124]:

df=df['Description'].values

In [125]:
en_stop  = set(stopwords.words('english'))
p_stemmer = PorterStemmer()
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
import re

# def fix_Plan(df):
#     for text in df:
#         tokens = re.sub("[^a-zA-Z]",  # Search for all non-letters
#                                   " ",          # Replace all non-letters with spaces
#                                   str(text))

#         words = tokens.lower().split()     
#         stops = set(stopwords.words("english"))      
#         meaningful_words = [w for w in words if not w in stops]  
#     return meaningful_words
# tokens=fix_Plan(df)


def tokenize(df):
    texts = []
    for doc in df:
        # tokenize document string
        tokens = re.sub("[^a-zA-Z]",  # Search for all non-letters
                                  " ",          # Replace all non-letters with spaces
                                  str(text))

        words = tokens.lower().split()     
        stops = set(stopwords.words("english"))      
        tokens = [w for w in words if not w in stops]
        texts.append(tokens)
    return texts

texts = tokenize(df)
texts[0]

['datasets',
 'contains',
 'transactions',
 'made',
 'cre',
 'ultimate',
 'soccer',
 'database',
 'data',
 'analysis',
 'name',
 'description',
 'dtype',
 'object']

In [136]:
# vectorizer = CountVectorizer(
#     min_df=20) 
# vectorizer.fit(texts)
three_words_pattern = r"\b\w{3,}\b" 
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts] # id and count
# corpus = gensim.matutils.Sparse2Corpus(newsgroup_data_vectorized, documents_columns=False)
# newsgroup_data_vectorized= v.transform(x)
# id_map = dict((v, k) for k, v in v.vocabulary_.items()) 

In [139]:
ldamodel = gensim.models.ldamodel.LdaModel (corpus, num_topics=2, id2word=dictionary, passes=25, random_state=34)
# ldamodel = gensim.models.ldamodel.LdaModel(
#     corpus, num_topics=2, id2word=dictionary, passes=20, random_state= 0)
ldamodel.print_topics(num_topics=2,num_words=10)

[(0,
  '0.082*"description" + 0.079*"transactions" + 0.079*"ultimate" + 0.079*"name" + 0.075*"data" + 0.074*"analysis" + 0.074*"soccer" + 0.073*"database" + 0.070*"datasets" + 0.068*"dtype"'),
 (1,
  '0.085*"cre" + 0.083*"contains" + 0.081*"object" + 0.077*"made" + 0.075*"dtype" + 0.073*"datasets" + 0.070*"database" + 0.069*"soccer" + 0.069*"analysis" + 0.067*"data"')]

<1x36412 sparse matrix of type '<class 'numpy.int64'>'
	with 164 stored elements in Compressed Sparse Row format>

<font color = green >

## Learn more
</font>

Latent Dirichlet allocation
<br>
https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


<font color = green >

## Next lesson: Clustering 
</font>

