# Topic Modeling
1. Topic modeling allow us to efficiently analyze large volumes of text by clustering documents in topic1.-Normally, a large volume of data is not labeled, so it means that we will not be able to apply our previous supervised learning approaches to create machine learning model for the data
2. One important idea is we do not know the 'correct topic' or 'the right answer'
3. all we know is that the documents clustered together shared the same topic,it is up to the user to identify what these topics represent

# LDA
ASSUMPTIONS of LDA
1. Documents with similar topics use similar groups of words
2. Latent topics can be found by searching for groups of words that frequently occur together across the corpus 

ASSUMPTION of LDA for topic modeling
1. Documents are probability distributions over latent topics
2. Topics themselves are probability distributions over words

LDA represents documents as the mixtures of topics that spit out words with certain probabilities
It assums that the documents are produced in the following fashion:
1. choose the topic mixture for the document (according to a Dirichlet distribution over a fixed set of topics). eg. 60%business, 20% politics, 10%food
Generate each word in the document by:
1. using the word to generate the topic itself (according to the multinomial distribution of the topic). eg, if we generate the topic 'food',we may generate the word 'apple' with 60% probability, 'home' with 30% probability and so on 

In [2]:
import pandas as pd

In [3]:
npr=pd.read_csv('npr.csv')

In [4]:
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
npr['Article'][0]

'In the Washington of 2016, even when the policy can be bipartisan, the politics cannot. And in that sense, this year shows little sign of ending on Dec. 31. When President Obama moved to sanction Russia over its alleged interference in the U. S. election just concluded, some Republicans who had long called for similar or more severe measures could scarcely bring themselves to approve. House Speaker Paul Ryan called the Obama measures ”appropriate” but also ”overdue” and ”a prime example of this administration’s ineffective foreign policy that has left America weaker in the eyes of the world.” Other GOP leaders sounded much the same theme. ”[We have] been urging President Obama for years to take strong action to deter Russia’s worldwide aggression, including its   operations,” wrote Rep. Devin Nunes,  . chairman of the House Intelligence Committee. ”Now with just a few weeks left in office, the president has suddenly decided that some stronger measures are indeed warranted.” Appearing 

In [6]:
len(npr)

11992

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
#create an instance 
cv=CountVectorizer(max_df=0.95,min_df=2,stop_words='english')

1. min_df means that for a word to be counted in the CountVectorizier, it has to show up at least in 2 documents 
2. max_df is  discard the words that show up at 95% in the document(such as is,are, he ,she...)
3. Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.


In [9]:
dtm=cv.fit_transform(npr['Article'])

In [10]:
dtm

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [19]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
LDA= LatentDirichletAllocation(n_components=7,random_state=42)

In [None]:
LDA.fit(dtm)

In [None]:
# grab the vocabulary of words 

#grab the topics

#grab the highest probability words per topic 


# Non-negative Matrix Factorization
Non-negative Matrix Factorization is an unsupervised algorithm that simultaneously performs dimensionality reduction and clustering.
We can use it in conjunction with TF-IDF to model topics across documents.

In [17]:
import pandas as pd

In [18]:
npr=pd.read_csv('npr.csv')
npr.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
tfidf=TfidfVectorizer(max_df=0.95,min_df=0.1,stop_words='english')

In [21]:
dtm=tfidf.fit_transform(npr['Article'])

In [22]:
dtm

<11992x385 sparse matrix of type '<class 'numpy.float64'>'
	with 823171 stored elements in Compressed Sparse Row format>

In [23]:
from sklearn.decomposition import NMF

In [24]:
nmf_model=NMF(n_components=7,random_state=42)

In [25]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=7, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [28]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WRODS FOR TOPIC # {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

THE TOP 15 WRODS FOR TOPIC # 0
['things', 've', 'don', 'book', 'world', 'new', 'way', 'really', 'time', 'life', 'know', 'think', 'people', 'just', 'like']
THE TOP 15 WRODS FOR TOPIC # 1
['policy', 'washington', 'office', 'presidential', 'election', 'administration', 'republican', 'obama', 'white', 'campaign', 'donald', 'house', 'said', 'president', 'trump']
THE TOP 15 WRODS FOR TOPIC # 2
['department', 'city', 'statement', 'npr', 'law', 'told', 'president', 'according', 'government', 'state', 'reported', 'court', 'reports', 'police', 'said']
THE TOP 15 WRODS FOR TOPIC # 3
['don', 'research', 'year', 'study', '000', 'new', 'company', 'just', 'years', 'university', 'water', 'like', 'food', 'people', 'says']
THE TOP 15 WRODS FOR TOPIC # 4
['million', 'research', 'children', 'public', 'states', 'federal', 'program', 'act', 'law', 'plan', 'study', 'people', 'percent', 'care', 'health']
THE TOP 15 WRODS FOR TOPIC # 5
['won', 'states', 'political', 'said', 'obama', 'percent', 'republican', 'p

In [29]:
topic_results=nmf_model.transform(dtm)

In [30]:
topic_results

array([[0.01458297, 0.14155629, 0.01518527, ..., 0.        , 0.03915428,
        0.00340823],
       [0.00662681, 0.16616382, 0.02065129, ..., 0.        , 0.00096774,
        0.00135205],
       [0.        , 0.16321501, 0.02205973, ..., 0.        , 0.05173463,
        0.        ],
       ...,
       [0.03571259, 0.        , 0.04173332, ..., 0.04512429, 0.00886924,
        0.04080381],
       [0.01823482, 0.04778487, 0.        , ..., 0.        , 0.16048895,
        0.        ],
       [0.01933994, 0.01697408, 0.05502703, ..., 0.00625853, 0.01690319,
        0.02053306]])

In [31]:
topic_results.shape

(11992, 7)

In [34]:
topic_results.argmax(axis=1)

array([1, 1, 1, ..., 4, 5, 3])

In [36]:
npr['Topic']=topic_results.argmax(axis=1)
npr.head()

Unnamed: 0,Article,Topic
0,"In the Washington of 2016, even when the polic...",1
1,Donald Trump has used Twitter — his prefe...,1
2,Donald Trump is unabashedly praising Russian...,1
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1
4,"From photography, illustration and video, to d...",2


In [38]:
mytopic_dict={0:'health',1:'election',2:'legis',3:'poli',4:'election',5:'music',6:'edu'}
npr['Topic Label']=npr['Topic'].map(mytopic_dict)

In [39]:
npr.head()

Unnamed: 0,Article,Topic,Topic Label
0,"In the Washington of 2016, even when the polic...",1,election
1,Donald Trump has used Twitter — his prefe...,1,election
2,Donald Trump is unabashedly praising Russian...,1,election
3,"Updated at 2:50 p. m. ET, Russian President Vl...",1,election
4,"From photography, illustration and video, to d...",2,legis


## Project, find the topic for the dataset

In [1]:
import pandas as pd

In [6]:
quora=pd.read_csv('quora_questions.csv')

In [12]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [9]:
quora['Question'][0]

'What is the step by step guide to invest in share market in india?'

In [10]:
len(quora)

404289

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
cv=CountVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [16]:
dtm=cv.fit_transform(quora['Question'])

In [17]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.int64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [18]:
from sklearn.decomposition import LatentDirichletAllocation

In [20]:
LDA= LatentDirichletAllocation(n_components=20,random_state=50)

In [21]:
LDA.fit(dtm)

KeyboardInterrupt: 

# Non-negative Matrix Factorization

In [2]:
import pandas as pd

In [3]:
quora=pd.read_csv('quora_questions.csv')

In [6]:
quora.head()

Unnamed: 0,Question
0,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...
2,How can I increase the speed of my internet co...
3,Why am I mentally very lonely? How can I solve...
4,"Which one dissolve in water quikly sugar, salt..."


In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
tfidf=TfidfVectorizer(max_df=0.95,min_df=2,stop_words='english')

In [9]:
dtm=tfidf.fit_transform(quora['Question'])
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [10]:
dtm

<404289x38669 sparse matrix of type '<class 'numpy.float64'>'
	with 2002912 stored elements in Compressed Sparse Row format>

In [11]:
from sklearn.decomposition import NMF

In [12]:
nmf_model=NMF(n_components=20,random_state=50)

In [13]:
nmf_model.fit(dtm)

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=20, random_state=50, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [14]:
for index, topic in enumerate(nmf_model.components_):
    print(f"THE TOP 15 WRODS FOR TOPIC # {index}")
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])

THE TOP 15 WRODS FOR TOPIC # 0
['thing', 'read', 'place', 'visit', 'places', 'phone', 'buy', 'laptop', 'movie', 'ways', '2016', 'books', 'book', 'movies', 'best']
THE TOP 15 WRODS FOR TOPIC # 1
['majors', 'recruit', 'sex', 'looking', 'differ', 'use', 'exist', 'really', 'compare', 'cost', 'long', 'feel', 'work', 'mean', 'does']
THE TOP 15 WRODS FOR TOPIC # 2
['add', 'answered', 'needing', 'post', 'easily', 'improvement', 'delete', 'asked', 'google', 'answers', 'answer', 'ask', 'question', 'questions', 'quora']
THE TOP 15 WRODS FOR TOPIC # 3
['using', 'website', 'investment', 'friends', 'black', 'internet', 'free', 'home', 'easy', 'youtube', 'ways', 'earn', 'online', 'make', 'money']
THE TOP 15 WRODS FOR TOPIC # 4
['balance', 'earth', 'day', 'death', 'changed', 'live', 'want', 'change', 'moment', 'real', 'important', 'thing', 'meaning', 'purpose', 'life']
THE TOP 15 WRODS FOR TOPIC # 5
['reservation', 'engineering', 'president', 'minister', 'company', 'china', 'country', 'business', 'oly

In [15]:
topic_results=nmf_model.transform(dtm)

In [16]:
topic_results.argmax(axis=1)

array([ 5, 16, 17, ..., 11, 11,  9])

In [17]:
quora['Topic']=topic_results.argmax(axis=1)
quora.head()

Unnamed: 0,Question,Topic
0,What is the step by step guide to invest in sh...,5
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,16
2,How can I increase the speed of my internet co...,17
3,Why am I mentally very lonely? How can I solve...,11
4,"Which one dissolve in water quikly sugar, salt...",14


In [20]:
export_csv=quora.to_csv('quora_topic.csv')

In [18]:
[quora['Topic']==11]

[0         False
 1         False
 2         False
 3          True
 4         False
           ...  
 404284    False
 404285    False
 404286     True
 404287     True
 404288    False
 Name: Topic, Length: 404289, dtype: bool]

In [None]:
#does not work very well