# Topic Modelling by `Mr. Harshit Dawar!`
### Algorithm: LDA (Latent Dirichlet Allocation)

***Steps***
1. A random number of topics will be decided by the user to which the words from the document will be assigned.
2. Each word from each document will be assigned to any of the random topics initially.
3. Now, random topics from each document & words assignments to those topics in each document will be obtained. Although, initial assignment will not make any sense.
4. Steps 2 & 3 will be repeated until the best assignments are provided using the formula given below.

For each topic:  ***probability( topic "t" | document "d")*** <= Probability of topic "t" existing in document "d".

For each word:  ***probability( word "w" | topic "t")*** <= Probability of word "w" belonging to topic "t".

Final probability that a topic "t" generated word "w" in document "d": ***probability( topic "t" | document "d") * probability( word "w" | topic "t")***


**Important Pointers**

* The user has to decide the number of topics to get from the document
* The user has to interpret the topics itself.

**Few Assumptions of LDA**

* Documents are probability distributions over Topics, Topics are probabilty distributions over words.
* Documents with similar topics uses similar groups of words.
* Topics can be founded by searching for the words that occur across the corpus in documents. 

In [13]:
# Importing the required Libraries
import pandas as pd
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
# Loading the Dataset
data = pd.read_csv("data.csv")

In [4]:
data.head()

Unnamed: 0,Article
0,"In the Washington of 2016, even when the polic..."
1,Donald Trump has used Twitter — his prefe...
2,Donald Trump is unabashedly praising Russian...
3,"Updated at 2:50 p. m. ET, Russian President Vl..."
4,"From photography, illustration and video, to d..."


In [5]:
data.shape

(11992, 1)

## Getting the word vectorized

In [6]:
"""
* min_df represents min. number of documents in which a word should occur. A word with a number below this
  will be ignored.
* max_df represents max. word frequency of occurence of a word in the document above which all the
  words will be ignored.
  
* Stopwrods of English will be removed.
"""
vectorizer = CountVectorizer(min_df = 2, max_df = 0.95, stop_words="english")

In [8]:
document_word_matrix = vectorizer.fit_transform(data.Article)

In [15]:
document_word_matrix

<11992x54777 sparse matrix of type '<class 'numpy.int64'>'
	with 3033388 stored elements in Compressed Sparse Row format>

In [16]:
np.unique(document_word_matrix.toarray()[0])

array([ 0,  1,  2,  3,  4,  5,  6,  7,  9, 10, 15, 19])