# What is Topic Modeling :

Topic modeling is an unsupervised technique that intends to analyze large volumes of text data by clustering the documents into groups. In the case of topic modeling, the text data do not have any labels attached to it. Rather, topic modeling tries to group the documents into clusters based on similar characteristics.

A typical example of topic modeling is clustering a large number of newspaper articles that belong to the same category. In other words, cluster documents that have the same topic. It is important to mention here that it is extremely difficult to evaluate the performance of topic modeling since there are no right answers. It depends upon the user to find similar characteristics between the documents of one cluster and assign it an appropriate label or topic.
    
Two approaches are mainly used for topic modeling: **Latent Dirichlet Allocation** and **Non-Negative Matrix factorization**.

# Latent Dirichlet Allocation (LDA) :

**The LDA is based upon two general assumptions:**



*   Documents that have similar words usually have the same topic
*  Documents that have groups of words frequently occurring together usually have the same topic.

These assumptions make sense because the documents that have the same topic, for instance, Business topics will have words like the "economy", "profit", "the stock market", "loss", etc. The second assumption states that if these words frequently occur together in multiple documents, those documents may belong to the same category.

**Mathematically, the above two assumptions can be represented as:**

Documents are probability distributions over latent topics
Topics are probability distributions over words


# LDA for Topic Modeling in Python:

---



In this section we will see how Python can be used to implement LDA for topic modeling. The data set can be downloaded from the Kaggle.

The data set contains user reviews for different products in the food category. We will use LDA to group the user reviews into 5 categories.

**The first step, as always, is to import the data set:**

In [141]:
import pandas as pd  
import numpy as np

#reviews_datasets = pd.read_csv(r'E:\Datasets\Reviews.csv')
reviews_datasets = pd.read_csv(r'data.csv')
#reviews_datasets = reviews_datasets.head(1000)  
reviews_datasets.dropna()
reviews_datasets.head() 

Unnamed: 0,Text,Summary
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,toursim
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,toursim
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,toursim
3,تنقسم السياحة الهدف أربعة أنواع وهم,toursim
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,toursim


In [142]:
reviews_datasets['Text'][30]

'قامت الدولة بعلو مكانة المرأة أشكال الرياضة فوجدنا سيداتا حصلت مدالية ذهبية فضية التنافس دول نفس الرياضة اللاعبة نور الشربيني لاعبة الأسكواش حصلت بطولة انجلترا العالمية تعد لاعبة مصرية تحصل وهناك أيضا مثالا يحتذي اللاعب أحمد براده وغيره المتألقين الأخرين جعلوا اسم بلدهم يصل عنان السماء ويرفع علم مصر الخارج اذن فالرياضة تعتبر أسمي الأشياء يبرع الأنسان تهتم الدولة بشئون اللاعبين لكرة الفدم جعلهم يصلون كأس العالم سيتم انشاء منافسة عالمية ودولية عام 2018 يظهر دور الدولة السير قدما العالمية ويبين الدور الفعال تحققه الرياضة نكن أكثر احترافيا مثلنا اللاعبين الدول الأخري'

**Before we can apply LDA**, we need to create vocabulary of all the words in our data. Remember from the previous article, we could do so with the help of a count vectorizer. Look at the following script:

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(max_df=0.8, min_df=2)  
doc_term_matrix = count_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))

In the script above we use the ***CountVectorizer*** class from the ***sklearn.feature_extraction.text*** module to create a document-term matrix. We specify to only include those words that appear in less than 80% of the document and appear in at least 2 documents. We also remove all the stop words as they do not really contribute to topic modeling.

**Now let's look at our document term matrix:**

In [144]:
doc_term_matrix

<40x226 sparse matrix of type '<class 'numpy.int64'>'
	with 643 stored elements in Compressed Sparse Row format>

Each of 40 documents is represented as 226 dimensional vector, which means that our vocabulary has 226 words.

**Next, we will use LDA to create topics along with the probability distribution for each word in our vocabulary for each topic:**

In [145]:
from sklearn.decomposition import LatentDirichletAllocation

LDA = LatentDirichletAllocation(n_components=2, random_state=42)  
LDA.fit(doc_term_matrix)  

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
                          evaluate_every=-1, learning_decay=0.7,
                          learning_method='batch', learning_offset=10.0,
                          max_doc_update_iter=100, max_iter=10,
                          mean_change_tol=0.001, n_components=2, n_jobs=None,
                          perp_tol=0.1, random_state=42, topic_word_prior=None,
                          total_samples=1000000.0, verbose=0)

In the script above we use the*** LatentDirichletAllocation*** class from the ***sklearn.decomposition*** library to perform LDA on our document-term matrix. The parameter *n_components* specifies the number of categories, or topics, that we want our text to be divided into. The parameter *random_state* (aka the seed) is set to 42 so that you get the results similar to mine.

Let's randomly fetch words from our vocabulary. We know that the count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

Let's randomly fetch words from our vocabulary. We know that the count vectorizer contains all the words in our vocabulary. We can use the get_feature_names() method and pass it the ID of the word that we want to fetch.

**The following script randomly fetches 10 words from our vocabulary:**

In [146]:
import random

for i in range(10):  
    random_id = random.randint(0,len(count_vect.get_feature_names()))
    print(count_vect.get_feature_names()[random_id])

مختلفة
مختلفة
نقل
الحلول
تعود
البطالة
جامع
الخيل
الرياضية
القدم


**Let's find 10 words with the highest probability for the first topic. To get the first topic, you can use the components_ attribute and pass a 0 index as the value:**

In [0]:
first_topic = LDA.components_[0]  


The first topic contains the probabilities of 226 words for topic 1. To sort the indexes according to probability values, we can use the argsort() function. Once sorted, the 10 words with the highest probabilities will now belong to the last 10 indexes of the array. The following script 

**returns the indexes of the 10 words with the highest probabilities:**

In [148]:
top_topic_words = first_topic.argsort()[-10:]
top_topic_words

array([108,  88, 165, 131,   1, 220,  21, 182,  33,  57])

These indexes can then be used to retrieve the value of the words from the count_vect object.

**which can be done like this:**

In [149]:
for i in top_topic_words:  
    print(count_vect.get_feature_names()[i])

بشكل
الكثير
فوائد
جسده
أكثر
يقوم
الانسان
ممارسة
الجسم
الرياضة


**Let's print the 10 words with highest probabilities for all the five topics:**

In [150]:
for i,topic in enumerate(LDA.components_):  
    print(f'Top 10 words for topic #{i}:')
    print([count_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['بشكل', 'الكثير', 'فوائد', 'جسده', 'أكثر', 'يقوم', 'الانسان', 'ممارسة', 'الجسم', 'الرياضة']


Top 10 words for topic #1:
['وكذلك', 'الطبيعية', 'الكثير', 'السياح', 'تعتبر', 'سياحة', 'العالم', 'الدولة', 'مصر', 'السياحة']




The output shows that the firsr topic might contain reviews about sports and the second topic might contain reviews about toursim, etc. Yeou can see that there a few common words in all the categories. This is because there are few words that are used for almost all the topics. For instance "good", "great", "like" etc.

As a final step, we will add a column to the original data frame that will store the topic for the text. To do so, we can use LDA.transform() method and pass it our document-term matrix. This method will assign the probability of all the topics to each document. 

**Look at the following code:**

In [151]:
topic_values = LDA.transform(doc_term_matrix)  
topic_values.shape  

(40, 2)

In the output, you will see (20000, 5) which means that each of the document has 5 columns where each column corresponds to the probability value of a particular topic. To find the topic index with maximum value, we can call the argmax() method and pass 1 as the value for the axis parameter.

**The following script adds a new column for topic in the data frame and assigns the topic value to each row in the column:**

In [0]:
reviews_datasets['Topic'] = topic_values.argmax(axis=1)  
reviews_datasets.to_csv('LDA_result.csv', encoding='utf-8', index=False)

**Let's now see how the data set looks:**

In [153]:
reviews_datasets.head()  

Unnamed: 0,Text,Summary,Topic
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,toursim,1
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,toursim,1
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,toursim,1
3,تنقسم السياحة الهدف أربعة أنواع وهم,toursim,1
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,toursim,0


# Non-Negative Matrix Factorization (NMF):

In the previous section, we saw how LDA can be used for topic modeling. In this section, we will see how non-negative matrix factorization can be used for topic modeling.

Non-negative matrix factorization is also a supervised learning technique which performs clustering as well as dimensionality reduction. It can be used in combination with TF-IDF scheme to perform topic modeling. In this section, we will see how Python can be used to perform non-negative matrix factorization for topic modeling.

**NMF for Topic Modeling in Python:**

In this section, we will perform topic modeling on the same data set as we used in the last section. You will see that the steps are also quite similar.

In [154]:
import pandas as pd  
import numpy as np

reviews_datasets = pd.read_csv(r'data.csv')  
reviews_datasets.head()

Unnamed: 0,Text,Summary
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,toursim
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,toursim
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,toursim
3,تنقسم السياحة الهدف أربعة أنواع وهم,toursim
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,toursim


In [155]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(max_df=0.8, min_df=2)  
doc_term_matrix = tfidf_vect.fit_transform(reviews_datasets['Text'].values.astype('U'))  
tfidf_vect

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.8, max_features=None,
                min_df=2, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

In [156]:
doc_term_matrix

<40x226 sparse matrix of type '<class 'numpy.float64'>'
	with 643 stored elements in Compressed Sparse Row format>

In [157]:
from sklearn.decomposition import NMF

nmf = NMF(n_components=2, random_state=42)  
nmf.fit(doc_term_matrix )

NMF(alpha=0.0, beta_loss='frobenius', init=None, l1_ratio=0.0, max_iter=200,
    n_components=2, random_state=42, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

In [158]:
import random

for i in range(10):  
    random_id = random.randint(0,len(tfidf_vect.get_feature_names()))
    print(tfidf_vect.get_feature_names()[random_id])

الاثار
أيضا
تساهم
الحياتية
وذلك
أماكن
عرف
كبيرا
تعود
يقوم


In [159]:
first_topic = nmf.components_[0]  
top_topic_words = first_topic.argsort()[-10:] 
top_topic_words

array([147,  87,  65, 119, 149,  53,  94,  76, 179,  66])

In [160]:
for i in top_topic_words:  
    print(tfidf_vect.get_feature_names()[i])

زيارة
القومي
السياح
تعتبر
سياحة
الدولة
المختلفة
العالم
مصر
السياحة


In [161]:
for i,topic in enumerate(nmf.components_):  
    print(f'Top 10 words for topic #{i}:')
    print([tfidf_vect.get_feature_names()[i] for i in topic.argsort()[-10:]])
    print('\n')

Top 10 words for topic #0:
['زيارة', 'القومي', 'السياح', 'تعتبر', 'سياحة', 'الدولة', 'المختلفة', 'العالم', 'مصر', 'السياحة']


Top 10 words for topic #1:
['الرياضية', 'الرياضي', 'فوائد', 'جسده', 'أكثر', 'الانسان', 'يقوم', 'ممارسة', 'الجسم', 'الرياضة']




In [162]:
topic_values = nmf.transform(doc_term_matrix)  
reviews_datasets['Topic'] = topic_values.argmax(axis=1)  
reviews_datasets.to_csv('NMF_result.csv', encoding='utf-8', index=False)
reviews_datasets.head()

Unnamed: 0,Text,Summary,Topic
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,toursim,0
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,toursim,0
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,toursim,0
3,تنقسم السياحة الهدف أربعة أنواع وهم,toursim,0
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,toursim,0


# Text Analytics tfidf:

after you see this vedio
https://www.youtube.com/watch?v=hXNbFNCgPfY

**you can follow this code**

In [0]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
df=pd.read_csv('data.csv')

In [9]:
df.head()

Unnamed: 0,Text,Summary
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,toursim
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,toursim
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,toursim
3,تنقسم السياحة الهدف أربعة أنواع وهم,toursim
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,toursim


In [10]:
len(df)

40

In [11]:

len(df[df.Summary=='toursim'])

27

In [13]:
len(df[df.Summary=='sports'])

13

In [0]:
df.loc[df["Summary"]=='toursim',"Summary",] = 0

In [0]:
df.loc[df["Summary"]=='sports',"Summary",] = 1

In [17]:
df.head()

Unnamed: 0,Text,Summary
0,تعتبر السياحة احدي الركائز المهمة لاقتصاد بلد ...,1
1,تتوقف متعة السياحة الاطلاع تاريخ الأمم السابقة...,1
2,تحتاج السياحة الكثير الاهتمام بالمرافق السياحي...,1
3,تنقسم السياحة الهدف أربعة أنواع وهم,1
4,السياحة الترفيهية تشمل زيارة الأماكن الساحلية...,1


In [0]:
df_x=df["Text"]
df_y=df["Summary"]

In [0]:
cv = TfidfVectorizer(min_df=1)

In [0]:
x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.2, random_state=4)

In [72]:
x_train.head()

16    تساعد السياحة التخلص الضغوطات والعقد النفسية ت...
11    تعد السياحة سفر الانسان داخل خارج دولته بهدف غ...
19    تمثل السياحة الثقافية وكذا السياحة الاثرية اقد...
33    أهم فوائد الرياضة تقوي بنية الجسم وتجعله يتمتع...
12    تتعدد فوائد وأهمية السياحة الكثير المجالات أبر...
Name: Text, dtype: object

In [0]:
x_traincv = cv.fit_transform(["Hi How are you How are you doing","Hi what's up","Wow that's awesome"])

In [74]:
a = x_traincv.toarray()
a

array([[0.54275734, 0.        , 0.27137867, 0.20639047, 0.54275734,
        0.        , 0.        , 0.        , 0.        , 0.54275734],
       [0.        , 0.        , 0.        , 0.4736296 , 0.        ,
        0.        , 0.62276601, 0.62276601, 0.        , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.57735027, 0.        ]])

In [75]:
a[0]

array([0.54275734, 0.        , 0.27137867, 0.20639047, 0.54275734,
       0.        , 0.        , 0.        , 0.        , 0.54275734])

In [76]:
cv.get_feature_names()

['are', 'awesome', 'doing', 'hi', 'how', 'that', 'up', 'what', 'wow', 'you']

In [0]:
x_traincv=cv.fit_transform(x_train)

In [78]:
a=x_traincv.toarray()
a[0]

array([0., 0., 0., ..., 0., 0., 0.])

In [79]:
cv.inverse_transform(a[0])

[array(['التخلص', 'السياحة', 'الضغوطات', 'العمل', 'النفسية', 'اليومية',
        'تساعد', 'تسببها', 'مشاكل', 'والعقد'], dtype='<U13')]

In [80]:
x_train.iloc[0]

'تساعد السياحة التخلص الضغوطات والعقد النفسية تسببها مشاكل العمل اليومية'

In [0]:
x_testcv=cv.transform(x_test)

In [0]:
mnb = MultinomialNB()

In [0]:
y_train=y_train.astype('int')

In [83]:
y_train

16    1
11    1
19    1
33    1
12    1
18    1
38    1
13    1
10    1
22    1
32    1
25    1
17    1
36    1
29    1
14    1
2     1
24    1
27    1
6     1
35    1
34    1
21    1
37    1
0     1
3     1
30    1
9     1
8     1
23    1
1     1
5     1
Name: Summary, dtype: int64

In [84]:
mnb.fit(x_traincv,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [85]:
testmessage=x_test.iloc[0]
testmessage

'ولا تنحصر أهمية الرياضة الجانب الجسدي بينت الدراسات الرياضة تأثير ايجابي تحسين أعراض الأمراض النفسية كالفصام يتمثل بمشاعر الغضب والكراهية والشعور بالذنب والثبات الوضعي وتأخر الحركة6  وتتخذ الرياضة العديد الأشكال تكون رياضة ساكنة كرفع الأثقال والتزلج الماء حركية كالمشي والركض والتجديف وركوب الخيل والدراجات والسباحة فالسباحة نشاط ترفيهي تنافسي العديد الفوائد لجسم الانسان وأكثر الأشكال الرياضية شيوعا المشي يتخذ شكل نشاط يومي اعتيادي مسابقة تنافسية وتشجع منظمات الصحة محاولة قضاء المهام اليومية مشيا استخدام السيارة والحافلات والرياضة الحركية تحرك أكبر عدد العضلات الجسم وتنشط الدورة الدموية لحاجة الجسم الحركة للأكسجين'

In [95]:
predictions=mnb.predict(x_testcv)
predictions

array([1, 1, 1, 1, 1, 1, 1, 1])

In [96]:
a=np.array(y_test)
a

array([1, 1, 1, 1, 1, 1, 1, 1], dtype=object)

In [0]:
count = 0
for i in range (len(predictions)):
    if predictions[i] == a[i]:
        count = count+1

In [98]:
count

8

In [99]:
len(predictions)

8

In [101]:
accuracy = count / len(predictions)
accuracy

1.0

In [163]:
# you can test one text as the follow
predictions=mnb.predict(x_testcv[0])
predictions

array([1])