####     Author : Mohlatlego  nakeng

#### Topic Modelling

* Latent Dirichlet Allocation represents documents as mixtures of topics that spit out words with certain probabilities. So now suppose you have a set of documents. You’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic.
* Latent Dirichlet allocation (LDA) is a technique that automatically discovers topics that these documents contain.
* Dirichlet is a distribution specified by a vector parameter α containing some αi corresponding to each topic i, which we write as Dir(α)

#### Non-negative Matrix Factorization

* LDA is based on probabilistic graphical modeling while NMF relies on linear algebra.
* Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus).
* The aim of each algorithm is then to produce 2 smaller matrices; a document to topic matrix and a word to topic matrix that when multiplied together reproduce the bag of words matrix with the lowest error.

#### Importing Libraries  

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import pyLDAvis
import pyLDAvis.sklearn
import sys
import pandas as pd
import os
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

In [11]:
import preprocessor as p

In [12]:
from nltk.corpus import stopwords

#### Import datasets

In [13]:
data = pd.read_csv("data/vaccine_jhb.csv",sep="\t" )

In [14]:
data.columns

Index(['id', 'conversation_id', 'created_at', 'date', 'time', 'timezone',
       'user_id', 'username', 'name', 'place', 'tweet', 'language', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video',
       'thumbnail', 'near', 'geo', 'source', 'user_rt_id', 'user_rt',
       'retweet_id', 'reply_to', 'retweet_date', 'translate', 'trans_src',
       'trans_dest'],
      dtype='object')

In [15]:
data.tweet

0       @LadyhawkAnnie The fact that nobody has attemp...
1       @Newzroom405 Maybe he needs Pfizer for the spear.
2       @__Inolofatse__ Fostofol I'm not your fan, sec...
3       @SundayTimesZA Asina ndaba we trust our own na...
4       As August comes to an end we look back at 27 d...
                              ...                        
9722    @TedPhaladi I agree just go with the flow beca...
9723    And for the love of God, fuck #SAA, use this m...
9724    And A VACCINE STRATEGY SHOULD BE ANNOUNCED AND...
9725    Where it all start the chines are on with thei...
9726    @leylurr ❤️❤️❤️ just you wait until we have a ...
Name: tweet, Length: 9727, dtype: object

#### data cleaning 

* Removig mentions or Tags

In [16]:
import re

In [17]:
from nltk.stem import PorterStemmer
stop_words=stopwords.words('english')
stemmer=PorterStemmer()

In [18]:
def data_clean(text):
    for i in range(len(data)):
        tweet=re.sub('[^a-zA-Z]',' ',data.iloc[i])
        tweet=re.sub('@[A-Za-z0-9_]+',' ',data.iloc[i])
        tweet=tweet.lower().split()
        tweet=[stemmer.stem(word) for word in tweet if (word not in stop_words)]
    #     tweet = p.clean(data.tweet)
        tweet=' '.join(tweet)
        return tweet

In [19]:
def display_topics(model, feature_names, no_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic:", (topic_idx))
        print(" ".join([feature_names[i]
        for i in topic.argsort()[:-no_top_words - 1:-1]]))


def tfidf_vectorizer(documents,total_features):

    #  TFIDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=total_features, stop_words='english')
    tfidf = tfidf_vectorizer.fit_transform(documents)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    return tfidf_vectorizer,tfidf,tfidf_feature_names

def count_vectorizer(documents,total_features):

    #  Count Vectorizer
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=total_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(documents)
    tf_feature_names = tf_vectorizer.get_feature_names()
    return tf_vectorizer,tf,tf_feature_names

In [34]:
data['tweet']

0       @LadyhawkAnnie The fact that nobody has attemp...
1       @Newzroom405 Maybe he needs Pfizer for the spear.
2       @__Inolofatse__ Fostofol I'm not your fan, sec...
3       @SundayTimesZA Asina ndaba we trust our own na...
4       As August comes to an end we look back at 27 d...
                              ...                        
9722    @TedPhaladi I agree just go with the flow beca...
9723    And for the love of God, fuck #SAA, use this m...
9724    And A VACCINE STRATEGY SHOULD BE ANNOUNCED AND...
9725    Where it all start the chines are on with thei...
9726    @leylurr ❤️❤️❤️ just you wait until we have a ...
Name: tweet, Length: 9727, dtype: object

In [20]:
total_features = 15000
num_topic = 20
tfidf_vectorizer, tfidf, tfidf_feature_names = tfidf_vectorizer(data['tweet'],total_features)
tf_vectorizer, tf, tf_feature_names = count_vectorizer(data['tweet'],total_features)



In [21]:
model_lda = LatentDirichletAllocation(n_components=num_topic, max_iter=30, learning_method='online', learning_offset=50.,random_state=0).fit(tfidf)
no_top_words = 20

In [22]:
display_topics(model_lda, tfidf_feature_names, no_top_words)

Topic: 0
gene therapy lockdownhouseparty experimental universities skynews kimberleypipet baba lynnbrittney2 khadijapatel thread authority zoeharcombe declined pentse radio702 bullied revolutionary hase abuse
Topic: 1
kingphaahla2 johnson_yolande unokwandalo thando_honey chile bdlivesa potus gloves donate hello veins edmnangagwa appreciated existence r350 noxnonozi khayajames locals diy amazulu
Topic: 2
ne queen tea ncube_johnson church lock yhooo established zanu cap bonang_m aunts grace scn_nkosi dozes astrazenaca sole overthinking container covidvaccine
Topic: 3
promote masego version lava tablet suncloud shaken chava sunboxambassador spreadalittlesunshine sunbox mokang pilane sunshinecinema sparkconversation pontsho busts cape discusses northern
Topic: 4
sabreakingnews carteblanchetv mnet utshwala sam zimbabwean reacting zimbabweans lakho material pruwolfie presently hermainem wath covid_19sa robert mnangagwa nanoza23 sogama_l_sabifa engineered
Topic: 5
bonitascares bonitasmedical 

##### Testing on JHB data

In [23]:
data_op = pyLDAvis.sklearn.prepare(model_lda,tfidf,tfidf_vectorizer)
pyLDAvis.enable_notebook()
pyLDAvis.display(data_op)

#### Testing on MMA data

In [24]:
data_comp =pd.read_excel("Complaints_Reviewed_As_Disinformation_twitter.xlsx")

In [25]:
data_comp.title

0       Contrasting images used to further racist agenda
1                            Racist Tweets by @dirkdup69
2                All Nigerians are Criminals and must go
3      Disinformation with xenophobic agenda re. Oran...
4      Tweet indicating that masks are a farce and sh...
                             ...                        
126         Twitter: 28 deaths from vaccine in SA so far
127    Twitter: Do not get vaccinated...you will die ...
128    Tweet claims that new COVID-19 strains more li...
129    Twitter: Do not get vaccinated...or else you w...
130                 False claims about COVID-19 vaccines
Name: title, Length: 131, dtype: object

In [26]:
def display_topics_comp(model, feature_names, no_top_words):
    
    for topic_idx, topic in enumerate(model.components_):
        print("Topic:", (topic_idx))
        print(" ".join([feature_names[i]
        for i in topic.argsort()[:-no_top_words - 1:-1]]))


def tfidf_vectorizer_comp(documents,total_features):

    #  TFIDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=total_features, stop_words='english')
    tfidf = tfidf_vectorizer.fit_transform(documents)
    tfidf_feature_names = tfidf_vectorizer.get_feature_names()
    return tfidf_vectorizer,tfidf,tfidf_feature_names

def count_vectorizer_comp(documents,total_features):

    #  Count Vectorizer
    tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=total_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(documents)
    tf_feature_names = tf_vectorizer.get_feature_names()
    return tf_vectorizer,tf,tf_feature_names

In [27]:
total_features = 15000 
num_topic = 20
tfidf_vectorizer_comp, tfidf_comp, tfidf_feature_names_comp = tfidf_vectorizer_comp(data_comp['title'],total_features)
tf_vectorizer_comp, tf_comp, tf_feature_names_comp = count_vectorizer_comp(data_comp['title'],total_features)



In [28]:
model_lda_comp = LatentDirichletAllocation(n_components=num_topic, max_iter=30, learning_method='online', learning_offset=50.,random_state=0).fit(tfidf_comp)
no_top_words = 20

In [29]:
display_topics(model_lda_comp, tfidf_feature_names_comp, no_top_words)

Topic: 0
masks lockdowns tweet protests ostensible zweli video conspiracy sa claims africa statements new alleged 19 alleges crimes chester constitutes pres
Topic: 1
likely new vaccinated claims tweet 19 covid ostensible da allegations mkhize makes video alleged claiming sa virus africa measures julius
Topic: 2
gates conspiracy theories vaccination mynameisjerm foundation death pres covid19 tweet 19 phoenix murder nigerians protests claims constitutes statement measures regarding
Topic: 3
rates measures disinformation statement efficacy violence claim dna lockdowns new mynameisjerm racist living news tweet racial theories 19 old cloth
Topic: 4
spreader rage super events various dr zweli mkhize associated claiming video tweet efficacy regarding current murder alleged sa taxi rates
Topic: 5
phoenix allegations wave foreign claims tweets protests protest used enca ostensible statements nationals mkhize vaccine taxi cites disinformation false tweet
Topic: 6
die poison twitter vaccinated co

In [30]:
data_op = pyLDAvis.sklearn.prepare(model_lda_comp,tfidf_comp,tfidf_vectorizer_comp)
pyLDAvis.enable_notebook()
pyLDAvis.display(data_op)
pyLDAvis.save_html(data_op, 'topics_MMA_LDA.html')

#### Non-negative Matrix Factorization

In [31]:
n_components = 20

In [32]:
def fit_NMF(X, n_components):
    model = NMF(n_components=n_components,random_state=0)
    nmf_tfidf_limit =model.fit(X)
    return nmf_tfidf_limit

In [33]:
nmf_tfidf_comp = fit_NMF(tfidf_comp, n_components)
display_topics(nmf_tfidf_comp ,tf_feature_names_comp ,  no_top_words)



Topic: 0
tweet ramaphosa ostensible image xenophobic da report cites mynameisjerm measures likely pres enca anti event rage makes associated murder allegations
Topic: 1
twitter flu lockdown vaccine kill medical wearing zweli zuma foreign nationals lockdowns mkhize dna deaths rates claim die poison crimes
Topic: 2
covid 19 measures medical dna efficacy statements pertaining false makes scam various flu rage event vaccination mortality cloth statistics second
Topic: 3
south africa foreign nationals living number statistics wave zimbabweans crimes responsible claiming flu regarding kill racist constitutes mortality pertaining efficacy
Topic: 4
masks wearing lockdowns cloth efficacy claims claim mortality used tweet rates 19 chester gates events evidence extreme fake false flu
Topic: 5
covid19 regarding vaccination lockdowns gates rates scam claim second wave cabanac roman vaccine dr pres foundation used mortality twitter death
Topic: 6
disinformation constitutes claim foundation racist xe