<a href="https://colab.research.google.com/github/Sarpwus/datasciencecoursera/blob/master/TopicModeling_UsingTwitterData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Topic Modeling of Tweets via Unsupervised Approach using the LDA Algorithm**

Learning guide from:


*   Data Science with Raghav: https://github.com/raaga500/YTshared/blob/master/V4_TopicModelling_4.ipynb




In [None]:
# Mount Google Drive to have access to files using Google Collab
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
!pip install pyldavis --quiet

In [None]:
# import the necessary libraries

# related to dataframe and exploration
import pandas as pd
import string

# related to topic modeling
import gensim # the library for Topic modeling
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
import pyLDAvis.gensim_models #LDA visualisation library

# related to NLTK
import nltk
nltk.download('stopwords') # for getting list of stopwords
nltk.download('wordnet') # for getting list of words that the algorithm utilise
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import warnings
warnings.simplefilter('ignore')
from itertools import chain

# for text manipulation
import re


In [None]:
# list the datasets in the datasets directory
!ls '/content/drive/My Drive/Colab Notebooks/Omdena_HrF/datasets'

snscrape_killing.csv		Tweets_Dataset_parth_final.xlsx
snscrape_warcrime.csv		twint_keyword_complete.csv
trainDataset_extract.csv	twint_warcrime_complete.csv
Tweets_Dataset_parth_final.csv


## Read in the Raw Data

In [None]:
# directory path
dirpath = '/content/drive/My Drive/Colab Notebooks/Omdena_HrF/datasets/'

# read in one of the twitter datasets we have
raw_tweet_df = pd.read_csv(dirpath + 'snscrape_killing.csv')

In [None]:
raw_tweet_df.shape

(5002, 8)

In [None]:
raw_tweet_df.head()

Unnamed: 0,text,hashtag,date,lang,event_location,submission_location,war_crime,war_crime_category
0,#UnknownGovernment running Nigeria\n#UnknownGu...,#killing,2021-06-09 16:55:37+00:00,en,,,,
1,Hey we talk about a bunch of bad things. And t...,#killing,2021-06-09 16:42:19+00:00,en,,,,
2,@chimbiko_jerome @Nj99625368 @CatrionaLaing1 @...,#killing,2021-06-09 16:35:52+00:00,en,,,,
3,#UnknownGovernment running Nigeria\n#UnknownGu...,#killing,2021-06-09 16:35:30+00:00,en,,,,
4,@MaziNnamdiKanu @UNHumanRights @JohnCampbellcf...,#killing,2021-06-09 13:45:52+00:00,en,,,,


In [None]:
# column names
for col_name in raw_tweet_df.columns: 
    print(col_name)

text
hashtag
date
lang
event_location
submission_location
war_crime
war_crime_category


##Explore and Clean the dataset to make it ready for Analysis

In [None]:
# viewing an example of the tweet [e.g. pick at random]
tweet_example = raw_tweet_df['text'].values[700]
tweet_example

'@UN @UNICEF @DrYasminAHaque @MelissaFleming Ugandans seeking  your aid Museveni #Killing us #Kidnapping we are bleeding  by guns not even  #COVID19 😭😭🇺🇬'

In [None]:
# setting up variables needed to clean the tweets - stopwords, punctation, and lemma
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [None]:
# seeing what's inside the variables, stop and exclude
print(list(stop)[:5])
print(list(exclude)[:5])

['again', 'here', 'that', 'yourself', 're']
['^', '%', '+', '*', '!']


In [None]:
# Example: remove stopwords from tweet
' '.join([word for word in tweet_example.lower().split() if word not in stop])

'@un @unicef @dryasminahaque @melissafleming ugandans seeking aid museveni #killing us #kidnapping bleeding guns even #covid19 😭😭🇺🇬'

In [None]:
# Example: remove punctation including special characters or symbols
punc_free_text = ''.join(ch for ch in tweet_example if ch not in exclude)
punc_free_text

'UN UNICEF DrYasminAHaque MelissaFleming Ugandans seeking  your aid Museveni Killing us Kidnapping we are bleeding  by guns not even  COVID19 😭😭🇺🇬'

In [None]:
# Example: normalised the text using the variable, lemma
' '.join([lemma.lemmatize(word) for word in punc_free_text.split()])

'UN UNICEF DrYasminAHaque MelissaFleming Ugandans seeking your aid Museveni Killing u Kidnapping we are bleeding by gun not even COVID19 😭😭🇺🇬'

In [None]:
# Wrap the process into a Function to clean text: 
# removing stopwords, punctuation, and then normalising it using the lemma
def clean_tweet_msg(message):
  '''
  This function takes 'message' i.e. tweets as the argument then it does 3 things to clean it
  
  Firstly, it uses the variable, stop, to remove any stopwords such as 'a', 'the' etc...
  Then in the Second Step, any character or punctation such as @, !, ., etc. are removed 
  Finally, the text is lemmatize using the variable, lemma, for each word in the message or tweet

  And have done all that, this function returns a list of the words in the message

  '''
  stop_free = ' '.join([word for word in message.lower().split() if word not in stop])
  punc_free = ''.join(symbol for symbol in stop_free if symbol not in exclude)
  normalised = ' '.join([lemma.lemmatize(word) for word in punc_free.split()])
  
  return normalised.split()

In [None]:
# Testing the Function, clean_tweet_msg, 
print(tweet_example)
clean_tweet_msg(tweet_example)

@UN @UNICEF @DrYasminAHaque @MelissaFleming Ugandans seeking  your aid Museveni #Killing us #Kidnapping we are bleeding  by guns not even  #COVID19 😭😭🇺🇬


['un',
 'unicef',
 'dryasminahaque',
 'melissafleming',
 'ugandan',
 'seeking',
 'aid',
 'museveni',
 'killing',
 'u',
 'kidnapping',
 'bleeding',
 'gun',
 'even',
 'covid19',
 '😭😭🇺🇬']

###**Observations so far:**


*   The tweets contains useful information such as "mentions of user handles" i.e. the @ symbol, and also #tags. These are good source of info that we can pull into seperate columns
*   



## **Interim:** Identifying Topics using a Sample of 150 Tweets

In [None]:
# sample dataframe of the text column in raw_tweet_df
sample_tweet_df = raw_tweet_df[['text']][:150]

In [None]:
sample_tweet_df.head()

Unnamed: 0,text
0,#UnknownGovernment running Nigeria\n#UnknownGu...
1,Hey we talk about a bunch of bad things. And t...
2,@chimbiko_jerome @Nj99625368 @CatrionaLaing1 @...
3,#UnknownGovernment running Nigeria\n#UnknownGu...
4,@MaziNnamdiKanu @UNHumanRights @JohnCampbellcf...


In [None]:
# use the clean_tweet_msg function to clean and gather the result into separate column
sample_tweet_df['text_clean'] = sample_tweet_df['text'].apply(clean_tweet_msg)

In [None]:
sample_tweet_df.head()

Unnamed: 0,text,text_clean
0,#UnknownGovernment running Nigeria\n#UnknownGu...,"[unknowngovernment, running, nigeria, unknowng..."
1,Hey we talk about a bunch of bad things. And t...,"[hey, talk, bunch, bad, thing, podcast, hosted..."
2,@chimbiko_jerome @Nj99625368 @CatrionaLaing1 @...,"[chimbikojerome, nj99625368, catrionalaing1, g..."
3,#UnknownGovernment running Nigeria\n#UnknownGu...,"[unknowngovernment, running, nigeria, unknowng..."
4,@MaziNnamdiKanu @UNHumanRights @JohnCampbellcf...,"[mazinnamdikanu, unhumanrights, johncampbellcf..."


In [None]:
# create dictionary
dictionary = corpora.Dictionary(sample_tweet_df['text_clean'])
# print the total number of unique words
print(dictionary.num_nnz)

2748


In [None]:
# create document term matrix
doc_term_matrix = [dictionary.doc2bow(doc) for doc in sample_tweet_df['text_clean']]
print(len(doc_term_matrix))

150


In [None]:
#doc_term_matrix

**Interim:** LDA Model Training on Sample of 150 Tweets

In [None]:
# instantiate LDA Model
lda = gensim.models.ldamodel.LdaModel

In [None]:
# fit LDA model on dataset
ntopics = 5
%time ldamodel = lda(doc_term_matrix, num_topics = ntopics, id2word = dictionary, passes = 50, minimum_probability = 0)

CPU times: user 3.12 s, sys: 69.8 ms, total: 3.19 s
Wall time: 3.11 s


In [None]:
# Print the topics identified by LDA model
ldamodel.print_topics(num_topics = ntopics)

[(0,
  '0.037*"killing" + 0.020*"million" + 0.017*"amp" + 0.011*"two" + 0.011*"muslim" + 0.011*"say" + 0.009*"like" + 0.009*"christian" + 0.009*"spread" + 0.009*"sponsored"'),
 (1,
  '0.037*"killing" + 0.010*"amp" + 0.009*"compassion" + 0.008*"child" + 0.007*"wave" + 0.007*"muslim" + 0.006*"u" + 0.006*"un" + 0.004*"even" + 0.004*"it"'),
 (2,
  '0.041*"killing" + 0.014*"nigeria" + 0.014*"biafrans" + 0.013*"u" + 0.012*"people" + 0.012*"woman" + 0.010*"buhari" + 0.010*"un" + 0.010*"biafraland" + 0.010*"running"'),
 (3,
  '0.048*"killing" + 0.008*"u" + 0.007*"canadian" + 0.007*"worst" + 0.007*"it’s" + 0.007*"history" + 0.007*"free" + 0.007*"nosis" + 0.007*"june" + 0.007*"cdnpoli"'),
 (4,
  '0.039*"killing" + 0.012*"murder" + 0.009*"people" + 0.008*"innocent" + 0.006*"make" + 0.006*"brotherhood" + 0.004*"must" + 0.004*"know" + 0.004*"one" + 0.004*"planet"')]

In [None]:
# visualise the LDA model results
lda_display = pyLDAvis.gensim_models.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics = False, mds = 'mds')
pyLDAvis.display(lda_display)



In [None]:
# assign the topics to the documents in corpus
lda_corpus = ldamodel[doc_term_matrix]

In [None]:
#[doc for doc in lda_corpus]

In [None]:
scores = list(chain(*[[score for topic_id, score in topic] \
                      for topic in [doc for doc in lda_corpus]]))

threshold = sum(scores)/len(scores)
print(threshold)

0.19999999932323892


In [None]:
cluster1 = [j for i,j in zip(lda_corpus, sample_tweet_df.index) if i[0][1] > threshold]
cluster2 = [j for i,j in zip(lda_corpus, sample_tweet_df.index) if i[1][1] > threshold]
cluster3 = [j for i,j in zip(lda_corpus, sample_tweet_df.index) if i[2][1] > threshold]
cluster4 = [j for i,j in zip(lda_corpus, sample_tweet_df.index) if i[3][1] > threshold]
cluster5 = [j for i,j in zip(lda_corpus, sample_tweet_df.index) if i[4][1] > threshold]

print(len(cluster1))
print(len(cluster2))
print(len(cluster3))
print(len(cluster4))
print(len(cluster5))

25
32
34
36
25


In [None]:
# view cluster 1 in the sample_tweet_df
sample_tweet_df.iloc[cluster1]

Unnamed: 0,text,text_clean
9,@ShivAroor we live in a world where #Modi is c...,"[shivaroor, live, world, modi, called, dictato..."
11,@MarianneSansum @frederickone OH! I don't know...,"[mariannesansum, frederickone, oh, know, owner..."
15,Like #Christians #killing millions to spread #...,"[like, christian, killing, million, spread, lo..."
26,Like #Christians #killing millions to spread #...,"[like, christian, killing, million, spread, lo..."
38,One million Indian troops are engaged in #kill...,"[one, million, indian, troop, engaged, killing..."
39,I strongly condemn the brutal killing of a #Ca...,"[strongly, condemn, brutal, killing, canadian,..."
44,Canada condemns 'targeted' killing of Muslim f...,"[canada, condemns, targeted, killing, muslim, ..."
56,#Killing of Canadian Muslim family with truck ...,"[killing, canadian, muslim, family, truck, hat..."
57,Like #Christians #killing millions to spread #...,"[like, christian, killing, million, spread, lo..."
69,It's also #Ohio. What the heck is wrong with t...,"[also, ohio, heck, wrong, people, state, alway..."


In [None]:
# view cluster 2 in the sample_tweet_df
sample_tweet_df.iloc[cluster2]

In [None]:
# view cluster 3 in the sample_tweet_df
sample_tweet_df.iloc[cluster3]

In [None]:
# view cluster 4 in the sample_tweet_df
sample_tweet_df.iloc[cluster4]

In [None]:
# view cluster 5 in the sample_tweet_df
sample_tweet_df.iloc[cluster5]

## **Feature Engineering:** Extracting the @mention and #tags to identify & interpret the cluster of Topics 

In [None]:
# A function to pull @mentions and #tags from Tweets
def pull_mention_n_tags(message):
    '''
    This function takes a tweet message as the argument
    and it pulls the @mentions and #tags
    which are gathered into separate output list: 
        output_at_word - i.e. @mentions
        output_hash_word - i.e. #tags
    '''
    output_at_word = []
    output_hash_word = []
    
    # loop through the words in message to pull @word and #word
    for word in message.split():
        at_word = re.findall('[@].+', word)
        hash_word = re.findall('[#].+', word)
        
        # conditional statement to gather @word to the list output_at_word
        if len(at_word) > 0:
            output_at_word += at_word
        # 2nd conditional statement to gather #word to the list output_hash_word
        elif len(hash_word) > 0:
            output_hash_word += hash_word
            
    # return the 2 outputs
    return output_at_word, output_hash_word