# TOPIC INTRODUCTION AND SOCIAL SCIENCE CONTEXT

## The Case Study with #WhatsHappeninginThailand Hashtag

Name: Tunyanan Pimonbutpong

Link for download dataset: https://drive.google.com/file/d/1ngd1QQk-SPphEHsB7IZ2srauy4QCuOzg/view?usp=sharing

or Github link: https://github.com/Tunyananp/nlpprojects_SIMM71.git

Thailand has a long history of political instability, with strong military culture and monarchical overreach periodically disrupting democracy over the past century. Social media and digital activism have gradually shaped and amplified the nation's political and social discourse over the past decade. Since 2018, Thailand's adoption of Twitter has led to a significant increase in the platform's popularity (Kemp, 2020). Thais on Twitter have, to date, use the platform to discuss controversial and sensitive topics. The #WhatsHappeninginThailand is one of the popular hashtags on Twitter that Thais mainly use English for drawing international attention to know more about the situation in Thailand and publicise protest events (Sombatpoonsiri, 2020).

This notebook is based on the connective action framework in the article by Bennett, W. L., & Segerberg, A. (2012). The Logic of Connective Action: Digital Media and The Personalization of Contentious Politics. Information, communication & society, 15(5), 739- 768.

Here, I use the connective action framework as a foundation, which argue that internet users engage in digital activism through self-motivation and individualized activities, without the necessity for traditional organizations to function as intermediaries (Bennett and Segerberg, 2012). To achieve their collective goal, users in the connective action framework can generate their own content under the common hashtag. I proposes that Thai users use Twitter to develop their own narratives via shared hashtags in order to discuss and mobilize support for common aims. From this assumption, I would like to find the major topics that are being discussed in #WhatsHappeninginThailand hashtag.

# IMPORT TEXT FILES (101 TWEETS)

The first step, I will import a dataset. 

In [None]:
import pandas as pd

datathailand = pd.read_csv("whathappenthai.csv")

df = pd.DataFrame(datathailand)

print(df)

# DATA CLEANING 

Raw tweets without preprocessing are unstructured and contain redundant and often problematic information. This dataset contains hashtag (#whatishappeninginthailand) and emojis, therefore, I decided to clean them out as they may not be necessary for topic modeling approach because those terms are not provide meaningful context for discovering inherent topics from the dataset. 

Firstly, I will clean the data from emojis

Applied the code from jfs (2015).Removing emojis from a string in Python [Source code]. https://stackoverflow.com/a/49146722/330558

In [None]:
!pip install emoji==1.7
import emoji
import re

In [None]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

call_emoji_free = lambda x: remove_emoji(x)
df['emoji_free_tweets'] = df['post_text'].apply(call_emoji_free)

print(df)

Next, the hashtag (#WhatsHappeninginThailand) will be removed 

In [None]:
def remove_hashtags(tweet):
    """Takes a string and removes any hash tags"""
    tweet = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', tweet)  # remove hash tags
    return tweet

call_hashtag_free = lambda x: remove_hashtags(x)
df['final_tweets'] = df['emoji_free_tweets'].apply(call_hashtag_free)

print(df)

# DATA PRE-PROCESSING

After cleaning data, the next step is data pre-processing stage, which is a stage that using for converting sentences into words, converting words to their root form and removing words that are too common or too irrelevant to the purpose of topic modelling. This process includes tokenization and word normalization.

## Tokenization

In order to build up a vocabulary, the first thing is to break our tweets into chunks. Thus, in the first step, I will tokenize tweets by applying the NLTK package.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
import string

In [None]:
df['Tokenized'] = df.apply(lambda row: nltk.word_tokenize(row['final_tweets']), axis=1)

We will see that the tweets in the tokenized column are now separated into chunks. However, there are still stopwords and punctuation, so I will then delete them because they are common words that not necessary and do not carry a lot of information (Kedia, 2020, pp.66-67).

Furthermore, as the words in the dataset include both uppercase and lowercase, I will standardize them by making all the word lowercase.

In [None]:
# remove stopwords and lowercase words

def remove_stopwords(tokenized_column):
    """Return a list of tokens with English stopwords removed. 

    Args:
        column: Pandas dataframe column of tokenized data from tokenize()

    Returns:
        tokens (list): Tokenized list with stopwords removed.

    """
    stop_words = set(stopwords.words("english"))
    return [word.lower() for word in tokenized_column if word.lower() not in stop_words]

df['stopwords_removed'] = df.apply(lambda x: remove_stopwords(x['Tokenized']), axis=1)

In [None]:
# Remove punctuation

print(string.punctuation)

string_1 = string.punctuation

def remove_punctuations(punc_col):
    string_1 = string.punctuation
    return [word for word in punc_col if not word in string_1]

df['punc_removed'] = df.apply(lambda x: remove_punctuations(x['stopwords_removed']), axis=1)

In addition to this, the dataset contains a great number of the country name (Thailand) and nationality (Thai). moreover, there are also number. I decide to remove these because they do not give us much information.

In [None]:
# Delete words thailand and thai

Thailand = '''thailand, thai, thais'''

def remove_thai(thai_col):
    Thailand = '''thailand, thai, thais'''
    return [word for word in thai_col if not word in Thailand]

df['thai_removed'] = df.apply(lambda x: remove_thai(x['punc_removed']), axis=1)

# Delete number

def remove_number(num):
    return [x for x in num if not x.isdigit()]

df['number_removed'] = df.apply(lambda x: remove_number(x['thai_removed']), axis=1)


## Word Normalization

After I tokenized tweets, now I will lemmatize tweets into their root forms using Spacy Lemmatizer. This helps into reducing the amount of different information that the computer has to deal, and therefore improves efficiency.

In [None]:
import spacy
#nlp = spacy.load('en')
nlp = spacy.load("en_core_web_sm")

In [None]:
df['tokens_back_to_text'] = [' '.join(map(str, l)) for l in df['number_removed']]

def get_lemmas(text):
    '''Used to lemmatize the processed tweets'''
    lemmas = []
    
    doc = nlp(text)
    
    for token in doc: 
        if ((token.is_stop == False) and (token.is_punct == False)) and (token.pos_ != 'PRON'):
            lemmas.append(token.lemma_)
    
    return lemmas

df['lemmas'] = df['tokens_back_to_text'].apply(get_lemmas)

# TOPIC MODELLING

Next, I will conduct the topic modeling with Gensim. This topic model based on the Latent Dirichlet Allocation (LDA) algorithm, which is unsupervised machine learning. I will use Genism to create the bag-of-words that form the corpus. Firstly, we need to build the dictionary to have the corpus, as the corpus is made from documents converted to bag-of-words, and a dictionary is required for building bag-of-words.

In [None]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaMulticore
from gensim.models import CoherenceModel

In [None]:
# Make the dictionary 

dictionary = Dictionary(df['lemmas'])

#Use the dictionary to generate the corpus (set of bag-of-words model)

corpus = [dictionary.doc2bow(doc) for doc in df['lemmas']]

After we have built the corpus, topic coherence is one of the main techniques used to estimate the number of topics. Thus, in the next step, I will decide the number of topics based on the calculation of the coherence score using C_v.and plot the coherence model using seaborn.

Codes from Ghanoum.T (2021). Topic Modelling in Python with spaCy and Gensim.[Source code]. https://towardsdatascience.com/topic-modelling-in-python-with-spacy-and-gensim-dc8f7748bdbf

In [None]:
!pip install pyLDAvis 
!pip install matplotlib
!pip install seaborn

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import pyLDAvis.gensim_models
pyLDAvis.enable_notebook()# Visualise inside a notebook

In [None]:
topics = []
score = []
for i in range(1,10,1):
   lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=10, num_topics=i, workers = 4, passes=10, random_state=100)
   cm = CoherenceModel(model=lda_model, texts = df['lemmas'], corpus=corpus, dictionary=dictionary, coherence='c_v')
   topics.append(i)
   score.append(cm.get_coherence())
_=plt.plot(topics, score)
_=plt.xlabel('Number of Topics')
_=plt.ylabel('Coherence Score')
plt.show()

When looking at the coherence using C_v algorithm, I choose to go with 2 topics because it has a high coherence score with around 0.45.  The last step, I will train my topic model with the number of topics equal two

In [None]:
#Optimal Model

lda_model = LdaMulticore(corpus=corpus, id2word=dictionary, iterations=100, num_topics= 2, workers = 4, passes=10, random_state = 100)

# VISUALIZATION
 
After finalize the number of topics, I will visualize the topic model using pyLDAvis

In [None]:
lda_display = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(lda_display)

The visualization shows two bubbles, in which each bubble represents a topic. When looking at the red bars that give the estimated number of times a given term was generated by a given topic, the first topic is related to people's demands (keywords such as want and need). The keywords indicate that Thais would like the problems of human rights, economy, democracy, freedom of expression, law, mistreatment and violence against protesters caused by police and the military government to be resolved.

Next, the keywords under the second topic is in regard to the state violence and police brutality. They describe what force the government and police used (e.g., shoot, bullet, arrest, and abuse). They also highlighted the victims being protesters and in particular students. This might be because since 2018, after twitter has gradually shaped and amplified the nation's political and social discourse, protest events in Thailand often lead by student movement from various campuses (Sombatpoonsiri, 2020).

In conclusion, this project has shown the major themes hashtag activists addressed on Twitter. In the connective action framework, users can generate their own content under the common hashtag to achieve their collective goal. In other words, there are multiple issues within the hashtag that can be connected to common concerns. Topic modeling indicated that activists within #WhatsHappeninginThailand hashtag focused on multiple issues  which were connected through a common concern, i.e. state violence and police brutality. Furthermore, they utilized Twitter as a platform to demand for reforming the country system that full of problems. 

# REFERENCES

Bennett, W. L., & Segerberg, A. (2012). The Logic of Connective Action: Digital Media and The Personalization of Contentious Politics. Information, communication & society, 15(5), 739- 768.

Kedia, A., & Rasu, M. (2020). Hands-on Python natural language processing: explore tools and techniques to analyze and process text with a view to building real-world NLP applications. Packt Publishing Ltd.

Kemp, S. (2020, February 18). Digital 2020: Thailand. https://datareportal.com/reports/digital-2020-thailand.

Sombatpoonsiri, J. (2020, September, 20). Unpacking Thailand’s Protests: Current Contour and Future Trajectories. https://www.ispionline.it/en/pubblicazione/unpacking-thailands-protests-current-contour-and-future-trajectories-27300
    