## Trending Topics
To obtain trending topics, we will be matching each text to dictionaries of different topics. If a proportion of the words exceed a threshold of matching the dictionary, then the text will be tagged to the topic. We set the threshold to 0.7. 

A text can be tagged to multiple topics.

In [4]:
import pandas as pd
import regex as re 
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
nltk_stopwords = set(stopwords.words('english'))

In [32]:
# fb_data = pd.read_csv("../data/processed_facebook_text.csv", index_col=0)
# fb_data

In [33]:
import json

f = open("../data/2022-02-08.json")
data = json.load(f)

reddit_data = pd.DataFrame()

for post_id in data:
    comments = data[post_id]["comments"]
    if comments:
        for cid in comments:
            temp_dict = {
                "text": comments[cid].get("body"),
            }
            reddit_data = reddit_data.append(temp_dict, ignore_index=True)

    reddit_data = reddit_data.dropna().drop_duplicates(ignore_index=True).reset_index(drop=True)

reddit_data
reddit_data.to_csv("sample_reddit_data.csv")

Unnamed: 0,text
0,I'm not sure how far we are from the peak but ...
1,"> # Singapore reports 7,806 new COVID-19 cases..."
2,Good morning all have a great Tuesday and stay...
3,Fuck lah. Morning ruined.\n\nMy ballerina sock...
4,These past few days the trains & buses are so ...
...,...
39221,Is HDB the only thing that is increasing in pr...
39222,That’s true. Inflation makes debt cheaper. But...
39223,Yes but an increase back to “normal” is still ...
39224,These imports are usually paid in usd\n\nEdit:...


In [20]:
def get_topics(text):
    try: 
        with open('../data/Trending Topics/dictionary/art.csv') as f:
            art_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/covid19.csv') as f:
            covid19_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/education.csv') as f:
            edu_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/environment.csv') as f:
            env_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/fashion.csv') as f:
            fashion_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/food.csv') as f:
            food_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/health.csv') as f:
            health_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/politics.csv') as f:
            politics_dict = set([i for i in f][0].split(","))
        with open('../data/Trending Topics/dictionary/technology.csv') as f:
            tech_dict = set([i for i in f][0].split(","))

        topic_dicts = [('art', art_dict), ('covid19', covid19_dict), ('education', edu_dict), ('environment', env_dict), ('fashion', fashion_dict), ('food', food_dict), ('health', health_dict), ('politics', politics_dict), ('technology', tech_dict)]
        # print(topic_dicts)

        topic_match = {'art':0, 'covid19':0, 'education':0, 'environment':0, 'fashion':0, 'food':0, 'health':0, 'politics':0, 'technology':0}

        for i in range(8):
            match_text = topic_dicts[i][1].intersection(set(text.split()))
            # print("MATCHED WORDS")
            # print(topic_dicts[i][0], match_text)

            for k in topic_match:
                if k == topic_dicts[i][0]:
                    topic_match[k] = len(match_text)

        return max(topic_match, key=topic_match.get)

    except Exception as e:
        return []

In [11]:
def processing(text):
    wnl = WordNetLemmatizer()
    
    lowercase_text = text.lower()
    punctuations_removed = re.sub('[^a-z]', ' ', lowercase_text)
    tokens = word_tokenize(punctuations_removed)
    stopwords_removed = [token for token in tokens if token not in nltk_stopwords]
    lemmatized_tokens = [wnl.lemmatize(w) for w in stopwords_removed]    
    
    return " ".join(lemmatized_tokens)

#### Testing

In [18]:
sentence = "No, our education system is doing a pretty good job serving the educational needs of our young while not imposing too much workload or stress on them. This is due to the presence of myriad pathways that students can choose to take. Primary school students can take their subjects at foundation level or at the normal level, secondary school students can choose between 3 or 4 streams, including Express, Normal (A) and Normal (T). Tertiary students can choose between 3 to 4 educational institutions as well. I believe stress does exist but it is not due to the nature of the education system, but due to societal influence, parental/family influence or it is self-imposed by the student."

cleantext= processing(sentence)
get_topics(cleantext)

MATCHED WORDS
art {'nature', 'foundation', 'subject', 'family', 'school', 'young', 'need', 'student', 'job', 'self'}
MATCHED WORDS
covid19 set()
MATCHED WORDS
education {'level', 'educational', 'institution', 'primary', 'foundation', 'express', 'family', 'school', 'normal', 'job', 'education', 'self', 'stress', 'young', 'secondary', 'student', 'subject', 'stream', 'need'}
MATCHED WORDS
environment {'system', 'nature', 'level', 'family', 'school', 'young', 'need', 'student', 'job', 'education'}
MATCHED WORDS
fashion {'level', 'take', 'family', 'school', 'young', 'need', 'pretty', 'job', 'good', 'self'}
MATCHED WORDS
food {'level', 'take', 'family', 'school', 'young', 'need', 'job', 'serving'}
MATCHED WORDS
health {'level', 'family', 'school', 'young', 'need', 'normal', 'self', 'stress'}
MATCHED WORDS
politics {'need', 'education', 'family', 'job'}


'education'

### Evaluation

In [35]:
reddit_data["cleantext"] = reddit_data["text"].apply(processing)

reddit_data["topic"] = reddit_data["cleantext"].apply(get_topics)
reddit_data

Unnamed: 0,text,cleantext,topic
0,I'm not sure how far we are from the peak but ...,sure far peak certainly hope hospitalization s...,fashion
1,"> # Singapore reports 7,806 new COVID-19 cases...",singapore report new covid case death moh new ...,health
2,Good morning all have a great Tuesday and stay...,good morning great tuesday stay strong stay sa...,education
3,Fuck lah. Morning ruined.\n\nMy ballerina sock...,fuck lah morning ruined ballerina sock came sn...,art
4,These past few days the trains & buses are so ...,past day train bus slow even funny anymore,art
...,...,...,...
39221,Is HDB the only thing that is increasing in pr...,hdb thing increasing price periodic inflation ...,environment
39222,That’s true. Inflation makes debt cheaper. But...,true inflation make debt cheaper long term lia...,environment
39223,Yes but an increase back to “normal” is still ...,yes increase back normal still increase let sa...,environment
39224,These imports are usually paid in usd\n\nEdit:...,import usually paid usd edit idiot know world ...,food


1: art
2: covid19
3: education
4: environment
5: fashion
6: food
7: health
8: politics
9: technology
10: others

In [41]:
sample_reddit_data = reddit_data.sample(n=100)
sample_reddit_data.to_csv("sample_reddit_data.csv")

In [53]:
sample_reddit_data = pd.read_csv("sample_reddit_data.csv", index_col = 0)
labelled_reddit_data = pd.read_csv("sample_reddit_data_labelled.csv", index_col = 0)

def rename_topics(text):
    if text == 'art':
        text = 1
    elif text == 'covid19':
        text = 2    
    elif text == 'education':
        text = 3
    elif text == 'environment':
        text = 4
    elif text == 'fashion':
        text = 5
    elif text == 'food':
        text = 6
    elif text == 'health':
        text = 7
    elif text == 'politics':
        text = 8
    elif text == 'technology':
        text = 9
    else:
        text = 10  
    return text

sample_reddit_data['topic'] = sample_reddit_data['topic'].apply(rename_topics)
sample_reddit_data

Unnamed: 0,topic,text,cleantext
3159,5,Thanks for the info. Yes I think I probably do...,thanks info yes think probably use bag time ti...
9794,7,"Condoms are pretty useful against STDs, which ...",condom pretty useful std appreciate gon na pre...
3726,1,"What you don't like recycled plots, lifeless c...",like recycled plot lifeless character dialogue...
8731,7,"It's <20 in icu, 90+ is needing oxygen support...",icu needing oxygen support nasal prong
31844,1,"Never mind then, I guess I was wrong!",never mind guess wrong
12449,1,Complaint what? He got bigger space than most ...,complaint got bigger space u rent free
25355,1,"Trickle down is an urban myth. Gaslighting, re...",trickle urban myth gaslighting really
9103,6,They are already milking in SGCovidLaKopi,already milking sgcovidlakopi
7684,1,"The market can bear the prices they ask, eithe...",market bear price ask either actual quality ma...
24583,1,how to count my caloriesss when my mom is such...,count calorie mom good cook lmao


In [54]:
from sklearn.metrics import classification_report
y_true = labelled_reddit_data['topic'].to_list()
y_pred = sample_reddit_data['topic'].to_list()

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         8
           3       0.11      0.67      0.19         3
           4       0.25      0.33      0.29         3
           5       0.00      0.00      0.00         0
           6       0.43      0.75      0.55         4
           7       0.33      0.33      0.33         9
           8       0.00      0.00      0.00         3
           9       0.00      0.00      0.00         4
          10       1.00      0.02      0.03        66

    accuracy                           0.10       100
   macro avg       0.21      0.21      0.14       100
weighted avg       0.72      0.10      0.09       100



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


**TO DO:**
- Add dictionary for COVID-19, business, finance, social issues, entertainment
- Refine the words for dictionaries 
  - Remove numbers [DONE]
  - More stopwords [DONE]
  - Consider bigrams or trigrams (for e.g. names of ministers, schools, other terms, etc.) [DONE]

  - Manually add more specific words => Further refine dictionaries
- Consider logic for others?
- Multiple topics (one-hot encoding)
- Alternative methods
  - Compare results with pre-trained model: https://huggingface.co/cvcio/mediawatch-el-topics?text=embrace+hail+singapore+let+go => has more topics (even more niche, but takes very long to run, still waiting to evaluate)
  - Consider doing thoughtful comments modelling first, non-thoughtful comments will be labelled as others
- 

