In [1]:
import pandas as pd

In [2]:
comments_df = pd.read_csv('./comments_with_score.csv',index_col = 0)

In [3]:
comments_df.head()

Unnamed: 0,article_id,comments,is_reply,neg,neu,pos,compound
0,0.0,What's the point of studying so much ended up ...,0.0,0.0,0.872,0.128,0.7096
1,0.0,No matter what kind of streaming or subject ba...,0.0,0.156,0.76,0.084,-0.8555
2,0.0,Seems to be that the purpose of this system is...,1.0,0.0,0.844,0.156,0.6322
3,0.0,This feels like just another diversion from RE...,0.0,0.045,0.797,0.159,0.8981
4,0.0,Isn‚Äôt a ‚Äúreal‚Äù issue the boxing of kids into s...,1.0,0.0,0.69,0.31,0.6597


## Topic Modelling (LDA unigram)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [5]:
cvec = CountVectorizer(stop_words='english',max_df=0.98,min_df=3)

In [6]:
comments_cvec = cvec.fit_transform(comments_df['comments'])

In [7]:
#Vocabulary of words
cvec.get_feature_names()

['1990s',
 '1st',
 '20',
 '2024',
 '40',
 '90s',
 'ability',
 'able',
 'abuse',
 'academic',
 'act',
 'acts',
 'actual',
 'actually',
 'affect',
 'affected',
 'age',
 'ago',
 'agree',
 'ah',
 'allow',
 'allows',
 'anti',
 'arts',
 'ask',
 'attitude',
 'average',
 'away',
 'bad',
 'badly',
 'banding',
 'based',
 'batok',
 'belief',
 'believe',
 'belong',
 'benefit',
 'best',
 'better',
 'big',
 'blame',
 'bond',
 'boon',
 'born',
 'boss',
 'bosses',
 'brainwash',
 'breakers',
 'bring',
 'brought',
 'brutally',
 'build',
 'bukit',
 'business',
 'called',
 'care',
 'caste',
 'category',
 'cause',
 'certain',
 'challenging',
 'chan',
 'chance',
 'change',
 'changes',
 'charge',
 'cheaters',
 'cheating',
 'check',
 'chee',
 'child',
 'children',
 'china',
 'chinese',
 'citizen',
 'citizens',
 'civil',
 'class',
 'classes',
 'clean',
 'cleaning',
 'clear',
 'clumsiness',
 'come',
 'coming',
 'comment',
 'comments',
 'common',
 'companies',
 'competition',
 'competitiveness',
 'complacency',


In [8]:
#Model for 2 topics
lda = LatentDirichletAllocation(n_components=2,random_state=42)
lda.fit(comments_cvec)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=2, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [9]:
#Crate a function to print top words for each topic to have a sense of what the topics might represent
def show_top15(lda_model):
    '''
    This function takes in an lda fitted model and prints the top 15 words for each topic modelled.
    '''
    for index,topic in enumerate(lda_model.components_):
        print('The top 15 words for topic #{}'.format(index))
        print([cvec.get_feature_names()[i] for i in topic.argsort()[-15:]])
        print('\n')

In [10]:
#Show top words for each topic
show_top15(lda)

The top 15 words for topic #0
['change', 'schools', 'like', 'pap', 'time', 'normal', 'school', 'stream', 'people', 'chinese', 'education', 'just', 'good', 'streaming', 'students']


The top 15 words for topic #1
['normal', 'called', 'reform', 'big', 'education', 'government', 'civil', 'restructure', 'world', 'years', 'singaporean', 'bad', 'kids', 'like', 'singapore']




<div class='alert alert-block alert-warning'>
    The two topic seems to both be a mix of politics and education.
</div>

In [11]:
#Model for 3 topics
lda_3 = LatentDirichletAllocation(n_components=3,random_state=42)
lda_3.fit(comments_cvec)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=3, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [12]:
#Show top 15 words for each topic
show_top15(lda_3)

The top 15 words for topic #0
['just', 'time', 'policies', 'new', 'like', 'need', 'foreign', 'local', 'years', 'change', 'ministers', 'education', 'singapore', 'good', 'pap']


The top 15 words for topic #1
['don', 'sex', 'called', 'education', 'reform', 'world', 'big', 'years', 'government', 'civil', 'restructure', 'singaporean', 'bad', 'like', 'singapore']


The top 15 words for topic #2
['social', 'teachers', 'like', 'chinese', 'schools', 'people', 'school', 'just', 'good', 'kids', 'express', 'stream', 'normal', 'streaming', 'students']




<div class='alert alert-block alert-warning'>
    The first two topics seem to both be about politics while the last seems more distinctly about education.
</div>

In [13]:
#Model for 4 topics
lda_4 = LatentDirichletAllocation(n_components=4,random_state=42)
lda_4.fit(comments_cvec)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=4, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [14]:
#Show top 15 words for each topic
show_top15(lda_4)

The top 15 words for topic #0
['singapore', 'streaming', 'time', 'jobs', 'years', 'foreign', 'don', 'change', 'pap', 'just', 'local', 'ministers', 'chinese', 'new', 'education']


The top 15 words for topic #1
['sex', 'don', 'called', 'world', 'education', 'reform', 'big', 'government', 'years', 'civil', 'restructure', 'singaporean', 'bad', 'like', 'singapore']


The top 15 words for topic #2
['non', 'look', 'chinese', 'like', 'just', 'teachers', 'schools', 'people', 'kids', 'school', 'streaming', 'express', 'stream', 'normal', 'students']


The top 15 words for topic #3
['ong', 'grc', 'change', 'lose', 'subjects', 'parents', 'pap', 'students', 'end', 'election', 'different', 'singapore', 'like', 'streaming', 'good']




In [15]:
#Model for 5 topics
lda_5 = LatentDirichletAllocation(n_components=5,random_state=42)
lda_5.fit(comments_cvec)

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='batch', learning_offset=10.0,
             max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
             n_components=5, n_jobs=None, n_topics=None, perp_tol=0.1,
             random_state=42, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)

In [16]:
#Show top 15 words for each topic
show_top15(lda_5)

The top 15 words for topic #0
['elites', 'parents', 'change', 'years', 'singapore', 'jobs', 'pap', 'chinese', 'foreign', 'ministers', 'new', 'don', 'just', 'education', 'local']


The top 15 words for topic #1
['sex', 'called', 'don', 'big', 'reform', 'world', 'education', 'years', 'civil', 'government', 'restructure', 'singaporean', 'bad', 'like', 'singapore']


The top 15 words for topic #2
['good', 'just', 'non', 'like', 'teachers', 'people', 'chinese', 'schools', 'school', 'streaming', 'kids', 'express', 'stream', 'normal', 'students']


The top 15 words for topic #3
['study', 'level', 'just', 'got', 'end', 'education', 'future', 'subject', 'students', 'ong', 'parents', 'streaming', 'like', 'singapore', 'good']


The top 15 words for topic #4
['mp', 'people', 'like', 'issue', 'need', 'louis', 'lose', 'grc', 'change', 'election', 'ng', 'pap', 'good', 'time', 'streaming']




<div class='alert alert-block alert-warning'>
    Beyond three topics the topics seem to overlap rather strongly. Even with three topics the first two seemed to overlap, but there was a more distinct topic on education with 3 topics.
</div>

In [17]:
#Retrieve the topics of comments as a list
topic_results = lda_3.transform(comments_cvec)

In [18]:
#Add the topics to the DataFrame
comments_df['topic'] = topic_results.argmax(axis=1)

In [19]:
#Check the number of comments for each topic
for i in range(3):
    print('Number of comments under topic {}: '.format(i),len(comments_df[comments_df['topic']==i]))

Number of comments under topic 0:  110
Number of comments under topic 1:  12
Number of comments under topic 2:  143


<div class='alert alert-block alert-info'>
    With only 12 comments under topic 1 and considering how the topics seem to overlap, it would likely be possible to combine topics 0 and 1 as a single topic and have 2 topics (likely politics and education judging from the top 15 words).
</div>

In [20]:
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [21]:
#Topic and terms visualisation
pyLDAvis.sklearn.prepare(lda_3, comments_cvec, cvec)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [22]:
import random

In [23]:
def random_5_comments(topic):
    '''
    This function takes in a topic number and prints 5 randomly selected comments.
    '''
    #for reproducibility 
    random.seed(99)
    #randomly pick 5 index positions
    selected = random.sample(range(len(comments_df[comments_df['topic']==topic])),5)
   
    #Print the selected indexes 
    for index in selected:
        print(comments_df[comments_df['topic']==topic].iloc[index]['comments'])
        print('-----'*20)
        print('\n')

In [24]:
#Print 5 comments from topic 0
random_5_comments(0)

Back to square one, after 40 yearsüòú
----------------------------------------------------------------------------------------------------


After election, things might change again...
----------------------------------------------------------------------------------------------------


i game u fifty dollars you don't have school going kids now. on?
----------------------------------------------------------------------------------------------------


Can such polls be taken seriously? The majority of dumb local born S'poreans are after all Champion Complainers who have repeatedly returned P@P to power.
----------------------------------------------------------------------------------------------------


The number of local masters, double degrees and degrees jobless are alarming... Surprising foreigners coming in as tourists can land jobs in days.
----------------------------------------------------------------------------------------------------




In [25]:
#Print 5 comments from topic 1
random_5_comments(1)

To Everyone in this Website, Especially PAP, Opposition Parties & All Singaporean,

To improve our competitiveness in Global Economy , We really must REVAMP our entire school education system , in actual fact, it should have been Done it in over 20 years ago, during the 1990s .

From this website on ‚Äú Subject-Based Banding to replace streaming in secondary schools by 2024 by Singapore Government , it has EXPOSE OUT these BIG PROBLEMS in Singapore that had spread over many years .

Unfortunately, our ‚Äú MOST EXPENSIVE GOVERNMENT IN THE WHOLE WORLD ‚Äú don‚Äôt seem to do much on it , though TALK very BIG in Mass Media that our ‚Äú MOST EXPENSIVE GOVERNMENT IN THE WHOLE WORLD ‚Äú is doing so ! ! !

The PROBLEMS that we are facing now are these , due to the very Harmful Effects of " DON' T CARE " of or , should say, SACRIFICE Professional Ethics & Moral Education for many years :===>

(1). Bad CORORATE CULTURES & Unethical SOCIAL VALUES are widespread in Singapore Business Wor

In [26]:
#Print 5 comments from topic 2
random_5_comments(2)

So sad that in a country as modern as Singapore that pride herself in one race one people , etc thst we talk about segregation. Isnt it better gor he kids ti decide if he prefers a faster thuss express or slower m thus normal himself? Ut is all about nurturung a good character jot how fadt or slow you graduate. Many famous folks also never graduate. In this world...less educated does not mean less successful. Good lord...please do not do this poll.
----------------------------------------------------------------------------------------------------


As long as any school is gov-aided or supported, MOE should insist these schools take in students with non-Express PSLE T-Scores to show that these schools can value-add. Also, this will in the long term help social cohesiveness.
----------------------------------------------------------------------------------------------------


The surest way to travel to the sun is by night!
--------------------------------------------------------------

<div class='alert alert-block alert-info'>
    Judging by the contents of the comments separated via LDA, the comments can largely be clustered around two topics: politics and education. Slightly more than half of the comments were about education,streaming and social segregation while the remainder were focused on politics and the government. 
</div>