# Table of Contents
 <p>

In this exercise we will go over `realDonaldTrump_tweets` and perform topic modeling. Each line in this file is a tweet. 

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import re

**Task 1: Load the data**

Consider each tweet as a document. Load the tweets. Strip away symbols and web links in the tweets. If the tweet becomes empty string after preprocessing, then discard the tweet from analysis.


In [2]:
file_path = '/dsa/data/all_datasets/linguistic/realDonaldTrump_tweets.txt'

In [3]:
# load each tweet as a document
with open(file_path, 'r') as f:
    tweets = f.readlines()

In [10]:
def preprocess_tweet(tweet):
    tweet = re.sub(r'https?:\/\/[^\s]+', '', tweet)
    tweet = re.sub(r"[^a-zA-Z\s]","",tweet)
    tweet = tweet.strip().lower()
    return tweet

cleaned_tweets = [preprocess_tweet(tweet) for tweet in tweets if preprocess_tweet(tweet)]

In [11]:
print("Top 5 preprocessed tweets:", cleaned_tweets[:5])

print(f"The amount of preprocessed tweets:{len(cleaned_tweets)}")

Top 5 preprocessed tweets: ['it was a great honor to have spoken before the countries of the world at the united nations', 'usaatungaunga', 'god bless the people of mexico city we are with you and will be there for you', 'as president of the united states of america i will always put americafirstunga', 'full remarks']
The amount of preprocessed tweets:3651


**Task 2: Create term frequency matrix for these tweets.**


In [13]:
# Add your code below
# -------------------
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')  
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_tweets)  


print(f"TF-IDF demensions of matrix: {tfidf_matrix.shape}")


print(tfidf_matrix[:5].toarray())



TF-IDF demensions of matrix: (3651, 1000)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


**Task 3: Apply LDA topic modeling method with 5 topics**

Fit an LDA model with 5 topics on these tweets. 


In [14]:
  # Add your code below
# -------------------
from sklearn.decomposition import LatentDirichletAllocation


lda = LatentDirichletAllocation(n_components=5, random_state=42)

lda.fit(tfidf_matrix)


def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))


tfidf_feature_names = tfidf_vectorizer.get_feature_names()
display_topics(lda, tfidf_feature_names, 10)



Topic 0:
trump great amp thank people american united day states president
Topic 1:
rt obamacare amp vote realdonaldtrump debates teamtrump healthcare hillaryclinton trumppence
Topic 2:
great america foxandfriends make enjoy today president rt honor jobs
Topic 3:
makeamericagreatagain clinton hillary watch imwithyou crooked media cnn news fake
Topic 4:
thank join maga pm tickets draintheswamp florida tomorrow ohio americafirst


**Task 4: Print the top 10 words for each of the topics**

In [12]:
# Add your code below
# -------------------

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx + 1}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))


no_top_words = 10
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
display_topics(lda, tfidf_feature_names, no_top_words)




Topic 1:
pm clinton hillary media news watch enjoy foxandfriends fake rt
Topic 2:
thank great join maga draintheswamp america tickets make florida tomorrow
Topic 3:
trump americafirst obamacare debates crookedhillary amp healthcare donald hillaryclinton repeal
Topic 4:
makeamericagreatagain hillary states jobs amp national clinton years obama happy
Topic 5:
rt amp north realdonaldtrump people president great failing korea trump


**Task 5: Name each of the topic (No right answer)**

After observing top-10 words in each topic, do these topics make sense to you? Can you name each of the topic? 

In [13]:
# Add your code below
# -------------------

topic_names = {
    1: "Politics",
    2: "Economy",
    3: "Healthcare",
    4: "Foreign Policy",
    5: "Social Media"
}

for topic_idx, topic_name in topic_names.items():
    print(f"Topic {topic_idx}: {topic_name}")




Topic 1: Politics
Topic 2: Economy
Topic 3: Healthcare
Topic 4: Foreign Policy
Topic 5: Social Media


**Task 6: Create a TFIDF matrix**

Create TFIDF matrix for these tweets.

In [15]:
# Add your code below
# -------------------

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

tfidf_matrix = tfidf_vectorizer.fit_transform(tweets)

print('TFIDF Matrix Shape: {}'.format(tfidf_matrix.shape))




TFIDF Matrix Shape: (3998, 3083)


**Task 6: Apply NMF topic modeling with 5 topics**

In [17]:
# Add your code below
# -------------------

from sklearn.decomposition import NMF

nmf_model = NMF(n_components=5, random_state=42)

nmf_model.fit(tfidf_matrix)

def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

tfidf_feature_names = tfidf_vectorizer.get_feature_names()

display_topics(nmf_model, tfidf_feature_names, 10)







Topic 0:
https tickets watch draintheswamp 3kwol2ibaw americafirst crookedhillary icymi rt video
Topic 1:
great america make safe going honor day people today state
Topic 2:
makeamericagreatagain imwithyou erictrump lets movement lesm pennsylvania join poll nfib
Topic 3:
thank maga americafirst florida ohio join imwithyou trumppence16 support new
Topic 4:
amp rt hillary trump clinton realdonaldtrump people draintheswamp president media


**Task 7: Print the top 10 words for each of the topics**

In [18]:
# Add your code below
# -------------------
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

display_topics(nmf_model, tfidf_feature_names, 10)





Topic 0:
https tickets watch draintheswamp 3kwol2ibaw americafirst crookedhillary icymi rt video
Topic 1:
great america make safe going honor day people today state
Topic 2:
makeamericagreatagain imwithyou erictrump lets movement lesm pennsylvania join poll nfib
Topic 3:
thank maga americafirst florida ohio join imwithyou trumppence16 support new
Topic 4:
amp rt hillary trump clinton realdonaldtrump people draintheswamp president media


**Task 8: Perform a comparison between the topics identified by LDA and NMF methods.**

In [20]:
# Add your code below
# -------------------

from sklearn.decomposition import LatentDirichletAllocation, NMF

lda_model = LatentDirichletAllocation(n_components=5, random_state=42)
lda_model.fit(tfidf_matrix)

nmf_model = NMF(n_components=5, random_state=42)
nmf_model.fit(tfidf_matrix)






NMF(n_components=5, random_state=42)

In [21]:
def display_topics_comparison(lda_model, nmf_model, feature_names, no_top_words):
    print("LDA Topics:")
    for topic_idx, topic in enumerate(lda_model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))
    
    print("\nNMF Topics:")
    for topic_idx, topic in enumerate(nmf_model.components_):
        print(f"Topic {topic_idx}:")
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

display_topics_comparison(lda_model, nmf_model, tfidf_feature_names, 10)


LDA Topics:
Topic 0:
thank great https america join make maga rt realdonaldtrump amp
Topic 1:
https makeamericagreatagain draintheswamp bigleaguetruth clinton hillary debates2016 vpdebate rt bernie
Topic 2:
tickets media hillary news fake clinton amp people years healthcare
Topic 3:
https americafirst watch imwithyou thank crookedhillary amp great president united
Topic 4:
https enjoy amp today interviewed 00 great whitehouse foxandfriends rt

NMF Topics:
Topic 0:
https tickets watch draintheswamp 3kwol2ibaw americafirst crookedhillary icymi rt video
Topic 1:
great america make safe going honor day people today state
Topic 2:
makeamericagreatagain imwithyou erictrump lets movement lesm pennsylvania join poll nfib
Topic 3:
thank maga americafirst florida ohio join imwithyou trumppence16 support new
Topic 4:
amp rt hillary trump clinton realdonaldtrump people draintheswamp president media


# Save your notebook, then `File > Close and Halt`