# Lecture 4 - Topic Models

In this notebook we will learn how to cluster text into topics using different embeddings and the K-means clustering algorithm. 

Below is the overview of this notebook.

0. Install required packages (only need to do this the first time we run the notebook)

1. Load corpus of tweets

2. Make word clouds of the tweets

3. Create tf and tf-idf embeddings of the tweets

4. Create LDA topic model embeddings of the tweets

5. Create low dimensional embeddings of the tweets using UMAP

6. Cluster the tweets using K-means clustering

7. Compare clusters from different embeddings

This notebook can be opened in Colab 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zlisto/social_media_analytics/blob/main/Lecture04_TopicModels.ipynb)

Before starting, select "Runtime->Factory reset runtime" to start with your directories and environment in the base state.

If you want to save changes to the notebook, select "File->Save a copy in Drive" from the top menu in Colab.  This will save the notebook in your Google Drive.


# Clone GitHub Repository
This will clone the repository to your machine.  This includes the code and data files.  Then change into the directory of the repository.

In [None]:
!git clone https://github.com/zlisto/social_media_analytics

import os
os.chdir("social_media_analytics")

## Install Requirements 


In [None]:
!pip install -r requirements.txt

## Import packages

We import the packages we are going to use.  A package contains several useful functions that make our life easier.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import umap
import gensim.downloader as api
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

import sklearn.cluster as cluster
from sklearn import metrics
from scipy import stats

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

import scripts.TextAnalysis as ta
from scripts.api import *

# Data Cleaning



### Load data

We will load csv files containing tweets from several users into a dataframe **df**. 

In [2]:
fname_db = "data/lecture_04"
df = DB.fetch(table_name = 'user_tweets', path = fname_db)

#df = pd.read_csv("data/tweets_lec_10.csv")
print(f"{len(df)} tweets in dataframe")
df.sample(1000).head()

8626 tweets in dataframe


Unnamed: 0,created_at,screen_name,text,lang,retweet_count,reply_count,like_count,quote_count,id,author_id,conversation_id,in_reply_to_user_id,geo
1861,2021-05-06T16:05:20.000Z,KimKardashian,Cotton collection just dropped in 2 new colors...,en,244,149,10186,30,1390336891363467266,25365536,1390336891363467266,,
5445,2021-12-08T20:49:45.000Z,AOC,RT @cspan: Rep. Alexandria Ocasio-Cortez (@AOC...,en,1127,0,0,0,1468684246420271104,138203134,1468684246420271104,,
6209,2021-06-10T17:47:47.000Z,AOC,RT @AyannaPressley: Stop the bad faith attempt...,en,1486,0,0,0,1403046247305560065,138203134,1403046247305560065,,
6105,2021-07-02T00:48:26.000Z,AOC,"This is not a drill! 🚨 \n\nElias, my homework ...",en,2055,561,29212,219,1410762252261728257,138203134,1410762252261728257,,
3316,2020-05-29T16:06:54.000Z,BarackObama,My statement on the death of George Floyd: htt...,en,439246,37108,1832494,42624,1266400635429310466,813286,1266400635429310466,,


###  Remove Superfluous Columns 

We don't need all the columns.  We can remove them from this dataframe using the column selection operation.  We just choose which columns we want to keep and put them in a list.


In [3]:
df = df[ ['screen_name', 'text', 'retweet_count']]
df.sample(10).head()

Unnamed: 0,screen_name,text,retweet_count
8271,RashidaTlaib,It is not radical to want clean water. https:/...,119
6166,AOC,"For districts with multiple candidates, we enc...",141
3720,JBALVIN,Esta tarde nos vemos en TikTok a las 6 pm hora...,93
5972,AOC,"If mods want to blow up the infra deal, that’s...",2752
3497,JBALVIN,@SonrienteBenito ❤️🙏👏,0


### Plot Tweets per User

A count plot shows us how many tweets each user has in the dataset.  If we choose `y` to be `"screen_name"` the plot will be vertical.

We can choose the `palette` for the plot from this list here: https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
plt.figure(figsize=(8,8))
sns.countplot(data=df,y='screen_name',  palette = "Set2")
plt.ylabel("Screen name", fontsize = 14)
plt.xlabel("Tweet count", fontsize = 14)
plt.show()

### Cleaning Text Data
Next we will clean the tweet text.  We use the *clean_tweet* function in the TextAnalytics module.  This function removes punctuation and hyperlinks, and also makes all the text lower case.  We remove any cleaned tweets which have zero length, as these won't be useful for clustering.  We add a column to **df** called *text_clean* with the cleaned tweets.

In [4]:
df['text_clean'] = df.text.apply(ta.clean_tweet)  #clean the tweets
df = df[df.text_clean.str.len() >0]  #remove cleaned tweets of lenght 0

n0 = len(df)
nclean = len(df)

print(f"{n0} tweets, {nclean} clean tweets")

df.sample(n=5)

8055 tweets, 8055 clean tweets


Unnamed: 0,screen_name,text,retweet_count,text_clean
3309,BarackObama,"4. So if we want to bring about real change, t...",25910,4 so if we want to bring about real change the...
5553,AOC,"For people saying this is too out there, read ...",279,for people saying this is too out there read t...
5078,sanbenito,como ustedes se aprenden esos bailes???,5789,como ustedes se aprenden esos bailes
4705,sanbenito,desconfigurao,15465,desconfigurao
2717,MichelleObama,"During times like this, we all need to lean on...",1174,during times like this we all need to lean on ...


# Copy of Dataframe

Sometimes you want to work on a slice of a dataframe.  For example, maybe you want to work with a slice that contains tweets from a single screen name.  If you want to add a column to the slice, you will get a warning, because the slice is tied to the original dataframe.  To avoid this, use the `copy` function when creating the slice.  This makes the slice an independent copy and now you can add colummns without any error.

In [9]:
df_aoc = df[df.screen_name=='AOC']

df_aoc['test'] = df.retweet_count


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [6]:
df_aoc = df[df.screen_name=='AOC'].copy()

df_aoc['test'] = df.retweet_count

# Word Cloud

We can make a word cloud of the tweets using the `WordCloud` function which takes as input a list of stopwords and many other parameters.  We then apply the `generate` function to a string of text to make the word cloud.  We use the `imshow` function to visualize the word cloud.

We convert the `text` column of our dataframe into a single giant string called `text` using the `tolist` and `join` functions.

In [None]:
stopwords = set(STOPWORDS)

text=' '.join(df.text_clean.tolist()).lower()
wordcloud = WordCloud(stopwords=stopwords,max_font_size=150, 
                      max_words=100, 
                      background_color="black",
                      width=1000, 
                      height=600)

wordcloud.generate(text)
fig = plt.figure(figsize = (10,8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# Create Text Embeddings

To cluster the tweets, we need to create vector embeddings for them.  We can do this using vectorizers.  We have two simple options here.  One is as a term frequency (tf) vectorizer called *CountVectorizer*.  The other is a term-frequency inverse document-frequency (tf-idf) vectorizer called *TfidfVectorizer*.


### Term Frequency (TF) Embedding

We initialize the *CountVectorizer* and tell it to remove English stopwords with the `stop_words` parameter set to `"english"`.  We also tell it to remove any word that occur in less than 5 documents with the `min_df` parameter.  Then we use the `fit_transform` method applied to the `text_clean` column of `df` to create the document vectors, which we call `tf_embedding`.  We store the words for each element of the vector in `tf_feature_names`.

In [None]:
tf_vectorizer = CountVectorizer(min_df=5, stop_words='english')
tf_embedding = tf_vectorizer.fit_transform(df.text_clean)
tf_feature_names = tf_vectorizer.get_feature_names_out()

nvocab = len(tf_feature_names)
ntweets = len(df.text_clean)
print(f"{ntweets} tweets, {nvocab} words in vocabulary")

### Term Frequency-Inverse Document Frequency (TF-IDF) Embedding

We initialize the `TfidfVectorizer` as we did the `CountVectorizer`.  Then we use the `fit_transform` method applied to the `text_clean` column of `df` to create the document vectors, which we call `tfidf_embedding`.  We store the words for each element of the vector in `tfidf_feature_names`.

In [None]:
tfidf_vectorizer = TfidfVectorizer(min_df=5, stop_words='english')
tfidf_embedding = tfidf_vectorizer.fit_transform(df.text_clean)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

nvocab = len(tfidf_feature_names)
print(f"{ntweets} tweets, {nvocab} words in vocabulary")

### Latent Dirichlet Allocation (LDA) Embedding

We will fit an LDA topic model on the tf embedding of the tweets. Much of this section pulls code from this blog:

https://medium.com/mlreview/topic-modeling-with-scikit-learn-e80d33668730



#### Fitting LDA Model

To fit an LDA model we need to specify the number of topics.  There are sophisticated ways to do this, but because it takes some time to fit the model, we will cheat here.  We set `num_topics` equal to the number of unique users in the dataset.  Hopefully we find one topic for each user.  To fit the model we use the `LatentDirichletAllocation` function.  We first initialize this object with the number of topics, and then use the `fit` function to fit the model to `tf_embedding` (we can't use `tfidf_embedding` because LDA data must be word counts (integers)).  The fit model object is called `lda`. 

In [None]:
%%time
num_topics = len(df.screen_name.unique())
lda = LatentDirichletAllocation(n_components=num_topics, max_iter=5, 
                                learning_method='online', learning_offset=50.,
                                random_state=0).fit(tf_embedding)

#### Convert Tweets into Topic Embedding Vectors Using LDA Model

Next we convert each tweet into a topic embedding vector.  This vector length is the number of topics in the LDA model.  The value of each element tells us the probability the tweet contains this topic.  The conversion is done using the `transform` function of `lda`.  The resulting topic vectors are called `lda_embedding`.

In [None]:
lda_embedding = lda.transform(tf_embedding)
print(f"{ntweets} tweets, {num_topics} topics in LDA model")
print(f"shape of lda embedding is {lda_embedding.shape}")

#### Visualizing LDA Topics with pyLDAvis

A cool way to visualize the topics in an LDA model is using the pyLDAvis package.  To do this we use the `prepare` function in `pyLDAvis.sklearn` to create an object called `viz`.  The inputs are the model (`lda`), the tf embedding (`tf_embedding`), and the CountVectorizer (`tf_vectorizer`).  Then we create an interactive visualization of the model using the `show` function applied to `viz`.  
Here's how to use the pyLDAvis webpage.  Each circle is a topic.  Hover over it and the bar graph lights up with the highest probabilit words in the topic.  You can slide the value of the relevance metric (lambda) to adjust how the relevance of each word is measured.  lambda = 1 means the red bar just shows the probability of the word in the topic.  lambda = 0 means the red bar shows the probability of the word in the topic divided by the probability of the word in the entire corpus of tweets.  For our purposes, lambda = 0 is fine.

In [None]:
viz = pyLDAvis.sklearn.prepare(lda, tf_embedding, tf_vectorizer)
pyLDAvis.display(viz)


### UMAP Embedding

We can use UMAP to create low-dimensional embeddings of the tweets.  This allows us to plot the tweets in two dimensions. Also, sometimes the lower dimensional embedding makes better text clusters.


In [None]:
%%time
umap_tf_embedding = umap.UMAP(n_components=2, metric='hellinger').fit_transform(tf_embedding)
umap_tfidf_embedding = umap.UMAP(n_components=2, metric='hellinger').fit_transform(tfidf_embedding)

#zscoring centers the vectors at zero
umap_tf_embedding = stats.zscore(umap_tf_embedding,nan_policy='omit')
umap_tfidf_embedding = stats.zscore(umap_tfidf_embedding,nan_policy='omit')


#### Add UMAP Embeddings to DataFrame

Add UMAP embeddings x and y coordinates for each tweet to `df`.


In [None]:
df['tf_umap_x'] = umap_tf_embedding[:,0]
df['tf_umap_y'] = umap_tf_embedding[:,1]
df['tfidf_umap_x'] = umap_tfidf_embedding[:,0]
df['tfidf_umap_y'] = umap_tfidf_embedding[:,1]

#### Visualize Embeddings

We can use `scatterplot` to plot the embeddings using the UMAP x-y coordinates.  We will color the data points, which are tweets, by the screen name of their creator using the `hue` parameter.

In [None]:
xmax = 3  #range for x-axis
ymax = 3  #range for y-axis
s = 5  #marker size

fig = plt.figure(figsize = (16,8))

ax1 = plt.subplot(1,2,1)
sns.scatterplot(data=df, x="tf_umap_x", 
                y="tf_umap_y", hue="screen_name", s=s)
plt.title("TF Embedding")
plt.xlim([-xmax, xmax])
plt.ylim([-ymax,ymax])

ax2 = plt.subplot(1,2,2)
sns.scatterplot(data=df, x="tfidf_umap_x", 
                y="tfidf_umap_y", hue="screen_name", s=s)
plt.title("TF-IDF Embedding");
plt.xlim([-xmax, xmax])
plt.ylim([-ymax,ymax])

plt.show()


# Cluster Tweets Using K-Means on Embeddings

We will cluster the tf, tf-idf, and word2vec embedding vectors using the k-means algorithm.  We choose the number of clusters we want with the variable `n_clusters`.  To get the cluster label of each tweet we initiailize a `KMeans` object with the number of clusters, and then call the `fit_predict` function on the embedding array.  

We create a column in `df` for each k-means cluster label.  

In [None]:
#n_clusters = len(df.screen_name.unique())
n_clusters = 6

kmeans_label = cluster.KMeans(n_clusters=n_clusters).fit_predict(tf_embedding)
df['kmeans_label_tf'] = [str(x) for x in kmeans_label]

kmeans_label = cluster.KMeans(n_clusters=n_clusters).fit_predict(tfidf_embedding)
df['kmeans_label_tfidf'] = [str(x) for x in kmeans_label]

kmeans_label = cluster.KMeans(n_clusters=n_clusters).fit_predict(lda_embedding)
df['kmeans_label_lda'] = [str(x) for x in kmeans_label]

kmeans_label = cluster.KMeans(n_clusters=n_clusters).fit_predict(np.nan_to_num(umap_tf_embedding))
df['kmeans_label_tf_umap'] = [str(x) for x in kmeans_label]

kmeans_label = cluster.KMeans(n_clusters=n_clusters).fit_predict(np.nan_to_num(umap_tfidf_embedding))
df['kmeans_label_tfidf_umap'] = [str(x) for x in kmeans_label]


#### Plot Embeddings with Cluster Labels

We can make a scatterplot of the tweet embeddings, but this time color the data points using the cluster label.

In [None]:
embedding_types = ['tf_umap','tfidf_umap','lda']
s = 5
xmax,ymax = 3,3

for embedding_type in embedding_types:
  fig = plt.figure(figsize = (16,8))
  ax1 = plt.subplot(1,2,1)
  kmeans_label = f"kmeans_label_{embedding_type}"
  sns.scatterplot(data=df, x=f"tfidf_umap_x", 
                    y=f"tfidf_umap_y", 
                    hue="screen_name", s=s)
  plt.title("True Clusters")
  plt.xlim([-xmax, xmax])
  plt.ylim([-ymax,ymax])

  ax2 = plt.subplot(1,2,2)
  sns.scatterplot(data=df, x=f"tfidf_umap_x", 
                    y=f"tfidf_umap_y", 
                    hue=kmeans_label, s=s)
  plt.title(f"{kmeans_label} Clusters");    
  plt.xlim([-xmax, xmax])
  plt.ylim([-ymax,ymax])
  plt.show()

### Histograms of Users and Word Clouds of Tweets in the Clusters

We will take the tweets in each cluster, make a word cloud for them, and a histogram of the screen names of the users who posted the tweets.  If we have good clusters, we expect one user to dominate each cluster, or a group of users who use tweet about similar topics.

We will be creating word clouds and histograms again later on, so lets write a function to do it.  The function is called `kmeans_wordcloud_userhist`.  Its inputs are the dataframe with the tweets and cluster labels, `df`, the name of the column with the cluster labels `cluster_label_column`, and a set of stopwords called `stopwords`.  

In [None]:
def kmeans_wordcloud_userhist(df, cluster_label_column,stopwords):
    print(cluster_label_column)
    for k in np.sort(df[cluster_label_column].unique()):
        s=df[df[cluster_label_column]==k]
        text=' '.join(s.text_clean.tolist()).lower()
        wordcloud = WordCloud(stopwords=stopwords,max_font_size=150, max_words=100, background_color="white",width=1000, height=600)
        wordcloud.generate(text)
     
        print(f"\n\tCluster {k} {cluster_label_column} has {len(s)} tweets")
        plt.figure(figsize = (16,4))
        plt.subplot(1,2,1)
        ax = sns.countplot(data = s, x = 'screen_name')
        plt.xticks(rotation=45)
        plt.ylabel("Number of tweets")
        plt.xlabel("Screen name")

        plt.subplot(1,2,2)
        plt.imshow(wordcloud, interpolation="bilinear")
        plt.axis("off")
        plt.show()
    return 1

# Wordcloud of Clusters

We can plot a word cloud for each cluster found, along with a histogram of the screen names in the cluster.

#### Wordcloud for TF UMAP Embedding

In [None]:
stopwords = set(STOPWORDS)
cluster_label_column= 'kmeans_label_tf_umap'
kmeans_wordcloud_userhist(df,cluster_label_column,stopwords )

#### Wordcloud for TF-IDF UMAP Embedding

In [None]:
stopwords = set(STOPWORDS)
cluster_label_column= 'kmeans_label_tfidf_umap'
kmeans_wordcloud_userhist(df,cluster_label_column,stopwords )

#### Wordcloud for LDA UMAP Embedding

In [None]:
stopwords = set(STOPWORDS)
cluster_label_column= 'kmeans_label_lda'
kmeans_wordcloud_userhist(df,cluster_label_column,stopwords )