# Twitter Topic Modeling with Non-negative Matrix Factorization

###### Ignacio Antequera Sanchez

*Content/Trigger Warning: This project uses real-world data from Twitter, but that comes at the risk of our dataset containing tweets about sensitive or triggering topics. This project should be approached without having to dig into the tweets present and reading about content*

# 0. Introduction
---
Hello Everyone!

My name is Ignacio Antequera Sanchez and in this project I will use the techniques for recommender systems in an unexpected way to help us model topics found on Twitter. Throughout this project, I will practice extracting topics from tweets using matrix factorization. This method assumes every tweet is a combination of several topics weighted by their prevailance in the text. This approach in fact finds a low-dimensional representation of the tweets (through the topic weights).

For this project, we will be working with tweets about the pandemic 2020 when the pandemic entered our lives. The dataset is obtained from [Kaggle](https://www.kaggle.com/smid80/coronavirus-covid19-tweets-late-april?select=2020-04-30+Coronavirus+Tweets.CSV) and the preprocessing we have done followed the steps [here](https://www.kaggle.com/satanizer/covid-19-tweets-analysis). For computational speed we will first analyze a dataset from one day: April 30, 2020. I encourage you to explore this dataset further and see how topics change over time. I might include some additional analysis from other different days in the future and study how topics change over time

Without further ado, let's delve into the data!

---

# 1. The Data
---
*Extracted from [Kaggle](https://www.kaggle.com/smid80/coronavirus-covid19-tweets-late-april?select=2020-04-30+Coronavirus+Tweets.CSV)*

First let's read the dataset into a data frame and have a look what is there.

In [1]:
import numpy as np
import pandas as pd

text = pd.read_csv('tweets-2020-4-30-1.csv')
text = text.fillna('') # some rows are nan so replace with empty string
text.tail()

np.random.seed(416)

## Note: Some preprocessing

The dataset you have just loaded was actually pre-processed by me in a different project. Here I briefly describe the steps handled already just so you know that there are usually some extra things that need to be done with text data. The code is also presented below.

* Removed tweets not in English. This is a tricky modeling choice, but one that is pretty common for simplicity and accuracy. Like when discussing bias, a better choice would probably to build up separate models for each language. 
* Removed URLs from tweets (not relevant to analysis)
* Make all text lower-case
* Remove all punctuation
* Remove stop-words (e.g., "a", "the", "to") using [NLTK](https://www.nltk.org/).
* Also remove some too frequent terms related to COVID that end up skewing the analysis.

The code for these steps is shown below. The original dataset had extra columns other than just text.

```
# select tweets in English
text = data['text'][data['lang']=='en']

# remove URL links
text = text.apply(lambda x: re.sub(r"https\S+", "", str(x)))

# make lower case
text = text.str.lower()

# remove punctuation
text = text.apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

# remove stopwords and common COVID terms
import nltk
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
stop_words.update(['#coronavirus', '#coronavirusoutbreak', 
                   '#coronavirusPandemic', '#covid19', '#covid_19', 
                   '#epitwitter', '#ihavecorona', 'amp', 'coronavirus', 
                   'covid19','covid-19', 'covidー19'])

def remove_stopwords(tweet):
    words = tweet.split()
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)  # Trick to make string separated by spaces

text = text.apply(remove_stopwords)
```

## TF-IDF Matrix

Remember that matrix factorization methods work on matrices of numbers not text so we need to convert the text into a meaningful numeric representation.

Frequency-Inverse Document Frequency is a good way to do this since it defines a word weight vector for each document by accounting for the most popular words such as `the` or `a`.  We can extract it using `scikit-learn`.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create TF-IDF matrix
vectorizer = TfidfVectorizer(max_df=0.95)  # ignore words with very high doc frequency
tf_idf = vectorizer.fit_transform(text['text'])

# extract feature names (vocabulary)
feature_names = vectorizer.get_feature_names_out()

# check out the shape
print("TF-IDF matrix shape:", tf_idf.shape)

TF-IDF matrix shape: (119147, 183012)


It indicates that the TF-IDF matrix has a shape of (119147, 183012), which means there are 119,147 documents (tweets) in the dataset, and 183,012 unique words (features) in the vocabulary after preprocessing and applying the TF-IDF transformation.

This TF-IDF matrix will serve as the input for further analysis.

We will now make two variables `num_tweets` and `num_words` that store the number of tweets in our dataset and number of words in our analysis respectively.

In [3]:
# Calculate the number of tweets
num_tweets = tf_idf.shape[0]

# Calculate the number of words (features)
num_words = tf_idf.shape[1]

# Print the results
print("Number of Tweets:", num_tweets)
print("Number of Words (Features):", num_words)

Number of Tweets: 119147
Number of Words (Features): 183012


This confirms that there are 119,147 tweets in the dataset and 183,012 unique words (features) in the TF-IDF analysis.

# 2. Modeling Tweets with Topics
---
We will use a particular technique similar to matrix factorization for recommendation to help us model tweets. In particular we will use a model called Non-negative Matrix Decomposition to help us discover topics.

You might be wondering how we can use an approach we taught for recommender systems to model tweets, when there is no notion of recommending a tweet. The idea is to try to create two matrices to describe "Tweet factors" and "Word factors" that will hopefully correspond to distinct topics of discussion. Just like with matrix factorization for recommendation, our hope is that each factor corresponds to a topic.

We will use the [NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html) method from `scikit-learn` to extract the topics. 

We will set up an NMF model with 5 components and fit it to our TF-IDF data. Then, use the `fit_transform` to both fit the NMF model and transform our tweet data in one step.

When creating the model, we will use the following hyperparameters:
* `init='nndsvd'`
* `n_components=5`

When fitting the model, we will fit it on our TF-IDF data which is a matrix of the shape `(num_tweets, num_words)`.

Once we have done this, we will save our model in a variable called `nmf` and the projected tweets in a variable called `tweets_projected`.

In [4]:
from sklearn.decomposition import NMF

# Initialize NMF model with 5 components
nmf = NMF(n_components=5, init='nndsvd', random_state=42)

# Fit NMF model to TF-IDF data (tweets)
tweets_projected = nmf.fit_transform(tf_idf)

# Print the shape of the projected tweets matrix
print("Shape of projected tweets matrix:", tweets_projected.shape)

Shape of projected tweets matrix: (119147, 5)


### Inspecting Components

The topics are stored within the object `nmf.components_`. Let's investigate this matrix and the `tweets_projected` matrices by printing their values and their shapes.

In [5]:
# Print the shapes of nmf.components_ and tweets_projected
print("Shape of nmf.components_:", nmf.components_.shape)
print("Shape of tweets_projected:", tweets_projected.shape)

Shape of nmf.components_: (5, 183012)
Shape of tweets_projected: (119147, 5)


- The nmf.components_ matrix has a shape of (5, 183012), indicating that there are 5 topics (or "Word factors") represented by 183,012 unique words (features) in the TF-IDF analysis.
- The tweets_projected matrix has a shape of (119147, 5), indicating that there are 119,147 tweets represented by 5 components (or "Tweet factors") corresponding to the topics extracted by the NMF model.

The nmf.components_ field corresponds to the "Word factors" in the terminology of matrix factorization. This matrix represents the topics as combinations of words (features) found in the dataset. Therefore, it characterizes the contribution of each word to each topic, rather than the representation of each tweet in terms of topics.

# 3. Analyzing Topics
---
We are now interested in inspecting each topic to find the most prevelant or meaningful words for that topic. We'll consider the words with the highest weights for a topic in NMF model to be the most important words for that topic. Recall that the words themselves are stored in a variable called `feature_names`.

Before trying to investigate the values in the real data, let's do a small example first to explore how this can be done. We can use the [`argsort()`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) to get a list of array indices sorted by the values at those indices; this is useful when you want to use the ordered indices for another purpose.

In [6]:
# Given data
small_words = ['dogs', 'cats', 'axolotl']
small_weights = np.array([1, 4, 2])

# Sort the indices based on weights in descending order
sorted_indices = np.argsort(small_weights)[::-1]

# Use take_along_axis to get the sorted words array
sorted_small_words = np.take_along_axis(np.array(small_words), sorted_indices, axis=None).tolist()

# Print the sorted words array
print("Sorted Small Words:", sorted_small_words)

Sorted Small Words: ['cats', 'axolotl', 'dogs']


We will now generalize this code for the last section to work on our real dataset.

Ww will noe write a function `words_from_topic` to extract an ordered list of words in a topic (highest weight first).

In [7]:
def words_from_topic(topic, feature_names):
    """
    Sorts the words by their weight in the given topic from largest to smallest.
    
    Args:
        topic (np.array): A numpy array with one entry per word that shows the weight in this topic.
        feature_names (list): A list of words that each entry in topic corresponds to
    
    Returns:
        A list of words in feature_names sorted by weight in topic from largest to smallest. 
    """

    # Sort the indices based on weights in descending order
    sorted_indices = np.argsort(topic)[::-1]

    # Use sorted indices to rearrange words in feature_names
    sorted_feature_names = np.array(feature_names)[sorted_indices]

    # Convert the sorted numpy array back to a Python list and return
    return sorted_feature_names.tolist()

Once We have implemented the function above,  We should be able to run the cell below that uses our function to print out the top 10 words in each topic.

In [8]:
def print_top_words(components, feature_names, n_top_words):
    """Print the first n_top_words for each topic in components.

    Args:
        components (numpy.ndarray): NMF components matrix.
        feature_names (list): List of feature names (words).
        n_top_words (int): Number of top words to print for each topic.
    """
    for topic_index, topic in enumerate(components):
        ordered_words = words_from_topic(topic, feature_names)
        top_words = ', '.join(ordered_words[:n_top_words])
        print(f'Topic: #{topic_index}: {top_words}')

## Investigating Tweet
Next let's look at a specific tweet (index 40151) and the individual contributions of the topics. The cell below prints the text of the original tweet and then the value of the tweet after being transformed by our NMF.

In [9]:
index = 40151
print(text.iloc[index]['text'])
print(tweets_projected[index])

attention seattle shoppers grocery stores working hard keep employees customers safe part help slow spread ☑️ limit trips ☑️ respect special shopping hours ☑️ follow socialdistance guidance stores wegotthisseattle
[0.00825208 0.         0.02897575 0.         0.01537722]


The transformed tweet value represents the contributions of the tweet to each of the topics extracted by the NMF model. The value is a vector where each element corresponds to a topic, and the magnitude of each element indicates the contribution of the tweet to that topic.

Here's the breakdown of the transformed tweet value:

- Topic 1: 0.00825208
- Topic 2: 0.0
- Topic 3: 0.02897575
- Topic 4: 0.0
- Topic 5: 0.01537722

Each value represents the contribution of the tweet to the respective topic. Since some values are zero, it indicates that the tweet has no significant contribution to those topics.

Based on the topic values for the tweet provided, The tweet is most associated with Topic #2 because it has the highest value among all topics. Therefore, Topic #2 is the topic that the tweet is most associated with based on the given topic values.

---

In our analysis above where we modeled each tweet in 5 topics, which topic has the most tweets strongly associated with it? 

For each tweet, We are now going to calculate which topic it is most strongly associated with by looking at the topic values for the tweet. If there is ever a tie for the largest topic weight, we will take the one with the lowest index (although this is unlikely to happen in our dataset).

We will save the index of the topic with the most tweets strongly associated with it in a variable called `largest_topic`. The result should be an integer for the index of the largest topic.

*Hint: There is a very efficient way to do this using code like we wrote in HW6, but there are many ways to solve this problem in general*.


In [None]:
# Initialize an array to store the index of the largest topic for each tweet
largest_topic_indices = np.zeros(len(tweets_projected))

# Iterate over each tweet to find the index of the largest topic
for i in range(len(tweets_projected)):
    # Find the index of the largest element in the current tweet's topic values
    largest_topic_indices[i] = np.argmax(tweets_projected[i])

# Count the occurrences of each topic index
topic_counts = np.bincount(largest_topic_indices.astype(int))

# Find the index of the largest topic with the maximum count
largest_topic_index = np.argmax(topic_counts)

# Print the index of the largest topic
print("Index of the largest topic:", largest_topic_index)

# Investigating Trends

One benefit of using matrix factorization to a small dimension, is it lets us visualize tweets in this "topic space" to find any interesting groupings. 

Now in our earlier analysis, we modeled each tweet as 5 topics but that is hard to visualize. 

In the next cell, make a new NMF model and projected tweets (called `nmf_small` and `tweets_projected_small` respectively) with 3 components instead of 5. Use the same settings for the other parameters as we did earlier.

In [None]:
nmf_small = NMF(n_components=3, init='nndsvd')
tweets_projected_small = nmf_small.fit_transform(tf_idf)

We can investigate the topics in this small model. Unsurprisingly the seem mostly the same but a couple topics had to merge.

In [None]:
print_top_words(nmf_small.components_, feature_names, 10)

Now that we have 3 values for each tweet, we can actually plot each tweet in 3D space to see how all the tweets relate to each other. The following cell does exactly that. You don't need to understand all the specifics of how to make a 3D plot, but just note it is using the 3 topic values for each tweet as the x, y, z coordinates

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Set up axes to plot on
fig = plt.figure()
ax = fig.add_subplot(projection='3d')

# Make 3D scatterplot
ax.scatter(tweets_projected_small[:, 0], tweets_projected_small[:, 1], tweets_projected_small[:, 2])

# Set axis labels
ax.set_xlabel('Topic 0')
ax.set_ylabel('Topic 1')
ax.set_zlabel('Topic 2')

# Rotate plot to be easily viewed
ax.view_init(30, 30)

Interesting, it looks like there is a small cluster of Tweets that are far away from all the others when looking at Topic 2. In other words, there are a few tweets that are very far in the Topic 2 direction while the majority of tweets are spread out more in Topic 0/1.

### 🔍 **Question 8** Outlier Tweets (Optional)
Let's look into the tweets that seem very different than the rest. 

For this problem, we want you to compute all of the unique tweets (since there are some duplicates) that appear in this region in the "topic space". 

Below, we explain the steps to do this computation. Save your result in a variable called `outlier_tweets` that has type `numpy.array` and stores all the unique tweets that are these outliers (as described below).

For this problem, you should follow these steps:
1. Find the which rows in our `tweets_projected_small` our outliers. We will define this tweets as ones that have a value of `0.15` or more for Topic 2.
2. Now that we know which rows are outliers, use that information to access the `text` column of our original tweets `DataFrame` `text` for those rows.
3. Use the `.unique` function available on a column of a `pandas` `DataFrame`to find all the unique values.

If you follow these steps (particularly the last), you will end up with a `numpy.array` with all of the unique tweets that meet this criteria. Note that many of the tweets look similar, but they count as unique tweets since they have some character differences!

Do you spot a theme amongst these tweets? Do you think there is an explanation why our model isolated them as their own topic?


In [None]:
# TODO implement the process explained above
z = y = np.zeros(len(tweets_projected_small[:, 2]))
#fill in z with index for not small projected tweet and -1 otherwise
z = np.where(tweets_projected_small[:, 2] >= 0.15, y, -1)
# This sets the value of each element in z to the corresponding index of y if the corresponding element in tweets_projected_small[:, 2] is greater than or equal to 0.15, and to -1 otherwise.

rows = np.delete(np.unique(z, return_counts=True)[0],0)
outlier_tweets = np.empty(len(rows), dtype=object)
for i in range(len(rows)):
    outlier_tweets[i] = text.iloc[int(rows[i])]['text']
outlier_tweets = np.unique(outlier_tweets, return_counts=True)[0]

print(outlier_tweets)

Based on the output provided, it seems like the theme of the outlier tweets is about a virtual meeting or event that will take place on April 30th at 5pm UK time. 
It is possible that the model isolated these tweets as their own topic because they contain unique wordsor phrases that are not present in the other tweets in the dataset

It is also possible that these tweets were grouped together because they contain a higher frequency of certain keywords that the model identified as important for this particular topic