# In-class exercise 6b: Converting Tweets to matrix and Do fun things with them

## Remember the moral dictionary? From count method to frequency and inverse frequency

In the last session, we have pre-processed the American presidential candidates' tweets and applied a simple counting/frequency method to check out who appeals to which moral value. 

The moral dictionary (or any custom-made dictionary) is a fixed set of words. If you have a good theoritical foundation on why particularly these words need to be considered, it is a great idea - simple, easy to implement, deterministic and convincing. Any fancy algorithm won't necessary perform better in this case. 

But what if you don't have a good set of words to use, or that your research question is not conditioned on one specific angle? In this case, we can also convert the text data to "general" matrix. Imagine that instead of the 10 types of words in the MFD, you have one column that count **each individual English word**. Then, you can do fun things with it (such as machine learning, see the next class!) 

Of course, such a matrix would be extremely large and many of their words are completely uninformative (such as the word "the"). That's why we use all kinds of techniques trying to extract the "useful" information of a text, most importantly limiting to the top [insert dimension] N-grams and applying TF-IDF. 

In [104]:
# Import everything we need today 

# The basics
import scipy 
from pathlib import Path
import pandas as pd

# NLP tool kit
import nltk
from nltk.stem import SnowballStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('punkt_tab')      
nltk.download('wordnet')    
nltk.download('omw-1.4') 
nltk.download('averaged_perceptron_tagger_eng')
from nltk.corpus import wordnet
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag

# Regular expressions
import re

# Language detection 
#import langdetect

# Loading
import fastparquet

# convert the tweets to embeddings using word2vec
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import numpy as np

# K-mean clustering 
from sklearn.cluster import KMeans

# Tfidf Vectoriser
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Our example the last time 
this_dir = Path(".")
df_path = this_dir / "data" / "cooked_data.parquet"
df = pd.read_parquet(df_path)

# let's quickly convert the column "processed text" in a single string instead of a list of words
def join_with_space(lst): 
    return(" ".join(lst))

df['processed_text'] = df['processed_text'].apply(join_with_space)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/huangyuchen/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


### Exercise 1: Getting to know TF-IDF vectorizers

In this exercise, we will use the `TfidfVectorizer` from `sklearn` to convert the tweets into a matrix.

TFIDF is in its essence a counting method but with reweighting: it weights down the words that appear in many documents (e.g., "the", "is", "and") and weights up the words that appear in fewer documents (e.g., "election", "vote"). Let's first look at a small example: please take a look at the first 10 entries in our database. 

Run all the code blocks in this section. 


In [24]:
# Taking a look at the first 10 entries (they are from Trump 2016)
print(df['processed_text'][0:10])

index
0                                  land iowa speak soon
1                                                  time
3                                               # enjoy
5                                       mexico pay wall
8                                                 watch
9                                             # p enjoy
10       trump promise special session repeal obamacare
11                      yet evidence media rig election
12                                great job far contest
14    hillary advisers want avoid support israel tal...
Name: processed_text, dtype: object


In [25]:
# Using CountVectorizer
count_vectorizer = CountVectorizer()
count_matrix = count_vectorizer.fit_transform(df['processed_text'][0:10])
print("Count Vectorizer Result:\n", count_matrix.toarray())

# Using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'][0:10])
print("TF-IDF Vectorizer Result:\n", tfidf_matrix.toarray())

Count Vectorizer Result:
 [[0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0]
 [0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1]
 [0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 1 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0]]
TF-IDF Vectorizer Result:
 [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.5        0.
  0.         0.5        0.         0.         0.         0.
  0.         0.         0.         0.         0.5        0.5
  0.         0.

In [27]:
tfidf_matrix.shape

(10, 33)

In [28]:
# Looking at the words it take
print(count_vectorizer.get_feature_names_out())
print(tfidf_vectorizer.get_feature_names_out())

['advisers' 'avoid' 'contest' 'democrats' 'election' 'enjoy' 'evidence'
 'far' 'great' 'hillary' 'iowa' 'israel' 'job' 'land' 'media' 'mexico'
 'obamacare' 'pay' 'promise' 'repeal' 'rig' 'session' 'soon' 'speak'
 'special' 'support' 'talk' 'time' 'trump' 'wall' 'want' 'watch' 'yet']
['advisers' 'avoid' 'contest' 'democrats' 'election' 'enjoy' 'evidence'
 'far' 'great' 'hillary' 'iowa' 'israel' 'job' 'land' 'media' 'mexico'
 'obamacare' 'pay' 'promise' 'repeal' 'rig' 'session' 'soon' 'speak'
 'special' 'support' 'talk' 'time' 'trump' 'wall' 'want' 'watch' 'yet']


#### Questions to answer: 

1. Each of the matrices is 10x33. Why is there 10 vectors and 33 entries per vector in the matrix? 
2. Explain with your own world what an entry means: for example, what does a 1 in the 11st place in the first vector represent? 
3. Look at the first vector in both vectorizer. Why is the entries for "Iowa" and "land" 1 in the count vectorizer and 0.5 in the second vectorizer?  
4. Why do we want to use the TF-IDF vectorizer instead of the count vectorizer? 
Hint: imagine two scenarios: 
- Trump always finishes his tweet with "enjoy!"
- Trump tweets in a long and repeated manner, such as this tweet below: 


> “When will all of the ‘reporters’ who have received Noble Prizes for their work on Russia, Russia, Russia, only to have been proven totally wrong (and, in fact, it was the other side who committed the crimes), be turning back their cherished ‘Nobles’ so that they can be given…to the REAL REPORTERS & JOURNALISTS who got it right. I can give the Committee a very comprehensive list. When will the Noble Committee DEMAND the Prizes back, especially since they were gotten under fraud? The reporters and Lamestream Media knew the truth all along …Lawsuits should be brought against all, including the Fake News Organizations, to rectify this terrible injustice. For all of the great lawyers out there, do we have any takers? When will the Noble Committee Act? Better be fast!”



### Exercise 2: implement and fine-tuning the vectorizer 

After the small illustration, we want to apply the TFIDF vectorizer immediately to the whole data base. The next cell contain some basic grammar for creating the TFIDF vectorizer and checking its shape.

We cut down to own biden's tweets to make it faster. Please run the 2 code blocks below.

In [None]:
# Cut to biden's tweets
df = df[df['candidate'] == 'biden']

In [43]:
# create another vectorizer 
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_text'])

#print some samples
print(tfidf_vectorizer.get_feature_names_out()[-20:])
print(tfidf_vectorizer.get_feature_names_out()[0:20])


# How long is this? 
tfidf_vectorizer.get_feature_names_out().shape

['yo' 'yom' 'york' 'yorknorth' 'you' 'youd' 'youll' 'young' 'younger'
 'youre' 'youth' 'youve' 'zero' 'zeta' 'zip' 'zl' 'zones' '보내세요' '한가위'
 '행복한']
['10' '100th' '10216donald' '1030am' '10th' '1215pm' '16th' '19th' '1st'
 '21st' '350pm' '3pm' '3rd' '47th' '48th' '4th' '50th' '5283nurses' '57th'
 '60']


(3897,)

Biden's 2k tweets gave us 3897 features. That looks quite long! Looking at the list of the features, there are some that we probably doesn't need: numbers that failed to be cleaned ('10' '100th' '10216donald' '1030am' '10th' '1215pm' '16th') or things that are typos ("yom"), foreign language ("행복한"). We want only meaningful words! 

To do this, we are going to fine-tune the TFIDF vectorizer so that it contains only the meaningful words. Please create a few different TF-idf vectorizes with the following options, using the document as the reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html. What do each critieria do?

- try the max_df and min_df options : set max_df to 0.9 and min_df to 0.01. What do you get?
- try the max_features options: set max_features to 500. What do you get?
- try n-grams (features with more than a word) while limiting the N of features in a certain way:set n grams to 1 or 2 what do you get? 
- try the vocabulary option and load in our moral dictionary from the last section. 

#### To proceed

For our next step, create a vectorizer with max 2000 features, max_df = 0.9 and min_df = 0.005, and n-grams of 1 and 2. Check the shape of the resulting matrix and print out the last 20 features and the first 20 features. Also remove these specific words: 

`special_stop_words = ['trump', 'hillary', 'clinton', 'president', 'vote', 'election', 'amp', 'realdonaldtrump', 'donald', 'makeamericagreatagain', 'imwithher', 'gettyimages', 'us', 'one', 'new', 'now', 'like', 'day', 'time', 'see', 'want', 'go', 'know', "hillary clinton", 'joe', 'biden', 'democrat', 'democrats', 'democratic', 'america', 'american', 'people', 'country', 'think', 'great', 'donald', 'donald trump', 'vote', 'america'  ]`

#### Optional: visualize your matrix as a dataframe. 

As an additional exercise and also to better visualize the matrix, you can convert the sparse matrix to a dataframe, where the column names are the features (the words). We can also add the tweet text as a column to better visualize the matrix, and the candidates name as another column.

### Exercise 3: K-means clustering - unsupervised learning with text data

For the next step, we would like to group the tweets into different clusters based on their content. This is an unsupervised learning task, where we don't have predefined labels for the tweets, but we want to discover patterns and similarities among them. (e.g. what are the x main topics that the candidates are tweeting about? Do they differ between candidates?)


To do so, we are going to use the K-means clusterng algorithm, which is a popular method for clustering data points into K distinct groups based on their features. For an introduction of k-mean clustering: https://www.youtube.com/watch?v=4b5d3muPQmA

The parameter that we provide is the K - the number of clusters that we want, and we will see how the tweets are grouped into these clusters.

Please run the next block

In [None]:
# Define the quantity of clusters - try around with this number! 
k = 3

# apply the Kmeans algorithm
kmeans = KMeans(n_clusters=k, random_state=2025) # WHy do we have to set a seed? which step in the algorithm is random?
kmeans.fit(tfidf_matrix)

# attach the labels to the dataframe 
tfidf['cluster'] = kmeans.labels_

# Check out how many tweets in each cluster
tfidf['cluster'].value_counts()



cluster
0    1752
1     317
2     153
Name: count, dtype: int64

Now that we have just imposed the K-mean clustering to the Tf-idf vectorizer, it is still not very clear what each cluster means. To better understand the clusters, we can look at the top terms (words) that are most representative of each cluster. we are also interested in which candidate tweets belong to which cluster.

Please run the next block. what are the three vectors in the output? How can you find the corresponding words? 

In [122]:
# top terms of each cluster. What do these numbers mean? 
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
order_centroids


array([[358, 215, 544, ..., 298, 249, 218],
       [362,  67, 251, ..., 274, 273, 311],
       [ 41, 542, 143, ..., 548, 309, 311]])

The three vectors in the output represent the centroids of the three clusters formed by the K-means algorithm. Each centroid is a vector in the same feature space as the original data points (tweets), and it represents the "center" of each cluster. The values in each centroid vector correspond to the importance of each feature (word) in that cluster.

Do you think that the clusters make sense? Can you interpret what each cluster is about based on the top terms?

### Exercise 4: Getting to Know Embeddings

TfIDf is a simple and effective way to convert text data into a numerical format that can be used for machine learning tasks. However, it has some limitations, such as not capturing the semantic meaning of words and phrases; "happy" and "joyful" are treated as completely different words, even though they have similar meanings. We might want to associate words that are similar in meaning and make the model "understand" the text better.

Entering: embeddings. Embeddings are dense vector representations of words or phrases that capture their semantic meaning and relationships. Popular embedding techniques include Word2Vec, GloVe, and FastText. These embeddings are typically pre-trained on large corpora of text data and can be fine-tuned for specific tasks. We are going to download a pre-trained word2vec model and use it to convert the tweets into embeddings.
Please run the next block.

In [None]:
# import a tweet embedding model
import gensim.downloader
glove_vectors = gensim.downloader.load('glove-twitter-25')



The model that we download here, is called "KeyedVectors". What could that possibly mean? 

Please run the next block. Here, we are asking the model to provide the vector representation for a specific word (in this case, "happy"). 

In [118]:
glove_vectors['election']

array([ 1.0729  ,  0.77963 , -0.98176 , -1.0813  ,  0.15096 , -1.9077  ,
        0.34671 , -0.85255 ,  0.638   , -0.11122 , -0.28173 ,  0.032971,
       -3.074   ,  1.9266  ,  0.81371 , -0.86205 , -0.67    , -0.041087,
       -0.26829 , -0.66733 , -0.40652 ,  0.5476  ,  0.13716 , -1.094   ,
       -0.96162 ], dtype=float32)

The model represents the word in an abstract 25-dimensional space. Each dimension captures some latent semantic feature of the word, and the values in the vector represent the strength or presence of those features. 

For example, imagine that one of the dimensions captures the concept of "positivity" or "happiness". A word like "happy" might have a high value in this dimension, indicating that it is strongly associated with positive emotions. On the other hand, a word like "sad" might have a low value in this dimension, indicating that it is associated with negative emotions. 

The pre-trained embeddings are learned from a large corpus, and we can exploit some cool functions that the model has. For example, please try the subsequent functions: 

- use the most_similar function to find the words that are most similar to "election". 
- use the similarity function to test the similarity between the following group of words: 'happy' and 'joy', 'happy' and 'sad', 'happy' and 'bus'; then, try with some other topics (e.g. immigration vs politics)

What can you say about the similarity features? Do they make sense? In what case do they not perform well? 

### Exercise 5: Word embeddings for tweets

Now, we would like to use our pre-made embedding model to convert the tweets into embeddings. Here, since we have multiple words in a tweet, we will do something simple: take the average of the embeddings of all the words in the tweet. (We can do more complicated things: check out sentence embeddings such as BERT!

Please run the next block. Explain the size of the outcome object "embeddings". What does each vector in this represent? 

In [None]:

# Get the embeddings for each tweet by averaging the word vectors
def get_tweet_embedding(tweet):
    words = word_tokenize(tweet)
    word_vectors = [glove_vectors[word] for word in words if word in glove_vectors]
    if word_vectors:
        return np.mean(word_vectors, axis=0)
    else:
        return np.zeros(glove_vectors.vector_size)
embeddings = np.array([get_tweet_embedding(tweet) for tweet in df['processed_text']])
embeddings.shape

(2222, 25)

Next, we will apply K-mean clustering to the embeddings and see how the tweets are grouped into different clusters. Please run the next block.Please adapt the code above the runs K means clustering for the new matrix object "embeddings". 

Comparing the two approaches, both TFIDF and the embeddings are ways to convert text data into numerical format for machine learning tasks. What are their main difference? Which one do you like and why? 

We would like to know whether the clusters formed by the word2vec embeddings are more meaningful and coherent than those formed by the tf-idf vectorizer. To do so, we want take a look at the words with highest tf-idf scores in cluster 1. ("for this clusters, here is the most frequent words adjusted with the corpus speciality...) 

- please write the code that exercise this function : that prints out the top 20 words by tfidf score in each cluster. 
- what do you think each cluster is about?
- Do you think embeddings performs necesssariily better than the tf-idf vectorizer? Why or why not?

In these two classes, we learned how to convert text data into numerical format using different techniques, and how to apply unsupervised learning algorithms to discover patterns and relationships in the data. 

That being said, as we can see, the unsupervised learning is not a great techniques with such a "loose" data like tweets. The texts are short, noisy, and often lack context, making it difficult to extract meaningful information. In the next lesson, we will learn more about machine learning, particularly supervised learning, where we have predefined labels for the data and can train models to predict these labels based on the features extracted from the text. 