# Lab 3: Introduction to Mining Matrix Data

In this lab, we will be focusing on **Dimensionality Reduction**. As discussed in lecture, Dimensionality Reduction can help us uncover insights by allowing us to "zoom" in on the important features in our data and discover hidden patterns. To show the power of this technique, we will be implementing a use case in which we will will apply **Latent Semantic Analysis (LSA)** to tweets about the H1N1 pandemic/vaccine. We can think of LSA as applying Singular Value Decomposition (SVD) to a Document x Term matrix. Then, we will shift our attention to recommender systems - we will apply SVD to a User x Movies dataset to find user-factors and movie-factors and will implement a simple recommender system using a clustering approach.

In [3]:
!pip install nltk
import nltk
nltk.download('stopwords')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
import json
import random
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
import pandas as pd
import numpy as np
import seaborn as sns
from scipy import spatial

from google.colab import files as colab_files
uploaded = colab_files.upload()

Saving Lab3_Matrix-StarterCode.ipynb to Lab3_Matrix-StarterCode.ipynb
Saving links.csv to links.csv
Saving movies.csv to movies.csv
Saving ratings.csv to ratings.csv
Saving README.txt to README.txt
Saving tags.csv to tags.csv
Saving tweets_time1.json to tweets_time1.json
Saving tweets_time5.json to tweets_time5.json


### 1.  LSA on H1N1 Twitter Data

The first task is to leverage Twitter Data to gain insights into the evolution of vaccine acceptance during the H1N1 pandemic. As evident with our recent experiences with COVID-19, there is an increased sense of urgency to develop and distribute vaccines rapidly during pandemics. As such, H1N1 offers an interesting perspective as the H1N1 vaccine was developed in a very short period of time.

Instead of looking at and comparing all of the H1N1 pandemic milestones (see below for a timeline of the milestones), we will focus in on just two of them:

1) Milestone 1 - April 15: first human infection in California

2) Milestone 5 - September 15: FDA announces approval of 4 H1N1 vaccines

![image.png](attachment:image.png)

To complete this analysis, we must complete the following stelps:

* **Step 1**: Use the Twitter API to retrieve ~500 tweets for each milestone

    * Query definition: (H1N1 OR swine flu) AND (vaccine*), no retweets, only English tweets
    * **This step is already completed for you**  - you will just have to read in the tweet data from json files
    

* **Step 2**: Explore and clean the tweets (e.g., remove links, stop words and numbers)


* **Step 3**: Apply TFIDFVectorizer to obtain the document-term matrix


* **Step 4**: Apply TruncatedSVD to get the topics for each milestone

#### 1.1 Load the Tweets Data

In [16]:
# we will load and analyze tweets from the two milestones separately
# tweets1 will refer to tweets from the 1st H1N1 milestone - 1st human infection in California
# tweets5 will refer to tweets from the 5th H1N1 milestone - FDA announces approval of 4 H1N1 vaccines

with open('tweets_time1.json', 'r') as f:
    tweets1 = json.load(f)
    
print("Milestone 1 - Number of Tweets:", len(tweets1))
    
with open('tweets_time5.json', 'r') as f:
    tweets5 = json.load(f)
    
print("Milestone 5 - Number of Tweets:", len(tweets5))

Milestone 1 - Number of Tweets: 541
Milestone 5 - Number of Tweets: 500


In [5]:
# example tweet object
# each tweet is stored with a unique id as the key and has the following attributes:
# (1) timestamp
# (2) text (this will be our focus)

tweets1['1598228282']

{'timestamp': 'Thu Apr 23 22:12:33 +0000 2009',
 'text': 'CDC: rare swine flu detected in 7 Americans. Still unknown if vaccine is available to protect against the strain. http://tinyurl.com/cksqp4'}

#### 1.2 Explore and Clean Tweets

The first step will be to check for duplicate tweets in our data and remove them as necessary. This is important as we want to get a good distribution of the topics being discussed from the milestones and do not want them to be biased by duplicate tweets. 

In [6]:
# check for and remove any duplicates

# source: https://www.w3schools.com/python/python_howto_remove_duplicates.asp
def remove_dups(x):
    return list(dict.fromkeys(x))

tweets1_no_dups = []
tweets1_text = [tweets1[tweet]["text"] for tweet in tweets1] # extract the text from the tweet object
tweets1_no_dups = remove_dups(tweets1_text)
print("Before:", len(tweets1), "After:", len(tweets1_no_dups))

tweets5_no_dups = []
tweets5_text = [tweets5[tweet]["text"] for tweet in tweets5] # extract the text from the tweet object
tweets5_no_dups = remove_dups(tweets5_text)
print("Before:", len(tweets5), "After:", len(tweets5_no_dups))

Before: 541 After: 527
Before: 500 After: 462


Note: we still may see substantial overlap in the content of some tweets; for example, tweets may refer to the same article by its title but we will no longer have tweets with the same *exact* content. Next, to famialize ourselves with the data, let's look at some example tweets from each milestone.

In [7]:
print("Sample Tweets for Milestone 1")
print(random.sample(tweets1_no_dups, 10))
print("-------------------------------------------------------------------------------------------------------------")
print("Sample Tweets for Milestone 5")
print(random.sample(tweets5_no_dups, 10))

Sample Tweets for Milestone 1
['Damn Swine Flu!!!Why did the CDC know about this years ago but this is the first we are told about it? Why is there no Vaccine??huh?why?why?', "Vaccines 'will not stop swine flu': CURRENT flu vaccines will not stop a deadly virus spreading around the world.. http://tinyurl.com/dzywoj", "The swine Flu in Mexico was man made so that they could give all it's citizens a Vaccine shot it's coming to the USA as well! VACCINES!!!", '@govchains \nThose against TARP have been vaccinated against swine flu (pork barrel spending)', 'Swine flu vaccine poss ready by sep or oct via npr (via @edwardboches)', 'CDC press conference: very unlikely seasonal H1N1 vaccination effective against current outbreak. #swineflu', 'I heard Bruce Lee was born immune to Swine Flu...the only vaccine to cure it was in his blood.', 'Chuck Norris cured the Swine Flu by eating a cold pig whole as vaccination.', 'Researchers Working To Develop Swine Flu Vaccine: Last Friday, an e-mail contain

Next, we will apply some pre-processing to the tweets. In any natural language processing task, this is an essential step but the pre-processing you apply will depend on the nature of the text. In this example, we will lowercase the text and remove links, stop words, numbers and the words that were included in the original query (the 'query words').

In [8]:
# function to remove links
def strip_links(text):
    link_regex = re.compile('((https?):((//)|(\\\\))+([\w\d:#@%/;$()~_?\+-=\\\.&](#!)?)*)', re.DOTALL)
    links = re.findall(link_regex, text)
    for link in links:
        text = text.replace(link[0], ', ')
    return text

# define stop words and query words list to be removed
stop_words = stopwords.words('english')
query_words = ["H1N1", "swineflu", "swine", "flu", "vaccine", "vaccines", "vaccination", "vaccinations", 
                                                       "vaccinate", "vaccinates", "vaccinated"]

In [9]:
cleaned_tweets1 = []

for tweet in tweets1_no_dups:
    tweet_no_link = strip_links(tweet) # strip off any links
    tweet_lower = tweet_no_link.lower() # lowercase all letters
    tweet_only_alpha = re.sub("[^a-zA-Z]", " ", tweet_lower) # remove all characters that are not alphabetical
    tweet_spaces_removed = re.sub(' +', ' ', tweet_only_alpha).strip() # remove extra spaces and strip any at beginning/end
    tweet_no_stop_words = [item for item in tweet_spaces_removed.split() if (item not in stop_words and item not in query_words)]
    cleaned_tweets1.append(" ".join(tweet_no_stop_words))

cleaned_tweets5 = []

for tweet in tweets5_no_dups:
    tweet_no_link = strip_links(tweet) # strip off any links
    tweet_lower = tweet_no_link.lower() # lowercase all letters
    tweet_only_alpha = re.sub("[^a-zA-Z]", " ", tweet_lower) # remove all characters that are not alphabetical
    tweet_spaces_removed = re.sub(' +', ' ', tweet_only_alpha).strip() # remove extra spaces and strip any at beginning/end
    tweet_no_stop_words = [item for item in tweet_spaces_removed.split() if (item not in stop_words and item not in query_words)]
    cleaned_tweets5.append(" ".join(tweet_no_stop_words))

#### 1.3 Extract the topics: Apply TFIDFVectorizer & TruncatedSVD

Now that we have cleaned the tweets, we can move on to extracting the topics! To do so, we must first apply TFIDFVectorizer to convert the tweets into a matrix of TF-IDF features. TF-IDF is a commonly used technique in natural language processing. It stands for term frequency–inverse document frequency. Basically, it is a technique used to compute the importance or relevance of each word by looking at how often a word appears in each text and how often it occurs across all texts in our corpus.

In [10]:
# source: https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

# apply to milestone 1 

vectorized_tweets1 = []

vectorizer = TfidfVectorizer(stop_words='english', max_df = 0.5)  # ignore terms that have a document frequency higher than 0.5
vectorized_tweets1 = vectorizer.fit_transform(cleaned_tweets1) # apply to our cleaned tweets
print("Shape of Milestone 1 Document-Term Matrix:", vectorized_tweets1.shape) # check shape of the document-term matrix
terms1 = vectorizer.get_feature_names()
print("Number of terms:", len(terms1))

# apply to milestone 5

vectorized_tweets5 = []

vectorizer = TfidfVectorizer(stop_words='english', max_df = 0.5)  # ignore terms that have a document frequency higher than 0.5
vectorized_tweets5 = vectorizer.fit_transform(cleaned_tweets5) # apply to our cleaned tweets
print("Shape of Milestone 5 Document-Term Matrix:", vectorized_tweets5.shape) # check shape of the document-term matrix
terms5 = vectorizer.get_feature_names()
print("Number of terms:", len(terms5))

Shape of Milestone 1 Document-Term Matrix: (527, 1501)
Number of terms: 1501
Shape of Milestone 5 Document-Term Matrix: (462, 1086)
Number of terms: 1086




After applying TF-IDF to our tweets, we see that we get a document-term matrix for the tweets for each of our milestones. We see that the number of rows is the number of tweets and the number of columns is the number of terms. Next, we will apply TruncatedSVD to obtain our topics. In this example, we will set n_components to 5 since we want 5 topics for each milestone but this is a parameter that we will want to adjust given the task at hand.

In [11]:
terms_list1 = []

svd_model = TruncatedSVD(n_components=5, random_state=671) # n_components = # of topics
svd_model.fit(vectorized_tweets1) # fit to our vectorized tweets
    
for i, comp in enumerate(svd_model.components_): # loop through our 5 topics
    terms_comp = zip(terms1, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:20] # only take top 20 terms for each topic for simplicity
    terms_list1.append(sorted_terms)

terms_list5 = []

svd_model = TruncatedSVD(n_components=5, random_state=671) # n_components = # of topics
svd_model.fit(vectorized_tweets5) # fit to our vectorized tweets
    
for i, comp in enumerate(svd_model.components_): # loop through our 5 topics
    terms_comp = zip(terms5, comp) 
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:20] # only take top 20 terms for each topic for simplicity
    terms_list5.append(sorted_terms)

Nice! Now let's take a look at our results!

In [12]:
print("Milestone 1 Topics \n")
sub_count = 1
for topic in terms_list1:
    print("Topic "+str(sub_count)+": ")
    print([t[0] for t in topic])
    sub_count += 1
    print("-------------------------------------------------------------------------------------------------")

print("\n Milestone 5 Topics \n")
sub_count = 1
for topic in terms_list5:
    print("Topic "+str(sub_count)+": ")
    print([t[0] for t in topic])
    sub_count += 1
    print("-------------------------------------------------------------------------------------------------")

Milestone 1 Topics 

Topic 1: 
['cdc', 'readies', 'case', 'pandemic', 'time', 'closely', 'novel', 'watching', 'seasonal', 'prepares', 'pat', 'sickening', 'source', 'feedzilla', 'new', 'help', 'prevents', 'memory', 'post', 'blog']
-------------------------------------------------------------------------------------------------
Topic 2: 
['seasonal', 'help', 'health', 'officials', 'ap', 'say', 'news', 'pessimistic', 'protection', 'post', 'ingredient', 'national', 'protect', 'wants', 'pessimisti', 'seaso', 'aga', 'offer', 'protects', 'months']
-------------------------------------------------------------------------------------------------
Topic 3: 
['months', 'ingredient', 'wants', 'news', 'mexico', 'ready', 'city', 'key', 'scientists', 'away', 'hope', 'health', 'launches', 'campaign', 'stop', 'com', 'livescience', 'roche', 'massive', 'gilead']
-------------------------------------------------------------------------------------------------
Topic 4: 
['months', 'away', 'help', 'livescien

Looking at the topics, we can start to get an idea of how we could assign 'topic names' to each of these. For example, for topic 1 from milestone 1, we may label it 'National Preparation for Emerging Vaccine' given the words in the topic. 

In [13]:
### YOUR CODE: apply TruncatedSVD to the tweets with your choice of n_components (not 5!)
### how do the results differ? what do you think the 'best' choice of n_components is for this data?
svd_model = TruncatedSVD(n_components=3, random_state=671)
svd_model.fit(vectorized_tweets5)

TruncatedSVD(n_components=3, random_state=671)

From here, there are a variety of next steps and analyses that we could perform. For example, we may want to look at the similarity of topics between the different milestones or we may want to perform additional analyses such as sentiment anaylsis to combine with these findings. But in this lab, we will stop here and will move on to our next example: appying SVD to a Users-Movies dataset.

### 2. Users-Movie Dataset

First, we must download the data. Navigate to https://grouplens.org/datasets/movielens/ and download the 'ml-latest-small.zip' file under MovieLens Latest Datasets. This dataset contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. We will focus on:
   * 'movies.csv' - contains the 'movieId', 'title' and 'genres'
   * 'ratings.csv' - contains the 'userId', 'movieId', 'rating', 'timestamp'
   
Dataset citation: F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872 

In [14]:
# movies data
movies_data = pd.read_csv('ml-latest-small/movies.csv')
print(movies_data.shape)
movies_data.head()

FileNotFoundError: ignored

In [None]:
# users data
users_data = pd.read_csv('ml-latest-small/ratings.csv')
print(users_data.shape)
users_data.head()

#### 2.1 Explore the Data

In [None]:
# YOUR CODE: explore the data! some ideas for what you may want to consider include:
# (1) how many users are there?
# (2) how many movies are there with at least 1 rating? 
# (3) on average, how many movies has each user rated?
# (4) what does the distribution of ratings look like?

#### 2.2 Apply SVD to Obtain Factors

Before we can apply SVD to our data, we must transform it into an appropiate format. To do so, we will use the pivot function where the index='userId' so we will have one row for each user, columns='movieId' so we will have one column for each movie and values='ratings' to fill each value in the table. We will use 0 to indicate movies for which the user has not provided a rating. As the average user has rated 165.3 movies, we will expect the majority of the values to be equal to 0.

In [None]:
pivoted_users_df = users_data.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
print("Shape:", pivoted_users_df.shape) #check
pivoted_users_df.head()

As we have many movies that have only been rated by a handful of users, we are going to focus on the 500 most-rated movies. Therefore, we must find the names of the 500 most rated movies and filter to the subset.

In [None]:
# find the names of the 500 most rated movies

movieIds = pivoted_users_df.columns
movieNames = [movies_data[movies_data.movieId == id_]['title'].iloc[0] for id_ in movieIds] # extract movie titles
pivoted_users_df.columns = movieNames
counts = pivoted_users_df.apply(np.count_nonzero, axis=0) # count the number of ratings for each movie
number_ratings_df = pd.DataFrame(counts).reset_index()
number_ratings_df.columns = ['movieName', 'numberRatings']
# get the names of the 500 movies with the most ratings in the df
movieNames_500 = number_ratings_df.sort_values('numberRatings', ascending=False)[0:500].movieName
len(movieNames_500) # check

In [None]:
# get the dataframe for these 500 movies only
movies500_ratings = pivoted_users_df[list(movieNames_500)]
movies500_ratings
# we have 501 rows - looks like we may have a duplicate movie? something to keep in mind...

Now that the data is in the proper format, we can apply SVD. We will get three items returned:

1) U:  user-to-concept similarity matrix

2) Σ (sigma): the diagonal elements representing the ‘strength’ of each concept

3) V: movie-to-concept similarity matrix

In this example, we will extract the 50 latent factors with the greatest 'strength' values. 

**Thought exercise**: what do you expect to be the shapes of U, Σ and V?

In [None]:
svd_model = TruncatedSVD(n_components=50, random_state=671)
users_df_svd = svd_model.fit_transform(movies500_ratings)

U = users_df_svd / svd_model.singular_values_ 
sigma = svd_model.singular_values_
V = svd_model.components_

# checks 
print("Shape of U:", U.shape)
print("Length of Sigma:", len(sigma))
print("Shape of V:", V.shape)

In [None]:
sigma # in decreasing order

Now that we have decomposed our matrix, one question we may want to answer is: what movies are clustering together? To begin answering this, let's look at some of the latent factors for the movies and see if we can get an idea of what the factors may represent.

In [None]:
movie_concept1 = V[0] # the concept with the greatest strength, just the average (not very helpful)
movieNames = movies500_ratings.columns # get a list of the movie names
indices = sorted(range(len(movie_concept1)), key=lambda i: movie_concept1[i], reverse=True)[0:20] # get the indices of the movies with the top 20 values
names = [movieNames[ind] for ind in indices] # get the corresponding movies
movies_data[movies_data.title.isin(names)] 
# we see lots of classics/popular movies

In [None]:
movie_concept1 = V[1] # the concept with the 2nd greatest strength
movieNames = movies500_ratings.columns # get a list of the movie names
indices = sorted(range(len(movie_concept1)), key=lambda i: movie_concept1[i], reverse=True)[0:20] # get the indices of the movies with the top 20 values
names = [movieNames[ind] for ind in indices] # get the corresponding movie index
movies_data[movies_data.title.isin(names)] 
# we see lots of action/adventure movies

In [None]:
### YOUR CODE: pick some other movie concepts and try to determine what they could be representing

As you can see, it is not always exactly clear what the latent factors represent but we can make some hypotheses about why certain movies are being clustered together. Next, we will see how we can use the matrix V to calculate the similarity between movies.

In [None]:
# transform V into a dataframe with the movie titles as the columns
movie_factors_df = pd.DataFrame(V, columns = movies500_ratings.columns)
movie_factors_df.shape #check

In [None]:
# calculate the similarity between movies using cosine similarity/distance
movie1 = movie_factors_df['Beauty and the Beast (1991)']
movie2 = movie_factors_df['Aladdin (1992)']
movie3 = movie_factors_df['Godfather, The (1972)']
result = 1 - spatial.distance.cosine(movie1, movie2)
print("Beauty and the Beast vs. Aladdin:", result)
result = 1 - spatial.distance.cosine(movie1, movie3)
print("Beauty and the Beast vs. The Godfather:", result)
result = 1 - spatial.distance.cosine(movie2, movie3)
print("Aladdin Beat vs. The Godfather:", result)

As an example, we see that Beauty and the Beast and Aladdin are much more similar to one another than they are to a serious film like the Godfather.

In [None]:
### YOUR CODE: select a few movies of your choice and calculate the similarity between them

#### 2.3 Simple Recommendation System

Lastly, we will see how we can build a simple movie recommendation system using K-means clustering. We will apply this to the user factors matrix we got by applying SVD above. 

In [None]:
# cluster based on user factors - goal is to find similar users

# apply clustering algorithm
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=5, random_state=6).fit_predict(U) # arbitrary choice of k
clustered_users_df = pd.DataFrame(U, index = pivoted_users_df.index)
clustered_users_df['cluster'] = clusters # create new column with cluster membership
print(clustered_users_df.shape) # check
clustered_users_df.head()

In [None]:
# distribution of users among clusters
clustered_users_df.cluster.value_counts()
# see that the majority of users are in cluster 3

In [None]:
# who are the similar users?

# select a user
user = int(input("Enter a user id: ")) 

cluster_no = int(clustered_users_df[clustered_users_df.index == user].cluster) # get the user's cluster
print("User", user, "belongs to cluster", cluster_no)
similar_users = list(clustered_users_df[clustered_users_df.cluster == cluster_no].index) # find other users in the cluster
print("Number of similar users:", len(similar_users))
print("Similar users:", similar_users)

In [None]:
# how has the cluster rated the movies in our dataset?

cluster_df = movies500_ratings.loc[similar_users] # filter to users in cluster
cluster_df[cluster_df == 0] = np.nan # set 0 values to NaN (to not bias calculation of means)
avg_ratings = cluster_df.mean() # get the mean rating for each movie for the cluster
avg_ratings.sample(5) # preview a sample of the mean cluster ratings

In [None]:
# what are top 10 rated movies for the cluster?

avg_ratings_df = pd.DataFrame(avg_ratings).reset_index()
avg_ratings_df.columns = ['movieName', 'avgClusterRating']
avg_ratings_df.sort_values('avgClusterRating', ascending=False).head(10)

The above list tells us the 10 movies with the highest average ratings across the cluster's users. However, when we recommend movies to our user, we want to make sure they haven't seen it before, so we must filter those out from our recommendations.

In [None]:
### YOUR CODE: what are top 10 rated movies for the cluster (that our user has not rated previously)?