# Chapter 2: Latent Semantic Analysis (LSA)

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In this notebook, you'll get to practice topic modeling with LSA.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.decomposition import TruncatedSVD

Remember, a corpus is a collection of documents. Here, we have 7 documents. Scanning the text, what topics do you see? How many topics do you see? Keep those answers in mind as we move forward.

In [2]:
corpus = ['Football baseball basketball',
          'baseball giants cubs redsox',
          'football broncos cowboys',
          'a baseball redsox tigers',
          'the pop stars hendrix prince',
          'hendrix prince jagger rock',
          'joplin pearl jam tupac rock']

Before topic modeling, the first step is to get your text data in a format that is ready for modeling. Use `CountVectorizer` with English stop words to turn the corpus into a document-term matrix. Save the matrix as `doc_term` and print out the shape of the matrix.

In [3]:
### BEGIN SOLUTION
vectorizer = CountVectorizer(stop_words='english')
doc_term = vectorizer.fit_transform(corpus)
### END SOLUTION
doc_term.shape

(7, 19)

In [4]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert doc_term.shape == (7, 19), "The doc_term matrix should have 7 documents and 19 terms."
### END HIDDEN TESTS

Turn the `doc_term` matrix into a dataframe. Modify the code below to do so and save the output as `doc_term_df`.

```
pd.DataFrame(INSERT_VALUE_HERE.toarray(), index=corpus, columns=vectorizer.get_feature_names())
```

In [5]:
### BEGIN SOLUTION
doc_term_df = pd.DataFrame(doc_term.toarray(), index=corpus, columns=vectorizer.get_feature_names())
### END SOLUTION
doc_term_df

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
Football baseball basketball,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
baseball giants cubs redsox,1,0,0,0,1,0,1,0,0,0,0,0,0,0,1,0,0,0,0
football broncos cowboys,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
a baseball redsox tigers,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0
the pop stars hendrix prince,0,0,0,0,0,0,0,1,0,0,0,0,1,1,0,0,1,0,0
hendrix prince jagger rock,0,0,0,0,0,0,0,1,1,0,0,0,0,1,0,1,0,0,0
joplin pearl jam tupac rock,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,1


In [6]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(doc_term_df) == pd.DataFrame, "The output should be a dataframe."
assert doc_term_df.shape == (7, 19), "The output should have 7 documents and 19 terms."
### END HIDDEN TESTS

Fit an LSA model using `TruncatedSVD` with two components, or in other words, two topics. Save the fitted model as `lsa`.

In [7]:
### BEGIN SOLUTION
lsa = TruncatedSVD(2)
lsa.fit(doc_term)
### END SOLUTION
lsa

TruncatedSVD()

In [8]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert lsa.explained_variance_ratio_[0].round(3) == 0.185, "The output should be a fitted TruncatedSVD model with two components."
### END HIDDEN TESTS

The LSA model reduced the 19 terms into 2 topics. Save the topic-term matrix as `topic_term`.

In [9]:
### BEGIN SOLUTION
topic_term = lsa.components_.round(3)
### END SOLUTION
topic_term

array([[ 0.   , -0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,  0.488,
         0.266,  0.191,  0.191,  0.191,  0.222,  0.488,  0.   ,  0.457,
         0.222,  0.   ,  0.191],
       [ 0.675,  0.172,  0.053,  0.053,  0.278,  0.225,  0.278, -0.   ,
         0.   , -0.   , -0.   , -0.   , -0.   , -0.   ,  0.503, -0.   ,
        -0.   ,  0.225, -0.   ]])

In [10]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert topic_term.shape == (2, 19), "The topic_term matrix should have 2 components and 19 terms."
### END HIDDEN TESTS

Turn the `topic_term` matrix into a dataframe. Modify the code below to do so and save the output as `topic_term_df`.

```
pd.DataFrame(INSERT_VALUE_HERE.round(3), index = ["component_1", "component_2"], columns = vectorizer.get_feature_names())
```

In [11]:
### BEGIN SOLUTION
topic_term_df = pd.DataFrame(topic_term.round(3),
                index = ["component_1", "component_2"],
                columns = vectorizer.get_feature_names())
### END SOLUTION
topic_term_df

Unnamed: 0,baseball,basketball,broncos,cowboys,cubs,football,giants,hendrix,jagger,jam,joplin,pearl,pop,prince,redsox,rock,stars,tigers,tupac
component_1,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.488,0.266,0.191,0.191,0.191,0.222,0.488,0.0,0.457,0.222,0.0,0.191
component_2,0.675,0.172,0.053,0.053,0.278,0.225,0.278,-0.0,0.0,-0.0,-0.0,-0.0,-0.0,-0.0,0.503,-0.0,-0.0,0.225,-0.0


In [12]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(topic_term_df) == pd.DataFrame, "The output should be a dataframe."
assert topic_term_df.shape == (2, 19), "The output should have 2 topics and 19 terms."
### END HIDDEN TESTS

While you can scan the output to determine the top words in each topic, the function below displays the top terms in each topic in a format that is easier to read.

Apply the function to find the top 5 terms in each of the 2 topics. Save the results of the function as `output`.

In [13]:
# Function to display the top n terms in each topic
def display_topics(model, feature_names, no_top_words, topic_names = None): 
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix + 1)
        else:
            print("\nTopic: ", topic_names[ix])
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
    print("\n")
    return model, feature_names, no_top_words

In [14]:
### BEGIN SOLUTION
output = display_topics(lsa, vectorizer.get_feature_names(), 5)
### END SOLUTION
output;


Topic  1
hendrix, prince, rock, jagger, stars

Topic  2
baseball, redsox, cubs, giants, football




In [15]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert output[0].n_components == 2, "The model hyperparameter should be set to lsa."
assert len(output[1]) == 19, "The feature_names hyperparamter should be set to vectorizer.get_feature_names()."
assert output[2] == 5, "The no_top_words hyperparameter should be set to 5."
### END HIDDEN TESTS

Now this is the fun part. Take a look at the top words in the two topics, and using your human brain, name them.

The first one looks like it's about music and the second one is about sports.

In [16]:
display_topics(lsa, vectorizer.get_feature_names(), 5, ['Music', 'Sports']);


Topic:  Music
hendrix, prince, rock, jagger, stars

Topic:  Sports
baseball, redsox, cubs, giants, football




In this last step, your task is to figure out which topics are in each document. Transform the original `doc_term` matrix into a document-topic matrix and save it as `doc_topic`.

In [17]:
### BEGIN SOLUTION
doc_topic = lsa.transform(doc_term)
### END SOLUTION
doc_topic.shape

(7, 2)

In [18]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert doc_topic.shape == (7, 2), "The doc_topic matrix should have 7 documents and 2 topics."
### END HIDDEN TESTS

Turn the `doc_topic` matrix into a dataframe. Modify the code below to do so and save the output as `doc_topic_df`.

```
pd.DataFrame(INSERT_VALUE_HERE.round(5), index = corpus, columns = ["music", "sports"])
```

In [19]:
### BEGIN SOLUTION
doc_topic_df = pd.DataFrame(doc_topic.round(5), index = corpus, columns = ["music", "sports"])
### END SOLUTION
doc_topic_df

Unnamed: 0,music,sports
Football baseball basketball,0.0,1.07195
baseball giants cubs redsox,-0.0,1.73445
football broncos cowboys,-0.0,0.33125
a baseball redsox tigers,0.0,1.4032
the pop stars hendrix prince,1.42034,-0.0
hendrix prince jagger rock,1.69829,-0.0
joplin pearl jam tupac rock,1.22058,-0.0


In [20]:
### CHECK YOUR OUTPUT WITH THE ANSWER

### BEGIN HIDDEN TESTS
assert type(doc_topic_df) == pd.DataFrame, "The output should be a dataframe."
assert doc_topic_df.shape == (7, 2), "The output should have 7 documents and 2 topics."
### END HIDDEN TESTS

Each of the 7 documents that used to be represented by 19 terms is now just represented by 2 topics. You can see that the first four documents are about sports and the last three are about music.