## INFS 770 - Assignment 3

**Note**: Created using Anaconda Python 3.7.3 (64-bit)

---

### Pre-task setup

In [1]:
# Imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import gensim
from gensim.models import LdaModel, LsiModel
from pprint import pprint

# Sci-Py Packages
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

# nltk packages
import nltk
from nltk import word_tokenize 
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import FreqDist

In [2]:
# download nltk data first
# Note: ran once, then commented out for future notebook runs
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

### T1 - read in & print textfile data

In [3]:
# Load text data
text_file = open("reviews.txt", "r")
docs = text_file.readlines()
text_file.close()

In [4]:
# Make sure we've got data
print("Number of lines in document:", len(docs))

Number of lines in document: 235


In [5]:
# Pretty print out the data
pprint(docs)

['with all the available applications and new updates for its software , '
 'iphone is a phone you grow with , not out of .\n',
 'i upgraded from the sd550 canon there is not comparisent  .\n',
 'i have also converted my office from the pc and windows world to mac os and '
 'that is the best business decision i have made in years i have tried other '
 'phones with android , windows mobile and they all fall short  .\n',
 "i've also experienced the unit occasionally freezing or resetting itself , "
 "usually when the bluetooth connection is active and i'm receiving a phone "
 'call  .\n',
 "this is my family's seventh (at least) garmin unit , and may be our last .\n",
 'thanks canon for making my family happy and able to share our family photos '
 'again  .\n',
 'a really great camera  .\n',
 'so i knew when i is buying this unit i is not buying a unit that i would '
 'build for this price but the best available as far as it being balanced in   '
 ', efficientcy and bells and whisles  .\

### T2

In [6]:
# Pre-processing goals:
# 1 - Remove punctuations
# 2 - Remove numbers
# 3 - Do a lemmatization
# 4 - Remove stop words
# 5 - Remove words appearing in only 1 document
# 6 - Remove words appearing in >90% of documents

def before_token(documents):
    # remove punctuations
    punctuationless = list(map(lambda x: " ".join(re.findall('\\b\\w\\w+\\b',x)), documents))
    # remove numbers
    return list(map(lambda x:re.sub('\\b[0-9]+\\b', '', x), punctuationless))
docs1 = before_token(docs)


class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t,"v") for t in word_tokenize(doc)]
stopwords = nltk.corpus.stopwords.words("english") # for removal
vectorizer = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                             norm=None,
                             stop_words=stopwords,
                             max_df=0.9, # remove frequent words (> 90%)
                             min_df=1) # remove unique words

corpus_vect = vectorizer.fit_transform(docs1)
#print(corpus_vect) # sparse matrix
df_vect = pd.DataFrame(corpus_vect.toarray(), columns=vectorizer.get_feature_names())
#print(df_vect)

In [7]:
# Print the features (vectorizer.vocabulary) extracted based on TF-IDF
# examine the mapping of words to feature indexes
print("Extracted features: ")
pprint(vectorizer.vocabulary_)

Extracted features: 
{'1000fd': 0,
 '16gb': 1,
 '265wt': 2,
 '2nd': 3,
 '30k': 4,
 '32mb': 5,
 '3g': 6,
 '3gs': 7,
 '4g': 8,
 '765t': 9,
 '865t': 10,
 'aa': 11,
 'able': 12,
 'accent': 13,
 'access': 14,
 'account': 15,
 'accuracy': 16,
 'accurate': 17,
 'acquire': 18,
 'across': 19,
 'active': 20,
 'actually': 21,
 'ad': 22,
 'add': 23,
 'addition': 24,
 'additional': 25,
 'admit': 26,
 'ads': 27,
 'ago': 28,
 'agree': 29,
 'airport': 30,
 'alaska': 31,
 'allow': 32,
 'almost': 33,
 'already': 34,
 'also': 35,
 'although': 36,
 'always': 37,
 'amateur': 38,
 'amaze': 39,
 'american': 40,
 'anagram': 41,
 'android': 42,
 'angular': 43,
 'another': 44,
 'anti': 45,
 'anyhow': 46,
 'anymore': 47,
 'anyone': 48,
 'anything': 49,
 'anyway': 50,
 'apon': 51,
 'app': 52,
 'appear': 53,
 'apple': 54,
 'applications': 55,
 'approach': 56,
 'approx': 57,
 'apps': 58,
 'around': 59,
 'ask': 60,
 'aspect': 61,
 'assistance': 62,
 'attractive': 63,
 'audio': 64,
 'auto': 65,
 'autolock': 66,
 'ava

### T3

**What TF-IDF means:**

As we learned in class, TF-IDF is short for Term Frequency - Inverse Document Frequency.  The main concept behind TF-IDF is that within a specific document, the more frequently that a word occurs then the more relevant that term is to that document (that's the Term Frequency part of TF-IDF).  However, if that same term occurs across multiple different documents then it may be less relevant overall and could simply be a commonly-occuring word (that's the Inverse Document Frequency part of TF-IDF).

TF-IDF for data mining involves applying scoring to terms to weight them higher or lower in relevancy when observed within a document and also applying scoring to those terms as observed throughout multiple different documents.  Your ultimate goal with TF-IDF is to find relevant (high-scoring) words that will do a good job of summarizing an individual document.

### T4

In [8]:
# Build a topic model using Latent Dirichlet Allocation (LDA)
# Set # of topics to 3

# Convert the vectorized data to a gensim corpus object
corpus_vect1 = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer.vocabulary_.items())
#print(id2word)

# Build the lda model
lda1 = LdaModel(corpus_vect1, 
                num_topics=3,
                id2word=id2word, 
                passes=10)

# Print the topics
pprint(lda1.print_topics())

[(0,
  '0.010*"phone" + 0.009*"well" + 0.009*"new" + 0.008*"price" + 0.008*"google" '
  '+ 0.008*"unit" + 0.007*"cameras" + 0.007*"use" + 0.007*"say" + '
  '0.007*"best"'),
 (1,
  '0.013*"camera" + 0.013*"iphone" + 0.010*"big" + 0.009*"buy" + 0.009*"take" '
  '+ 0.008*"picture" + 0.008*"like" + 0.008*"great" + 0.007*"get" + '
  '0.007*"really"'),
 (2,
  '0.014*"use" + 0.013*"camera" + 0.013*"get" + 0.011*"great" + 0.010*"phone" '
  '+ 0.009*"would" + 0.009*"iphone" + 0.009*"one" + 0.008*"traffic" + '
  '0.008*"easy"')]


In [9]:
# use the lda model to transform documents
lda_docs1 = lda1[corpus_vect1]
lda_docs1 = gensim.matutils.corpus2csc(lda_docs1)
lda_docs1 = lda_docs1.T.toarray()
#for row in lda_docs1:
#    print(row)

# extract the scores and round them to 3 decimal places
scores1 = np.round([[doc for doc in row] for row in lda_docs1], 3)
#print(scores1)

In [10]:
# convert the documents scores into a data frame
df_lda1 = pd.DataFrame(scores1, columns=["topic 1", "topic 2", "topic 3"])
#df_lda1

### T5

In [11]:
# Build a topic model using Latent Dirichlet Allocation (LDA) 
# Set # of topics to 4
# Build a topic model using Latent Dirichlet Allocation (LDA)

# Convert the vectorized data to a gensim corpus object
corpus_vect2 = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer.vocabulary_.items())
#print(id2word)

# Build the lda model
lda2 = LdaModel(corpus_vect2,
                num_topics=4,
                id2word=id2word, 
                passes=10)

# Print the topics
pprint(lda2.print_topics())

[(0,
  '0.016*"iphone" + 0.014*"phone" + 0.010*"get" + 0.010*"one" + 0.010*"know" + '
  '0.009*"great" + 0.009*"camera" + 0.008*"far" + 0.008*"keep" + '
  '0.008*"garmin"'),
 (1,
  '0.021*"use" + 0.012*"picture" + 0.011*"take" + 0.010*"easy" + 0.010*"best" '
  '+ 0.009*"iphone" + 0.009*"price" + 0.008*"camera" + 0.008*"like" + '
  '0.008*"never"'),
 (2,
  '0.012*"get" + 0.012*"phone" + 0.012*"would" + 0.009*"love" + 0.008*"multi" '
  '+ 0.008*"well" + 0.008*"look" + 0.008*"go" + 0.007*"recommend" + '
  '0.007*"make"'),
 (3,
  '0.018*"camera" + 0.011*"need" + 0.011*"happy" + 0.011*"battery" + '
  '0.010*"life" + 0.010*"great" + 0.010*"like" + 0.009*"buy" + 0.009*"traffic" '
  '+ 0.009*"canon"')]


In [12]:
# use the lda model to transform documents
lda_docs2 = lda2[corpus_vect2]
lda_docs2 = gensim.matutils.corpus2csc(lda_docs2)
lda_docs2 = lda_docs2.T.toarray()
#for row in lda_docs2:
#    print(row)

# extract the scores and round them to 3 decimal places
scores2 = np.round([[doc for doc in row] for row in lda_docs2], 3)
#print(scores2)

In [13]:
# convert the documents scores into a data frame
df_lda2 = pd.DataFrame(scores2, columns=["topic 1", "topic 2", "topic 3", "topic 4"])
#df_lda2

### T6

In [14]:
# Build a topic model using Latent Dirichlet Allocation (LDA)
# Set # of topics to 5

# Convert the vectorized data to a gensim corpus object
corpus_vect3 = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer.vocabulary_.items())
#print(id2word)

# Build the lda model
lda3 = LdaModel(corpus_vect3,
                num_topics=5,
                id2word=id2word, 
                passes=10)

# Print the topics
pprint(lda3.print_topics())

[(0,
  '0.017*"take" + 0.014*"picture" + 0.013*"best" + 0.012*"camera" + '
  '0.012*"great" + 0.011*"like" + 0.011*"review" + 0.010*"buy" + 0.009*"far" + '
  '0.009*"battery"'),
 (1,
  '0.015*"easy" + 0.012*"use" + 0.010*"traffic" + 0.010*"iphone" + '
  '0.010*"phone" + 0.009*"price" + 0.009*"one" + 0.009*"poi" + 0.009*"love" + '
  '0.009*"camera"'),
 (2,
  '0.011*"garmin" + 0.009*"like" + 0.009*"also" + 0.009*"signal" + '
  '0.009*"highway" + 0.008*"camera" + 0.008*"use" + 0.008*"bluetooth" + '
  '0.008*"thing" + 0.007*"little"'),
 (3,
  '0.013*"use" + 0.013*"phone" + 0.011*"unit" + 0.010*"say" + 0.010*"nikon" + '
  '0.009*"camera" + 0.009*"recommend" + 0.008*"one" + 0.008*"better" + '
  '0.008*"quality"'),
 (4,
  '0.019*"iphone" + 0.018*"get" + 0.014*"much" + 0.013*"phone" + 0.012*"would" '
  '+ 0.010*"buy" + 0.010*"love" + 0.010*"shoot" + 0.009*"great" + '
  '0.009*"point"')]


In [15]:
# use the lda model to transform documents
lda_docs3 = lda3[corpus_vect3]
lda_docs3 = gensim.matutils.corpus2csc(lda_docs3)
lda_docs3 = lda_docs3.T.toarray()
#for row in lda_docs3:
#    print(row)

# extract the scores and round them to 3 decimal places
scores3 = np.round([[doc for doc in row] for row in lda_docs3], 3)
#print(scores3)

In [16]:
# convert the documents scores into a data frame
df_lda3 = pd.DataFrame(scores3, columns=["topic 1", "topic 2", "topic 3", "topic 4", "topic5"])
#df_lda3

### T7

Q: Among T4, T5, T6 results, which number of topics (among 3,4,5) gives you the best topic modeling results?

A: The results of T4 (3 topics) seems to give the most coherent collections of words for each topic.  

The results of having 5 topics gives several groups of words that don't give a strong understanding of what the topic actually is about, for example this group of words:

 (2,
  '0.011*"use" + 0.011*"get" + 0.011*"great" + 0.010*"take" + 0.010*"go" + '
  '0.010*"much" + 0.010*"like" + 0.009*"sometimes" + 0.008*"battery" + '
  '0.008*"image"'),
  
Having only 4 topics is a bit more coherent than 5, but still produces this word grouping that is difficult to generalize into a topic:

 (3,
  '0.013*"phone" + 0.013*"one" + 0.012*"need" + 0.011*"use" + 0.011*"iphone" + '
  '0.010*"battery" + 0.010*"apps" + 0.009*"make" + 0.009*"multi" + '
  '0.009*"take"')]

Ultimately, I think a setting of 3 topics is the best fit as it gives the most coherent word groupings as compared to the other two alternatives.

### T8

In [18]:
# Re-run the TF-IDF vectorizer, set norm to be “l1” for the vectorizer
# Set num_topics to be the optimal number of topics you found in T7 and re-run LDA
vectorizer_l1 = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                                norm="l1",
                                stop_words=stopwords,
                                max_df=0.9, # remove frequent words (> 90%)
                                min_df=1) # remove unique words

corpus_vect_l1 = vectorizer_l1.fit_transform(docs1)
#print(corpus_vect_l1) # sparse matrix

In [19]:
# Build a topic model using Latent Dirichlet Allocation (LDA)
# Set # of topics to 3

# Convert the vectorized data to a gensim corpus object
corpus_vect_l1 = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer_l1.vocabulary_.items())
#print(id2word)

# Build the lda model
lda_l1 = LdaModel(corpus_vect_l1,
                  num_topics=3,
                  id2word=id2word,
                  passes=10)

# Print the topics
pprint(lda_l1.print_topics())

[(0,
  '0.013*"camera" + 0.013*"phone" + 0.012*"one" + 0.011*"think" + 0.010*"buy" '
  '+ 0.009*"like" + 0.008*"get" + 0.008*"go" + 0.008*"happy" + 0.007*"shoot"'),
 (1,
  '0.012*"great" + 0.012*"camera" + 0.010*"iphone" + 0.010*"well" + '
  '0.009*"unit" + 0.009*"also" + 0.009*"price" + 0.008*"use" + 0.007*"phone" + '
  '0.007*"traffic"'),
 (2,
  '0.013*"iphone" + 0.012*"picture" + 0.012*"get" + 0.011*"use" + 0.010*"take" '
  '+ 0.009*"good" + 0.008*"screen" + 0.008*"phone" + 0.008*"easy" + '
  '0.008*"rout"')]


In [20]:
# use the lda model to transform documents
lda_docs_l1 = lda_l1[corpus_vect_l1]
lda_docs_l1 = gensim.matutils.corpus2csc(lda_docs_l1)
lda_docs_l1 = lda_docs_l1.T.toarray()
#for row in lda_docs_l1:
#    print(row)

# extract the scores and round them to 3 decimal places
scores_l1 = np.round([[doc for doc in row] for row in lda_docs_l1], 3)
#print(scores_l1)

In [21]:
# convert the documents scores into a data frame
df_lda_l1 = pd.DataFrame(scores_l1, columns=["topic 1", "topic 2", "topic 3"])
#df_lda_l1

### T9

In [23]:
# Re-run the TF-IDF vectorizer, set norm to be “l2” for the vectorizer
# Set num_topics to be the optimal number of topics you found in T7 and re-run LDA
vectorizer_l2 = TfidfVectorizer(tokenizer=LemmaTokenizer(),
                                norm="l2",
                                stop_words=stopwords,
                                max_df=0.9, # remove frequent words (> 90%)
                                min_df=1) # remove unique words

corpus_vect_l2 = vectorizer_l2.fit_transform(docs1)
#print(corpus_vect_l2) # sparse matrix

In [24]:
# Build a topic model using Latent Dirichlet Allocation (LDA)
# Set # of topics to 3

# Convert the vectorized data to a gensim corpus object
corpus_vect_l2 = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
id2word = dict((v,k) for k,v in vectorizer_l2.vocabulary_.items())
#print(id2word)

# Build the lda model
lda_l2 = LdaModel(corpus_vect_l2,
                  num_topics=3,
                  id2word=id2word,
                  passes=10)

# Print the topics
pprint(lda_l2.print_topics())

[(0,
  '0.020*"iphone" + 0.015*"get" + 0.012*"phone" + 0.012*"one" + 0.011*"use" + '
  '0.008*"love" + 0.008*"would" + 0.008*"well" + 0.008*"go" + 0.008*"like"'),
 (1,
  '0.016*"great" + 0.012*"camera" + 0.010*"use" + 0.010*"take" + '
  '0.009*"picture" + 0.008*"phone" + 0.008*"easy" + 0.008*"best" + '
  '0.008*"traffic" + 0.008*"anyone"'),
 (2,
  '0.014*"camera" + 0.010*"price" + 0.009*"screen" + 0.009*"like" + '
  '0.008*"battery" + 0.007*"picture" + 0.007*"cameras" + 0.007*"video" + '
  '0.007*"shoot" + 0.007*"also"')]


In [25]:
# use the lda model to transform documents
lda_docs_l2 = lda_l2[corpus_vect_l2]
lda_docs_l2 = gensim.matutils.corpus2csc(lda_docs_l2)
lda_docs_l2 = lda_docs_l2.T.toarray()
#for row in lda_docs_l2:
#    print(row)

# extract the scores and round them to 3 decimal places
scores_l2 = np.round([[doc for doc in row] for row in lda_docs_l2], 3)
#print(scores_l2)

In [26]:
# convert the documents scores into a data frame
df_lda_l2 = pd.DataFrame(scores_l2, columns=["topic 1", "topic 2", "topic 3"])
#df_lda_l2

### T10

***Q:*** Compare the topics you obtained with norm set to be 1) None, 2) “l1” or 3) “l2” (num_topics is always the optimal number of topics you identified in T7). Please tell me which norm gives you the best topics, and why.

***A:*** The norm set to "None" seems to give me the best topics, alongside the optimal number of topics that I chose in T7 (3 topics).  Of the 3 norm options, the "None" norm results seem to have the best distinction of topics.  

For the "None" results, the first topic has nouns such as "camera" and "picture".  The second topic has nouns such as "price" and "screen".  The third sopic contains nouns such as "unit" and "traffic".  These all seem to be distinct from one another and could each be attributable back to a discrete topic

The "l1" norm results has nouns such as "camera" that appear in more than one topic.  The "l2" norm results are short on nouns, and contain lots of adjectives or adverbs, and are tough to make a distinction on what the topic may be about. 

Ultimately, because of the better clarity and distinction, I think the results from the "None" group are the best and make the most sense.

### T11

In [27]:
# Write code to show the topic/document table of LDA with the best number of topics 
# using the vectorized dataset obtained with the best norm

# convert the documents scores into a data frame
df_lda1 = pd.DataFrame(scores1, columns=["topic 1", "topic 2", "topic 3"])
df_lda1

Unnamed: 0,topic 1,topic 2,topic 3
0,0.981,0.010,0.000
1,0.967,0.017,0.017
2,0.993,0.000,0.000
3,0.000,0.989,0.000
4,0.000,0.011,0.980
5,0.000,0.000,0.984
6,0.028,0.939,0.032
7,0.990,0.000,0.000
8,0.000,0.000,0.991
9,0.000,0.982,0.000


### T12

In [28]:
# Run truncated SVD(set n_components to be the optimal number of topics you 
# found in T7) and print topics and the topic/document table

# svd
from scipy.linalg import svd
U, s, V = svd(corpus_vect.toarray())
print(s)

[4.10077832e+01 3.55933224e+01 3.36799870e+01 3.22005039e+01
 3.15007989e+01 3.11632193e+01 2.93039921e+01 2.83331740e+01
 2.75286309e+01 2.68950262e+01 2.62036735e+01 2.58306682e+01
 2.55997184e+01 2.52955123e+01 2.48936170e+01 2.44516819e+01
 2.44110726e+01 2.40673574e+01 2.37539880e+01 2.35996448e+01
 2.34361968e+01 2.32749803e+01 2.31963571e+01 2.28439412e+01
 2.26515552e+01 2.25645072e+01 2.23013228e+01 2.21720949e+01
 2.19530625e+01 2.16887126e+01 2.14753621e+01 2.13724636e+01
 2.11680580e+01 2.06476927e+01 2.05842165e+01 2.05022507e+01
 2.03350047e+01 2.01154707e+01 1.99643197e+01 1.97357820e+01
 1.95752090e+01 1.94201477e+01 1.91649182e+01 1.90040013e+01
 1.89117954e+01 1.88458830e+01 1.85883276e+01 1.85450530e+01
 1.85073995e+01 1.83984981e+01 1.80978209e+01 1.79646818e+01
 1.78760167e+01 1.78138693e+01 1.76680659e+01 1.75717761e+01
 1.74833874e+01 1.73106963e+01 1.72175664e+01 1.71812278e+01
 1.70603817e+01 1.69438627e+01 1.68181437e+01 1.67746209e+01
 1.66743849e+01 1.660069

In [29]:
from sklearn.decomposition import TruncatedSVD
tsvd = TruncatedSVD(n_components=3)
tsvd.fit(corpus_vect)

print(np.round(tsvd.transform(corpus_vect), 3))
print(tsvd.singular_values_)

[[ 1.1460e+00 -1.2330e+00 -7.8000e-02]
 [ 1.5300e-01 -1.1000e-02 -1.2000e-02]
 [ 3.2300e+00 -2.8300e+00 -5.5000e-02]
 [ 1.6670e+00 -1.5340e+00  7.2600e-01]
 [ 9.9400e-01 -6.2900e-01 -4.9900e-01]
 [ 9.4100e-01 -4.5800e-01  6.1000e-02]
 [ 1.0000e+00  3.1000e-01  1.4100e-01]
 [ 3.9130e+00 -2.6780e+00 -1.0600e+00]
 [ 3.5790e+00 -1.8170e+00 -7.7000e-02]
 [ 1.9020e+00 -1.2860e+00 -3.0000e-01]
 [ 1.3250e+00 -1.6500e-01  6.5000e-02]
 [ 4.7500e-01 -2.9000e-02 -7.3000e-02]
 [ 3.4410e+00 -2.6690e+00  3.8800e-01]
 [ 3.5390e+00  3.5500e-01 -3.2500e-01]
 [ 2.1000e-02  4.0000e-03  1.0000e-03]
 [ 6.5400e-01 -6.0600e-01  1.0000e-02]
 [ 6.1100e-01 -6.9000e-02 -1.1500e-01]
 [ 1.0040e+00 -8.1800e-01  9.1000e-02]
 [ 1.6500e+00  5.3600e-01  9.0000e-03]
 [ 5.4060e+00 -6.7480e+00 -6.5500e-01]
 [ 1.9200e-01 -8.2000e-02 -2.1000e-02]
 [ 2.6900e-01 -4.5000e-02 -2.1000e-02]
 [ 9.0600e-01 -2.5400e-01 -2.8900e-01]
 [ 5.1000e-02  4.0000e-03  2.9000e-02]
 [ 1.8951e+01  2.6528e+01  1.8990e+00]
 [ 1.3400e+00 -1.4840e+00

In [30]:
df_comp = pd.DataFrame(tsvd.components_, columns=vectorizer.get_feature_names())
df_comp = df_comp.apply(lambda x: np.round(x,3))
print(df_comp)

   1000fd   16gb  265wt    2nd    30k   32mb     3g    3gs     4g   765t  ...  \
0   0.003  0.003  0.003  0.045  0.003  0.005  0.020  0.003  0.025  0.007  ...   
1  -0.000 -0.005 -0.001 -0.053 -0.004  0.003 -0.023 -0.005 -0.016  0.002  ...   
2  -0.000  0.000 -0.001 -0.045 -0.001 -0.001  0.162  0.000 -0.002  0.004  ...   

   world  worry  would  write   yeah   year  years    yes    yet  youtube  
0  0.040  0.004  0.140    0.0  0.009  0.028  0.053  0.028  0.021    0.009  
1 -0.053 -0.000 -0.170    0.0 -0.010 -0.029 -0.017 -0.036 -0.007   -0.003  
2  0.002  0.002 -0.036    0.0 -0.006  0.002 -0.013 -0.007  0.014    0.011  

[3 rows x 902 columns]


In [31]:
# using the LsiModel class in gensim
lsi = LsiModel(corpus=corpus_vect1, 
               id2word=id2word, 
               num_topics=3)

In [32]:
# Print a list of topics
pprint(lsi.print_topics())

[(0,
  '0.255*"like" + 0.196*"camera" + 0.184*"screen" + 0.180*"picture" + '
  '0.157*"way" + 0.154*"use" + 0.146*"get" + 0.143*"take" + 0.141*"would" + '
  '0.138*"quality"'),
 (1,
  '-0.205*"screen" + -0.200*"picture" + 0.190*"way" + 0.170*"would" + '
  '0.164*"phone" + 0.154*"highway" + -0.122*"note" + -0.121*"file" + '
  '-0.121*"garage" + -0.121*"dead"'),
 (2,
  '0.312*"signal" + 0.163*"3g" + 0.160*"customizable" + 0.160*"wi" + '
  '0.160*"theme" + 0.160*"generally" + 0.160*"autolock" + 0.160*"settings" + '
  '0.160*"ssh" + 0.160*"strong"')]


In [33]:
# use the lsi model to transform documents
lsi_docs = lsi[corpus_vect1]
lsi_docs = gensim.matutils.corpus2csc(lsi_docs)
lsi_docs = lsi_docs.T.toarray()
#for row in lsi_docs:
#    print(row)

# extract the scores and round them to 3 decimal places
scores_lsi = np.round([[doc for doc in row] for row in lsi_docs], 3)
#print(scores_lsi)

In [34]:
# Print scores
# convert the documents scores into a data frame
df_lsi = pd.DataFrame(scores_lsi, columns=["topic 1", "topic 2", "topic 3"])
df_lsi

Unnamed: 0,topic 1,topic 2,topic 3
0,1.147,1.268,-0.023
1,0.149,0.018,-0.011
2,3.211,2.635,-0.363
3,1.682,1.533,0.921
4,1.000,0.638,-0.375
5,0.937,0.471,-0.035
6,0.996,-0.318,0.030
7,3.924,2.678,-0.808
8,3.577,1.878,0.046
9,1.893,1.313,-0.283
