# Topic Modeling Assessment Project

#### Task: Import pandas and read in the amazonreviews.tsv file.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('amazonreviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,pos,Stuning even for the non-gamer: This sound tra...
1,pos,The best soundtrack ever to anything.: I'm rea...
2,pos,Amazing!: This soundtrack is my favorite music...
3,pos,Excellent Soundtrack: I truly like this soundt...
4,pos,"Remember, Pull Your Jaw Off The Floor After He..."


In [3]:
df.drop('label', inplace=True, axis=1)
df.head()

Unnamed: 0,review
0,Stuning even for the non-gamer: This sound tra...
1,The best soundtrack ever to anything.: I'm rea...
2,Amazing!: This soundtrack is my favorite music...
3,Excellent Soundtrack: I truly like this soundt...
4,"Remember, Pull Your Jaw Off The Floor After He..."


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  10000 non-null  object
dtypes: object(1)
memory usage: 78.2+ KB


In [5]:
# REMOVE NaN VALUES AND EMPTY STRINGS:
df.dropna(inplace=True)


In [6]:
df.head()

Unnamed: 0,review
0,Stuning even for the non-gamer: This sound tra...
1,The best soundtrack ever to anything.: I'm rea...
2,Amazing!: This soundtrack is my favorite music...
3,Excellent Soundtrack: I truly like this soundt...
4,"Remember, Pull Your Jaw Off The Floor After He..."


# Preprocessing

#### Task: Use TF-IDF Vectorization to create a vectorized document term matrix. You may want to explore the max_df and min_df parameters.

max_df: float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df: float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
tfidf = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')

In [9]:
dtm = tfidf.fit_transform(df['review'])

In [10]:
dtm

<10000x15302 sparse matrix of type '<class 'numpy.float64'>'
	with 298224 stored elements in Compressed Sparse Row format>

# LDA

In [11]:
from sklearn.decomposition import LatentDirichletAllocation

In [12]:
LDA = LatentDirichletAllocation(n_components=20,random_state=42)

In [13]:
# This can take awhile, we're dealing with a large amount of documents!
LDA.fit(dtm)

LatentDirichletAllocation(n_components=20, random_state=42)

In [14]:
len(LDA.components_)

20

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [15]:
for index,topic in enumerate(LDA.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['shorted', 'fitness', 'auto', 'disturbed', 'sizes', 'lean', 'stockings', 'recipes', 'glider', 'chocolate', 'network', 'bar', 'scarlett', 'haiku', 'diabetes']


THE TOP 15 WORDS FOR TOPIC #1
['size', 'used', 'works', 'got', 'price', 'bought', 'amazon', 'just', 'work', 'good', 'item', 'buy', 'use', 'great', 'product']


THE TOP 15 WORDS FOR TOPIC #2
['nightwish', 'bedroom', 'footnotes', 'pam', 'design', 'arthritis', 'voodoo', 'lapinator', 'cup', 'enlightening', 'cups', 'crawford', 'tea', 'sandler', 'adam']


THE TOP 15 WORDS FOR TOPIC #3
['pour', 'pas', 'massager', 'eargel', 'loops', 'ballet', 'sf', 'le', 'infrared', 'dragged', 'expired', 'replacing', 'filter', 'coffee', 'explosions']


THE TOP 15 WORDS FOR TOPIC #4
['time', 'don', 'songs', 'album', 'movies', 'quality', 'just', 'film', 'like', 'bad', 'cd', 'great', 'good', 'dvd', 'movie']


THE TOP 15 WORDS FOR TOPIC #5
['vegan', 'vimes', 'wolverine', 'gun', 'court', 'recommendable', 'miles', 'angel', 'heal

Attaching Discovered Topic Labels to Original Articles¶

In [16]:
dtm

<10000x15302 sparse matrix of type '<class 'numpy.float64'>'
	with 298224 stored elements in Compressed Sparse Row format>

In [17]:
dtm.shape

(10000, 15302)

In [18]:
len(df)

10000

In [19]:
topic_results = LDA.transform(dtm)

In [20]:
topic_results.shape

(10000, 20)

In [21]:
topic_results[0].round(2)

array([0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.45, 0.01, 0.01, 0.01, 0.01,
       0.41, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01])

In [22]:
topic_results[0].argmax()

6

In [23]:
df.head()

Unnamed: 0,review
0,Stuning even for the non-gamer: This sound tra...
1,The best soundtrack ever to anything.: I'm rea...
2,Amazing!: This soundtrack is my favorite music...
3,Excellent Soundtrack: I truly like this soundt...
4,"Remember, Pull Your Jaw Off The Floor After He..."


In [24]:
topic_results_lda = topic_results.argmax(axis=1)

In [25]:
df['Topic_LDA'] = topic_results_lda

In [26]:
df.head(10)

Unnamed: 0,review,Topic_LDA
0,Stuning even for the non-gamer: This sound tra...,6
1,The best soundtrack ever to anything.: I'm rea...,17
2,Amazing!: This soundtrack is my favorite music...,4
3,Excellent Soundtrack: I truly like this soundt...,13
4,"Remember, Pull Your Jaw Off The Floor After He...",11
5,an absolute masterpiece: I am quite sure any o...,11
6,"Buyer beware: This is a self-published book, a...",13
7,Glorious story: I loved Whisper of the wicked ...,13
8,A FIVE STAR BOOK: I just finished reading Whis...,13
9,Whispers of the Wicked Saints: This was a easy...,13


# Non-negative Matrix Factorization

#### TASK: Using Scikit-Learn create an instance of NMF with 20 expected components. (Use random_state=42)..

In [27]:
from sklearn.decomposition import NMF

In [28]:
nmf_model = NMF(n_components=20,random_state=42)

In [29]:
nmf_model.fit(dtm)



NMF(n_components=20, random_state=42)

#### TASK: Print our the top 15 most common words for each of the 20 topics.

In [30]:
for index,topic in enumerate(nmf_model.components_):
    print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
    print([tfidf.get_feature_names()[i] for i in topic.argsort()[-15:]])
    print('\n')

THE TOP 15 WORDS FOR TOPIC #0
['thought', 'reader', 'chapter', 'understand', 'review', 'people', 'interesting', 'information', 'life', 'pages', 'recommend', 'author', 'written', 'reading', 'book']


THE TOP 15 WORDS FOR TOPIC #1
['transformers', 'watched', 'saw', 'worst', 'horror', 'watching', 'effects', 'bad', 'special', 'acting', 'seen', 'action', 'watch', 'movies', 'movie']


THE TOP 15 WORDS FOR TOPIC #2
['say', 'got', 'thought', 'way', 'thing', 'people', 'did', 'know', 'think', 'didn', 'don', 'bad', 'really', 'just', 'like']


THE TOP 15 WORDS FOR TOPIC #3
['sounds', 'listening', 'track', 'heard', 'listen', 'rock', 'tracks', 'albums', 'best', 'sound', 'band', 'song', 'songs', 'music', 'album']


THE TOP 15 WORDS FOR TOPIC #4
['kids', 'controls', 'raider', 'gameplay', 'levels', 'tomb', 'glitches', 'playing', 'buy', 'played', 'graphics', 'fun', 'games', 'play', 'game']


THE TOP 15 WORDS FOR TOPIC #5
['boots', 'excellent', 'definitely', 'fast', 'loved', 'recommend', 'job', 'wonderfu

#### TASK: Add a new column to the original quora dataframe that labels each question into one of the 20 topic categories.

In [31]:
df.head()

Unnamed: 0,review,Topic_LDA
0,Stuning even for the non-gamer: This sound tra...,6
1,The best soundtrack ever to anything.: I'm rea...,17
2,Amazing!: This soundtrack is my favorite music...,4
3,Excellent Soundtrack: I truly like this soundt...,13
4,"Remember, Pull Your Jaw Off The Floor After He...",11


In [32]:
topic_results = nmf_model.transform(dtm)

In [36]:
topic_results.argmax(axis=1)

df['Topic_NMF'] = topic_results.argmax(axis=1)

df.head(10)

Unnamed: 0,review,Topic_LDA,Topic_NMF
0,Stuning even for the non-gamer: This sound tra...,6,4
1,The best soundtrack ever to anything.: I'm rea...,17,4
2,Amazing!: This soundtrack is my favorite music...,4,4
3,Excellent Soundtrack: I truly like this soundt...,13,4
4,"Remember, Pull Your Jaw Off The Floor After He...",11,4
5,an absolute masterpiece: I am quite sure any o...,11,4
6,"Buyer beware: This is a self-published book, a...",13,0
7,Glorious story: I loved Whisper of the wicked ...,13,17
8,A FIVE STAR BOOK: I just finished reading Whis...,13,6
9,Whispers of the Wicked Saints: This was a easy...,13,6


In [38]:
df['review'][0]

'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

In [40]:
df['review'][5000]

'Wonderful book: This is the wonderfully engaging tale of how Bugliosi successfully prosecuted the Manson Family for the August 1969 murders. However, if you wish to learn a lot about the backgrounds of the "family" members, this is not the book for you. This book is from a law-enforcement point of view. If you pick up this book hoping to learn about Manson and his philosophy, you\'ll be disappointed. However, if you want to know how the players were implicated and prosecuted, there\'s no better book.'

In [41]:
df['Topic_LDA'][5000]

13

In [42]:
df['Topic_NMF'][5000]

0

For 1st Review:-
'Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^'

Let's check Topic 6 words of LDA:-
['tracks', 'good', 'class', 'love', 'like', 'labor', 'childbirth', 'sound', 'concert', 'video', 'great', 'cd', 'music', 'album', 'dvd']


Let's check Topic 4 words of NMF:-
['kids', 'controls', 'raider', 'gameplay', 'levels', 'tomb', 'glitches', 'playing', 'buy', 'played', 'graphics', 'fun', 'games', 'play', 'game']


So, as we can see LDA topic modelling technique clearly points out toward the topic related to awesome music/sound track related to video or album which is in synchronization with the review given my customer1.

Coming to NMF topic modelling technique, it points out toward the topic related to some kid's video game with good graphics and all, so this is also in synchronization with the review given my customer1 not directly but context is about kid's video-game.


For 5000th review:-
Wonderful book: This is the wonderfully engaging tale of how Bugliosi successfully prosecuted the Manson Family for the August 1969 murders. However, if you wish to learn a lot about the backgrounds of the "family" members, this is not the book for you. This book is from a law-enforcement point of view. If you pick up this book hoping to learn about Manson and his philosophy, you\'ll be disappointed. However, if you want to know how the players were implicated and prosecuted, there\'s no better book.'


Let's check Topic 13 words of LDA:-
THE TOP 15 WORDS FOR TOPIC #13
['people', 'don', 'characters', 'really', 'time', 'reading', 'just', 'great', 'books', 'like', 'movie', 'good', 'story', 'read', 'book']


Let's check Topic 0 words of NMF:-
THE TOP 15 WORDS FOR TOPIC #0
['thought', 'reader', 'chapter', 'understand', 'review', 'people', 'interesting', 'information', 'life', 'pages', 'recommend', 'author', 'written', 'reading', 'book']

Both the topic modelling techniques clearly point out to some great books about an interesting incident.

# Great job!