<a href="https://colab.research.google.com/github/KalikaKay/Author-Classification-Project/blob/master/Topic_Modeling_Agglomerative_Clusters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Topic Modeling

Using LSA, LDA, and NNMF; print out top ten words (with their highest loading) for each topic modeling. 

Analyze and compare among three methods.

# Data Cleaning



In [106]:
#Topic Models
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import LatentDirichletAllocation, NMF

#Data Engineering
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import numpy as np
import pandas as pd


# file location
PATH = '/content/drive/MyDrive/Author Classification/Books/Books.parquet'
books = pd.read_parquet(PATH)

In [4]:
books.head()

Unnamed: 0,sentence,author,tokenized,lemmatized,kbow,abow,dbow
0,Produced David Price,Anne Bronte,"[produced, david, price]","[produced, david, price]",0,1,0
1,Agnes Grey NOVEL,Anne Bronte,"[agnes, grey, novel]","[agnes, grey, novel]",0,1,0
5,Illustration Birthplace Charlotte Emily and An...,Anne Bronte,"[illustration, birthplace, charlotte, emily, a...","[illustration, birthplace, charlotte, emily, a...",0,1,0
9,All true histories contain instruction though ...,Anne Bronte,"[all, true, histories, contain, instruction, t...","[all, true, history, contain, instruction, tho...",0,0,0
10,father was clergyman the north England who was...,Anne Bronte,"[father, was, clergyman, the, north, england, ...","[father, wa, clergyman, the, north, england, w...",1,0,0


*...emphasis on unsupervised learning*

Perhaps I didn't pay enough attention to the cluster results the agglomerative model provided in my first unsupervised learning notebook. 

I'm curious about the topics for the cluster models. What are the topics in the agglomerative clusters? 

I will be performing my topic modeling on the agglomerative clusters. 



In [80]:
#Number of Keywords and Topics for each model.
num_keywords = 10
num_topics = 10

# Latent Semantic Analysis (LSA)

In [None]:
#Bag of Words Vector
vectorizer_one = TfidfVectorizer()
vectorized_one = vectorizer_one.fit_transform(books[books.abow ==1]['lemmatized'].astype(str))
vectorizer_zero = TfidfVectorizer()
vectorized_zero= vectorizer_zero.fit_transform(books[books.abow ==0]['lemmatized'].astype(str))

In [84]:
model = TruncatedSVD(n_components=num_topics)
model.fit_transform(vectorized_one.toarray())

results = [[(vectorizer_one.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
one = [[x[0] for x in i] for i in results]

In [98]:
#This format for readability. 
for topic in one:
  print(*topic)

the you and she her that wa said his not
you said are what have don know your will can
she her said jake yes hand him herself maud nap
yes said jake the nap him bunny right lucas well
said jake the nap his don bunny lucas maud again
what said asked matter jake mean wa the nap cried
not that but she have could would had all wa
his him that wa and jake not but eye with
him she his wa you did asked why the raskolnikov
wa that jake you are there the silent maud voice


In [101]:
#redfine the model so it's not tainted by the first one. 
model = TruncatedSVD(n_components=num_topics)
model.fit_transform(vectorized_zero.toarray())

results = [[(vectorizer_zero.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
zero = [[x[0] for x in i] for i in results]

In [102]:
#This format for readability. 
for topic in zero:
  print(*topic)

the and wa that her she had his you for
you that she have not for your and her will
her she had wa not could herself been would with
and his wa that him had not all them they
and her they their them will child little our it
that they for their all one will have not are
wa they but there not and did myself were could
his that wa her man hand one she with eye
she not but him for would will have very could
she that had and him out then went come ivanovna


# Latent Dirichlet Allocation (LDA)

In [99]:
#Bag of Words Vector
vectorizer_one = CountVectorizer()
vectorized_one = vectorizer_one.fit_transform(books[books.abow ==1]['lemmatized'].astype(str))
vectorizer_zero = CountVectorizer()
vectorized_zero= vectorizer_zero.fit_transform(books[books.abow ==0]['lemmatized'].astype(str))

In [100]:
model = LatentDirichletAllocation(n_components=num_topics, learning_method='online')
model.fit_transform(vectorized_one.toarray())

results = [[(vectorizer_one.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
one = [[x[0] for x in i] for i in results]

#This format for readability. 
for topic in one:
  print(*topic)

jake you back sure what believe don know told yes
mind wish father petrovitch hope arkady along raskolnikov ivanovna secret
the and wa his had her with from that were
you and she the that her not but said for
felt life more and since fairfax boy jane rest glad
silence spoke sonia ask between silent into moved stopped sent
old passed the vasya snap street twenty hundred arm daughter
the light miko anita over dead yesterday grantline straight gregg
the and into hear through kleig earth stood which it
face the side upon his and slowly fell svidriga lov


In [105]:
model = LatentDirichletAllocation(n_components=num_topics, learning_method='online')
model.fit_transform(vectorized_zero.toarray())

results = [[(vectorizer_zero.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
zero = [[x[0] for x in i] for i in results]

#This format for readability. 
for topic in zero:
  print(*topic)

mastakovitch yulian member waggon promotion gas counter flown transcendental seth
emma mr harriet miss elton knightley woodhouse churchill weston frank
inventor solarium carnes sunburn admiral planet plane quartz operating sanity
sophia macdonald sensibility janetta augustus laura scotland tho edinburgh laurina
von beyer moonlight lunium cathode cadmium detected ultra spectrum medical
ivanovitch ivanovna sonia semyon katerina svidriga lov rouble arkady amalia
the and wa his that with had him for but
eloisa tho marlowe elizabeth lesley chaise charlotte louisa disgraced 7th
and the that you her she wa not for had
pyotr petrovitch wud romanovna avdotya alexandrovna pulcheria heave swung yer


# Non-Negative Matrix Factorization

In [107]:
#Bag of Words Vector
vectorizer_one = TfidfVectorizer()
vectorized_one = vectorizer_one.fit_transform(books[books.abow ==1]['lemmatized'].astype(str))
vectorizer_zero = TfidfVectorizer()
vectorized_zero= vectorizer_zero.fit_transform(books[books.abow ==0]['lemmatized'].astype(str))

In [108]:
model = NMF(n_components=num_topics)
model.fit_transform(vectorized_one.toarray())

results = [[(vectorizer_one.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
one = [[x[0] for x in i] for i in results]

#This format for readability. 
for topic in one:
  print(*topic)

the and from with had room they door out into
you are know don your why can will have how
she her had herself hand eye with did face upon
yes answered sir replied raskolnikov quite sure course now well
said jake nap don she bunny maud lucas again anne
what asked mean matter cried say want are did raskolnikov
not have and but for that all would will very
his and eye with hand face her head upon looked
him looked with don raskolnikov let and bunny know she
wa that there had silent but were voice maud when


In [109]:
model = NMF(n_components=num_topics)
model.fit_transform(vectorized_zero.toarray())

results = [[(vectorizer_zero.get_feature_names()[i], topic[i])
           for i in topic.argsort()[:-num_keywords - 1:-1]]
           for topic in model.components_]
zero = [[x[0] for x in i] for i in results]

#This format for readability. 
for topic in zero:
  print(*topic)

the room door which with from and through window into
you your are don know have will that come see
her she and the with had from that eye mother
very emma and mr harriet the wa weston miss elton
his him himself and with the man would vasya ivanovitch
that will have and not for ha all what one
and wa then went said out came when down were
had wa that been not but all could the there
she her herself would not him knew that did could
they them their and were the are themselves with for


# Conclusion


Three topic models were run against the books dataset. In the interest maintaining the unsupervised learning theme, the books were filtered by the sentences' agglomerative clusters.

Out of the three, the LDA model seems to have the most information regarding the top ten topics for each cluster. LSA seems to come in at a close second. 
I have no idea what NMF is talking about. 

There are definitely some other approaches that I could have taken with this topic modeling. I could have filtered out by author or book and retrieved a topic for specific titles or authors. 

I took the cluster route because I wanted to see if there was a stark difference in topics between the clusters.

---
*a Thinkful Project by Kalika Kay Curry*
