Exploratory data analysis is *exploratory* by nature, often used to get a sense of the content and structure of a dataset before diving into specific research questions.  For this homework, you have one task: tell us something interesting about a dataset using topic modeling. You are free to use any dataset (even those from Kaggle this time), but see sample datasets below:

* [CMU Book Summary Dataset](https://www.cs.cmu.edu/~dbamman/booksummaries.html)
    * Metadata: Title, author, publication date, genre

* [CMU Movie Summary Dataset](http://www.cs.cmu.edu/~ark/personas/)
    * Metadata: Movie box office revenue, genre, release date, runtime, and language

* [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)
    * Metadata: positive/negative sentiment
    
* [Congressional Speech Data](https://www.cs.cornell.edu/home/llee/data/convote.html)
    * Metadata: political party
    
* [100K ArXiv abstracts](https://drive.google.com/file/d/1ThK1D9AstYI6s2Z7m9SmvLqLZPneMp12/view?usp=sharing)
    * Metadata: ArXiv subject (e.g., math, physics, cs)

For any dataset, you'll have to understand the format in order to read in the textual part (and any metadata of interest); and then tokenize the text before passing it on to topic modeling.

Refer to `4.eda/TopicModel.ipynb` for example code on how to work with topic models in gensim, and remember that we've considered several concepts already for characterizing differences across groups; feel free to draw on those as you see appropriate for this exploration.

A **check** submission will run topic modeling on the text of your chosen dataset and discuss the topics that emerge; a **check-plus** submission will relate variation in those topics to aspects of metadata (e.g., discussing interesting topical differences over time or between genre/political party, etc.).  In all cases, be sure to explain why what you have found is interesting.

In [1]:
# CMU Book Summary Dataset

import nltk
import re
import gensim
from gensim import corpora
import operator

nltk.download('stopwords')
from nltk.corpus import stopwords

import numpy as np
import random

random.seed(1)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\linzh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def read_stopwords(filename):
    stopwords={}
    with open(filename) as file:
        for line in file:
            stopwords[line.rstrip()]=1
    return stopwords

In [3]:
stop_words = {k:1 for k in stopwords.words('english')}
stop_words.update(read_stopwords("../data/jockers.stopwords"))
stop_words["'s"]=1
stop_words=list(stop_words.keys())

In [4]:
def filter(word, stopwords):
    
    """ Function to exclude words from a text """
    
    # no stopwords
    if word in stopwords:
        return False
    
    # has to contain at least one letter
    if re.search("[A-Za-z]", word) is not None:
        return True
    
    return False

In [5]:
# Metadata: Title, author, publication date, genre

def read_docs(metadataFile, stopwords):
    
    idds = []
    titles = []
    authors = []
    p_dates = []
    # genres = []

    docs=[]

    with open(metadataFile, encoding="utf-8") as file:
        for line in file:

            cols=line.rstrip().split("\t")

            idd    = cols[0]
            title  = cols[2]
            author = cols[3]
            p_date = cols[4]
            # genre  = cols[5]

            if len(idd) != 0 and len(title) != 0 and len(author) != 0 and len(p_date) != 0:
                idds.append(idd)
                titles.append(title)
                authors.append(author)
                p_dates.append(p_date)

                tokens=nltk.word_tokenize(cols[6].lower())
                tokens=[x for x in tokens if filter(x, stopwords)]
                docs.append(tokens)
                
    return docs, titles

In [6]:
metadataFile="../data/booksummaries.txt"
data, doc_names=read_docs(metadataFile, stop_words)

In [7]:
# data, doc_names

Using gensim's corpora.dictionary

In [8]:
# Create vocab from data; restrict vocab to only the top 10K terms that show up in at least 5 documents 
# and no more than 50% of all documents

dictionary = corpora.Dictionary(data)
dictionary.filter_extremes(no_below=5, no_above=.5, keep_n=10000)

In [9]:
# Replace dataset with numeric ids words in vocab (and exclude all other words)
corpus = [dictionary.doc2bow(text) for text in data]

In [10]:
num_topics=20

In [11]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=num_topics, 
                                           passes=10,
                                           alpha='auto')

In [12]:
for i in range(num_topics):
    print("topic %s:\t%s" % (i, ' '.join([term for term, freq in lda_model.show_topic(i, topn=10)])))

topic 0:	magic world find help dragon power kill magical dark human
topic 1:	find tells island away water people camp day river city
topic 2:	army battle men death father lord help escape brother killed
topic 3:	earth planet time human space world humans system alien years
topic 4:	park fox monk ms. tarzan beast cult badger lion mole
topic 5:	novel murder death story police killer case woman investigation murdered
topic 6:	heaven cat circus cats cal tintin drake odd smiley gregor
topic 7:	de sir england french la london english british spain spanish
topic 8:	school mother n't time father friends finds tells day goes
topic 9:	mr. mrs. children boy fly story game mouse book flock
topic 10:	case company firm bridge drug number smith henri mexico bank
topic 11:	war family novel american white world father black women story
topic 12:	narrator abbot day boy father sherlock hermit priest horse brother
topic 13:	family father mother life wife daughter years husband brother time
topic 14:	book 

In [13]:
topic_model=lda_model 

topic_docs=[]
for i in range(num_topics):
    topic_docs.append({})
for doc_id in range(len(corpus)):
    doc_topics=topic_model.get_document_topics(corpus[doc_id])
    for topic_num, topic_prob in doc_topics:
        topic_docs[topic_num][doc_id]=topic_prob

for i in range(num_topics):
    print("%s\n" % ' '.join([term for term, freq in topic_model.show_topic(i, topn=10)]))
    sorted_x = sorted(topic_docs[i].items(), key=operator.itemgetter(1), reverse=True)
    for k, v in sorted_x[:5]:
        print("%s\t%.3f\t%s" % (i,v,doc_names[k]))
    print()
    
    

magic world find help dragon power kill magical dark human

0	0.960	Cavern of the Fear
0	0.911	Magic Moon
0	0.908	Darke
0	0.896	Trail of the Wolf
0	0.889	Cube Route

find tells island away water people camp day river city

1	0.890	The Moomins and the Great Flood
1	0.888	The Fourth Apprentice
1	0.885	Harimau! Harimau!
1	0.848	The Secret of the Forgotten City
1	0.841	The Borrowers Afloat

army battle men death father lord help escape brother killed

2	0.872	Viking Warrior
2	0.863	Durgeshnandini
2	0.822	Captain from Castile
2	0.804	Durandal
2	0.801	Kapalkundala

earth planet time human space world humans system alien years

3	0.889	Abduction
3	0.882	The Flaming Arrow
3	0.864	The World at the End of Time
3	0.852	Manifold: Space
3	0.834	The Abode of Life

park fox monk ms. tarzan beast cult badger lion mole

4	0.506	Tarzan Triumphant
4	0.350	Fox's Feud
4	0.348	Tarzan and the Lost Empire
4	0.331	Tarzan the Magnificent
4	0.319	In the Path of the Storm

novel murder death story police killer c

Q: discuss the topics that emerge

The no. 1 topic is the murder and detective stories. \
The no. 2 topic seems to be some fairy tales or children stories with animals as main characters. \
The no. 3 topic centers on human and society, but interestingly connects books from different genres like psychology and science fiction. \
The no. 4 topic is about family with various classes and cultural backgrounds. \
It is interesting that the tokens from book summaries can be connected to accurately categorize the books by topcis.