In [None]:
import numpy as np 
import pandas as pd 
from typing import *

import os
from matplotlib import pyplot as plt
from IPython.display import display, Markdown

from sklearn.decomposition import PCA

import gensim
from gensim.models import word2vec as w2v

from gensim.models import LdaModel
from gensim.corpora import Dictionary

In [None]:
def show_markdown_table(headers: List[str], data: List) -> str:
    s = f"| {' | '.join(headers)} |\n| {' | '.join([(max(1, len(header) - 1)) * '-' + ':' for header in headers])} |\n"
    for row in data:
        s += f"| {' | '.join([str(item) for item in row])} |\n"
    display(Markdown(s))

# Latent Dirichlet Allocation (LDA)

A **topic** is a set of terms that suggest a shared theme. You'll use **Latent Dirichlet Allocation (LDA)** to generate five topics for your book. 


In [None]:
# Replace the argument with the filename of your book.
sentences = w2v.LineSentence('../input/csci-270-tokenized-books-2022/huck-finn_fixed.txt')

## Running LDA

Create an LDA topic model for your book using the model in the `sentences` variable defined above. Run the `lda_analysis()` function below for this purpose. It will generate 5 topics for you to examine, each containing 20 words. It may take a couple of minutes to complete.

In [None]:
def lda_analysis(sentences):
    dictionary = Dictionary(sentences)
    dictionary.filter_extremes(no_below=1, no_above=0.8)
    corpus = [dictionary.doc2bow(text) for text in sentences]
    lda = LdaModel(corpus, num_topics=5, id2word=dictionary, update_every=5, chunksize=10000, passes=10)
    return lda.show_topics(formatted=False, num_words=20)

In [None]:
%time lda_analysis(sentences)

## Analyzing Topics

Examine the five topics generated by the function call above. Then answer the following questions:

1. For each topic, can you perceive a coherent aspect of your book that summarizes the terms included in the topic? If so, state that aspect and justify it. If not, speculate as to how that collection of terms is in some way representative of the book.

**I am having a hard time with this one. I have rerun the function several times to try and disect different topics. They all seem to be indiscernable from a single topic. This is probably because a majority of the story is told through narration or other stories that characters retell.**

2. What themes or important concepts from the book were not represented in the topics that arguably could have been? Don't focus on specific terms; instead, think about abstractions that transcend individual terms.

**In the book there is a significant amount of lying, stealing, pretending/impersonating, and just general mischief. I expected this to be represented in some of the topics, but they do not seem to be.**

3. What terms in the topic represent vocabulary specific to the book, as opposed to more generic terms? Examples include major characters, locations, objects, etc.

**There are several terms specific to the dialect some characters speak, because of the the setting and time period of the book. I have chosen not to include them because they are just different forms of more generic terms. Other terms:
    Tom, huck, jim, king, duke, mary, river**
    
4. What terms from the book were **not** represented in the topics that arguably could have been? Examples again include major characters, locations, objects, etc.

**raft, father, lie, steal, hide**

## Additional books

Select two other books from our corpus, and run topic analysis on each one. 

In [None]:
%time lda_analysis(w2v.LineSentence('../input/csci-270-tokenized-books-2022/KJV_Bible.txt'))

In [None]:
%time lda_analysis(w2v.LineSentence('../input/csci-270-tokenized-books-2022/The Martian - Andy Weir_fixed.txt'))

## Analysis of Additional Books

For each of the additional books, answer the following questions:

1. How do the topic terms that are not book-specific compare to those from your own book? 

**I can still see how the terms that are more generic still apply to their own book because they are more fitting for the vocabulary of their time.**

2. What similarities and differences between the books can you identify based on examining the listed terms?

**Well the two books I choose are drastically different. Without using any of my prior knowledge of these books, I can tell that the martian deals with more material subjects, and the bible deals with more spiritual subjects.
Martian: hydrogen, water, solar, mars, rover, hydrogen, regulator, cells, trailer
Bible:  thee, thy, thou, faith, spirit, holy**

# Word2Vec

Word2Vec is a machine learning algorithm that finds similarities between words based on how they are used in documents within a particular corpus. We will begin by building a Word2Vec model of your book. We will then compare the results with building a Word2Vec model of the entire corpus.

In [None]:
model = w2v.Word2Vec(sentences)

In [None]:
def word_similarities(model, test_word_list):
    for test_word in test_word_list:
        if test_word in model.wv.key_to_index:
            print(f"Words similar to {test_word}")
            print(model.wv.most_similar(test_word, topn=20))
            print()
        else:
            print(f"{test_word} is not present in word2vec corpus")
            
    headers = ['Word'] + test_word_list
    rows = [[word2] + [model.wv.similarity(word1, word2) for word1 in test_word_list] for word2 in test_word_list]
    show_markdown_table(headers, rows)
            
    # Graphing code from: https://www.askpython.com/python-modules/gensim-word2vec
    X = model.wv[test_word_list]
    pca = PCA(n_components=2)
    result = pca.fit_transform(X)

    plt.scatter(result[:, 0], result[:, 1])
    for i, word in enumerate(test_word_list):
        plt.annotate(word, xy=(result[i, 0], result[i, 1]))
    plt.show()

## Terms to analyze

Select ten terms from your answers to Questions 3 and 4 in the **Analyzing Topics** section above:

**Tom, huck, jim, king, duke, father, river, raft,  lie, steal**


Explain why you selected each of these terms:

**I selected the first 6 terms because they are all characters. The next two are places and objects that commonly appear in the book. The next two are just actions that I think commonly occur in huck finn.**

Create a variable `term_list` below as a list of your ten selected terms.

In [None]:
# Place your ten terms in this list
term_list = ['tom', 'huck', 'jim', 'king', 'duke', 'father', 'river', 'raft', 'lie', 'steal']

## Term Analysis 1

Run the `word_similarities()` function with your terms in `term_list`. Then answer the questions that follow.

In [None]:
word_similarities(model, term_list)

## Term Analysis 1 Questions

1. For each term, what similar terms are most pertinent? 
**Tom: jim is the most similiar. Tom does not appear much in the story but it is usually with jim.
huck: jim is not the most similia but is included in the list. It makes sense because they spend majority of the story together.
jim: the only word similiar to jim that has some discernable meaning is raft, because it is the setting of most of the story.
king, duke: both of these characters have some similarity to jim, which makes sense in the context of them all being runaways.
father: the only term that seems to have some relation to the story is dollars. huck's father only comes into the story when he finds out about huck's fortune.
river: raft, jim, water, run.
raft: run, jim, town.
lie: men, man, young, people.
steal: take is the only similiar word that is comparable to steal. I find it interesting that the only character in this list is jim, because huck certainly does majority if not all of the stealing**

2. Which similar terms are least pertinent?
**For the most part the similiar terms contain generic terms that are changed a little because of the dialect spoken in the text:
knowed, warnt, id, em, reckoned.**

3. What insights about the structure of the book are represented by the pertinent similar terms?
**These terms show some relation between the characters, places and actions that take place in the story. But i feel as though these terms do not represent the most important relationships that appear in the text. At least it does not match my assumptions of what it would be.**

## Term Analysis 2

Repeat the above analysis using a Word2Vec model of the entire corpus.

In [None]:
sentences = w2v.PathLineSentences('/kaggle/input/csci-270-tokenized-books-2022', max_sentence_length=5000)
model = w2v.Word2Vec(sentences)
word_similarities(model, term_list)

## Term Analysis 2 Questions

1. What did you find to be the most striking differences in the lists of similar terms?

**The most striking difference is that most of the similarity values are lower than they were when analyzing the book alone. The next most aparent difference is that the characters share distinct similarities with characters from other books.**

2. Overall, is it more beneficial to train Word2Vec on your book alone or on the whole corpus? Support your answer with specific details as to how the content of your book interacts with the similarity lists Word2Vec generated.

**I feel as though it is significantly better to train word2vec over a whole corpus. In my experience the terms and topics generated from my book alone were overwelmingly generic and bland. When analyzing the whole corpus there is at least more data to compare to the specifc terms from your book, even it it takes a bit of research to determine relationships.**

## Concluding Analysis

1. In what ways did you find LDA useful in summarizing aspects of your book?

**I feel as though LDA did not do a sufficient job determining topics for my book. This is probably because of the way huck-finn is tole. I imagine that others had an easier time determining meaning drom**

2. In what ways did you find Word2Vec useful in finding and documenting connections among ideas, concepts, and characters in your book?

**I did not find it useful in matching my intuition of the relationships between characters settings and actions. But it did find some similairities that i did not expect. Overall I think that word2vec did a fair job, considering how much data it is processing.**
