# Project2 Part1 - Text Analysis through TFIDF computation


In [181]:
from text_analyzer import read_sonnets, clean_corpus, tf, get_top_k, idf, tf_idf, cosine_sim

import pandas as pd
import numpy as np
import plotly.express as px

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [182]:
# run text_analyzer.py with default arguments
!python text_analyzer.py


Sonnet 1 TF (Top 20):
[('the', 6), ('thy', 5), ('to', 4), ('and', 3), ('tender', 2), ('by', 2), ('but', 2), ('self', 2), ('thine', 2), ('own', 2), ('thou', 2), ('that', 2), ('his', 2), ('worlds', 2), ('might', 2), ('within', 1), ('or', 1), ('only', 1), ('due', 1), ('fairest', 1)]
Corpus TF (Top 20):
[('and', 491), ('the', 430), ('to', 408), ('my', 397), ('of', 372), ('i', 343), ('in', 322), ('that', 320), ('thy', 287), ('thou', 235), ('with', 181), ('for', 171), ('is', 168), ('not', 166), ('a', 166), ('me', 164), ('but', 163), ('love', 162), ('thee', 162), ('so', 144)]
Corpus IDF (Top 20):
[('sullied', 5.0369526024136295), ('played', 5.0369526024136295), ('space', 5.0369526024136295), ('warrior', 5.0369526024136295), ('Wooing', 5.0369526024136295), ('Compare', 5.0369526024136295), ('impregnable', 5.0369526024136295), ('rudely', 5.0369526024136295), ('Bound', 5.0369526024136295), ('divide', 5.0369526024136295), ('delights', 5.0369526024136295), ('unbless', 5.0369526024136295), ('

## a. Read about argparse.
Look at its implementation in the Python Script. Follow the instruction and answer the questions in the Argparse section.

The argparse module is used to make the interaction with the command line arguments of the program easier to deal with. For example, when running a program at the command line it can provide help to the user, giving them information regarding the types of arguments that it expects, the expected output, as well as how to run the program. It is used by first creating an instance of the parser and then attaching argument specifications to it.

## b. Read and Clean the data

In [183]:
d_corpus='data/shakespeare_sonnets/'

# return dictionary with keys corresponding to file names and values being the respective contents
corpus = read_sonnets(d_corpus)

# return corpus (dict) with each sonnet cleaned and tokenized for further processing
corpus = clean_corpus(corpus)

In [199]:
corpus['1']

['From',
 'fairest',
 'creatures',
 'we',
 'desire',
 'increase',
 'That',
 'thereby',
 'beautys',
 'rose',
 'might',
 'never',
 'die',
 'But',
 'as',
 'the',
 'riper',
 'should',
 'by',
 'time',
 'decease',
 'His',
 'tender',
 'heir',
 'might',
 'bear',
 'his',
 'memory',
 'But',
 'thou',
 'contracted',
 'to',
 'thine',
 'own',
 'bright',
 'eyes',
 'Feedst',
 'thy',
 'lights',
 'flame',
 'with',
 'selfsubstantial',
 'fuel',
 'Making',
 'a',
 'famine',
 'where',
 'abundance',
 'lies',
 'Thy',
 'self',
 'thy',
 'foe',
 'to',
 'thy',
 'sweet',
 'self',
 'too',
 'cruel',
 'Thou',
 'that',
 'art',
 'now',
 'the',
 'worlds',
 'fresh',
 'ornament',
 'And',
 'only',
 'herald',
 'to',
 'the',
 'gaudy',
 'spring',
 'Within',
 'thine',
 'own',
 'bud',
 'buriest',
 'thy',
 'content',
 'And',
 'tender',
 'churl',
 'makst',
 'waste',
 'in',
 'niggarding',
 'Pity',
 'the',
 'world',
 'or',
 'else',
 'this',
 'glutton',
 'be',
 'To',
 'eat',
 'the',
 'worlds',
 'due',
 'by',
 'the',
 'grave',
 'and',

## c. TF

In [185]:
# assign 1.txt to variable sonnet to process and find its TF (Note corpus is of type dic, but sonnet1 is just a str)
sonnet1 = corpus['1']

# determine tf of sonnet
sonnet1_tf = tf(sonnet1)

# get sorted list and slice out top 20
sonnet1_top20 = get_top_k(sonnet1_tf)

print
print("Sonnet 1 (Top 20):")
df = pd.DataFrame(sonnet1_top20, columns=["word", "count"])
df.head(20)

Sonnet 1 (Top 20):


Unnamed: 0,word,count
0,the,6
1,thy,5
2,to,4
3,and,3
4,self,2
5,thine,2
6,by,2
7,own,2
8,worlds,2
9,his,2


In [186]:
kv_dict = {'apple': 5.0, 'banana': 3.0, 'orange': 2.5, 'peach': 1.0}
get_top_k(kv_dict, 2)

[('apple', 5.0), ('banana', 3.0)]

In [187]:
# TF of entire corpus
flattened_corpus = [word for sonnet in corpus.values() for word in sonnet]
corpus_tf = tf(flattened_corpus)
corpus_top20 = get_top_k(corpus_tf)
# print
# print("Corpus TF (Top 20):")
df = pd.DataFrame(corpus_top20, columns=["word", "count"])
df.head(20)

Unnamed: 0,word,count
0,and,491
1,the,430
2,to,408
3,my,397
4,of,372
5,i,343
6,in,322
7,that,320
8,thy,287
9,thou,235


### Q: Discussion
Do you believe the most frequent words would discriminate between documents well? Why or why not? Any thoughts on how we can improve this representation? Does there appear to be any ‘noise’? If so, where? If not, it should be clear by the end of the assignment.

I do not think that using the most frequent words alone will be enough to discriminate between documents. All of the top words identified are common prepositions and articles, which do not pertain to the content of the sample. Because they convey little information about the content of the text, they are not useful for content-based computation.

## d. IDF

In [189]:
# IDF of corpus
corpus_idf = idf(corpus)
corpus_tf_ordered = get_top_k(corpus_idf)
# print top 20 to add to report
df = pd.DataFrame(corpus_tf_ordered, columns=["word", "score"])
df.head(20)

Unnamed: 0,word,score
0,tresses,5.036953
1,consent,5.036953
2,Sweets,5.036953
3,overgoes,5.036953
4,mud,5.036953
5,reckoned,5.036953
6,newappearing,5.036953
7,neercloying,5.036953
8,Nativity,5.036953
9,fore,5.036953


### Q: observe and briefly comment on the difference in top 20 lists (comparing TF of corpus vs its IDF).

When comparing both, looking at the term frequency of corpus, we see that the top results are not very useful, since they represent a lot of the noise in the data. This means that words such as "and", and "the" are at the top of the TF list, which are not useful in categorizing sonnets at all. On the other hand, the IDF of corpus shows more useful information of words such as "tresses" or consent", which are more specific words which are not found in every sonnet in the corpus. Thus, the IDF of the corpus can be more useful in categorizing.

## e. TF-IDF

In [190]:
# TFIDF of Sonnet1 w.r.t. corpus
sonnet1_tfidf = tf_idf(corpus_idf, sonnet1_tf)
sonnet1_tfidf_ordered = get_top_k(sonnet1_tfidf)
# print
# print("Sonnet 1 TFIDF (Top 20):")
df = pd.DataFrame(sonnet1_tfidf_ordered, columns=["word", "score"])
df.head(20)

Unnamed: 0,word,score
0,worlds,7.301316
1,tender,6.490386
2,glutton,5.036953
3,foe,5.036953
4,gaudy,5.036953
5,buriest,5.036953
6,selfsubstantial,5.036953
7,herald,5.036953
8,fuel,5.036953
9,niggarding,5.036953


### Q. What is different with this list than just using TF?

The TF list mainly contained common filler words and articles that are universal in Shakespeare's writing. The IDF list contains more content-based words that appear frequently in the samples but convey more meaning than basic articles.

## f. Compare all documents

In [208]:
# TODO: Visualize as a heatmap
corpus_idf = idf(corpus)
sonnet_tfidfs = []
for sonnet in corpus:
    sonnet_tf = tf(corpus[sonnet])
    sonnet_tfidfs.append(tf_idf(corpus_idf, sonnet_tf))
similarity = np.zeros((len(corpus),len(corpus)))
for sonnetA in range(len(corpus)):
    for sonnetB in range(len(corpus)):
        similarity[sonnetA][sonnetB] = cosine_sim(sonnet_tfidfs[sonnetA], sonnet_tfidfs[sonnetB])

fig = px.imshow(similarity, text_auto=False)
fig.show()



### Q. Observe the heatmap. What insight do you get from it?

The heatmap shows the similarity between every set of samples from the corpus. Zooming in on any cell shows the similarity of two particular samples. All samples have maximum similarity when compared to themselves, and most pairings have cosine similarity values in teh range of 0 to 0.5.