# Project 1：TF-IDF + Visualization - ASoIaF

The goal of this exercise is to analyze the A Song of Ice and Fire (ASoIaF) series by understanding the frequency and distribution of words across the corpus. A simple example is demonstrated by visualizing the bigrams with the top 10 TF-IDF weights in each book from the series, which shows us the key characters that appear in such.

**-------------------------------------------------------SPOILER ALERT!-------------------------------------------------------**

Among all books in the series, the bigram "ned said" has the highest TF-IDF weight, and it appears only in the first book's list of top 10 bigrams. By taking a closer look at the list, we can see that the character "Ned" is included in 3 out of the top 10 bigrams from *A Game of Thrones*. This matches the storyline, as the first of the series ended with the key character Eddard "Ned" Stark being executed.

--------------------------------------------------------------------------------------------------------------------------------
## Import Data

In [2]:
d = {'Book': ["A Game of Thrones", "A Clash of Kings", "A Storm of Swords", "A Feast of Crows", "A Dance With Dragons"], 'Filename': ['./files/001ssb.txt','./files/002ssb.txt','./files/003ssb.txt','./files/004ssb.txt','./files/005ssb.txt']}
import pandas as pd
df = pd.DataFrame(data=d)
df

Unnamed: 0,Book,Filename
0,A Game of Thrones,./files/001ssb.txt
1,A Clash of Kings,./files/002ssb.txt
2,A Storm of Swords,./files/003ssb.txt
3,A Feast of Crows,./files/004ssb.txt
4,A Dance With Dragons,./files/005ssb.txt


## TF-IDF

### Step 1: Read Text

In [3]:
# create list to store text
text_list = []

for filename in df['Filename']:
    with open(filename) as file:
        text = "".join(file.readlines()[1:])
    text_list.append(text)

### Step 2: Clean Text

Clean text and store to column "Word" in df.

In [4]:
# split into words
from nltk.tokenize import word_tokenize
tokens_list = [word_tokenize(text) for text in text_list]
# convert to lower case
tokens_list = [[w.lower() for w in tokens] for tokens in tokens_list]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped_list = [[w.translate(table) for w in tokens] for tokens in tokens_list]
# remove remaining tokens that are not alphabetic
words_list = [[word for word in stripped if word.isalpha()] for stripped in stripped_list]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words_list = [[w for w in words if not w in stop_words] for words in words_list]

In [5]:
corpus = []

for word_list in words_list:
    word = ' '.join(word_list)
    corpus.append(word)

### Step 3: Compute TF-IDF

Compute the TF-IDF weights of the top 10 bigrams (a 2-word term) for each book and save to csv files.

In [6]:
# Bigrams:
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = ['chapter']
vectorizer = TfidfVectorizer(min_df=2, max_df=.5, ngram_range=(2,2), stop_words = stopwords)
tfidf = vectorizer.fit_transform(corpus)

In [8]:
import numpy as np
for i in range(len(corpus)):
    weights = np.asarray(tfidf[i,].mean(axis=0)).ravel().tolist()
    weights_df = pd.DataFrame({'Term': vectorizer.get_feature_names(), 'TF-IDF weight': weights})
    weights_df = weights_df.sort_values(by='TF-IDF weight', ascending=False).head(10)
    weights_df = weights_df.reset_index(drop=True)
    print('------------' + df['Book'][i] + '------------')
    print(weights_df)
#    weights_df.to_csv("bigrams_" + str(i+1) + ".csv")

------------A Game of Thrones------------
   TF-IDF weight               Term
0       0.471123           ned said
1       0.106234           ned told
2       0.078520  littlefinger said
3       0.069283        vayon poole
4       0.064664           said ned
5       0.064664        rodrik said
6       0.060045    stallion mounts
7       0.060045       mounts world
8       0.050807          bran robb
9       0.046189        jory cassel
------------A Clash of Kings------------
   TF-IDF weight              Term
0       0.203817       ser cortnay
1       0.161157       ser jacelyn
2       0.123238       jaqen hghar
3       0.113758   maester cressen
4       0.109018       black betha
5       0.090059     lady hornwood
6       0.080579  thoren smallwood
7       0.075839       lady selyse
8       0.071099   cortnay penrose
9       0.066359   jacelyn bywater
------------A Storm of Swords------------
   TF-IDF weight              Term
0       0.115351  tom sevenstrings
1       0.090633      ye

## Visualization

Visualize the bigrams with the top 10 TF-IDF weights for each book.

In [9]:
%%HTML
<script type='text/javascript' src='https://us-west-2b.online.tableau.com/javascripts/api/viz_v1.js'></script><div class='tableauPlaceholder' style='width: 1000px; height: 1927px;'><object class='tableauViz' width='1000' height='1927' style='display:none;'><param name='host_url' value='https%3A%2F%2Fus-west-2b.online.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='&#47;t&#47;jessiecreates' /><param name='name' value='ASoIaFVisualization&#47;ASoIaF' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='showAppBanner' value='false' /><param name='filter' value='iframeSizedToWindow=true' /></object></div>