### Extracting cloud word data

A lot more can be done using the text book, we will now explore how to extract the word describing the most each book with the goal of doing a WordCloud like visualisation. To do so, we will need to reload the different file and convert each book into a very large string containing all the text of the book.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from helper.constantes import *
import numpy as np
import pickle

In [2]:
all_books = []
for i in range(1,8):
    # Read the full book line by line in the a list
    with open(data_folder+hpbooks_folder+f"hp{i}.txt") as f:
        lines = f.readlines()
    cur_str = ""
    for line in lines:
        cur_str += line
    all_books.append(cur_str)

In [3]:
def fit_transform_return(corpus, min_df=0.2, max_df=1.0, ngram_range =(1,4)):
    final_stopwords_list = stopwords.words('english')+ ["said", "page", "mind"]
    tfidf = TfidfVectorizer(min_df=min_df,max_df=max_df,stop_words=final_stopwords_list, use_idf=True,ngram_range=ngram_range)
    X = tfidf.fit_transform(corpus)
    feature_names = np.array(tfidf.get_feature_names_out())
    return X, feature_names

In [4]:
def get_all_top_tf_idf_words(full_x, feature_names, top_n=2):
    def get_top_tf_idf_words(x):
        sorted_nzs = np.argsort(x.data)[:-(top_n+1):-1]
        return feature_names[x.indices[sorted_nzs]]
    return [get_top_tf_idf_words(cur) for cur in full_x]

In [5]:
def save_list(words):
    with open("word_cloud", "wb") as fp:   #Pickling
        pickle.dump(words, fp)

In [6]:
x, feat = fit_transform_return(all_books, 0.2, 1.0, (1,4))
get_all_top_tf_idf_words(x,feat, 25)

[array(['harry', 'potter', 'ron', 'stone', 'harry potter', 'hagrid',
        'rowling', 'hermione', 'back', 'one', 'know', 'got', 'could',
        'get', 'like', 'professor', 'see', 'snape', 'looked', 'quirrell',
        'dumbledore', 'around', 'dudley', 'go', 'going'], dtype=object),
 array(['harry', 'ron', 'chamber secrets', 'potter', 'secrets', 'chamber',
        'harry potter', 'rowling', 'lockhart', 'hermione', 'back', 'one',
        'malfoy', 'could', 'dobby', 'professor', 'got', 'like', 'riddle',
        'weasley', 'around', 'know', 'hagrid', 'dumbledore', 'go'],
       dtype=object),
 array(['harry', 'prisoner', 'ron', 'hermione', 'azkaban', 'potter',
        'lupin', 'harry potter', 'rowling', 'professor', 'black', 'back',
        'one', 'hagrid', 'snape', 'around', 'like', 'looked', 'could',
        'see', 'know', 'got', 'professor lupin', 'get', 'malfoy'],
       dtype=object),
 array(['harry', 'potter', 'ron', 'fire', 'harry potter', 'goblet',
        'hermione', 'rowling',

We can see that we specify a lot of different parameters to extract the most important word in each book. Furthermore, we can see that the most of the time, the words "Harry", "Potter", "Hermione" or "Ron" come in the results. This is due to the fact that there are the main characters of the saga. However, if we want to have an interseting visualisation, we will have to get different words for each book. To do so, we can tune the parameters by first requiring that the document frequency must be strictly  below 1.0, i.e. the words doesn't appear in all the books. Furthermore, as already shown, we remove the most common english stopwords. Also, we added custom stopword such as 'said' or 'minds' that describe the process of talking in the book. Below, we consider a second version using a maximal document frequeny of 90%.

In [7]:
x, feat = fit_transform_return(all_books, 0.2, 0.9, (1,6))
words = get_all_top_tf_idf_words(x,feat, 30)
words

[array(['quirrell', 'flamel', 'mr dursley', 'troll', 'professor quirrell',
        'norbert', 'ronan', 'mrs dursley', 'piers', 'nicolas', 'firenze',
        'turban', 'mr ollivander', 'bane', 'ollivander', 'unicorn',
        'gotten', 'nimbus two', 'nicolas flamel', 'griphook',
        'two thousand', 'nimbus two thousand', 'flint', 'nimbus',
        'house cup', 'mr potter', 'sorcerer', 'scabbers', 'get yer',
        'third floor'], dtype=object),
 array(['chamber secrets', 'secrets', 'lockhart', 'dobby', 'riddle',
        'myrtle', 'diary', 'gilderoy', 'gilderoy lockhart', 'colin',
        'justin', 'heir', 'basilisk', 'moaning myrtle', 'fawkes', 'lucius',
        'elf', 'ernie', 'sir dobby', 'borgin', 'aragog', 'mr borgin',
        'lucius malfoy', 'mandrakes', 'myrtle bathroom', 'heir slytherin',
        'riddle diary', 'attacks', 'hospital wing', 'fifty years'],
       dtype=object),
 array(['prisoner', 'azkaban', 'lupin', 'professor lupin', 'pettigrew',
        'sirius', 'scabber

Just by removing the words or n-grams appearing in every book, we fall on meaningful words for each book. Indeed, we can see that all the words that appears for each book correspond to specific events that happen only in that book or not too often. Using the results we got, we will be able to produce the expected CloudWord visualisations and make it evolve over time. 

In [32]:
import json

# Save to json
word_cloud_json = json.dumps(dict(zip(range(0, 7), [book.tolist() for book in words])))

with open('../data/cleaned/word_cloud.json', 'w') as f:
    f.write(word_cloud_json)

In [14]:
# Pickle
save_list(words)