In [1]:
import pickle as pkl
import pandas as pd

# Motivation

## What is your dataset?

The raw dataset consisted of data from two sources:
1. The original Harry Potter books taken from: https://github.com/formcept/whiteboard/tree/master/nbviewer/notebooks/data/harrypotter
2. A list of characters Harry Potter characters taken from:
    1. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Philosopher%27s_Stone_(character_index)
    2. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Chamber_of_Secrets_(character_index)
    3. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Prisoner_of_Azkaban_(character_index)
    4. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Goblet_of_Fire_(character_index)
    5. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Order_of_the_Phoenix_(character_index)
    6. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Half-Blood_Prince_(character_index)
    7. https://harrypotter.fandom.com/wiki/Harry_Potter_and_the_Deathly_Hallows_(character_index)

For each characters we further webscraped their aliases from their individual wiki pages.

The datasets were collected by using API-requests and Beautifulsoup to scrape the data from the websites. The specific functions used for webscraping can be found in `1.Dataset_files/get_Books.ipynb` as well as `1.Dataset_files/get_Characters.ipynb` and `1.Dataset_files/get_CharacterWikis.ipynb`. Dataset stats before any preprocessing can be seen below:

In [2]:
len_books = []

book_titles = ["Book 1 - The Philosopher's Stone",
               "Book 2 - The Chamber of Secrets",
               "Book 3 - The Prisoner of Azkaban",
               "Book 4 - The Goblet of Fire",
               "Book 5 - The Order of the Phoenix",
               "Book 6 - The Half Blood Prince",
               "Book 7 - The Deathly Hallows"]

for title in book_titles:
    book = open("1.Dataset_files/OriginalBooks/" + title + ".txt", "r").read()
    len_books.append((len(book.split(" ")), len(book)))

print(f"The raw-book dataset consists of {len(len_books)} books, with {sum([l[0] for l in len_books])} words, and {sum([l[1] for l in len_books])} characters.")
print(f"The raw character-list for all 7 books consisted of {len(pd.read_csv('1.Dataset_files/CharacterWikis.csv'))} characters.")


The raw-book dataset consists of 7 books, with 1168851 words, and 6765119 characters.
The raw character-list for all 7 books consisted of 707 characters.


## Why did you choose this/these particular dataset(s)?

It should be no secret that all three of us consider ourselves huge "Potterheads". We have all read and loved the books multiple times. Thus for us, the reasoning behind choosing this dataset was quite simple: We wanted to work with something we were passionate about and had some degree of expert knowledge in. Choosing this dataset, enabled us to work with more advanced network and text analysis techniques, such as temporal analysis of networks and wordclouds and sentiment analysis.

## What was your goal for the end user's experience?

With our dataset, we set out to explore the following key research questions:
1. **HOW DO THE BOOKS CHANGE OVER TIME?**: The Harry Potter books where written by J.K. Rowling and initially published between 26th June 1997 – 21st July 2007. The first book (Harry Potter and the Philosopher's Stone) was originally marketed as a kids-book, but as the next 6 books were published in a span of 10 years, the primary audience of the books grew up. We thus wanted to understand if the topics discussed in the books over time became more complex and more adult. This question is answered by 1) looking at the temporal analysis of the character graphs, to understand if the number of key characters and the number of interactions change over time to address a more mature reader with better memory, and 2) looking at the full book texts through wordclouds and LIX-score over time to understand if the language and themes discuss become more adult and complex.
2. **HOW DO THE CHARACTERS, THEIR INTERACTIONS / RELATIONSHIPS, AND LANGUAGE SURROUNDING THEM CHANGE THROUGHOUT KEY MOMENTS IN THE HARRY POTTER BOOKS?**: It is no secret that what makes a book or series of book exciting to read are its shocking and exciting key moments, where big changes in character archs happen. We thus wanted to understand how the characters, their interactions and the language surrounding them changed throughout some identified key moments in the Harry Potter books. This question is answered by 1) looking at the temporal analysis of the character graphs, to understand how how key moments have an effect on sub-communities (i.e. changes in alliance), and 2) looking at the book texts that surrounds a character at a specific time through wordclouds and sentiment analysis to understand if they are portrayed in a different light as there character arch developments are uncovered.

In summary, our main goal for the end user's experience was to provide a tool that could help them explore the Harry Potter books in a new and way. We wanted to create new-found excitement in the user's mind about this pheominal books series, and give some reasoning behind what makes them so good, namely its dynamically changing and expanding world of characters, their motives and relationships.