# 02805 Project B

# Part1: Motivation

### 1. About the dataset.

> Our dataset is the characters from Harry Potter Novel, including character relationships, character attributes, characer stories.<br>We got the dataset using API from Fandom pages. Also, J.K rolling’s books will be used as our data source as well.

### 2. The reason to choose this/these particular dataset(s).

> <b>First of all</b>, this dataset has a very complete structure of character relations, which is a perfct material for social network analysis. 
<br><b>Moreover, </b>there are some particularly interesting studies of the relationship between universes, such as those between wizards and Muggles (non-wizards), from which we hope to derive the new X-degree of separation for whole universe. 
<br><b>Also,</b> the division of the four major colleges systems within the Hogwarts also caught our attention, and we wanted to find the attributes and patterns so that everyone could know which house they were assigned to. 
<br><b>Finally, </b>the Harry Potter novel accompanies almost everyone's childhood, most of people are very familiar with the characters and stories, hence, it will be easier to understand the outcomes of our exploration. And it would be exciting to explore new discoveries from familiar stories and relationships.

### 3. Goal for the end user's experience.

<b>Through our project, we hope that users will be able to:</b>
    
> * <b></b> View the social network structure of the characters.
> * <b></b> Get the basic statistics of the social network, such as how many characters involved (nodes), how many relationships they have (links) and so on.
> * <b></b> Understand the social network node attributes and corresponding social network structures (For instance: Four colleges; Muggle world& Wizarding World, Dark magic& Good Wizard).
> * <b></b> Follow the community detection of the social network (four colleges in Hogwarts) and be able to see how the generated community detections differs from the original allocation result.
> * <b></b> Know the keywords for each community.
> * <b></b> From clustering the sentiment detection/analysis of the characters, understand the general emotional tone of the characters in the book.

<b>As for the rendering part of the web page (more technical), we hope that we could</b>:
 > * <b></b> Implement partitioning of content
 > * <b></b> Interactive the visualization of social networks(for instance, zoom in and zoom out,etc)
 > * <b></b> Implement machine learning and data interaction: Predicting which college the character belongs to
 > * <b></b> Implement data interaction: give a character and the 10 most relevant characters; Or give the soical network associated with the character

# Part 2: Basic stats

> **<font color='green'>Let's understand the dataset better! :)</font>**

### 1. Data cleaning and preprocessing

In [45]:
#import the libraries
import requests
import json
import matplotlib.pyplot as plt
import re
import networkx as nx
import collections
from tqdm import tqdm
from community import community_louvain

> ### (1) Get the nodes and the links from wikipedia

### <font color='grey'>a. Try with wikipedia</font>

<b>In this section, first of all, we downloaded the dataset of Harry Potter characters from Wikipedia via API.


In [4]:
baseurl = "https://en.wikipedia.org/w/api.php?"
action = "action=query"
content = "prop=revisions&rvprop=content"
dataformat ="format=json"

# get the list of characters from wiki

cha_list_word = 'List_of_Harry_Potter_characters'

title = "titles={}".format(cha_list_word)
query = "{}{}&{}&{}&{}".format(baseurl, action, content, title, dataformat)

r = requests.get(query)
text= r.text

In [5]:
cha_names=[]
spl = text.split('\\n*')
for each_spl in spl:
    contents = each_spl.split('\\u2013')
    if len(contents) >1:
#         print(contents[0])
        cha_names.append(contents[0])

In [6]:
print('number of characters from wikipedia:{}'.format(len(cha_names)))

number of characters from wikipedia:204


<b>Howrver, the dataset from Wikipedia is far away from complete, the list only contains 204 charactors.<br>This is most likely because the wikipedia only has the information about the main characters.

<b>We found out a website call Fandom, which is a subculture composed of fans characterized by a feeling of empathy and camaraderie with others who share a common interest. It contains a lot of characters information from novels, movies, books, etc, which are very specific and interesting.

### <font color='grey'>b. Try with Fandom</font>

In [12]:
hp_cat_url = 'https://harrypotter.fandom.com/api/v1/Articles/List?expand=1&category=Individuals&namespaces=0&limit=100'
r = requests.get(hp_cat_url)
response = r.json()

a = 0
sub_cat = []
for item in response['items']:
    a += 1
    print("{}\t{}\t({})".format(str(a),item['title'].encode(encoding='utf-8'),item['id']))
    title = str(item['title'].encode(encoding='utf-8'))
    sub_cat.append(title[2:-1])

1	b'Individual infobox test'	(191486)
2	b'Deities'	(163464)
3	b'Disowned individuals'	(167918)
4	b'Fictional characters'	(7830)
5	b'Homosexuals'	(218202)
6	b'Impersonated individuals'	(22241)
7	b'Individuals by house'	(191523)
8	b'Individuals by physical characteristics'	(214731)
9	b'Individuals by ability'	(9521)
10	b'Individuals by achievement'	(9522)
11	b'Individuals by allegiance'	(10221)
12	b'Individuals by class'	(15091)
13	b'Individuals by death'	(9905)
14	b'Individuals by deed'	(73564)
15	b'Individuals by era'	(122081)
16	b'Individuals by gender'	(9516)
17	b'Individuals by injury'	(9523)
18	b'Individuals by job'	(9538)
19	b'Individuals by marital status'	(119075)
20	b'Individuals by parentage'	(9645)
21	b'Individuals by place of origin'	(187776)
22	b'Individuals by place of residence'	(202826)
23	b'Individuals by relationship'	(10130)
24	b'Individuals by school'	(13832)
25	b'Individuals by species'	(9552)
26	b'Missing individuals'	(73563)
27	b'Objects with Personality'	(35200)


In [13]:
baseurl = "https://harrypotter.fandom.com/api/v1/Articles/List?"
expand = "expand=1"
namespaces = "namespaces=0"
limit = "limit={}".format(100)


cat_list = []
# not subcat in side the cat, just the character list
direct_cat = []

depth_control = 6


# get the list from a (fake)category
# if not a category, return itself
def get_list(sub_cat_name,depth):
    depth+=1
    if depth >= depth_control:
        print('*Depth warning.')
        return []
    l = []
    category = "category={}".format(sub_cat_name)
    query = "{}{}&{}&{}&{}".format(baseurl, expand, category, namespaces, limit)

    mark = True
    
    r = requests.get(query)
    response = r.json()
    if 'items' not in response.keys():
        mark = False

    if mark:      
        cat_list.append(sub_cat_name)
        for item in response['items']:
            title = str(item['title'].encode(encoding='utf-8'))[2:-1]
            if title == sub_cat_name:
                continue
            if title in cat_list: # if this category already went through
                print('*REPEATED cate: {}'.format(title))
                continue
            if title.startswith('Locations'):
                print('*Locations cate: {}'.format(title))
                continue
            if 'Creatures' in title:
                print('*Creatures cate: {}'.format(title))
                continue
            l.extend(get_list(title,depth))
            
        print('exploring category:{}'.format(sub_cat_name))

    else:
        l = [sub_cat_name]

    return l


for i,each_sub_cat in enumerate(sub_cat[20:25]):
    depth = 1
    print('*************************\nexploring category:{}\t{} START\n'.format(i,each_sub_cat))
    direct_cat.extend(get_list(each_sub_cat,depth))
    print('\nexploring category:{}\t{} FINISH\n*************************'.format(i,each_sub_cat))

*************************
exploring category:0	Individuals by place of origin START

exploring category:Individuals from alternate realities
exploring category:Emigrants
exploring category:Chadian individuals
exploring category:Egyptian individuals
exploring category:Ivorian individuals
exploring category:Nigerian individuals
exploring category:African individuals
exploring category:American individuals
exploring category:Chinese individuals
exploring category:Japanese individuals
exploring category:Nepali individuals
exploring category:Persian individuals
exploring category:Russian individuals
exploring category:Tibetan individuals
exploring category:Asian individuals
exploring category:Albanian individuals
exploring category:Austrian individuals
exploring category:British Isles individuals
exploring category:Bulgarian individuals
exploring category:French individuals
exploring category:German individuals
exploring category:Greek individuals
exploring category:Italian individuals
expl

exploring category:Stevenson family
*REPEATED cate: Dumbledore family
exploring category:Wizard families
exploring category:Families
exploring category:Only children
exploring category:Half-orphans
exploring category:Orphans
exploring category:Parents
exploring category:Twins
exploring category:Unnamed family members
exploring category:Individuals by family
exploring category:Personal assistants
exploring category:Cedric Diggory's romantic relationships
exploring category:Celestina Warbeck's romantic relationships
exploring category:Cho Chang's romantic relationships
exploring category:Dean Thomas's romantic relationships
exploring category:Draco Malfoy's romantic relationships
exploring category:Fleur Delacour's romantic relationships
exploring category:Ginny Weasley's romantic relationships
exploring category:Harry Potter's romantic relationships
exploring category:Hermione Granger's romantic relationships
exploring category:Lavender Brown's romantic relationships
exploring category:

exploring category:Slytherins
exploring category:Unknown House
exploring category:Individuals by Hogwarts house
exploring category:Circle of Khanna
exploring category:Dumbledore's Army defectors
exploring category:Dumbledore's Army
exploring category:Inquisitorial Squad
exploring category:Hogwarts dropouts
exploring category:Hogwarts expellees
exploring category:Non-graduate Hogwarts students
exploring category:Head Boys
exploring category:Duelling Club Captains
exploring category:Founder Duels champions
exploring category:Head Girls
exploring category:Hatstalls
exploring category:Hogwarts prefects
exploring category:Slug Club
exploring category:Triwizard Champions
exploring category:Special students
exploring category:Unidentified Hogwarts Students
exploring category:Hogwarts students
exploring category:Home-schooled individuals
exploring category:Horned Serpents
exploring category:Pukwudgies (house)
exploring category:Thunderbirds (house)
exploring category:Wampuses
exploring categor

In [14]:
len(direct_cat)

801

In [15]:
direct_cat_1_10_set = set(direct_cat)
print(len(direct_cat_1_10_set))
import json
file_path = './data/cha_names_partd(20-24).json'
with open(file_path,'w') as f:
    json.dump(direct_cat,f)

632


FileNotFoundError: [Errno 2] No such file or directory: './data/cha_names_partd(20-24).json'

Get all nodes and remove the repeat ones and see the number of nodes.

In [17]:
folder_path = './data/cha_name_all/'

cha_all = []

import os
import json
file_list = os.listdir(folder_path)
for f_path in file_list:
    with open(os.path.join(folder_path,f_path),'r') as f:
        f_l = json.load(f)
    f_l = list(f_l)
    cha_all.extend(f_l)

FileNotFoundError: [Errno 2] No such file or directory: './data/cha_name_all/'

In [18]:
cha_all = set(cha_all)
print('number of nodes:{}'.format(len(cha_all)))

number of nodes:0


In [20]:
with open('./data/cha_name_tmp.json','w') as f:
    json.dump(list(cha_all),f)

FileNotFoundError: [Errno 2] No such file or directory: './data/cha_name_tmp.json'

In [21]:
# get the name for file saving and reading
def file_saving_reading_name(s):
    s = s.replace(':','@')
    s = s.replace('/','$')
    return s

# get the name with ':' and '/'
def node_name(s):
    s = s.replace('@',':')
    s = s.replace('$','/')
    return s

In [23]:
# download the pages
baseurl = "https://harrypotter.fandom.com/api.php?"
action = "action=query"
prop = "prop=revisions"
rvprop = "rvprop=content&rvparse=1"
format_ = "format=json"

fandom_folder = './data/cha_fandom_page'
len_cha = len(cha_all)

cha_all = list(cha_all)




# for each characters:
for i,each_cha in enumerate(cha_all[696:]):
    # deal with the character name
    each_cha = each_cha.replace(' ','_')  
    
    if r'\\' in each_cha:
        print(each_cha)
        continue
    
    print('downloading {}...\t{}\{}'.format(each_cha,i+696,len_cha))
    # query
    title = "titles={}".format(each_cha)
    query = "{}{}&{}&{}&{}&{}".format(baseurl, action,prop, title,rvprop,format_)
    
    try:
        r = requests.get(query)
        text= r.json()

        file_path = os.path.join(fandom_folder,file_saving_reading_name(each_cha)+'.json')

        with open(file_path,'w') as f:
            json.dump(text,f)
    except:
        print('*warning:{}'.format(each_cha))
    

In [24]:
# after getting all files-> file the links and store them

# 1. get all the notes
all_nodes = []
all_cha_json_files = os.listdir(fandom_folder)
for each in all_cha_json_files:
    cha_name = node_name(each.split('.json')[0])
    all_nodes.append(cha_name)

print('number of nodes:{}'.format(len(all_nodes)))
with open('./data/all_nodes.json','w') as f:
    json.dump(list(all_nodes),f)

FileNotFoundError: [Errno 2] No such file or directory: './data/cha_fandom_page'

In [None]:
# before step 2 define a function to get the link list from a text
# using regular expression！
def get_links(text):
    pattern=re.compile(r'<a href="/wiki/.*?" title=".*?">')
    result= pattern.findall(str(text))
    list_link=[]
    for x in result:
        x=x.split(' title')
        href=x[0].split('href="/wiki/')[-1].split('"')[0]
        list_link.append(href)
    return list_link

In [27]:
# 2. in each json file get the links and stroe it in a dictionary
# format{node_name:[linked_node_1, linked_node_2, ... ], ...}

node_edges = {}

num_of_nodes = len(all_nodes)
folder = './data/cha_fandom_page/'

for i,node in enumerate(all_nodes):
    # 2.1 get the json file 
    json_path = os.path.join(folder,file_saving_reading_name(node)+'.json')
    with open(json_path,'r') as f:
        text = json.load(f)

#     print(str(text)[:30])
    
    # get the links from text
    raw_links = get_links(text)
    
    # avoid repeating
    raw_links = list(set(raw_links))
    
    # check if links are from the nodes_list
    true_links = []
    for link in raw_links:
        if link in all_nodes:
            true_links.append(link)
    
    node_edges[node] = true_links
    print('{}/{}\tnode {} has {} links'.format(i+1,num_of_nodes,node,len(true_links)))

In [28]:
# 3. store the node egdes into json file
json_file = './data/nodes_egdes.json'
with open(json_file,'w') as f:
    json.dump(node_edges,f)

FileNotFoundError: [Errno 2] No such file or directory: './data/nodes_egdes.json'

In [29]:
# calculate how many links we have
num_of_links = 0 
for each in node_edges:
    num_of_links+=len(node_edges[each])
print('number of nodes:{}\nnumber of links:{}'.format(len(all_nodes),num_of_links))

number of nodes:0
number of links:0


In [30]:
# buidling the network

# bulid the network from data 
G = nx.DiGraph()

# add nodes    
G.add_nodes_from(node_edges.keys())

# add links
for each in node_edges:
    for link in node_edges[each]:
        G.add_edge(each,link)

print('number of nodes in network:{}'.format(len(G.nodes())))
print('number of links in network:{}'.format(len(G.edges())))

number of nodes in network:0
number of links in network:0


In [31]:
# extract the largest component 
largest_cc = max(nx.weakly_connected_components(G), key=len)
print('Number of nodes in the largest component:{}'.format(len(largest_cc)))
GCC = G.subgraph(largest_cc)
print('Number of links in the largest component:{}'.format(len(GCC.edges())))

ValueError: max() arg is an empty sequence

In [34]:
in_degrees = [d for n,d in G.in_degree()]
out_degrees = [d for n,d in G.out_degree()]


def plot_degree_distribution(degrees,title):
    degreeCount = collections.Counter(degrees)
    degree, count = zip(*degreeCount.items())

    # plot
    width = 1.0
    plt.bar(degree,count, align='center', width=width,edgecolor ='black',color = 'lightblue')
#     plt.hist(degrees,bins=1000,edgecolor ='black',color = 'lightblue')
    plt.title(title,pad = 20.0)
    plt.ylabel("Count")
    plt.xlabel("Degree")
    plt.show()
    
print('In degree\tmax:{}\tmin:{}'.format(max(in_degrees),min(in_degrees)))
print('Out degree\tmax:{}\tmin:{}'.format(max(out_degrees),min(out_degrees)))
print('Nodes with in-degree larger than 100: {}'.format(len([i for i in in_degrees if i >100])))
plot_degree_distribution(in_degrees,'In-degree Distribution for Magic World Characters')
plot_degree_distribution(out_degrees,'Out-degree Distribution for Magic World Characters')

ValueError: max() arg is an empty sequence

In [35]:
degreeCount = collections.Counter(in_degrees)
in_degree, in_count = zip(*degreeCount.items())
degreeCount = collections.Counter(out_degrees)
out_degree, out_count = zip(*degreeCount.items())
width = 1.0
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.bar(in_degree,in_count, align='center', width=width,edgecolor ='darkblue',color = 'darkblue',label='in-degree')
plt.title('In-degree Distribution in Magic Universe')
plt.legend()
plt.subplot(1,2,2)
plt.bar(out_degree,out_count, align='center', width=width,edgecolor ='olive',color = 'olive',label='out-degree')
plt.legend()
plt.title('Out-degree Distribution in Magic Universe')
plt.savefig('./magic_degree_dis.png')
plt.show()

ValueError: not enough values to unpack (expected 2, got 0)

In [37]:
G_undirect = GCC.to_undirected()
partition = community_louvain.best_partition(G_undirect,random_state=2020)

NameError: name 'GCC' is not defined

In [38]:
partition

NameError: name 'partition' is not defined

In [39]:
par_class = set(partition.values())
par_class_dict = {}
for v in partition.values():
    str_v = str(v)
    if str_v not in par_class_dict.keys():
        par_class_dict[str_v] = 1
    else:
        par_class_dict[str_v] += 1
par_class_dict

NameError: name 'partition' is not defined

In [40]:
json_file = './data/cha_fandom_page/16th_century_Potions_Professor.json'
with open(json_file,'r') as f:
    text = json.load(f)

# print(text)
import re
pattern=re.compile(r'<a href="/wiki/.*?" title=".*?">')
result= pattern.findall(str(text))

list_href=[]
list_title=[]
print(type(result),len(result))
for x in result:
    x=x.split(' title')
    href=x[0].split('href="/wiki/')[-1].split('"')[0]
    title=x[1].split('="')[-1].split('"')[0]
    list_href.append(href)
    list_title.append(title)

    
print(list_href[:25])
print(list_title[:25])

FileNotFoundError: [Errno 2] No such file or directory: './data/cha_fandom_page/16th_century_Potions_Professor.json'

In [41]:
baseurl = "https://harrypotter.fandom.com/api.php?"
action = "action=parse"
# prop = "prop=revisions&rvprop=content&rvparse=1"
format_ = "format=json"

each_cha  = 'Devil'
page = "page={}".format(cha_list_word)
query = "{}{}&{}&{}".format(baseurl, action, page,format_)

r = requests.get(query)
text= r.json()
print(text)
fandom_folder = './data/cha_fandom_page'
file_path = os.path.join(fandom_folder,file_saving_reading_name(each_cha)+'.json')

with open(file_path,'w') as f:
    json.dump(text,f)

{'error': {'code': 'missingtitle', 'info': "The page you specified doesn't exist.", '*': 'See https://harrypotter.fandom.com/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.'}}


FileNotFoundError: [Errno 2] No such file or directory: './data/cha_fandom_page/Devil.json'

In [42]:
url = 'https://harrypotter.fandom.com/wiki/Devil'
r = requests.get(url)
text= r.text

In [43]:
baseurl = "https://harrypotter.fandom.com/api.php?"

query = baseurl+'action=query&prop=revisions&titles=Devil&rvprop=content&rvparse=1&format=json'
r = requests.get(query)
text= r.json()

In [44]:
text

{'batchcomplete': '',
  'revisions': {'*': 'The parameter "rvparse" has been deprecated.\nBecause "rvslots" was not specified, a legacy format has been used for the output. This format is deprecated, and in the future the new format will always be used.'}},
 'query': {'pages': {'95140': {'pageid': 95140,
    'ns': 0,
    'title': 'Devil',
    'revisions': [{'*': '<div class="mw-parser-output"><p>\n<aside role="region" class="portable-infobox pi-background pi-theme-spirit pi-layout-default">\n<figure class="pi-item pi-image" data-source="image">\n\t<a href="https://static.wikia.nocookie.net/harrypotter/images/4/4d/Devil.jpg/revision/latest?cb=20160605110157" class="image image-thumbnail"\n\t   title="">\n\t\t<img src="https://static.wikia.nocookie.net/harrypotter/images/4/4d/Devil.jpg/revision/latest/scale-to-width-down/286?cb=20160605110157" srcset="https://static.wikia.nocookie.net/harrypotter/images/4/4d/Devil.jpg/revision/latest/scale-to-width-down/286?cb=20160605110157 1x, https://

### 2. Dataset discussion