# Hitchhikers fandom dataset creation

I am attempting to compile a database of fictional world elements from the Hitchhiker's Guide to the Galaxy universe. This database will be used to generate new stories in this universe with the help of large text models. My focus is on feeding information about the universe rather than the story itself. My hope is that by doing so, I will be able to create divergent content that differs from the original corpus.

## Dependencies and utilities

In [3]:
!pip install python-slugify fuzzywuzzy mwparserfromhell

Collecting python-slugify
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting fuzzywuzzy
  Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Collecting mwparserfromhell
  Downloading mwparserfromhell-0.6.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: text-unidecode, fuzzywuzzy, python-slugify, mwparserfromhell
Successfully installed fuzzywuzzy-0.18.0 mwparserfromhell-0.6.4 python-slugify-8.0.1 text-unidecode-1.3


In [4]:
from bs4 import BeautifulSoup
import requests
import os
from slugify import slugify
from fuzzywuzzy import fuzz
import mwparserfromhell



In [5]:
def print_breadcrumb(node):
    acc = ""
    for parent in node.parents:
        if parent.name:
            acc =  parent.name + " → " + acc  
    print(acc)

## Main page

In [22]:
url = "https://hitchhikers.fandom.com/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy_(book)"
r = requests.get(url)
html_content = r.content
soup = BeautifulSoup(html_content, 'html.parser')

In [53]:
header_tags = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']

In [23]:
main_text = soup.find(id = "mw-content-text")
headers = main_text.find_all(header_tags)

In [38]:
# Filter intro headers that I'm not interested in
startIdx = next((i for i, h in enumerate(headers) if h.text == "Appearances"), None)
headers = headers[startIdx:]

In [10]:
base = "https://hitchhikers.fandom.com/wiki"

last_seen_header = ""

def visit_node(node):
    global last_seen_header;
    global start_offset_reached;
    
    if node.name in header_tags:
        last_seen_header = node.text
    
    if node.name == "a" and len(node.text.strip()) > 0:
        fn = slugify(last_seen_header) + os.sep + slugify(node.text) + ".txt"
        print(fn)
        print(node.get('href'))
        # todo - parse
    
    for child in node.children:
        if child.name is not None:
            visit_node(child)

## Getting content from a single page

In [11]:
def text_from_wiki(content):
    res = ""
    wikicode = mwparserfromhell.parse(content)
    sections = wikicode.get_sections()

    skip_titles = ["Notes and references", "Behind the scenes", "Appearances"]
    
    for section in sections:
        heading_nodes = section.filter_headings()

        if len(heading_nodes) == 0:
            title = "Unnamed section"
        else:
            title = heading_nodes[0].title.strip()

        text = section.strip()
        text = section.strip_code()
        if title not in skip_titles:
            res = res + text

    return res

In [12]:
def text_from_url(url):
    url = url + "?action=edit"
    
    r = requests.get(url)
    html_content = r.content
    marvin_soup = BeautifulSoup(html_content, 'html.parser')
    
    content_el = marvin_soup.find(id = "wpTextbox1")
    return text_from_wiki(content_el.text)

## Traversing the link list page

Should save the entire fandom wiki

In [13]:
url = "https://hitchhikers.fandom.com/wiki/Special:AllPages?from=Hotblack+Desiato%27s+bodyguard"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
content_el = soup.find(class_ = "mw-allpages-body")  


In [25]:
link_names = [ {"fn": "data/dump" + os.sep + slugify(linkNode.text), "href": base + linkNode.get("href")} for linkNode in content_el.find_all("a")]

In [24]:
text_from_url("https://hitchhikers.fandom.com/wiki/Hrarf-Hrarf")

"The Hrarf-Hrarf are a species which live backwards in time. They are said to find that getting the business of “sagging bottoms and death” out of the way at an early stage prepares them for an increasingly wonderful time after mid-life crisis celebrations. Their lives finish in a “really quite extraordinarily pleasant birth.” They were first mentioned in Life, the Universe and Everything.\n\nThe Hrarf-Hrarf are the only known race which enjoys hangovers, as for them it guarantees that a tremendously good evening will follow.\xa0\n\nWhen Arthur Dent found himself on prehistoric Earth, a planet on which he was born some two million years later, it was noted that any being other than the Hrarf-Hrarf would find that a terribly lonely position to be in. Book \n Life, the Universe and Everything Radio \n\n Tertiary Phase \n Fit the Thirteenth Tertiary Phase \n Fit the ThirteenthTrivia\nIn The Hitchhiker's Guide to the Galaxy Radio Scripts: Tertiary, Quandary and Quintessential Phases, one o

In [28]:
base = "https://hitchhikers.fandom.com"

for elem in link_names:
    print(elem)
    try:
        text = text_from_url(elem['href'])
        os.makedirs(os.path.dirname(elem['fn']), exist_ok=True)
        with open(elem['fn'], 'w') as f:
            f.write(text)
    except Exception as e:
        print(e)
    

{'fn': 'data/dump/hotblack-desiato-s-bodyguard', 'href': 'https://hitchhikers.fandom.com/wiki/Hotblack_Desiato%27s_bodyguard'}
{'fn': 'data/dump/hrarf-hrarf', 'href': 'https://hitchhikers.fandom.com/wiki/Hrarf-Hrarf'}
{'fn': 'data/dump/human', 'href': 'https://hitchhikers.fandom.com/wiki/Human'}
{'fn': 'data/dump/human-beings', 'href': 'https://hitchhikers.fandom.com/wiki/Human_beings'}
{'fn': 'data/dump/humans', 'href': 'https://hitchhikers.fandom.com/wiki/Humans'}
{'fn': 'data/dump/humma-kavula', 'href': 'https://hitchhikers.fandom.com/wiki/Humma_Kavula'}
{'fn': 'data/dump/hurling-frootmig', 'href': 'https://hitchhikers.fandom.com/wiki/Hurling_Frootmig'}
{'fn': 'data/dump/hyperspace', 'href': 'https://hitchhikers.fandom.com/wiki/Hyperspace'}
{'fn': 'data/dump/ibiza', 'href': 'https://hitchhikers.fandom.com/wiki/Ibiza'}
{'fn': 'data/dump/infinidim-enterprises', 'href': 'https://hitchhikers.fandom.com/wiki/InfiniDim_Enterprises'}
{'fn': 'data/dump/infinidim-enterprises', 'href': 'https

## Traversing the main page

Visits the links from the main page with a 

In [267]:
url = "https://hitchhikers.fandom.com/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy_(book)"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
content_el = soup.find(id = "mw-content-text")  


In [267]:
url = "https://hitchhikers.fandom.com/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy_(book)"

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
content_el = soup.find(id = "mw-content-text")  

In [296]:
base = "https://hitchhikers.fandom.com"

clean_last_seen_header = ""
scrape_list = []

# Visits everything and fills it in
def clean_visit_node(node):
    global last_seen_header;
    global start_offset_reached;
    
    if node.name in header_tags:
        last_seen_header = node.text
    
    # if the last seen header is something that we don't want remove it
    if node.name == "a" and len(node.text.strip()) > 0 and "#" not in node.get('href'):
        fn = slugify(last_seen_header) + os.sep + slugify(node.text) + ".txt"
        href = node.get('href')
        
        scrape_list.append({"fn" : "data" + os.sep + fn, "href": base + href})

    for child in node.children:
        if child.name is not None:
            clean_visit_node(child)

clean_visit_node(content_el)


In [301]:
for elem in scrape_list:
    text = text_from_url(elem['href'])
    os.makedirs(os.path.dirname(elem['fn']), exist_ok=True)
    with open(elem['fn'], 'w') as f:
        f.write(text)

In [284]:
text_from_url("https://hitchhikers.fandom.com/wiki/Earth

'Earth was a giant supercomputer designed to find the Ultimate Question of Life, the Universe and Everything. Designed by Deep Thought and built by the Magratheans, it was commonly mistaken for a planet, especially by the ape descendants who lived on it. It was situated far out in the uncharted backwaters of the unfashionable end of the Western Spiral Arm of the Galaxy.\n\nUnfortunately, the Earth was destroyed by the Vogons five minutes before the program was to be completed. The Vogons were sent by the psychiatrist Gag Halfrunt, who thought his profession would cease if the Question were known. Later on, the Earth reappeared but all forms of the Earth were later demolished.\n\nThe only two humans to survive the Earth\'s destruction were Arthur Dent and Trillian.Elvis was revealed to be singing at The Domain of The King, though he may not have been human to begin with.Lifeforms\nEarth was mainly populated by "ape-descended life forms" or\xa0Humans, which number around 6.5 billion at t

In [298]:
scrape_list

[{'fn': 'data/series/the-hitchhiker-s-guide-to-the-galaxy.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy'},
 {'fn': 'data/author-s/douglas-adams.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/Douglas_Adams'},
 {'fn': 'data/audiobook-narrator-s/stephen-fry.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/Stephen_Fry'},
 {'fn': 'data/followed-by/the-restaurant-at-the-end-of-the-universe.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/The_Restaurant_at_the_End_of_the_Universe'},
 {'fn': 'data/followed-by/the-hitchhiker-s-guide-to-the-galaxy.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy'},
 {'fn': 'data/followed-by/douglas-adams.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/Douglas_Adams'},
 {'fn': 'data/plot-summary/arthur-dent.txt',
  'href': 'https://hitchhikers.fandom.com/wiki/Arthur_Dent'},
 {'fn': 'data/plot-summary/ford-prefect.txt',
  'href': 'https://hitchhikers.fandom.

In [None]:
    f.write(content_el.text)

In [None]:
headers
# data/main-characters/arthur-dent.txt
# Should not save in dict, save directly to file

for header in headers:
    print(header.text)
    section_links = []
    for tag in header.find_next_siblings():
        print(tag.find_all("a"))
        
        #if tag.name == 'a':
        #    print(tag)

In [None]:
for header in headers:
    print(header.text)  # print the header text
    section_elements = []  # create an empty list to store the section elements
    next_sibling = header.find_next_sibling()
    while next_sibling and next_sibling.name not in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
        section_elements.append(next_sibling)
        next_sibling = next_sibling.find_next_sibling()
        # print the section elements
        # find the things -
    
    for element in section_elements:
        pass
        #print(element.text)