# Scraping roguelike articles on the web

This Python notebook describes the data collection process for roguelike universe. If you have not yet installed the requirements, you can do it by running:

`pip install -r requirements.txt`

## Sourcing game titles

We prepared a [list of roguelike games](https://en.wikipedia.org/wiki/List_of_roguelikes) from Wikipedia as a starting point. For indicators of out-of-genre influences, we sourced 10,000+ video game titles from Pastebin as uploaded by the user ________. 

First, we setup read/write functions that are friendly with international unicode characters, because the web may contain all kinds of character code points.

In [84]:
import os
import io
import json
import pandas as pd

def read_json(path):
    data = ''
    with io.open(path, 'r', encoding='utf-8') as f:
        data = json.loads(f.read())
        print(__message('Loaded {}'.format(path)))
    return data
    
def save_json(path, data):
    with io.open(path, 'w', encoding='utf-8') as f:
        try:
            output = json.dumps(data, indent=2, ensure_ascii=False)
            f.write(output)
        except UnicodeEncodeError:
            f.write(output.encode('utf-8'))
    print(__message('Written to {}'.format(path)))
    
def __success(text):
    return '  (SUCC) {}'.format(text).encode('utf-8')
    
def __failure(text):
    return '!!FAIL!! {}'.format(text).encode('utf-8')
    
def __warning(text):
    return '??WARN?? {}'.format(text).encode('utf-8')
    
def __message(text):
    return '   |MSG| {}'.format(text).encode('utf-8')

Here is a sample of the list of roguelike games from Wikipedia:

In [85]:
roguelikes = pd.read_csv(os.path.join(os.getcwd(), 'roguelikes.csv'), skip_blank_lines=True)
roguelikes.head()

Unnamed: 0,Name,RogueTemple,Link,Released,Updated,Developer,Theme,Influences
0,100 Rogues,http://roguebasin.roguelikedevelopment.org/ind...,http://www.100rogues.com/,2010/05/06,2010/05/06,Dinofarm Games,Fantasy,Rogue
1,1Quest,http://roguebasin.roguelikedevelopment.org/ind...,http://www.ratzngodz.fr,2014/02/20,2015/02/07,Ratz 'N' Godz,Fantasy,"Dungeon Crawl Stone Soup, Dominions 4: Thrones..."
2,3059,http://roguebasin.roguelikedevelopment.org/ind...,https://sites.google.com/site/free3069/3059---...,2005/00/00,2005/06/11,Phr00t,"Science Fiction, Alien Planets, Futuristic",NetHack
3,3069,http://roguebasin.roguelikedevelopment.org/ind...,http://sites.google.com/site/free3069/,2009/07/06,2009/10/06,Phr00t,"Science Fiction, Alien Planets, Futuristic",3059
4,3079,http://roguebasin.roguelikedevelopment.org/ind...,http://sites.google.com/site/3079game/,2011/10/25,2015/02/13,Phr00t,"Science Fiction, Alien Planets, Futuristic","3059, 3069, Fallout, Minecraft"


In addition, roguelike-like games:

In [86]:
roguelikelikes = pd.read_csv(os.path.join(os.getcwd(), 'roguelike-likes.csv'), skip_blank_lines=True)
roguelikelikes.head()

Unnamed: 0,Name,Released,Updated,Developer,Theme,Influences
0,ToeJam & Earl,1991,,Johnson Voorsanger Productions,Fantasy,
1,Diablo,1996,,Blizzard North,Fantasy,
2,Diablo II,2000,,Blizzard Entertainment,Fantasy,
3,Lost Labyrinth,2001,2011.0,Lost Labyrinth,Fantasy,
4,Strange Adventures In Infinite Space,2002,2004.0,"Rich Carlson, Iikka Keränen",Space science fiction,


And a sample of the list of video games:

In [87]:
video_games = pd.read_json(os.path.join(os.getcwd(), 'games.json'))
video_games.head(10)

Unnamed: 0,title,year
0,$hop-n-$pree,2009
1,'43 - One Year After,1986
2,'89 Denno Kyusei Uranai,1988
3,'Nam 1965-1975,1991
4,'Splosion Man,2009
5,'Til Death Do Us Part,2013
6,(Almost) Total Mayhem,2011
7,(Not) Just another Space Shooter,2004
8,(T)Raumschiff Surprise - Periode 1,2004
9,*NSYNC Hotline Phone and Fantasy CD-Rom Game,2001


## Building a corpus 

Before we can do any text analysis, we need to build a corpus in which to operate on.

### 1. RogueTemple

RogueTemple Wiki collects a detailed description of roguelike games.

In [88]:
import requests

def scrape_mediawiki_url(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    response = requests.get(url, headers=headers, timeout=(9.1, 12.1))
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    
    content = [node.text.strip() for node in soup.select('#mw-content-text') if node.text]
    return ''.join(content)

In [89]:
# Sample scrape
content = scrape_mediawiki_url('http://roguebasin.roguelikedevelopment.org/index.php?title=100_Rogues')
print(content)

100 Rogues



Stable game



Developer

Dinofarm Games



Theme

Fantasy



Influences

Rogue



Released

 (?)



Updated

May 6, 2010 (?)



Licensing

Commercial



P. Language





Platforms

iPhone



Interface

Graphical Tiles



Game Length

Medium



Official site of 100 Rogues




100 Rogues is an original Roguelike for the iPhone, iPod Touch and iPad devices.  It features two playable classes with unique skill trees (similar to Diablo), several tilesets, original SNES-style music, and fully animated pixel art.  100 Rogues was developed from scratch for the iPhone OS devices, and has a click to move control scheme rather than a virtual d-pad.

Contents

1 Gameplay
2 Combat
3 Skills
4 Reception


 Gameplay
A game of 100 Rogues is a linear progression through 12 dungeon levels spread across 3 worlds: The Bandit Hole, The Dungeon, and Hell. The first three levels of each world consist of randomly-generated maps initially populated by Mobs of different Monsters. After these Mobs a

### 2. Wikipedia

Searching on Wikipedia can be done with two dozens line of code.

In [90]:
import wikipedia

def scrape_wiki_id(pageid):
    page = wikipedia.page(pageid=pageid)    
    print_wiki_page(page)
    
def scrape_wiki(title):
    try:
        searchstring = title
        page = wikipedia.page(searchstring, auto_suggest=False)
#         print_wiki_page(page)
        return page
    except wikipedia.DisambiguationError:
        try:
            searchstring = '{} (video game)'.format(title).replace(' ', '_')
            page = wikipedia.page(searchstring, auto_suggest=False)
#             print_wiki_page(page)
            return page
        except wikipedia.DisambiguationError:
            try:
                searchstring = '{} (Unix video game)'.format(title).replace(' ', '_')
                page = wikipedia.page(searchstring, auto_suggest=False)
    #             print_wiki_page(page)
                return page
            except:
                print(__warning(u'Wikipedia cannot find "{}"'.format(searchstring)))
        except:
            print(__warning(u'Wikipedia cannot find "{}"'.format(searchstring)))
    except wikipedia.PageError:
        try:
            page = wikipedia.page(title, auto_suggest=False)
#             print_wiki_page(page)
            return page
        except:
            print(__warning(u'Search term "{}" returned nothing'.format(searchstring)))
    
def print_wiki_page(page):
    print(page.title)
    print(page.content)
    print(page.references)  

In [91]:
# Test Wikipedia crawl
print(scrape_wiki('Rogue Legacy'))

<WikipediaPage 'Rogue Legacy'>


### 3. DuckDuckGo

We also source a list of potential interesting webpages via an internet search engine, DuckDuckGo.

In [92]:
import bs4
import time
import requests
import urllib.parse

def scrape_duckduckgo(keywords, developer=""):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    searchstring = u'"{}" AND {} AND game AND (interview OR mortem OR history OR develop)'.format(keywords, developer)
    q = u'http://duckduckgo.com/html/?q={}'.format(urllib.parse.quote(searchstring.encode('utf-8')))
    print(q)
                                                   
    response = requests.get(q, headers=headers, timeout=(9.1, 12.1))
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    
    links = []
    links = [node.get('href') for node in soup.select('a.result__a')]
    return links

In [93]:
scrape_duckduckgo('Ancient Domains of Mystery', 'Thomas Biskup')

http://duckduckgo.com/html/?q=%22Ancient%20Domains%20of%20Mystery%22%20AND%20Thomas%20Biskup%20AND%20game%20AND%20%28interview%20OR%20mortem%20OR%20history%20OR%20develop%29


['https://store.steampowered.com/video/333300',
 'https://forum.zoneofgames.ru/topic/36951-adom-ancient-domains-of-mystery/',
 'https://en.wikipedia.org/wiki/Ancient_Domains_of_Mystery',
 'https://www.youtube.com/watch?v=ChtBuBrFYc8',
 'https://ancient-domains-of-mystery.ru.uptodown.com/',
 'https://www.cultureofgaming.com/ancient-domains-of-mystery-adom-review/',
 'http://en.wikibedia.ru/wiki/Ancient_Domains_of_Mystery',
 'https://www.turkaramamotoru.com/en/ancient-domains-of-mystery-181815.html',
 'https://stillnessinthestorm.com/2018/08/the-mystery-of-ancient-nuclear-war/',
 'https://lgdb.org/game/adom-ancient-domains-of-mystery',
 'https://steamcommunity.com/sharedfiles/filedetails/?l=russian&id=258925365',
 'https://www.ranker.com/review/ancient-domains-of-mystery/455750',
 'https://alchetron.com/Ancient-Domains-of-Mystery',
 'http://RuTracker.org/forum/viewtopic.php?t=5451914',
 'https://rawg.io/games/adom-ancient-domains-of-mystery',
 'https://yepdownload.com/ancient-domains-of-

## Scrape the corpus

With the functions above we can collect a corpus.

In [None]:
def scrape(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    try:
        print(__message('Scraping {} ...'.format(url)))
        response = requests.get(url, headers=headers, timeout=(9.1, 12.1))
    except Exception as e:
        print(__failure('Failed to load {}'.format(url)))
        print(e)
        return None
    
    html = response.text
    if html and any(word in html.lower() for word in ['tutorial']):
        return None
    if html and any(word in html.lower() for word in ['rogue', 'procedural', 'generation', 'interview', 'mortem', 'review', 'history', 'develop', 'idea', 'inspir']):
        print(__message('Found article'))
        soup = bs4.BeautifulSoup(response.text, 'lxml')
        selections = soup.select('body > p') + soup.select('div > p') + soup.select('table td')
        content = [node.text.strip() for node in selections]
        return ''.join(content)
    return None

In [None]:
# For Roguelike games, we build a corpus with RogueTemple and DuckDuckGo
corpus = []
    
for index, roguelike in roguelikes.iterrows():
    print(roguelike)
    title = roguelike['Name']
    if not isinstance(title, str):
        continue
    text = []
    
    rogue_temple = scrape_mediawiki_url(roguelike['RogueTemple'])
    text.append(rogue_temple)

    developers = str(roguelike['Developer']).replace(',', ' OR ')
    links = scrape_duckduckgo(title, developers)
    
    for link in links[:10]:
        if 'roguebasin.roguelikedevelopment.org' in link \
            or 'roguebasin.com' in link \
            or 'wikipedia' in link:
            continue
        content = scrape(link)
        if content:
            text.append(content)
    
    corpus.append({"title": title, "text": text})
  
save_json('corpus.json', corpus)

Name                                                  100 Rogues
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                                   http://www.100rogues.com/
Released                                              2010/05/06
Updated                                               2010/05/06
Developer                                         Dinofarm Games
Theme                                                    Fantasy
Influences                                                 Rogue
Name: 0, dtype: object
http://duckduckgo.com/html/?q=%22100%20Rogues%22%20AND%20Dinofarm%20Games%20AND%20game%20AND%20%28interview%20OR%20mortem%20OR%20history%20OR%20develop%29
b'   |MSG| Scraping https://100rogues.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.dinofarmgames.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.dinofarmgames.com/forum/index.php?threads/100-rogues-on-android.1501/ ...'
b'   |MSG| Found article'
b'   |MSG| Scra

b'   |MSG| Scraping https://www.livecareer.com/career/advice/interview/surprising-hospitality-interview-questions ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://career.guru99.com/top-25-interview-questions-for-game-developer/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://indiegamehq.tumblr.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.geeksforgeeks.org/xoriant-interview-experience/ ...'
b'   |MSG| Scraping https://www.michaelpage.ae/advice/career-advice/job-interview-tips/top-10-interview-questions-and-how-answer-them ...'
b'   |MSG| Found article'
Name                                                        3089
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                              http://3089game.wordpress.com/
Released                                              2013/02/02
Updated                                               2014/02/27
Developer                                                 Phr00t
Theme    

b'   |MSG| Found article'
b'   |MSG| Scraping https://student.unsw.edu.au/interview-dos-and-donts ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://translate.google.ru/ ...'
b'   |MSG| Scraping https://www.youtube.com/watch?v=aMcjxSThD54 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.michaelpage.ae/advice/career-advice/job-interview-tips/top-10-interview-questions-and-how-answer-them ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.cleverism.com/15-funny-interview-questions/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.michaelpage.com.au/advice/career-advice/changing-jobs/top-do-s-and-don-ts-when-meeting-recruiter ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://employment.williams.edu/staff/staff-hiring-guidelines/interview-and-selection/interview-dos-and-donts/ ...'
Name                                                   AGB Rogue
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link               http:

In [None]:
# For Roguelike-like games, we build a corpus with Wikipedia and DuckDuckGo
# corpus = read_json('corpus-roguelike-like.json')
# if not corpus:
corpus = []
    
for index, roguelike in roguelikelikes.iterrows():
    print(roguelike)
    title = roguelike['Name']
    text = []
    
    page = scrape_wiki(title)
    if page:
        text.append(page.content)

    developers = str(roguelike['Developer']).replace(',', ' OR ')
    links = scrape_duckduckgo(title, developers)
    
    for link in links[:20]:
        if 'roguebasin.roguelikedevelopment.org' in link \
            or 'roguebasin.com' in link \
            or 'wikipedia' in link:
            continue
        content = scrape(link)
        if content:
            text.append(content)
    
    corpus.append({"title": title, "text": text})
  
save_json('corpus-roguelike-like.json', corpus)

In [None]:
import io
import os
import re
import bs4
import sys
import json
import time
import nltk
import urllib
import pprint
import random
import string
import requests
import wikipedia
import itertools
import collections

In [None]:
# Extract themes
# Items of interest, of genre, of identification
# Emotions of joy, sadness, frustrations
# Memory recall? Specific sentices or sentiments
corpus = read_json(os.path.join(os.getcwd(), 'data', 'corpus.json'))
distributions = {}

for game, sites in corpus.items():
    print(game)
    tagged_sentences = []
    for url, content in sites.items():
        for sentence in content:
            tagged_sentences += (encode_english(sentence))
    freqdist = nltk.FreqDist((word, tag) for word, tag in tagged_sentences if tag == u'ADJ')
    distributions[game] = freqdist

In [None]:
for game, dist in distributions.items():
    words = [x[0] for x, y in distributions[game].most_common(10)]
    print(u'{}: {}'.format(game, u', '.join(words)))

In [None]:
# Testing
scrape_wiki(u"Dungeon_(video_game)")

In [None]:
# Scrape for links
game_meta = read_json(os.path.join(os.getcwd(), 'data', 'game-sources.json'))

shuffled_game_meta = game_meta.items()
random.shuffle(shuffled_game_meta)
for game, meta in shuffled_game_meta:
    game_meta[game]['Links'] += scrape_duckduckgo(game, game_meta[game]['Developer'])
    game_meta[game]['Links'] = list(set(game_meta[game]['Links']))
    save_json(os.path.join(os.getcwd(), 'data', 'game-sources.json'), game_meta)
    time.sleep(2)

In [None]:
# Load content in search results
game_meta = read_json(os.path.join(os.getcwd(), 'data', 'game-sources.json'))
cached = read_json(os.path.join(os.getcwd(), 'data', 'corpus.json'))

output = cached
for game, meta in game_meta.items():
    if game not in output:
        output[game] = {}
    print(__message(game))
    for url in meta['Links']:
        if url in output[game] or url.endswith('pdf'):
            continue
        data = []
        html = scrape(url)
        if html and any(word in html.lower() for word in ['interview', 'mortem', 'review', 'history', 'develop', 'idea', 'inspir']):
            soup = bs4.BeautifulSoup(html)
            content = soup.select('div > p') + soup.select('body > p')
            data = [c.string.strip() for c in content if c.string and c.string.strip()]
            output[game][url] = data
            print(__message(u'Scrapped {}'.format(url)))
            save_json(os.path.join(os.getcwd(), 'data', 'corpus.json'), output)

In [None]:
# Locate mentions of games
game_LUT = set(read_json(os.path.join(os.getcwd(), 'data', 'games.json')))
game_meta = read_json(os.path.join(os.getcwd(), 'data', 'game-sources.json'))
game_articles = read_json(os.path.join(os.getcwd(), 'data', 'corpus.json'))
not_games = set(read_json(os.path.join(os.getcwd(), 'data', 'not-games.json')))

# Create a look up table for games
roguelike_LUT = {}
for game, meta in game_meta.items():
    roguelike_LUT[game] = game
    if 'AKA' in meta:
        for aka in meta['AKA']:
            roguelike_LUT[aka] = game

In [None]:
# Look through the interview articles
roguelike_relations = {}
other_relations = {}
for game, articles in game_articles.items():
    roguelike_relations[game] = []
    other_relations[game] = []
    counter = collections.Counter()
    for url, article in articles.items():
        # Intersection for fast search
        things = []
        current = u''
        for paragraph in article:
            for token in paragraph.split():
                if re.compile("^[A-Z0-9][\w:']*[\w:']|[A-Z\.]+$").match(token) or \
                        (current and token in ('the', 'of', 'no', 'to')):
                    current += u'{} '.format(token)
                elif current:
                    things.append(current.strip())
                    current = u''
        roguelike_things = [roguelike_LUT[s] for s in things if s in roguelike_LUT]
        if roguelike_things:
            roguelike_relations[game].extend(roguelike_things)
        other_things = [s for s in things if
                            s in game_LUT and
                            s not in not_games and
                            s not in roguelike_LUT and
                            len(s) > 1 and
                            not s.isdigit()]
        if other_things:
            other_relations[game].extend(other_things)

# print("\n### ROGUELIKES ###\n")
# pprint.pprint(roguelike_relations, indent=2)
# print("\n### OTHER GAMES ###\n")
# pprint.pprint(other_relations, indent=2)

save_json(os.path.join(os.getcwd(), 'generated', 'roguelike-relations.json'), roguelike_relations)
save_json(os.path.join(os.getcwd(), 'generated', 'other-relations.json'), other_relations)

In [None]:
# Construct influence network

roguelike_relations = read_json(os.path.join(os.getcwd(), 'generated', 'roguelike-relations.json'))
other_relations = read_json(os.path.join(os.getcwd(), 'generated', 'other-relations.json'))
games_years = read_json(os.path.join(os.getcwd(), 'generated', 'games-years.json'))

roguelike_influence = {}
for roguelike, other_roguelikes in roguelike_relations.items():
    roguelike_influence[roguelike] = []
    
    roguelike_relation_counter = collections.Counter()
    for other_roguelike in other_roguelikes:
        if other_roguelike != roguelike:
            roguelike_relation_counter[other_roguelike] += 1
            
    other_relation_counter = collections.Counter()
    for other_relation in other_relations[roguelike]:
        if other_relation != roguelike:
            other_relation_counter[other_relation] += 1
            
    for roguelike_relation in roguelike_relation_counter.most_common(5):
        roguelike_influence[roguelike].append(roguelike_relation[0])
        
    for other_relation in other_relation_counter.most_common(5):
        if other_relation[1] > 1:
            roguelike_influence[roguelike].append(other_relation[0])
            
#     print(u'{}\n{}\n{}\n'.format(roguelike, 
#                                    roguelike_relation_counter.most_common(3), 
#                                    other_relation_counter.most_common(3)))

games_set_small = set(itertools.chain(*(roguelike_relations.values()+other_relations.values())))
    
games_years_small = {game: int(year) for game, year in games_years.items() if game in games_set_small}
    
print(games_years_small)
                                                        
save_json(os.path.join(os.getcwd(), 'generated', 'relations.json'), roguelike_influence)
save_json(os.path.join(os.getcwd(), 'generated', 'games-years-small.json'), games_years_small)