# Scraping roguelike articles on the web

This Python notebook describes the data collection process for roguelike universe. If you have not yet installed the requirements, you can do it by running:

`pip install -r requirements.txt`

## Sourcing game titles

We prepared a [list of roguelike games](https://en.wikipedia.org/wiki/List_of_roguelikes) from Wikipedia as a starting point. For indicators of out-of-genre influences, we sourced 10,000+ video game titles from Pastebin as uploaded by the user ________. 

First, we setup read/write functions that are friendly with international unicode characters, because the web may contain all kinds of character code points.

In [15]:
import os
import io
import json
import pandas as pd

def read_json(path):
    data = ''
    with io.open(path, 'r', encoding='utf-8') as f:
        data = json.loads(f.read())
        print(__message('Loaded {}'.format(path)))
    return data
    
def save_json(path, data):
    with io.open(path, 'w', encoding='utf-8') as f:
        try:
            output = json.dumps(data, indent=2, ensure_ascii=False)
            f.write(output)
        except UnicodeEncodeError:
            f.write(output.encode('utf-8'))
    print(__message('Written to {}'.format(path)))
    
def __success(text):
    return '  (SUCC) {}'.format(text).encode('utf-8')
    
def __failure(text):
    return '!!FAIL!! {}'.format(text).encode('utf-8')
    
def __warning(text):
    return '??WARN?? {}'.format(text).encode('utf-8')
    
def __message(text):
    return '   |MSG| {}'.format(text).encode('utf-8')

Here is a sample of the list of roguelike games from Wikipedia:

In [16]:
roguelikes = pd.read_csv(os.path.join(os.getcwd(), 'roguelikes.csv'), skip_blank_lines=True)
roguelikes.head()

Unnamed: 0,Name,RogueTemple,Link,Status,Released,Updated,Developer,Theme,Influences
0,100 Heroes: Shopkeeper of Doom,http://roguebasin.roguelikedevelopment.org/ind...,http://www.bay12forums.com/smf/index.php?topic...,alpha,2012/00/00,2012/10/25,Paul Wright,Economics/Trading,Recettear
1,100 Rogues,http://roguebasin.roguelikedevelopment.org/ind...,http://www.100rogues.com/,stable,2010/05/06,2010/05/06,Dinofarm Games,Fantasy,Rogue
2,1Quest,http://roguebasin.roguelikedevelopment.org/ind...,http://www.ratzngodz.fr/,stable,2014/02/20,2015/02/07,Ratz 'N' Godz,Fantasy,"DCSS, Dominion4"
3,3059,http://roguebasin.roguelikedevelopment.org/ind...,https://sites.google.com/site/free3069/3059---...,stable,2005/00/00,2005/06/11,Phr00t,"Science Fiction, Alien Planets, Futuristic",nethack
4,3069,http://roguebasin.roguelikedevelopment.org/ind...,http://sites.google.com/site/free3069/,stable,2009/07/06,2009/10/06,Phr00t,"Science Fiction, Alien Planets, Futuristic",3059


In addition, roguelike-like games:

In [17]:
roguelikelikes = pd.read_csv(os.path.join(os.getcwd(), 'roguelike-likes.csv'), skip_blank_lines=True)
roguelikelikes.head()

Unnamed: 0,Name,Released,Updated,Developer,Theme,Influences
0,ToeJam & Earl,1991,,Johnson Voorsanger Productions,Fantasy,
1,Diablo,1996,,Blizzard North,Fantasy,
2,Diablo II,2000,,Blizzard Entertainment,Fantasy,
3,Lost Labyrinth,2001,2011.0,Lost Labyrinth,Fantasy,
4,Strange Adventures In Infinite Space,2002,2004.0,"Rich Carlson, Iikka Keränen",Space science fiction,


And a sample of the list of video games:

In [18]:
video_games = pd.read_json(os.path.join(os.getcwd(), 'games.json'))
video_games.head(10)

Unnamed: 0,title,year
0,$hop-n-$pree,2009
1,'43 - One Year After,1986
2,'89 Denno Kyusei Uranai,1988
3,'Nam 1965-1975,1991
4,'Splosion Man,2009
5,'Til Death Do Us Part,2013
6,(Almost) Total Mayhem,2011
7,(Not) Just another Space Shooter,2004
8,(T)Raumschiff Surprise - Periode 1,2004
9,*NSYNC Hotline Phone and Fantasy CD-Rom Game,2001


## Building a corpus 

Before we can do any text analysis, we need to build a corpus in which to operate on.

### 1. RogueTemple

RogueTemple Wiki collects a detailed description of roguelike games.

In [26]:
import bs4
import requests

def scrape_mediawiki_url(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    try:
        response = requests.get(url, headers=headers, timeout=(9.1, 12.1))
    except Exception as e:
        print(__failure('Failed to scrape {}'.format(url)))
        print(e)
        return ''
    
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    
    content = [node.text.strip() for node in soup.select('#mw-content-text') if node.text]
    return ''.join(content)

In [20]:
# Sample scrape
content = scrape_mediawiki_url('http://roguebasin.roguelikedevelopment.org/index.php?title=100_Rogues')
print(content)

100 Rogues



Stable game



Developer

Dinofarm Games



Theme

Fantasy



Influences

Rogue



Released

 (?)



Updated

May 6, 2010 (?)



Licensing

Commercial



P. Language





Platforms

iPhone



Interface

Graphical Tiles



Game Length

Medium



Official site of 100 Rogues




100 Rogues is an original Roguelike for the iPhone, iPod Touch and iPad devices.  It features two playable classes with unique skill trees (similar to Diablo), several tilesets, original SNES-style music, and fully animated pixel art.  100 Rogues was developed from scratch for the iPhone OS devices, and has a click to move control scheme rather than a virtual d-pad.

Contents

1 Gameplay
2 Combat
3 Skills
4 Reception


 Gameplay
A game of 100 Rogues is a linear progression through 12 dungeon levels spread across 3 worlds: The Bandit Hole, The Dungeon, and Hell. The first three levels of each world consist of randomly-generated maps initially populated by Mobs of different Monsters. After these Mobs a

### 2. Wikipedia

Searching on Wikipedia can be done with two dozens line of code.

In [21]:
import wikipedia

def scrape_wiki_id(pageid):
    page = wikipedia.page(pageid=pageid)    
    print_wiki_page(page)
    
def scrape_wiki(title):
    try:
        searchstring = title
        page = wikipedia.page(searchstring, auto_suggest=False)
#         print_wiki_page(page)
        return page
    except wikipedia.DisambiguationError:
        try:
            searchstring = '{} (video game)'.format(title).replace(' ', '_')
            page = wikipedia.page(searchstring, auto_suggest=False)
#             print_wiki_page(page)
            return page
        except wikipedia.DisambiguationError:
            try:
                searchstring = '{} (Unix video game)'.format(title).replace(' ', '_')
                page = wikipedia.page(searchstring, auto_suggest=False)
    #             print_wiki_page(page)
                return page
            except:
                print(__warning(u'Wikipedia cannot find "{}"'.format(searchstring)))
        except:
            print(__warning(u'Wikipedia cannot find "{}"'.format(searchstring)))
    except wikipedia.PageError:
        try:
            page = wikipedia.page(title, auto_suggest=False)
#             print_wiki_page(page)
            return page
        except:
            print(__warning(u'Search term "{}" returned nothing'.format(searchstring)))
    
def print_wiki_page(page):
    print(page.title)
    print(page.content)
    print(page.references)  

In [22]:
# Test Wikipedia crawl
print(scrape_wiki('Rogue Legacy'))

<WikipediaPage 'Rogue Legacy'>


### 3. DuckDuckGo

We also source a list of potential interesting webpages via an internet search engine, DuckDuckGo.

In [23]:
import time
import requests
import urllib.parse

def scrape_duckduckgo(keywords, developer=""):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    searchstring = u'"{}" AND {} AND game AND (mortem OR history OR developer OR review)'.format(keywords, developer)
    q = u'http://duckduckgo.com/html/?q={}'.format(urllib.parse.quote(searchstring.encode('utf-8')))
    print(q)
                                                   
    response = requests.get(q, headers=headers, timeout=(9.1, 12.1))
    soup = bs4.BeautifulSoup(response.text, 'lxml')
    
    links = []
    links = [node.get('href') for node in soup.select('a.result__a')]
    return links

In [24]:
# scrape_duckduckgo('Ancient Domains of Mystery', 'Thomas Biskup')

## Scrape the corpus

With the functions above we can collect a corpus.

In [29]:
def scrape(url, title):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36',
    }
    try:
        print(__message('Scraping {} ...'.format(url)))
        response = requests.get(url, headers=headers, timeout=(10.1, 15.1))
    except Exception as e:
        print(__failure('Failed to load {}'.format(url)))
        print(e)
        return None
    
    html = response.text
    if html and any(word in html.lower() for word in ['tutorial']):
        return None
    if html and any(word in html.lower() for word in [title, 'rogue', 'procedural', 'generation', 'mortem', 'review', 'history', 'develop', 'idea', 'inspir']):
        print(__message('Found article'))
        soup = bs4.BeautifulSoup(response.text, 'lxml')
        selections = soup.select('body > p') + soup.select('div > p') + soup.select('table td')
        content = [node.text.strip() for node in selections]
        return ''.join(content)
    return None

## Web scraping -- all games

In [34]:
# For Roguelike games, we build a corpus with RogueTemple and DuckDuckGo
corpus = []
    
for index, roguelike in roguelikes[531:].iterrows():
    print(roguelike)
    title = roguelike['Name']
    if not isinstance(title, str):
        continue
    text = []
    
    rogue_temple = scrape_mediawiki_url(roguelike['RogueTemple'])
    text.append(rogue_temple)

    developers = str(roguelike['Developer']).replace(',', ' OR ')
    links = scrape_duckduckgo(title, developers)
    
    for link in links[:25]:
        if 'roguebasin.roguelikedevelopment.org' in link \
            or 'roguebasin.com' in link \
            or 'wikipedia' in link:
            continue
        content = scrape(link, title)
        if content:
            text.append(content)
    
    corpus.append({"title": title, "text": text})
    
    # Be nice
    if not links:
        print('Sleeping')
        time.sleep(10)
  
save_json('corpus-4.json', corpus)

Name                                                       Tower
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                               http://tower.sourceforge.net/
Status                                                      beta
Released                                              2007/01/27
Updated                                                2009/2/13
Developer                                                    NaN
Theme                                                    Fantasy
Influences                                                   NaN
Name: 531, dtype: object
http://duckduckgo.com/html/?q=%22Tower%22%20AND%20nan%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping https://gamedevelopment.tutsplus.com/articles/tower-of-greed-post-mortem--gamedev-115 ...'
b'   |MSG| Scraping https://www.gamespot.com/reviews/post-mortem-review/1900-2911836/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://

b'   |MSG| Found article'
b'   |MSG| Scraping https://pawfriction.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://physics.stackexchange.com/q/340384 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://mortalengines.wikia.com/wiki/Traction_City ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://thejournal.com/articles/2016/05/05/report-games-and-online-video-gain-traction-in-education.aspx ...'
b'   |MSG| Scraping http://www.icj-cij.org/en/case/50 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://startuprunner.com/traction-bullseye-framework/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.cars.com/articles/common-problems-with-traction-control-1420680310438/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://thepitchclinic.com/traction-investors-want-it-heres-how-you-show-it/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://mcasuspension.com/traction-mod ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.

b'   |MSG| Found article'
Name                                             Triangle Wizard
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                             http://trianglewizard.webs.com/
Status                                                    stable
Released                                              2008/10/27
Updated                                               2018/01/16
Developer      Wouter van den Wollenberg(''wollie73''@''hotma...
Theme                                                    Fantasy
Influences                       NetHack, Diablo, Age of Wonders
Name: 536, dtype: object
http://duckduckgo.com/html/?q=%22Triangle%20Wizard%22%20AND%20Wouter%20van%20den%20Wollenberg%28%27%27wollie73%27%27%40%27%27hotmail.com%27%27%29%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                                 Trollhunter
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...

http://duckduckgo.com/html/?q=%22Ultima%20Ratio%20Regum%22%20AND%20Mark%20Johnson%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                                  UltraRogue
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                http://rogue.rogueforge.net/ultrarogue-1-07/
Status                                                    stable
Released                                              2005/86/00
Updated                                               2005/02/07
Developer      Herb Chong, Roguelike Restoration Project, others
Theme                                                        NaN
Influences                             Rogue 3.6, Advanced Rogue
Name: 547, dtype: object
http://duckduckgo.com/html/?q=%22UltraRogue%22%20AND%20Herb%20Chong%20OR%20%20Roguelike%20Restoration%20Project%20OR%20%20others%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
N

http://duckduckgo.com/html/?q=%22Valkyrie%20Framework%22%20AND%20Kornel%20Kisielewicz%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                          Vapors of Insanity
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                        http://www.roguetemple.com/z/vapors/
Status                                                      beta
Released                                              2011/07/29
Updated                                               2012/06/27
Developer                                                      Z
Theme                           High Fantasy, society gone crazy
Influences                                        Ragnarok, ADOM
Name: 558, dtype: object
http://duckduckgo.com/html/?q=%22Vapors%20of%20Insanity%22%20AND%20Z%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                          Virtual Con

b'   |MSG| Found article'
b'   |MSG| Scraping http://www.wadjeteyegames.com/2015/05/25/post-mortem-golden-wake/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.pagerduty.com/resources/learn/post-mortem-incident-report/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamasutra.com/view/news/238773/10_seminal_game_postmortems_every_developer_should_read.php ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.portent.com/blog/10-tips-for-a-successful-post-mortem.htm ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.moddb.com/games/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.behindthename.com/name/Benjamin ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://indiegamebundle.wikia.com/wiki/Post_Mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://en.wiktionary.org/wiki/post_mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://store.steampowered.com/app/586370 ...'
Name                 

http://duckduckgo.com/html/?q=%22Web%20Raid%22%20AND%20Karlheinz%20Agsteiner%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                             Web Raid Mobile
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link           https://forums.roguetemple.com/irldb/[https://...
Status                                                    stable
Released                                              2011/11/02
Updated                                               2013/11/05
Developer                                    Karlheinz Agsteiner
Theme                                          None particularly
Influences                                      NetHack, WebRaid
Name: 570, dtype: object
http://duckduckgo.com/html/?q=%22Web%20Raid%20Mobile%22%20AND%20Karlheinz%20Agsteiner%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                        When 

b'   |MSG| Found article'
b'   |MSG| Scraping http://gamehistory.org/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.youtube.com/watch?v=P9JYgBmfL-Y ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.theguardian.com/technology/gamesblog/2012/dec/06/video-games-as-art ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamedevelopment.tutsplus.com/articles/15-analyses-post-mortems-and-game-design-docs--gamedev-11554 ...'
b'   |MSG| Scraping https://www.indiedb.com/games/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://allaboutwindowsphone.com/software/developer/Dawid-Farbaniec.php ...'
b'   |MSG| Scraping http://indiegames.clickteam.com/1277/gdc-2016-post-mortem.html ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://AppAgg.com/developer/dawid-farbaniec-d-f/?hl=en ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://nighthoodgames.itch.io/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorming.com/pre

b'   |MSG| Scraping https://itch.io/jam/libgdxjam/topic/12068/david-and-albertos-dev-blog ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.roguelikedevelopment.org/archive/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.billmorefield.com/index.php/2016/03/06/roguelike-development-with-c-part-1-introduction/ ...'
b'   |MSG| Scraping http://www.oldpcgaming.net/witchaven-2-blood-vengeance-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.theverge.com/2018/4/25/17280908/george-r-r-martin-grrm-game-of-thrones-song-of-ice-and-fire ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.theguardian.com/artanddesign/2017/apr/21/the-terrors-genius-of-alberto-giacometti-artist-sculptor-tate-modern ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://theplayersaid.com/2017/05/13/interview-with-elo-darkness-designers-tommaso-mondadori-and-alberto-parisi/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.rockpapershotgun.com

b'   |MSG| Found article'
b'   |MSG| Scraping https://theworldgame.sbs.com.au/video ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://notgameworld.ru/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.youtube.com/watch?v=h0tbNg4kTwg ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamesradar.com/monster-hunter-world-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gameranx.com/features/id/6335/article/best-open-world-games/2/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamespclist.com/best-games-pc-list/open-world-games-pc-list-2.html ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://worldofwarcraft.com/en-us/news/21498532 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://worldofgamesdownload.weebly.com/ ...'
b'!!FAIL!! Failed to load https://worldofgamesdownload.weebly.com/'
HTTPSConnectionPool(host='worldofgamesdownload.weebly.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeou

http://duckduckgo.com/html/?q=%22XLarn%22%20AND%20Swinfjord-Games%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping http://swinfjord-games.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamespot.com/reviews/post-mortem-review/1900-2911836/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamesmojo.com/games/indie/xlarn ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://store.steampowered.com/app/360030/XLarn/ ...'
b'   |MSG| Scraping https://www.moddb.com/games/xlarn ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.indiedb.com/company/swinfjord-games ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://AppAgg.com/developer/swinfjord-games/?hl=ru ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.qwant.com/game/xlarn?l=fr ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.youtube.com/watch?v=P9JYgBmfL-Y ...'
b'   |MSG| Found article'
b'   |MSG| Scraping ht

b'   |MSG| Found article'
b'   |MSG| Scraping https://www.historicmysteries.com/post-mortem-photography/ ...'
b'   |MSG| Scraping https://killscreen.com/articles/developer-does-a-post-mortem-on-his-divorce/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://waltorious.wordpress.com/2012/02/27/roguelike-highlights-xenocide/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://pcgamingwiki.com/wiki/Post_Mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://wiki2.org/en/Past_Mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamepressure.com/download.asp?ID=1993 ...'
b'   |MSG| Found article'
Name                                                  Xenomarine
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                             http://www.ascifiroguelike.com/
Status                                                     alpha
Released                                              2016/01/05
Updated                          

http://duckduckgo.com/html/?q=%22XirrelaiRPG%22%20AND%20Pteriforever%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping https://www.gamespot.com/reviews/post-mortem-review/1900-2911836/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.youtube.com/watch?v=P9JYgBmfL-Y ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://nighthoodgames.itch.io/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorming.com/pre-mortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.giantbomb.com/profile/yukoasho/blog/post-mortem-duke-nukem-forever/83005/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamejolt.com/@Pteriforever ...'
b'   |MSG| Scraping https://gamejolt.com/games/mortem/223601 ...'
b'   |MSG| Scraping https://adventuregamers.com/articles/view/17557 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://manishearth.github.io/blog/2015/05/28/github-streak-end-game-and-post-mor

b'   |MSG| Found article'
b'   |MSG| Scraping https://www.dailystar.co.uk/news/latest-news/453137/john-palmer-goldfinger-shot-death-killed-essex ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamasutra.com/features/postmortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.historicmysteries.com/post-mortem-photography/ ...'
b'   |MSG| Scraping https://codeascraft.com/2012/05/22/blameless-postmortems/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://twitter.com/john_palmer10 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.bizarrepedia.com/victorian-coffins-post-mortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://lawandorder.wikia.com/wiki/Post-Mortem_Blues ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://en.wiktionary.org/wiki/post_mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.thefullwiki.org/John_Palmer ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.facebook.com/JohnPalm

b'   |MSG| Scraping https://github.com/blubaron/z-angband/blob/master/z_readme ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://github.com/blubaron/z-angband/blob/master/z_faq.txt ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamejolt.com/games/knightsquest/313416 ...'
b'   |MSG| Scraping https://sites.google.com/site/mangojuice75/home2 ...'
Name                                                     Zaiband
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                    http://www.zaimoni.com/zaiband/index.htm
Status                                                     alpha
Released                                              2007/03/04
Updated                                                 2008/7/3
Developer                                              Bessarion
Theme                                                    fantasy
Influences       Moria, Tolkien's Middle-Earth, *D&D, Rolemaster
Name: 592, dtype: object
http://duckduckgo.com/h

b'   |MSG| Found article'
b'   |MSG| Scraping http://www.ign.com/articles/2003/02/25/post-mortem-review ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://nethack.wikia.com/wiki/ZAPM ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorming.com/pre-mortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.moddb.com/games/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://adventuregamers.com/articles/view/17557 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://en.wiktionary.org/wiki/post_mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.thefullwiki.org/Cyrus_A._Dolph ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://killscreen.com/articles/developer-does-a-post-mortem-on-his-divorce/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://mortemmanor.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://hbr.org/2007/09/performing-a-project-premortem ...'
b'   |MSG| Found article'
b'   |MSG|

Name                                                     Zomband
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                    http://www.zooptek.net/drupal/?q=node/15
Status                                                    stable
Released                                              2007/03/31
Updated                                               2008/07/20
Developer                                                ZoopTEK
Theme                                                    Zombies
Influences                                      Angband, zombies
Name: 597, dtype: object
http://duckduckgo.com/html/?q=%22Zomband%22%20AND%20ZoopTEK%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
Sleeping
Name                                    Zombie Minefield Sweeper
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link           http://jsfiddle.net/JesterBLUE/cco8mt0a/embedd...
Status                                        

b'   |MSG| Found article'
b'   |MSG| Scraping https://hbr.org/2007/09/performing-a-project-premortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorming.com/pre-mortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://theadventurezone.wikia.com/wiki/The_The_Adventure_Zone_Zone:_Experiments_Post-Mortem,_More_on_Season_Two! ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamasutra.com/features/postmortem/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://nighthoodgames.itch.io/mortem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.gameboomers.com/wtcheats/pcPp/Post_Mortem.htm ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.ashford.zone/2009/08/creepy-post-mortem-photos-from-the-victorian-age ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorrent.co/mortem.html ...'
b'   |MSG| Scraping https://store.steampowered.com/app/586370 ...'
b'   |MSG| Scraping https://www.torrent-zone.com/mortem-postmortem/

In [180]:
# For Roguelike-like games, we build a corpus with Wikipedia and DuckDuckGo
# corpus = read_json('corpus-roguelike-like.json')
# if not corpus:
corpus = []
    
for index, roguelike in roguelikelikes.iterrows():
    print(roguelike)
    title = roguelike['Name']
    text = []
    
    page = scrape_wiki(title)
    if page:
        text.append(page.content)

    developers = str(roguelike['Developer']).replace(',', ' OR ')
    links = scrape_duckduckgo(title, developers)
    
    for link in links[:20]:
        if 'roguebasin.roguelikedevelopment.org' in link \
            or 'roguebasin.com' in link \
            or 'wikipedia' in link:
            continue
        content = scrape(link, title)
        if content:
            text.append(content)
    
    corpus.append({"title": title, "text": text})
  
save_json('corpus-roguelike-like.json', corpus)

Name                           ToeJam & Earl
Released                                1991
Updated                                  NaN
Developer     Johnson Voorsanger Productions
Theme                                Fantasy
Influences                               NaN
Name: 0, dtype: object
http://duckduckgo.com/html/?q=%22ToeJam%20%26%20Earl%22%20AND%20Johnson%20Voorsanger%20Productions%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping http://gaming.wikia.com/wiki/ToeJam_%26_Earl_Productions ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://lastdaydeaf.com/toejam-earldev-johnson-voorsanger-productions-sega-genesis-1991/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://toejamandearl.wikia.com/wiki/ToeJam_%26_Earl_in_Panic_on_Funkotron ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://segaretro.org/ToeJam_%26_Earl ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.facebook.com/pages/ToeJam-Earl-Prod



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


http://duckduckgo.com/html/?q=%22Diablo%22%20AND%20Blizzard%20North%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping https://www.youtube.com/watch?v=VscdPA6sUkc ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://us.battle.net/forums/en/d3/topic/7415795753 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://idclips.com/rev/blizzard+diablo/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.pcgamer.com/diablo-designer-david-breviks-full-gdc-post-mortem-is-now-online/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.diablowiki.net/Blizzard_North ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.diabloii.net/blog/comments/diablo-3-post-mortem-jay-wilson-pt1 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://ltclip.com/rev/diablo+blizzard/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamasutra.com/view/feature/131533/postmortem_blizzards_diablo_ii.php ...'
b'   |MSG

b'   |MSG| Found article'
b'   |MSG| Scraping http://www.g4g.it/2009/10/01/strange-adventures-in-infinite-space/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamezebo.com/2010/11/18/strange-adventures-infinite-space-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamespot.com/reviews/strange-adventures-in-infinite-space-review/1900-2856375/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.myabandonware.com/game/strange-adventures-in-infinite-space-3qy ...'
b'   |MSG| Scraping http://igrotop.com/games/strange_adventures_in_infinite_space/info ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.game-ost.ru/games/6212/strange_adventures_in_infinite_space/meta/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://wiki2.org/en/Strange_Adventures_in_Infinite_Space ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://vikimy.com/l-en/Strange_Adventures_In_Infinite_Space ...'
b'   |MSG| Found article'
b'   |MSG| Scrapin

b'   |MSG| Found article'
b'   |MSG| Scraping http://bestgamer.net/load/3125-the-binding-of-isaac-edmund-mcmillen-eng-p.html ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://tcrf.net/The_Binding_of_Isaac ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://steamcommunity.com/workshop/filedetails/?id=895960454 ...'
b'   |MSG| Found article'
Name          FTL: Faster Than Light
Released                        2012
Updated                          NaN
Developer               Subset Games
Theme          Space science fiction
Influences                       NaN
Name: 8, dtype: object
http://duckduckgo.com/html/?q=%22FTL%3A%20Faster%20Than%20Light%22%20AND%20Subset%20Games%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping http://gaming.wikia.com/wiki/FTL:_Faster_Than_Light ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://speed-new.com/ftl-faster-than-light-full-pc-game ...'
b'   |MSG| Found article'
b'   |MSG| Scrap

b'   |MSG| Found article'
b'   |MSG| Scraping https://roguelegacy.gamepedia.com/Cellar_Door_Games ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.usgamer.net/articles/rogue-legacy-review ...'
b'   |MSG| Scraping https://jayisgames.com/review/rogue-legacy.php ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.gamingdebugged.com/2013/04/18/interview-cellar-door-games-creators-of-rogue-legacy/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gbatemp.net/review/rogue-legacy.166/ ...'
b'   |MSG| Scraping https://www.windowscentral.com/rogue-legacy-review ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://iamericm.com/2014/07/rogue-legacy-indie-game-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.moregameslike.com/rogue-legacy/android/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.pcgamer.com/rogue-legacy/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.facebook.com/CellarDoorGames/ ...'
b'   |M

b'   |MSG| Scraping https://store.steampowered.com/video/264280 ...'
b'   |MSG| Scraping http://gameverse.com/2013/03/22/99-levels-to-hell-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://indiegamereviewer.com/review-99-levels-to-hell-an-indie-roguelike-like/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://gamesmojo.com/games/action/99-levels-to-hell ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://toucharcade.com/2018/08/10/arena-of-valor-news-idol-liliana-map-overhauls-and-a-new-hero/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://ru-ru.facebook.com/99levelstohell/posts/?ref=page_internal ...'
b'   |MSG| Scraping http://gamestorrent.co/99-levels-to-hell-prophet.html ...'
b'   |MSG| Scraping https://kyojim.com/99-levels-hell-prophet/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gogwiki.com/wiki/99_Levels_to_Hell ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://zaxisgames.blogspot.com/2012/09/99-levels-to-he

b'   |MSG| Found article'
b'   |MSG| Scraping http://www.gamerevolution.com/review/67605-crypt-of-the-necrodancer-review ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://steamcommunity.com/games/247080/announcements/detail/611738382089044545 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.softpedia.com/reviews/games/pc/Crypt-of-the-Necrodancer-Review-479596.shtml ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://oceanofgames.com/crypt-of-the-necrodancer-free-download/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.microsoft.com/en-us/p/crypt-of-the-necrodancer/bzhl37cpgp4x ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.pcgamesn.com/crypt-of-the-necrodancer/crypt-of-the-necrodancer-s-new-boss-is-an-entire-band-of-enemies ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.ign.com/games/crypt-of-the-necrodancer ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://torrentsgames.net/macintosh/crypt-of-the-necrod

b'   |MSG| Found article'
Name          The Binding of Isaac: Rebirth
Released                               2014
Updated                                 NaN
Developer                           Nicalis
Theme                               Fantasy
Influences                              NaN
Name: 22, dtype: object
http://duckduckgo.com/html/?q=%22The%20Binding%20of%20Isaac%3A%20Rebirth%22%20AND%20Nicalis%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping https://gamesmojo.com/games/action/the-binding-of-isaac-rebirth ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://maddownload.com/games/rpg/the-binding-of-isaac-rebirth/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://store.steampowered.com/app/250900/The_Binding_of_Isaac_Rebirth/ ...'
b'   |MSG| Scraping http://fragrun.com/games/action/the-binding-of-isaac-rebirth ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://playstationgamer.org/games/action/the-binding-of

b'   |MSG| Scraping https://wccftech.com/review/hand-of-fate-2/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://oceanofgames.com/hand-of-fate-free-download/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gamespot.com/reviews/hand-of-fate-2-review/1900-6416814/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.gog.com/game/hand_of_fate_2_outlands_and_outsiders ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://www.gamerevolution.com/review/67003-hand-of-fate-review ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://gamestorrent.co/hand-of-fate-2-outlands-and-outsiders-plaza.html ...'
b'   |MSG| Scraping https://www.pcpowerplay.com.au/review/hand-of-fate-2,477783 ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://speed-new.com/hand-of-fate-full-pc-game ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://unity3d.com/showcase/case-stories/hand-of-fate ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.facebook.

b'??WARN?? Search term "Infinite Space III: Sea of Stars" returned nothing'
http://duckduckgo.com/html/?q=%22Infinite%20Space%20III%3A%20Sea%20of%20Stars%22%20AND%20Rich%20Carlson%20OR%20%20Iikka%20Ker%C3%A4nen%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping https://www.imdb.com/title/tt1323932/fullcredits ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.facebook.com/WeirdWorldsReturnToInfiniteSpace ...'
b'   |MSG| Scraping https://www.facebook.com/infinitespace3 ...'
b'   |MSG| Scraping http://doom.wikia.com/wiki/List_of_notable_WADs ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://doomwiki.org/wiki/Hacx ...'
b'   |MSG| Scraping https://doomwiki.org/wiki/Requiem ...'
b'   |MSG| Scraping https://store.steampowered.com/app/698100/Protagon_VR/ ...'
b'   |MSG| Scraping http://doom.wikia.com/wiki/Requiem ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://onemandoom.blogspot.com/p/index-of-reviews.html ...'


b'   |MSG| Found article'
b'   |MSG| Scraping https://steamcommunity.com/app/265000/discussions/0/365163537817753926/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.steamgifts.com/discussion/VYsDF/something-to-read-interesting-history-of-indie-dev-betadwarf-makers-of-forced-and-forced-showdown ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.destructoid.com/review-forced-showdown-368410.phtml ...'
b'   |MSG| Scraping https://ru-ru.facebook.com/BetaDwarf/posts ...'
b'   |MSG| Scraping http://cogconnected.com/review/forced-showdown-review/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://linuxgamenews.com/post/63208838351/the-incredible-story-of-betadwarf-and-forced ...'
b'   |MSG| Scraping https://gamingbolt.com/forced-review ...'
b'   |MSG| Scraping http://www.grabthegames.com/review-forced-showdown.html ...'
b'   |MSG| Found article'
Name          Pixel Cave
Released            2016
Updated              NaN
Developer       Megabyte
Theme         

b'   |MSG| Scraping https://saveorquit.com/2017/10/07/review-heat-signature/ ...'
b'   |MSG| Scraping http://heatsig.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.rockpapershotgun.com/2017/09/26/heat-signature-review/ ...'
b'   |MSG| Scraping https://gamesmojo.com/games/action/heat-signature ...'
b'   |MSG| Found article'
b'   |MSG| Scraping http://heatsignature.wikia.com/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.pcgamesn.com/heat-signature/heat-signature-release-date-steam-trading-cards ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.polygon.com/2017/9/22/16350330/heat-signature-review ...'
b'   |MSG| Scraping https://waypoint.vice.com/en_us/article/evpvmk/watch-us-steal-cool-spaceships-in-heat-signature ...'
b'   |MSG| Found article'
Name                   FARA
Released               2018
Updated                 NaN
Developer     Brian Roberts
Theme               Fantasy
Influences              NaN
Name: 37, dtype: object
b'?

## Web scraping - subset of games

Because we don't have all the time in the world.

In [222]:
# For Roguelike games, we build a corpus with RogueTemple and DuckDuckGo
corpus = []
games_to_scrape = ["The Tombs"]
    
for index, roguelike in roguelikes.iterrows():
    title = roguelike['Name']
    if not isinstance(title, str) or not title in games_to_scrape:
        continue
    print(roguelike)
    text = []
    
    rogue_temple = scrape_mediawiki_url(roguelike['RogueTemple'])
    text.append(rogue_temple)

    developers = '"' + str(roguelike['Developer']).replace(',', ' OR ') + '"'
    links = scrape_duckduckgo(title, developers)
    
    for link in links[:25]:
        if 'roguebasin.roguelikedevelopment.org' in link \
            or 'roguebasin.com' in link \
            or 'wikipedia' in link:
            continue
        content = scrape(link, title)
        if content:
            text.append(content)
    
    corpus.append({"title": title, "text": text})
  
save_json('corpus-addition.tmp.json', corpus)

Name                                                   The Tombs
RogueTemple    http://roguebasin.roguelikedevelopment.org/ind...
Link                                    http://www.thetombs.com/
Released                                              2005/04/16
Updated                                               2005/04/16
Developer                                         Martin Woodard
Theme                                                    Fantasy
Influences                                                   NaN
Name: 229, dtype: object
http://duckduckgo.com/html/?q=%22The%20Tombs%22%20AND%20%22Martin%20Woodard%22%20AND%20game%20AND%20%28mortem%20OR%20history%20OR%20developer%20OR%20review%29
b'   |MSG| Scraping http://time.com/4791258/game-of-thrones-george-r-r-martin-interview/ ...'
b'   |MSG| Found article'
b'   |MSG| Scraping https://www.theverge.com/2018/4/25/17280908/george-r-r-martin-grrm-game-of-thrones-song-of-ice-and-fire ...'
b'   |MSG| Found article'
b'   |MSG| Scraping h

KeyboardInterrupt: 