# Track Missing Data
We need a system for efficiently identifying missing data, with our initial population being the set of all completed Mini Normal games. We also want to identify _why_ a data point is missing and, when a data point is incomplete, what is present and what isn't. 

### What kinds of missing data are there?
We can be missing setup/slot information, phase transition information, voting data, and/or thread content. And information can be missing because it hasn't been successfully extracted yet or because it's missing from the forum (e.g. because of a site crash). Also, rather than being present or missing, information can also instead be _inaccurate_ - but solving that problem requires an entirely different approach from that of detecting missing data, so we focus on it elsewhere. And finally, data can be undesirable - either because of a broken game (e.g. modflaking) or other issues that make inclusion in analysis difficult/unreasonable.

### How will we manage the prospect of missing data?
We'll write a script that collects a list of existing completed game threads and checks our data set for associated data, building a list marking wherever data is missing. From there, we'll maintain this list, including updating it regularly as data collection ensues and new games finish, as well as manually marking instances where data collection is impossible or undesirable.


## Build List of Completed Games
For now we focus solely on Mini Normals.

In [21]:
# dependences
import requests
from lxml import html

# start by finding number of threads in subforum
url = 'https://forum.mafiascum.net/viewforum.php?f=53&start={}'
base = requests.get(url.format(0)).content
topic_count = html.fromstring(base).xpath('//div[@class="pagination"]/text()')[0].strip()
topic_count = int(topic_count[:topic_count.find(' ')])

# build list of game urls across each page of threads
game_urls, game_titles = [], []
for i in range(0, topic_count, 100):
    page = requests.get(url.format(i)).content
    
    titles = html.fromstring(page).xpath("//div[@class='forumbg']//dt/a/text()")
    game_titles += [title.strip() for index, title in enumerate(titles) if index % 2 == 0]
    
    urls = html.fromstring(page).xpath("//div[@class='forumbg']//dt/a/@href")
    game_urls += [url[1:url.find('&sid')] for index, url in enumerate(urls) if index % 2 == 0]

# print result
for index, url in enumerate(game_urls):
    print(url, game_titles[index])

/viewtopic.php?f=53&t=29549 Mini Normal Archives
/viewtopic.php?f=53&t=80170 Mini Normal 2086 - Pastries! GAME OVER!
/viewtopic.php?f=53&t=80044 Mini 2082: Hall of Mirrors I (Game Over!)
/viewtopic.php?f=53&t=80031 Mini Normal 2081 — My First Game! [Game Over]
/viewtopic.php?f=53&t=79944 Mini Normal 2080 ft. My Cats [Game Over]
/viewtopic.php?f=53&t=79638 Mini Normal 2075 - Game Over
/viewtopic.php?f=53&t=79563 Mini Normal 2073: ~ramblings~ (game over)
/viewtopic.php?f=53&t=79475 Mini Normal 2071 (Game Over!)
/viewtopic.php?f=53&t=79373 Mini-Normal 2070 is done
/viewtopic.php?f=53&t=79261 Mini Normal 2068: Cat Art! [Game Over]
/viewtopic.php?f=53&t=79212 Mini Normal 2067: Musicals [Endgame]
/viewtopic.php?f=53&t=79138 Mini Normal 2066: Catloaves [Game Over!]
/viewtopic.php?f=53&t=79035 Mini Normal 2062: Erinnerungen (um game over)
/viewtopic.php?f=53&t=78945 Mini Normal 2060: World Architecture [Endgame]
/viewtopic.php?f=53&t=78822 Mini Normal 2058 (Endgame)
/viewtopic.php?f=53&t=78680

# List URLs Missing from archive.txt

In [24]:
# load archive
with open('../data/archive.txt') as f:
    archive = f.read()

# print info tied to each thread not mentioned in archive
count = 0
for index, url in enumerate(game_urls):
    if url[1:] in archive:
        
        # print results
        print(index, game_titles[index], url)
        count += 1

72 Mini Normal 1938 [Game Over] /viewtopic.php?f=53&t=72911
73 Mini Normal 1931 | Endgame /viewtopic.php?f=53&t=72708
75 ♡☭☆ Mini Normal 1933 - Shiba Inus! - Game Over!☆☭♡ /viewtopic.php?f=53&t=72770
76 Mini Normal 1929 - Game Over /viewtopic.php?f=53&t=72581
77 Mini Normal 1925: Complete /viewtopic.php?f=53&t=72508
78 Mini Normal 1923: BooneyToonz IV - Five Weeks in a Boon END /viewtopic.php?f=53&t=72397
79 Mini Normal 1921 - Town Win /viewtopic.php?f=53&t=72324
80 Mini Normal 1920 (Game Over) /viewtopic.php?f=53&t=72291
81 Mini Normal 1917: :X Mafia (Game over!) /viewtopic.php?f=53&t=72052
82 Mini Normal 1914 - Sunshine Mafia - Game Over /viewtopic.php?f=53&t=71908
83 Mini Normal 1919: Endgame /viewtopic.php?f=53&t=72179
84 Mini 1895: Shaziro Mafia - GAME OVER /viewtopic.php?f=53&t=71119
85 Mini Normal 1911 | Penguin Mafia Redux | Endgame /viewtopic.php?f=53&t=71796
86 Mini Normal 1909: Girls ♥ Girls 1 ~ Endgame /viewtopic.php?f=53&t=71697
87 Mini Normal 1908 - In The Web (Game Over)