# Track Missing Data
We need a system for efficiently identifying missing data, with our initial population being the set of all completed Mini Normal games. We also want to identify _why_ a data point is missing and, when a data point is incomplete, what is present and what isn't. 

### What kinds of missing data are there?
We can be missing setup/slot information, phase transition information, voting data, and/or thread content. And information can be missing because it hasn't been successfully extracted yet or because it's missing from the forum (e.g. because of a site crash). Also, rather than being present or missing, information can also instead be _inaccurate_ - but solving that problem requires an entirely different approach from that of detecting missing data, so we focus on it elsewhere. And finally, data can be undesirable - either because of a broken game (e.g. modflaking) or other issues that make inclusion in analysis difficult/unreasonable.

### How will we manage the prospect of missing data?
We'll write a script that collects a list of existing completed game threads and checks our data set for associated data, building a list marking wherever data is missing. From there, we'll maintain this list, including updating it regularly as data collection ensues and new games finish, as well as manually marking instances where data collection is impossible or undesirable.

## Dependencies

In [12]:
# dependences
import requests
import csv
import string
from lxml import html

# needed variables
no_punctuation = str.maketrans('', '', string.punctuation) # for quick removal of punctuation 
completed_url = 'https://forum.mafiascum.net/viewforum.php?f=53&start={}'

with open('../data/archive.txt') as f:
    archive = f.read()

## Build List of Completed Games and Identify Those Missing from `archive.txt`
For now we focus solely on Mini Normals. We probably don't need this cell anymore.

In [13]:
# start by finding number of threads in subforum
base = requests.get(completed_url.format(0)).content
topic_count = html.fromstring(base).xpath('//div[@class="pagination"]/text()')[0].strip()
topic_count = int(topic_count[:topic_count.find(' ')])

# scrape list of game urls and titles across each page of threads
game_urls, game_titles = [], []
for i in range(0, topic_count, 100):
    page = requests.get(completed_url.format(i)).content
    
    # game titles
    titles = html.fromstring(page).xpath("//div[@class='forumbg']//dt/a/text()")
    game_titles += [title.strip() for index, title in enumerate(titles) if index % 2 == 0]
    
    # game urls
    urls = html.fromstring(page).xpath("//div[@class='forumbg']//dt/a/@href")
    game_urls += [url[1:url.find('&sid')] for index, url in enumerate(urls) if index % 2 == 0]

# mark which of these aren't in archive
excluded = []
for index, url in enumerate(game_urls):
    count = archive.count(url[1:] + '\n')
    if count == 0 :
        excluded.append(index)
    
# print counts
print('Number of URLs:', len(game_urls))
print('Number of URLs Unmatched to String in Archive:', len(excluded))
print('Number of Games in Archive:', len(archive.split('\n\n\n')))
print('{} threads unaccounted for!'.format(len(game_urls) - len(excluded) - len(archive.split('\n\n\n'))))
print('Number of URLs After Excluding Duplicates:', len(list(set(game_urls))))
print()

Number of URLs: 1025
Number of URLs Unmatched to String in Archive: 727
Number of Games in Archive: 298
0 threads unaccounted for!
Number of URLs After Excluding Duplicates: 1023



## Identify and Count Games Included in transitions.tsv
First entry of each row has the game number. Storing this instead of URL was a bad idea, as now I have to write code to match game numbers with URLs. Along with checking if a row associated w/ a game exists, I have to check if the row is complete - this means no entry in the row has a question mark and the last entry is a hyphen. My goal for this project today is to achieve that for every game in my initial ~322 game data set - or mark games where this isn't possible for some reason.

In [7]:
# build list of game numbers from archive
numbers = []
for game in archive.split('\n\n\n'):
    name = game.split('\n')[1]
    
    print(name)
    print([int(i) for i in name.translate(str.maketrans('', '', string.punctuation)).split() if i.isdigit()])

assert False

# load transitions csv 
count = 0
with open('../data/transitions.tsv') as f:
    transitions = csv.reader(f, delimiter='\t')
    for row in transitions:
        # print(row)
        
        count += 1
        if count > 10:
            assert False



Game 1091: Mafia Mania
[1091]
Game 1094: Mariposa Peak Mafia
[1094]
Game 1098: The Mafia Experiment!
[1098]
Game 1101: Suspiciously Normal Mafia
[1101]
Game 1102: Rivertown Mafia
[1102]
Game 1105: A Mafia Invasion!
[1105]
Game 1107: Just a Game
[1107]
Game 1112: Mundania
[1112]
Game 1114: Jim's Mafia
[1114]
Game 1117: Manhattan Special
[1117]
Game 1121: Nexusville Mafia
[1121]
Game 1122: Mafia.Exe
[1122]
Game 1126: Averagely Suspicious Mafia
[1126]
Game 1130: A Fishbowl Invasion by Ninja Monkeys!
[1130]
Game 1133: Mafia in Venice
[1133]
Game 1137: Long Overdue Mafia
[1137]
Game 1140: Mafia Mishmash
[1140]
Game 1142: Quintessentially English Mafia
[1142]
Game 1145: Plain Mafia
[1145]
Game 1146: Don't Get Slapped in the Face by a Fish Mafia
[1146]
Game 1147: Royal Mafia at the Round Table
[1147]
Game 1152
[1152]
Game 1156
[1156]
Game 1157: Witch-Hunt Nightless
[1157]
Game 1159: Powerrox93's Mini Normal I
[1159]
Game 1161: Neruzian Era Mafia
[1161]
Game 1164: 9p normal mafia.
[1164]
Game 

AssertionError: 