# Summary

This notebook contains the code used in gathering the data from [boardgamegeek](https://boardgamegeek.com/).
<br>

First,we gather a master game list from the top 100 games by boardgamegeek's `geekrating` score, using `BeautifulSoup`, and save it in the file `games_master_list.csv`. 
<br>

Then, using functions from the `bgg_data_func` file, we loop through all the games, and for each individual game, we save the ratings (as of March 31, 2020) in a json file, each file contains 50 ratings. These files are in the `json_data` folder. 
<br>

In the next step, for each game, we loop through the `json` files, and create a summary file in a `csv` format, each having two columns: `user_name` and `rating`. These summary files are also saved in the `json_data` folder. 
<br>

Finally, we loop through the games and combine their summary `csv` files into one file that has three columns: `user_name`, `game_id`, and `rating`. These are saved in the `data_input` folder, and can be directly imported to be used with the objects from the `surprise` library in the following Jupyter notebooks. 

In [1]:
import requests
from bs4 import BeautifulSoup

import os
import json
import csv
import pandas as pd
import numpy as np

import time
from datetime import datetime

import bgg_data_func

# Games Master List from BGG

In this section, we gather a master game list from the top 100 games by boardgamegeek's `geekrating` score, using `BeautifulSoup`, and save it in the file `games_master_list.csv`. 
<br>

The table is in this format: [boardgamegeek top 100](https://boardgamegeek.com/browse/boardgame), although please note that the list changes with time. My master game list comes from the site as of March 31, 2020. 

In [2]:
session = requests.Session()
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)'\
           'AppleWebKit 537.36 (KHTML, like Gecko) Chrome',
           'Accept':'text/html,application/xhtml+xml,application/xml;'\
           'q=0.9,image/webp,*/*;q=0.8'}
url = 'https://boardgamegeek.com/browse/boardgame?sort=rank'
req = session.get(url, headers=headers)

bs = BeautifulSoup(req.text, 'html.parser')

In [3]:
ranking_table = bs.find('table', {'class': 'collection_table'})

In [4]:
rows = ranking_table.findChildren(['th', 'tr'])

In [5]:
bg_name_list = []
bg_link_list = []
bg_year_list = []

for row in rows[8:]:
    cell = row.findChildren(['td'])[2]
    bg_name = cell.find('a').text
    bg_link = cell.find('a')['href']
    bg_year = cell.find('span', {'class': 'smallerfont dull'}).text[1:-1]
    
    bg_name_list.append(bg_name)
    bg_link_list.append(bg_link)
    bg_year_list.append(bg_year)
    
games_master_list = pd.DataFrame(columns=['name', 'link', 'year'])
games_master_list['name'] = bg_name_list
games_master_list['link'] = bg_link_list
games_master_list['year'] = bg_year_list

In [6]:
games_master_list.to_csv('games_master_list.csv', index=False)

# Import Master List

Work in the remainder of the notebook starts with importing the master list. 

In [2]:
df = pd.read_csv('games_master_list.csv')

game_ids = df['link'].map(lambda x: bgg_data_func.link_to_gameid(x))

# BGG API Calls

In this section, using functions from the `bgg_data_func` file, we loop through all the games, and for each individual game, we save the ratings (as of March 31, 2020) in a json file, each file contains 50 ratings. These files are in the `json_data` folder. 
<br>

I used three different process, they can be all run from their corresponding cells, after adjusting the `game_id`, `page_num`, and `rated_only` parameters. The three different processes:
- collect all the pages for all the games in a certain range
- collect pages of one game, starting from a specific page
- collect one specific page of one game

In [7]:
# cell saves all available pages for a given gamelist, change game_ids parameters
# works with other string IDs as well but must have a folder

for game_id in game_ids[85:100]:
    
    print(game_id)
    
    page_num = 1
    page_empty = False

    while not page_empty:

        page_empty = bgg_data_func.handle_one_api_request(game_id, page_num)

        seconds_to_sleep = 3 + np.random.uniform(0,1)
        time.sleep(seconds_to_sleep)
        
        page_num += 1
        
        if page_num % 50 == 0:
            print(page_num)
        
    time.sleep(10)

In [8]:
# this one just gets info for one game, from a certain page num

game_id = '157354'
page_num = 359
rated_only = True

page_empty = False

while not page_empty:

    # get API response
    page_empty = bgg_data_func.handle_one_api_request(game_id, page_num, rated_only)

    seconds_to_sleep = 2 + np.random.uniform(0,1)
    time.sleep(seconds_to_sleep)

    page_num += 1
    
    if page_num % 50 == 0:
        print(page_num)

In [9]:
# and finally, this one is a given page num for a given game

game_id = '28143'
page_num = 808

bgg_data_func.handle_one_api_request(game_id, page_num, False)

# JSON Data Converter

For each game, we loop through the `json` files, and create a summary file in a `csv` format, each having two columns: `user_name` and `rating`. These summary files are also saved in the `json_data` folder. 

In [10]:
# game_ids_to_convert = pd.concat([game_ids[49:100],(game_ids[-1:])])
game_ids_to_convert = game_ids[49:100]

In [11]:
for game_id in game_ids_to_convert:
    print(game_id)
    bgg_data_func.create_csv_summary(game_id)

# Create Data Summary

Finally, we loop through the games and combine their summary `csv` files into one file that has three columns: `user_name`, `game_id`, and `rating`. These are saved in the `data_input` folder, and can be directly imported to be used with the objects from the `surprise` library in the following Jupyter notebooks. 

In [12]:
summary_filename = "games_100"

In [13]:
game_ids_to_merge = game_ids[:100]

In [14]:
for game_id in game_ids_to_merge:
    
    current_file = pd.read_csv('./json_data/game_' + game_id + '/' + game_id + '_summary.csv', header = None)
    
    current_file[2] = game_id
    
    with open('./data_input/' + summary_filename + '_summary.csv', 'a') as f:
        writer = csv.writer(f)
        writer.writerows(zip(current_file[0], current_file[2], current_file[1]))

In [15]:
# check length

ratings_file = open('./data_input/' + summary_filename + '_summary.csv')
reader = csv.reader(ratings_file)
lines= len(list(reader))
print(lines)

With that step, the data preparation is complete, recommender system can be started by importing the `games_100_summary.csv` file.