# Scraping MoonBoard Problems
Scrapes MoonBoard problems from the MoonBoard site using an automated clicking routine defined via Selenium.

In the process of scraping, four (4) intermediate files will be produced:
1. problems_dict.pickle
2. failed_uids_dict.pickle
3. problems_dict_holds.pickle
4. moonboard_data.pickle

These items can be organized under the several phases of data mining:

**Phase 1: Get all URLs leading to specific problems**

*Produces: Item(s) 1*
* Accessing all problems in the MoonBoard problems repository requires clicking through every page on their site 
* On each page, a set of problems are shown as a scrollable UI element
* Each problem within this scrollable UI element has a URL leading to a unique webpage that displays a problem and related metadata

**Phase 2: Accessing each problem's page and extract metadata**

*Produces: Item(s) 2, 3*
* After Phase 1, we have a dictionary that maps each unique problem to its corresponding webpage via URL
* We access each unique webpage and extract metadata into **Item (3)**
* Every unsuccessful access attempt is stored in **Item (2)**

**Phase 3: Format schema for neural network**

*Produces: Item(s) 4*
* After Phase 2, we have a dictionary of MoonBoard problems and associated metadata
* Phase 3 processes this scraped data into a schema that is consistent and suitable for input to neural network training

## Setup:

In [6]:
import shutil

from moonboard_helper import *

In [7]:
# Load credentials
with open('./credentials.txt') as f:
    flines = f.readlines()

cred_dict = {s.split('-')[0].strip() : s.split('-')[1].strip() for s in flines}
print(cred_dict)

FileNotFoundError: [Errno 2] No such file or directory: './credentials.txt'

In [17]:
username = None
password = None
driver_path = '/usr/local/bin/chromedriver'
moonboard_url = 'https://moonboard.com/'
save_path = '/Users/jrchang612/moonGen_scrape_2016'
save_path_holds = '/Users/jrchang612/moonGen_scrape_2016_cp'
save_path_failed = '/Users/jrchang612/moonGen_scrape_2016_fail'
save_path_final = '/Users/jrchang612/moonGen_scrape_2016_final'

In [None]:
"""
username = cred_dict['username']
password = cred_dict['password']
driver_path = cred_dict['driver_path']
save_path = cred_dict['save_path']
save_path_holds = cred_dict['save_path_holds']
save_path_failed = cred_dict['save_path_failed']
save_path_final = cred_dict['save_path_final']

moonboard_url = 'https://moonboard.com/'
"""

## Phase 1: Preliminary Scraping (URLs)

In [18]:
# Load browser and login to MoonBoard
browser = load_browser(driver_path)
loginMoonBoard(browser, moonboard_url, username, password)
time.sleep(2)

In [4]:
# Get problems view
click_view_problems(browser)
click_holdsetup(browser)

<selenium.webdriver.remote.webelement.WebElement (session="0e1a18b0657fd4660db8d902ac5968c7", element="f224b536-bf01-4cce-aa36-1e790ecc54d2")>

In [5]:
find_and_click(browser, tag_name = 'button', attribute = 'data-key', value= '25° MoonBoard', num_tries=1, sleep_val=1)

<selenium.webdriver.remote.webelement.WebElement (session="0e1a18b0657fd4660db8d902ac5968c7", element="d3578630-5f40-4e3f-8929-2648ac3ad4e0")>

In [7]:
find_and_click(browser, tag_name = 'button', attribute = 'data-field', value= 'Benchmarks', num_tries=1, sleep_val=1)

<selenium.webdriver.remote.webelement.WebElement (session="b28bf42d9e81b7e3930365026fe8d7de", element="93bd7258-8bc3-4da6-832a-c5e2e08f64e5")>

In [11]:
find_and_click(browser, tag_name = 'button', attribute = 'data-field', value= 'None', num_tries=1, sleep_val=1)

<selenium.webdriver.remote.webelement.WebElement (session="2dcbf1517f5de3c5e4f51f47b5e453db", element="1d68f8d7-9f20-4a2f-8b2b-fcf3c973060d")>

In [11]:
# Process all pages (num_pages == -1 gets all pages)
if not os.path.exists(save_path):
    problems_dict = process_all_pages(browser, save_path, num_pages = -1, sleep_val=1)
    save_pickle(problems_dict, save_path)
else:
    problems_dict = load_pickle(save_path)

Processed page: 1!
Clicked page 2
Processed page: 2!
Clicked page 3
Processed page: 3!
Clicked page 4
Processed page: 4!
Clicked page 5
Failed to process problems on page 5
Processed page: 5!
Clicked page 6
Processed page: 6!
Clicked page 7
Processed page: 7!
Clicked page 8
Processed page: 8!
Clicked page 9
Processed page: 9!
Clicked page 10
Processed page: 10!
Clicked page 11
Processed page: 11!
Clicked page 12
Processed page: 12!
Clicked page 13
Processed page: 13!
Clicked page 14
Processed page: 14!
Clicked page 15
Processed page: 15!
Clicked page 16
Processed page: 16!
Clicked page 17
Processed page: 17!
Clicked page 18
Processed page: 18!
Clicked page 19
Processed page: 19!
Clicked page 20
New page not loaded yet!
Failed to process problems on page 20
New page not loaded yet!
Failed to process problems on page 20
New page not loaded yet!
Failed to process problems on page 20
New page not loaded yet!
Failed to process problems on page 20
New page not loaded yet!
Failed to process p

In [12]:
# Number of scraped problems
print('Number of problems:', len(problems_dict))

Number of problems: 30673


In [70]:
if not os.path.exists(save_path_failed):
    print('Creating failed uids dictionary...')
    failed_uids_dict = {}
    save_pickle(failed_uids_dict, save_path_failed)
else:
    print('Loading failed uids dictionary...')
    failed_uids_dict = load_pickle(save_path_failed)
    print('Number of failed Uids:', len(failed_uids_dict))

Creating failed uids dictionary...


## Phase 2: Secondary Scraping (Problems)

In [13]:
# Copy problem dict
if not os.path.exists(save_path_holds):
    shutil.copyfile(save_path, save_path_holds)

holds_dict = load_pickle(save_path_holds)
    
# Failed uids
if not os.path.exists(save_path_failed):
    print('Creating failed uids dictionary...')
    failed_uids_dict = {}
    save_pickle(failed_uids_dict, save_path_failed)
else:
    print('Loading failed uids dictionary...')
    failed_uids_dict = load_pickle(save_path_failed)
    print('Number of failed Uids:', len(failed_uids_dict))

Creating failed uids dictionary...


In [19]:
# Scrape specific problems
holds_dict, failed_uids_dict = scrape_problems(
    browser, 
    holds_dict, 
    save_path_holds, 
    failed_uids_dict, 
    save_path_failed,
    num_tries=10
    )
    
# Format mined problems
final_dict = cast_to_basic_schema(holds_dict)
save_pickle(final_dict, save_path_final)

1 / 30673
2 / 30673
3 / 30673
4 / 30673
5 / 30673
6 / 30673
7 / 30673
8 / 30673
9 / 30673
10 / 30673
11 / 30673
12 / 30673
13 / 30673
14 / 30673
15 / 30673
16 / 30673
17 / 30673
18 / 30673
19 / 30673
20 / 30673
21 / 30673
22 / 30673
23 / 30673
24 / 30673
25 / 30673
26 / 30673
27 / 30673
28 / 30673
29 / 30673
30 / 30673
31 / 30673
32 / 30673
33 / 30673
34 / 30673
35 / 30673
36 / 30673
37 / 30673
38 / 30673
39 / 30673
40 / 30673
41 / 30673
42 / 30673
43 / 30673
44 / 30673
45 / 30673
46 / 30673
47 / 30673
48 / 30673
49 / 30673
50 / 30673
51 / 30673
52 / 30673
53 / 30673
54 / 30673
55 / 30673
56 / 30673
57 / 30673
58 / 30673
59 / 30673
60 / 30673
61 / 30673
62 / 30673
63 / 30673
64 / 30673
65 / 30673
66 / 30673
67 / 30673
68 / 30673
69 / 30673
70 / 30673
71 / 30673
72 / 30673
73 / 30673
74 / 30673
75 / 30673
76 / 30673
77 / 30673
78 / 30673
79 / 30673
80 / 30673
81 / 30673
82 / 30673
83 / 30673
84 / 30673
85 / 30673
86 / 30673
87 / 30673
88 / 30673
89 / 30673
90 / 30673
91 / 30673
92 / 306

In [15]:
final_dict = cast_to_basic_schema(holds_dict)
save_pickle(final_dict, save_path_final)

Failed to read 356367
Failed to read 353028
Failed to read 348825
Failed to read 333117
Failed to read 332272
Failed to read 314544
Failed to read 314454
Failed to read 312470
Failed to read 311460
Failed to read 310781
Failed to read 309769
Failed to read 309545
Failed to read 309079
Failed to read 287806
Failed to read 287597
Failed to read 280514
Failed to read 280420
Failed to read 264916
Failed to read 263758
Failed to read 256115
Failed to read 242702
Failed to read 233728
Failed to read 222189
Failed to read 207982
Failed to read 206103
Failed to read 204447
Failed to read 197168
Failed to read 166051
Failed to read 115780
Failed to read 61494
Failed to read 23360


In [8]:
len(final_dict)

15836

In [5]:
holds_dict['367789']

{'problem_name': 'FLUBBER',
 'info': ['Vertical World North',
  'Be the first to repeat this problem',
  '6B+',
  'Feet follow hands',
  '40° MoonBoard'],
 'url': 'https://moonboard.com/Problems/View/367789/flubber',
 'num_empty': 3,
 'num_stars': 0}

In [13]:
# Copy problem dict
if not os.path.exists(save_path_holds):
    shutil.copyfile(save_path, save_path_holds)

holds_dict = load_pickle(save_path_holds)

In [14]:
# Failed uids
if not os.path.exists(save_path_failed):
    print('Creating failed uids dictionary...')
    failed_uids_dict = {}
    save_pickle(failed_uids_dict, save_path_failed)
else:
    print('Loading failed uids dictionary...')
    failed_uids_dict = load_pickle(save_path_failed)
    print('Number of failed Uids:', len(failed_uids_dict))

Creating failed uids dictionary...


In [15]:
# Scrape specific problems
holds_dict, failed_uids_dict = scrape_problems(
    browser, 
    holds_dict, 
    save_path_holds, 
    failed_uids_dict, 
    save_path_failed,
    num_tries=1
)

1 / 15
Failed to find class field-validation-error
2 / 15
Failed to find class field-validation-error
3 / 15
Failed to find class field-validation-error
4 / 15
Failed to find class field-validation-error
5 / 15
Failed to find class field-validation-error
6 / 15
Failed to find class field-validation-error
7 / 15
Failed to find class field-validation-error
8 / 15
Failed to find class field-validation-error
9 / 15
Failed to find class field-validation-error
10 / 15
Failed to find class field-validation-error
11 / 15
Failed to find class field-validation-error
12 / 15
Failed to find class field-validation-error
13 / 15
Failed to find class field-validation-error
14 / 15
Failed to find class field-validation-error
15 / 15
Failed to find class field-validation-error


In [None]:
# Close browser
browser.close()

## Phase 3: Schema Organization

In [16]:
# Format mined problems
final_dict = cast_to_basic_schema(holds_dict)
save_pickle(final_dict, save_path_final)

In [19]:
with open('/Users/jrchang612/moonGen/output/moonGen_scrape_final', 'rb') as f:
    data = pickle.load(f)

In [20]:
data

{'341667': {'url': 'https://moonboard.com/Problems/View/341667/woods-are-nice',
  'start': [[3, 2], [7, 0]],
  'mid': [[8, 6], [6, 11], [10, 4], [5, 15], [3, 12]],
  'end': [[3, 17]],
  'grade': 6,
  'user_grade': '6C+',
  'is_benchmark': True,
  'repeats': 546,
  'problem_type': None,
  'is_master': False,
  'setter': {'Id': 'd3a099ae-6cc1-457d-ba80-af2dab1c7ebf',
   'Nickname': 'Matt',
   'Firstname': 'Matthew ',
   'Lastname': 'Hammond ',
   'City': 'Bedworth ',
   'Country': 'United Kingdom',
   'ProfileImageUrl': '/Content/Account/Images/default-profile.png?637228639487042349',
   'CanShareData': True}},
 '339318': {'url': 'https://moonboard.com/Problems/View/339318/sugar-poor',
  'start': [[1, 5], [3, 2]],
  'mid': [[6, 7], [6, 9], [9, 12], [6, 15], [7, 16], [9, 2], [10, 4]],
  'end': [[5, 17]],
  'grade': 5,
  'user_grade': '6C+',
  'is_benchmark': True,
  'repeats': 857,
  'problem_type': None,
  'is_master': False,
  'setter': {'Id': 'b8032ce3-eec0-42c6-829d-ae5ced48dda6',
   