# Web Scraping IRL

In [225]:
# This file is an example of using scraping to accomplish a task.
# Synopsis: I play the game Android Netrunner and would like to make a data set/database
# of tournment winning decks with the name and the card list.  The names will be specially crafted
# so that it identifies the set (it's important if you play the game), the tournament name, and the
# faction (corp or runner).  All this information is important for my purposes, if you don't
# understand what was just said don't worry about it!  The desired endstate is to have the card sets
# then saved as a document with that name.  I'll be able to use that information for later analysis
# and use in online collections so that I can read through each card.
# The target site will be https://stimhack.com/tournament-decklists/ which contains tables
# with hyperlinks to the winning decks for a given entry. This notebook will be a little messy as I'm intending
# to leave in the bits where I figure this out (I.e. adjust my targeting on the site to scrape it, finding tables
# following links, etc.).
# Plan:
# 1) Check robots.txt - ensure we're not breaking rules!
# 2) Parse the target document so that we can programatically links and names
# 3) Put the deck names and links in a data frame
# 4) Follow each link to get the web doc in which the cards are listed
# 5) Parse the page with card lists
# 6) (finally) store the card lists with proper names for later use
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup as bs
from urllib.parse import urlparse
import re


# First, setup some get, response check, and "log" functions <- all borrowed from the mathematicians file.
# These will allow us to re-use code better.
# Simple Get:  Pass it a URL, if the URL is valid and the response looks like HTML, we'll send the content back
#              to the caller.
def simple_get(url):
    # Attempts to get the content at 'url' by making an HTTP GET request.
    # If the content-type of the response is some kind of HTML/XML, return the
    # text content, otherwise return None
    try:
        # Contextlib's closing allows us to open, parse, and pass the contents of the url
        # without having to explicitly close the connection
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None
    # Handle Request exception
    except RequestException as e:
        log_error('Error during requests to {} : {}'.format(url, str(e)))
        return None
    
    
def is_good_response(resp):
    # Returns true if the response seems to be HTML, false otherwise
    content_type = resp.headers['Content-Type'].lower()
    print('[*] Get results content type: {}'.format(content_type))
    return (resp.status_code == 200 and content_type is not None and
            content_type.find('html') > -1)


def log_error(e):
    # It is always a good idea to log errors.  This function just prints them,
    # but you can make it do anything
    print(e)
    
    
def find_robots(url):
    # Check the domain for a robots.txt file. We must check this to ensure that
    # we are not violating any rules of the site - please play nice when scraping!
    # Get just the domain from the target
    parser = urlparse(url)
    rbts = str(parser.scheme + '://' + parser.netloc + '/robots.txt')
    
    try:
        # Contextlib's closing allows us to open, parse, and pass the contents of the url
        # without having to explicitly close the connection
        with closing(get(rbts, stream=True)) as resp:
            return (resp.content).decode('UTF-8')
    except RequestException as e:
        log_error('Error during requests to {} : {}'.format(rbts, str(e)))
        return None

### Testing 1 - Investigating the Target Structure

In [226]:
# The first thing one should do before conducting scraping is check to see what's allowed - please scrape responsibly.
# I've checked this site and there is only one place not allowed.
# test = find_robots('https://stimhack.com/tournament-decklists/')
# print(test)

# Instantiate a "target" variable to contain the value of the site we're scraping
target = 'https://stimhack.com/tournament-decklists/'

# Grab the page's HTML
page = simple_get(target)

# Parse the page's HTML
html = bs(page, 'html.parser')

# Display the HTML - using .prettify() method to make the output more "human readable"
# print(html.prettify())
# Find all <table> tags with class "tablepres" <- tried using dataTable (seemed more appropriate) but it
# wouldn't grab all the tables.  After some tinkering, came up with the below
tables = html.find_all('table', {'class': 'tablepress'})
# Check the number of items in our list of tables - comes out to be 21 which is the number of tables there
# are on the page (sigh) I counted them at the time of writing so trust me?
print(len(tables))

# After doing a little more digging I figured out how to get better delineators for my tables in the "older" section of
# the document.  If this seems jarringly out of place it's because it should.  I need a unique name for each of my tables
# in order to better sort the data later on
older_expansions = html.find_all('h2', {'class': 'tablepress-table-name'})

# Showing a little of the information we grabbed
for name in older_expansions[:3]:
    print(name.string)

[*] Get results content type: text/html; charset=utf-8
21
All That Remains
Up and Over
First Contact


### Testing 2 - Parsing the tables

In [227]:
# At this point we've:
#    Retrieved the document from the website containing the information target
#    Parsed the document to pull out the tables of information
# Now, we need to pull the information out and store it in a useful manner.
# The first of those to things we'll do is pull the info out of a table:
# Found that each instance of an item in the list can still be treated like a bs4
# object...
print('[*] Displaying the type of object each element in the list is:')
for table in tables[:3]:
    print(type(table))
print('\n')

# So we continue to parse through each one finding a unique name for late use
print('[*] The unique identifier for each of our tables:')
for table in tables[:3]:
    print(table['id'])
print('\n')

# Now that we can name each table something, we'll find the headers next to create
# the structure for our table
print('[*] The headers for the tables:')
for table in tables[:3]:
    print('Table {} columns:'.format(table['id']))
    for col in table.thead.find_all('th'):
        print('     {}'.format(col.string))
print('\n')

[*] Displaying the type of object each element in the list is:
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>


[*] The unique identifier for each of our tables:
decklists
tablepress-20
tablepress-19


[*] The headers for the tables:
Table decklists columns:
     DataPack
     Location
     Type
     Players
     Winner
     Runner
Table tablepress-20 columns:
     No
     Tournament
     Players
     Champion
     Runner
     Corp
Table tablepress-19 columns:
     No
     Tournament
     Players
     Champion
     Runner
     Corp




In [228]:
# Data Frame Structure:
#    Name: expansion_tournament_faction
#    Link: <hyperlink>
#    Cards: [maybe list?]
# The site has divided its records into (essentially) two different sets:
# "modern" and "older" <- I made those up, just don't think about it
# The table titled decklists I'm calling "modern" and everything else I'm
# calling older.  We'll need to parse them a bit differently so let's put them
# into two different variables so we can approach them appropriately and,
# ultimately, store them in the same dataframe!
# Had to switch this up because the tbody was not a part of the table for
# some reason...
modern = html.find('tbody', {'class': 'list'})

# Had to use a bit o' the ol' re to get the job done, the expression below
# says "find all the tables (html.find_all('table'...)) where the id is
# equal to 'tablepress' followed by anything else ({'id': re.compile('tablepress.*')})"
older = html.find_all('table', {'id': re.compile('tablepress.*')})

# Now, we pull out the beginnings of our information - deck name
# we need expansion_tournament_faction, however, we can't get faction
# from this table, that's tied to the card list. So, we'll grab the first to pieces
# and store that with the link as a dictionary
names_links = {}
for row in modern.find_all('tr'):
    # Get the cell that has the data pack name and modify it
    set_name = row.find('td', {'class': 'data-pack'}).string.replace(' ', '_').lower()
    dual_purpose = row.find('td', {'class': 'location'})
    tourny = dual_purpose.string.replace(' ', '_').replace(',', '').lower()
    name = set_name + '_' + tourny
    link = dual_purpose.a['href']
    names_links[name] = link

# Now the "modern" table has been parsed
for key, value in list(names_links.items())[:3]:
    print('Name: {}\n    Link: {}'.format(key, value))

Name: blood_money_games_of_berkeley_berkeley_ca
    Link: https://stimhack.com/gnk-games-of-berkeley-berkeley-ca-14-players/
Name: 23_seconds_gamespace_kashiwagi
    Link: https://stimhack.com/gnk-gamespace-kashiwagi-6-players/
Name: 23_seconds_ropecon_helsinki_finland
    Link: https://stimhack.com/regional-ropecon-helsinki-finland-23-players/


In [229]:
# Now we need to parse the "older" tables - remember, this isn't just one but a list of tables
# Itterate over the tables
counter = -1
for table in older:
    # Increment the counter to change the value from the old set list variable
    counter += 1
    for row in table.tbody.find_all('tr'):
        try:
            set_name = older_expansions[counter].string.replace(' ', '_').lower()
            dual_purpose = row.find('td', {'class': 'column-2'})
            tourny = dual_purpose.string.replace(',', '').replace(' ', '_').lower()
            link = dual_purpose.a['href']
            name = set_name + '_' + tourny
            names_links[name] = link
        except Exception as e:
            print('Set {} error {}'.format(set_name, e))
            continue

Set honor_and_profit error 'NoneType' object has no attribute 'replace'
Set double_time error 'NoneType' object has no attribute 'replace'
Set double_time error 'NoneType' object has no attribute 'replace'
Set double_time error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set fear_and_loathing error 'NoneType' object has no attribute 'replace'
Set true_colors error 'NoneType' object has no attribute 'replace'
Set true_colors error 'NoneType' object has no attribute 'replace'
Set secon

In [230]:
# So now, names_links has a bunch of links to follow to get the covetted card sets.
# While I wasn't able to parse out everything, for one reason or another, there's still
# a LOT to work with
#print('Number of links to follow: {}\n'.format(len(names_links.keys())))

# Need some links to test the next bit - parsing out the deck list
test_1 = names_links['blood_money_games_of_berkeley_berkeley_ca']
test_2 = names_links['23_seconds_gamespace_kashiwagi']
test_3 = names_links['23_seconds_ropecon_helsinki_finland']

card_page = simple_get(test_1)
card_html = bs(card_page, 'html.parser')
set_1 = card_html.find('div', {'class': 'wc-shortcodes-column-first'})
# print(set_1.text)
# print(type(set_1))

[*] Get results content type: text/html; charset=utf-8


In [309]:
for linebreak in set_1.find_all('br'):
    linebreak.extract()

place_holder = ''
for child in set_1.children:
    try:
        # The deck name and the card names are in a tags which have a text attribute.
        # If the child does not it will drop down to the exception logic below.
        try_var = child.text.strip()
        # If that child object is a card name, there should have been a number before it.
        # If there was a number the exception logic below would have stored that value in the place_holder variable.
        if place_holder != '':
            print('{} x {} < - Number and Card'.format(place_holder, try_var))
            # reset the place_holder var to be empty
            place_holder = ''
            # Start the loop over
            continue
        # If placeholder has nothing in it, it's not a card bet we need to see if it should be printed on its own line
        try:
            # If the child wasn't a card it was instead a single line of text relavent for one line
            # I.e. (<card_subtype> <count>), or the name of the deck
            # if re.match("^[A-Za-z0-9_-]*$", my_little_string)
            # re.search('[a-zA-Z]', try_var)
            if any(c.isalpha() for c in try_var):
                print('{} <- Something relavent for a single line'.format(try_var))
                continue
        except TypeError:
            # If the child contained no letters, it was something I didn't care to have
            # I.e. the amount of influence (something specific to the game) a given card consumes
            print('{} <- don\'t need this'.format(try_var))
    except Exception as e:
        # print('Error occurred at {}: {}'.format(child, e))
        # Get rid of spaces around the text - because that's what this is, just text, meaning not really a child element
        # Looking at the code reveals that the number of cards there is in a specific deck isn't between any tags.
        exc_var = str(child).strip()
        try:
            # Check the contents of the variable to see if it's a number
            int(exc_var)
            # If so, we want to save that number to print it on a line with the card to designate how many should
            # be in the deck
            place_holder = exc_var
            print('{} <- place_holder'.format(exc_var))
            continue
        except Exception as e:
            pass
        if re.search('[a-zA-Z]', child):
            print('{} <- wasn\'t in tag but single line relavent'.format(exc_var))
        else:
            print('{} <- What got here?!'.format(exc_var))
            continue

Vex'ahlia <- Something relavent for a single line
(45 cards) <- wasn't in tag but single line relavent
Kate "Mac" McCaffrey: Digital Tinker <- Something relavent for a single line
Event (12) <- wasn't in tag but single line relavent
3 <- place_holder
3 x Career Fair < - Number and Card
 <- What got here?!
3 <- place_holder
3 x Diesel < - Number and Card
1 <- place_holder
1 x "Freedom Through Equality" < - Number and Card
2 <- place_holder
2 x Indexing < - Number and Card
3 <- place_holder
3 x Sure Gamble < - Number and Card
Hardware (7) <- wasn't in tag but single line relavent
2 <- place_holder
2 x Clone Chip < - Number and Card
2 <- place_holder
2 x Mirror < - Number and Card
1 <- place_holder
1 x Plascrete Carapace < - Number and Card
2 <- place_holder
2 x R&D Interface < - Number and Card
Resource (17) <- wasn't in tag but single line relavent
2 <- place_holder
2 x Beth Kilrain-Chang < - Number and Card
3 <- place_holder
3 x Daily Casts < - Number and Card
2 <- place_holder
2 x Fil

'3'