# Web Scraping with Beautiful Soup Demo

This demo will walk through an example of scraping data from a website programatically using the Beautiful Soup package in Python. Complete code is available on GitHub (see Resources).


## Why scrape data from the web?

+ No API available
+ Build a dataset for analysis
+ Keep track of data over time
+ Use data in another program
+ The Internet is a Big Data Lake

## DEMO: Magic The Gathering (MTG)

MTG is a card-dueling game. Players build a deck prior to game play with certain rules of deck construction based on the format of play. For this use-case, I was interested in the Commander format of game play. Commander decks consist of 100 cards: 1 commander, 99 spells & mana (currency to cast spells). 

There are around 20,000 unique MTG cards in existence. They are not all 'legal', but if you assume that 10,000 are legal, you have an astronomical number of combinations of decks possible in nearly every format of game play. Certain deck combinations float to the top and become widely replicated and played in the tournament circuits. These comprise the 'metagame'.

I was interested in looking at the deck characteristics of the current 'metagame'. There are several websites that allow users to save deck configurations and provide stats on the decks. After exhaustive searching, I found that there were no websites that exposed API endpoints to pull this data. 

Example Sites:
+ [EDH Rec](https://edhrec.com/)
+ [MTG Top 8](https://mtgtop8.com/)
+ [MTG Goldfish](https://www.mtggoldfish.com/metagame/commander#paper)

**Enter: Beautiful Soup and Web Scraping!!**

### Imports

+ URLlib: (Standard Library)
    + urllib is a package that collects several modules for working with URLs:
        + **urllib.request** for opening and reading URLs
        + **urllib.error** containing the exceptions raised by urllib.request
        + urllib.parse for parsing URLs
        + urllib.robotparser for parsing robots.txt files
+ JSON: (Standard Library)
    + JSON encoder and decoder
+ RE: Regular Expressions (Standard Library)
    + Regular Expression Operations
+ BS4: Beautiful Soup
    + "Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work."

In [6]:
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
import json
import re
from bs4 import BeautifulSoup
###THen get deck info using MTG SDK API

from mtgsdk import Card
cards = Card.all() #https://mtgjson.com/

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

## Examine the structure of the site

+ Robots.txt
    + Example: https://www.mtggoldfish.com/robots.txt
    + More Info: http://www.robotstxt.org/robotstxt.html
    + BLUF: Respect the robots.txt file...it's just the right thing to do. 
+ Look at the site and figure out exactly what data you want
    + https://www.mtggoldfish.com/metagame/commander/full#paper
+ What is the structure of the URLs?
    + https://www.mtggoldfish.com/metagame/commander/full?page=2#paper
    + notice the page number, how many pages?
+ Is there a way to programmatically loop through URLs?
+ F12 (developer mode) to 'inspect elements'

## Scraping Plan: Two loops

+ Loop through the ~12 pages that have **urls** for each deck
    + https://www.mtggoldfish.com/metagame/commander/full?page=1#paper
    + https://www.mtggoldfish.com/metagame/commander/full?page=2#paper
    + https://www.mtggoldfish.com/metagame/commander/full?page=3#paper
    + ...
    + https://www.mtggoldfish.com/metagame/commander/full?page=12#paper
+ Loop through deck URLs on each of the above pages, eg.
    + zur-the-enchanter
    + kozilek-the-great-distortion
    + sen-triplets

In [4]:
# Target Domain
DOMAIN = 'https://www.mtggoldfish.com'

# Use the following URL to make a list of links to all the commander decks in current metagame
myurl = 'https://www.mtggoldfish.com/metagame/commander/full?page=1#paper'
#pull the html text of myurl into python 
html = urlopen(myurl)
#create beautiful soup object from the html
bs = BeautifulSoup(html, 'html.parser')
#close the connection to the page
html.close()
#place to store scraped data
decks = {}

URLError: <urlopen error [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>

In [5]:
bs.head()

[<title>Magic: the Gathering Commander Decks and Metagame Breakdown (All</title>,
 <meta content="Popular Commander Magic: the Gathering decks with prices from tournament results." name="description"/>,
 <meta content="decks, decklists, tournament results, archetypes, metagame, magic, mtg, magic the gathering, magic the gathering online, mtgo, daily events" name="keywords"/>,
 <meta content="Magic: the Gathering Commander Decks and Metagame Breakdown (All Decks)" property="og:title"/>,
 <meta content="website" property="og:type"/>,
 <meta content="https://www.mtggoldfish.com/metagame/commander/full" property="og:url"/>,
 <meta content="Popular Commander Magic: the Gathering decks with prices from tournament results." property="og:description"/>,
 <meta content="summary" name="twitter:card"/>,
 <meta content="@mtggoldfish" name="twitter:site"/>,
 <link href="https://www.mtggoldfish.com/feed" rel="alternate" title="ATOM" type="application/atom+xml"/>,
 <link href="https://www.mtggoldfish

## Find the deck URL in the HTML
+ 'Inspect Element'
+ Use Beautiful Soup findAll() method to search the bs object for that html Element

In [6]:
bs.title

<title>Magic: the Gathering Commander Decks and Metagame Breakdown (All</title>

In [7]:
bs.title.string

'Magic: the Gathering Commander Decks and Metagame Breakdown (All'

In [8]:
bs.findAll('meta',{'name':"description"})

[<meta content="Popular Commander Magic: the Gathering decks with prices from tournament results." name="description"/>]

In [9]:
bs.find('div',{'class':"card-image-tile"})

<div class="card-image-tile" style="background-image: url('https://cdn1.mtggoldfish.com/images/gf/Sisay%252C%2BWeatherlight%2BCaptain%2B%255BMH1%255D.jpg');"></div>

In [10]:
#chain together
bs.find('div',{'class':"card-image-tile"}).get('style')

"background-image: url('https://cdn1.mtggoldfish.com/images/gf/Sisay%252C%2BWeatherlight%2BCaptain%2B%255BMH1%255D.jpg');"

In [11]:
image_style = bs.find('div',{'class':"card-image-tile"}).get('style')
image = image_style.split("('", 1)[1].split("')")[0]
print(image)

https://cdn1.mtggoldfish.com/images/gf/Sisay%252C%2BWeatherlight%2BCaptain%2B%255BMH1%255D.jpg


In [12]:
#Display image
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= image)

**NOTE: We can use a Magic the Gathering API to get some of the card data. It's useful to evaluate if there is a request limit when scraping a site. You might need to split the data collection tasks between multiple data sources. In this use case, I am pulling one page at a time and parsing the html locally. This also limits the number of requests being made to the site.**

In [13]:
bs.findAll('span', {'class': "deck-price-paper"})

[<span class="deck-price-paper">
 <!-- items [ [url, label], [url, label] ] -->
 <div class="subNav-menu-mobile">
 <label class="sr-only" for="menu-format-selection-6021cbeb-0a31-4917-8557-0625dbf8394a">Format Selection</label>
 <select class="form-control subNav-menu-select" id="menu-format-selection-6021cbeb-0a31-4917-8557-0625dbf8394a">
 <option selected="selected" value="">Select a Format ...</option>
 <option value="/metagame/standard/full#paper">Standard</option>
 <option value="/metagame/modern/full#paper">Modern</option>
 <option value="/metagame/pioneer/full#paper">Pioneer</option>
 <option value="/metagame/pauper/full#paper">Pauper</option>
 <option value="/metagame/legacy/full#paper">Legacy</option>
 <option value="/metagame/vintage/full#paper">Vintage</option>
 <option value="/metagame/historic/full#paper">Historic</option>
 <option value="/metagame/penny_dreadful/full#paper">Penny Dreadful</option>
 <option value="/metagame/commander_1v1/full#paper">Commander 1v1</option>


In [14]:
deck_urls = []
try:
    deck_url = bs.findAll('span', {'class': "deck-price-paper"})
    for i in deck_url:
        for link in i.findAll('a'):
            deck_urls.append(link.attrs['href'])
except AttributeError as e:
    print(e)

In [15]:
deck_urls

['/metagame/standard/full#paper',
 '/metagame/modern/full#paper',
 '/metagame/pioneer/full#paper',
 '/metagame/pauper/full#paper',
 '/metagame/legacy/full#paper',
 '/metagame/vintage/full#paper',
 '/metagame/historic/full#paper',
 '/metagame/penny_dreadful/full#paper',
 '/metagame/commander_1v1/full#paper',
 '/metagame/commander/full#paper',
 '/metagame/brawl/full#paper',
 '/archetype/sisay-weatherlight-captain#paper',
 '/archetype/golos-tireless-pilgrim#paper',
 '/archetype/kenrith-the-returned-king#paper',
 '/archetype/commander-rin-and-seri-inseparable#paper',
 '/archetype/breya-etherium-shaper#paper',
 '/archetype/chulane-teller-of-tales#paper',
 '/archetype/alela-artful-provocateur#paper',
 '/archetype/atraxa-praetors-voice#paper',
 '/archetype/muldrotha-the-gravetide#paper',
 '/archetype/yuriko-the-tiger-s-shadow#paper',
 '/archetype/korvold-fae-cursed-king#paper',
 '/archetype/krenko-mob-boss#paper',
 '/archetype/gishath-sun-s-avatar#paper',
 '/archetype/gavi-nest-warden#paper',

### Getting all the links at the top of the page....looks like we only want the ones starting with /archetype...Identify the pattern and use regular expressions to narrow down the list.

In [16]:
deck_urls = []
try:
    deck_url = bs.findAll('span', {'class': "deck-price-paper"})
    for i in deck_url:
        for link in i.findAll('a', href=re.compile(r'/archetype*')):
            deck_urls.append(link.attrs['href'])
except AttributeError as e:
    print(e)

print(deck_urls)

['/archetype/sisay-weatherlight-captain#paper', '/archetype/golos-tireless-pilgrim#paper', '/archetype/kenrith-the-returned-king#paper', '/archetype/commander-rin-and-seri-inseparable#paper', '/archetype/breya-etherium-shaper#paper', '/archetype/chulane-teller-of-tales#paper', '/archetype/alela-artful-provocateur#paper', '/archetype/atraxa-praetors-voice#paper', '/archetype/muldrotha-the-gravetide#paper', '/archetype/yuriko-the-tiger-s-shadow#paper', '/archetype/korvold-fae-cursed-king#paper', '/archetype/krenko-mob-boss#paper', '/archetype/gishath-sun-s-avatar#paper', '/archetype/gavi-nest-warden#paper', '/archetype/kalamax-the-stormsire#paper', '/archetype/kess-dissident-mage#paper', '/archetype/the-ur-dragon#paper', '/archetype/arcades-the-strategist#paper', '/archetype/nethroi-apex-of-death#paper', '/archetype/niv-mizzet-parun#paper', '/archetype/meren-of-clan-nel-toth#paper', '/archetype/syr-gwyn-hero-of-ashvale#paper', '/archetype/kaalia-of-the-vast#paper', '/archetype/kinnan-bon

### Now we have a deck list we can loop through! Look at first deck.

In [17]:
deck_url = DOMAIN + deck_urls[1]
print(deck_url)

https://www.mtggoldfish.com/archetype/golos-tireless-pilgrim#paper


In [18]:
#Get the deck page
try:
    html = urlopen(deck_url)
except HTTPError as e:
    print(e)
except URLError as e:
    print('The server could not be found!')
try:
    page = BeautifulSoup(html.read(), 'html.parser')
    html.close()
except AttributeError as e:
    print(e)
print(page.head())

[<title>Golos, Tireless Pilgrim Deck for Magic: the Gathering</title>, <meta content="Golos, Tireless Pilgrim deck list with prices for Magic: the Gathering (MTG)." name="description"/>, <meta content="golos, tireless pilgrim, golos, tireless pilgrim, sol ring, cultivate, commander, deck, decklist, price, archetype, magic, mtg, magic the gathering, magic the gathering online, mtgo" name="keywords"/>, <meta content="https://cdn1.mtggoldfish.com/images/gf/Golos%252C%2BTireless%2BPilgrim%2B%255BM20%255D.jpg" property="og:image"/>, <meta content="Golos, Tireless Pilgrim Deck for Magic: the Gathering" property="og:title"/>, <meta content="website" property="og:type"/>, <meta content="https://www.mtggoldfish.com/archetype/golos-tireless-pilgrim" property="og:url"/>, <meta content="Golos, Tireless Pilgrim deck list with prices for Magic: the Gathering (MTG)." property="og:description"/>, <meta content="summary" name="twitter:card"/>, <meta content="@mtggoldfish" name="twitter:site"/>, <link h

## Get Data!

+ Deck Name
+ NumberOfDecks
+ % of Metagame
+ Online Price
+ Paper Price
+ Cards!!!


### Deck Name

In [19]:
page.findAll('h1', {'class': 'deck-view-title'})

[<h1 class="deck-view-title">
 Golos, Tireless Pilgrim
 Report Deck Name
 </a></h1>]

In [20]:
try:
    for i in page.findAll('h1', {'class': 'deck-view-title'}):
        name =i.get_text()
        name=name.split('\n\n')[0].strip('\n')
        print(f'Deck name: {name}')
except AttributeError as e:
    print(e)
#page.findAll('h1', {'class': 'deck-view-title'})

Deck name: Golos, Tireless Pilgrim


### Number of Decks

In [21]:
deckStats = page.find('p').get_text().strip().split()
# Number of Decks of same kind on MTG Goldfish  #
numDecks = deckStats[0]
# % of Meta #
pctMeta = re.sub('[^0-9.]+', '', deckStats[2])

print(deckStats)
print(f'Number of Decks: {numDecks}')
print(f'% of Metagame: {pctMeta}')

['48', 'Decks', '(1.73%', 'of', 'meta)']
Number of Decks: 48
% of Metagame: 1.73


### Online Price/Paper Price

In [22]:
# $ Deck #
deckPrice_paper = list(page.find('div', {'class': 'price-box paper'}).div.next_siblings)[1].get_text()
deckPrice_online = list(page.find('div', {'class': 'price-box online'}).div.next_siblings)[1].get_text()
print(f'Paper Price: {deckPrice_paper}')
print(f'Online Price: {deckPrice_online}')

Paper Price: 1,349.48
Online Price: 322.07


### Get Cards!!

This is the real impetus behind this use case: what cards are in the most popular decks being played right now?

A few applications in mind for using this data paired with the data collected above:

+ Network analysis of decks with prices as edge weights
+ Text analysis of entire decks
+ EDA on deck stats with price

In [23]:
###First get the cards

card_list = []

try:
    deck_page = page.find('table', {'class': "deck-view-deck-table"})

    link_finder = re.compile('^(.*(\/price).*)', re.IGNORECASE)

    for i in deck_page.findAll('a', {'href': link_finder}):
        card = i.get_text()
        if '//' in card.split():
            card = card.split(' // ')
            for c in card:
                card_list.append(c)
        else:
            card_list.append(card)
except AttributeError as e:
    print(e)

print(card_list)

['Golos, Tireless Pilgrim', 'Bloom Tender', 'Faeburrow Elder', 'Arena Rector', "Atraxa, Praetors' Voice", 'Felidar Guardian', 'Nicol Bolas, the Ravager', 'Spark Double', 'Deepglow Skate', 'Seedborn Muse', 'Aminatou, the Fateshifter', 'Narset, Parter of Veils', 'Oko, Thief of Crowns', 'Saheeli Rai', 'Teferi, Time Raveler', 'Chandra, Torch of Defiance', 'Narset Transcendent', 'Ral Zarek', 'Tamiyo, Field Researcher', 'Teferi, Master of Time', 'Nicol Bolas, Dragon-God', 'Teferi, Hero of Dominaria', 'Tezzeret the Seeker', 'Tezzeret, Artifice Master', 'Chandra, Awakened Inferno', 'Liliana, Dreadhorde General', 'Teferi, Temporal Archmage', 'Nicol Bolas, God-Pharaoh', 'Nicol Bolas, Planeswalker', 'Ugin, the Spirit Dragon', 'Swords to Plowshares', "Assassin's Trophy", 'Cyclonic Rift', 'Heroic Intervention', 'Generous Gift', "Teferi's Protection", 'Supreme Verdict', "Urza's Ruinous Blast", 'Merciless Eviction', "Primevals' Glorious Rebirth", 'Sol Ring', 'Arcane Signet', 'Talisman of Creativity',

In [24]:
deck_list = {}
try:
    for j in range(0, len(card_list)):
        one_card = [(card.name, card.text, card.colors, card.mana_cost) for card in cards if card.name == card_list[j]]
        if not one_card:
            deck_list[j] = {'name': card_list[j], 'text': 'NA',
                                'colors': 'NA', 'mana_cost': 'NA'}
        else:
            deck_list[j] = {'name': one_card[0][0], 'text': one_card[0][1],
                                'colors': one_card[0][2], 'mana_cost': one_card[0][3]}
except AttributeError as e:
    print(e)

In [25]:
print(deck_list)

tap another target permanent.\n[−2]: Ral Zarek deals 3 damage to any target.\n[−7]: Flip five coins. Take an extra turn after this one for each coin that comes up heads.', 'colors': ['Red', 'Blue'], 'mana_cost': '{2}{U}{R}'}, 18: {'name': 'Tamiyo, Field Researcher', 'text': '[+1]: Choose up to two target creatures. Until your next turn, whenever either of those creatures deals combat damage, you draw a card.\n[−2]: Tap up to two target nonland permanents. They don\'t untap during their controller\'s next untap step.\n[−7]: Draw three cards. You get an emblem with "You may cast spells from your hand without paying their mana costs."', 'colors': ['Green', 'Blue', 'White'], 'mana_cost': '{1}{G}{W}{U}'}, 19: {'name': 'Teferi, Master of Time', 'text': "You may activate loyalty abilities of Teferi, Master of Time on any player's turn any time you could cast an instant.\n[+1]: Draw a card, then discard a card.\n[−3]: Target creature you don't control phases out. (Treat it and anything attache

## Store the information

Storing the data as a JSON object to be used for analysis

In [27]:
decks[0] = {'DeckName': name, 'NumberOfDecks': numDecks,
            'PercentOfDecks': pctMeta, 'PaperPrice': deckPrice_paper,
            'OnlinePrice': deckPrice_online, 'Cards': deck_list}

In [32]:
decks[0].get('DeckName')

'Golos, Tireless Pilgrim'

In [40]:
decks[0].get('Cards')[22]

{'name': 'Tezzeret the Seeker',
 'text': '[+1]: Untap up to two target artifacts.\n[−X]: Search your library for an artifact card with converted mana cost X or less and put it onto the battlefield. Then shuffle your library.\n[−5]: Artifacts you control become artifact creatures with base power and toughness 5/5 until end of turn.',
 'colors': ['Blue'],
 'mana_cost': '{3}{U}{U}'}

In [41]:
len(decks[0].get('Cards'))

100

In [42]:
with open('data_demo.json', 'w') as fp:
    json.dump(decks, fp, indent=4)

## Beautiful Soup Cheatsheet

+ [Beautiful Soup Cheatsheet on GitHub](https://gist.github.com/yoki/b7f2fcef64c893e307c4c59303ead19a)

Beautiful Soup features not shown here:

+ css selectors
+ parent/child/sibling selectors (up/down/sideways naviagtion through html elements)
+ modifying tags
+ deleting/replacing/encoding
+ soup strainer: parse document on import

## Resources

+ Official Documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
+ Maggie's GitHub Repo: https://github.com/MKS310/MTG-Web-Scraping
+ MTG: https://en.wikipedia.org/wiki/Magic:_The_Gathering
+ MTG Goldfish: https://www.mtggoldfish.com
