# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Using-Beautiful-soup-to-extract-web-data" data-toc-modified-id="Using-Beautiful-soup-to-extract-web-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Using <strong>Beautiful soup</strong> to extract web data</a></div><div class="lev2 toc-item"><a href="#Importing-our-libs" data-toc-modified-id="Importing-our-libs-11"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Importing our libs</a></div><div class="lev2 toc-item"><a href="#Setting-the-url" data-toc-modified-id="Setting-the-url-12"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Setting the url</a></div><div class="lev2 toc-item"><a href="#Looking-for-our-data-of-interest" data-toc-modified-id="Looking-for-our-data-of-interest-13"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Looking for our data of interest</a></div><div class="lev2 toc-item"><a href="#Creating-a-dataframe" data-toc-modified-id="Creating-a-dataframe-14"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Creating a dataframe</a></div><div class="lev2 toc-item"><a href="#Saving-into-csv-file" data-toc-modified-id="Saving-into-csv-file-15"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Saving into <em>csv</em> file</a></div>

# Using __Beautiful soup__ to extract web data

## Importing our libs

In [1]:
# Importing our libs:
from IPython.display import display
import pandas as pd
from bs4 import BeautifulSoup
import requests

## Setting the url

In this case we are using a standard on to test som functionalities of the libraries in order to build a script to automatize the scraping.

In [2]:
# Setting path:
url = "http://numismatics.org/ocre/results?q=&start=0"

r    = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

# Print a 'beautiful' verion of file:
print(soup.prettify)

<bound method Tag.prettify of <!DOCTYPE HTML>
<html><head profile="http://a9.com/-/spec/opensearch/1.1/"><title>Online Coins of the Roman Empire: Browse Collection</title><link href="http://numismatics.org/ocre/feed/?q=" rel="alternate" type="application/atom+xml"/><link href="http://numismatics.org/ocre/query.csv/?q=" rel="alternate" type="text/csv"/><link href="http://numismatics.org/ocre/query.kml/?q=" rel="alternate" type="application/vnd.google-earth.kml+xml"/><link href="http://numismatics.org/ocre/opensearch.xml" rel="search" title="Example Search for http://numismatics.org/ocre/" type="application/opensearchdescription+xml"/><meta content="42746" name="totalResults"/><meta content="0" name="startIndex"/><meta content="20" name="itemsPerPage"/><link href="http://numismatics.org/themes/ocre/images/favicon.png" rel="shortcut icon" type="image/x-icon"/><meta content="width=device-width, initial-scale=1" name="viewport"/><script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11

## Looking for our data of interest

All `div` tags with the class `result-doc` are subsections that contain all the info abour the coins (including all the images). We first extract all these divs, and later we will clean the info contained.

In [3]:
results = soup.find_all("div", class_="result-doc")
display(results[0])

<div class="row result-doc"><div class="col-md-12"><h4><a href="id/ric.1(2).aug.1A">RIC I (second edition) Augustus 1A</a></h4></div><div class="col-md-5 col-lg-4 pull-right"><a class="thumbImage" href="http://ww2.smb.museum/mk_edit/images/n8/8102/vs_opt.jpg" id="http://ww2.smb.museum/ikmk/object.php?id=18207929" rel="gallery" title="Obverse of DE-MUS-814819/18207929: Münzkabinett Berlin"><img class="side-thumbnail" src="http://ww2.smb.museum/mk_edit/images/n8/8102/vs_thumb.jpg"/></a><a class="thumbImage" href="http://ww2.smb.museum/mk_edit/images/n8/8102/rs_opt.jpg" id="http://ww2.smb.museum/ikmk/object.php?id=18207929" rel="gallery" title="Reverse of DE-MUS-814819/18207929: Münzkabinett Berlin"><img class="side-thumbnail" src="http://ww2.smb.museum/mk_edit/images/n8/8102/rs_thumb.jpg"/></a><a class="thumbImage" href="http://ww2.smb.museum/mk_edit/images/n8/8101/vs_opt.jpg" id="http://ww2.smb.museum/ikmk/object.php?id=18207928" rel="gallery" style="display:none" title="Obverse of DE-M

We filter the tags to build our dataframe using pandas.

In [4]:
# Tag filtering:
tags = results[0].find_all("dt")
display(tags)
tags = [ tags[i].text for i in range(len(tags)) ]
display(tags)

[<dt>Date</dt>,
 <dt>Denomination</dt>,
 <dt>Mint</dt>,
 <dt>Obverse</dt>,
 <dt>Reverse</dt>]

['Date', 'Denomination', 'Mint', 'Obverse', 'Reverse']

## Creating a dataframe

We're now able to build a simple dataframe.

In [5]:
# We create an empty dataframe:
df = pd.DataFrame(columns=tags)

for result in results:
    # Extract all and images:
    data = result.find_all("dd")
    imgs = result.find_all("img")
    
    # Build a dictionary for dataframe appending:
    data = { tags[i] : data[i].text for i in range(len(data)) }
    # Try to add images url (if they exist):
    try:
        data["Obverse_URL"] = imgs[0]['src']
        data["Reverse_URL"] = imgs[1]['src']
    except:
        data["Obverse_URL"] = None
        data["Reverse_URL"] = None
    
    data = pd.Series(data)
    df = df.append(data, ignore_index = True)

display(df.head())

Unnamed: 0,Date,Denomination,Mint,Obverse,Reverse,Obverse_URL,Reverse_URL
0,25 BC - 23 BC,Quinarius,Emerita,"AVGVST: Head of Augustus, bare, left","P CARISI LEG: Victory standing right, placing ...",http://ww2.smb.museum/mk_edit/images/n8/8102/v...,http://ww2.smb.museum/mk_edit/images/n8/8102/r...
1,25 BC - 23 BC,Quinarius,Emerita,"AVGVST: Head of Augustus, bare, left","P CARISI LEG: Victory standing right, placing ...",http://www.kenom.de/iiif/image/record_DE-MUS-0...,http://www.kenom.de/iiif/image/record_DE-MUS-0...
2,25 BC - 23 BC,Denarius,Emerita,"IMP CAESAR AVGVST: Head of Augustus, bare, right","P CARISIVS LEG PRO PR: Round shield, spear-hea...",http://ww2.smb.museum/mk_edit/images/n7/7794/v...,http://ww2.smb.museum/mk_edit/images/n7/7794/r...
3,25 BC - 23 BC,Denarius,Emerita,"IMP CAESAR AVGVST: Head of Augustus, bare, left","P CARISIVS LEG PRO PR: Round shield, spear-hea...",http://ww2.smb.museum/mk_edit/images/n2/2615/v...,http://ww2.smb.museum/mk_edit/images/n2/2615/r...
4,25 BC - 23 BC,Denarius,Emerita,"IMP CAESAR AVGVSTVS: Head of Augustus, bare, r...","P CARISIVS LEG PRO PR: Round shield, spear-hea...",http://ww2.smb.museum/mk_edit/images/n7/7798/v...,http://ww2.smb.museum/mk_edit/images/n7/7798/r...


## Saving into *csv* file

Now we save the sample dataframe into a csv file:

In [6]:
df.to_csv("sample_df.csv")