# Challenges Scraping Non-Tabular Data

On <a href="https://sandeepmj.github.io/scrape-example-page">this demo page</a> I've reproduced several variations of issues we are likely to encounter when scraping.

- Review scrape of an well-organized page.
- Dynamically getting column names.
- Scraping a challenging page.
- Excluding multi-classes.


Let's start by scraping <a href="https://sandeepmj.github.io/scrape-example-page/#organized">the organized CEO data</a>.

In [8]:
## import libraries

import pandas as pd
from bs4 import BeautifulSoup
import requests

In [9]:
## target url
url = "https://sandeepmj.github.io/scrape-example-page/#organized"

In [10]:
## response

response = requests.get(url)

In [12]:
## turn into soup

soup = BeautifulSoup(response.text, "html.parser")

In [15]:
## scrape

organized = soup.find(id="organized")
organized

<section id="organized">
<h2>Organized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>
<div class="ceo">
<p class="rank">Rank: 2</p>
<p class="name">Name: Frank Bisignano</p>
<p class="annual_compensation">Annual Compensation: $102.2 million</p>
<p class="company">Company: First Data (FDC)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 3</p>
<p class="name">Name: Michael Rapino</p>
<p class="annual_compensation">Annual Compensation: $70.6 million</p>
<p class="company">Company: Live Nation Entertainment (LYV)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 4</p>
<p class="name">Name: Leslie Moonves</p>
<p class="annual_compensation">Annual Compensation: 68.4 million</p>
<p class="company">Company: CBS</p>
</div>
<div class="ceo">
<p class="rank">Rank: 5</p>
<p class="name">Name: Gregory Ma

In [16]:
type(organized)

bs4.element.Tag

In [31]:
## isolate ceos

ceos = organized.find_all('div', class_="ceo")
ceos

[<div class="ceo">
 <p class="rank">Rank: 1</p>
 <p class="name">Name: Hock E. Tan</p>
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>
 <p class="company">Company: Broadcom</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 2</p>
 <p class="name">Name: Frank Bisignano</p>
 <p class="annual_compensation">Annual Compensation: $102.2 million</p>
 <p class="company">Company: First Data (FDC)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 3</p>
 <p class="name">Name: Michael Rapino</p>
 <p class="annual_compensation">Annual Compensation: $70.6 million</p>
 <p class="company">Company: Live Nation Entertainment (LYV)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 4</p>
 <p class="name">Name: Leslie Moonves</p>
 <p class="annual_compensation">Annual Compensation: 68.4 million</p>
 <p class="company">Company: CBS</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 5</p>
 <p class="name">Name: Gregory Maffei</p>
 <p class="annual_compensation">Annua

In [32]:
## find all the names using FL
names_fl = []
for ceo in ceos:
    names_fl.append(ceo.find("p", class_="name").get_text().replace("Name: ", ""))
    
names_fl

['Hock E. Tan',
 'Frank Bisignano',
 'Michael Rapino',
 'Leslie Moonves',
 'Gregory Maffei']

In [34]:
names_lc = [name.find("p", class_="name").get_text().replace("Name: ", "") for name in ceos]
names_lc

['Hock E. Tan',
 'Frank Bisignano',
 'Michael Rapino',
 'Leslie Moonves',
 'Gregory Maffei']

In [None]:
for artist in artists:
    print(artist.get_text())

In [35]:
name = soup.find("p", class_="name")
name

<p class="name">Name: Hock E. Tan</p>

In [22]:
annual_compensation = soup.find_all("p", class_="annual_compensation")
annual_compensation

[<p class="annual_compensation">Annual Compensation: $103.2 million</p>,
 <p class="annual_compensation">Annual Compensation: $102.2 million</p>,
 <p class="annual_compensation">Annual Compensation: $70.6 million</p>,
 <p class="annual_compensation">Annual Compensation: 68.4 million</p>,
 <p class="annual_compensation">Annual Compensation: $67.2 million</p>,
 <p class="annual_compensation"><span>Annual compensation:</span> $103.2 million</p>,
 <p class="annual_compensation"><span>Annual Compensation:</span> $102.2 million</p>,
 <p class="annual_compensation"><span>Annual Compensation:</span> $70.6 million</p>,
 <p class="annual_compensation"><span>Annual Compensation:</span> 68.4 million</p>,
 <p class="annual_compensation"><span>Annual Compensation:</span> $67.2 million</p>,
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>,
 <p class="annual_compensation">Annual Compensation: $102.2 million</p>,
 <p class="annual_compensation">Annual Compensation: $70.6 million<

In [40]:
annual_compensation_fl = []
for comp in annual_compensation:
    names_fl.append(comp.find("p", class_="annual_compensation").get_text().replace("annual compensation: ", ""))
    
annual_compensation_fl



AttributeError: 'NoneType' object has no attribute 'get_text'

In [41]:
pd.DataFrame(ceo_list)

NameError: name 'ceo_list' is not defined

In [37]:
compensation_lc = [annual_comp.find("p", class_="annual_compensation").get_text().replace("annual_compensation: ", "") for annual_comp in ceos]
compensation_lc

['Annual Compensation: $103.2 million',
 'Annual Compensation: $102.2 million',
 'Annual Compensation: $70.6 million',
 'Annual Compensation: 68.4 million',
 'Annual Compensation: $67.2 million']

In [50]:
company_names = soup.find_all("p", class_="company")
company_names

[<p class="company">Company: Broadcom</p>,
 <p class="company">Company: First Data (FDC)</p>,
 <p class="company">Company: Live Nation Entertainment (LYV)</p>,
 <p class="company">Company: CBS</p>,
 <p class="company">Company: Liberty Media &amp; Qurate Retail Group</p>,
 <p class="company"><span>Company:</span> Broadcom</p>,
 <p class="company"><span>Company:</span> First Data (FDC)</p>,
 <p class="company"><span>Company:</span> Live Nation Entertainment (LYV)</p>,
 <p class="company"><span>Company:</span> CBS</p>,
 <p class="company"><span>Company:</span> Liberty Media &amp; Qurate Retail Group</p>,
 <p class="company">Company: Broadcom</p>,
 <p class="company">Company: First Data (FDC)</p>,
 <p class="company">Company: Live Nation Entertainment (LYV)</p>,
 <p class="company">Company: CBS</p>,
 <p class="company">Company: Qurate Retail Group</p>]

In [46]:
ceo_list = []
for item in zip(names_fl, company_names, compensation_lc):
    ceo_list.append(item)
    
ceo_list

NameError: name 'company_names' is not defined

In [None]:
ceo_list = []
company = []
annual_compensation = []
for all_data in zip(artists_list, albums_list, sales_list, albums_url_list):
    album_data.append(all_data)
    
print(album_data)

In [47]:
list(("cat", "dog", "mouse"))

['cat', 'dog', 'mouse']

In [48]:
type(("cat", "dog", "mouse"))

tuple

## Built-in Functions are always

-list and zip instead of for loops

In [51]:
zip(names_fl, company_names, compensation_lc)


<zip at 0x12d40f740>

In [103]:
mydata = list(zip(names_fl, company_names, compensation_lc))
mydata

[('Hock E. Tan',
  <p class="company">Company: Broadcom</p>,
  'Annual Compensation: $103.2 million'),
 ('Frank Bisignano',
  <p class="company">Company: First Data (FDC)</p>,
  'Annual Compensation: $102.2 million'),
 ('Michael Rapino',
  <p class="company">Company: Live Nation Entertainment (LYV)</p>,
  'Annual Compensation: $70.6 million'),
 ('Leslie Moonves',
  <p class="company">Company: CBS</p>,
  'Annual Compensation: 68.4 million'),
 ('Gregory Maffei',
  <p class="company">Company: Liberty Media &amp; Qurate Retail Group</p>,
  'Annual Compensation: $67.2 million')]

### if the lists are different lengths then it will ignore the excess data and your data will be wrong

In [54]:
pd.DataFrame(mydata, columns = ["name", "company", "annual_comp"])

In [56]:
ceos[0]

<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>

In [None]:
HTML Attributes

div is a tag.
class is an attribute
some_class holds a value

In [58]:
ceos[0]["class"]
## if you just call 0 it will produce the first thing it finds, first on a list

['ceo']

In [59]:
ceos[0]

<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>

In [64]:
my_ptags = ceos[0].find_all("p")
my_ptags

[<p class="rank">Rank: 1</p>,
 <p class="name">Name: Hock E. Tan</p>,
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>,
 <p class="company">Company: Broadcom</p>]

In [67]:
my_ptags[0]["class"]

['rank']

In [81]:
col_names = []
for my_ptag in my_ptags:
    #print(my_ptag)
    print(my_ptag["class"][0])
    col_names.append(my_ptag["class"][0])
    
col_names

rank
name
annual_compensation
company


['rank', 'name', 'annual_compensation', 'company']

In [80]:
pd.DataFrame(ceos_list, columns = col_names)

NameError: name 'ceos_list' is not defined

In [82]:
ceos[0]
## Parents
## siblings
## children

<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>

In [83]:
pd.DataFrame(mydata, columns = col_names[1:4])

NameError: name 'mydata' is not defined

In [85]:
target = soup.find(id = "disorganized")
target

<section id="disorganized">
<h2>Disorganized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<span>Rank:</span><dt> 1</dt>
<span>Name:</span><dt> Hock E. Tan</dt>
<span>Annual compensation:</span><dt> $103.2 million</dt>
<span>Company:</span><dt> Broadcom</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 2</dt>
<span>Name:</span><dt> Frank Bisignano</dt>
<span>Annual Compensation:</span><dt> $102.2 million</dt>
<span>Company:</span><dt> First Data (FDC)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 3</dt>
<span>Name:</span><dt> Michael Rapino</dt>
<span>Annual Compensation:</span><dt> $70.6 million</dt>
<span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 4</dt>
<span>Name:</span><dt> Leslie Moonves</dt>
<span>Annual Compensation:</span><dt> 68.4 million</dt>
<span>Company:</span><dt> CBS</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 5</dt>
<span>Name:</span> <dt> Gregory Maffei</dt>
<span>Annual Com

In [86]:
ceos = target.find_all("div", class_="ceo")
ceos

[<div class="ceo">
 <span>Rank:</span><dt> 1</dt>
 <span>Name:</span><dt> Hock E. Tan</dt>
 <span>Annual compensation:</span><dt> $103.2 million</dt>
 <span>Company:</span><dt> Broadcom</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 2</dt>
 <span>Name:</span><dt> Frank Bisignano</dt>
 <span>Annual Compensation:</span><dt> $102.2 million</dt>
 <span>Company:</span><dt> First Data (FDC)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 3</dt>
 <span>Name:</span><dt> Michael Rapino</dt>
 <span>Annual Compensation:</span><dt> $70.6 million</dt>
 <span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 4</dt>
 <span>Name:</span><dt> Leslie Moonves</dt>
 <span>Annual Compensation:</span><dt> 68.4 million</dt>
 <span>Company:</span><dt> CBS</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 5</dt>
 <span>Name:</span> <dt> Gregory Maffei</dt>
 <span>Annual Compensation:</span><dt> $67.2 million</dt>
 <span>Com

In [101]:
for ceo in ceos:
    #print(ceo)
    #print("***********")
    #ceo.find_all("dt")
    
    print(ceo.find_all("dt")[1].get_text())
    print("***********")
    
    
### find all results in a list
### BeautifulSoup result set is the same as a list. You can then slice like a list

 Hock E. Tan
***********
 Frank Bisignano
***********
 Michael Rapino
***********
 Leslie Moonves
***********
 Gregory Maffei
***********


In [105]:
for ceo in ceos:
    #print(ceo)
    
    print(ceo.find_all("dt")[0:4])
    print("***********")
    

[<dt> 1</dt>, <dt> Hock E. Tan</dt>, <dt> $103.2 million</dt>, <dt> Broadcom</dt>]
***********
[<dt> 2</dt>, <dt> Frank Bisignano</dt>, <dt> $102.2 million</dt>, <dt> First Data (FDC)</dt>]
***********
[<dt> 3</dt>, <dt> Michael Rapino</dt>, <dt> $70.6 million</dt>, <dt> Live Nation Entertainment (LYV)</dt>]
***********
[<dt> 4</dt>, <dt> Leslie Moonves</dt>, <dt> 68.4 million</dt>, <dt> CBS</dt>]
***********
[<dt> 5</dt>, <dt> Gregory Maffei</dt>, <dt> $67.2 million</dt>, <dt> Liberty Media &amp; Qurate Retail Group</dt>]
***********


In [108]:
for ceo in ceos:
    #print(ceo)
    
    data_ceo = [ceo for ceo in ceo.find_all("dt")[0:4]]
    print(f"Data_CEO: {data_ceo}")
    for ceo in data_ceo:
        print(ceo.get_text())
        ceo_values.append()

## start with second end "for ceo in ceo"

Data_CEO: [<dt> 1</dt>, <dt> Hock E. Tan</dt>, <dt> $103.2 million</dt>, <dt> Broadcom</dt>]
 1
 Hock E. Tan
 $103.2 million
 Broadcom
Data_CEO: [<dt> 2</dt>, <dt> Frank Bisignano</dt>, <dt> $102.2 million</dt>, <dt> First Data (FDC)</dt>]
 2
 Frank Bisignano
 $102.2 million
 First Data (FDC)
Data_CEO: [<dt> 3</dt>, <dt> Michael Rapino</dt>, <dt> $70.6 million</dt>, <dt> Live Nation Entertainment (LYV)</dt>]
 3
 Michael Rapino
 $70.6 million
 Live Nation Entertainment (LYV)
Data_CEO: [<dt> 4</dt>, <dt> Leslie Moonves</dt>, <dt> 68.4 million</dt>, <dt> CBS</dt>]
 4
 Leslie Moonves
 68.4 million
 CBS
Data_CEO: [<dt> 5</dt>, <dt> Gregory Maffei</dt>, <dt> $67.2 million</dt>, <dt> Liberty Media &amp; Qurate Retail Group</dt>]
 5
 Gregory Maffei
 $67.2 million
 Liberty Media & Qurate Retail Group


In [121]:
ceo_data_list = []
for ceo in ceos:
    all_targets = ceo.find_all("dt")
    rank = all_targets[0].get_text(strip = True)
    name = all_targets[1].get_text(strip = True)
    annual_comp = all_targets[2].get_text(strip = True)
    company = all_targets[3].get_text(strip = True)
    ceo_data_list.append({"rank": rank,
                "name": name,
                "annual companesation": annual_comp,
                 "company": company})
    
ceo_data_list

[{'rank': '1',
  'name': 'Hock E. Tan',
  'annual companesation': '$103.2 million',
  'company': 'Broadcom'},
 {'rank': '2',
  'name': 'Frank Bisignano',
  'annual companesation': '$102.2 million',
  'company': 'First Data (FDC)'},
 {'rank': '3',
  'name': 'Michael Rapino',
  'annual companesation': '$70.6 million',
  'company': 'Live Nation Entertainment (LYV)'},
 {'rank': '4',
  'name': 'Leslie Moonves',
  'annual companesation': '68.4 million',
  'company': 'CBS'},
 {'rank': '5',
  'name': 'Gregory Maffei',
  'annual companesation': '$67.2 million',
  'company': 'Liberty Media & Qurate Retail Group'}]

In [122]:
df = pd.DataFrame(ceo_data_list)
df

Unnamed: 0,rank,name,annual companesation,company
0,1,Hock E. Tan,$103.2 million,Broadcom
1,2,Frank Bisignano,$102.2 million,First Data (FDC)
2,3,Michael Rapino,$70.6 million,Live Nation Entertainment (LYV)
3,4,Leslie Moonves,68.4 million,CBS
4,5,Gregory Maffei,$67.2 million,Liberty Media & Qurate Retail Group


### The same steps each time:

* Is the content on the page (use ```Reveal Source```)?
* Where and how is the content held on the page?
* Which classes and IDs do we target?
* Is there a pattern?
* Is there anything that breaks the pattern?

# Excluding classes

Most modern sites have tags that include multiple classes.

What if you want to target a tag with a single class but that class also appears in tags with others that holds other types of content.

For example, capture ```Excluding Some Classes``` section of our page in ```BeautifulSoup``` object.



In [42]:
## RUN this cell that holds some html
some_html = '''<li> Silly List </li>
<li class="a"> A alone  - UNWANTED </li>
<li class="a z"> A and Z  - UNWANTED </li>
<li class="z"> Z first - my target</li>
<li class="b z"> B and Z  - UNWANTED</li>
<li class="x z"> X and Z - UNWANTED </li>
<li class="z"> Z second - my target</li>'''



### Back to our CEOs