# Web Scraping and Parsing | Retrieving Tags with Beautiful Soup in Python - Tutorial 35 in Anaconda

## Data Parsing

In [1]:
import pandas

from bs4 import BeautifulSoup
import re

In [5]:
filename = 'data/DSFD_Listing.html'

html_doc = None
with open(filename, 'r') as f:
    html_doc = f.read()

soup = BeautifulSoup(html_doc, 'html.parser')

In [6]:
type(soup)

bs4.BeautifulSoup

## Parsing your data

In [8]:
print(soup.prettify()[:100])

<!DOCTYPE html>
<html>
 <head>
  <title>
   Best Books
  </title>
 </head>
 <body>
  <p class="title


## Getting data from a parse tree

In [10]:
text_only = soup.get_text()
print(text_only)




Best Books


DATA SCIENCE FOR DUMMIES
Jobs in data science abound, but few people have the data science skills needed to fill these
    increasingly important roles in organizations. Data Science For Dummies is the pe
    
    Edition 1 of this book:
    

Provides a background in data science fundamentals before moving on to working with relational databases and
        unstructured data and preparing your data for analysis
Details different data visualization techniques that can be used to showcase and summarize your data
Explains both supervised and unsupervised machine learning, including regression, model validation, and
        clustering techniques
Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark


    What to do next:
    
See a preview of the
      book,
    get the free pdf download, and then
    buy the book!

...




## Searching and retrieving data from a parse tree

### Retrieving tags by filtering with name arguments

In [11]:
soup.find_all('li')

[<li>Provides a background in data science fundamentals before moving on to working with relational databases and
         unstructured data and preparing your data for analysis</li>,
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>,
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and
         clustering techniques</li>,
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>]

### Retrieving tags by filtering with keyword arguments

In [12]:
soup.find_all(id='link 3')

[<a class="preview" href="http://bit.ly/Data-Science-For-Dummies" id="link 3">buy the book!</a>]

### Retrieving tags by filtering with string arguments

In [13]:
soup.find_all('ul')

[<ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and
         unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and
         clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

### Retrieving tags by filtering with list objects

In [14]:
soup.find_all(['ul', 'b'])

[<b>DATA SCIENCE FOR DUMMIES</b>, <ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and
         unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and
         clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>]

### Retrieving tags by filtering with regular expressions

In [27]:
rex = re.compile(r'I')

soup.find_all(string=rex)

['DATA SCIENCE FOR DUMMIES',
 'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark']

### Retrieving tags by filtering with a Boolean value

In [29]:
soup.find_all(True)

[<html>
 <head>
 <title>Best Books</title>
 </head>
 <body>
 <p class="title"><b>DATA SCIENCE FOR DUMMIES</b></p>
 <p class="description">Jobs in data science abound, but few people have the data science skills needed to fill these
     increasingly important roles in organizations. Data Science For Dummies is the pe
     <br/><br/>
     Edition 1 of this book:
     <br/>
 <ul>
 <li>Provides a background in data science fundamentals before moving on to working with relational databases and
         unstructured data and preparing your data for analysis</li>
 <li>Details different data visualization techniques that can be used to showcase and summarize your data</li>
 <li>Explains both supervised and unsupervised machine learning, including regression, model validation, and
         clustering techniques</li>
 <li>Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark</li>
 </ul>
 <br/><br/>
     What to do next:
     <br/>
 <a class="preview" href="http

### Retrieving tags by filtering with string objects

In [31]:
for a in soup.find_all('a'):
    print(a.get('href'))

http://www.data-mania.com/blog/books-by-lillian-pierson/
http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/
http://bit.ly/Data-Science-For-Dummies


### Retrieving tags by filtering with regular expressions

In [33]:
rex = re.compile(r'data')

soup.find_all(string=rex)

['Jobs in data science abound, but few people have the data science skills needed to fill these\n    increasingly important roles in organizations. Data Science For Dummies is the pe\n    ',
 'Provides a background in data science fundamentals before moving on to working with relational databases and\n        unstructured data and preparing your data for analysis',
 'Details different data visualization techniques that can be used to showcase and summarize your data',
 'Includes coverage of big data processing tools like MapReduce, Hadoop, Storm, and Spark']