Web-Scraping

Getting started with Web scraping
Navigable string objects
Parsing tags in html tree

Use cases of web scraping:

e-commerce store automation
Emergency resource allocation

There are 4 main types of object types in BeautifulSoup:

BeautifulSoup object
Tag object
Navigable string object
Comment object

1. Web scraping with beautiful soup: Book 1

Install BeautifulSoup python package from command line pip install BeautifulSoup
Import library
Create an HTML document
Create constructor with HTML parser BeautifulSoup(html_doc, 'html.parser') The output of constructor gives a BeautifulSoup object which is a transformation of markup into parse tree. The parse tree is a set of linked objects representing structure of document.

When I print the object, I find it difficult to read as it has no structure. Therefore, I use soup.prettify method to make the document structured and readable.

2. Tag objects: Book 2

Conists of name and attributes. Attributes are to reference, search, and navigate data by tagging BeautifulSoup

Working with names:

Obtain tag from HTML body
Give it a name, change its name

Working with attributes:

Treating as dictionary
Append and delete elements in dictionary format

Navigating tree with tags:

soup.title
soup.head
soup.body.b
Get unordered list soup.ul
Web link soup.a

2. Navigable string objects: Book 3

Obtain the string inside the tag
tag = soup.b
nav_string = tag.string
Obtain all stirngs in object for string in soup.stripped_strings:print(repr(string))
Obtain parent tag within parsed tree nav_string.parent

3. Retrieving tags: Book 4

Filtering with name argument
- soup.find_all("li")
- <li>Provides a background in data science fundamentals ... data for analysis</li>
Filtering with keyword argument
- soup.find_all(id = "link 3")
- [<a class="preview" href="http://bit.ly/Data-Science-For-Dummies" id="link 3">buy the book!</a>]
Filetring with String argument
- soup.find_all('ul')
- [<ul>\n<li>Provides a background in data science fundamentals ... data for analysis</li> ... <ul>
Filetring with List object
- soup.find_all(['ul','b'])
- <b>DATA SCIENCE FOR DUMMIES</b>,<ul>\n<li>Provides a background in data science ... data for analysis</li>\n</ul>
Filtering with regular expression argument (re match method)
- l = re.compile('l')
- for tag in soup.find_all(l): print (tag.name)
html

title

ul

li
Retrieving weblinks by filetring with Regular String object
- for link in soup.find_all('a'): print(link.get('href'))
http://www.data-mania.com/blog/books-by-lillian-pierson/

http://www.data-mania.com/blog/data-science-for-dummies-answers-what-is-data-science/

http://bit.ly/Data-Science-For-Dummies
Retrieving Strings by filetring with Regular expression
- soup.find_all(string = re.compile("data"))
[u'Jobs in data science abound, ...les in ...]

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Navigable string objects.ipynb		Navigable string objects.ipynb
ParsedText.txt		ParsedText.txt
README.md		README.md
Web scraping action.ipynb		Web scraping action.ipynb
Working with parsed data.ipynb		Working with parsed data.ipynb
web scraping with beautiful soup.ipynb		web scraping with beautiful soup.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Navigable string objects.ipynb

Navigable string objects.ipynb

ParsedText.txt

ParsedText.txt

README.md

README.md

Web scraping action.ipynb

Web scraping action.ipynb

Working with parsed data.ipynb

Working with parsed data.ipynb

web scraping with beautiful soup.ipynb

web scraping with beautiful soup.ipynb

Repository files navigation

Web-Scraping

Use cases of web scraping:

1. Web scraping with beautiful soup: Book 1

2. Tag objects: Book 2

Working with names:

Working with attributes:

Navigating tree with tags:

2. Navigable string objects: Book 3

3. Retrieving tags: Book 4

About

Releases

Packages

Languages

Adhira-Deogade/Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Web-Scraping

Use cases of web scraping:

1. Web scraping with beautiful soup: Book 1

2. Tag objects: Book 2

Working with names:

Working with attributes:

Navigating tree with tags:

2. Navigable string objects: Book 3

3. Retrieving tags: Book 4

About

Resources

Stars

Watchers

Forks

Languages