## Web Scraping
Web scraping is used to collect large information from websites. 
<img src="https://d1jnx9ba8s6j9r.cloudfront.net/blog/wp-content/uploads/2018/11/Untitled-1-768x183.jpg">

Python package urllib, which contains tools for working with URLs.  urllib.request module contains a function called urlopen() that can be used to open a URL within a program.

In [1]:
from urllib.request import urlopen

In [2]:
url = "http://olympus.realpython.org/profiles/aphrodite"

In [3]:
# To open the web page, pass url to urlopen():
page = urlopen(url)
# urlopen() returns an HTTPResponse object
page

<http.client.HTTPResponse at 0x7f2e978998d0>

************************************************************************************************

To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8

In [4]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")
# print html to see the contents
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



### Extract Text From HTML With String Methods
One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.

In [5]:
title_index = html.find("<title>")
title_index

14

In [6]:
start_index = title_index + len("<title>")
end_index = html.find("</title>")
title = html[start_index:end_index]
title

'Profile: Aphrodite'

# Advanced tools and libraries for Web Scraping
- BeautifulSoup
- Scrapy
- Selenium

A webpage may or may not be allowed to scrap. To know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file. For this example, I am scraping Flipkart website. So, to see the “robots.txt” file, the URL is www.flipkart.com/robots.txt.

The three main components of web-scraping are :
- Crawl : Navigate to target website by making http request and download the response
- Parse and Transform: Parse the response with an HTML parser, and extract required data
- Store required data into json or csv

### Crawl

In [9]:
# importing required libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
data = BeautifulSoup(html, "html.parser") #second argument, the string "html.parser", tells the object which parser to use behind the scenes. "html.parser" represents Python’s built-in HTML parser.
data

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>

### Parse and Transform
Parse this data into an HTML Parser using BeautifulSoup

### Use a BeautifulSoup Object
BeautifulSoup objects have a .get_text() method that can be used to extract all the text from the document and automatically remove any HTML tags.

In [12]:
print ("TEXT >>>>>>>")
print(soup.get_text())
image1, image2 = soup.find_all("img") #use find_all() to return a list of all instances of a particular tag
print ("IMAGES ...... ",image1, image2)
print("image1.name ....",image1.name)
print ("image1 src ... ", image1["src"])

TEXT >>>>>>>


Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine




IMAGES ......  <img src="/static/dionysus.jpg"/> <img src="/static/grapes.png"/>
image1.name .... img
image1 src ...  /static/dionysus.jpg


In [13]:
#Certain tags in HTML documents can be accessed by properties of the Tag object. 
soup.title

<title>Profile: Dionysus</title>