Collecting data from websites using an automated process is known as web scraping

### urllib
the urllib.request module contains a function called urlopen() that you can use to open a URL within a program.

In [14]:
from urllib.request import urlopen
#url ="https://atti.ac.ke"
url = "http://olympus.realpython.org/profiles/aphrodite"
# To open the web page, pass url to urlopen():
page = urlopen(url)
#urlopen() returns an HTTPResponse object:
print(page)


<http.client.HTTPResponse object at 0x000001DFE714EB30>


To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

In [15]:
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)

<html>
<head>
<title>Profile: Aphrodite</title>
</head>
<body bgcolor="yellow">
<center>
<br><br>
<img src="/static/aphrodite.gif" />
<h2>Name: Aphrodite</h2>
<br><br>
Favorite animal: Dove
<br><br>
Favorite color: Red
<br><br>
Hometown: Mount Olympus
</center>
</body>
</html>



### Extract Text From HTML With String Methods

One way to extract information from a web page’s HTML is to use string methods. For instance, you can use .find() to search through the text of the HTML for the <title> tags and extract the title of the web page.
    To start, you’ll extract the title of the web page that you requested in the previous example. If you know the index of the first character of the title and the index of the first character of the closing </title> tag, then you can use a string slice to extract the title.

Because .find() returns the index of the first occurrence of a substring, you can get the index of the opening <title> tag by passing the string "<title>" to .find():

In [20]:
title_index = html.find("<title>")
title_index


14

You don’t want the index of the <title> tag, though. You want the index of the title itself. To get the index of the first letter in the title, you can add the length of the string "<title>" to title_index:

In [21]:
start_index = title_index + len("<title>")
start_index
21

21

Now get the index of the closing </title> tag by passing the string "</title>" to .find():

In [22]:
end_index = html.find("</title>")
end_index


39

In [23]:
#Finally, you can extract the title by slicing the html string
title = html[start_index:end_index]
title


'Profile: Aphrodite'

### Regular expressions—or regexes

short patterns that you can use to search for text within a string. Python supports regular expressions through the standard library’s re module.

In [24]:
import re

In [30]:
re.findall("ab*c",  "dsfacu")


['ac']

In [33]:
re.findall("ab*c", "abcd")
#re.findall("ab*c", "ABC", re.IGNORECASE)


['abc']

In [34]:
re.findall("a.c", "abc")

['abc']

In [36]:
re.findall("a.*c", "abtyc")

['abtyc']

## Use an HTML Parser for Web Scraping in Python

# Beautiful Soup

In [42]:
from bs4 import BeautifulSoup as BS
from urllib.request import urlopen

url ="http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
content = page.read().decode("utf-8")
soup = BS(content, "html.parser")
print(soup)

<html>
<head>
<title>Profile: Dionysus</title>
</head>
<body bgcolor="yellow">
<center>
<br/><br/>
<img src="/static/dionysus.jpg"/>
<h2>Name: Dionysus</h2>
<img src="/static/grapes.png"/><br/><br/>
Hometown: Mount Olympus
<br/><br/>
Favorite animal: Leopard <br/>
<br/>
Favorite Color: Wine
</center>
</body>
</html>



The above code does the following

Opens the URL http://olympus.realpython.org/profiles/dionysus by using urlopen() from the urllib.request module

Reads the HTML from the page as a string and assigns it to the content variable

Creates a BeautifulSoup object and assigns it to the soup variable

In [48]:
q= soup.get_text()
a= q.replace('\n', '')
print(a)
print(q)

Profile: DionysusName: DionysusHometown: Mount OlympusFavorite animal: Leopard Favorite Color: Wine


Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine




