## WEB SCRAPING USING BEAUTIFUL SOUP

## Hyper Text Markup Language (HTML)

Hypertext Markup Language is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets(CSS) and scripting languages such as JavaScript(JS).

HTML describes the structure of a Web page
HTML consists of a series of elements
HTML elements tell the browser how to display the content
HTML elements are represented by tags
HTML tags label pieces of content such as "heading", "paragraph", "table", and so on
Browsers do not display the HTML tags, but use them to render the content of the page
Simple HTML document code to be added
The declaration defines this document to be HTML5
The < html> element is the root element of an HTML page
The < head> element contains meta information about the document
The < title> element specifies a title for the document
The < body> element contains the visible page content
The < h1> element defines a large heading
The < p> element defines a paragraph
HTML Tags
HTML tags are element names surrounded by angle brackets

Example: < tagname>content goes here...</ tagname>

HTML tags normally come in pairs like < p> and < /p>
The first tag in a pair is the start tag, the second tag is the end tag
The end tag is written like the start tag, but with a forward slash inserted before the tag name
HTML Page Structure

## Most Commonly Used Tags
Div tag: div tag is used as a container to represent an area on the screen.

Anchor tag: It is used to link one page to another page.
< a href="..."> Statements... < /a>

List tag: It is used to list the content.
< li> Statements... < /li>

Ordered List tag: It is used to list the content in a particular order.
< ol> Statements... < /ol>

< ol>

  < li>List item 1< /li>  
  < li>List item 2< /li> 
  < li>List item 3< /li>  
  < li>List item 4< /li>  
< /ol>

Unordered List tag: It is used to list the content without order.
< ul> Statements... < /ul>

< ul>

  < ul>List item< /ul>  
  < ul>List item< /ul> 
  < ul>List item< /ul>  
  < ul>List item< /ul>  
< /ul>

Image tag: It is used to add image element in html document.
< img src="" width="40" height="40" >

Tables Tags: Table tag is used to create a table in html document.

  < table> 
    < tr> 
      < th>Month< /th> 
      < th>Savings< /th> 
    < /tr> 
    < tr> 
      < td>January< /td> 
      < td>100< /td> 
    < /tr> 
   < /table>

Th tag: It defines the header cell in a table.
Tr tag: It is used to define row of html table.
Td tag: (Table division) It defines the standard cell in html document.
Form tag: It is used to create html form for user.
Submit input tag: It is used to take the input from the user.

    < form method=post action="/cgibin/example.cgi"> 
        < input type="text" maxlength="30">  
        < input type="Submit" value="Submit">  
    < /form>

#Web scraping is an automatic method to obtain large amounts of data from websites.

#Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. 

There are many different ways to perform web scraping to obtain data from websites. 

These include using online services, particular API’s or even creating your code for web scraping from scratch. 
Many large websites, like Google, Twitter, Facebook, StackOverflow, etc.

## Web Scraping has multiple applications 
#1. Price Monitoring 
#2. Market Research 
#3. News Monitoring 
#4. Sentiment Analysis
5. Email tracking


## This activity will use the below modules:

Requests: To make web requests

Beautiful Soup: To extract data from the HTML response

BeautifulSoup can extract single or multiple occurrences of a specific tag and can also accept search criteria based on attributes such as:

Find: This function takes the name of the tag as string input and returns the first found match of the particular tag from the webpage response

Findall: Use find_all to extract all the occurrences of a particular tag from the page response.
find_all returns an object of ResultSet which offers index based access to the result of found occurrences and can be printed using a for loop.
find_all can accept a list of tags as soup.find_all(['th', 'td']) and parameters like id to find tags with unique id

Select: This function finds multiple instances and returns a list.

Attribute Driven Search: Most of the times attributes like id, class, or value are used to further refine the search.
Example soup.find_all('table')

Nested Tags: Nested tags can be found using the select method.
Example: soup.select("html head p")[0].get_text()


Beautiful Soup also provides navigation properties like

next_sibling and previous_sibling: To traverse tags at same level, like tr or td within the same tag.

next_element and __previous_element: To shift HTML elements.


Points To Remember:

The logic to extract the data usually depends upon the HTML structure of the webpage, so some changes in structure can break the logic.

The content of a website can be subject to applied laws, so make sure to read the terms and conditions about content

Loading Web Pages with 'request'

Extracting title and body of the web with beautiful soup

In [5]:
import requests

# Make a request to https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/
# Store the result in 'res' variable
res = requests.get(
    'https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/')
txt = res.text
status = res.status_code

print(txt, status)

<!DOCTYPE html>
<html lang="en">
	<head>
		<!-- Anti-flicker snippet (recommended)  -->
		<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
		<title>codedamn Web Scraper demo</title>
		<meta charset="utf-8" />
		<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1" />

		<meta
			name="keywords"
			content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, "
		/>
		<meta name="description" content="The most popular web scraping website." />
		<link
			rel="icon"
			sizes="128x128"
			href="/webscraper-python-codedamn-classroom-website/favicon.png"
		/>

		<meta name="viewport" content="width=device-width, initial-scale=1.0" />

		<link rel="stylesheet" href="/webscraper-python-codedamn-classroom-website/app.css" />

		<link
			rel="apple-touch-icon"
			href="/webscraper-python-codedamn-classroom-website/logo-icon.png"
		/>

		<script defer src="/webscraper-python-codedamn-classroom-website/app.js"></script>
	</head>
	<body>
		<header r

Extract title and body

In [6]:
import requests
from bs4 import BeautifulSoup

# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Extract title of page
page_title = soup.title

# Extract body of page
page_body = soup.body

# Extract head of page
page_head = soup.head

# print the result
print(page_title, page_head)


<title>codedamn Web Scraper demo</title> <head>
<!-- Anti-flicker snippet (recommended)  -->
<style>
			.async-hide {
				opacity: 0 !important;
			}
		</style>
<title>codedamn Web Scraper demo</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, " name="keywords"/>
<meta content="The most popular web scraping website." name="description"/>
<link href="/webscraper-python-codedamn-classroom-website/favicon.png" rel="icon" sizes="128x128"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<link href="/webscraper-python-codedamn-classroom-website/app.css" rel="stylesheet"/>
<link href="/webscraper-python-codedamn-classroom-website/logo-icon.png" rel="apple-touch-icon"/>
<script defer="" src="/webscraper-python-codedamn-classroom-website/app.js"></script>
</head>


Extracting Links( Extract all image information)

In [7]:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
image_data = []

# Extract and store in top_items according to instructions on the left
images = soup.select('img')
for image in images:
    src = image.get('src')
    alt = image.get('alt')
    image_data.append({"src": src, "alt": alt})

print(image_data)

[{'src': '/webscraper-python-codedamn-classroom-website/logo_white.svg', 'alt': 'Web Scraper'}, {'src': '/webscraper-python-codedamn-classroom-website/cart2.png', 'alt': 'item'}, {'src': '/webscraper-python-codedamn-classroom-website/cart2.png', 'alt': 'item'}, {'src': '/webscraper-python-codedamn-classroom-website/cart2.png', 'alt': 'item'}, {'src': '/webscraper-python-codedamn-classroom-website/fbicon.png', 'alt': 'Web Scraper on Facebook'}, {'src': '/webscraper-python-codedamn-classroom-website/twicon.png', 'alt': 'Web Scraper on Twitter'}]


In [None]:
Extract attribute values

In [8]:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
all_links = []

# Extract and store in top_items according to instructions on the left
links = soup.select('a')
for ahref in links:
    text = ahref.text
    text = text.strip() if text is not None else ''

    href = ahref.get('href')
    href = href.strip() if href is not None else ''
    all_links.append({"href": href, "text": text})

print(all_links)

[{'href': '', 'text': 'Toggle navigation'}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': ''}, {'href': '#page-top', 'text': ''}, {'href': '/webscraper-python-codedamn-classroom-website/', 'text': 'Web Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/cloud-scraper', 'text': 'Cloud Scraper'}, {'href': '/webscraper-python-codedamn-classroom-website/pricing', 'text': 'Pricing'}, {'href': '#section3', 'text': 'Learn'}, {'href': '/webscraper-python-codedamn-classroom-website/documentation', 'text': 'Documentation'}, {'href': '/webscraper-python-codedamn-classroom-website/tutorials', 'text': 'Video Tutorials'}, {'href': '/webscraper-python-codedamn-classroom-website/how-to-videos', 'text': 'How to'}, {'href': '/webscraper-python-codedamn-classroom-website/test-sites', 'text': 'Test Sites'}, {'href': 'https://forum.webscraper.io/', 'text': 'Forum'}, {'href': 'https://chrome.google.com/webstore/detail/web-scraper/jnhgnonknehpejjnehehllkliplmbmhn?hl=en', '

Genarating CSV file from the data

In [14]:
import requests
from bs4 import BeautifulSoup
import csv
# Make a request
page = requests.get(
    "https://codedamn-classrooms.github.io/webscraper-python-codedamn-classroom-website/")
soup = BeautifulSoup(page.content, 'html.parser')

# Create top_items as empty list
all_products = []

# Extract and store in top_items according to instructions on the left
products = soup.select('div.thumbnail')
for product in products:
    name = product.select('h4 > a')[0].text.strip()
    description = product.select('p.description')[0].text.strip()
    price = product.select('h4.price')[0].text.strip()
    reviews = product.select('div.ratings')[0].text.strip()
    image = product.select('img')[0].get('src')

    all_products.append({
        "name": name,
        "description": description,
        "price": price,
        "reviews": reviews,
        "image": image
    })


keys = all_products[0].keys()

with open('products.csv', 'w', newline='') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(all_products)

In [15]:
for i in soup.find_all(['a','div']):
    print (i.text)





Toggle navigation

















Web Scraper





Cloud Scraper





Pricing





Learn




Documentation


Video Tutorials


How to


Test Sites


Forum




Install


Login








Toggle navigation











Toggle navigation

















Web Scraper





Cloud Scraper





Pricing





Learn




Documentation


Video Tutorials


How to


Test Sites


Forum




Install


Login






Web Scraper




Cloud Scraper




Pricing




Learn



Documentation
Video Tutorials
How to
Test Sites
Forum
Install
Login






Test Sites











Home



											Computers
											




											Phones
											








E-commerce training site

								Welcome to WebScraper e-commerce site. You can use this site for
								training to learn how to use the Web Scraper. Items listed here are
								not for sale.
							

Top items being scraped right now





$1139.54

Asus AsusPro Adv...


											Asus AsusPro Advanced BU401LA-FA271G Dark Grey,
											14", Core i5-4210U, 