# **Web Scraping** 

 
Web scraping is an automatic method to obtain large amounts of data from websites  
Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.  

Web scraping requires two parts, namely the crawler and the scraper.  
Web Crawler:also known as a spider or bot, is a program that systematically browses the internet, following links from one page to another.  
It means it opens all the links in a website and stores in database.
Web Scraper: A web scraper is a tool or program specifically designed to extract information from web pages. Unlike crawlers, which focus on discovering and indexing pages, scrapers are used to extract specific data from one or more pages.  

**Types of Web Scraper:**  
Self-built web scraper, Pre-built web scraper, browser extension web scraper, software web scraper, cloud web scraper  

**Imp. Python libraries for web scraping:**  
Scrapy, Beautifulsoup, Selenium  
 

**Components of web scraping**  
1. HTTP Requests: It  involves sending an HTTP request to the target website's server to retrieve the content of the page.  
2. HTML Parsing:  
Once the content of the webpage is retrieved, the next step is to parse the HTML to locate and extract the desired information.  
Tools like BeautifulSoup (Python) is used to parse and manipulate the HTML content.
3. Data Extraction:  
Data extraction involves identifying the specific HTML elements that contain the data you want and then extracting that data.
4. Handling Dynamic Content:  
Many modern websites load content dynamically using JavaScript, which can complicate scraping efforts.
5. Data Storage:  
After extracting the necessary data, it must be stored in a structured format like  CSV, JSON, or XML for further use.
6. Rate Limiting and Throttling:  
To avoid overloading the target server and triggering anti-scraping mechanisms, it’s important to control the frequency of requests.
7. Legal and Ethical Considerations:  
Web scraping must be done within the bounds of legality and ethical guidelines.  
Robots.txt: This file provides guidelines about which parts of the site can be accessed by web crawlers.  


In [38]:
# install request and beautifulsoup

In [39]:
import requests

In [40]:
webpage= requests.get('https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets')
print(webpage)

<Response [200]>


HTTP Status Codes: three-digit codes that indicate the result of the HTTP request. They are categorized into five classes:  
1. Informational Responses (100–199)  
2. Successful Responses (200–299):
3. Redirection Responses (300–399):
4. Client Error Responses (400–499):  
5. Server Error Responses (500–599):  

In [41]:
webpage.url

'https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets'

In [42]:
webpage.status_code

200

# **HTML Parsing**

In [43]:
from bs4 import BeautifulSoup

In [44]:
# webpage.content

**NOTE:**    
Things are not clear here as the content is in text format. 
So, we will convert the content into prettify format(html format) so that we can extract the data.

In [8]:
soup=BeautifulSoup(webpage.content,"html.parser")

In [45]:
# print(soup.prettify())

In [10]:
# print(soup.body.prettify())

In [46]:
soup.title

<title>Allinone | Web Scraper Test Sites</title>

In [47]:
soup.p #Returns the 1st paragraph tag

<p>Web Scraper</p>

In [48]:
soup.a #Returns the 1st anchor tag

<a data-bs-target=".side-collapse" data-bs-target-2=".side-collapse-container" data-bs-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggler float-end collapsed" data-bs-target="#navbar" data-bs-target-2=".side-collapse-container" data-bs-target-3=".side-collapse" data-bs-toggle="collapse" type="button">
<span class="visually-hidden">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
<span class="icon-bar extra-bottom-bar"></span>
</button>
</a>

# **Kinds of Objects**  
1. Tag
2. NavigableString
3. BeautifulSoup
4. Comments

## Tag

In [49]:
tag=soup.html

In [50]:
type(tag)

bs4.element.Tag

In [51]:
tag=soup.p
tag

<p>Web Scraper</p>

In [52]:
tag=soup.a
tag

<a data-bs-target=".side-collapse" data-bs-target-2=".side-collapse-container" data-bs-toggle="collapse-side">
<button aria-controls="navbar" aria-expanded="false" class="navbar-toggler float-end collapsed" data-bs-target="#navbar" data-bs-target-2=".side-collapse-container" data-bs-target-3=".side-collapse" data-bs-toggle="collapse" type="button">
<span class="visually-hidden">Toggle navigation</span>
<span class="icon-bar top-bar"></span>
<span class="icon-bar middle-bar"></span>
<span class="icon-bar bottom-bar"></span>
<span class="icon-bar extra-bottom-bar"></span>
</button>
</a>

## NavigableString

In [53]:
tag=soup.p.string
tag

'Web Scraper'

## BeautifulSoup

In [55]:
soup.title

<title>Allinone | Web Scraper Test Sites</title>

In [56]:
soup.head

<head>
<!-- Google Tag Manager -->
<script>(function (w, d, s, l, i) {
		w[l] = w[l] || [];
		w[l].push({
			'gtm.start':
				new Date().getTime(), event: 'gtm.js'
		});
		var f = d.getElementsByTagName(s)[0],
			j = d.createElement(s), dl = l != 'dataLayer' ? '&l=' + l : '';
		j.async = true;
		j.src =
			'https://www.googletagmanager.com/gtm.js?id=' + i + dl;
		f.parentNode.insertBefore(j, f);
	})(window, document, 'script', 'dataLayer', 'GTM-NVFPDWB');</script>
<!-- End Google Tag Manager -->
<title>Allinone | Web Scraper Test Sites</title>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper" name="keywords"/>
<meta content="The most popular web scraping extension. Start scraping in minutes. Automate your tasks with our Cloud Scraper. No software to download, no coding needed." name="description"/>
<link href="/favicon.png" rel="icon" sizes="128x128"/>
<meta c

In [60]:
# soup.body

In [61]:
soup.find("h1")

<h1>Test Sites</h1>

In [62]:
soup.find_all("h1")

[<h1>Test Sites</h1>, <h1 class="page-header">Computers / Tablets</h1>]

## Comments

In [63]:
com=soup.p.string
com

'Web Scraper'

In [65]:
com=soup.h1.string
com

'Test Sites'

# **Implementation**

In [66]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import numpy as np

Reference Video: https://www.youtube.com/watch?v=1DGTRnrvPpo&list=PLjVLYmrlmjGfSYkgH-_jgC8KMxyRzq7US&index=4

In [67]:
webpage=requests.get('https://webscraper.io/test-sites/e-commerce/allinone/computers/tablets')
webpage

<Response [200]>

if response code is 403:  
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'} -requests.get('url',headers=headers).text

In [68]:
soup=BeautifulSoup(webpage.text,'lxml')


In [69]:
# soup.prettify()

In [71]:
tablets= soup.find_all('div', class_='col-md-4 col-xl-4 col-lg-4')
len(tablets)


21

**NOTE:**  
In Python's BeautifulSoup library, when you want to find or filter HTML elements by their CSS class, you use class_ instead of class. This is because class is a reserved keyword in Python, which means it is used to define classes in Python code. To avoid conflict with this reserved keyword, BeautifulSoup uses class_ as an attribute name.

In [72]:
names=soup.find_all('a',class_='title')
print(len(names))
for i in names:
    print(i.text)

21
Lenovo IdeaTab
IdeaTab A3500L
Acer Iconia
Galaxy Tab 3
Iconia B1-730H...
Memo Pad HD 7
Asus MeMO Pad
Amazon Kindle
Galaxy Tab 3
IdeaTab A8-50
MeMO Pad 7
IdeaTab A3500-...
IdeaTab S5000
Galaxy Tab 4
Galaxy Tab
MeMo PAD FHD 1...
Galaxy Note
Galaxy Note
iPad Mini Reti...
Galaxy Note 10...
Apple iPad Air


In [73]:
prices=soup.find_all('h4',class_='price float-end card-title pull-right')
print(len(prices))
for i in prices:
    print(i.text)

21
$69.99
$88.99
$96.99
$97.99
$99.99
$101.99
$102.99
$103.99
$107.99
$121.99
$130.99
$148.99
$172.99
$233.99
$251.99
$320.99
$399.99
$489.99
$537.99
$587.99
$603.99


In [74]:
descriptions=soup.find_all('p',class_="description card-text")
print(len(descriptions))
for i in descriptions:
    print(i.text)

21
7" screen, Android
Black, 7" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2
7" screen, Android, 16GB
7", 8GB, Wi-Fi, Android 4.2, White
Black, 7", 1.6GHz Dual-Core, 8GB, Android 4.4
IPS, Dual-Core 1.2GHz, 8GB, Android 4.3
7" screen, Android, 8GB
6" screen, wifi
7", 8GB, Wi-Fi, Android 4.2, Yellow
Blue, 8" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2
White, 7", Atom 1.2GHz, 8GB, Android 4.4
Blue, 7" IPS, Quad-Core 1.3GHz, 8GB, 3G, Android 4.2
Silver, 7" IPS, Quad-Core 1.2Ghz, 16GB, 3G, Android 4.2
LTE (SM-T235), Quad-Core 1.2GHz, 8GB, Black
16GB, White
White, 10.1" IPS, 1.6GHz, 2GB, 16GB, Android 4.2
10.1", 3G, Android 4.0, Garnet Red
12.2", 32GB, WiFi, Android 4.4, White
Wi-Fi + Cellular, 32GB, Silver
10.1", 32GB, Black
Wi-Fi, 64GB, Silver


## Considering the whole container and Converting collected data into Pandas Dataframe then export as .csv file

Reference Video: https://www.youtube.com/watch?v=RmPVICEWF4w&t=1680s

**NOTE:**  
If you use find_all() but try to access the result as if it were a single element (like you would with find()), you'll get a TypeError.  
For example, attempting to call .text or access an attribute directly on the result of find_all() without first iterating over the list or indexing into it.

In [75]:
names=[]
prices=[]
descriptions=[]


for i in tablets:
    names.append(i.find('a',class_='title').text)
    prices.append(i.find('h4',class_='price float-end card-title pull-right').text)
    descriptions.append(i.find('p',class_="description card-text").text)


df=pd.DataFrame({'names':names,'prices':prices,'descriptions':descriptions})
df

Unnamed: 0,names,prices,descriptions
0,Lenovo IdeaTab,$69.99,"7"" screen, Android"
1,IdeaTab A3500L,$88.99,"Black, 7"" IPS, Quad-Core 1.2GHz, 8GB, Android 4.2"
2,Acer Iconia,$96.99,"7"" screen, Android, 16GB"
3,Galaxy Tab 3,$97.99,"7"", 8GB, Wi-Fi, Android 4.2, White"
4,Iconia B1-730H...,$99.99,"Black, 7"", 1.6GHz Dual-Core, 8GB, Android 4.4"
5,Memo Pad HD 7,$101.99,"IPS, Dual-Core 1.2GHz, 8GB, Android 4.3"
6,Asus MeMO Pad,$102.99,"7"" screen, Android, 8GB"
7,Amazon Kindle,$103.99,"6"" screen, wifi"
8,Galaxy Tab 3,$107.99,"7"", 8GB, Wi-Fi, Android 4.2, Yellow"
9,IdeaTab A8-50,$121.99,"Blue, 8"" IPS, Quad-Core 1.3GHz, 16GB, Android 4.2"


In [76]:
# Export the DataFrame to a CSV file:
df.to_csv('tablets.csv', index=False)  # index=False to exclude the row numbers

# **BeautifulSoup vs Selenium:** for web scraping

BeautifulSoup is ideal for simple, quick, and lightweight scraping tasks on static websites.
Selenium is more powerful and flexible, suitable for scraping dynamic content, interacting with web elements, and automating complex tasks on the web.