# Web Scraping
- It's a technique which is used to extract large amounts of data from websites.

- The data extracted can be stored in structured formats like csv.
- Then we can use the extracted data according to our need.
- For example we can collect data from e-commerce portals, social media platforms to understand the customer behaviors, sentiments, buying patterns which are critical insights for any business.

# What is Web Scraping?
- Web scraping is an automated technique used to extract large amounts of data from websites.
- The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.
- There are different ways to scrape websites such as online Services, APIs or writing your own code.
![image.png](attachment:image.png)

# Is Web Scraping Legal?
- Some websites allow web-scraping explicitly and some websites forbids web scraping
- So, before scraping a website read all the terms conditions of the website regarding web scraping and then you can continue accordingly
- Every website has a robots.txt file stored in its server. This file contains the permissions about who can access the website and who cannot.

# robots.txt

	
- **Web Page:** You can use a robots.txt file for web pages (HTML, PDF, or other non-media formats that Google can read), to manage crawling traffic if you think your server will be overwhelmed by requests from Google's crawler, or to avoid crawling unimportant or similar pages on your site.

- **Resource File:** You can use a robots.txt file to block resource files such as unimportant image, script, or style files, if you think that pages loaded without these resources will not be significantly affected by the loss. However, if the absence of these resources make the page harder for Google's crawler to understand the page, don't block them, or else Google won't do a good job of analyzing pages that depend on those resources.

- **Media File:** Use a robots.txt file to manage crawl traffic, and also to prevent image, video, and audio files from appearing in Google search results. This won't prevent other pages or users from linking to your image, video, or audio file.

![image.png](attachment:image.png)

# How does Web Scraping works?
- We write code that sends a request to the server that's hosting the page we specified.
- Our code downloads that page's source code 
- It filters through the page looking for HTML elements we've specified, and extracting whatever content we've instructed it to extract in the code.

For example if we want to get all of the titles inside H2 tags from a website, we could write some code to do that and the code will work as shown in the following steps:

1. Our code would request the site's content from its server and download it.
2. Then it would go through the page's HTML code and looks for the H2 tags.
3. Whenever it finds an H2 tag, it would copy whatever text is inside the tag, and save it in whatever format we have specified.

# Components Of A Web Page
A web page is made by generally 4 types of files:

1. HTML file: It contains the main content of the web page
2. CSS file: This file is for the styling of the web page
3. JS file: The JavaScript file brings interactivity to the web page
4. Images file: JPG/PNG file formats for showing images in the web page

As we are interested in extracting data from the web page, we will be using the html file for extracting data as the html file contains the main content of the web page.

# Basics of HTML Structure

**HTML is the standard markup language for creating Web pages.**

###### What is HTML?

HTML stands for Hyper Text Markup Language
HTML is the standard markup language for creating Web pages
HTML describes the structure of a Web page
HTML consists of a series of elements
HTML elements tell the browser how to display the content
HTML elements label pieces of content such as "this is a heading", "this is a paragraph", "this is a link", etc.

###### A Simple HTML Document
![image.png](attachment:image.png)

###### Example Explained

- The <!DOCTYPE html> declaration defines that this document is an HTML5 document
- The <html> element is the root element of an HTML page
- The <head> element contains meta information about the HTML page
- The <title> element specifies a title for the HTML page (which is shown in the browser's title bar or in the page's tab)
- The <body> element defines the document's body, and is a container for all the visible contents, such as headings, paragraphs, - images, hyperlinks, tables, lists, etc.
- The <h1> element defines a large heading
- The <p> element defines a paragraph

###### What is an HTML Element?
![image.png](attachment:image.png)

###### Web Browsers
The purpose of a web browser (Chrome, Edge, Firefox, Safari) is to read HTML documents and display them correctly.

A browser does not display the HTML tags, but uses them to determine how to display the document:
![image.png](attachment:image.png)

###### HTML Page Structure
Below is a visualization of an HTML page structure:
![image-2.png](attachment:image-2.png)

Let's look at what a typical HTML Structure might look like

![image-3.png](attachment:image-3.png)

# Libraries for Web Scraping
**There are so many diverse libraries we can use for web scraping. We shall look at the most popular
ones.**
- **Selenium:** It is a web testing library that automates the browser activities. Through WebDriver, Selenium supports all major browsers on the market such as Chrome, Firefox, Internet Explorer, Opera and Safari. Though most of the community uses Chrome.
This library uses Web Drivers in order to test commands and process the web pages to get to the data we need.
The web drivers enable python to control the browser via OS-level interactions.
- **BeautifulSoup:** Python package for parsing HTML and XML documents. It creates data parse trees in order to extract data easily.
- **Pandas:** Pandas is a library used for data manipulation and analysis. It is used to store the extracted data and store it in the desired format.

# >>>>>>>>>>>>>>>>>>>Demo of Web Scraping<<<<<<<<<<<<<<<<<<<<

## Beautifulsoup Practical

**First Install Libraries**

First we will import all the required libraries required for web-scraping. We will require two libraries for web-scraping:

1. **requests:** This will be used to send get request to the web page server to get the source-code of the webpage
2. **BeautifulSoup:** It will be used to parse the source code and to extract the required data from the parsed structure. 

In [1]:
!pip install bs4
!pip install requests

Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Building wheels for collected packages: bs4
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Created wheel for bs4: filename=bs4-0.0.1-py3-none-any.whl size=1272 sha256=12ae2d277ebab49bd0a3ac925606a649c88b709534f3d23c7d6c2c2ffb63b9cf
  Stored in directory: c:\users\dimple makwana\appdata\local\pip\cache\wheels\73\2b\cb\099980278a0c9a3e57ff1a89875ec07bfa0b6fcbebb9a8cad3
Successfully built bs4
Installing collected packages: bs4
Successfully installed bs4-0.0.1


### Import Required Libraries

In [2]:
# Importing the required libraries
from bs4 import BeautifulSoup
import requests

# How to View Source Code
- **Firefox:** CTRL + U (Meaning press the CTRL key on your keyboard and hold it down. While holding down the CTRL key, press the “u” key.) 
- **Edge/Internet Explorer:** CTRL + U. Or right click and select “View Source.”
- **Chrome:** CTRL + U
- **Opera:** CTRL + U
![image-2.png](attachment:image-2.png)

## Send get request to the webpage server to get the source code of the page

In [3]:
page = requests.get('https://www.dineout.co.in/delhi-restaurants/buffet-special')

page

<Response [200]>

### Page content

In [4]:
soup = BeautifulSoup(page.content)
soup

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/><link href="/manifest.json" rel="manifest"/><style type="text/css">
            @font-face {
                font-family: 'dineicon';
                src:  url('/fonts/dineicon.eot');
                src:  url('/fonts/dineicon.eot#iefix') format('embedded-opentype'),
                url('/fonts/dineicon.ttf') format('truetype'),
                url('/fonts/dineicon.woff') format('woff'),
                url('/fonts/dineicon.svg#dineicon') format('svg');
                font-weight: normal;
				font-style: normal;
				font-display: swap;
            }
            .hide {
                display: none !important;
            }
            .async-hide{
                opacity: inherit !important;
            }
            iframe[name="google_conversion_frame"]{
        

### Scraping First Name

In [5]:
# First, we will use html tag where we have the first title of the restaurants.

first_title = soup.find('a',class_="restnt-name ellipsis")
first_title

<a analytics-action="RestaurantCardClick" analytics-label="86792_Castle Barbeque" class="restnt-name ellipsis" data-w-onclick="sendAnalyticsCommon|w1-restarant" href="/delhi/castle-barbeque-connaught-place-central-delhi-86792">Castle Barbeque</a>

In [6]:
first_title.text

'Castle Barbeque'

![Untitled.png](attachment:Untitled.png)

### Scraping first location

In [7]:
loc = soup.find('div',class_="restnt-loc ellipsis")
loc.text

'Connaught Place, Central Delhi'

![image.png](attachment:image.png)

### Scraping first price


In [9]:
sta = soup.find('span',class_="double-line-ellipsis")
sta.text

'₹ 2,000 for 2 (approx) | Chinese, North Indian'

In [12]:
# In above scraping we are getting price and location too. 
# We don't want location mentioned here so we can just split the element with below method "sta.text.split('|')[0]"

sta = soup.find('span',class_="double-line-ellipsis")
sta.text.split('|')[0]

'₹ 2,000 for 2 (approx) '

In [13]:
# Or if we need only other part of the element we can just change [0] to [1]
sta = soup.find('span',class_="double-line-ellipsis")
sta.text.split('|')[1]

' Chinese, North Indian'

In below webpage element class is not mentioned. In such cases we need to copy the class of it's parent element.
![image.png](attachment:image.png)

### Scraping Multiple Titles

In [15]:
# Now we have all the tags in which there are the job titles.

# Now we will extract the text from these tags one by looping over these tags

titles = []

for i in soup.find_all('a',class_="restnt-name ellipsis"):
    titles.append(i.text)

titles[1:2] 

['Jungle Jamboree']

In [16]:
titles

['Castle Barbeque',
 'Jungle Jamboree',
 'Cafe Knosh',
 'Castle Barbeque',
 'The Barbeque Company',
 'India Grill',
 'Delhi Barbeque',
 'The Monarch - Bar Be Que Village',
 'Indian Grill Room']

### Scraping multiple locations

In [17]:
location = [] #empty list

for i in soup.find_all('div',class_="restnt-loc ellipsis"): 
    location.append(i.text)

location

['Connaught Place, Central Delhi',
 '3CS Mall,Lajpat Nagar - 3, South Delhi',
 'The Leela Ambience Convention Hotel,Shahdara, East Delhi',
 'Pacific Mall,Tagore Garden, West Delhi',
 'Gardens Galleria,Sector 38A, Noida',
 'Hilton Garden Inn,Saket, South Delhi',
 'Taurus Sarovar Portico,Mahipalpur, South Delhi',
 'Indirapuram Habitat Centre,Indirapuram, Ghaziabad',
 'Suncity Business Tower,Golf Course Road, Gurgaon']

### Scraping the multiple price

In [19]:
price = []

for i in soup.find_all('span',class_="double-line-ellipsis"):
    price.append(i.text.split('|')[0])

price

['₹ 2,000 for 2 (approx) ',
 '₹ 1,680 for 2 (approx) ',
 '₹ 3,000 for 2 (approx) ',
 '₹ 2,000 for 2 (approx) ',
 '₹ 1,700 for 2 (approx) ',
 '₹ 2,400 for 2 (approx) ',
 '₹ 1,800 for 2 (approx) ',
 '₹ 1,900 for 2 (approx) ',
 '₹ 2,200 for 2 (approx) ']

### Scraping rhe images

In [20]:
images = []
for i in soup.find_all("img",class_="no-img"):
    images.append(i.get('data-src'))

images

['https://im1.dineout.co.in/images/uploads/restaurant/sharpen/8/k/b/p86792-16062953735fbe1f4d3fb7e.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/5/p/m/p59633-166088382462ff137009010.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/4/p/m/p406-15438184745c04ccea491bc.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/3/j/o/p38113-15959192065f1fcb666130c.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/7/p/k/p79307-16051787755fad1597f2bf9.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/2/v/t/p2687-1482477169585cce712b90f.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/5/d/i/p52501-1661855212630de5eceb6d2.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/restaurant/sharpen/3/n/o/p34822-15599107305cfa594a13c24.jpg?tr=tr:n-medium',
 'https://im1.dineout.co.in/images/uploads/

![image.png](attachment:image.png)

### Printing length

In [21]:
print(len(titles),len(location),len(images))


9 9 9


## Making dataframe


In [24]:
import pandas as pd

df = pd.DataFrame({'Titles':titles,'Location':location,'Price':price,'Images_url':images})
df

Unnamed: 0,Titles,Location,Price,Images_url
0,Castle Barbeque,"Connaught Place, Central Delhi","₹ 2,000 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
1,Jungle Jamboree,"3CS Mall,Lajpat Nagar - 3, South Delhi","₹ 1,680 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
2,Cafe Knosh,"The Leela Ambience Convention Hotel,Shahdara, ...","₹ 3,000 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
3,Castle Barbeque,"Pacific Mall,Tagore Garden, West Delhi","₹ 2,000 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
4,The Barbeque Company,"Gardens Galleria,Sector 38A, Noida","₹ 1,700 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
5,India Grill,"Hilton Garden Inn,Saket, South Delhi","₹ 2,400 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
6,Delhi Barbeque,"Taurus Sarovar Portico,Mahipalpur, South Delhi","₹ 1,800 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
7,The Monarch - Bar Be Que Village,"Indirapuram Habitat Centre,Indirapuram, Ghaziabad","₹ 1,900 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...
8,Indian Grill Room,"Suncity Business Tower,Golf Course Road, Gurgaon","₹ 2,200 for 2 (approx)",https://im1.dineout.co.in/images/uploads/resta...


# >>>>>>>>>>>>>>>>>>>>>>>>END<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<