# Web Scraping in Python
There are several ways to extract information from the web. Use of APIs being probably the best way to extract data from a website. Almost all large websites like Twitter, Facebook, Google, Twitter, StackOverflow provide APIs to access their data in a more structured manner. If you can get what you need through an API, it is almost always preferred approach over web scraping. This is because if you are getting access to structured data from the provider, why would you want to create an engine to extract the same information.


Sadly, not all websites provide an API. Some do it because they do not want the readers to extract huge information in a structured way, while others don’t provide APIs due to lack of technical knowledge.


Web scraping is a method of gathering data from websites crawled by bots and organizing it in a logical manner. The history of online scraping has proven that almost all publicly available websites may be scrapped.


Because online scraping bots generally replicate human actions in a minor way, sophisticated web scrapers may scrape almost any web page currently found online.


The only thing that may happen is that while scraping a website, you can run into some difficulties.


As a scraper, your scraper would be identified as a bot.

It's possible that your IP address will be banned.

If you make a lot of requests, you could get denied for scraping.

**Web scraping** deals with extracting or scraping the information from the website. Web scraping is also sometimes referred to as web harvesting or web data extraction. Generally, web scraping deals with extracting data automatically with the help of web crawlers.


Web scraping allows you to convert unstructured data on the web (present in HTML format) into structured data (such as a database or spreadsheet).


#### Can you scrape from all the websites?
Scraping makes the website traffic spike and may cause the breakdown of the website server. Thus, not all websites allow people to scrape. How do you know which websites are allowed or not? You can look at the ‘robots.txt’ file of the website. You just simply put robots.txt after the URL that you want to scrape and you will see information on whether the website host allows you to scrape the website.

3 Popular Tools and Libraries used for Web Scraping in Python You’ll come across multiple libraries and frameworks in Python for web scraping. Here are three popular ones that do the task with efficiency and aplomb:

### BeautifulSoup
Beautiful Soup is a pure Python library for extracting structured data from a website. It allows you to parse(pull) data from HTML and XML files.

### Scrapy
Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.

### Selenium
Selenium is another popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping.

# Web scraping and Data Analysis

#### Life Cycle of the Data Science Project:

1) Business Understanding (Understand the Business functionalities beofore we go furthur)

2) Data Requirement (Should know important variables)

3) Data Collection(We Use Web Scraping)

4) Data Cleaning ( Missing Data, Pre processing)

5) Data Analysis (EDA) to know insight about data

6) Model Building (Apply ML Models)

7) Model Evaluation (Verify Model performance)

8) Deployment (Deploy the best model)



Note: Usually step 1 to step 5 it involves 70% of time duration in your project life time.
Web scraping is an automated task to extract large amounts of data from wb pages.



Web scraping is an automated method to extract lage amounts of data from websites.


**Basic Steps:**


1) Identify the URL from which you need the data

2) Inspect the code behind the page

3) Find the elements/data you want to extract

4) Write and excute the code to store the data in the required format


**Python libraries required:**


**requests:** This is used to extract the html code from the give URL.

**BeautifulSoup:** This is used to format and read the html content.

**re:** To handle the text data

**numpy & pandas**

HyperText Markup Language (HTML) is the language that web pages are created in. HTML isn’t a programming language, like Python, though. It’s a markup language that tells a browser how to display content. HTML consists of elements called tags. The most basic tag is the tag. This tag tells the web browser that everything inside of it is HTML. Right inside an html tag, we can put two other tags: the head tag, and the body tag.


The main content of the web page goes into the body tag. The head tag contains data about the title of the page.. The p tag defines a paragraph, and any text inside the tag is shown as a separate paragraph


## Sample HTML Code

In [6]:
# load the required libraries
import numpy as np
import pandas as pd
import re
import requests
from bs4 import BeautifulSoup

import time

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [7]:
URL = 'https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'

In [8]:
page  = requests.get(URL)

In [9]:
page.status_code # A status code 200 means we are allowed to scrape the data

200

In [10]:
page.text

'<!doctype html><html lang="en"><head><link href="https://rukminim1.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/www/linchpin/fk-cp-zion/css/app_modules.chunk.94b5e7.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.6e7580.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><meta name="robots" content="noodp"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" href="/appl

In [11]:
pagecontent = page.text

In [12]:
soup = BeautifulSoup(pagecontent)
soup

<!DOCTYPE html>
<html lang="en"><head><link href="https://rukminim1.flixcart.com" rel="preconnect"/><link href="//static-assets-web.flixcart.com/www/linchpin/fk-cp-zion/css/app_modules.chunk.94b5e7.css" rel="stylesheet"/><link href="//static-assets-web.flixcart.com/www/linchpin/fk-cp-zion/css/app.chunk.6e7580.css" rel="stylesheet"/><meta content="text/html; charset=utf-8" http-equiv="Content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="102988293558" property="fb:page_id"/><meta content="658873552,624500995,100000233612389" property="fb:admins"/><meta content="noodp" name="robots"/><link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/><link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/><meta content="website" property="og:type"/><meta content="Flipkart.com" name="og_site_name" property="og:site_name"/><link href="/apple-touch-icon-57x57.png" rel="apple-to

In [13]:
soup.find('div', attrs={'class':"_3pLy-c row"})#_3pLy-c row

<div class="_3pLy-c row"><div class="col col-7-12"><div class="_4rR01T">MSI Stealth 15M Core i7 11th Gen - (16 GB/1 TB SSD/Windows 10 Home/6 GB Graphics/NVIDIA GeForce RTX 30...</div><div class="gUuXy-"><span class="_1lRcqv" id="productRating_LSTCOMG2HCTYYGFHZHMAJQO9M_COMG2HCTYYGFHZHM_"><div class="_3LWZlK">4.2</div></span><span class="_2_R_DZ"><span><span>26 Ratings </span><span class="_13vcmD">&amp;</span><span> 4 Reviews</span></span></span></div><div class="fMghEO"><ul class="_1xgFaf"><li class="rgWa7D">Intel Core i7 Processor (11th Gen)</li><li class="rgWa7D">16 GB DDR4 RAM</li><li class="rgWa7D">64 bit Windows 10 Operating System</li><li class="rgWa7D">1 TB SSD</li><li c

In [14]:
type(page)
type(pagecontent)
type(soup)

requests.models.Response

str

bs4.BeautifulSoup

In [15]:
soup.find_all('div',attrs={'class':'_3pLy-c row'})

[<div class="_3pLy-c row"><div class="col col-7-12"><div class="_4rR01T">MSI Stealth 15M Core i7 11th Gen - (16 GB/1 TB SSD/Windows 10 Home/6 GB Graphics/NVIDIA GeForce RTX 30...</div><div class="gUuXy-"><span class="_1lRcqv" id="productRating_LSTCOMG2HCTYYGFHZHMAJQO9M_COMG2HCTYYGFHZHM_"><div class="_3LWZlK">4.2</div></span><span class="_2_R_DZ"><span><span>26 Ratings </span><span class="_13vcmD">&amp;</span><span> 4 Reviews</span></span></span></div><div class="fMghEO"><ul class="_1xgFaf"><li class="rgWa7D">Intel Core i7 Processor (11th Gen)</li><li class="rgWa7D">16 GB DDR4 RAM</li><li class="rgWa7D">64 bit Windows 10 Operating System</li><li class="rgWa7D">1 TB SSD</li><li 

In [16]:
# find the first PageElement that matches the given criteria.
soup.find('div', attrs={'class':"_3pLy-c row"})

<div class="_3pLy-c row"><div class="col col-7-12"><div class="_4rR01T">MSI Stealth 15M Core i7 11th Gen - (16 GB/1 TB SSD/Windows 10 Home/6 GB Graphics/NVIDIA GeForce RTX 30...</div><div class="gUuXy-"><span class="_1lRcqv" id="productRating_LSTCOMG2HCTYYGFHZHMAJQO9M_COMG2HCTYYGFHZHM_"><div class="_3LWZlK">4.2</div></span><span class="_2_R_DZ"><span><span>26 Ratings </span><span class="_13vcmD">&amp;</span><span> 4 Reviews</span></span></span></div><div class="fMghEO"><ul class="_1xgFaf"><li class="rgWa7D">Intel Core i7 Processor (11th Gen)</li><li class="rgWa7D">16 GB DDR4 RAM</li><li class="rgWa7D">64 bit Windows 10 Operating System</li><li class="rgWa7D">1 TB SSD</li><li c

In [17]:
# store the forstproduct details

x = soup.find('div', attrs={'class':'_3pLy-c row'})

In [18]:
x.text

'MSI Stealth 15M Core i7 11th Gen - (16 GB/1 TB SSD/Windows 10 Home/6 GB Graphics/NVIDIA GeForce RTX 30...4.226 Ratings\xa0&\xa04 ReviewsIntel Core i7 Processor (11th Gen)16 GB DDR4 RAM64 bit Windows 10 Operating System1 TB SSD39.62 cm (15.6 inch) DisplaySilver-Lining Print, Cooler Boost, Audio Boost, Dragon Center, Nahimic 32 Year Warranty Term for Gaming & Content Creation (EU-WE)₹1,20,990₹1,62,99025% offFree deliveryUpto ₹18,100 Off on ExchangeBank Offer'

In [19]:
for x in soup.find_all('div',attrs={'class':'_3pLy-c row'}):
    print(x.text,'\n\n')

MSI Stealth 15M Core i7 11th Gen - (16 GB/1 TB SSD/Windows 10 Home/6 GB Graphics/NVIDIA GeForce RTX 30...4.226 Ratings & 4 ReviewsIntel Core i7 Processor (11th Gen)16 GB DDR4 RAM64 bit Windows 10 Operating System1 TB SSD39.62 cm (15.6 inch) DisplaySilver-Lining Print, Cooler Boost, Audio Boost, Dragon Center, Nahimic 32 Year Warranty Term for Gaming & Content Creation (EU-WE)₹1,20,990₹1,62,99025% offFree deliveryUpto ₹18,100 Off on ExchangeBank Offer 


acer Aspire 7 Core i5 10th Gen - (8 GB/512 GB SSD/Windows 10 Home/4 GB Graphics/NVIDIA GeForce GTX 165...4.412,081 Ratings & 1,306 ReviewsFree upgrade to Windows 11 when availableIntel Core i5 Processor (10th Gen)8 GB DDR4 RAM64 bit Windows 10 Operating System512 GB SSD39.62 cm (15.6 inch) DisplayQuick Access, Acer Care Center, Acer Product Registration, Acer Collection1 Year International Travelers Warranty₹52,990₹89,99941% offFree deliveryUpto ₹18,100 Off on ExchangeBank Offer 


ASUS VivoBook 15 Core i3 10th Gen - (8 GB/1 TB HDD/Wind

In [20]:
# Extracting the data from all pages

titles=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
num_of_rev_rating = [] #List to store the number of ratings & reviews of the product
specifications = [] # to store the specifications
pgno = []

for i in range(1,41):
    start_time = time.time()
    URL = 'https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page={}'.format(i)
    pagecontent = requests.get(URL).text
    soup = BeautifulSoup(pagecontent)
    #print(URL)
    
    for x  in soup.find_all('div', attrs={'class':"_3pLy-c row"}):
        title = x.find('div', attrs={'class':'_4rR01T'})
        if title is None:
            titles.append(np.NaN)
        else:
            titles.append(title.text)

        #_3LWZlK
        rating = x.find('div', attrs={'class':'_3LWZlK'})
        if rating is None:
            ratings.append(np.NaN)
        else:
            ratings.append(rating.text)

        # _2_R_DZ
        no_of_ratings = x.find('span', attrs={'class':'_2_R_DZ'})
        if no_of_ratings is None:
            num_of_rev_rating.append(np.NaN)
        else:
            num_of_rev_rating.append(no_of_ratings.text)

        # _1xgFaf
        specs = x.find('ul', attrs={'class':'_1xgFaf'})
        if specs is None:
            specifications.append(np.NaN)
        else:
            specifications.append(specs.text)

        # _30jeq3 _1_WHN1
        price = x.find('div', attrs={'class':'_30jeq3 _1_WHN1'})
        if price is None:
            prices.append(np.NaN)
        else:
            prices.append(price.text)
            
        pgno.append(i)
    
    end_time = time.time()
    print('Page {} completed in {} seconds'. format(i, end_time-start_time))

Page 1 completed in 0.9016995429992676 seconds
Page 2 completed in 0.9451339244842529 seconds
Page 3 completed in 0.7123939990997314 seconds
Page 4 completed in 0.8988687992095947 seconds
Page 5 completed in 0.9168484210968018 seconds
Page 6 completed in 0.9552786350250244 seconds
Page 7 completed in 0.8273038864135742 seconds
Page 8 completed in 1.274430274963379 seconds
Page 9 completed in 1.0249056816101074 seconds
Page 10 completed in 0.7671785354614258 seconds
Page 11 completed in 1.0295748710632324 seconds
Page 12 completed in 1.1183106899261475 seconds
Page 13 completed in 1.111745834350586 seconds
Page 14 completed in 1.6322529315948486 seconds
Page 15 completed in 1.3131518363952637 seconds
Page 16 completed in 1.5582475662231445 seconds
Page 17 completed in 0.9168782234191895 seconds
Page 18 completed in 0.9612231254577637 seconds
Page 19 completed in 0.8331019878387451 seconds
Page 20 completed in 0.9316422939300537 seconds
Page 21 completed in 1.0739431381225586 seconds
Pag

In [25]:
# build a data frame using the lists

laptop_df = pd.DataFrame({'Title' : titles, 'Price': prices,
             'Specifications' : specifications,
             'Rating': ratings, 'No_of_R&R': num_of_rev_rating,
             'PageNo': pgno})

In [27]:
laptop_df

Unnamed: 0,Title,Price,Specifications,Rating,No_of_R&R,PageNo
0,ASUS VivoBook 15 Core i3 10th Gen - (8 GB/1 TB...,"₹29,990",Intel Core i3 Processor (10th Gen)8 GB DDR4 RA...,3.5,142 Ratings & 25 Reviews,1
1,HP Ryzen 3 Dual Core 3250U - (8 GB/256 GB SSD/...,"₹34,990",AMD Ryzen 3 Dual Core Processor8 GB DDR4 RAM64...,4.3,"2,731 Ratings & 294 Reviews",1
2,Lenovo IdeaPad 3 Core i3 11th Gen - (8 GB/512 ...,"₹39,490",Intel Core i3 Processor (11th Gen)8 GB DDR4 RA...,4.3,688 Ratings & 70 Reviews,1
3,ASUS VivoBook 15 (2022) Core i3 10th Gen - (8 ...,"₹37,990",Intel Core i3 Processor (10th Gen)8 GB DDR4 RA...,4.3,"1,708 Ratings & 191 Reviews",1
4,Lenovo IdeaPad 3 Core i3 10th Gen - (8 GB/256 ...,"₹35,490",Intel Core i3 Processor (10th Gen)8 GB DDR4 RA...,4.3,"2,881 Ratings & 311 Reviews",1
...,...,...,...,...,...,...
954,ASUS VivoBook 17 Ryzen 5 Hexa Core 5500U - (16...,"₹62,990",AMD Ryzen 5 Hexa Core Processor16 GB DDR4 RAM6...,2.7,3 Ratings & 0 Reviews,40
955,HP Core i7 11th Gen - (8 GB/512 GB SSD/Windows...,"₹1,19,599",Intel Core i7 Processor (11th Gen)8 GB DDR4 RA...,,,40
956,DELL Core i5 11th Gen - (16 GB/512 GB SSD/Wind...,"₹68,500",Intel Core i5 Processor (11th Gen)16 GB DDR4 R...,3.3,6 Ratings & 0 Reviews,40
957,ASUS Zenbook 14 Flip Ryzen 5 Hexa Core 5600H -...,"₹91,900",AMD Ryzen 5 Hexa Core Processor16 GB LPDDR4X R...,,,40


In [23]:
laptop_df.Specifications[0][0:34]
laptop_df.Specifications[1][0:34]
laptop_df.Specifications[200][0:34]
laptop_df.Specifications[300][0:34]
laptop_df.Specifications[900]

'Intel Core i3 Processor (10th Gen)'

'AMD Ryzen 3 Dual Core Processor8 G'

'AMD Ryzen 7 Octa Core Processor16 '

'AMD Ryzen 5 Hexa Core Processor8 G'

'Intel Core i7 Processor (10th Gen)8 GB DDR4 RAM64 bit Windows 10 Operating System512 GB SSD39.62 cm (15.6 Inch) DisplayMicrosoft Office (Trial Only)1 Year Warranty + 1 Year Premium Care + 1 Year ADP'

## Extract Processor

In [24]:
#### Extract Processor details

myregex = re.compile(r'[A-Za-z0-9\s]+Processor')
myregex.findall(laptop_df.Specifications[36])

['Intel Core i3 Processor']