# **Web Scrapping**
### requests - This is used to extract the HTML code from the given URL
### BeautifulSoup - Format and Scrap the data from the HTML
## **Steps**

1.   Identify URL.
2.   Inspect HTML code.
3.   Find the HTML tag for the element that you want to extract.
4.   Write some code to scrap this data.


In [2]:
# Installing BeautifulSoup
# Remove below # to install the bs4
#! pip install bs4

In [1]:
# Loading required libraries

import numpy as np
import pandas as pd

import requests
from bs4 import BeautifulSoup

In [3]:
# Identify the URL

URL = 'https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off'


In [4]:
# Loading the WebPage in Memory using requests library

page = requests.get(URL)


In [5]:
# Check the Status Code of the Page

page.status_code


200

In [6]:
# Extracting the HTML Code of the WebPage

htmlCode = page.text


In [7]:
htmlCode

'<!doctype html><html lang="en"><head><link href="https://rukminim1.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.8f4f44.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><meta name="robots" content="noodp"/><link rel="shortcut icon" href="https:///www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.p

Lets identify the below mentioned features and based on them we will try to scrape out the relavant data from FlipKart website.

URL = '?'

Price = '?'

Rating = '?'

Title = '?'

Feature = '?'


1.   Price = div _30jeq3 _1_WHN1
2.   Features = ul _1xgFaf
3.   Rating = div _3LWZlK
4.   Prod Title = div _4rR01T
5.   URL = https://www.flipkart.com/search?q=laptop&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off





In [8]:
# Format the HTML code using bs4 library

soup = BeautifulSoup(htmlCode)


In [9]:
help(soup)

Help on BeautifulSoup in module bs4 object:

class BeautifulSoup(bs4.element.Tag)
 |  BeautifulSoup(markup='', features=None, builder=None, parse_only=None, from_encoding=None, exclude_encodings=None, element_classes=None, **kwargs)
 |  
 |  A data structure representing a parsed HTML or XML document.
 |  
 |  Most of the methods you'll call on a BeautifulSoup object are inherited from
 |  PageElement or Tag.
 |  
 |  Internally, this class defines the basic interface called by the
 |  tree builders when converting an HTML/XML document into a data
 |  structure. The interface abstracts away the differences between
 |  parsers. To write a new tree builder, you'll need to understand
 |  these methods as a whole.
 |  
 |  These methods will be called by the BeautifulSoup constructor:
 |    * reset()
 |    * feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    * handle_starttag(name, attrs) # See note about return value
 |    * handle_endtag(n

# find()



In [10]:
# Price

price = soup.find('div', attrs={'class' : '_30jeq3 _1_WHN1'})

print(price.text)

₹34,990


In [11]:
# Brand

title = soup.find('div', attrs={'class' : '_4rR01T'})

print(title.text)

ASUS Vivobook 15 Core i3 11th Gen - (8 GB/512 GB SSD/Windows 11 Home) X515EA-EJ322WS Thin and Light La...


In [12]:
# Rating

rating = soup.find('div', attrs={'class' : '_3LWZlK'})

print(rating.text)

4.2


In [13]:
# Feature List

feature_list = soup.find('ul', attrs = {'class' : '_1xgFaf'})

print(feature_list.text)


Intel Core i3 Processor (11th Gen)8 GB DDR4 RAM64 bit Windows 11 Operating System512 GB SSD39.62 cm (15.6 Inch) Display1 Year Onsite Warranty


#find_all()

In [14]:
# Find All Prices

soup.find_all('div', attrs={'class' : '_30jeq3 _1_WHN1'})

[<div class="_30jeq3 _1_WHN1">₹34,990</div>,
 <div class="_30jeq3 _1_WHN1">₹84,990</div>,
 <div class="_30jeq3 _1_WHN1">₹46,990</div>,
 <div class="_30jeq3 _1_WHN1">₹38,990</div>,
 <div class="_30jeq3 _1_WHN1">₹49,990</div>,
 <div class="_30jeq3 _1_WHN1">₹21,990</div>,
 <div class="_30jeq3 _1_WHN1">₹36,158</div>,
 <div class="_30jeq3 _1_WHN1">₹44,990</div>,
 <div class="_30jeq3 _1_WHN1">₹39,990</div>,
 <div class="_30jeq3 _1_WHN1">₹33,490</div>,
 <div class="_30jeq3 _1_WHN1">₹51,990</div>,
 <div class="_30jeq3 _1_WHN1">₹38,990</div>,
 <div class="_30jeq3 _1_WHN1">₹49,990</div>,
 <div class="_30jeq3 _1_WHN1">₹54,990</div>,
 <div class="_30jeq3 _1_WHN1">₹74,990</div>,
 <div class="_30jeq3 _1_WHN1">₹1,24,990</div>,
 <div class="_30jeq3 _1_WHN1">₹50,990</div>,
 <div class="_30jeq3 _1_WHN1">₹45,490</div>,
 <div class="_30jeq3 _1_WHN1">₹52,490</div>,
 <div class="_30jeq3 _1_WHN1">₹37,490</div>,
 <div class="_30jeq3 _1_WHN1">₹43,490</div>,
 <div class="_30jeq3 _1_WHN1">₹52,490</div>,
 <div cl

In [15]:
# Find All Ratings

soup.find_all('div', attrs={'class' : '_3LWZlK'})


[<div class="_3LWZlK">4.2</div>,
 <div class="_3LWZlK">4.7</div>,
 <div class="_3LWZlK">4.4<img class="_1wB99o" src="

In [16]:
prices = soup.find_all('div', attrs = {'class' : '_30jeq3 _1_WHN1'})

print(prices)

print(type(prices))

print(type(prices[1]))

for tag in prices:
    print(tag.text)


[<div class="_30jeq3 _1_WHN1">₹34,990</div>, <div class="_30jeq3 _1_WHN1">₹84,990</div>, <div class="_30jeq3 _1_WHN1">₹46,990</div>, <div class="_30jeq3 _1_WHN1">₹38,990</div>, <div class="_30jeq3 _1_WHN1">₹49,990</div>, <div class="_30jeq3 _1_WHN1">₹21,990</div>, <div class="_30jeq3 _1_WHN1">₹36,158</div>, <div class="_30jeq3 _1_WHN1">₹44,990</div>, <div class="_30jeq3 _1_WHN1">₹39,990</div>, <div class="_30jeq3 _1_WHN1">₹33,490</div>, <div class="_30jeq3 _1_WHN1">₹51,990</div>, <div class="_30jeq3 _1_WHN1">₹38,990</div>, <div class="_30jeq3 _1_WHN1">₹49,990</div>, <div class="_30jeq3 _1_WHN1">₹54,990</div>, <div class="_30jeq3 _1_WHN1">₹74,990</div>, <div class="_30jeq3 _1_WHN1">₹1,24,990</div>, <div class="_30jeq3 _1_WHN1">₹50,990</div>, <div class="_30jeq3 _1_WHN1">₹45,490</div>, <div class="_30jeq3 _1_WHN1">₹52,490</div>, <div class="_30jeq3 _1_WHN1">₹37,490</div>, <div class="_30jeq3 _1_WHN1">₹43,490</div>, <div class="_30jeq3 _1_WHN1">₹52,490</div>, <div class="_30jeq3 _1_WHN1">

In [17]:
ratings = soup.find_all('div', attrs={'class' : '_3LWZlK'})

# print(ratings)

for tag in ratings:
    print(tag.text)


4.2
4.7
4.4
4.1
4.4
4.1
4.3
4.3
4.3
4.2
4.3
4.1
4.4
4.5
4.3
4.8
4.4
4.3
4.3
4.2
4.2
4.4
4.4
4.4
4.3
3
5
4.3
5
5
4.7
4
5
4.1
4
1
4.3
5
4


In [18]:
ratings = soup.find('div', attrs={'class' : '_3LWZlK'})

print(ratings.text)


4.2


# Let's look into all the URLs
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=2

https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=5

https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=8

https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=3

https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=10

https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=9



In [21]:
# Code

'''
URL = https://www.flipkart.com/search?q=laptops&otracker=search
&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=9
'''

for i in range(1, 31):
    print('https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page={}'. format(i))


https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=2
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=3
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=4
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=5
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=6
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=7
https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=8
https://www.flipkart.com/search?

# Code for Web Scrapping

In [22]:
# Scrapping the Web Page

title = []
rating = []
price = []
features = []

for i in range(1, 31):
    URL = 'https://www.flipkart.com/search?q=laptops&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page={}'. format(i)
    
    page = requests.get(URL)
    htmlCode = page.text
    
    soup = BeautifulSoup(htmlCode)
    
    for x in soup.find_all('div', attrs={'class' : '_2kHMtA'}):

        product = x.find('div', attrs={'class' : '_4rR01T'})
        if product is None:
            title.append(np.NaN)
        else:
            title.append(product.text)

        mrp = x.find('div', attrs={'class' : '_30jeq3 _1_WHN1'})
        if mrp is None:
            price.append(np.NaN)
        else:
            price.append(mrp.text)

        rate = x.find('div', attrs={'class' : '_3LWZlK'})
        if rate is None:
            rating.append(np.NaN)
        else:
            rating.append(rate.text)

        f = x.find('ul', attrs={'class' : '_1xgFaf'})
        if f is None:
            features.append(np.NaN)
        else:
            features.append(f.text)


In [23]:
print(len(title))
print(len(price))
print(len(rating))
print(len(features))


720
720
720
720


# Create a DataFrame and save it in CSV file

In [24]:
df = pd.DataFrame({'Product' : title, 'Rating' : rating, 'MRP' : price, 'Feature' : features})
df.head()

Unnamed: 0,Product,Rating,MRP,Feature
0,acer Swift Go 14 Ryzen 5 Hexa Core 7530U - (8 ...,,"₹62,990",AMD Ryzen 5 Hexa Core Processor8 GB LPDDR4X RA...
1,Lenovo Intel Celeron Dual Core - (8 GB/256 GB ...,4.1,"₹25,517",Intel Celeron Dual Core Processor8 GB DDR4 RAM...
2,ASUS Vivobook 15 Core i3 11th Gen - (8 GB/512 ...,4.2,"₹34,990",Intel Core i3 Processor (11th Gen)8 GB DDR4 RA...
3,APPLE 2020 Macbook Air M1 - (8 GB/256 GB SSD/M...,4.7,"₹84,990",Apple M1 Processor8 GB DDR4 RAMMac OS Operatin...
4,ASUS TUF Gaming F15 Core i5 10th Gen - (8 GB/5...,4.4,"₹49,990",Intel Core i5 Processor (10th Gen)8 GB DDR4 RA...


In [25]:
df.shape


(720, 4)

In [26]:
df.tail()

Unnamed: 0,Product,Rating,MRP,Feature
715,DELL Ryzen 3 Dual Core 3rd Gen - (8 GB/256 GB ...,,"₹43,490",AMD Ryzen 3 Dual Core Processor (3rd Gen)8 GB ...
716,DELL Inspiron Ryzen 5 Hexa Core 5625U - (8 GB/...,,"₹59,999",AMD Ryzen 5 Hexa Core Processor8 GB DDR4 RAM64...
717,Lenovo IdeaPad Gaming 3 Intel Core i5 11th Gen...,4.0,"₹56,890",Intel Core i5 Processor (11th Gen)8 GB DDR4 RA...
718,ASUS Vivobook 14 (2022) Core i5 12th Gen - (16...,4.4,"₹63,990",Intel Core i5 Processor (12th Gen)16 GB DDR4 R...
719,ASUS VivoBook 15 (2022) Core i3 10th Gen - (8 ...,4.3,"₹35,990",Intel Core i3 Processor (10th Gen)8 GB DDR4 RA...


In [27]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Product  720 non-null    object
 1   Rating   451 non-null    object
 2   MRP      720 non-null    object
 3   Feature  720 non-null    object
dtypes: object(4)
memory usage: 22.6+ KB


In [29]:
#df.to_csv('directory/filename.csv', index = False)
