# Web Scraping Flipkart Using BeautifulSoup

**Web scraping** is a technique used to extract data from websites. It involves fetching web pages, parsing the HTML or XML content of those pages, and then extracting useful information. Web scraping is commonly used for various purposes, including data collection, research, price comparison, monitoring, and more.

Here are some key components of web scraping:

HTTP Requests: Web scraping typically begins with sending HTTP requests to a website to retrieve its content. You can use libraries like requests in Python to make these requests.

HTML Parsing: After obtaining the web page content, you parse the HTML (or XML) using a parser to navigate the document's structure and locate the data you want to extract.

Data Extraction: You identify and extract specific data or elements from the parsed HTML content. This can include text, tables, images, links, and more.

Data Cleaning: The extracted data often requires cleaning and formatting to be useful. This may include removing unnecessary characters, handling missing values, and structuring the data as needed.

Storage: Finally, you can store the extracted data in a file, database, or use it for analysis, reporting, or other purposes.

**Beautiful Soup** is a Python library commonly used for web scraping. It provides tools for parsing HTML and XML documents, navigating their structure, and extracting data. Beautiful Soup is designed to make it easy to scrape information from web pages and is often used in combination with other libraries like requests to fetch web pages. Some of its key features include:

Parsing: It parses HTML or XML documents and creates a parse tree, allowing you to navigate and manipulate the document's elements.

Search and Navigation: Beautiful Soup provides methods for searching and navigating the parse tree, making it easy to locate specific elements in the document.

Data Extraction: You can extract data from elements by accessing their attributes, text content, or child elements.

Automatic Encoding Conversion: Beautiful Soup handles different character encodings and converts them to Unicode.

Error Handling: It can handle poorly formatted or invalid HTML, which is common when scraping real-world websites.

In summary, web scraping is the process of extracting data from websites, and Beautiful Soup is a Python library that simplifies the process of parsing and navigating HTML and XML documents for web scraping purposes.










________________________________________________________________________________________________________________________________

**OBJECTIVE** : To scrape data from a single web page of Flipkart displaying details about Smartphones under the range of Rs 2000 and to store it into csv file.

**LIBRARIES** : W e are using 3 Libraries here
1. Pandas
2. Requests
3. BeautifulSoup4

In [31]:
#Import the libraries

import pandas as pd              #to use DataFrames
import requests                  #to fetch web content
from bs4 import BeautifulSoup    #to parse and extract data from that content


The process involves sending an HTTP request to a webpage, obtaining its HTML content, parsing that content with BeautifulSoup, and then using pandas to organize and analyze the data.

**NOTE**: "from bs4 import BeautifulSoup": This line imports the BeautifulSoup class from the bs4 module of the beautifulsoup4 library

We need to first set our intention, i.e. what do we want to scrape? 
In this example we want to scrape the following: Mobile name, Price, Description, and rating


In [32]:
# Creating lists
Product_name=[]
Prices=[]
Description=[]
Reviews=[]


In [33]:
# Storing the url in the variable

url="https://www.flipkart.com/mobiles/~mobile-phones-under-rs20000/pr?sid=tyy%2C4io&otracker=undefined_footer_footer&page=1"

r=requests.get(url)
print(r)

#getting response as 200 indicates that we can request the html and then scrape the data tf in short the page is scrappable

<Response [200]>


In [35]:
#Lets 1st learn how to Scrape data from a single page

soup=BeautifulSoup(r.text,"lxml")
#print(soup.prettify()) #printing the html of the 1st page

# This will help us to retrieve data from a specific portion of the page else will strat retrieving from various parts of the page
box=soup.find("div",class_="_1YokD2 _3Mn1Gg")
#print(box)


In [38]:
# Now we will start Fetching the Product name by inspecting the page and finding the class of the div where the name is stored
#_4rR01T is the class name

names= box.find_all("div",class_="_4rR01T")
#print(names)



In [12]:
# Try not to run this cell again and again else it will keep on appending, adding duplicate or unnecessary values; else you will freak out like me and will run to Chatgpt for help.
for i in names:
    name=i.text
    Product_name.append(name)


print(len(Product_name))

24


In [13]:
print(Product_name) # Accurate ouput

['SAMSUNG Galaxy F34 5G (Orchid Violet, 128 GB)', 'SAMSUNG Galaxy F14 5G (OMG Black, 128 GB)', 'REDMI 12 (Jade Black, 128 GB)', 'vivo T2x 5G (Glimmer Black, 128 GB)', 'vivo T2x 5G (Marine Blue, 128 GB)', 'vivo T2x 5G (Marine Blue, 128 GB)', 'vivo T2x 5G (Glimmer Black, 128 GB)', 'SAMSUNG Galaxy F14 5G (GOAT Green, 128 GB)', 'vivo T2x 5G (Aurora Gold, 128 GB)', 'vivo T2x 5G (Aurora Gold, 128 GB)', 'SAMSUNG Galaxy F14 5G (GOAT Green, 128 GB)', 'SAMSUNG Galaxy F34 5G (Electric Black, 128 GB)', 'vivo T2x 5G (Aurora Gold, 128 GB)', 'vivo T2x 5G (Marine Blue, 128 GB)', 'OnePlus Nord CE 2 Lite 5G (Blue Tide, 128 GB)', 'POCO C55 (Cool Blue, 128 GB)', 'SAMSUNG Galaxy F14 5G (B.A.E. Purple, 128 GB)', 'SAMSUNG Galaxy F14 5G (B.A.E. Purple, 128 GB)', 'POCO C55 (Power Black, 128 GB)', 'SAMSUNG Galaxy F04 (Opal Green, 64 GB)', 'SAMSUNG Galaxy F14 5G (OMG Black, 128 GB)', 'SAMSUNG Galaxy F34 5G (Orchid Violet, 128 GB)', 'OnePlus Nord CE 3 Lite 5G (Pastel Lime, 128 GB)', 'POCO C51 (Royal Blue, 64 GB)'

In [37]:
# Now we will fetch the prices from the site
#_30jeq3 _1_WHN1 is the class name

prices=box.find_all("div",class_="_30jeq3 _1_WHN1")
#print(prices)


In [15]:
# Now we want to store The values into the list and remove any unncessary tags

for i in prices:
    price=i.text
    Prices.append(price)

print(Prices)

['₹16,499', '₹12,490', '₹9,999', '₹14,999', '₹14,999', '₹12,999', '₹12,999', '₹11,490', '₹12,999', '₹14,999', '₹12,490', '₹16,499', '₹11,999', '₹11,999', '₹17,224', '₹7,999', '₹12,490', '₹11,490', '₹7,999', '₹6,499', '₹11,490', '₹18,499', '₹19,954', '₹6,499']


In [17]:
# Now we are going to cross check the len
print(len(Prices))

24


In [36]:
# Now we will be Fetching the Despcription
#_1xgFaf is the class name

descripts=box.find_all("ul",class_="_1xgFaf")
#print(descripts)

#In future I will edit the code to more column for specific data segrgation in the description part to

In [19]:
# Once we have fetched the data, lets add it to the list

for i in descripts:
    des=i.text
    Description.append(des)
    
print(Description)

['6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16.51 cm (6.5 inch) Full HD+ Display50MP (OIS) + 8MP + 2MP | 13MP Front Camera6000 mAh BatteryExynos 1280 Processor1 Year Manufacturer Warranty for Device and 6 Months Manufacturer Warranty for In-Box Accessories', '6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16.76 cm (6.6 inch) Full HD+ Display50MP + 2MP | 13MP Front Camera6000 mAh BatteryExynos 1330, Octa Core Processor1 Year Manufacturer Warranty for Device and 6 Months Manufacturer Warranty for In-Box Accessories', '4 GB RAM | 128 GB ROM | Expandable Upto 1 TB17.25 cm (6.79 inch) Full HD+ Display50MP + 8MP + 2MP | 8MP Front Camera5000 mAh BatteryHelio G88 Processor1 Year Manufacturer Warranty for Phone and 6 Months Warranty for in the Box Accessories', '8 GB RAM | 128 GB ROM16.71 cm (6.58 inch) Full HD+ Display50MP + 2MP | 8MP Front Camera5000 mAh BatteryDimensity 6020 Processor1 Year of Device & 6 Months for Inbox Accessories', '8 GB RAM | 128 GB ROM16.71 cm (6.58 inch) Full HD+ Displa

In [39]:
# Once we are done with our description part, lets proceed to the review part
#_3LWZlK is the class name

reviews=box.find_all("div",class_="_3LWZlK")
#print(reviews)

In [21]:
#Now lext extract the text part

for i in reviews:
    r=i.text
    Reviews.append(r)
    
print(Reviews)

# Checked the data and the lenght too    


['4.3', '4.2', '4.3', '4.3', '4.3', '4.4', '4.4', '4.2', '4.4', '4.3', '4.2', '4.3', '4.4', '4.4', '4.4', '4.2', '4.2', '4.2', '4.2', '4.2', '4.2', '4.2', '4.4', '4.1']


In [22]:
print(len(Reviews))

24


In [23]:
print(len(Description))

24


In [25]:
#Its time for us bring all the lists under one roof! 
# We will first add them into DataFrame and then convert it into csv file

df=pd.DataFrame({"Product Name":Product_name, "Prices":Prices , "Description":Description , "Reviews":Reviews})
print(df)

                                       Product Name   Prices  \
0     SAMSUNG Galaxy F34 5G (Orchid Violet, 128 GB)  ₹16,499   
1         SAMSUNG Galaxy F14 5G (OMG Black, 128 GB)  ₹12,490   
2                     REDMI 12 (Jade Black, 128 GB)   ₹9,999   
3               vivo T2x 5G (Glimmer Black, 128 GB)  ₹14,999   
4                 vivo T2x 5G (Marine Blue, 128 GB)  ₹14,999   
5                 vivo T2x 5G (Marine Blue, 128 GB)  ₹12,999   
6               vivo T2x 5G (Glimmer Black, 128 GB)  ₹12,999   
7        SAMSUNG Galaxy F14 5G (GOAT Green, 128 GB)  ₹11,490   
8                 vivo T2x 5G (Aurora Gold, 128 GB)  ₹12,999   
9                 vivo T2x 5G (Aurora Gold, 128 GB)  ₹14,999   
10       SAMSUNG Galaxy F14 5G (GOAT Green, 128 GB)  ₹12,490   
11   SAMSUNG Galaxy F34 5G (Electric Black, 128 GB)  ₹16,499   
12                vivo T2x 5G (Aurora Gold, 128 GB)  ₹11,999   
13                vivo T2x 5G (Marine Blue, 128 GB)  ₹11,999   
14    OnePlus Nord CE 2 Lite 5G (Blue Ti

In [28]:
#creation of Excel Sheet
df.to_csv("flipkart_mobiles_under20k_pd1.csv")

Now we have to go to the directory and check for the new csv file namely "flipkart_mobiles_under20k_pd1.csv".

Friends this is the code for scrapping a single page, I will be uploading the code for scrapping multiple pages.It just needs an introduction of loop and a little bit of editing in the url,the rest of the code would be the same.

I have also uploaded the dataset, well I have made following edits in the file
1. Enhancement of the header
2. Indexing started from 0, changed it to begin with 1
3. Made changes in the currency 
And that's it!

I am open to your feedback and suggestions. This is my 1st post on kaggle. I will be upskilling and posting more on the site. Looking forward to your cooperation in the journey.

Till then take care and Happy Coding!

