# Scraping Flipkart's Mobile Details using Python



![](https://i.imgur.com/J9Np8Il.jpg)



Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing HTML documents, some platforms also offer REST APIs to retrieve information in a machine-readable format like JSON. In this tutorial, we'll use web scraping and REST APIs to create a real-world dataset.

This Project covers the following topics: 

* Downloading web pages using the requests library
* Inspecting the HTML source code of a web page
* Parsing parts of a website using Beautiful Soup
* Writing parsed information into CSV files
* Using a REST API to retrieve data as JSON
* Combining data from multiple sources
* Using links on a page to crawl a website

### How to Run the Code

The best way to learn the material is to execute the code and experiment with it yourself. This tutorial is an executable [Jupyter notebook](https://jupyter.org). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.





## Problem 


> **QUESTION**: Write a Python function that creates a CSV file (comma-separated values) containing details about the 24 top mobile phones below 20k in Flipkart. You can view the top mobile phones below 20k in the  [Flipkart page ](https://www.flipkart.com/search?q=mobile+below+20000&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page=1). The output CSV should contain these details: mobile_name, mobile_description, mobile_price.

![](https://i.imgur.com/CW1nZZ1.png)

## Downloading a web page using `requests`

When you access a URL like [Flipkart page ](https://www.flipkart.com/search?q=mobile%20below%2020000&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off) using a web browser, it downloads the contents of the web page the URL points to and displays the output on the screen. Before we can extract information from a web page, we need to download the page using Python.

We'll use a library called [`requests`](https://docs.python-requests.org/en/master/) to download web pages from the internet. Let's begin by installing and importing the library.

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
import requests

To download a page we can use the get function from requests, which returns a response object.

In [4]:
 topic_url ="https://www.flipkart.com/search?q=mobile+below+20000&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page1"

In [5]:
response = requests.get(topic_url)

* requests.get returns a response object with the page contents and some information indicating whether the request was successful, using a status code. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

In [6]:
print(response.status_code)

200


In [7]:
page_contents=response.text

If the request was successful, response.status_code is set to a value between 200 and 299.

In [8]:
len(page_contents)

595802

In [9]:
 print(page_contents[:1000])

<!doctype html><html lang="en"><head><link href="https://rukminim1.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.104e9a.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><meta name="robots" content="noodp"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" hr


* The pages contains large number of characters! Let's view the first 1000 characters of the web pages.

* What you see above is the *source code* of the web page. It written in a language called [HTML](https://developer.mozilla.org/en-US/docs/Web/HTML). It defines the content and structure of the web page. 

In [10]:
with open('top-mobile-below-20k.html', 'a', encoding="utf-8") as file:
        file.write(page_contents)

Let's save the contents to a file with the `.html` extension.

While this looks similar to the original web page, note that it's simply a copy. You will notice that none of the links or buttons work. To view or edit the source code of the file, click "File > Open" within Jupyter, then select the file top-mobile-below-20k.html from the list and click the "Edit" button.

![](https://i.imgur.com/TzE3Aiu.png)

As you might expect, the source code looks something like this:

![](https://i.imgur.com/pJwaANN.png)

## Inspecting the HTML source code of a web page

![](https://i.imgur.com/lg2nKNI.jpg)

As mentioned earlier, web pages are written in a language called HTML (Hyper Text Markup Language). HTML is a fairly simple language comprised of *tags*  (also called *nodes* or *elements*) e.g. `<a href="https://jovian.ai" target="_blank">Go to Jovian</a>`. An HTML tag has three parts:

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


Common Tags and Attributes
Following are some of the most commonly used HTML tags:

html
head
title
body
div
span
h1 to h6
p
img
ul, ol and li
table, tr, th and td
style
...
Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

id
style
class
href (used with <a>)
src (used with <img>)
EXERCISE: Complete this tutorial on HTML: https://www.htmldog.com/guides/html/ . Once done, try describing what the above tags and attributes are used for. Try creating a new HTML page using the tags you find most interesting.

To learn how to style HTML tags, check out this tutorial on CSS: https://www.htmldog.com/guides/css/

### Inspecting HTML in the Browser

You can view the source code of any webpage right within your browser by right-clicking anywhere on a page and selecting the "Inspect" option. It opens the "Developer Tools" pane, where you can see the source code as a tree. You can expand and collapse various nodes and find the source code for a specific portion of the page.

Here's what it looks like on the Chrome browser:

![](https://i.imgur.com/ZTrcylh.png)

## Extracting information from HTML using Beautiful Soup

To extract information from the HTML source code of a webpage programmatically, we can use the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Let's install the library and import the `BeautifulSoup` class from the `bs4` module.

In [11]:
# Install the library
!pip install beautifulsoup4 --upgrade --quiet

In [12]:
# Import the library
from bs4 import BeautifulSoup

In [13]:
?BeautifulSoup

Next, let's read the contents of the file `top-mobile-below-20k.html` and create a `BeautifulSoup` object to parse the content.

In [14]:
with open('top-mobile-below-20k.html', 'r') as f:
    html_source = f.read()

In [15]:
html_source[:1000]

'<!doctype html><html lang="en"><head><link href="https://rukminim1.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.905c37.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.104e9a.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><meta name="robots" content="noodp"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" h

In [16]:
doc = BeautifulSoup(html_source, 'html.parser')

In [17]:
type(doc)

bs4.BeautifulSoup

The `doc` object contains several properties and methods for extracting information from the HTML document.

### Finding all tags of the same type and Searching by Class

To find all the occurrences of a tag, use the `find_all` method.

The `class` attribute is one of the most frequently used attributes on HTML tags (used for layout and styling). We can search for tags containing a class using the `class_` argument in `find_all` (note that `class` is a reserved keyword in Python, hence the underscore in the argument name).

### Applying `for` loop to get all content between the tags to store in a single `list`.

Filtering the `div` tag by providing specific `class` to get mobile name.

In [18]:
mobilename_tags=doc.find_all('div',{'class':'_4rR01T'})
def getMobile_name_list(mobilename_tags): 
    mobile_names=[]

    for tag in mobilename_tags:
        mobile_names.append(tag.text)
    return mobile_names 
mobile_name = getMobile_name_list(mobilename_tags)


Filtering the `ul` tag by providing specific `class` to get mobile description.

In [19]:
mobile_description_tags=doc.find_all('ul',{'class':'_1xgFaf'})
def getMobile_description_list(mobile_description_tags):
    mobile_descriptions=[]

    for tag in mobile_description_tags:
        mobile_descriptions.append(tag.text)
    return mobile_descriptions
mobile_description = getMobile_description_list(mobile_description_tags)


Filtering the `div` tag by providing specific `class` to get price of mobile.

In [20]:
price_tags=doc.find_all('div',{'class':'_30jeq3 _1_WHN1'})
def getMobile_price_list(price_tags):
    mobile_prices=[]

    for tag in price_tags:
        mobile_prices.append(tag.text)
    return mobile_prices
 



In [21]:
#This function is used to scrap multiple pages by providing page url
def scrape_page(page_number):
    url= topic_url ="https://www.flipkart.com/search?q=mobile+below+20000&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off&page="+str(page_number)
    doc=BeautifulSoup(page_contents, 'html.parser')
    name=getMobile_name_list(mobilename_tags)
    description=getMobile_description_list(mobile_description_tags)
    price=getMobile_price_list(price_tags)
    

    return name,description,price

In [22]:
mobile_name,mobile_description,mobile_price=[],[],[]
for page_number in range(1,6):
    name,description,price= scrape_page(page_number)
    mobile_name += name
    mobile_description += description
    mobile_price += price
    

## Pandas 

pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.

In [23]:
!pip install pandas --quiet

In [24]:
import pandas as pd

In [25]:
All_mobile={
    'mobile_name' : mobile_name,
    'mobile_description' : mobile_description,
    'mobile_price' : mobile_price
}

### Pandas DataFrame

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns. As you can see from the result above, the DataFrame is like a table with rows and columns. Pandas use the loc attribute to return one or more specified row (s).

In [26]:
mobile_df = pd.DataFrame(All_mobile)
mobile_df.index += 1

In [27]:
mobile_df


Unnamed: 0,mobile_name,mobile_description,mobile_price
1,"vivo T1 5G (Starlight Black, 128 GB)",6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹16,990"
2,"vivo T1 5G (Starlight Black, 128 GB)",4 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹15,990"
3,"vivo T1 5G (Rainbow Fantasy, 128 GB)",6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹16,990"
4,Cellecor E8,32 MB RAM | 32 MB ROM6.1 cm (2.4 inch) Display...,"₹1,469"
5,"vivo T1 5G (Rainbow Fantasy, 128 GB)",4 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹15,990"
...,...,...,...
116,"realme 9i 5G (Soulful Blue, 128 GB)",6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹16,999"
117,Cellecor E2+,32 MB RAM | 32 MB ROM | Expandable Upto 32 GB4...,"₹1,119"
118,"realme 8s 5G (Universe Purple, 128 GB)",8 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹16,499"
119,"vivo T1 44W (Starry Sky, 128 GB)",6 GB RAM | 128 GB ROM | Expandable Upto 1 TB16...,"₹15,999"


## Create CSV file from extracted information.

In [28]:
mobile_df.to_csv('list_of _mobile_below_20k', index=False)

## Summary 

* The Scraping was done using Python libraries such as Requests, Selenium for extracting the data
* Scraping top two pages of mobile phones below 20k in flipkart website like mobile name, mobile description and mobile price.
* Parsed all the scraped data into a csv file containing 47 rows and 3 columns for each creative field.

Web Scraping means collecting or download any kind of content or data from a website. Most of the popular websites have their API that allows you to scrape data and Python has also many web scraping libraries that allows you to scrape any kind of website. I hope you liked this article on web scraping project with Python.

## Future work

* Extracting more details of the project and creator by accessing the project links and creator links
* Code optimization
* Improving the documentation part of the project
* Adding a time and date stamp at the point when website's page is requested and adding it to the output, as the web page is dynamic and frequently changes data and adding new projects.

## References that will help you for `web scraping` :

* https://youtu.be/9gwlKLxI3YA
* https://www.youtube.com/watch?v=m-koIYWCaIo&t=946s
* https://www.youtube.com/c/JovianML
* https://github.com/benteddy/Web-Scraping-using-Python-Jupyter-Notebook-and-Selenium#:~:text=Web%20Scrapping%20using%20Python%20Jupyter%20Notebook%20and%20Selenium,get%20data%20and%20information%20from%20a%20website%20automatically.
* https://github.com/H2001-hj/Web-Scraping-using-Python-Jupyter-Notebook

In [None]:
jovian.commit(files=["Scraping_Flipkarts_Mobile_Details_using_Python.ipynb"], outputs=["list_of _mobile_below_20k"])


<IPython.core.display.Javascript object>