# Introduction to Web Scraping

## 1. Introduction to Web Scraping

In today's world, we have tons of unstructured data/information (mostly web data) available freely. Sometimes the freely available data is easy to read and sometimes not. No matter how your data is available, web scraping is a very useful tool to transform unstructured data into structured data that is easier to read and analyze. In other words, web scraping is a way to collect, organize and analyze this enormous amount of data. So let us first understand what is web-scraping.

### When and Where Data Scientists/Analysts Use Web Scraping:

- **When:** When data is not readily available in a structured format (e.g., APIs, databases).
- **Where:** E-commerce, social media, news websites, financial sites, etc.

### Why Use Web Scraping?

- **Data Collection:** Gather large amounts of data from the web quickly and efficiently.
- **Automation:** Automate the process of data collection and updates.
- **Insights:** Derive insights from data that is not easily accessible through other means.

### Ethical and Legal Considerations:

- **Ethics:** Respecting website terms of service, avoiding overloading servers.
- **Legality:** Following the robots.txt file, understanding copyright laws.

### Common Web Scraping Tools:

- **Libraries:** Beautiful Soup, Scrapy, Selenium, Requests.
- **Comparison:** Use cases for each tool.

## 2. Basic HTML and CSS

### HTML Tree Structure

Before we look into the functionality provided by Beautiful Soup, let us first understand the HTML tree structure.

The root element in the document tree is the `<html>`, which can have parents, children, and siblings, determined by its position in the tree structure. To move among HTML elements, attributes, and text, you have to move among nodes in your tree structure.

### Content:

#### Basics of HTML:

- **HTML Tags:** `<html>`, `<head>`, `<body>`, `<div>`, `<span>`, etc.
- **Attributes:** `id`, `class`, `href`, `src`, etc.
- **DOM Structure:** The hierarchical organization of elements.

#### Common HTML Tags and Their Uses:

- **Headings:** `<h1>`, `<h2>`, `<h3>`, etc.
- **Paragraphs:** `<p>`
- **Links:** `<a href="URL">Link Text</a>`
- **Images:** `<img src="image.jpg" alt="description">`
- **Lists:** `<ul>`, `<ol>`, `<li>`
- **Tables:** `<table>`, `<tr>`, `<td>`

#### Basics of CSS:

- **Selectors:** Element selectors, class selectors, ID selectors.
- **Styling:** Applying styles using CSS.
  ```html
  <style>
    .class-selector {
      color: blue;
    }
    #id-selector {
      font-size: 20px;
    }
  </style>

```



## 3.Fetching Web Pages with Requests

#### Introduction to the requests Library:

* **Requests:** Sending HTTP requests to fetch web content.
      
      --> pip install requests

#### Sending GET Requests and Handling Responses:

```python
import requests
url = "http://google.com"
response = requests.get(url)


### Handling HTTP Responses:

#### Status Codes:
 
    200: OK
    300: Redirection
    400: Client Error (e.g., 404 Not Found)
    500: Server Error (e.g., 500 Internal Server Error)
        
        


## 5.Parsing HTML with Beautiful Soup

### Learning Objectives:

- Parse HTML documents using Beautiful Soup.
- Extract specific elements and data from web pages.

### Introduction to Beautiful Soup

Beautiful Soup is a Python library named after a Lewis Carroll poem of the same name in "Alice's Adventures in Wonderland". Beautiful Soup parses unwanted data and helps organize and format messy web data by fixing bad HTML and presenting it in easily-traversable XML structures. In short, Beautiful Soup is a Python package that allows us to pull data out of HTML and XML documents.

### Installing Beautiful Soup:

```bash
pip install beautifulsoup4
```


### Commonly Used Methods and Attributes:

    soup.title: The <title> tag of the HTML document.
    soup.body: The <body> tag of the HTML document.
    soup.find(): Finds the first instance of an element.
    soup.find_all(): Finds all instances of an element.


In [1]:
# !pip install requests

# !pip install beautifulsoup4 lxml

In [2]:
# Loading required libraries
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re

In [3]:
url = "https://www.flipkart.com/search?q=iphone%2016&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"

In [4]:
req = requests.get(url)

In [5]:
req.status_code

200

### Extracting iPhone-16 Data From Flipkart website

In [6]:
url = "https://www.flipkart.com/search?q=iphone%2017&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"

In [7]:
request_header = {'Content-Type': 'image/webp; charset=UTF-8', 
                  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:126.0) Gecko/20100101 Firefox/126.0',
                'Accept-Encoding': 'gzip, deflate, br, zstd'                  
                 }

response = requests.get(url, headers=request_header)

### Fretch Html Data

In [8]:
response.status_code

200

In [9]:
response.headers

{'server': 'nginx', 'date': 'Thu, 30 Oct 2025 03:32:02 GMT', 'content-type': 'text/html; charset=utf-8', 'transfer-encoding': 'chunked', 'content-security-policy': "script-src 'self' 'unsafe-eval' https://*.flixcart.com https://*.flixcart.net https://flipkart.d1.sc.omtrdc.net https://dpm.demdex.net https://tnc.phonepe.com https://js-agent.newrelic.com https://bam.nr-data.net https://www.googletagmanager.com 'nonce-84511626129568490'; style-src 'self' 'unsafe-inline' https://*.flixcart.com https://tnc.phonepe.com https://*.flixcart.net; img-src 'self' data: blob: https://*.flixcart.com https://*.flixcart.net https://images.ixigo.com https://flipkart.d1.sc.omtrdc.net https://www.facebook.com https://*.fkapi.net https://googleads.g.doubleclick.net https://www.google.com https://www.google.co.in https://www.googleadservices.com https://sp.analytics.yahoo.com https://bat.bing.com https://bat.r.msn.com https://pay.payzippy.com https://1.pay.payzippy.com https://tnc.phonepe.com https://img.fk

In [10]:
response.text

'<!doctype html><html lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.c48a12.css"/><link rel="stylesheet" href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.066267.css"/><meta http-equiv="Content-type" content="text/html; charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="fb:page_id" content="102988293558"/><meta property="fb:admins" content="658873552,624500995,100000233612389"/><link rel="shortcut icon" href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico"/><link type="application/opensearchdescription+xml" rel="search" href="/osdd.xml?v=2"/><meta property="og:type" content="website"/><meta name="og_site_name" property="og:site_name" content="Flipkart.com"/><link rel="apple-touch-icon" sizes="57x57" href="/apple-touch-icon-57x57.png"/><l

In [11]:
html_response  = response.text

### BeautifulSoup

#### Parse HTML


In [12]:
soup = BeautifulSoup(html_response)

In [13]:
soup.text

'  Iphone 17- Buy Products Online at Best Price in India - All Categories | Flipkart.com     Explore PlusLoginBecome a Seller More CartFiltersCATEGORIESMobiles & AccessoriesMobilesPrice......Min₹10000₹15000₹20000₹30000to₹10000₹15000₹20000₹30000₹30000+BrandAppleInfinixPOCOSamsungOnePlus?Customer RatingsGST Invoice AvailableRAM1GB and Below6 GB  AboveInternal StorageBattery CapacityScreen SizePrimary CameraSecondary CameraProcessor BrandSpecialityResolution TypeOperating SystemNetwork TypeSim TypeOffersSpecial PriceBuy More, Save MoreNo Cost EMIFeaturesTypeNumber of CoresAvailabilityDiscount50% or more40% or more30% or more20% or more10% or moreClock SpeedNeed help?Help me decideHomeMobiles & AccessoriesMobilesShowing 1 – 24 of 610 results for "iphone 17"Sort ByRelevancePopularityPrice -- Low to HighPrice -- High to LowNewest FirstAdd to CompareApple iPhone 17 (Black, 256 GB)4.71,611 Ratings\xa0&\xa0127 Reviews256 GB ROM16.0 cm (6.3 inch) Super Retina XDR Display48MP + 48MP | 18MP Front 

In [14]:
soup.prettify

<bound method Tag.prettify of <!DOCTYPE html>
<html lang="en"><head><link href="https://rukminim2.flixcart.com" rel="preconnect"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app_modules.chunk.c48a12.css" rel="stylesheet"/><link href="//static-assets-web.flixcart.com/fk-p-linchpin-web/fk-cp-zion/css/app.chunk.066267.css" rel="stylesheet"/><meta content="text/html; charset=utf-8" http-equiv="Content-type"/><meta content="IE=Edge" http-equiv="X-UA-Compatible"/><meta content="102988293558" property="fb:page_id"/><meta content="658873552,624500995,100000233612389" property="fb:admins"/><link href="https://static-assets-web.flixcart.com/www/promos/new/20150528-140547-favicon-retina.ico" rel="shortcut icon"/><link href="/osdd.xml?v=2" rel="search" type="application/opensearchdescription+xml"/><meta content="website" property="og:type"/><meta content="Flipkart.com" name="og_site_name" property="og:site_name"/><link href="/apple-touch-icon-57x57.png" rel="apple

In [15]:
soup.text

'  Iphone 17- Buy Products Online at Best Price in India - All Categories | Flipkart.com     Explore PlusLoginBecome a Seller More CartFiltersCATEGORIESMobiles & AccessoriesMobilesPrice......Min₹10000₹15000₹20000₹30000to₹10000₹15000₹20000₹30000₹30000+BrandAppleInfinixPOCOSamsungOnePlus?Customer RatingsGST Invoice AvailableRAM1GB and Below6 GB  AboveInternal StorageBattery CapacityScreen SizePrimary CameraSecondary CameraProcessor BrandSpecialityResolution TypeOperating SystemNetwork TypeSim TypeOffersSpecial PriceBuy More, Save MoreNo Cost EMIFeaturesTypeNumber of CoresAvailabilityDiscount50% or more40% or more30% or more20% or more10% or moreClock SpeedNeed help?Help me decideHomeMobiles & AccessoriesMobilesShowing 1 – 24 of 610 results for "iphone 17"Sort ByRelevancePopularityPrice -- Low to HighPrice -- High to LowNewest FirstAdd to CompareApple iPhone 17 (Black, 256 GB)4.71,611 Ratings\xa0&\xa0127 Reviews256 GB ROM16.0 cm (6.3 inch) Super Retina XDR Display48MP + 48MP | 18MP Front 

###  Create Empty Lists to Store Scraped Data




In [16]:
product_name = []          # Product Titles
product_price = []         # Product Prices
product_ratings = []       # Product Ratings
product_rom = []           # Storage Info (ROM)
product_display = []       # Display Details
product_camera = []        # Camera Specifications
product_warranty = []      # Warranty Information

### Extracting a mobile Features

In [17]:
## Fetch Page & Parse HTML
response = requests.get(url)
soup = BeautifulSoup(response.text)

# Capture Product Price Details
price = soup.find_all("div", class_="Nx9bqj _4b5DiR")
for i in price:
    product_price.append(re.sub("\D", "", i.text))

# Retrieve Product Names
names = soup.find_all("div", class_="KzDlHZ")
for i in names:
    product_name.append(i.text)

# Pull Rating Information
ratings = soup.find_all('div', class_="XQDdHH")
for i in ratings:
    product_ratings.append(i)

# Organize Product Attributes: ROM, Display, Camera, Warranty
features = soup.find_all("li", class_="J+igdf")
for i in features:
    if "ROM" in i.text:
        product_rom.append(i.text)
    elif "Display" in i.text:
        product_display.append(i.text)
    elif "Camera" in i.text:
        product_camera.append(i.text)
    elif "Warranty" in i.text:
        product_warranty.append(i.text)


In [18]:
print("Name     :", len(product_name))
print("Price    :", len(product_price))
print("Ratings  :", len(product_ratings))
print("ROM      :", len(product_rom))
print("Display  :", len(product_display))
print("Camera   :", len(product_camera))
print("Warranty :", len(product_warranty))


Name     : 24
Price    : 24
Ratings  : 33
ROM      : 24
Display  : 24
Camera   : 24
Warranty : 24


### Creating DataFrame


In [19]:
# Fix mismatched list length
correct_length = len(product_name)

product_ratings = product_ratings[:correct_length]

import pandas as pd

products_df = pd.DataFrame({
    "Product Name": product_name,
    "Price": product_price,
    "Ratings": product_ratings,
    "ROM": product_rom,
    "Display": product_display,
    "Camera": product_camera,
    "Warranty": product_warranty
})


## Data Cleaning

In [20]:
# Extract Product Name
products_df.product_name = products_df['Product Name'].apply(
    lambda x: re.findall(r'^[^\(]+', x)[0].strip()
)

# Extract Color
products_df['color'] = products_df['Product Name'].apply(
    lambda x : re.findall("\((.+),", x)[0].strip()
)

#  Extract Storage
products_df.storage = products_df['Product Name'].apply(
    lambda x : re.findall(",\s*(\d+)", x)[0]
)

#  Extract Ratings
products_df.Ratings = products_df['Ratings'].apply(
    lambda x : re.findall(r"\d{1}\.\d{1}", str(x))[0]
)

# Extract Warranty
products_df.Warranty = products_df.Warranty.apply(
    lambda x : re.findall(r"\d", str(x))[0]
)

# Extract Max Camera MP
products_df['Camera'] = products_df['Camera'].str.findall(
    r'(\d+)(?=MP)'
).apply(lambda x: max(map(int, x)))


  products_df.product_name = products_df['Product Name'].apply(
  products_df.storage = products_df['Product Name'].apply(


In [21]:
products_df.drop("Product Name",axis=1)

Unnamed: 0,Price,Ratings,ROM,Display,Camera,Warranty,color
0,82900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Black
1,82900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,White
2,82900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Lavender
3,82900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Mist Blue
4,134900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Deep Blue
5,134900,4.7,256 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Cosmic Orange
6,169900,4.8,512 GB ROM,17.53 cm (6.9 inch) Super Retina XDR Display,48,1,Deep Blue
7,174900,4.7,1 TB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Cosmic Orange
8,149900,4.8,256 GB ROM,17.53 cm (6.9 inch) Super Retina XDR Display,48,1,Deep Blue
9,154900,4.7,512 GB ROM,16.0 cm (6.3 inch) Super Retina XDR Display,48,1,Deep Blue
