## Smartphone Market Analysis (₹20,000+ Segment) — Data Collection

This notebook documents the process of **scraping smartphone data from Flipkart** to construct a comprehensive dataset for analyzing India’s **upper-midrange smartphone market (₹20,000 and above)**.

The collected dataset aims to capture essential competitive parameters, including:

- Price  
- Brand  
- Model Name  
- Ratings and Review Count  
- RAM and Storage Configuration  
- Camera Specifications  
- Battery Capacity  
- Processor / Chipset  
- Display Refresh Rate  

This stage is dedicated exclusively to **data collection**. Subsequent notebooks will address **data cleaning, structuring, and analytical exploration**.

---
### Step 1: Import Required Libraries

This step involves importing the necessary Python libraries for data collection.  
**Selenium** (along with **webdriver-manager**) is used to dynamically load Flipkart product pages and extract relevant smartphone listings efficiently.

In [1]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
import pandas as pd

---
### Step 2: Define Scraping Strategy

The scraping process targets smartphone listings on Flipkart that meet the following criteria:

- 5G smartphones  
- Price ≥ ₹20,000  
- Sorted by **Popularity** (to prioritize currently relevant and widely sold models)

Each **brand** will be scraped individually to ensure accuracy and consistency across datasets.  
All brand-specific datasets will be **merged** in subsequent stages for unified analysis.

---
### Step 3: Identify HTML Structure & CSS Selectors

Analysis of Flipkart’s smartphone listing pages revealed the following key HTML elements and CSS selectors necessary for data extraction:

- **Product card container:** `div.tUxRFH`  
- **Product name:** `div.KzDlHZ`  
- **Product price:** `div.Nx9bqj._4b5DiR`  
- **Product specifications:** `ul.G4BRas > li.J+igdf`

These selectors will be used with **BeautifulSoup** to parse and extract the required information from each loaded page.

In [2]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

url = "https://www.flipkart.com/search?q=oppo+5g+smartphone&sort=popularity&p%5B%5D=facets.price_range.from%3D20000&p%5B%5D=facets.price_range.to%3DMax"

driver.get(url)

WebDriverWait(driver, 10).until(
    EC.presence_of_all_elements_located((By.CSS_SELECTOR, "div.tUxRFH"))
)

print("Page loaded with 20k+ Oppo phones")

Page loaded with 20k+ Oppo phones


---
### Step 4: Scrape All Pages (Pagination)

After successfully scraping data from a single Flipkart results page, the next step involves automating pagination to capture the complete set of listings.  

The approach includes iterating through sequential pages by incrementing the **`page=`** parameter in the URL.  
Scraping continues until no additional product cards are detected, ensuring that all available smartphones within the defined criteria are collected.

In [3]:
from bs4 import BeautifulSoup

all_products = []

# Base URL for Oppo smartphones ≥ 20k
base_url = "https://www.flipkart.com/search?q=oppo+5g+smartphone&sort=popularity&p%5B%5D=facets.price_range.from%3D20000&p%5B%5D=facets.price_range.to%3DMax&page={}"
# Loop through pages
for page in range(1, 10):
    print(f"Scraping Page {page}...")
    driver.get(base_url.format(page))
    time.sleep(2)

    soup = BeautifulSoup(driver.page_source, "html.parser")
    cards = soup.select("div.tUxRFH")

    if len(cards) == 0:
        print("No more products. Stopping.")
        break

    for card in cards:
        name_tag = card.select_one("div.KzDlHZ")
        price_tag = card.select_one("div.Nx9bqj._4b5DiR")
        specs = card.select("ul.G4BRas li.J\\+igdf")

        name = name_tag.text.strip() if name_tag else None
        price = price_tag.text.strip() if price_tag else None
        specs_list = [s.text.strip() for s in specs] if specs else []

        all_products.append({
            "name": name,
            "price": price,
            "specs": specs_list
        })

print("Total products scraped:", len(all_products))

Scraping Page 1...
Scraping Page 2...
Scraping Page 3...
Scraping Page 4...
Scraping Page 5...
Scraping Page 6...
Scraping Page 7...
Scraping Page 8...
No more products. Stopping.
Total products scraped: 156


---
### Step 5: Data Cleaning

The raw scraped data requires initial preprocessing before analysis.  
Key cleaning steps include:

- Converting **price strings** (e.g., `"₹24,999"`) into integer values (`24999`)  
- Parsing and normalizing the **specifications list** into structured fields:  
  - RAM (GB)  
  - Storage (GB)  
  - Display Size (inches)  
  - Rear Camera (MP)  
  - Battery Capacity (mAh)  
- Preparing the cleaned dataset for **CSV export**, enabling consistent formatting and easier integration into subsequent processing notebooks.

In [4]:
df = pd.DataFrame(all_products)

# Drop rows where price is None
df = df[df['price'].notna()].copy()

# Convert price string → integer
df['price'] = (
    df['price']
    .str.replace('₹', '', regex=False)
    .str.replace(',', '', regex=False)
    .astype(int)
)

In [5]:
import re

def extract_ram(spec_list):
    text = " ".join(spec_list)
    match = re.search(r'(\d+)\s*GB RAM', text, re.IGNORECASE)
    return int(match.group(1)) if match else None

def extract_storage(spec_list):
    text = " ".join(spec_list)
    match = re.search(r'(\d+)\s*GB ROM', text, re.IGNORECASE)
    return int(match.group(1)) if match else None

def extract_display(spec_list):
    text = " ".join(spec_list)
    match = re.search(r'(\d+\.\d+|\d+)\s*inch', text, re.IGNORECASE)
    return float(match.group(1)) if match else None

def extract_battery(spec_list):
    text = " ".join(spec_list)
    match = re.search(r'(\d+)\s*mAh', text, re.IGNORECASE)
    return int(match.group(1)) if match else None

def extract_camera(spec_list):
    text = " ".join(spec_list)
    match = re.search(r'(\d+)\s*MP', text, re.IGNORECASE)
    return int(match.group(1)) if match else None

df['ram_gb'] = df['specs'].apply(extract_ram)
df['storage_gb'] = df['specs'].apply(extract_storage)
df['display_inch'] = df['specs'].apply(extract_display)
df['battery_mah'] = df['specs'].apply(extract_battery)
df['camera_mp'] = df['specs'].apply(extract_camera)

---
### Step 6: Export Clean Dataset to CSV

In [6]:
df.to_csv("oppo_20k_clean.csv", index=False)
df.shape

(156, 8)

---
### Backing up Oppo scraped data
Save a clean copy of the Oppo dataset.

In [7]:
import sys
!{sys.executable} -m pip install --upgrade pip



In [8]:
import sys
!{sys.executable} -m pip install pyarrow



In [9]:
import os
import pandas as pd

os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/clean", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)

# Save raw list (all_products came from scraping)
import json
with open("data/raw/oppo_20k_raw.json", "w", encoding="utf-8") as f:
    json.dump(all_products, f, ensure_ascii=False, indent=2)

# Save cleaned DataFrame (df) to CSV and parquet
df.to_csv("data/clean/oppo_20k_clean.csv", index=False)
df.to_parquet("data/clean/oppo_20k_clean.parquet", index=False)

print("Saved:")
print(" - data/raw/oppo_20k_raw.json")
print(" - data/clean/oppo_20k_clean.csv")
print(" - data/clean/oppo_20k_clean.parquet")
print("Current cleaned df shape:", df.shape)

Saved:
 - data/raw/oppo_20k_raw.json
 - data/clean/oppo_20k_clean.csv
 - data/clean/oppo_20k_clean.parquet
Current cleaned df shape: (156, 8)


In [10]:
df.shape

(156, 8)