This project is part of the CodeAlpha Data Science Internship. It covers a complete end-to-end data science pipeline — from raw web scraping of Flipkart's laptop listings, through data cleaning and EDA, to advanced data visualization and NLP-based sentiment analysis.
Target Website: Flipkart.com — India's leading e-commerce platform Category Scraped: Laptops Total Records Collected: 700+ unique laptop listings
CodeAlpha_DataPulse/
│
├── scraper.py # Task 1A — Web scraping (Selenium + BS4)
├── clean_data.py # Task 1B — Data cleaning & preprocessing
├── Task2_EDA_Flipkart_Laptops.ipynb # Task 2 — Exploratory Data Analysis
├── Task3_Task4_Flipkart_Laptops.ipynb # Task 3 & 4 — Visualization + Sentiment Analysis
│
├── data/
│ ├── flipkart_laptops.csv # Raw scraped data (messy)
│ └── flipkart_laptops_cleaned.csv # Final cleaned dataset
│
├── images/
│ ├── viz1_executive_dashboard.png
│ ├── viz2_brand_battle.png
│ ├── viz3_bubble_chart.png
│ ├── viz4_correlation_heatmap.png
│ ├── viz5_price_boxplot.png
│ ├── viz6_best_value.png
│ ├── viz7_sentiment_distribution.png
│ ├── viz8_vader_by_brand.png
│ ├── viz9_sentiment_vs_rating.png
│ ├── viz10_vader_vs_textblob_scatter.png
│ ├── viz11_wordcloud_positive.png
│ └── viz12_subjectivity_vs_polarity.png
│
└── README.md # Project documentation
| Tool | Purpose |
|---|---|
Python 3.8+ |
Core programming language |
Selenium |
Browser automation to bypass JS rendering |
BeautifulSoup4 |
HTML parsing and data extraction |
Pandas |
Data manipulation and cleaning |
NumPy |
Numerical operations |
Regex (re) |
Pattern matching for text cleaning |
Matplotlib |
Core plotting library |
Seaborn |
Statistical data visualization |
WordCloud |
Word cloud generation |
VADER Sentiment |
Rule-based NLP sentiment scoring |
TextBlob |
ML-based polarity & subjectivity analysis |
ChromeDriver |
Headless browser for Selenium |
Flipkart renders its product listings using JavaScript, which means a simple requests.get() returns an empty or incomplete HTML page. Selenium launches a real Chrome browser, waits for the page to fully load, and then passes the rendered HTML to BeautifulSoup for extraction.
| Column | Description |
|---|---|
Product Name |
Full product title as shown on Flipkart |
Selling Price |
Current discounted price |
Original Price |
MRP before discount |
Discount |
Percentage discount offered |
Rating |
Average customer rating (out of 5) |
Rating Count |
Total number of ratings and reviews |
# Step 1 — Install dependencies
pip install selenium webdriver-manager beautifulsoup4 pandas
# Step 2 — Run the scraper
python scraper.pyExpected Output:
✅ Page 1 scraped — 24 products collected so far
✅ Page 2 scraped — 48 products collected so far
...
🎉 Done! 700+ products saved to flipkart_laptops.csv
| Issue | Example (Raw) | Fix Applied |
|---|---|---|
| Broken price encoding | â,'44,990 |
Extracted digits only → 44990 |
| Ratings and reviews merged | 125 Ratings & 11 Reviews |
Split into two numeric columns |
| No brand/company column | — | Extracted brand name from product title |
| Discount stored as text | 10% off |
Extracted number only → 10 |
| Duplicate product listings | Multiple identical rows | Removed via drop_duplicates() |
| Prices stored as strings | "₹44,990" |
Converted to float |
python clean_data.pyBefore:
| Product Name | Selling Price | Discount | Rating Count |
|---|---|---|---|
| Acer TravelLite AMD Ryzen 5... | â,'44,990 | 10% off | 125 Ratings & 11 Reviews |
| HP Victus Intel Core i5... | â,'72,990 | 10% off | 733 Ratings & 50 Reviews |
After:
| Brand | Product Name | Selling Price (INR) | Original Price (INR) | Discount (%) | Rating | Number of Ratings | Number of Reviews |
|---|---|---|---|---|---|---|---|
| Acer | Acer TravelLite AMD Ryzen 5... | 44990 | 50000 | 10 | 3.8 | 125 | 11 |
| HP | HP Victus Intel Core i5... | 72990 | 81201 | 10 | 4.3 | 733 | 50 |
| ASUS | ASUS Vivobook Go 15... | 38990 | 50990 | 23 | 4.3 | 1532 | 77 |
1. Extract Brand Name → from Product Name using keyword matching
2. Fix Price Encoding → strip non-numeric chars, convert to float
3. Split Rating Column → separate "Number of Ratings" and "Number of Reviews"
4. Clean Discount → extract integer from "10% off" → 10
5. Remove Duplicates → based on Product Name
6. Drop Null Rows → remove rows with missing Product Name
7. Reorder Columns → logical order for analysis
8. Export Clean CSV → flipkart_laptops_cleaned.csv
| Metric | Value |
|---|---|
| Total Products | 700+ |
| Brands Covered | HP, Lenovo, ASUS, Acer, Dell, Samsung, MSI, Apple and more |
| Price Range | ₹14,990 — ₹2,00,000+ |
| Average Rating | ~4.1 / 5 |
| Columns | 8 |
Performed a thorough EDA on the cleaned dataset to uncover patterns and insights before visualization.
- Descriptive Statistics — mean, median, std, quartiles across all numeric columns
- Missing Value Analysis — identified and handled nulls in Rating, Price, and Review columns
- Brand-wise Distribution — product count and market share per brand
- Price Distribution — histogram + KDE plot of selling prices; identified skew towards mid-range
- Discount Analysis — distribution of discounts; brands offering the highest average discounts
- Rating Distribution — most products cluster between 3.8 and 4.5 stars
- Correlation Analysis — explored relationships between price, discount, rating, and review count
- Price Segmentation — categorized products into Budget / Mid-Range / Premium / Ultra Premium
- HP and Lenovo dominate the listing count with the most diverse portfolios
- Higher priced laptops do not always guarantee higher ratings — a weak positive correlation was observed
- Discount % has no strong correlation with ratings, suggesting discounts are marketing-driven
- Number of Ratings and Number of Reviews are highly correlated (as expected)
- Budget segment (<₹30K) has limited options but includes some highly rated value picks
Six compelling visualizations were created to tell the complete story of the Flipkart laptop market.
A multi-panel overview with KPI cards showing total products, brand count, average price, average rating, average discount, and total reviews — plus five sub-charts covering brand distribution, price segment share, discount histogram, rating distribution, and top brands by review volume.
Three side-by-side horizontal bar charts comparing every brand across average selling price, average discount %, and average rating. Color-mapped for quick comparison.
Scatter plot where each bubble represents one laptop. X-axis = selling price, Y-axis = discount %, and bubble size = number of ratings (popularity). Color-coded by brand.
Lower-triangle correlation matrix across all numeric features. Reveals strong positive correlation between number of ratings and reviews, and weak correlation between price and rating.
Box + strip plot showing price spread (IQR, median, outliers) for the top 7 brands by listing count. Reveals Apple as a clear ultra-premium outlier while ASUS and Acer compete in the budget-mid range.
Ranked horizontal bar chart using a Value Score = (Rating / Selling Price) × 100,000, filtered to laptops with 50+ ratings. Identifies the best quality-per-rupee laptops on the platform.
Since the dataset contains product names rather than raw review text, a two-layer NLP strategy was applied:
| Layer | Tool | What It Analyzes |
|---|---|---|
| VADER | Rule-based lexicon (domain-tuned) | Product name tokens and spec keywords |
| TextBlob | ML-based polarity scoring | Product name polarity & subjectivity |
| Rating-Backed | Ground-truth validation | Actual customer rating converted to sentiment label |
Product names were cleaned before NLP scoring:
- Removed parentheses content (model suffixes)
- Stripped storage specs (e.g.
16GB,512GB SSD) - Removed all non-alphabetic characters
- Lowercased and normalized whitespace
VADER's lexicon was extended with laptop-domain-specific words:
| Positive Words (boosted) | Negative Words (penalized) |
|---|---|
slim, ultra, pro, elite, premium, gaming, lightweight, oled, powerful |
lag, budget, basic, lite, entry, slow, chromebook |
Each product received four VADER scores: Positive, Negative, Neutral, Compound. Compound ≥ 0.05 → Positive, ≤ −0.05 → Negative, else → Neutral.
TextBlob provided:
- Polarity score (−1 to +1): product name tone
- Subjectivity score (0 to 1): how descriptive vs factual the name is
Customer ratings were mapped to sentiment labels for validation:
- Rating ≥ 4.0 → Positive
- Rating 3.5–3.99 → Neutral
- Rating < 3.5 → Negative
| # | Chart | Description |
|---|---|---|
| 7 | Sentiment Distribution | Side-by-side bar charts comparing VADER vs TextBlob vs Rating-based sentiment |
| 8 | VADER by Brand | Average VADER compound score per brand — reveals brand name perception |
| 9 | Sentiment vs Avg Rating | Validates NLP labels against actual customer ratings |
| 10 | VADER vs TextBlob Scatter | Agreement scatter plot; points colored by actual rating sentiment |
| 11 | Word Cloud (Positive) | Word cloud of product names with positive VADER sentiment |
| 12 | Subjectivity vs Polarity | TextBlob scatter by price segment — Ultra Premium names are more subjective |
- VADER assigned a majority of laptops Positive sentiment, driven by spec-rich branding (Ultra, Pro, Elite)
- TextBlob was more conservative, with most products labeled Neutral due to factual/technical naming
- Rating-backed validation confirmed NLP accuracy: Positive-labeled products consistently had higher average customer ratings
- VADER and TextBlob agree on polarity direction for ~70% of products
- Ultra Premium laptops had the highest TextBlob subjectivity scores — premium brand names are more emotionally descriptive
# 1. Clone the repository
git clone https://github.com/Khprateek/CodeAlpha_DataPulse.git
cd CodeAlpha_DataPulse
# 2. Install all dependencies
pip install selenium webdriver-manager beautifulsoup4 pandas numpy \
matplotlib seaborn wordcloud vaderSentiment textblob
# 3. Download TextBlob corpora
python -m textblob.download_corpora
# 4. Run scraper
python scraper.py
# 5. Run data cleaning
python clean_data.py
# 6. Open Jupyter notebooks for EDA, Visualization & Sentiment Analysis
jupyter notebookCodeAlpha Data Science Intern 🔗 GitHub: Khprateek 🔗 LinkedIn: Prateek Kharwar
This project is open source and available under the MIT License.
Built as part of the CodeAlpha Data Science Internship Program