# Checkpoint 3: Web Scraping for Reviews

## Objective
The goal of this task is to scrape product reviews from a given product URL, extract essential details such as review text, ratings, reviewer names, and dates, and save the results into a structured CSV file for further analysis.

## Libraries Used
- **requests**: To fetch HTML content of the product page.
- **BeautifulSoup**: For parsing and extracting data from HTML.
- **pandas**: To organize and save the extracted data.

Install the required libraries using:
```bash
pip install requests beautifulsoup4 pandas
```

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
from random import uniform
import logging
import os

# Configure logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Function to scrape reviews from a product URL
def scrape_reviews(product_url, review_selector, text_selector, rating_selector, author_selector, date_selector):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    try:
        logging.info(f"Fetching {product_url}...")
        response = requests.get(product_url, headers=headers)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        logging.error(f"Error fetching the page: {e}")
        return pd.DataFrame()

    soup = BeautifulSoup(response.content, "html.parser")
    reviews = soup.select(review_selector)

    review_texts = []
    ratings = []
    reviewer_names = []
    review_dates = []

    for review in reviews:
        # Extract review text
        text = review.select_one(text_selector).get_text(strip=True) if review.select_one(text_selector) else None
        review_texts.append(text)

        # Extract and clean rating
        rating = review.select_one(rating_selector).get_text(strip=True) if review.select_one(rating_selector) else None
        ratings.append(clean_rating(rating))

        # Extract reviewer name
        name = review.select_one(author_selector).get_text(strip=True) if review.select_one(author_selector) else None
        reviewer_names.append(name)

        # Extract review date
        date = review.select_one(date_selector).get_text(strip=True) if review.select_one(date_selector) else None
        review_dates.append(date)

    return pd.DataFrame({
        "Review Text": review_texts,
        "Rating": ratings,
        "Reviewer Name": reviewer_names,
        "Review Date": review_dates
    })

# Function to clean rating data
def clean_rating(rating):
    if rating:
        try:
            return float(rating.split()[0])
        except (ValueError, AttributeError):
            return None
    return None

# Main function to scrape and save reviews
def main():
    product_url = input("Enter the product URL to scrape reviews: ").strip()
    output_file = "Reviews.csv"

    # Define selectors (customize for the target website)
    selectors = {
        "review_selector": ".review",
        "text_selector": ".review-text",
        "rating_selector": ".review-rating",
        "author_selector": ".review-author",
        "date_selector": ".review-date"
    }

    # Scrape reviews
    reviews_data = scrape_reviews(product_url, **selectors)

    if reviews_data.empty:
        logging.error("No reviews scraped. Check the URL or selectors.")
        return

    # Save to CSV
    if os.path.exists(output_file):
        logging.warning(f"{output_file} already exists. Appending data.")
        existing_data = pd.read_csv(output_file)
        reviews_data = pd.concat([existing_data, reviews_data], ignore_index=True)

    reviews_data.to_csv(output_file, index=False)
    logging.info(f"Scraped reviews saved to {output_file}")

if __name__ == "__main__":
    main()

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
2025-02-10 14:57:08,661 - INFO - Fetching https://www.amazon.in/ZEBRONICS-Zeb-Thunder-Connectivity-Sea-Green/dp/B09B5CPV71/ref=sr_1_3?sr=8-3...
2025-02-10 14:57:12,270 - INFO - Scraped reviews saved to Reviews.csv


## Steps to Run the Script
1. **Install Required Libraries**:
   ```bash
   pip install requests beautifulsoup4 pandas
   ```
2. **Run the Script**:
   - Execute the code and provide a valid product URL when prompted.
3. **Output**:
   - The extracted reviews will be saved into a file named `Reviews.csv`.
   - The file contains fields: Review Text, Rating, Reviewer Name, and Review Date.

## Notes
- The script works for static websites. For dynamic content, use tools like `selenium`.
- Ensure the URL provided is from the target website and contains reviews.