# Web Scraping

### Introduction to the project
In this IT Coding project, we dive into the world of web scraping, harnessing the power of two formidable Python libraries, Selenium and BeautifulSoup, to extract and analyze Amazon product reviews.

### Understanding web scraping
***What exactly is web scraping?*** Web scraping is a sophisticated technique that enables the automatic retrieval and extraction of data from websites. In our case, it involves the parsing of the underlying HTML (Hypertext Markup Language) and XML (eXtensible Markup Language) on web pages. Through this process, we gain the ability to capture specific information, be it plain text, captivating images, valuable links, or even intricate data like product reviews.

In this project, we focus on Amazon product reviews, but the applications of web scraping are boundless.

This is the url of the product: https://www.amazon.co.uk/Sennheiser-Crystal-Clear-Cancellation-Customizable-Lightweight-MOMENTUM-4-Wireless-Black/product-reviews/B0B6GHW1SX/ref=cm_cr_arp_d_paging_btm_next_2?ie=8&reviewerType=all_reviews&pageNumber='
 

# Importing Dependencies

In [2]:
import time
from datetime import datetime
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException
import pandas as pd
import re

The script begins by importing necessary Python libraries:
- **time**: Used for adding time delays during web scraping.
- **BeautifulSoup**: A library for parsing HTML and XML documents.
- **selenium**: A web automation tool for controlling web browsers.
- **pandas**: A data manipulation library.
- **re**: The regular expressions library for text pattern matching.

# Chromedriver

Selenium is a powerful tool for controlling a web browser through the program. It is functional for all browsers, and it supports all major programming languages.

Selenium WebDriver is a web automation framework that allows you to execute your tests against different browsers.

To find the suitable ChromeDriver follow this link: https://chromedriver.chromium.org/

Selenium WebDriver requires a driver to interface with the chosen browser. The driver acts as a bridge between your script and the browser. In this case, I'm using Chrome, so I need the ChromeDriver.

In the following code I didn't explicitly specified the path to the ChromeDriver because in a previous project of mine, I've already used it.

**How is this possible?** I've added the *chromedriver.exe* in a directory listed in my PATH (in Windows the system variable that your operating system uses to locate needed executables from the command line or Terminal window). Doing this, The system will automatically find the *chromedriver.exe* in the directories specified in the PATH.

In the other case, is sufficient to add **chromedriver_path = 'path/to/chromedriver.exe'**

### IN WINDOWS ### 
To add the ChromeDriver in the PATH you will have to modify the system's variables to include the ChromeDriver.

# Insights

The script utilize the Selenium library to automate a headless Chrome broweser for web scraping. 
- Headless means that the browser runs without a visible interface, which is useful to scrape without the user interaction.
- The line ('--disable-gpu') disables GPU usage when running Chrome in headless mode.
The last line allow to interact with web pages, scraping everything needed in the background **without  a visible browser window**
In addition the deactivation of the GPU result in a more efficient and faster headless browsing and web scraping.

### CSS Selectors

The Cascading Style Sheets (CSS) selectors are patterns used to select and style HTML elements.

**'.a-section.review'**: This selector targets the container element that wraps each individual review on the Amazon product reviews page.

**'.review-text'**: This selector is used to extract the text content of the review.

**'.review-title'**: This selector is used to extract the text content of the review title.

**'.a-icon-alt'**: This selector is used to extract the text content of the star rating given in the review.

**'.review-date'**: This selector is used to extract the text content of the review date.

**'.a-profile-name'**: This selector is used to extract the text content of the reviewer's ID or name.

**'li.a-last'**: This selector is used to find the "Next" button element at the bottom of the reviews page, allowing the script to navigate to the next page of reviews.

### Functions:
- **extract_star_rating(star_text)**: This function takes the text containing star ratings and extracts the numeric part.
- **extract_review_date(review_date_text)**: It extracts the date part from the text containing the review date.
- **scrape_amazon_reviews**: Scrapes Amazon product reviews from a given URL. It uses Selenium for web automation and BeautifulSoup for HTML parsing. The function collects review details such as text, titles, star ratings, dates, and reviewer IDs. It handles retries and errors. After defining the function, the code utilizes it to scrape reviews from multiple pages, and the data is stored in lists for analysis and research.

In [None]:
def extract_star_rating(star_text):
    star_parts = star_text.split()
    if len(star_parts) > 0:
        return star_parts[0]
    else:
        return 'N/A'

def extract_review_date(review_date_text):
    date_match = re.search(r'on (\d+ [A-Za-z]+ \d{4})', review_date_text)
    if date_match:
        return date_match.group(1)
    else:
        return 'N/A'

def scrape_amazon_reviews(url, max_retries=3, retry_delay=2):
    options = Options()
    options.add_argument('--headless')
    options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(options=options)
    
    all_reviews = []
    all_review_titles = []
    all_star_ratings = []
    all_review_dates = []
    all_reviewer_ids = []  
    unique_reviews = set() 
    
    for attempt in range(max_retries + 1):
        try:
            driver.get(url)
            time.sleep(3)
            
            reached_end = False
            
            while True:
                try:
                    soup = BeautifulSoup(driver.page_source, 'html.parser')
                    
                    review_elements = soup.select('.a-section.review')
                    
                    for review_elem in review_elements:
                        review_text = review_elem.select_one('.review-text').get_text(strip=True)
                        
                        # Review title
                        review_title_elem = review_elem.select_one('.review-title')
                        review_title = review_title_elem.get_text(strip=True)
                        
                        # Star rating
                        star_elem = review_elem.select_one('.a-icon-alt')
                        star_rating = star_elem.get_text(strip=True)
                        
                        # Review date
                        review_date_elem = review_elem.select_one('.review-date')
                        review_date = review_date_elem.get_text(strip=True)
                        review_date = extract_review_date(review_date)  # Extract the date part
                        
                        # Reviewer ID
                        reviewer_id_elem = review_elem.select_one('.a-profile-name')
                        reviewer_id = reviewer_id_elem.get_text(strip=True)
                        
                        # Avoid repetition of reviews
                        if review_text not in unique_reviews:
                            all_reviews.append(review_text)
                            all_review_titles.append(review_title)
                            all_star_ratings.append(extract_star_rating(star_rating))
                            all_review_dates.append(review_date)
                            all_reviewer_ids.append(reviewer_id)
                            unique_reviews.add(review_text)
                    
                    next_button = soup.find('li', {'class': 'a-last'})
                    if not next_button:
                        reached_end = True
                        break
                    
                    next_url = f"https://www.amazon.co.uk{next_button.find('a')['href']}"
                    driver.get(next_url)
                    time.sleep(5)
                
                except Exception as inner_e:
                    print(f"An error occurred while parsing: {inner_e}")
                    break  
            
            if reached_end or attempt >= max_retries:
                driver.quit()
                return all_reviews, all_review_titles, all_star_ratings, all_review_dates, all_reviewer_ids
            
            print(f"Failed to fetch data. Retrying in {retry_delay} seconds... (Attempt {attempt + 1}/{max_retries})")
            time.sleep(retry_delay)
        
        except WebDriverException as e:
            print(f"An error occurred with the WebDriver: {e}")
            print(f"Retrying in {retry_delay} seconds... (Attempt {attempt + 1}/{max_retries})")
            time.sleep(retry_delay)
    
    driver.quit()
    print(f"Data retrieval failed after {max_retries} attempts. Please check the URL and website accessibility.")
    return None, None, None, None, None

base_url = 'https://www.amazon.co.uk/Sennheiser-Crystal-Clear-Cancellation-Customizable-Lightweight-MOMENTUM-4-Wireless-Black/product-reviews/B0B6GHW1SX/ref=cm_cr_arp_d_paging_btm_next_2?ie=8&reviewerType=all_reviews&pageNumber='

# the 'pageNumber=' is left like this to help the parsing throughout the pages

all_reviews = []
all_review_titles = []
all_star_ratings = []
all_review_dates = []
all_reviewer_ids = []

for page_number in range(1, 12):
    page_url = base_url + str(page_number)
    reviews, review_titles, star_ratings, review_dates, reviewer_ids = scrape_amazon_reviews(page_url)
    
    if reviews is not None:
        all_reviews.extend(reviews)
        all_review_titles.extend(review_titles)
        all_star_ratings.extend(star_ratings)
        all_review_dates.extend(review_dates)
        all_reviewer_ids.extend(reviewer_ids)

df = pd.DataFrame({'ID': all_reviewer_ids, 'Title': all_review_titles, 'Rating': all_star_ratings, 'Review': all_reviews, 'Date': all_review_dates})

print(df.head())


The scrape came with a problem, the rating was attached to the titles.

In [4]:
# Cleaning the titles
df['Title'] = df['Title'].str.replace(r'\d(\.\d)? out of 5 stars', '')
df

Unnamed: 0,ID,Review,Title,Rating,Date
0,John C.,i have owned the momentum 2’s for many years a...,awesome headphones…. but…,4.0,08 10 2023
1,R Griffiths,at first i felt these headphones were a mixed ...,wonderful sound now that initial connection is...,5.0,13 07 2023
2,Simon Pawlin,i wasn’t sure how good these were going to be ...,comfortable and quality sound,4.0,04 08 2023
3,Dezmo63,read a very long review by a lady who had tria...,60 hour battery life. amazing.,5.0,01 10 2023
4,Gabriele Fazio,this is the 3rd wireless headphones i owned in...,"good quality, bad controls",4.0,06 09 2023
...,...,...,...,...,...
95,Mr. P. Charles,loved my gsp650s thought the momentum 4 would ...,do not buy for gaming,1.0,26 11 2022
96,Wayne,"bought these probably about 2 months ago, abso...",rubbish,1.0,12 03 2023
97,"""kas5239""",bought in august for my son's september birthd...,headphones developed a fault very quickly,1.0,30 10 2022
98,Suzi Edwards-Alexander,bought two pairs. returned one pair because th...,unusable and unreliable,1.0,15 09 2022


For the sake of a better analysis I've transformed the month from nominal to numerical values

In [7]:
def extract_and_format_date(date_str):
    try:
        date = datetime.strptime(date_str, '%d %B %Y')
        return date.strftime('%d %m %Y')  # %-d and %-m are use to remove the leading zeros
    except ValueError:
        return 'N/A'

df['Date'] = df['Date'].apply(extract_and_format_date)

df

Unnamed: 0,ID,Review,Title,Rating,Date
0,John C.,i have owned the momentum 2’s for many years a...,awesome headphones…. but…,4.0,08 10 2023
1,R Griffiths,at first i felt these headphones were a mixed ...,wonderful sound now that initial connection is...,5.0,13 07 2023
2,Simon Pawlin,i wasn’t sure how good these were going to be ...,comfortable and quality sound,4.0,04 08 2023
3,Dezmo63,read a very long review by a lady who had tria...,60 hour battery life. amazing.,5.0,01 10 2023
4,Gabriele Fazio,this is the 3rd wireless headphones i owned in...,"good quality, bad controls",4.0,06 09 2023
...,...,...,...,...,...
95,Mr. P. Charles,loved my gsp650s thought the momentum 4 would ...,do not buy for gaming,1.0,26 11 2022
96,Wayne,"bought these probably about 2 months ago, abso...",rubbish,1.0,12 03 2023
97,"""kas5239""",bought in august for my son's september birthd...,headphones developed a fault very quickly,1.0,30 10 2022
98,Suzi Edwards-Alexander,bought two pairs. returned one pair because th...,unusable and unreliable,1.0,15 09 2022


I stored the results in a csv file which will allow me to retrieve the data in an easier way.

In [75]:
df.to_csv('SENHEISER.csv', index=False)