
# Glassdoor Review Scraper WIP

## How to Use
Designed to be agnostic of which company you are scraping all you need to do is the following:

 1. Ensure this code is loaded in a local IDE (Built and tested in VSCode), cloud based IDE's like Google Collab won't work due the requirement for an installed Chrome instance
 2. Ensure you have chrome installed on your device
 3. Run the following blocks of code
 4. When prompted copy in the url to the reviews page of the company who's reviews you want to scrape

	> Example url: https://www.glassdoor.co.uk/Reviews/eBay-Reviews-E7853.htm
	
 5. Entering a glassdoor username and password is optional however the tool will only be able to return the first page of reviews without them, to mitigate risk they are only held in memory long enough to pass to Glassdoor
 6. How many pages of reviews you wish to return, each page holds roughly 10 reviews

## Current Issues

 - Can't be run in headless mode due to the Captcha solving method, currently exploring alternative methods to get around this requirement
 - No unit tests, these will be included before final submission I've just not had time to include them before the formative!
 - Occasionally when signing in it will click the forgot password link instead of the sign in button causing the code to fail, currently looking into a solution
 - No logic to check you're asking for more pages than there are pages for that company, the issue is there's no clear identifier for the page count as they all use the same class and identifiers. solvable problem just not have time to implement before formative
 - Code is an uncommented mess and needs serious clean up and commenting before final submission

In [None]:
%pip install -q selenium pandas bs4 seleniumbase tqdm

In [None]:
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
from seleniumbase import Driver
from bs4 import BeautifulSoup
from itertools import zip_longest
from IPython.display import clear_output
from tqdm import tqdm
import time
import pandas as pd

The following will likely be integrated into the login function before final submission

If using VSCode the search bar at the top of the screen will be where you're prompted to fill in details

In [None]:
url = input("Enter the Glassdoor reviews URL: ")
email = input("Enter your Glassdoor email: ")
password = input("Enter your Glassdoor password: ")
page_count = int(input("Enter how many pages of results to trawl through: ")) #Each Page has 10 reviews
df_reviews = pd.DataFrame(columns=["Title","Rating","Date","Job Title","Pros","Cons"])

if email and password:
    page_count = page_count
else:
    page_count = 1
    print("Only returning first page as no login details were provided.")

The Following is the main scraping tool

In [None]:
class GlassdoorScraper:
    def __init__(self, url, email="", password="", page_count=1):
        self.url = url
        self.email = email
        self.password = password
        self.page_count = page_count
        self.driver = Driver(uc=True)
        #self.driver.set_window_position(0,-2000)
        del email, password
        print("Initialized GlassdoorScraper")

    def login(self):
        self.driver.uc_open_with_reconnect(self.url, reconnect_time=6)
        self.driver.uc_gui_click_captcha()
        try:
            if self.email and self.password:
                self.driver.click('button[aria-label="sign in"]', timeout=5)
                self.driver.type('input[type="email"]', self.email, timeout=2)
                self.driver.click('button[data-test="continue-with-email-modal"]', timeout=2)
                self.driver.sleep(2)
                self.driver.type('input[type="password"]', self.password, timeout=2)
                self.driver.click('button[class="Button Button"]', timeout=2)
                self.driver.click('button[id="onetrust-accept-btn-handler"]', timeout=2)
                print("Logged in successfully")
            else:
                self.driver.click('button[id="onetrust-accept-btn-handler"]', timeout=2)
                print("Skipping sign-in: email or password empty")
        except NameError:
            print("Login elements not found, closing scraper")
            self.driver.quit()
        del self.email, self.password
        time.sleep(5)

    def grab_reviews(self):
        
        rows = []
        reviews = self.driver.find_element(by=By.ID,value="ReviewsFeed")
        soup = BeautifulSoup(reviews.get_attribute('innerHTML'),'html.parser')
        pros = [a.get_text(separator=" ", strip=True) for a in soup.find_all(attrs={"data-test": "review-text-PROS"})]
        cons = [a.get_text(separator=" ", strip=True) for a in soup.find_all(attrs={"data-test": "review-text-CONS"})]
        title = [a.get_text(separator=" ", strip=True) for a in soup.find_all(attrs={"data-test": "review-details-title"})]
        job = [a.get_text(separator=" ", strip=True) for a in soup.find_all(attrs={"data-test": "review-avatar-label"})]
        rating = [a.get_text(separator=" ", strip=True) for a in soup.find_all(attrs={"data-test": "review-rating-label"})]
        date = [a.get_text(separator=" ", strip=True) for a in soup.find_all(class_ = "timestamp_reviewDate__dsF9n")]

        for p, c, t, j, r, d, in zip(pros, cons, title, job, rating, date):
            rows.append({"Title": t,"Rating": r,"Date": d, "Job Title": j, "Pros": p, "Cons": c})
        return rows


    def scrape_reviews(self):
        print("Starting to scrape reviews")
        
        df_reviews = pd.DataFrame(GlassdoorScraper.grab_reviews(self))

        for page in tqdm(range(1, page_count), desc="Scraping Pages", unit="page"):
            ActionChains(self.driver).move_to_element(self.driver.find_element(By.CLASS_NAME, value="PaginationContainer_paginationContainer__bDHGx")).perform()
            self.driver.click('button[data-test="next-page"]', timeout=2)
            self.driver.sleep(2)
            ActionChains(self.driver).move_to_element(self.driver.find_element(by=By.ID,value="ReviewsFeed")).perform()
            GlassdoorScraper.grab_reviews(self)
            df_reviews_cont = pd.DataFrame(GlassdoorScraper.grab_reviews(self))
            df_reviews = pd.concat([df_reviews, df_reviews_cont], ignore_index=True)
        
        self.driver.quit()
        clear_output(wait=True)
        print("Scraping complete. Dataset Info: \n")
        print(df_reviews.info())
        return df_reviews

Use the following to call the tool

In [None]:
scraper = GlassdoorScraper(url, email, password, page_count)
scraper.login()
df_reviews = scraper.scrape_reviews()
df_reviews.to_csv("glassdoor_reviews.csv", index=False)