README

Maria Guideng

rvest-scrape-glassdoor

😴 Oh hi. This repo is no longer maintained. Go here for the gdscrapeR package instead. 👻

About

Scrape Glassdoor.com for company reviews. Prep text data for text analytics.

Demo

Take Tesla for example. The url to scrape will be: https://www.glassdoor.com/Reviews/Tesla-Reviews-E43129.htm

Here's a screen shot of the text to extract:

Web scraper function

Extract company reviews for the following:

Total reviews - by full & part-time workers only
Date - of when review was posted
Summary - e.g., "Amazing Tesla"
Rating - star rating between 1.0 and 5.0
Title - e.g., "Current Employee - Anonymous Employee"
Pros - upsides of the workplace
Cons - downsides of the workplace
Helpful - number marked as being helpful, if any

#### SCRAPE ####
# Packages
library(rvest)    #scrape
library(purrr)    #iterate scraping by map_df()

# Set URL
baseurl <- "https://www.glassdoor.com/Reviews/"
company <- "Tesla-Reviews-E43129"
sort <- ".htm?sort.sortType=RD&sort.ascending=true"

# How many total number of reviews? It will determine the maximum page results to iterate over.
totalreviews <- read_html(paste(baseurl, company, sort, sep="")) %>% 
  html_nodes(".margBot.minor") %>% 
  html_text() %>% 
  sub(" reviews", "", .) %>% 
  sub(",", "", .) %>% 
  as.integer()

maxresults <- as.integer(ceiling(totalreviews/10))    #10 reviews per page, round up to whole number

# Scraping function to create dataframe of: Date, Summary, Rating, Title, Pros, Cons, Helpful
df <- map_df(1:maxresults, function(i) {
  
  Sys.sleep(sample(seq(1, 5, by=0.01), 1))    #be a polite bot. ~12 mins to run with this system sleeper
  
  cat("boom! ")   #progress indicator
  
  pg <- read_html(paste(baseurl, company, "_P", i, sort, sep=""))   #pagination (_P1 to _P163)
  
  data.frame(rev.date = html_text(html_nodes(pg, ".date.subtle.small, .featuredFlag")),
             rev.sum = html_text(html_nodes(pg, ".reviewLink .summary:not([class*='hidden'])")),
             rev.rating = html_attr(html_nodes(pg, ".gdStars.gdRatings.sm .rating .value-title"), "title"),
             rev.title = html_text(html_nodes(pg, "#ReviewsFeed .hideHH")),
             rev.pros = html_text(html_nodes(pg, "#ReviewsFeed .pros:not([class*='hidden'])")),
             rev.cons = html_text(html_nodes(pg, "#ReviewsFeed .cons:not([class*='hidden'])")),
             rev.helpf = html_text(html_nodes(pg, ".tight")),
             stringsAsFactors=F)
})

RegEx

Use regular expressions to clean and extract additonal variables:

Reviewer ID (1 to N reviewers by date, sorted from first to last)
Year (from Date)
Location (e.g., Palo Alto, CA)
Position (e.g., Project Manager)
Status (current or former employee)

#### REGEX ####
# Packages
library(stringr)    #pattern matching functions

# Clean: Helpful
df$rev.helpf <- as.numeric(gsub("\\D", "", df$rev.helpf))

# Add: ID
df$rev.id <- as.numeric(rownames(df))

# Extract: Year, Position, Location, Status
df$rev.year <- as.numeric(sub(".*, ","", df$rev.date))

df$rev.pos <- sub(".* Employee - ", "", df$rev.title)
df$rev.pos <- sub(" in .*", "", df$rev.pos)

df$rev.loc <- sub(".*\\ in ", "", df$rev.title)
df$rev.loc <- ifelse(df$rev.loc %in% 
                       (grep("Former Employee|Current Employee", df$rev.loc, value = T)), 
                     "Not Given", df$rev.loc)

df$rev.stat <- str_extract(df$rev.title, ".* Employee -")
df$rev.stat <- sub(" Employee -", "", df$rev.stat)

Output

#### EXPORT ####
write.csv(df, "rvest-scrape-glassdoor-output.csv")  #to csv

Notes

Write up of demo: "Scrape Glassdoor Company Reviews in R Using the gdscraper Package".
Updated 9/23/2018 to add star "rating" field. Initial commit 8/1/2018.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
images		images
.gitignore		.gitignore
README.md		README.md
rvest-scrape-glassdoor.R		rvest-scrape-glassdoor.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

.gitignore

.gitignore

README.md

README.md

rvest-scrape-glassdoor.R

rvest-scrape-glassdoor.R

Repository files navigation

README

rvest-scrape-glassdoor

About

Demo

About

Releases 1

Packages

Languages

mguideng/rvest-scrape-glassdoor

Folders and files

Latest commit

History

Repository files navigation

README

rvest-scrape-glassdoor

About

Demo

About

Topics

Resources

Stars

Watchers

Forks

Languages