
## Indeed.com Scraping



### Step 1: Figure out how to scrape the company data from Indeed.

<font color='17321F'>
    
The data I want to scrape:

* Reviews - Overall company Rating
* Reviews - Work Life Balance Rating
* Reviews - Pay and Benefits Rating
* Reviews - Job Security and Advancement Rating
* Reviews - Management Rating
* Reviews - Culture Rating
* Reviews - Overall company Rating
* Reviews - count 5 star Rating
* Reviews - count 4 star Rating
* Reviews - count 3 star Rating
* Reviews - count 2 star Rating
* Reviews - count 1 star Rating
* Reviews/UK - Reviews for UK (perform NLP):
* number_of_reviews to know how many pages to loop through
* review_date: can use this to choose which reviews I want to include in my dataset (convert to date and filter on dates not in the 'future')
* review_header: one liner, review summary - NLP
* review_text: main body of text - nlp
* review_pros: pro's in note form - nlp
* review_cons: con's in note form - nlp
    
Dropped Columns:
* Snapshot - CEO
* Snapshot - Founded (not sure how accurate this is, IQVIA says 2017?)
* Snapshot - Company Size (This is categorical but worth comparing to the figure in the csv gender pay gap file)
* Snapshot - Revenue (Gives an idea of how big the company is)
* Snapshot - Industry (want to see if this impacts the overall statistics)
* Snapshot - (!)About section (perform NLP on this) DECIDED TO DROP THIS BECAUSE NOT SURE HOW USEFUL IT WOULD BE
    
</font>


In [1]:
import requests
import bs4
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import time, random
import math
import re
from datetime import datetime

### Step 2: load the data (trialled with 18_19 will do the other years later) and clean. Then created a dataframe with the first word of the company name column. Then scraped information from Indeed based using company = first word of the company name. This reduced the dataframe by about 20%, partially because some companies did not have a page on the site and also because the URL was incorrect for some of the company names using this method so it returned NA values. 

In [2]:
GPG_18_19 = pd.read_csv('/Users/gitas/Desktop/GA/Capstone/Gender_Pay_Gap_Data/UK_Gender_Pay_Gap_Data_2018_2019.csv')

In [3]:
GPG_18_19.dropna(subset=['CompanyNumber', 'DiffMeanBonusPercent', 'DiffMedianBonusPercent'], inplace=True)

In [4]:
GPG_18_19.reset_index(drop=True, inplace=True)

In [5]:
GPG_18_19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8105 entries, 0 to 8104
Data columns (total 25 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   EmployerName               8105 non-null   object 
 1   Address                    8105 non-null   object 
 2   CompanyNumber              8105 non-null   object 
 3   SicCodes                   7713 non-null   object 
 4   DiffMeanHourlyPercent      8105 non-null   float64
 5   DiffMedianHourlyPercent    8105 non-null   float64
 6   DiffMeanBonusPercent       8105 non-null   float64
 7   DiffMedianBonusPercent     8105 non-null   float64
 8   MaleBonusPercent           8105 non-null   float64
 9   FemaleBonusPercent         8105 non-null   float64
 10  MaleLowerQuartile          8105 non-null   float64
 11  FemaleLowerQuartile        8105 non-null   float64
 12  MaleLowerMiddleQuartile    8105 non-null   float64
 13  FemaleLowerMiddleQuartile  8105 non-null   float

In [6]:
indeed_companies = GPG_18_19[['EmployerName', 'CompanyNumber']]
indeed_companies.head()

Unnamed: 0,EmployerName,CompanyNumber
0,"""RED BAND"" CHEMICAL COMPANY, LIMITED",SC016876
1,118 LIMITED,03951948
2,123 EMPLOYEES LTD,10530651
3,1509 GROUP,04104101
4,1610 LIMITED,06727055


In [7]:
employer_clean = []
for line in indeed_companies.EmployerName:
    employer_clean.append((line.split(' ', 1)[0]))

In [8]:
indeed_companies['employer_clean'] = employer_clean

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  indeed_companies['employer_clean'] = employer_clean


In [None]:
# indeed_comp.to_csv('indeed_companies', index=False)

In [9]:
indeed_comp = pd.read_csv('indeed_companies.csv')

In [10]:
indeed_comp.head()

Unnamed: 0,EmployerName,CompanyNumber,employer_clean
0,"""RED BAND"" CHEMICAL COMPANY, LIMITED",SC016876,"""RED"
1,118 LIMITED,03951948,118
2,123 EMPLOYEES LTD,10530651,123
3,1509 GROUP,04104101,1509
4,1610 LIMITED,06727055,1610


<font color='red'>

#### Note - for all the below scrapes, I did these for a few companies at a time (around 500-1000), converted to a dataframe, saved to csv then ran for the next few companies. This was to ensure that if I got blocked from the site or had any issues with internet connection I would still have some of the data saved.
    
</font>

### Step 3: First scrape was for the CEO, founded, company size, revenue and industry but these fields were later dropped because I noticed that not all company had all these fields completed so there would be a large number of NA's and I did not want to lose anymore rows. Also, a lot of this information I could get from companies house. 

In [None]:
# scraping the company stats:
url_template = "https://uk.indeed.com/cmp/{}"
CEO = []
founded = []
company_size = []
revenue = []
industry = []
for company in indeed_comp.employer_clean[100:110]:
    time.sleep(random.randint(1, 4))
    r = requests.get(url_template.format(company))
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        CEO.append(soup.find('li', attrs={'class':"css-v4e08k eu4oa1w0", 'data-testid':"companyInfo-ceo"}).text[3:].strip())
    except:
        CEO.append(np.nan)
    try:
        founded.append(soup.find('li', attrs={'class':"css-v4e08k eu4oa1w0", 'data-testid':"companyInfo-founded"}).text[7:].strip())
    except:
        founded.append(np.nan)
    try:
        company_size.append(soup.find('li', attrs={'class':"css-v4e08k eu4oa1w0", 'data-testid':"companyInfo-employee"}).text[12:].strip())
    except:
        company_size.append(np.nan)
    try:
        revenue.append(soup.find('li', attrs={'class':"css-v4e08k eu4oa1w0", 'data-testid':"companyInfo-revenue"}).text[7:].strip())
    except:
        revenue.append(np.nan)
    try:
        industry.append(soup.find('li', attrs={'class':"css-v4e08k eu4oa1w0", 'data-testid':"companyInfo-industry"}).text[8:].strip())
    except:
        industry.append(np.nan)

### Step 4: Scrape all the rating scores - overall and categorical and save to a dataframe.

In [11]:
# scraping the overall and category ratings:
url_template = "https://uk.indeed.com/cmp/{}/reviews"
company_no = []
number_of_reviews = []
company_rating = []

work_life_balance = []
pay_and_benefits = []
job_security_and_advancement = []
management = []
culture = []

for company in indeed_comp.employer_clean[100:110]:
    time.sleep(random.randint(1, 4))
    r = requests.get(url_template.format(company))
    soup = BeautifulSoup(r.text, 'html.parser')
    try:
        number_of_reviews.append(soup.find('a', class_="cmp-RatingsCountLink").text.strip())
    except:
        number_of_reviews.append(np.nan)
    try:
        company_rating.append(soup.find('span', class_="cmp-CompactHeaderCompanyRatings-value").text.strip())
    except:
        company_rating.append(np.nan)
    try:
        work_life_balance.append(soup.find('div', class_="cmp-ReviewsFilters-singleSection").text[19:22].strip())
    except:
        work_life_balance.append(np.nan)
    try:
        pay_and_benefits.append(soup.find('div', class_="cmp-ReviewsFilters-singleSection").text[39:42].strip())
    except:
        pay_and_benefits.append(np.nan)
    try:
        job_security_and_advancement.append(soup.find('div', class_="cmp-ReviewsFilters-singleSection").text[56:59].strip())
    except:
        job_security_and_advancement.append(np.nan)
    try:
        management.append(soup.find('div', class_="cmp-ReviewsFilters-singleSection").text[85:88].strip())
    except:
        management.append(np.nan)
    try:
        culture.append(soup.find('div', class_="cmp-ReviewsFilters-singleSection").text[98:101].strip())
    except:
        culture.append(np.nan)

In [14]:
category_ratings_100_110 = pd.DataFrame()

category_ratings_100_110['number_of_reviews'] = number_of_reviews
category_ratings_100_110['company_rating'] = company_rating
category_ratings_100_110['work_life_balance'] = work_life_balance
category_ratings_100_110['pay_and_benefits'] = pay_and_benefits
category_ratings_100_110['job_security_and_advancement'] = job_security_and_advancement
category_ratings_100_110['management'] = management
category_ratings_100_110['culture'] = culture

# category_ratings_0_2000.to_csv('category_ratings_0_2000')

In [15]:
category_ratings_100_110

Unnamed: 0,number_of_reviews,company_rating,work_life_balance,pay_and_benefits,job_security_and_advancement,management,culture
0,25 reviews,3.2,3.1,3.0,2.8,3.0,2.7
1,,,,,,,
2,1 review,5.0,5.0,5.0,5.0,5.0,5.0
3,,,,,,,
4,107 reviews,3.7,3.8,3.6,3.1,3.4,3.5
5,6 reviews,3.5,3.0,3.8,3.0,3.2,3.2
6,15 reviews,3.4,3.5,3.5,3.2,3.3,3.4
7,216 reviews,3.4,3.3,3.1,2.9,3.1,3.3
8,43 reviews,3.4,3.1,2.9,2.8,2.8,3.2
9,43 reviews,3.4,3.1,2.9,2.8,2.8,3.2


In [None]:
# category_ratings_18_19 = pd.concat([category_ratings_18_19, indeed_comp], axis=1)

### Step 5: Scrape the star rating counts and save to a dataframe. 

In [16]:
# scraping the 1-5 * ratings
url_template = "https://uk.indeed.com/cmp/{}/reviews"
count_5_star = []
count_4_star = []
count_3_star = []
count_2_star = []
count_1_star = []

for company in indeed_comp.employer_clean[100:110]:
    time.sleep(random.randint(1, 4))
    r = requests.get(url_template.format(company))
    soup = BeautifulSoup(r.text, 'html.parser')  
    try:
        first_rating = soup.find('div', class_="cmp-ReviewHistogram-row")
        count_5_star.append(first_rating.text[1:].strip())
    except:
        count_5_star.append(np.nan)
    
    try:
        second_rating = first_rating.find_next_sibling('div')
        count_4_star.append(second_rating.text[1:].strip())
    except:
        count_4_star.append(np.nan)
    
    try:
        third_rating = second_rating.find_next_sibling('div')
        count_3_star.append(third_rating.text[1:].strip())
    except:
        count_3_star.append(np.nan)
    
    try:
        forth_rating = third_rating.find_next_sibling('div')
        count_2_star.append(forth_rating.text[1:].strip())
    except:
        count_2_star.append(np.nan)
    
    try:
        fifth_rating = forth_rating.find_next_sibling('div')
        count_1_star.append(fifth_rating.text[1:].strip())
    except:
        count_1_star.append(np.nan)

In [17]:
star_ratings_100_110 = pd.DataFrame()

star_ratings_100_110['count_5_star'] = count_5_star
star_ratings_100_110['count_4_star'] = count_4_star
star_ratings_100_110['count_3_star'] = count_3_star
star_ratings_100_110['count_2_star'] = count_2_star
star_ratings_100_110['count_1_star'] = count_1_star

# star_ratings_0_2000.to_csv('star_ratings_0_2000')

In [18]:
star_ratings_100_110

Unnamed: 0,count_5_star,count_4_star,count_3_star,count_2_star,count_1_star
0,7.0,4.0,6,2,6
1,,,6,2,6
2,1.0,0.0,0,0,0
3,,,0,0,0
4,33.0,36.0,22,10,6
5,1.0,2.0,2,1,0
6,3.0,6.0,2,2,2
7,55.0,59.0,50,19,33
8,14.0,8.0,10,2,9
9,14.0,8.0,10,2,9


In [None]:
# star_ratings_18_19 = pd.concat([star_ratings_18_19, category_ratings_18_19], axis=1)

In [None]:
# star_ratings_18_19.to_csv('all_ratings_18_19', index=False)

In [None]:
all_ratings_18_19 = pd.read_csv('all_ratings_18_19.csv')
all_ratings_18_19.head()

### Step 6: Scrape the review data (pros, cons, main text and header). To do this, I had to do two things, before saving to a dataframe:
 ### a) Identify the number of pages to loop through for each company - there was a number at the top stating the total number of reviews and I knew each page displays 20 results and I didn't want any repeats as it would have been quite difficult to get rid of duplicates, since all reviews would be included in the same row of the dataframe as one list. I wrote a formula that used the total number of reviews to determine how many pages to loop through.
### b) Scrape the date of review, convert to a datetime series and ensure that the date of the review is not after gender pay gap data period. 

In [19]:
# scraping the review data: after applying the datetime filter
url_template = "https://uk.indeed.com/cmp/{}/reviews?fcountry=GB&start={}"
review_date = []
review_header = []
review_text = []
review_pros = []
review_cons = []
march_31_2019 = datetime(2019, 3, 31).toordinal()

for company in indeed_comp.employer_clean[100:110]:
    company_date = []
    company_header = []
    company_text = []
    company_pros = []
    company_cons = []

    url_template_company = "https://uk.indeed.com/cmp/{}/reviews?fcountry=GB"
    r_company = requests.get(url_template_company.format(company))
    soup_company = BeautifulSoup(r_company.text, 'html.parser')
    counts = soup_company.find('div', attrs={'class':"cmp-ReviewsCount"})
    try:
        number_reviews = int((counts.findChild('b').text.strip()))
    except:
        number_reviews = 20
    max_reviews = ((math.ceil(number_reviews/20)))*20
    for page in range(0,max_reviews,20):
        time.sleep(random.randint(1, 4))
        r = requests.get(url_template.format(company, page))
        soup = BeautifulSoup(r.text, 'html.parser')
        for review in soup.find_all('div', attrs={'class':"cmp-Review", 'itemprop':'review'}):
            try:
                review_author = review.find('span', attrs={'class':"cmp-ReviewAuthor"}).text.strip()
                date = re.findall(r'(\w+\s\w+\s\w+)$', review_author)
                for i in date:
                    datetime_object = datetime.strptime(i, '%d %B %Y')
                
                if datetime_object.toordinal() < march_31_2019:
                    company_date.append(date)
                else:
                    pass
                
            except:
                pass
                
            if datetime_object.toordinal() < march_31_2019:
                try:
                    company_header.append(review.find('a', attrs={'class':"cmp-Review-titleLink"}).text.strip())
                except:
                    pass

                text_section = review.find('div', attrs={'class':"cmp-Review-text"})
                try:
                    company_text.append(text_section.findChild('span', attrs={'class':"cmp-NewLineToBr-text"}).text.strip())
                except:
                    pass
                pros_section = review.find('div', attrs={'class':"cmp-ReviewProsCons-prosText"})
                try: 
                    company_pros.append(pros_section.find('span', attrs={'class':"cmp-NewLineToBr-text"}).text.strip())
                except:
                    pass
                cons_section = review.find('div', attrs={'class':"cmp-ReviewProsCons-consText"})
                try: 
                    company_cons.append(cons_section.find('span', attrs={'class':"cmp-NewLineToBr-text"}).text.strip())
                except:
                    pass
            else:
                pass

    review_date.append(company_date)
    review_header.append(company_header)
    review_text.append(company_text)
    review_pros.append(company_pros)
    review_cons.append(company_cons)

In [20]:
review_details_100_110 = pd.DataFrame()

review_details_100_110['review_date'] = review_date
review_details_100_110['review_header'] = review_header
review_details_100_110['review_text'] = review_text
review_details_100_110['review_pros'] = review_pros
review_details_100_110['review_cons'] = review_cons

# review_details_0_500.to_csv('review_details_0_500')

In [21]:
review_details_100_110

Unnamed: 0,review_date,review_header,review_text,review_pros,review_cons
0,"[[9 August 2017], [6 June 2018], [4 March 2018...","[Friendly, fun place to work, fun and professi...","[I enjoyed working there., This was a very goo...","[Friendly, easy hours, free lunch, being left ...","[Split shifts, physically exhausting, notg eno..."
1,[],[],[],[],[]
2,[],[],[],[],[]
3,[],[],[],[],[]
4,"[[1 February 2013], [5 January 2018], [14 Augu...","[An intresting challenge., do not trust this c...",[A typical day at work involves dealing with i...,"[exciting time, met a lot of users, Warning, F...","[long but intresting, Warning, Pay, short pays..."
5,[],[],[],[],[]
6,"[[6 September 2017], [11 December 2015], [19 F...","[Fun and creative job, Enjoyable but very trav...",[It's a fun a creative place to do what you lo...,"[Get your hair done for free, Exposed to main ...","[Can be Long hours, Long hours and lots of tra..."
7,"[[14 January 2018], [13 November 2018], [8 Feb...","[Fun workplace, Its a job, Worse place ever, I...",[Good Working Environment with international a...,[wages],"[traveling, Long hours, poor working condition..."
8,"[[29 May 2018], [19 November 2018], [2 July 20...","[Great Sense of Community, Boring, Very fun an...",[In regards to its property management departm...,"[Teamwork, Money, Beach, Colleagues and workpl...","[Workload, Less, Pay is embrassingly low. The ..."
9,"[[29 May 2018], [19 November 2018], [2 July 20...","[Great Sense of Community, Boring, Very fun an...",[In regards to its property management departm...,"[Teamwork, Money, Beach, Colleagues and workpl...","[Workload, Less, Pay is embrassingly low. The ..."


In [None]:
# all_reviews_18_19 = pd.concat([review_details_18_19, all_ratings_18_19], axis=1)
# all_reviews_18_19.head()

In [None]:
# all_reviews_18_19.to_csv('all_reviews_18_19', index=False)