<a id=contents></a>

# Extraction and cleaning notebook
## Data retrieved from glassdoor.com

Data was extracted via Glassdoor's REST API (documentation here: https://www.glassdoor.com/developer/index.htm). 

[0. Data extraction via Selenium scraping](#api)

[1. Data Inspection](#insp)

[2. Cleaning numerical data](#numerical)

[3. Cleaning categorical data](#categ)

[4. Cleaning text data](#text)

In [3]:
import pandas as pd
import numpy as np
import os
import time
import requests as req
from dotenv import load_dotenv
load_dotenv()
import selenium as sl
from selenium.common.exceptions import ElementClickInterceptedException, NoSuchElementException
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium import webdriver 
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.action_chains import ActionChains

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style("darkgrid")
import string

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import re
tokenizer = RegexpTokenizer(r'\b\w{3,}\b')
stop_words = list(set(stopwords.words("english")))
stop_words += list(string.punctuation)

import warnings
warnings.filterwarnings('ignore')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<a id=api ><a/> 

## 0. Data Extraction via Selenium scraping
    
[LINK to table of contents](#contents)

In [25]:
driver_test = webdriver.Chrome('/Users/ipreoteasa/Desktop/Io/chromedriver_2')


### Iteratively experimentin with Selenium and checking outputs

In [26]:
driver_test.get('https://www.indeed.co.uk/')
elem = driver_test.find_element_by_name('q')
elem.clear()
elem.send_keys('data scientist')

elem = driver_test.find_element_by_name('l')
elem.clear()
elem.send_keys('london')

elem.send_keys(Keys.RETURN)


In [6]:
DOM = driver_test.page_source


In [7]:
print(DOM)

<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr"><head>
<meta http-equiv="content-type" content="text/html;charset=UTF-8" />
<script type="text/javascript" async="" src="https://www.google-analytics.com/plugins/ua/linkid.js"></script><script async="" src="https://sb.scorecardresearch.com/beacon.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script type="text/javascript" src="//d3fw5vlhllyvee.cloudfront.net/s/ef27480/en_GB.js"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/b512638/jobsearch_all.css" rel="stylesheet" type="text/css" />
<link rel="alternate" type="application/rss+xml" title="Data Scientist Jobs, vacancies in London" href="http://www.indeed.co.uk/rss?q=data+scientist&amp;l=london" />
<link rel="alternate" media="only screen and (max-width: 640px)" href="/m/jobs?q=data+scientist&amp;l=london" />
<link rel="alternate" media="handheld" href="/m/jobs?q=data+scientist&amp;l=london" />

<script type="te

### Now using Beautiful Soup for the text extraction

In [68]:
soup_test = BeautifulSoup(DOM, 'lxml')

In [69]:
soup_test.prettify()



In [85]:
jobtitle_soup = soup_test.find_all(name='a', attrs= {'class': 'jobtitle turnstileLink', 'data-tn-element':'jobTitle'})

list_hrefs_page1 = [jobtitle_elem['href'] for jobtitle_elem in jobtitle_soup]

In [86]:
jobtitle_soup[4]

<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=f30a1d14d045a98c&amp;fccid=7cbdd150bffe93e0&amp;vjs=3" id="jl_f30a1d14d045a98c" onclick="setRefineByCookie([]); return rclk(this,jobmap[4],true,0);" onmousedown="return rclk(this,jobmap[4],0);" rel="noopener nofollow" target="_blank" title="Data Scientist-London, UK">
<b>Data</b> <b>Scientist</b>-London, UK</a>

In [92]:
list_hrefs_page1[2]

'/rc/clk?jk=e96b45c3c7a40fb9&fccid=f1d8e147024abb3f&vjs=3'

In [73]:
jobtitle_soup

[<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0CDunQwmDWxuCtK3LuAbt0ghr7w0gk9qUTosg2llpPs7WVAS-JHSgOSk7fPsG1xMWbgaUIrQx2unHK9ei5KvQwfZyEwNtIZQNVoDhQfq9e9uyj7_GE-gAlQpD_Vq3Ozk4yRFuDaxTNNfP-CDQmMZwcHsZ_WmCfc08l6vJV0C1eu9E2CmDomw1uRTsrjoLVnjYA-uBpLOpgKxHCNf0jp744ueAMb_mQP4pYETN5ExMauz3dFq7NSZ4jGZ3EQTzJ5tx98GQWcufCsZusmw-MOdW0BGYXfEJtmeLENJWPam_Cvp3Gom-rklvMAknRcO196ny1ggboK6fdFwMYmuJzd3qPuajnfsmBtfnA8iMC2fUbtNpNXX5tP8OO-vvr6vE-rFzjg-kwqxpW-YElaTkoo5OGjLzyBu_GOfEpaOU7KNrTPRiWleAugjZ5TUtRgicyC8ZUlp2CpiHuAbm_D8UpxsHyGpkQXR0sjCwQ=&amp;p=0&amp;fvj=0&amp;vjs=3" id="sja0" onclick="setRefineByCookie([]); sjoc('sja0', 1); convCtr('SJ'); rclk(this,jobmap[0],true,1);" onmousedown="sjomd('sja0'); clk('sja0'); rclk(this,jobmap[0],1);" rel="noopener nofollow" target="_blank" title="CPRD Data Scientist">
 CPRD <b>Data</b> <b>Scientist</b></a>,
 <a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0AbcLY

### Pagination in Selenium

In [32]:
# next_page_button = driver_test.find_element_by_xpath('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a')
try:
    next_page_buttons = driver_test.find_elements_by_class_name('np')
    time.sleep(1)
    next_page_buttons[-1].click()
except:
    next_page_buttons = driver_test.find_elements_by_class_name('np')
    time.sleep(1)
    next_page_buttons[-1].click()

In [36]:
# signup_popup_close = driver_test.find_element_by_id('data-tn-section')
# At this point we identify the svg element of the popup that comes up on the next page
signup_popup_svg = driver_test.find_element(By.CSS_SELECTOR, value='svg')
# We now want to subselect the close button of that particular element
# signup_popup_svg.click()

### Dealing with signing up popups on the indeed website

In [30]:
popup_elem=WebDriverWait(driver_test, 2).until(ec.presence_of_element_located((By.ID, 'popover-x')))
ActionChains(driver_test).move_to_element(popup_elem).click().perform()

In [76]:
driver_test.page_source



It might seem like some of our links (the first 2 and the last one are broken in some way, when tested lower down for what page they access they're actually in working order.

In [114]:
for i,x in enumerate(list_hrefs_page1):
    print(i,  '   ' , x)

0     /pagead/clk?mo=r&ad=-6NYlbfkN0CDunQwmDWxuCtK3LuAbt0ghr7w0gk9qUTosg2llpPs7WVAS-JHSgOSk7fPsG1xMWbgaUIrQx2unHK9ei5KvQwfZyEwNtIZQNVoDhQfq9e9uyj7_GE-gAlQpD_Vq3Ozk4yRFuDaxTNNfP-CDQmMZwcHsZ_WmCfc08l6vJV0C1eu9E2CmDomw1uRTsrjoLVnjYA-uBpLOpgKxHCNf0jp744ueAMb_mQP4pYETN5ExMauz3dFq7NSZ4jGZ3EQTzJ5tx98GQWcufCsZusmw-MOdW0BGYXfEJtmeLENJWPam_Cvp3Gom-rklvMAknRcO196ny1ggboK6fdFwMYmuJzd3qPuajnfsmBtfnA8iMC2fUbtNpNXX5tP8OO-vvr6vE-rFzjg-kwqxpW-YElaTkoo5OGjLzyBu_GOfEpaOU7KNrTPRiWleAugjZ5TUtRgicyC8ZUlp2CpiHuAbm_D8UpxsHyGpkQXR0sjCwQ=&p=0&fvj=0&vjs=3
1     /pagead/clk?mo=r&ad=-6NYlbfkN0AbcLYJH6EEU2hhfUhIe4V-wtZUPXOEfJh72XLwlGjJwXW5GoFa_FpEAJY0bA41l88eG6x9Zf5eNV9CO8Bn0q6evh40YruuK2jjwlLb60bx0EDglGndWrUpozLWnGqvJes6HOlSBcfaiqRi129Fm6HY-jeqnZPGCgZR_pC8w4OKveTvZ0XZ3BnAAbRu3zB6MIzA4R-RaT2yZ7OV3UfNjE1Keaw3P2GRajb9todTFpnLMj1tfPqihYPMs0qUc-oPfH-hfCN2-Qcaeq-HDRbtFlyIu3qjp1-uCBChsB2lsNpwdDuI4fzQiI4_cEul6vLUXJ22GKXbKbaws0MAnzKofo6WQ8zQT6pksXbYSPsnn3UZrwfpA1uWlhCSX3luLEsP5E0ZmaZWCDVOjYkORYkQbgQRSHC5ekrHS2VD_ZMqefqh4g=

### Accessing individual job post pages from our list of stored URLs

In [96]:
'indeed.co.uk'+ test_url

'indeed.co.uk/rc/clk?jk=e96b45c3c7a40fb9&fccid=f1d8e147024abb3f&vjs=3'

In [120]:
test_url = list_hrefs_page1[0]
driver_test = webdriver.Chrome('/Users/ipreoteasa/Desktop/Io/chromedriver_2')
time.sleep(2)
driver_test.get('https://www.indeed.co.uk'+test_url)


In [121]:
dom =  driver_test.page_source

In [122]:
job_soup = BeautifulSoup(dom, 'lxml')

In [123]:
job_soup_descr = job_soup.find(name='div', attrs= {'class': 'jobsearch-jobDescriptionText', 'id':'jobDescriptionText'})
job_soup_descr

<div class="jobsearch-jobDescriptionText" id="jobDescriptionText"><div><div>We would like to reassure all applicants that we are continuing to recruit through these challenging circumstances. For any vacancies advertised we are proceeding with Video Interviews where possible and are facilitating remote homeworking for all positions where possible. Should you have any queries on the current recruitment process at the MHRA please contact careers@mhra.gov.uk.</div><div></div><br/>
<div>
We currently have a great opportunity for a Data Scientist to join the DTT team within the CPRD Division, to work full time on a permanent, full-time contract.</div><div></div><br/>
<div>
The Medicines and Healthcare products Regulatory Agency enhance and improve the health of millions of people every day through the effective regulation of medicines and medical devices, underpinned by science and research. The agency is made up of c.1300 staff working across three centres:</div><div></div><br/>
<ul><li> M

In [124]:
job_soup_descr.get_text()

'We would like to reassure all applicants that we are continuing to recruit through these challenging circumstances. For any vacancies advertised we are proceeding with Video Interviews where possible and are facilitating remote homeworking for all positions where possible. Should you have any queries on the current recruitment process at the MHRA please contact careers@mhra.gov.uk.\n\nWe currently have a great opportunity for a Data Scientist to join the DTT team within the CPRD Division, to work full time on a permanent, full-time contract.\n\nThe Medicines and Healthcare products Regulatory Agency enhance and improve the health of millions of people every day through the effective regulation of medicines and medical devices, underpinned by science and research. The agency is made up of c.1300 staff working across three centres:\n Medicines and Healthcare products Regulatory Agency regulatory centre (MHRA) Clinical Practice Research Datalink (CPRD) National Institute for Biological S

### Refactoring into a scraper class

Ok, so now that we've tested our way through the Indeed page, time to refactor this code and build in some waiting times so the website doesn't suffer any tremendous surge in visits.

In [150]:
import pandas as pd

class JobPostScraper:
    def __init__(self, root_url, search_term_job, location, num_jobs):
        """Initialise the job scraper object with 
        - root_url - (str) of the website you're visiting in our case 'indeed.co.uk', 
        - search)_term_job - (str) the job you're looking for (e.g. 'data scientist'), 
        - location - (str) your location (e.g. 'London')
        - num_jobs - (int) how many job postings you'd like to look at
        """
        self.root_url = root_url
        self.search_term_job = search_term_job
        self.location = location
        self.num_jobs = num_jobs
        self.job_descr_lst_ = []
        self.job_titles_lst_ =[]
        self.companies_lst_ = []
        self.job_post_dom_ = []
        self.job_post_urls_ = []
        return
    


    def get_job_link_urls(self, headless=False):
        """Instance method that start a Selenium Chrome driver that scrapes a website and searches
        for job URLs, paginates and then stores the num_jobs amount of URLs in a pandas dataframe
        for use later down the pipeline.
        headless - (bool) whether to have the chrome window showing or not as it's scraping
        """
        start = time.time()
        # empty list to store urls from within the main job posting website
        sub_urls = []
        #init the selenium driver
        chrome_options = Options()
        if headless:
            chrome_options.add_argument("--headless")
        driver = webdriver.Chrome('/Users/ipreoteasa/Desktop/Io/chromedriver_2', 
                                 options=chrome_options)
    
        # accessing main page
        driver.get(self.root_url)
        time.sleep(2)
        #enter our job search terms
        elem = driver.find_element_by_name('q')
        elem.clear()
        elem.send_keys(self.search_term_job)

        time.sleep(2)
        #enter our location search term
        elem = driver.find_element_by_name('l')
        elem.clear()
        elem.send_keys(self.location)
        elem.click()
        time.sleep(1)
        elem.send_keys(Keys.RETURN)

        time.sleep(4)
        
        time_index = 0
        
        while len(sub_urls)<=self.num_jobs:
            try:
                time.sleep(3)
                pop_up_close = driver.find_element_by_class_name('popover-x')
                pop_up_close.click()
            except:
                pass

            # using BS4 on the page source to get all the urls
            DOM = driver.page_source
            soup = BeautifulSoup(DOM, 'lxml')
            
            jobtitle_soup = soup.find_all(name='a', 
                                               attrs= {'class': 'jobtitle turnstileLink', 
                                                       'data-tn-element':'jobTitle'})
            
            # getting href attributes and storing them
            list_hrefs = [jobtitle_elem['href'] for jobtitle_elem in jobtitle_soup]
            for href in list_hrefs:
                sub_urls.append(href)
            
            WebDriverWait(driver, 2).until(ec.element_to_be_clickable((By.CLASS_NAME, 'np')))
            next_page_buttons = driver.find_elements_by_class_name('np')
            time.sleep(4)
            ActionChains(driver).move_to_element(next_page_buttons[-1]).click().perform()

            try:
                popup_elem=WebDriverWait(driver, 2).until(ec.presence_of_element_located((By.ID, 'popover-x')))
                ActionChains(driver).move_to_element(popup_elem).click().perform()
            except:
                pass
            
            time_elapsed = time.time() - start
            printout = f'Step {time_index} --- Time elapsed so far {time_elapsed}; URLs stored : {len(sub_urls)}'
            print(printout)
            time_index+=1
            
        # Now we take our list of urls, preppend the root url to them and store them in a dataframe
        job_urls_full = list(map(lambda x: str(self.root_url)+x , sub_urls))
        job_url_df = pd.DataFrame(job_urls_full, columns=['job_url'])
        
        self.job_post_urls_ = job_urls_full
        print('URL column successfully stored as pandas obj')

        return job_url_df
    
    
    def get_job_text_html(self,url_df, url_column = 'job_url', headless=True):
        """Retrieve the body of the job posting text using Selenium for browser interaction and
        Beautiful Soup for parsing and HTML tag removal
        url_df - (pandas dataframe/series) that contains our URLs
        url_column - (str) name of the dataframe column that contains URLs, by default = 'job_url'
        headless - (bool) whether to have the chrome window showing or not as it's scraping
        """
        start_job_descr = time.time()
        job_descr_lst = []
        # empty list to store urls from within the main job posting website
        if str(type(url_df)) == 'pandas.core.frame.DataFrame':
            url_list = list(url_df[url_column].values)
        elif str(type(url_df)) == 'pandas.core.series.Series':
            url_list = list(url_df)
        else:
            url_list = list(url_df[url_column].values)
            

        #init the selenium driver
        chrome_options = Options()
        if headless:
            chrome_options.add_argument("--headless")
        driver = webdriver.Chrome('/Users/ipreoteasa/Desktop/Io/chromedriver_2', 
                                 options=chrome_options)
        
        job_descr_list = []
        
        for url in url_list:
            driver.get(url)
            time.sleep(5)
            dom =  driver.page_source
            job_soup = BeautifulSoup(dom, 'lxml')
            job_soup_title = job_soup.find(name='div', 
                                           attrs= {'class': 'jobsearch-JobInfoHeader-title-container'})
            
            job_soup_descr = job_soup.find(name='div', 
                                           attrs= {'class': 'jobsearch-jobDescriptionText', 
                                                   'id':'jobDescriptionText'})
            
            job_soup_company = job_soup.find(name='div', 
                                           attrs= {'class': 'jobsearch-InlineCompanyRating icl-u-xs-mt--xs jobsearch-DesktopStickyContainer-companyrating'})
            try:
                job_soup_title_txt = job_soup_title.get_text()
                self.job_titles_lst_.append(job_soup_title_txt)
            except:
                pass
            
            try:
                job_soup_descr_txt = job_soup_descr.get_text()
                self.job_descr_lst_.append(job_soup_descr_txt)
            except:
                pass
                
            try:
                job_soup_comp_txt = job_soup_company.get_text()
                self.companies_lst_.append(job_soup_comp_txt)
            except:
                pass
            
            self.job_post_dom_.append(job_soup)

            
            if (len(self.job_descr_lst_)%50 ==0):
                time_elapsed_get_jobs = time.time() - start_job_descr
                printout = f'--- Time elapsed so far {time_elapsed_get_jobs}; jobs stored so far:  {len(self.job_descr_lst_)}'
                print(printout)
                
        return            
    
    def get_jobs_df(self):
        """Functions assembles a Pandas dataframe with 5 columns:
        URLs of job posts; the company; the job titles; the job
        description text and also the entire html of the job post
        page, in case the user would like to to any more data
        extraction from that data
        """
        data = pd.DataFrame({
                            'company': self.companies_lst_,
                            'job_title' : self.job_titles_lst_,
                            'job_descr' : self.job_descr_lst_,
                            'job_post_html' : self.job_post_dom_})
        
        data['job_search_term'] = str(self.search_term_job)
        
        return data
    

### Testing our new scraper class

In [49]:
root_url = 'https://www.indeed.co.uk'
search_term_job = 'data scientist'


job_scraper = JobPostScraper(root_url, search_term_job, location='London',
                            num_jobs=5)

In [50]:
job_url_df = job_scraper.get_job_link_urls()

Step 0 --- Time elapsed so far 21.6008939743042; URLs stored : 15
URL column successfully stored as pandas obj


In [51]:
job_url_df.drop_duplicates(inplace=True)
job_url_df

Unnamed: 0,job_url
0,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
1,https://www.indeed.co.uk/rc/clk?jk=e96b45c3c7a...
2,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
3,https://www.indeed.co.uk/rc/clk?jk=3b623080f2e...
4,https://www.indeed.co.uk/rc/clk?jk=71824434f4e...
5,https://www.indeed.co.uk/rc/clk?jk=f30a1d14d04...
6,https://www.indeed.co.uk/rc/clk?jk=cae641d93a8...
7,https://www.indeed.co.uk/rc/clk?jk=7548e60a487...
8,https://www.indeed.co.uk/rc/clk?jk=616297d5778...
9,https://www.indeed.co.uk/rc/clk?jk=cd3c41b9c6d...


In [52]:
job_scraper.get_job_text_html(job_url_df[:5])

--- Time elapsed so far 7.321532964706421; job descriptions stored so far:  1
--- Time elapsed so far 12.565306186676025; job descriptions stored so far:  2
--- Time elapsed so far 18.03852105140686; job descriptions stored so far:  3
--- Time elapsed so far 23.32067894935608; job descriptions stored so far:  4
--- Time elapsed so far 28.561187982559204; job descriptions stored so far:  5


In [54]:
ds_job_df = job_scraper.get_jobs_df()
ds_job_df

Unnamed: 0,company,job_titles_lst,job_descr,job_post_html,job_search_term
0,Medicines and Healthcare products Regulatory A...,CPRD Data Scientist,We would like to reassure all applicants that ...,"[[[\n, <title>CPRD Data Scientist - London - I...",data scientist
1,"Deutsche Bank2,894 reviews",Artificial Intelligence – Data Scientist,Job Title: Artificial Intelligence – Data Scie...,"[[[\n, <title>Artificial Intelligence – Data S...",data scientist
2,OSTC5 reviews,Data Scientist - ZISHI Adaptive,"Company Description\n\nIn just over 15 years, ...","[[[\n, <title>Data Scientist - ZISHI Adaptive ...",data scientist
3,COLLAB. Recruitment Ltd,Data Scientist,Data Scientist\n\nRole\nDo you want to make a ...,"[[[\n, <title>Data Scientist - London EC1V - I...",data scientist
4,"Deloitte9,527 reviews","Consultant, Data Scientist, Defence and Securi...",Your opportunity\nYou can expect to work as pa...,"[[[\n, <title>Consultant, Data Scientist, Defe...",data scientist


The main issues we're noticing here:
* the job titles and descriptions have been inverted - that error has not been corrected thankfully
* the company field often has a number followed by ' reviews' - thankfully this is regular enough for regex to solve later

In [58]:
ds_job_df.company[1]

'Deutsche Bank2,894 reviews'

### Running the scraper to get an adequate number of DS job posts

In [103]:
root_url = 'https://www.indeed.co.uk'
search_term_job = 'data scientist'

ds_job_scraper = JobPostScraper(root_url, search_term_job, location='London',
                            num_jobs=2000)

In [104]:
ds_job_urls_df = ds_job_scraper.get_job_link_urls(headless=True)

Step 0 --- Time elapsed so far 21.043319940567017; URLs stored : 15
Step 1 --- Time elapsed so far 28.54232096672058; URLs stored : 30
Step 2 --- Time elapsed so far 38.03959107398987; URLs stored : 45
Step 3 --- Time elapsed so far 47.22696900367737; URLs stored : 60
Step 4 --- Time elapsed so far 56.93040490150452; URLs stored : 75
Step 5 --- Time elapsed so far 66.1243588924408; URLs stored : 90
Step 6 --- Time elapsed so far 75.66573286056519; URLs stored : 105
Step 7 --- Time elapsed so far 84.84239292144775; URLs stored : 120
Step 8 --- Time elapsed so far 94.32680988311768; URLs stored : 135
Step 9 --- Time elapsed so far 103.5050220489502; URLs stored : 150
Step 10 --- Time elapsed so far 112.99656510353088; URLs stored : 165
Step 11 --- Time elapsed so far 122.48103404045105; URLs stored : 180
Step 12 --- Time elapsed so far 131.68594098091125; URLs stored : 195
Step 13 --- Time elapsed so far 141.19254207611084; URLs stored : 210
Step 14 --- Time elapsed so far 150.3793339729

In [105]:
# a problem came up  - we seem to have duplicate URLs, which is very bizzare
ds_job_urls_df.drop_duplicates(inplace=True)

In [157]:
ds_job_urls_df.isna().sum()

job_url    0
dtype: int64

In [106]:
ds_job_urls_df.shape

(666, 1)

In [107]:
ds_job_scraper.get_job_text_html(ds_job_urls_df)

--- Time elapsed so far 269.23754620552063; jobs stored so far:  50
--- Time elapsed so far 535.5580632686615; jobs stored so far:  100
--- Time elapsed so far 801.8132569789886; jobs stored so far:  150
--- Time elapsed so far 1064.9572911262512; jobs stored so far:  200
--- Time elapsed so far 1330.8274309635162; jobs stored so far:  250
--- Time elapsed so far 1593.7352080345154; jobs stored so far:  300
--- Time elapsed so far 1858.9256629943848; jobs stored so far:  350
--- Time elapsed so far 2121.920718193054; jobs stored so far:  400
--- Time elapsed so far 2386.624708175659; jobs stored so far:  450
--- Time elapsed so far 2649.5660848617554; jobs stored so far:  500
--- Time elapsed so far 2916.4089229106903; jobs stored so far:  550
--- Time elapsed so far 3185.776230096817; jobs stored so far:  600
--- Time elapsed so far 3453.8773341178894; jobs stored so far:  650


In [117]:
ds_job_df = ds_job_scraper.get_jobs_df()

In [145]:
ds_job_df.job_descr.isna().sum()

0

In [137]:
ds_job_urls_df.nunique()

job_url    666
dtype: int64

In [159]:
ds_df = ds_job_df

ds_df['job_url'] = ds_job_urls_df.values

ds_df.head()

Unnamed: 0,company,job_titles_lst,job_descr,job_post_html,job_search_term,job_url
0,Medicines and Healthcare products Regulatory A...,CPRD Data Scientist,We would like to reassure all applicants that ...,"[[[\n, <title>CPRD Data Scientist - London - I...",data scientist,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
1,"PwC7,678 reviews",Deals - Investigative Analytics - Data Scienti...,"A career within Forensics Technology services,...","[[[\n, <title>Deals - Investigative Analytics ...",data scientist,https://www.indeed.co.uk/rc/clk?jk=7548e60a487...
2,"Capital One - UK8,902 reviews",Data Scientist - Cyber,"White Collar Factory (95009), United Kingdom, ...","[[[\n, <title>Data Scientist - Cyber - London ...",data scientist,https://www.indeed.co.uk/rc/clk?jk=cae641d93a8...
3,Globant,Lead Data Scientist,We are a digitally native technology services ...,"[[[\n, <title>Lead Data Scientist - London EC1...",data scientist,https://www.indeed.co.uk/rc/clk?jk=ef47bc3a6fc...
4,UK Government - National Crime Agency32 reviews,G4 Lead Data Scientist - Cyber,Deploy analytical capabilities in support of o...,"[[[\n, <title>G4 Lead Data Scientist - Cyber -...",data scientist,https://www.indeed.co.uk/rc/clk?jk=83ec2522af2...


In [160]:
ds_df.tail()

Unnamed: 0,company,job_titles_lst,job_descr,job_post_html,job_search_term,job_url
661,Harnham9 reviews,Computer Vision Engineer,"Computer Vision Engineer\n\n\n£60,000 - £70,00...","[[[\n, <title>Computer Vision Engineer - Londo...",data scientist,https://www.indeed.co.uk/rc/clk?jk=bbbe53800f4...
662,causaLens,Data Scientist,SummaryWe are looking for a motivated and high...,"[[[\n, <title>Data Scientist - London W6 - Ind...",data scientist,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
663,Metrica Recruitment,Senior Data Scientist,The Company\n\nA well-funded and exciting star...,"[[[\n, <title>Senior Data Scientist - London -...",data scientist,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
664,MatchesFashion.com22 reviews,LEAD DATA SCIENTIST,The Brand\nAt MATCHESFASHION.COM we are on a m...,"[[[\n, <title>LEAD DATA SCIENTIST - London - I...",data scientist,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...
665,Beecher Madden.,"Machine Learning Specialist Job, London, Up To...",Role Summary:\n\nThis company is recruiting a ...,"[[[\n, <title>Machine Learning Specialist Job,...",data scientist,https://www.indeed.co.uk/pagead/clk?mo=r&ad=-6...


In [161]:
ds_df.loc[ds_df.job_url.isna()]

Unnamed: 0,company,job_titles_lst,job_descr,job_post_html,job_search_term,job_url


In [152]:
import sys
sys.setrecursionlimit(10000)

In [163]:
ds_df.to_pickle('ds_jobs_raw.pickle')

In [162]:
ds_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 666 entries, 0 to 665
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   company          666 non-null    object
 1   job_titles_lst   666 non-null    object
 2   job_descr        666 non-null    object
 3   job_post_html    666 non-null    object
 4   job_search_term  666 non-null    object
 5   job_url          666 non-null    object
dtypes: object(6)
memory usage: 31.3+ KB


### Scraping job posts for machine learning engineeer

In [164]:
root_url = 'https://www.indeed.co.uk'
search_term_job_ml = 'machine learning engineer'

mleng_job_scraper = JobPostScraper(root_url, search_term_job_ml, location='London',
                                   num_jobs=2000)

mleng_job_url_df = mleng_job_scraper.get_job_link_urls(headless=True)

Step 0 --- Time elapsed so far 19.301270961761475; URLs stored : 15
Step 1 --- Time elapsed so far 26.489434003829956; URLs stored : 30
Step 2 --- Time elapsed so far 36.00507211685181; URLs stored : 45
Step 3 --- Time elapsed so far 45.50082206726074; URLs stored : 60
Step 4 --- Time elapsed so far 54.99500012397766; URLs stored : 75
Step 5 --- Time elapsed so far 65.19107103347778; URLs stored : 90
Step 6 --- Time elapsed so far 74.6653790473938; URLs stored : 105
Step 7 --- Time elapsed so far 84.21054315567017; URLs stored : 120
Step 8 --- Time elapsed so far 93.6920108795166; URLs stored : 135
Step 9 --- Time elapsed so far 103.21356105804443; URLs stored : 150
Step 10 --- Time elapsed so far 112.67856311798096; URLs stored : 165
Step 11 --- Time elapsed so far 122.16320824623108; URLs stored : 180
Step 12 --- Time elapsed so far 131.3673460483551; URLs stored : 195
Step 13 --- Time elapsed so far 140.8595130443573; URLs stored : 210
Step 14 --- Time elapsed so far 150.61151313781

In [168]:
mleng_job_url_df.drop_duplicates(inplace=True)
mleng_job_url_df.head()

Unnamed: 0,job_url
0,https://www.indeed.co.uk/rc/clk?jk=adc2d336aec...
1,https://www.indeed.co.uk/company/Transformativ...
2,https://www.indeed.co.uk/rc/clk?jk=110484a7c65...
3,https://www.indeed.co.uk/rc/clk?jk=4c460bbd6a3...
4,https://www.indeed.co.uk/rc/clk?jk=6f06b212bad...


In [169]:
mleng_job_scraper.get_job_text_html(mleng_job_url_df)

--- Time elapsed so far 266.7283411026001; jobs stored so far:  50
--- Time elapsed so far 532.2183480262756; jobs stored so far:  100
--- Time elapsed so far 798.9522480964661; jobs stored so far:  150
--- Time elapsed so far 1063.7755060195923; jobs stored so far:  200
--- Time elapsed so far 1326.1524529457092; jobs stored so far:  250
--- Time elapsed so far 1589.0028960704803; jobs stored so far:  300
--- Time elapsed so far 1852.0578010082245; jobs stored so far:  350
--- Time elapsed so far 2115.673990011215; jobs stored so far:  400
--- Time elapsed so far 2377.841024160385; jobs stored so far:  450
--- Time elapsed so far 2642.8857209682465; jobs stored so far:  500
--- Time elapsed so far 2908.149560213089; jobs stored so far:  550
--- Time elapsed so far 3171.8328001499176; jobs stored so far:  600
--- Time elapsed so far 3434.5646159648895; jobs stored so far:  650
--- Time elapsed so far 3697.0847749710083; jobs stored so far:  700
--- Time elapsed so far 3966.532073974609

In [170]:
mleng_job_df = mleng_job_scraper.get_jobs_df()


mleng_df = mleng_job_df

mleng_df['job_url'] = mleng_job_url_df.values

mleng_df.to_pickle('mleng_jobs_raw.pkl')

In [171]:
mleng_df.shape

(812, 6)

In [172]:
mleng_df.job_descr.nunique()

680

<a id=insp ><a/> 

## 1. Data Inspection
    
[LINK to table of contents](#contents)

Now that we have a list of strings, we can check the length, the size of string and what the most frequent terms will be, which'll give us an insight into what we need to clean.

<a id=numerical ><a/> 

## 2. Cleaning numerical data
    
[LINK to table of contents](#contents)

<a id=categ ><a/> 

## 3. Cleaning categorical data
   
[LINK to table of contents](#contents)

<a id=text ><a/> 

## 4. Cleaning text data
    
[LINK to table of contents](#contents)