# Notes

Reason for using Selenium

The job search results page keeps adding more results at the bottom, but the URL stays the same. Therefore, it is not possible to loop through the pages. It is necessary to click on the next button until all the results are loaded and then scrape.

The url for a job search results page (for example for technology), when used later, redirects to the job search home page
without the search results. Therefore, it is necessary to type in the search term do a fresh search before scraping.

To do for this notebook

- see if just the qualifications part can be scraped.

Running the script 
    
- It takes about 10-15 minutes to run the part that scrapes the job titles and job description page links.
- It may take 3-4 hours to run the part that scrapes the job description text. It may take longer 
  if the script gets interruped for some reason. It is necessary for a person to check if the script run successfully 
  and rerun as needed.
- It looks like the script tends to get interrupted if the computer goes to sleep mode. 
  So it helps to keep the computer active for the script to complete running without interruption.

In [1]:
from selenium import webdriver
import pandas as pd
import numpy as np
from selenium.webdriver.common.by import By

# Get job titles and job description page links

In [2]:
#this url redirects to the page 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#home'
#url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#keyWordSearch=technology%20or%20software%20engineering%20or%20developer%20or%20azure%20or%20aws&locationSearch='

url = 'https://sjobs.brassring.com/TGnewUI/Search/Home/Home?partnerid=25633&siteid=5439&Codes=BeMore#home'

driver = webdriver.Chrome()

driver.implicitly_wait(20)

driver.get(url)
    
location = driver.find_element(By.XPATH,'//*[@id="initialSearchBox__26"]')
location.send_keys('usa')

search_button = driver.find_element(By.XPATH,'//*[@id="searchControls_BUTTON_2"]')
search_button.click()

In [3]:
x = 1

while x == 1:
    try:
    
        next_button = driver.find_element(By.XPATH,'//*[@id="showMoreJobs"]')
        next_button.click()

    except:
        x = 0    

In [4]:
job_title = []
job_link = []

In [5]:

y = 2
i = 1

while y != 0:  
    
    y = 2
    
    try:  
        job = driver.find_element(By.XPATH,'//*[@id="mainJobListContainer"]/div/div/ul/li['+str(i)+']/div[2]/div[1]')        
        job_title.append(job.text)
        
    except:
        job_title.append('')
        y -= 1

        
    try:
        link = driver.find_element(By.XPATH,'//*[@id="Job_'+str(i)+'"]')        
        job_link.append(link.get_attribute('href'))
        
    except:
        job_link.append('')
        y -= 1        

    i = i+1
    
    
print(len(job_title), len(job_link)) 
    


509 509


In [6]:
# convert to dataframe

df = pd.DataFrame(zip(job_title, job_link))
df.columns = ['TITLE', 'LINK']
df.head()

Unnamed: 0,TITLE,LINK
0,Pega Engineer,https://sjobs.brassring.com/TGnewUI/Search/hom...
1,"Senior Java Developer Spring, Spring Boot",https://sjobs.brassring.com/TGnewUI/Search/hom...
2,Senior .NET Developer,https://sjobs.brassring.com/TGnewUI/Search/hom...
3,Automation Test Engineer,https://sjobs.brassring.com/TGnewUI/Search/hom...
4,Principal Enterprise SAP BPC Consultant,https://sjobs.brassring.com/TGnewUI/Search/hom...


In [7]:
df['QUALIFICATIONS'] = np.nan
df.head()

Unnamed: 0,TITLE,LINK,QUALIFICATIONS
0,Pega Engineer,https://sjobs.brassring.com/TGnewUI/Search/hom...,
1,"Senior Java Developer Spring, Spring Boot",https://sjobs.brassring.com/TGnewUI/Search/hom...,
2,Senior .NET Developer,https://sjobs.brassring.com/TGnewUI/Search/hom...,
3,Automation Test Engineer,https://sjobs.brassring.com/TGnewUI/Search/hom...,
4,Principal Enterprise SAP BPC Consultant,https://sjobs.brassring.com/TGnewUI/Search/hom...,


In [8]:
df.to_csv('infosys_jobs_usa_title_link.csv')

# Get job description data

In [None]:
# job description text - wording and how the text and the headings are organized differs among job postings
# the most commonality I could find is most of the job descriptions and qualifications are under li tags, 
# but not always. Therefore, I am grabbing all li tags from the job description pages.

In [None]:
# get job description data in batches. It takes too long to get all at once.

In [10]:
driver = webdriver.Chrome()   
    
for i in range(len(df['LINK'])):
    
    try:
        job_text = ''
        url = (df['LINK'][i])

        driver.get(url)

        desc = driver.find_element(By.XPATH,'//*[@id="content"]/div[1]/div[7]/div[4]/div[2]/div/div[3]/div[4]/p[2]')
        texts = desc.find_elements(By.TAG_NAME, 'li')

        for Text in texts: 
            job_text = job_text+Text.text+' '

        df['QUALIFICATIONS'][i] = job_text

    except:
        df['QUALIFICATIONS'][i] = ''

driver.close()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['QUALIFICATIONS'][i] = job_text


In [11]:
df['QUALIFICATIONS'][500]

'Experience in client facing role managing highly complex programs Experience in life insurance domain Should have managed multi-million dollar programs with 100+ team members and multiple sub-projects Delivering with near-shore and off-shore teams Prior experience in managing Policy Administration System (PAS) conversion PMP Certification Your responsibilities would include '

In [12]:
len(df[df['QUALIFICATIONS'] == ''])


155

# Basic Cleaning

In [13]:
to_drop = df[df['QUALIFICATIONS'] == ''].index
df = df.drop(to_drop)
df = df.dropna()
len(df)


354

In [18]:
df['QUALIFICATIONS'] = df['QUALIFICATIONS'].str.lower()
df = df.drop_duplicates(subset=['TITLE', 'QUALIFICATIONS'])
df = df.reset_index(drop=True)
len(df)

354

In [20]:
df['COMPANY'] = 'Infosys'
df['DESCRIPTION'] = df['QUALIFICATIONS']
df.head()

Unnamed: 0,TITLE,LINK,QUALIFICATIONS,COMPANY,DESCRIPTION
0,Senior .NET Developer,https://sjobs.brassring.com/TGnewUI/Search/hom...,0 - 1 year experience in java full stack devel...,Infosys,0 - 1 year experience in java full stack devel...
1,Automation Test Engineer,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign equivalent requir...,Infosys,bachelor’s degree or foreign equivalent requir...
2,Principal Enterprise SAP BPC Consultant,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign university equiva...,Infosys,bachelor’s degree or foreign university equiva...
3,Senior Analyst - Analytics,https://sjobs.brassring.com/TGnewUI/Search/hom...,"analyze complex market information, understand...",Infosys,"analyze complex market information, understand..."
4,Senior Operations Manager - Sourcing & Procure...,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign equivalent requir...,Infosys,bachelor’s degree or foreign equivalent requir...


In [22]:
df = pd.read_csv('infosys_usa_ jobs.csv')
df = df.reindex(columns=['COMPANY', 'TITLE', 'QUALIFICATIONS', 'LINK', 'DESCRIPTION'])
df.head()

Unnamed: 0,COMPANY,TITLE,QUALIFICATIONS,LINK,DESCRIPTION
0,Infosys,Senior .NET Developer,0 - 1 year experience in java full stack devel...,https://sjobs.brassring.com/TGnewUI/Search/hom...,0 - 1 year experience in java full stack devel...
1,Infosys,Automation Test Engineer,bachelor’s degree or foreign equivalent requir...,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign equivalent requir...
2,Infosys,Principal Enterprise SAP BPC Consultant,bachelor’s degree or foreign university equiva...,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign university equiva...
3,Infosys,Senior Analyst - Analytics,"analyze complex market information, understand...",https://sjobs.brassring.com/TGnewUI/Search/hom...,"analyze complex market information, understand..."
4,Infosys,Senior Operations Manager - Sourcing & Procure...,bachelor’s degree or foreign equivalent requir...,https://sjobs.brassring.com/TGnewUI/Search/hom...,bachelor’s degree or foreign equivalent requir...


In [24]:
df.to_csv('infosys_usa_ jobs.csv')