## Article Title: An Academic Exploration of Data Scraping for Job Application Automation
### Introduction
In today's competitive job market, efficiency is key. Automating repetitive tasks is one way to improve this efficiency. While full automation might not be possible due to varying application requirements and human interactions, partial automation can certainly help ease the burden. This article explores the technical aspects of automating job applications, focusing on the theoretical concept rather than practical application to ensure we abide by websites' terms of service.

### What is Web Scraping?
Web scraping is a technique for extracting information from websites. It automates the manual process of collecting data from the web, using various programming languages and tools.

The first step in any web scraping project is to import the necessary Python packages and set up a web driver. The web driver allows us to programmatically control a web browser, enabling interactions with web pages just as a human user would.

For this academic example, we'll be using Selenium, a powerful tool for controlling a web browser through programs and automating browser automation. We'll also use pandas for data manipulation and time for pause delays.

Here's how we do it:

In [1]:
#Import Packages
from selenium import webdriver # Web scraping library to control browser interaction
import time # To introduce pause during automation
import pandas as pd # For data manipulation and storage
import os # For handling directory and file paths
from time import sleep # Alternative to time.sleep()

# The following imports are included but not used in this code snippet. 
# They can be useful for advanced functionalities like waiting for certain conditions.
# from selenium.webdriver.support.select import Select 
# from selenium.webdriver.support.ui import WebDriverWait 
# from selenium.webdriver.common.by import By 
# from selenium.webdriver.support import expected_conditions as EC 

# The import below is not used but can be important if you want to add Chrome options.
# from selenium.webdriver.chrome.options import Options 

# Path to webdriver, can be chrome or anything else you have
# Note: Replace the empty string with the path to your webdriver executable.
driver = webdriver.Chrome(executable_path="")


driver = webdriver.Chrome(executable_path=""): Initializes a new Chrome browser window controlled by the webdriver. Replace the empty string with the path to your ChromeDriver executable. Make sure you've downloaded the appropriate ChromeDriver version compatible with your Chrome browser.

Remember, this is an academic exercise, and running automated scripts to interact with websites like LinkedIn can violate their terms of service. Always make sure to review and comply with a website's terms before attempting any form of scraping or automation.

### Navigating to LinkedIn and Logging In
Once the necessary packages are imported and the webdriver is set up, the next step is to navigate to LinkedIn's login page. We will enter the username and password programmatically and then click the "Sign In" button.

Here is how this part is coded:

In [4]:
# Define the URL for LinkedIn login
url1='https://www.linkedin.com/login'

# Implicitly wait for 1 second; gives the browser time to load scripts
driver.implicitly_wait(1)

# Navigate to the LinkedIn login page
driver.get(url1)

# Locate email field by its id and input email
email_field = driver.find_element_by_id('username')
email_field.send_keys('') # Replace empty quotes with your email
print('- Finish keying in email')
sleep(1) # Pause for 1 second

# Locate password field by its name attribute and input password
password_field = driver.find_element_by_name('session_password')
password_field.send_keys('') # Replace empty quotes with your password
print('- Finish keying in password')
sleep(1) # Pause for 1 second

# Locate the sign-in button by its XPath and click it
signin_field = driver.find_element_by_xpath('//*[@id="organic-div"]/form/div[3]/button')
signin_field.click()
sleep(1) # Pause for 1 second


In [12]:
import requests
from lxml import html
import csv

url = "https://www.swfinstitute.org/fund-manager-rankings/hedge-fund-manager"
response = requests.get(url)
tree = html.fromstring(response.content)

companies = tree.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "table-striped", " " ))]//a/text()')
companies

[]

In [9]:
def clean_company_name(company_name):
    extra_stuff = ['LLC', 'LP', 'Inc.', 'Co.', 'Corp.', 'Ltd.', 'LLP', 'PLC', 'AG', 'AB', 'BV', 'GmbH']

    
    # Remove extra stuff
    cleaned_name = ' '.join(word for word in company_name.split() if word not in extra_stuff)
    
    # Remove commas
    cleaned_name = cleaned_name.replace(',', '')
    cleaned_name = cleaned_name.replace(' ', '%20')
    
    return cleaned_name

In [7]:
df = pd.read_csv("companies.csv")
df.head()

Unnamed: 0,Rank,Company
0,1,"Bridgewater Associates, LP"
1,2,Balyasny Asset Management
2,3,Tiger Global Management LLC
3,4,Garda Capital Partners
4,5,Renaissance Technologies LLC


In [11]:
company_links = []
for i, row in df.iterrows():
  clean_name = clean_company_name(row['Company'])
  company_search_page = f"https://www.linkedin.com/search/results/companies/?keywords={clean_name}"
  print(company_search_page)
  driver.get(company_search_page)
  sleep(2)
  try:
    element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div[2]/div/div[1]/main/div/div/div[2]/div/ul/li[1]/div/div/div[2]/div[1]/div[1]/div/span/span/a')
  except:
    try:
      element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div[2]/div/div[1]/main/div/div/div[3]/div/ul/li/div/div/div[2]/div[1]/div[1]/div/span/span/a')
    except:
      continue
  # Extract the link
  link = element.get_attribute("href")

  # Print the link
  company_links.append(link)
  
  


https://www.linkedin.com/search/results/companies/?keywords=Bridgewater%20Associates
https://www.linkedin.com/search/results/companies/?keywords=Balyasny%20Asset%20Management
https://www.linkedin.com/search/results/companies/?keywords=Tiger%20Global%20Management
https://www.linkedin.com/search/results/companies/?keywords=Garda%20Capital%20Partners
https://www.linkedin.com/search/results/companies/?keywords=Renaissance%20Technologies
https://www.linkedin.com/search/results/companies/?keywords=ExodusPoint%20Capital%20Management
https://www.linkedin.com/search/results/companies/?keywords=Squarepoint%20Capital
https://www.linkedin.com/search/results/companies/?keywords=Capula%20Investment%20Management
https://www.linkedin.com/search/results/companies/?keywords=Coatue%20Capital%20L.L.C.
https://www.linkedin.com/search/results/companies/?keywords=Two%20Sigma
https://www.linkedin.com/search/results/companies/?keywords=Lighthouse%20Investment%20Partners
https://www.linkedin.com/search/results/

In [8]:
import pickle

# Saving the list to a file
filename = "company_list.pkl"
# with open(filename, "wb") as file:
#     pickle.dump(company_links, file)


# Loading the list from the file
with open(filename, "rb") as file:
    loaded_list = pickle.load(file)

# Printing the loaded list
print(loaded_list)

['https://www.linkedin.com/company/bridgewater-associates/', 'https://www.linkedin.com/company/balyasny-asset-management-l.p./', 'https://www.linkedin.com/company/tiger-global-management/', 'https://www.linkedin.com/company/garda-capital-partners/', 'https://www.linkedin.com/company/exoduspoint/', 'https://www.linkedin.com/company/squarepoint-capital/', 'https://www.linkedin.com/company/capula_2/', 'https://www.linkedin.com/company/coatue/', 'https://www.linkedin.com/company/two-sigma-investments/', 'https://www.linkedin.com/company/lighthouse-investment-partners-llc/', 'https://www.linkedin.com/company/haidar-capital-management/', 'https://www.linkedin.com/company/viking-global-investors/', 'https://www.linkedin.com/company/adage-capital-management/', 'https://www.linkedin.com/company/elliottinvestmentmanagementlp/', 'https://www.linkedin.com/company/marshall-wace/', 'https://www.linkedin.com/company/millennium-events-management/', 'https://www.linkedin.com/company/element-capital-man

In [54]:
job_links = []

for i in loaded_list:
    driver.get(i + "jobs")
    try:
        element = driver.find_element_by_xpath('/html/body/div[5]/div[3]/div/div[2]/div/div[2]/main/div[2]/div/section/div/div/a')
    except:
        try:
            element = driver.find_element_by_xpath('/html/body/div[4]/div[3]/div/div[2]/div/div[2]/main/div[2]/div/section/div/div/a')
        except:
            continue
    link = element.get_attribute("href")

    driver.get(link)
    sleep(1)
    ul_element = driver.find_element_by_xpath("/html/body/div[5]/div[3]/div[4]/div/div/main/div/div[1]/div/ul")
    li_elements = ul_element.find_elements_by_tag_name("li")

    # Iterate over the list and print the contents of li elements that contain a link
    for element in li_elements:
        link_element = element.find_elements_by_class_name("job-card-list__title")
        if link_element:
            job_links.append(link_element[0].get_attribute("href"))

In [9]:
# Find li elements within the ul element
# Saving the list to a file
filename = "job_links.pkl"
# with open(filename, "wb") as file:
#     pickle.dump(job_links, file)


# Loading the list from the file
with open(filename, "rb") as file:
    loaded_list = pickle.load(file)

# Printing the loaded list
print(loaded_list)


['https://www.linkedin.com/jobs/view/3659222868/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3D%3D&trackingId=xaMcBVVrGcM%2B3lARTbTGVA%3D%3D&trk=flagship3_search_srp_jobs', 'https://www.linkedin.com/jobs/view/3655151423/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3D%3D&trackingId=JGfcBeRlWqBy184zE0xQAA%3D%3D&trk=flagship3_search_srp_jobs', 'https://www.linkedin.com/jobs/view/3641773495/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3D%3D&trackingId=8m0WvsZ6nv%2BQbik55VzF%2BA%3D%3D&trk=flagship3_search_srp_jobs', 'https://www.linkedin.com/jobs/view/3656750052/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3D%3D&trackingId=nU7rq9h6UJqkHNGWrT0tPg%3D%3D&trk=flagship3_search_srp_jobs', 'https://www.linkedin.com/jobs/view/3641773918/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3D%3D&trackingId=A%2B6079X1TbFFHC6S2pNipQ%3D%3D&trk=flagship3_search_srp_jobs', 'https://www.linkedin.com/jobs/view/3657165739/?eBP=JOB_SEARCH_ORGANIC&refId=TkoYSTBE1EiuayhpB9mi7Q%3

In [25]:
import traceback

# The list to hold all job details
job_data = []

for job in loaded_list:
    try:
        driver.get(job)

        job_details = {}  # Dictionary to store the details of each job

        try:
            job_details['job_title'] = driver.find_element_by_class_name("jobs-unified-top-card__job-title").text
        except Exception:
            print("Exception when fetching job title")
            print(traceback.format_exc())

        try:
            job_details['job_metadata'] = driver.find_element_by_class_name("jobs-unified-top-card__primary-description").text
        except Exception:
            print("Exception when fetching job metadata")
            print(traceback.format_exc())

        try:
            job_details['hiring_team_name'] = driver.find_element_by_class_name("jobs-poster__name").text
        except Exception:
            print("Exception when fetching hiring team name")
            print(traceback.format_exc())

        try:
            test = driver.find_element_by_class_name("hirer-card__hirer-information")
            a_element = test.find_element_by_xpath(".//a")
            job_details['hiring_team_link'] = a_element.get_attribute("href")
        except Exception:
            print("Exception when fetching hiring team link")
            print(traceback.format_exc())

        company_metadata = []
        try:
            for i in driver.find_elements_by_class_name("jobs-unified-top-card__job-insight"):
                company_metadata.append(i.text)
            job_details['company_metadata'] = company_metadata
        except Exception:
            print("Exception when fetching company metadata")
            print(traceback.format_exc())

        try:
            job_details['description'] = driver.find_element_by_class_name("jobs-description-content__text").text
        except Exception:
            print("Exception when fetching job description")
            print(traceback.format_exc())

        job_data.append(job_details)  # Add the job details to the job_data list

    except Exception:
        print("Exception when processing job: ", job)
        print(traceback.format_exc())

# Now job_data is a list of dictionaries, each containing details about a job


Exception when fetching company metadata
Traceback (most recent call last):
  File "<ipython-input-25-f45b55eac8cf>", line 41, in <module>
    company_metadata.append(i.text)
  File "/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 76, in text
    return self._execute(Command.GET_ELEMENT_TEXT)['value']
  File "/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webelement.py", line 633, in _execute
    return self._parent.execute(command, params)
  File "/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
    self.error_handler.check_response(response)
  File "/opt/anaconda3/lib/python3.8/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.StaleElementReferenceException: Message: stale element reference: stale element not found
  (Session info: chrome=115.0.5790.

In [29]:
df = pd.DataFrame(job_data)
df['link'] = loaded_list
df.head()

Unnamed: 0,job_title,job_metadata,hiring_team_name,hiring_team_link,description,company_metadata,link
0,Investment Engineer,"Bridgewater Associates · Westport, CT (On-site...",Jason Koulouras,https://www.linkedin.com/in/jasonkoulouras,About the job\nAbout Bridgewater\n\nBridgewate...,,https://www.linkedin.com/jobs/view/3659222868/...
1,Investment Associate Intern - Summer 2024,"Bridgewater Associates · Westport, CT (On-site...",Jason Koulouras,https://www.linkedin.com/in/jasonkoulouras,About the job\nSeeking deeply curious students...,"[$41,000/yr (from job description) · Internshi...",https://www.linkedin.com/jobs/view/3655151423/...
2,Support Engineer,"Bridgewater Associates · Singapore, Singapore ...",,,About the job\nAbout Bridgewater\n\nBridgewate...,"[Full-time · Entry level, 1,001-5,000 employee...",https://www.linkedin.com/jobs/view/3641773495/...
3,"Business Analyst, Workday Technology","Bridgewater Associates · Westport, CT (Remote)...",Jason Koulouras,https://www.linkedin.com/in/jasonkoulouras,About the job\nAbout Bridgewater\n\nBridgewate...,"[$150,000/yr - $225,000/yr (from job descripti...",https://www.linkedin.com/jobs/view/3656750052/...
4,"Technical Project Manager, Investment Technology","Bridgewater Associates · Westport, CT (On-site...",,,About the job\nAbout Bridgewater\n\nBridgewate...,"[$150,000/yr - $250,000/yr (from job descripti...",https://www.linkedin.com/jobs/view/3641773918/...


In [30]:
df.to_csv("final_output.csv")

In [9]:
import pandas as pd
df = pd.read_csv("final_output.csv", index_col=0)

In [10]:
df['description'] = df['description'].str.replace(r'\r?\n', ' ', regex=True)


In [1]:
from transformers import pipeline

# get_completion = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [13]:
def summarize(input):
    try:
        output = get_completion(input)
        return output[0]['summary_text']
    except ValueError:
        # Handle the ValueError here if it occurs
        pass

In [14]:
df['description_summary'] = df['description'].apply(summarize)

Your max_length is set to 142, but your input_length is only 75. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=37)
Your max_length is set to 142, but your input_length is only 69. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=34)
Your max_length is set to 142, but your input_length is only 66. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=33)
Your max_length is set to 142, but your input_length is only 110. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=55)
You

In [16]:
df.to_csv("summarized_description.csv")