### Scraping Linked in Data

In this notebook we are going to scrape the data from linked in profiles of people in data science.


First we need to makesure that we install the `webdriver-manager` as follows.

In [4]:
pip install webdriver-manager -q

Note: you may need to restart the kernel to use updated packages.


In the following code cell we are going to import all the packages that we are going to use in this notebook for the webscraping task.

In [98]:
import os
import pandas as pd
import requests
import tqdm
import random
import shutil
import time
import multiprocessing


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
from getpass import getpass
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By

Next we are going to create an instance of a ``Chrome`` driver using ``selenium`` for automation.

In [2]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Next we are going to visit https://www.linkedin.com/

In [69]:
driver.get('https://www.linkedin.com/')
driver.maximize_window()

We are then going to take the credential to login to our linked in account.

In [4]:
email = 'crispengari@gmail.com'
password = getpass(prompt = 'Enter the Linked in password: ').strip()

Enter the Linked in password:  ········


Then we click the `Sign in` link.

In [5]:
driver.find_element(By.LINK_TEXT, 'Sign in').click() #driver.find_element(By.XPATH, '/html/body/nav/div/a[2]')

Then we are going to signin to linked in with our credentials.

In [6]:
email_field = driver.find_element(By.XPATH, "//input[@id='username']")
email_field.send_keys(email)
email_field.send_keys(Keys.RETURN)

password_field = driver.find_element(By.XPATH, "//input[@id='password']")
password_field.send_keys(password)
password_field.send_keys(Keys.RETURN)

We then define the search queries that we want selenium to automate search on linked in.

In [70]:
# targeting the search field
queries  = ["data engineer in south africa", "Data Analyst in south africa",
            "Data Scientist in south africa", 'machine learning engineer in south africa']

The function `get_page_profile` allows us to get the profile url of a person from the search result.

In [82]:
def get_page_profile():
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    page_result = soup.find_all('li', {'class': 'reusable-search__result-container'})
    profile_urls = [res.find('a')['href'].split('?')[0] for res in page_result]
    return profile_urls

In [99]:
actions = ActionChains(driver)

The function `search_for_profiles` will take in the query and the elapse time together with the number of pages that we want to paginate for and returns us the unique number of profile urls.

In [126]:
def search_for_profiles(q, wait_for:int = 10, pages:int = 100):
    search_field = driver.find_element(By.XPATH, "//input[@placeholder='Search']")
    driver.execute_script("arguments[0].value = '';", search_field)
    search_field.send_keys(q)
    search_field.send_keys(Keys.RETURN)
    time.sleep(wait_for)
    try:
        driver.find_element(By.XPATH, "//button[@aria-pressed='false'][normalize-space()='People']").click()
    except Exception:pass
        
    profile_urls = []
    for i in tqdm.tqdm(range(pages), total=pages, desc="Next Page >>>"):
        try:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(wait_for)
            next_btn = driver.find_element(By.XPATH,  "//button[@aria-label='Next' and @type='button']")
            actions.move_to_element(next_btn).click().perform()
            # next_btn.click()
            time.sleep(3)
            ps = get_page_profile()
            profile_urls.extend(ps)
            time.sleep(3)
        except Exception: pass
    return list(set(profile_urls))

In the following code cell we are then going to query all the profile urls and store them as a list. So that we can save them to a `.csv` file for later usage.

In [127]:
all_urls = []
for q in queries:
    print(f"Fetching for: {q}")
    res = search_for_profiles(q)
    all_urls.extend(res)

Fetching for: data engineer in south africa


Next Page >>>: 100%|█████████████████████████████████████████████████████████████████| 100/100 [28:46<00:00, 17.26s/it]


Fetching for: Data Analyst in south africa


Next Page >>>: 100%|█████████████████████████████████████████████████████████████████| 100/100 [29:42<00:00, 17.83s/it]


Fetching for: Data Scientist in south africa


Next Page >>>: 100%|█████████████████████████████████████████████████████████████████| 100/100 [21:59<00:00, 13.20s/it]


Fetching for: machine learning engineer in south africa


Next Page >>>: 100%|█████████████████████████████████████████████████████████████████| 100/100 [29:49<00:00, 17.89s/it]


Next we are going to create a dataframe for these urls. And give it a column name ``profile``.

In [135]:
profile_df = pd.DataFrame(set(all_urls), columns=['profile'])
profile_df.head()

Unnamed: 0,profile
0,https://www.linkedin.com/in/matthew-shane-van-...
1,https://www.linkedin.com/in/dewaalnienaber
2,https://www.linkedin.com/in/sthembiso-makofane...
3,https://www.linkedin.com/in/lindokuhle-tshuma-...
4,https://www.linkedin.com/in/gladstone-ndhlovu-...


In [137]:
if len(profile_df) != 0:
    profile_df.to_csv('profiles.csv', index=False)
    print("Saved!")

Saved!


In the following code cell we can then load our urls in a numpy array from a dataframe.

In [139]:
urls = pd.read_csv('profiles.csv').profile.values
urls[:2]

array(['https://www.linkedin.com/in/matthew-shane-van-den-berg-1b90ba143',
       'https://www.linkedin.com/in/dewaalnienaber'], dtype=object)

Noe we can use these profile url's to scrape the information that we are intrested in in each profile.

In [33]:
driver.find_element(By.CLASS_NAME, "pvs-navigation__text").click()

In [32]:
experience_soup =  BeautifulSoup(driver.page_source, 'html.parser')

items = experience_soup.find_all('li', {'class': 'pvs-list__paged-list-item'})
items[0].find('span', {'class':'t-14 t-normal'}).text.split('·')[0].strip()
# items[0].find('div>span', {'class':'visually-hidden'}).text.split('·')[0].strip()

IndexError: list index out of range

In [27]:
def get_company(soup):
    return soup.find('span', {'aria-hidden': 'true'}).get_text(strip=True)

def get_roles(soup):
    roles = []
    role_elements = soup.find_all('li', class_='pvs-list__paged-list-item pvs-list__item--one-column')
    
    for role in role_elements:
        role_name = role.find('span', {'aria-hidden': 'true'}).get_text(strip=True)
        date_range = role.find_all('span', {'aria-hidden': 'true'})[1].get_text(strip=True).split('·')[0]
        from_date, to_date = date_range.split(' - ')
        roles.append({'name': role_name, 'from': from_date, 'to': to_date})
    return roles

get_roles(items[3])

[]

In [30]:
items[1]

<li class="pvs-list__paged-list-item pvs-list__item--one-column" id="profilePagedListComponent-ACoAAAvbmn8BgkeRRgd4-urnYm8lHXSlpr-LujQ-EXPERIENCE-VIEW-DETAILS-profilePositionGroup-ACoAAAvbmn8BgkeRRgd4-urnYm8lHXSlpr-LujQ-c218197500c6500b6fb587a99cfe182962496e58-NONE-en-US-0">
<div>
<span class="XXPUrJqMIZMEhzFhbPwQJlRaYcIINdcNLqOXYYc"></span>
<div class="BMQQbQbgIDIXbuCJMEYkpUriexATDoZLBuVSA ampufUOsTbANTpJpxILrGEjGvFMwYs" data-view-name="profile-component-entity">
<div>
<div aria-hidden="true" class="display-flex" tabindex="-1">
<!-- -->
</div>
</div>
<div class="display-flex flex-column full-width align-self-center">
<div class="display-flex flex-row justify-space-between">
<a class="optional-action-target-wrapper display-flex flex-column full-width" href="https://www.linkedin.com/company/614583/" target="_self">
<div class="display-flex flex-wrap align-items-center full-height">
<div class="display-flex">
<div class="display-flex full-width">
<div class="display-flex align-items-cent

In [None]:
{
    'company': "FNB South Africa",
    'employment_type': "Full-time",
    "roles": [
        {'name':'Data Engineer' , 'from': 'Nov 2022', 'to': 'Present'}
    ]

}

In [39]:
profile_urls = [url.split('?')[0] for url in profile_urls]

In [59]:
driver.get(f"{profile_urls[0]}/details/skills/")

In [60]:
def get_skills():
    skills_soup = BeautifulSoup(driver.page_source, 'html.parser')
    skills = []
    div = experience_soup.find('div', {'class': 'artdeco-tabpanel active ember-view'})
    print(div.prettify())
    for skill in div.find_all('li', {'class': 'pvs-list__paged-list-item'}):
        print(skill)
        print(skill.prettify()); break

get_skills()

<div aria-labelledby="ember206" class="artdeco-tabpanel active ember-view" id="ember211" role="tabpanel" tabindex="0">
 <div class="sjoTiQXsIrHRaOZVCcNpsUnpmNfFurXkMeiWaq">
  <!-- -->
  <ul class="xiDmfLWqcHAcVUVmKmTdTsLRIMszMNQbKJXE ph5 display-flex flex-row flex-wrap">
   <li class="artdeco-list__item aBSSLnlilONmmhQJzhalIKzDXmbhVZFAcr pvs-list__item--two-column">
    <!-- -->
    <div class="BMQQbQbgIDIXbuCJMEYkpUriexATDoZLBuVSA GLFwElfnlDhiKmeTGjzYNFKspIMtjYgQtk KluEZLkYInIgxIdbcSXvRCIGRczabULoAw" data-view-name="profile-component-entity">
     <div>
      <a class="optional-action-target-wrapper display-flex" data-field="active_tab_influencers_interests" href="https://www.linkedin.com/in/lizryan" target="_self">
       <div class="ivm-image-view-model pvs-entity__image">
        <div class="ivm-view-attr__img-wrapper">
         <!-- -->
         <!-- -->
         <img alt="Liz Ryan profile picture" class="ivm-view-attr__img--centered EntityPhoto-circle-3 evi-image lazy-image ember