# LinkedIn Post Scraping

Due to frequent changes to the layout of LinkedIn, much of the code from the online tutorials was no longer valid. And Linkedin's anti-crawl mechanism is too strict, so I can't find the interface for scraping. Here are a few ways I've tried and gotten stuck:

1. I referenced this [code](https://github.com/christophe-garon/Linkedin-Post-Scraper). But the errors reported during the craw and the main problem is: After logging in to the website, the home page of the website is displayed, but the post search page set in the script cannot be loaded.

2. I tried [this](https://stevesie.com/apps/linkedin-apidownload) on YouTube, and downloaded the source code of the page in HAR format, but it cannot be parsed.
I installed a Chrome Web Scraper [plug-in](https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn?hl=en), but it cannot be cycled for acquisition. I need manually select text areas one by one.

3. I tried [Octoparse](https://www.octoparse.com/tutorial-7/scrape-post-from-linkedin), but there was a problem with the simulated login session: I needed to enter the cell phone verification code and it showed "cell phone number was incorrect".

I finally adopted the first method, fixed the code and successfully crawled the real-time posts in last several hours. I used the following tools:

* Selenium: This tool works in conjunction with ChromeDriver to perform our desired functions like clicking links and scrolling. It’s rather cool watching the program run because it appears as though someone is control of the screen.
* ChromeDriver: This tool is like the middle man between Selenium and Google Chrome, which allows everything to run smoothly.
* Beautiful Soup: This is Python package that will allow us to find and access the various Linkedin elements that we would like to collect. It will scour through the page’s source code finding all of the tags that we instruct it to.

We originally planned to crawl 3-month historical posts published between 2021-11-15 and 2022-02-15. This idea failed as we can only get real-time data with totally 276 posts and 2231 words. 

However, these text data can help us understand what the themes are when user mentioned meta, what noise there would be, and whether there may be a sentimental tendency.


In [None]:
# required installs (i.e. pip3 install in terminal): pandas, selenium, bs4
import math
import sys
import time
import traceback

import pandas as pd
import numpy as np
import yaml
from bs4 import BeautifulSoup as bs
from bs4 import Tag
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

In [None]:
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver

0% [Working]            Hit:1 http://security.ubuntu.com/ubuntu bionic-security InRelease
0% [Connecting to archive.ubuntu.com (91.189.91.39)] [Connected to cloud.r-proj                                                                               Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease
                                                                               Hit:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
                                                                               Hit:4 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease
Hit:5 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Hit:6 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease
Hit:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Hit:8 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease
Hit:9 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease
Hit:10 http://pp

In [None]:
wd = webdriver.Chrome('/content/chromedriver.exe')

## Helper Functions

In [None]:
# writed the LinkedIn account password into the yaml file
# get the info for log in automatically when crawling
def get_infos():
    infos_dict = {}
    try:
        with open("./credentials.yaml", encoding="utf-8") as f:
            infos_dict = yaml.safe_load(f)
        print(infos_dict)
    except Exception as ex:
        print(ex)
        sys.exit()
    return (
        infos_dict.get("username"),
        infos_dict.get("password"),
        infos_dict.get("filter_settings"),
    )

In [None]:
# log in
def login(browser, url, username, password):
    print("*" * 50 + " sign on " + "*" * 50)
    try:
        # Open login page
        browser.get(url)

        # Enter login info:
        elementID = browser.find_element(by=By.ID, value="username")
        elementID.send_keys(username)

        elementID = browser.find_element(by=By.ID, value="password")
        elementID.send_keys(password)
        # Note: replace the keys "username" and "password" with your LinkedIn login info
        elementID.submit()
    except:  # Prevent duplicate logins
        pass

In [None]:
# verification
def verification(browser):
    try:
        code = input("Enter the Verification Code: ")
        vcode_input = browser.find_element(By.ID, "input__email_verification_pin")
        vcode_submit = browser.find_element(By.ID, "email-pin-submit-button")
        vcode_input.send_keys(code)
        vcode_submit.submit()
    except:
        pass

In [None]:
# find the keyword
def find(browser, target_text, max_times=10):
    print("*" * 50 + " start searching " + "*" * 50)
    # Go to the home page and start searching:
    times = 0
    while True:
        try:
            search_element_root = None
            while search_element_root is None:
                search_element_root = browser.find_element(
                    by=By.ID, value="global-nav-typeahead"
                )
                time.sleep(1.5)

            search_element = search_element_root.find_elements(By.TAG_NAME, "input")[0]
            search_element.send_keys(target_text)
            ActionChains(browser).key_down(Keys.ENTER).send_keys_to_element(
                search_element, ""
            ).perform()
            break
        except:
            times += 1
            time.sleep(1.5)
            if times > max_times:
                print("There seems to be a problem loading the page")
                sys.exit()

In [None]:
# set the filters
def set_filters(browser, filters_settings, max_times=20):
    """Matching type of filter: 0: by index, 1: by name"""

    print("*" * 50 + " Set filter conditions " + "*" * 50)
    filter_base_name = filters_settings["base"]
    filters = filters_settings["options"]
    times = 0
    while True:
        try:
            if filter_base_name == "" or filter_base_name is None:
                break
            container = browser.find_element_by_id(
                "search-reusables__filters-bar"
            ).find_element_by_tag_name("ul")
            base_options = container.find_elements_by_class_name(
                "search-reusables__primary-filter"
            )
            for base_option in base_options:
                option = base_option.find_element_by_tag_name("button")
                if option.get_property("innerText") == filter_base_name:
                    ActionChains(browser).click(option).perform()
                    break
            break
        except Exception as ex:
            times += 1
            traceback.print_exc()
            time.sleep(1.5)
            if times > max_times:
                sys.exit()

    if filters is None or len(filters) == 0:
        return
    for filter in filters:
        times = 0
        page_show = False
        while True:
            try:
                container = browser.find_element_by_id(
                    "search-reusables__filters-bar"
                ).find_element_by_tag_name("ul")
                filter_element = container.find_element_by_id(
                    "hoverable-outlet-{}-filter-value".format(filter["name"])
                )
                root = filter_element.find_element_by_xpath("..")
                active_button = root.get_property("children")[1].get_property(
                    "children"
                )[0]
                if not page_show:
                    ActionChains(browser).click(active_button).perform()
                    page_show = True
                options = filter_element.find_elements_by_tag_name("li")
                option = None
                if int(filter.get("type", 0)) == 0:
                    option = options[filter["value"]]
                else:
                    for op in options:
                        name_span = op.find_element_by_tag_name("span")
                        if name_span.get_property("innerText") == filter["value"]:
                            option = op
                            break
                submit_btn = (
                    filter_element.find_element_by_tag_name("fieldset")
                    .get_property("children")[-1]
                    .get_property("children")[-1]
                )
                if option is not None:
                    # selector = option.find_element_by_tag_name("input")
                    # ActionChains(browser).click(selector).perform()
                    label = option.find_element_by_tag_name("label")
                    ActionChains(browser).click(label).perform()
                    ActionChains(browser).click(
                        submit_btn
                    ).perform()  # submit_btn.click()
                break
            except:
                times += 1
                traceback.print_exc()
                time.sleep(2)
                if times > max_times:
                    break

In [None]:
# set trends filters
def set_trends_filters(browser):
    filter_trends = None
    times = 0
    while filter_trends is None:
        try:
            filter_trends = browser.find_element(
                by=By.CLASS_NAME, value="search-reusables__primary-filter"
            )
            filter_trends.click()
        except:
            time.sleep(1.5)
            times += 1
            if times > 10:
                sys.exit()

In [None]:
# load all resulte
def load_all_result(browser):
    print("*" * 50 + " Load all data " + "*" * 50)
    times = 0
    while True:
        try:
            containers = browser.find_element_by_class_name(
                "scaffold-finite-scroll__content"
            )
            infos_element = (
                browser.find_element_by_id("main")
                .find_element_by_class_name("search-marvel-srp")
                .find_element_by_tag_name("h1")
            )
            text = infos_element.text
            nums = int(text.split(".")[2].split("total")[-1].split(" ")[1])
            pages = int(math.ceil(nums / 10))
            while pages + 1 != containers.get_property("childElementCount"):
                print(
                    f"pages：{pages}，count: {containers.get_property('childElementCount')}"
                )
                browser.execute_script(
                    "window.scrollTo(0, document.body.scrollHeight);"
                )
                time.sleep(3)
            browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            break
        except Exception as ex:
            time.sleep(1)
            times += 1
            print(ex)
            if times > 10:
                sys.exit()

In [None]:
# get target contents for test
def get_target_contents1(browser):
    element_roots = []
    for i in range(10):
        element_roots = browser.find_elements_by_class_name(
            "scaffold-finite-scroll__content"
        )
        if len(element_roots) != 0:
            break
        time.sleep(2)

    for element_root in element_roots:
        username = element_root.find_element_by_class_name(
            "feed-shared-actor__name t-14 t-bold hoverable-link-text t-black"
        )
        # .find_element_by_tag_name("span").get_property("innerText")
        date = element_root.find_element_by_class_name(
            "feed-shared-actor__sub-description t-12 t-normal t-black--light"
        )  # .find_element_by_class_name("visually-hidden").get_property("innerText")
        text = element_root.find_element_by_class_name("break-words").get_property(
            "children"
        )[0]
        # .get_property("children")[0].get_property("innerText")

In [None]:
def get_target_contents(browser):
    print("*" * 50 + " Crawling content " + "*" * 50)
    company_page = browser.page_source
    linkedin_soup = bs(company_page.encode("utf-8"), "html.parser")
    linkedin_soup.prettify()
    data_roots = []
    for i in range(10):
        try:
            container = linkedin_soup.find(
                "div", {"class": "scaffold-finite-scroll__content"}
            )
            data_roots = container.find_all(
                "div", {"class": "ph0 pv0 search-results__no-cluster-container mb2"}
            )
            if len(data_roots) != 0:
                break
        except:
            traceback.print_exc()
            time.sleep(2)

    post_user = []
    post_dates = []
    post_texts = []
    for data_root in data_roots:
        roots = data_root.children
        if roots is None:
            continue
        for element_root in roots:
            try:
                if not isinstance(element_root, Tag):
                    continue
                infos_root = element_root.find(
                    "div", {"class": "feed-shared-actor__meta relative"}
                )
                if infos_root is None:
                    with open("log.txt", "w") as f:
                        f.write(element_root.strings)
                        f.write("\n")
                username = infos_root.find("span", {"dir": "ltr"}).text
                date = infos_root.find("span", {"class": "visually-hidden"}).text
                text = (
                    element_root.find(
                        "div",
                        {
                            "class": "feed-shared-text relative feed-shared-update-v2__commentary"
                        },
                    )
                    .find("span", {"dir": "ltr"})
                    .text
                )
                post_user.append(username)
                post_dates.append(date.strip())
                post_texts.append(text)
            except Exception as ex:
                traceback.print_exc()

    data = {"User Name": post_user, "Date Posted": post_dates, "Post Text": post_texts}
    return data

In [None]:
def write_to_file(datas, file_name):
    print("*" * 50 + " write to file " + "*" * 50)
    df = pd.DataFrame(datas)
    now_time = time.strftime("%Y-%m-%d_%H-%M-%S", time.localtime(time.time()))
    try:
        df.to_csv(
            "{}_{}_posts.csv".format(file_name, now_time), encoding="utf-8", index=False
        )
    except Exception as e:
        print(e)

    writer = pd.ExcelWriter(
        path="{}_{}_posts.xlsx".format(file_name, now_time), engine="xlsxwriter"
    )
    df.to_excel(writer, index=False)
    writer.save()

In [None]:
# access Webriver
target_text = "meta"
url = "https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin"
base_url = "https://www.linkedin.com/feed/"
username, password, filter_settings = get_infos()

## Perform Web Scraping

### login

In [None]:
login(browser, url, username, password)  # sign in
while not browser.current_url.startswith(base_url):
    time.sleep(2)
find(browser, target_text)  # find target text
# set_trends_filters(browser)  # only get content

### filter

In [None]:
set_filters(browser, filter_settings)

### scroll to get all pages

In [None]:
containers = browser.find_element_by_class_name("scaffold-finite-scroll__content")

In [None]:
infos_element = (
    browser.find_element_by_id("main")
    .find_element_by_class_name("search-marvel-srp")
    .find_element_by_tag_name("h1")
)

In [None]:
text = infos_element.text
text

In [None]:
if "Search results page" in text:
    nums = int(text.split("(共")[1].split(" ")[1])
else:
    nums = int(text.split(".")[2].split("total")[-1].split(" ")[1])
print(f"number: {nums}")

In [None]:
# Number of currently loaded pages + 1
containers.get_property("childElementCount")

In [None]:
# pages = int(math.ceil(nums / 10))
pages = 9
while pages + 1 != containers.get_property("childElementCount"):
    print(
        f"pages numbers：{pages}，cur page count: {int(containers.get_property('childElementCount'))-1}"
    )
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")

### Crawl target content

In [None]:
datas = get_target_contents(browser)

In [None]:
# show crawled data
df = pd.DataFrame(datas)
df

### write to file

In [None]:
file_name = "res"
write_to_file(datas, file_name)

### quit browser

In [None]:
browser.quit()