# Web Crawler for Beginner

## README
**This is a simple web crawler script written in Python for beginners. The script allows you to crawl job posting webpages on LinkedIn, extract the responsibilities section, and print the job posting number, company name, and responsibilities.**

### Prerequisites
To run this script, make sure you have the following installed:

Python 3: You can download Python from the official website: https://www.python.org/downloads/
Additionally, you need to install the following Python libraries:

requests: pip install requests
beautifulsoup4: pip install beautifulsoup4

### How to Use
Clone the repository or download the script to your local machine.

Open the script file (web_crawler.py) in a text editor or Python IDE of your choice.

Modify the job_posting_numbers list with the job posting numbers you want to crawl. You can replace the example job posting numbers with your desired ones.

Save the changes.

Open a command-line interface or terminal and navigate to the directory where the script is saved.

Run the script by executing the following command:

Copy code
python web_crawler.py
The script will start crawling the job posting webpages on LinkedIn and display the job posting number, company name, and responsibilities (if found) for each job posting.

### Notes
The script uses the requests library to send HTTP GET requests to the webpages and retrieve the HTML content.

The HTML content is then parsed using the beautifulsoup4 library to extract relevant information.

The script searches for the "Responsibilities" section in the HTML using various keywords commonly used for responsibilities sections.

If the "Responsibilities" section is found, the script extracts the responsibilities from the corresponding 'ul' tag.

The company name is extracted from the 'title' tag of the webpage.

The extracted information is then printed to the console.

In case the "Responsibilities" section is not found for a job posting, a message will be displayed indicating the absence of the section.

### Disclaimer
This script is intended for educational purposes only. Use it responsibly and respect the website's terms of service. LinkedIn's terms of service may prohibit scraping or crawling their website, so ensure you are authorized to perform these actions.
    
### License
This script is released under the MIT License. Feel free to modify and distribute it according to the terms of the license.

## 1. Get webpage using *requests*

In [36]:
import requests

req = requests.get('https://en.wikipedia.org/wiki/Web_crawler')

In [37]:
req

<Response [200]>

In [38]:
webpage = req.text

In [39]:
with open("filename.html", "w", encoding="utf-8") as f:
    f.write(webpage)

In [40]:
print(webpage)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Web crawler - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )enwikimwclientprefs=(

## 2. Get specific contents using BeatifulSoup

In [41]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

### 2.1 Prettify the webpage

In [42]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Web crawler - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^|; )en

### 2.2 Get the first paragraph

You can try to remove "attrs" to see how it works.

In [17]:
paragraph = soup.find_all('p')

In [18]:
paragraph

[<p class="mw-empty-elt">
 </p>,
 <p><b>Data science</b> is an <a class="mw-redirect" href="/wiki/Interdisciplinary" title="Interdisciplinary">interdisciplinary</a> academic field <sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup> that uses <a href="/wiki/Statistics" title="Statistics">statistics</a>, <a class="mw-redirect" href="/wiki/Scientific_computing" title="Scientific computing">scientific computing</a>, <a href="/wiki/Scientific_method" title="Scientific method">scientific methods</a>, processes, <a href="/wiki/Algorithm" title="Algorithm">algorithms</a> and systems to extract or extrapolate <a href="/wiki/Knowledge" title="Knowledge">knowledge</a> and insights from noisy, structured, and <a href="/wiki/Unstructured_data" title="Unstructured data">unstructured data</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup>
 </p>,
 <p>Data science also integrates domain knowledge from the underlying application domain (e.g., natural 

In [43]:
#find non empty paragraph
paragraph = soup.find('p', attrs={"class":False})

In [44]:
paragraph

<p>A <b>Web crawler</b>, sometimes called a <b>spider</b> or <b>spiderbot</b> and often shortened to <b>crawler</b>, is an <a href="/wiki/Internet_bot" title="Internet bot">Internet bot</a> that systematically browses the <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a> and that is typically operated by search engines for the purpose of <a href="/wiki/Web_indexing" title="Web indexing">Web indexing</a> (<i>web spidering</i>).<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup>
</p>

### 2.3 Get all the links in this paragraph which point to other webpages

In [45]:
paragraph.find_all('a')

[<a href="/wiki/Internet_bot" title="Internet bot">Internet bot</a>,
 <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>,
 <a href="/wiki/Web_indexing" title="Web indexing">Web indexing</a>,
 <a href="#cite_note-1">[1]</a>]

In [46]:
# find links beacuse they all have title in links
paragraph.find_all('a', attrs={"title":True})

[<a href="/wiki/Internet_bot" title="Internet bot">Internet bot</a>,
 <a href="/wiki/World_Wide_Web" title="World Wide Web">World Wide Web</a>,
 <a href="/wiki/Web_indexing" title="Web indexing">Web indexing</a>]

In [47]:
data = {"title":[], "href":[]}
for link in paragraph.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [48]:
import pandas as pd
df = pd.DataFrame(data)

In [49]:
df

Unnamed: 0,title,href
0,Internet bot,/wiki/Internet_bot
1,World Wide Web,/wiki/World_Wide_Web
2,Web indexing,/wiki/Web_indexing


## 3. Get the contents from all the webpages

In [50]:
webpages = []
head = "https://en.wikipedia.org"
for href in data["href"]:
    link = head + href
    req = requests.get(link)
    webpage = req.text
    webpages.append(webpage)

## 4. Futher readings

### 4.1 robots.txt

Check robots.txt of the website to find out what are allowed.

In [51]:
req = requests.get("https://en.wikipedia.org/robots.txt")
webpage = req.text

In [52]:
# note: we should not intensively and quickly craw webs
soup = BeautifulSoup(webpage, 'html.parser')
print(soup.text)

﻿# robots.txt for http://www.wikipedia.org/ and friends
#
# Please note: There are a lot of pages on this site, and there are
# some misbehaved spiders out there that go _way_ too fast. If you're
# irresponsible, your access to the site may be blocked.
#

# Observed spamming large amounts of https://en.wikipedia.org/?curid=NNNNNN
# and ignoring 429 ratelimit responses, claims to respect robots:
# http://mj12bot.com/
User-agent: MJ12bot
Disallow: /

# advertising-related bots:
User-agent: Mediapartners-Google*
Disallow: /

# Wikipedia work bots:
User-agent: IsraBot
Disallow:

User-agent: Orthogaffe
Disallow:

# Crawlers that are kind enough to obey, but which we'd rather not have
# unless they're feeding search engines.
User-agent: UbiCrawler
Disallow: /

User-agent: DOC
Disallow: /

User-agent: Zao
Disallow: /

# Some bots are known to be trouble, particularly those designed to copy
# entire sites. Please obey robots.txt.
User-agent: sitecheck.internetseer.com
Disallow: /

User-agent: 

### 4.2 Sleep

You would be banned, if you scrape a website too fast. Let your crawler sleep for a while after each round.

In [53]:
import time

for i in range(5):
    time.sleep(3)
    print(i)

0
1
2
3
4


### 4.3 Randomness

Pausing for extactly three seconds after each round is too robotic. Let's add some randomness to make your crawler looks more like a human.

In [54]:
from random import random

for i in range(5):
    t = 1 + 2 * random()
    time.sleep(t)
    print(i)

0
1
2
3
4


### 4.4 Separate the codes for scraping from the ones for data extraction

1. Scraping is more vulnerable. Nothing is more annoying than your crawler breaks because of a bug in the data extraction part.  
2. You never know what data you would need for modeling. So keep all the webpages you obtain. 

### 4.5 Chrome Driver and Selenium

These are the tools make your crawler act even more like a human.

In [55]:
pip install selenium




# Down load Chrome Driver: https://chromedriver.chromium.org/


Python:

import time

from selenium import webdriver



driver = webdriver.Chrome('/path/to/chromedriver')  # Optional argument, if not specified will search path.

driver.get('http://www.google.com/');

time.sleep(5) # Let the user actually see something!

search_box = driver.find_element_by_name('q')

search_box.send_keys('ChromeDriver')

search_box.submit()

time.sleep(5) # Let the user actually see something!


# Here's an example code that uses BeautifulSoup and requests libraries to crawl job descriptions from the a Linkedin Job Post

In [87]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the webpage
url = "https://www.linkedin.com/jobs/view/3630122749"
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find the "Responsibilities" section
responsibilities_section = soup.find("strong", text="Responsibilities")

if responsibilities_section:
    # Find the <ul> tag containing the responsibilities
    ul_tag = responsibilities_section.find_next("ul")

    # Extract the responsibilities as a list
    responsibilities = [li.text.strip() for li in ul_tag.find_all("li")]

    # Print the responsibilities
    for responsibility in responsibilities:
        print(responsibility)
else:
    print("Responsibilities section not found.")

Work on a variety of data-intensive problems using diverse data sets – including combinations of images, tabular data, time series and more specialized remote-sensing datasets like SAR and/or stereoimagery
Understand the fundamentals of oil and gas production and trading
Build and maintain data engineering pipelines, SQL databases, machine learning models, unit tests, code documentation and GCP-hosted client-facing services and API endpoints
Communicate machine learning model results to team members and clients in clear, concise and quantitative terms
Work independently or in a team, be ready to share ideas, collaborate and learn.


# Here's an example code that uses BeautifulSoup and requests libraries to crawl key job information from the 20 Linkedin Job Posts on the same page

In [121]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the webpage
url = "https://www.linkedin.com/jobs/search/?keywords=data%20scientist&location=United%20States&refresh=true"
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find all jobPosting numbers
job_postings = soup.find_all(attrs={"data-entity-urn": True})

# Extract the jobPosting numbers and store in a list
job_posting_numbers = [posting["data-entity-urn"].split(":")[-1] for posting in job_postings]

# Print the jobPosting numbers
print(job_posting_numbers)


['3637369960', '3639003584', '3635038298', '3628246598', '3636826550', '3626165537', '3624206900', '3635543605', '3636908707', '3634683031', '3638688017', '3637550459', '3632547910', '3627030054', '3628584634', '3633627920', '3630388478', '3624206545', '3625359500', '3613562559', '3617085448', '3627200228', '3638650918', '3637313634', '3632749941']


In [168]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the webpage
job_posting_numbers = ['3637369960', '3639003584',  '3630388478', '3617085448', '3637313634']  # Example jobPosting numbers

for job_posting_number in job_posting_numbers:
    url = f"https://www.linkedin.com/jobs/view/{job_posting_number}"
    response = requests.get(url)

    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the "Responsibilities" section
    responsibilities_section = soup.find("strong", text=['Requirement','Requirements','Responsibilites', "Responsibilities", "Key Responsibilities", ' Roles and Responsibilities',"Duties",
    "Tasks",
    "Job Duties",
    "Functions",
    "Work Responsibilities",
    "Job Responsibilities",
    "Job Tasks",
    "Role Expectations",
    "You Will",'Key Responsibilities'])

    # Initialize responsibilities list for each job posting
    responsibilities_list = []

    if responsibilities_section:
        # Find the <ul> tag containing the responsibilities
        ul_tag = responsibilities_section.find_next("ul")

        # Extract the responsibilities as a list
        responsibilities = [li.text.strip() for li in ul_tag.find_all("li")]

        # Add the responsibilities to the main list
        responsibilities_list.extend(responsibilities)

        # Find the company name within the <title> tag
        title_tag = soup.find("title")
        title_text = title_tag.text.strip() if title_tag else "N/A"
        company_name = title_text.split(" ", 1)[0] if title_text else "N/A"

        # Print the job posting number, company name, and responsibilities
        print(f"Job Posting Number: {job_posting_number}")
        print(f"Company Name: {company_name}")
        print("Responsibilities:")
        for responsibility in responsibilities:
            print(f"- {responsibility}")
        print()

    else:
        print(f"Responsibilities section not found for jobPosting {job_posting_number}")
        print()

Job Posting Number: 3637369960
Company Name: Patterned
Responsibilities:
- Partner with engineers, product managers, and business partners to identify algorithmic problems, brainstorm possible approaches, and recommend the best path forward.
- Develop algorithms iteratively, building in the right level of complexity to solve the business problem at hand and support future improvements.
- Define success criteria for your models so that you can measure impact and changes over time. You'll be expected to communicate findings and drive continuous improvements.
- Collaborate with Software Engineers to implement algorithms in production that scale gracefully.
- Collaborate with stakeholders to prioritize projects and define requirements.
- Carry out analysis on data produced by our hardware systems and create insightful visualizations to share your findings.
- Contribute to internal libraries to help other teams with their data science needs including visualization, prediction, optimization,

There are several methods and tools available for web scraping besides using Python and libraries like requests and BeautifulSoup. Here are some alternative methods:

1. **Scrapy**: Scrapy is a powerful and flexible Python framework for web scraping. It provides a high-level API and handles many common tasks, such as handling cookies, handling redirects, and parsing HTML/XML. Scrapy allows you to build scalable and efficient web crawlers easily.

2. **Selenium**: Selenium is a popular tool for web scraping that allows you to automate browser interactions. It provides a way to control a browser programmatically and interact with web elements, making it useful for scraping websites that heavily rely on JavaScript.

3. **Puppeteer**: Puppeteer is a Node.js library that provides a high-level API to control Chrome or Chromium browsers. It allows you to automate browser tasks, including web scraping. Puppeteer is particularly useful for scraping JavaScript-rendered websites and provides powerful features for interacting with the page.

4. **APIs**: Many websites provide APIs (Application Programming Interfaces) that allow you to access their data in a structured manner. Instead of scraping HTML, you can make HTTP requests to the API endpoints and retrieve the data directly. This approach is more reliable, efficient, and often encouraged by the website owners.

5. **Web Scraping Services**: If you prefer a non-technical approach or require more advanced scraping capabilities, there are web scraping services available that offer easy-to-use platforms or APIs. These services handle the scraping infrastructure for you and provide tools for data extraction and management.

6. **Browser Extensions**: Some browser extensions, such as Web Scraper and Octoparse, offer point-and-click interfaces for extracting data from websites. These extensions allow you to define scraping rules visually and export the extracted data in various formats.

It's important to note that when web scraping, you should always respect the website's terms of service and legal requirements. Make sure you have permission to scrape the website or that the website allows scraping through its terms of service or API.

## Summary 

In summary, this web crawler provides a basic framework for scraping job posting webpages on LinkedIn. It demonstrates how to send HTTP GET requests, parse HTML content, extract specific sections like "Responsibilities," and print relevant information such as job posting numbers, company names, and responsibilities.

The script is designed for beginners and uses Python along with the requests and BeautifulSoup libraries. It provides a starting point for those interested in web scraping and demonstrates how to navigate and extract data from webpages.

However, it's essential to keep in mind that web scraping should be conducted responsibly and in accordance with the website's terms of service. Before scraping any website, ensure you have permission or that the website allows scraping through an API or other means.

This web crawler can serve as a foundation for expanding your scraping capabilities. You can customize it to scrape other websites or enhance it with additional functionalities like handling pagination, extracting more data points, or storing the scraped data in a database.

Remember to always respect the legal and ethical boundaries of web scraping, and consult the specific website's terms of service before scraping their content. Happy scraping!