# Web Scraping Job Vacancies

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [7]:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs

## Step 3: Define a Data Wrangling function

In [8]:
def data_wrangler(response):
    # Converting the response to a dictionary
    data_base = response.json()
    # Converting the dictionary to a pandas Dataframe
    df = pd.DataFrame(data_base['data'])
    # Cleaning the company name
    for i in range(len(df)):
        df['company'][i] = df['company'][i]['name']
    # Dropping columns I find not useful
    df.drop(columns=['id','referenceId','postAt','postedTimestamp','posterId'])
    
    return df

## Step 4: Define a function that will write the data to excel

In [9]:
def excel_writer(df,your_name):
    with pd.ExcelWriter("Data_Storage"+"-"+str(your_name)+".xlsx") as writer:
        df4.to_excel(writer,sheet_name ="Data",index = False)

## Step 2 : Define a function collect your job title and location

In [10]:
# Importing the data using an API 

# Using the get request without an API mostly results in ConnectionResetError WinErr[10054]

# I chose the LinkedIn API as it is the most popular website/social network for job postings

def linkedIn_jobs(position:str,location_Id:str,your_name:str):
    # LinkedIn api (Rapidapi.com)
    url = "https://linkedin-data-api.p.rapidapi.com/search-jobs-v2"
    # Querystring limited to position and location, sorting and time can be added to drill down your search
    querystring = {"keywords":position,"locationId":location_Id}

    # API Keys I got from RapidAPI.com 
    headers = {
        "X-RapidAPI-Key": "0501448d09mshe93c5485606cd94p1f02eejsne5bd0eb0730a",
        "X-RapidAPI-Host": "linkedin-data-api.p.rapidapi.com"
    }
    
    response = requests.get(url, headers=headers, params=querystring)
    
    value_1 = data_wrangler(response)
    
    #### Value_2 is being commented out because the coursera jupyter notebook does not have module that supports
    #### pd.ExcelWriter
    
    #value_2 = excel_writer(value_1,your_name)
    
    return value_1
 
#linkedIn_jobs("Data Analyst","104035573","Kamogelo")

In [11]:
search = linkedIn_jobs("Data Analyst","104035573","Kamogelo")


def get_job_description(query):
    # Get all the urls from the jobs
    links = query['url']
    
    # Using a for loop we scrape the urls to get the information on the page
    # Storing the responses on inside a list "response_store"
    response_store = []
    for j in links:
        response_store.append(requests.get(j))
    
    # Using BeautifulSoup to convert the response to be readable
    # I would have loved to convert each response to a dictionary using .json()
    # but I keep getting error. Working with dictionaries is much simpler
    # I will use BeautifulSoup to get html like responses
    
    response_revealed = []

    for k in response_store:
        response_revealed.append(bs(k.content,"html.parser"))
        
    # Creating a dictionary that will store the revealed responses with its appropriate link
    # So that we can later use the links to join the responses to the dataframe
    
    links_and_responses = dict(zip(links,response_revealed))
    
    return [response_revealed,links_and_responses]
        

html_responses = get_job_description(search)

In [None]:
# Codes to be added

#### Function to extract the Job description from the bs4 elements captured(responses_revealed)
#### Function to link(using url as you joining key) the extracted job description to the main dataframe that will be written out to a excel file