## Introduction
This practice covers the steps on how to scrape jobs from Foundit. Using python, selenium, and pandas, we'll be able to extract information from foundit.sg and construct a pandas data frame. Before we begin, let's understand web scraping simply. 

Imagine if you are trying to get much information about something from various web pages and articles that need to be stored in a suitable format, for instance, an excel file. One way is to go through all those websites and write the useful information to the excel sheets manually. But programmers tend to do it in an easy way which is web scraping. Web scraping is the technique of extracting a large amount of data from different web pages that can be stored in a suitable format.

## Scraping job details from Foundit
The foundit (formerly Monster) Job Search App is a platform for freshers & experienced job seekers to find their perfect career opportunities. 

Here are the steps involved:

1. Install and import necessary modules
2. Send some basic queries like like job title or company name and location to the Foundit website using selenium
3. Fetch the current URL after sending the queries to the website using selenium
4. Fetch the information about job title, company name, rating, location, simple description, date of posting, etc
5. Store this information into a CSV file using pandas


## Load Libraries
First of all, we need to install some specific modules including a chrome driver for selenium. After installing the chrome driver move it to the working directory.

We need to import the libraries that will be used for this practical. Here requests help to send an HTTP request using python, Selenium is an automation tool that helps here to send queries to the website, lxml can convert the page into XML or HTML format. Pandas is to convert the data into a CSV file.



In [1]:
pip install selenium webdriver_manager

Defaulting to user installation because normal site-packages is not writeable
Collecting webdriver_manager
  Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Collecting python-dotenv (from webdriver_manager)
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Using cached python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, webdriver_manager
Successfully installed python-dotenv-1.1.0 webdriver_manager-4.0.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Sending job title and location using selenium
Now let's create a function that sends queries to the web page and returns the current URL. This function opens Foundit using the specified URL as one of its parameters. Then it sends the job title and location to the site using selenium. After that, we'll get a new page and its URL which consists of all the job details related to the job title and location you have specified as its parameters. Lastly, it returns the current URL which consists of jobs and their details so that we can simply scrape it using Beautiful Soup.

## Scraping jobs using Beautiful Soup

Not all the websites can use BeatuifulSoup to scrape. check https://www.foundit.sg/robots.txt

## Scraping jobs using Selenium

The next step is to find the CSS selectors and retrieve the raw text inside the tags that contain these CSS selectors. The CSS selectors given in the code are probably the same on the web page but sometimes it may change.

By looping through all the job posts we'll get much information about it. Lastly, we converted the data into a pandas data frame and simply returned it. You'll get the details about the job title, company name, salary, post date and experience. You can save it as a CSV file using df.to_csv("jobs.csv").


In [None]:
import socket
import struct

def send_command(sock, command: str):
    # Message format: [type: 1 byte] [length: 4 bytes, little endian] [command as bytes]
    cmd_bytes = command.encode()
    packet = struct.pack('<BI', 1, len(cmd_bytes)) + cmd_bytes
    sock.sendall(packet)

def read_line(sock):
    line = b''
    while not line.endswith(b'\n'):
        chunk = sock.recv(1)
        if not chunk:
            break
        line += chunk
    return line.decode()

def read_response(sock):
    print("[+] Server response:")
    while True:
        # Peek at next byte
        head = sock.recv(1, socket.MSG_PEEK)
        if not head:
            break
        if head == b'\x05':  # End byte
            sock.recv(1)
            break
        if head == b'\x03':  # Control/Separator
            sock.recv(1)
            continue

        # Read message
        type_byte = sock.recv(1)
        if type_byte != b'\x02':
            print("[!] Unexpected type:", type_byte)
            break

        length_bytes = sock.recv(4)
        if len(length_bytes) < 4:
            print("[!] Incomplete length")
            break
        length = struct.unpack('<I', length_bytes)[0]

        data = sock.recv(length)
        if len(data) < length:
            print("[!] Incomplete data")
            break

        try:
            print(data.decode())
        except:
            print(data)

def main():
    host = 'ctf1.sentinel-cyber.sg'
    port = 65432

    with socket.create_connection((host, port)) as sock:
        # Try likely commands — "flag" is a good guess
        command = "flag"
        print(f"[+] Sending command: {command}")
        send_command(sock, command)

        response = read_line(sock)
        print(f"[+] Initial server line: {response.strip()}")

        read_response(sock)

if __name__ == '__main__':
    main()


[+] Sending command: employees


ConnectionResetError: [Errno 54] Connection reset by peer