<h1><center>Tutorial on Web-scraping</center></h1>  
<center>By <b>HARIPRASHAAD SR</b></center>
<center>Student</b></center>

> **TOPIC**      : _Scraping Job Posting_  
> **TOOLS USED** : _Beautiful Soup_  
> **DIFFICULTY** : _Beginner_  
> **DOMAIN**     : _Web scraping_

## What is Web scraping ?
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications. There are many different ways to perform web scraping to obtain data from websites. These include using online services, particular API’s or even creating your code for web scraping from scratch. Web scraping is a technique for extracting data from websites, transforming unstructured web content into a structured format. Utilizing tools like BeautifulSoup and Scrapy in Python, web scraping involves navigating web pages, selecting relevant HTML elements, and retrieving desired information. It enables automated data collection from diverse sources, empowering users to extract real-time data, monitor changes, and analyze trends. While a powerful tool for information retrieval, ethical considerations and adherence to website terms of service are paramount. Web scraping finds applications in data mining, market research, and building datasets for machine learning and analytics.

## Objective of the Project
The objective of this web scraping project is to automate the collection of job postings from the official portal of the Government of Tamil Nadu using `Beautiful Soup` and `requests` in `python`. The primary goal is to develop a web scraping script that extracts detailed information from each job posting, including job titles, departments, locations, eligibility criteria, application deadlines, and other pertinent details. The project aims to structure the extracted data into a well-organized format, such as a dataset , facilitating easy analysis and reference. Additionally, the scraping script will be designed to run periodically, ensuring that the job postings data remains up-to-date.

Comprehensive documentation will be provided for users, offering instructions on interacting with the data and understanding the presented information.
> The project will strictly adhere to ethical standards and comply with the terms of use of the Government of Tamil Nadu's portal, respecting robots.txt rules and avoiding excessive requests. 

## Libraries used
- ### pandas:
Pandas is a Python library for data manipulation and analysis, featuring powerful data structures like Series and DataFrame. Widely used in data science, it simplifies tasks such as cleaning, transforming, and visualizing structured data. With seamless integration and an intuitive interface, Pandas is essential for efficient data handling and analysis in Python.  
**For creating Dataframes and csv files**
- ### numpy:
NumPy is a foundational Python library for numerical computing, providing support for large, multi-dimensional arrays and matrices. Essential for scientific and mathematical applications, it offers efficient operations on arrays, linear algebra functions, and tools for integrating with other data analysis libraries. NumPy is a cornerstone in the Python scientific computing ecosystem.  
**For manipulating with arrays**
- ### requests:
The requests module allows you to **send HTTP requests in Python**, which is useful for interacting with web APIs or web scraping.
The requests module is easy to use and well-documented, making it a good choice for beginners. 
The request module provides access to the various HTTP methods (GET, POST, PUT, DELETE) as well as many other popular request headers and parameters. This access makes it easy to handle common tasks, such as retrieving data from a server or creating customized responses in response to user actions.
- ### BeautifulSoup:
Beautiful Soup is a Python library for **parsing structured data**. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools.

## Steps involved in the project
- Importing libraries
- Getting the base url of the Job Posting website
- Requesting the website for the HTML contents of the page using `request` package
- Using the `BeautifulSoup` library, parse the contents
- Search the HTML content to retrieve info on job postings, company name, posted date, requirements, city , etc...
- Update these info into a Dataframe using `pandas` library
- Extract the CSV file of the dataframe

### STEP - 0 : Working Environment
Make Sure you all have a `python - 3.7` or higher version and `pip` installed on your system. Open a python IDE of your own choice. And make sure you install the `requirement.txt` file by typing the following in your terminal.  
  
`pip install -r requirements.txt`

### STEP - 1 : Importing Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### STEP - 2 : Job Posting Website
We will find the Job posting from the `https://www.tnprivatejobs.tn.gov.in/Home/jobs`(Government of Tamil Nadu) and scrap through its contents

![](https://imgur.com/NpqPyqh.png)

In [2]:
base_url = "https://www.tnprivatejobs.tn.gov.in/Home/jobs"

We have around 280 Job posting in this website posted across 28 pages. And these webpages have their address as  
<b>base_url + "/" + [10 , 20 ,30 ,...., 280]</b>

In [3]:
# Create a list to store all the urls
urls = [] 

# Appending the base url to the list
urls.append(base_url)

for i in range(10,281,10):
    # Appending all urls to the list
    urls.append(f"{base_url}/{i}")
    
urls

['https://www.tnprivatejobs.tn.gov.in/Home/jobs',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/10',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/20',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/30',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/40',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/50',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/60',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/70',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/80',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/90',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/100',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/110',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/120',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/130',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/140',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/150',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/160',
 'https://www.tnprivatejobs.tn.gov.in/Home/jobs/170',
 'https://www.tnprivatejobs.tn.gov.in/Hom

### STEP - 3 : Requesting the website 
We will request the website for the HTML content of the searching page using the `request` package  
  
> We will use only the `base_url` to learn about some basic concepts

In [4]:
try:
    page = requests.get(base_url)

    # Check if the request was successful(status code 200)
    if page.status_code == 200:
        print("Request successful")
    else:
        print(f"Request failed with status code {page.status_code}")

except requests.RequestException as e:
    print(f"An error occurred: {e}")

Request successful


We can see the returned content by printing the `page`

In [5]:
page

<Response [200]>

### STEP - 4 : Parse the contents
Using the BeautifulSoup library, parse the contents and store it in a variable `response` which will be used throughout this project

In [6]:
# Storing the parsed contents
response = BeautifulSoup(page.text,"html.parser")

**This will return the html page of the scraped website**

`print(response)`   

will list the entire `HTML` content of the webpage, which is very large. So, that line is not executed.  
If you want to view it you can execute that line.

Sample output of the above line:
![](https://imgur.com/XtQBSTH.png)

**Consolidation of Step 3 and Step 4**

In [8]:
def parse(url):
    try:
        page = requests.get(url)

        # Check if the request was successful(status code 200)
        if page.status_code == 200:
            print("Request successful")

            # Storing the parsed contents
            response = BeautifulSoup(page.text,"html.parser")
        else:
            print(f"Request failed with status code {page.status_code}")

    except requests.RequestException as e:
        print(f"An error occurred: {e}")
    
    else:
        return response

### STEP - 5 : Scraping through the returned web-content
- Job
- Company name
- City
- Role Description
- Field
- Requirement
- Salary
- Open date
- Close date

Before going forward, We will learn about how basic web scraping works.  
Our required details are present in the `response` in `HTML` format.

### Step - 5.1 : Scraping the Job Title

Right Click on the mouse on your web-page and click on `Inspect Element`, to see the `HTML`content of the webpage. In the below figure it is clear that the job titles are present inside the `<h4>` tags. So we would find all the `h4`tags using the `find_all( )` function

![](https://imgur.com/fWshCmn.png)

In [9]:
# find_all() is used to find all the occurance of the search element - <h4> tags
all_h4 = response.find_all('h4')
all_h4

[<h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbycity">Jobs By Location</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbytype">Jobs By Type</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbysector">Jobs By Sector</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbygender">Jobs By Gender</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbyexperience">Jobs By Experience</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbysalary">Salary Range</a> </h4>,
 <h4 class="panel-title"> <a class="collapsed" data-parent="#accordion" data-toggle="collapse" href="#jobsbytopcomp">Top Companies</a> </h4>,
 <h4

All the Job titles are within the `<a>`tags with the uniqe `style = 'color: #02b44a;'`. We would scrap these information using the `find( )` function which gives the first occurance of the search element.

In [10]:
# find() is used to find the first occurance of the search element - <a> tags
all_a_tags = [h4.find('a',style = 'color: #02b44a;') for h4 in all_h4]
all_a_tags

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23102202344326050" style="color: #02b44a;">
 											Digital Marketing Executive 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702045226171" style="color: #02b44a;">
 											Graphic Desginer 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702230126397" style="color: #02b44a;">
 											VIDEO JOCKEY 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23112109532337594" style="color: #02b44a;">
 											Bus Captain 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23072512412210335" style="color: #02b44a;">
 											Electrical desig

**Now we have all the `a` tags, but some values are `None`. We would like to get the contents of the `a` tags that are not `None` using the `.text` function**

In [11]:
# .text is used to strip of the contents of a particular tag - <a> tag
job_titles = [a.text for a in all_a_tags if a != None]
job_titles

['\r\n\t\t\t\t\t\t\t\t\t\t\tDigital Marketing Executive \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tGraphic Desginer \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tVIDEO JOCKEY \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tBus Captain \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tElectrical design engineer \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tBILLING EXECUTIVES \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tSewing Machine Operator  - பெண்களுக்கான வேலைவாய்ப்பு  \r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t',
 '\r\n\t\t\t\t\t\t\t\t\t\t\tProject Manager/Project In-charge/ Project Engineer/ 

**Though we have extracted the job titles, they are not clean. So, we would create a function `clean_text` to clean the text by removing the escape characters.**

In [12]:
def clean_text(text):
    # This function removes all the escape characters - ['\n' ,'\t' , '\r']
    cleaned_text = text.replace('\t','').replace('\n','').replace('\r','').strip()
    
    # This removes the blank spaces
    cleaned_text = ' '.join(cleaned_text.split()) 
    
    return cleaned_text

In [13]:
# Clean the scraped job titles using the clean_text function
job_titles = [clean_text(title) for title in job_titles]
job_titles

['Digital Marketing Executive',
 'Graphic Desginer',
 'VIDEO JOCKEY',
 'Bus Captain',
 'Electrical design engineer',
 'BILLING EXECUTIVES',
 'Sewing Machine Operator - பெண்களுக்கான வேலைவாய்ப்பு',
 'Project Manager/Project In-charge/ Project Engineer/ Executive',
 'Graphic Desginer',
 'Graphic Designer & DTP Operator']

###### The final process of the Step5.1 is to consolidate everything inside a function `scrap_title()` 

In [14]:
def scrap_title(response):
    #This function is used to scrap the job_titles from each webpage
    
    # find_all() is used to find all the occurance of the search element - <h4> tags
    all_h4 = response.find_all('h4')
    
    # find() is used to find the first occurance of the search element - <a> tags
    all_a_tags = [h4.find('a',style = 'color: #02b44a;') for h4 in all_h4]
    
    # .text is used to strip of the contents of a particular tag - <a> tag
    job_titles = [a.text for a in all_a_tags if a != None]
    
    # Clean the scraped job titles using the clean_text function
    job_titles = [clean_text(title) for title in job_titles]
    
    return job_titles

In [15]:
scrap_title(response)

['Digital Marketing Executive',
 'Graphic Desginer',
 'VIDEO JOCKEY',
 'Bus Captain',
 'Electrical design engineer',
 'BILLING EXECUTIVES',
 'Sewing Machine Operator - பெண்களுக்கான வேலைவாய்ப்பு',
 'Project Manager/Project In-charge/ Project Engineer/ Executive',
 'Graphic Desginer',
 'Graphic Designer & DTP Operator']

### STEP 5.2 : Scraping through each Job 
Now we have completed our first web-scraping, we would speed up things.  
In this step, we will go through webpages for each job and scrap the required information

![](https://imgur.com/FJN7xwW.png)

**We would like to visit each job's webpage and extract the contents. For that to happen, first we want to collect the urls of the webpages of each job.**

![](https://imgur.com/li5SJJU.png)

**Looking at the image above, It is sure that all the links are present in the same `<a>` tags that we used to retrieve the `job_titles`**

In [16]:
# Retrieving all the <a> tags with a link to each jobs
all_h4s = response.find_all("h4")
all_a_tag = [h4.find('a',style = 'color: #02b44a;') for h4 in all_h4s]
all_a_tag

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23102202344326050" style="color: #02b44a;">
 											Digital Marketing Executive 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702045226171" style="color: #02b44a;">
 											Graphic Desginer 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702230126397" style="color: #02b44a;">
 											VIDEO JOCKEY 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23112109532337594" style="color: #02b44a;">
 											Bus Captain 
 
 											
 																					</a>,
 <a href="https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23072512412210335" style="color: #02b44a;">
 											Electrical desig

**Now for each of these `<a>` tags that is not `None`, we will get the link to each job description. 
For this purpose we would use the `.attrs['href']` function that will get the href attribute from the `<a>`tags.**

In [17]:
# .attrs[] is used to retrieve the mentioned tag out from a HTML source
href = [a.attrs['href'] for a in all_a_tag if a != None]
href

['https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23102202344326050',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702045226171',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702230126397',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23112109532337594',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23072512412210335',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23083104491031380',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23090602402603079',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091105483508313',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091210083521943',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091210132821264']

**Now we have have come to the end of the Step 5.2, we will consolidate this step using a function `get_href()`.**

In [18]:
def get_href(response):
    # This function is used to get the links to each job in a web-page
    
    # Retrieving all the <a> tags with a link to each jobs
    all_h4s = response.find_all("h4")
    all_a_tag = [h4.find('a',style = 'color: #02b44a;') for h4 in all_h4s]
    
    # .attrs[] is used to retrieve the mentioned tag out from a HTML source
    href = [a.attrs['href'] for a in all_a_tag if a != None]
    
    return href

In [19]:
get_href(response)

['https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23102202344326050',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702045226171',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23111702230126397',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23112109532337594',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23072512412210335',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23083104491031380',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23090602402603079',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091105483508313',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091210083521943',
 'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23091210132821264']

### STEP 5.3 : Getting contents for each job
We would get the contents from each of the webpages that we retrieved in the last step about each job. Initially we would only use the first link.  
- Field
- Desription
- Salary
- Qualification
- City
- Landmark
- Requirements(Gender, Age, Openings, Experience,...)

In [20]:
# Initialising all the variables to store the scraped content
gender = age = opening = exp = city = landmark = field = desc = sal = qual = " "

In [21]:
# Get the first URL
url_1 = get_href(response)[0]
url_1

'https://www.tnprivatejobs.tn.gov.in/candidate/Home/ca_jobfair_single/23102202344326050'

**Now we will request and parse the `url_1` using the `parse()` function**

In [22]:
# Parse it using the parse() function
res_1 = parse(url_1)

Request successful


**Since, our contents are in the `div` tag with a unique class name `location`. We will do the exact same steps that we have done during the job_title scraping.**

In [23]:
# Get the contents from the <div> tags
all_divs = res_1.find_all('div',class_ = 'location')
conts = [div.text for div in all_divs]
conts

['\n   Media & Entertainment\r\n\r\n                                    |                                     Search Engine Marketing Executive                                  \r\n\r\n\r\n                                \n\n',
 '\n  15,000 - 25,000 p.m                               \r\n                                 | \r\n                                Under Graduate  \r\n\r\n\r\n                                                                  \r\n\r\n\r\n                                  \r\n\r\n                                  |   Chennai \r\n\r\n                              \r\n\r\n                            ',
 '\n Anna nagar                               \r\n                               ',
 '\n\r\n                             Gender :  All                              \r\n                               | Age Limit  - 20-30  | Openings - 3 | Experience - 0-1 Year                            \r\n                            ',
 '  15,000 - 25,000    |    Media & Entertainmen

**Yet again, we clean each element of the above list using the `clean_text()` function.**

In [24]:
# Clean each retrieved data using the clean_text() function
conts = [clean_text(cont) for cont in conts]
conts

['Media & Entertainment | Search Engine Marketing Executive',
 '15,000 - 25,000 p.m | Under Graduate | Chennai',
 'Anna nagar',
 'Gender : All | Age Limit - 20-30 | Openings - 3 | Experience - 0-1 Year',
 '15,000 - 25,000 | Media & Entertainment',
 '15,000 - 25,000 | Media & Entertainment']

**We split the contents by the colon ':' and store it in a new variable `temps`.  
And we store the needed contents from the `temps` in a new list `cleaned_conts`**

In [25]:
# Store the needed contents in a list
cleaned_conts = [] 

for cont in conts:
    # temps store the splitted elements
    temps = cont.split('|')
    temps = [temp.strip() for temp in temps]
    cleaned_conts.append(temps)
cleaned_conts

[['Media & Entertainment', 'Search Engine Marketing Executive'],
 ['15,000 - 25,000 p.m', 'Under Graduate', 'Chennai'],
 ['Anna nagar'],
 ['Gender : All',
  'Age Limit - 20-30',
  'Openings - 3',
  'Experience - 0-1 Year'],
 ['15,000 - 25,000', 'Media & Entertainment'],
 ['15,000 - 25,000', 'Media & Entertainment']]

In [26]:
# Printing the requirements content
for cont in cleaned_conts[3]:
    print(cont)

Gender : All
Age Limit - 20-30
Openings - 3
Experience - 0-1 Year


**We get 2D array `cleaned_conts` having the necessary informations. Since the last element is the 'Requirement' which is different for each job, We will deal it with it.**  
  
**By different means,  
For Example,  
Job-1 might have gender and age while Job-2 might have only the age in their requirements**

In [27]:
for cont in cleaned_conts[3]:
    if cont.startswith('Gender :'):
        gender = cont[9:]
    if cont.startswith('Age Limit -'):
        age = cont[12:]
    if cont.startswith('Openings -'):
        opening = cont[11:]
    if cont.startswith('Experience - '):
        exp = cont[13:]
print(f"Gender : {gender}, Age : {age}, Openings : {opening}, Experience : {exp}")

Gender : All, Age : 20-30, Openings : 3, Experience : 0-1 Year


**Similarly we store every value from the `cleaned_conts` to its corresponding variables**

In [28]:
field, desc = map(str, cleaned_conts[0])
print(f"Field = {field}, Description : {desc}")

Field = Media & Entertainment, Description : Search Engine Marketing Executive


In [29]:
sal, qual, city = map(str,cleaned_conts[1])
print(f"Salary = {sal}, Qualification : {qual}, City = {city}")

Salary = 15,000 - 25,000 p.m, Qualification : Under Graduate, City = Chennai


In [30]:
landmark = cleaned_conts[2]
print(f"Landmark : {landmark}")

Landmark : ['Anna nagar']


**Now, since we retrived all the necessary contents from this webpage, we will consolidate Step 5.3 using a function 
`scrap_info()`**

In [31]:
def scrap_info(response):
    # This function is used to retrieve the information from the child web-pages
    gender = age = opening = exp = city = landmark = field = desc = sal = qual = " "
    
    # Get the contents from the <div> tags
    all_divs = response.find_all('div',class_ = 'location')
    
    conts = [div.text for div in all_divs]
    # Clean each retrieved data using the clean_text() function
    
    conts = [clean_text(cont) for cont in conts]
    # Store the needed contents in a list
    cleaned_conts = [] 

    for cont in conts:
        # temps store the splitted elements
        temps = cont.split('|')
        
        temps = [temp.strip() for temp in temps]
        cleaned_conts.append(temps) 
        
    field, desc = map(str, cleaned_conts[0])   
    sal, qual, city = map(str,cleaned_conts[1])
    landmark = str(cleaned_conts[2][0])
    
    for cont in cleaned_conts[3]:
        if cont.startswith('Gender :'):
            gender = cont[9:]
        if cont.startswith('Age Limit -'):
            age = cont[12:]
        if cont.startswith('Openings -'):
            opening = cont[11:]
        if cont.startswith('Experience - '):
            exp = cont[13:]
    
    return (gender,age,opening,exp,city,landmark,desc,sal,qual,field)

In [32]:
# Scraping the information for the first url in the home webpage
scrap_info(res_1)

('All',
 '20-30',
 '3',
 '0-1 Year',
 'Chennai',
 'Anna nagar',
 'Search Engine Marketing Executive',
 '15,000 - 25,000 p.m',
 'Under Graduate',
 'Media & Entertainment')

### STEP 5.4 : Extending to all URLs in a single webpage
Now, as we have done it for a single url, we will extend it to all the urls in a single webpage that is stored in the variable `href`

In [33]:
def ret_info(href):
    # This function gets the urls from the get_href() function and retreives the data
    
    # The info_list is used to store all the acquired data
    info_list = []
    for url in href:
        # Request and parse each url and append the information to the info_list
        res_1 = parse(url)
        # Appending the scraped content to the info_list using the scrap_info() function
        info_list.append(scrap_info(res_1))
        
    # Convert the list to a np.array() and get the transpose to group each attributes
    info_list = (np.array(info_list)).T
    return info_list

In [34]:
ret_info([url_1])

Request successful


array([['All'],
       ['20-30'],
       ['3'],
       ['0-1 Year'],
       ['Chennai'],
       ['Anna nagar'],
       ['Search Engine Marketing Executive'],
       ['15,000 - 25,000 p.m'],
       ['Under Graduate'],
       ['Media & Entertainment']], dtype='<U33')

### STEP 5.5 : Retrieving other infomations
In this step, we will other data(s) like the job posted date and the end date, company name, required employee's description.

Since, all the other data(s) that we are looking for are present in the `<div>` tag with the class name `<companyName`, we will retreive the same way that we retreived data in the above process.

In [35]:
# Retrieving other information using the div tag
all_div = response.find_all("div",class_ = 'companyName')
all_datas = [div.text for div in all_div]
all_datas

['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tSearch Engine Marketing Executive\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t  \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t| \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tSri global innovatiion\t\t\t\t\t\t\t\t\t\t\t\t\t \n',
 '\n\r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tPosted Date : 22-10-2023   | Open Until : 29-11-2023',
 '\n  Dealership Sales and Value Added Services Executive\t\t\t\t\t\t\t\t \n',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tGraphic Designer\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\t\t\t\t\t\t\t\t  \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t| \r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tC2S ENTERPRISES\t\t\t\t\t\t\t\t\t\t\t\t\t \n',
 '\n\r\n\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tPosted Date : 17-11-2023   | Open Until : 29-11-2023',
 '\n  Animator | Assistant Graphic Designer | Graphic Designer | Social Media & Digital Marketing Manager\t\t\t\t\t\t\t\t \n',
 '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tActor\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\r\n\t\t\t\t\

**Ufffff!!!!  
The Same story!   
Clean it using the `clean_text()` function**

In [36]:
all_datas = [clean_text(data) for data in all_datas]
all_datas

['Search Engine Marketing Executive | Sri global innovatiion',
 'Posted Date : 22-10-2023 | Open Until : 29-11-2023',
 'Dealership Sales and Value Added Services Executive',
 'Graphic Designer | C2S ENTERPRISES',
 'Posted Date : 17-11-2023 | Open Until : 29-11-2023',
 'Animator | Assistant Graphic Designer | Graphic Designer | Social Media & Digital Marketing Manager',
 'Actor | C2S ENTERPRISES',
 'Posted Date : 17-11-2023 | Open Until : 29-11-2023',
 'Actor | Marketing and Social Media manager | Social Media & Digital Marketing Manager | Social Media Executive | Storyboard Artist',
 'Captain | Sri global innovatiion',
 'Posted Date : 21-11-2023 | Open Until : 29-11-2023',
 'Transport Coordinator',
 'Electrical Design Developer | SRG POWER CONTROL SYSTEM',
 'Posted Date : 25-07-2023 | Open Until : 30-11-2023',
 'Electrical Design Developer',
 'Billing Executive | SI Automobiles',
 'Posted Date : 31-08-2023 | Open Until : 30-11-2023',
 'Accounts Executive | Billing Executive',
 'Special

**From the above step, it is clear that the `all_datas` contains the information that we are looking for.  
We now manipulate the above list to retrieve the data.**
  
**Since all the dates are present every third line from the second line, we get them by the following**

In [37]:
# Retrieving the post_date and the last_date to apply
post_date = [] 
last_date = []

for i in range(1,len(all_datas),3):
    post, last = map(str,all_datas[i].split('|'))
    
    post_date.append(post)
    last_date.append(last)

In [38]:
print(f"Post date : {post_date}\n")
print(f"Last date : {last_date}")

Post date : ['Posted Date : 22-10-2023 ', 'Posted Date : 17-11-2023 ', 'Posted Date : 17-11-2023 ', 'Posted Date : 21-11-2023 ', 'Posted Date : 25-07-2023 ', 'Posted Date : 31-08-2023 ', 'Posted Date : 06-09-2023 ', 'Posted Date : 11-09-2023 ', 'Posted Date : 12-09-2023 ', 'Posted Date : 12-09-2023 ']

Last date : [' Open Until : 29-11-2023', ' Open Until : 29-11-2023', ' Open Until : 29-11-2023', ' Open Until : 29-11-2023', ' Open Until : 30-11-2023', ' Open Until : 30-11-2023', ' Open Until : 30-11-2023', ' Open Until : 30-11-2023', ' Open Until : 30-11-2023', ' Open Until : 30-11-2023']


In [39]:
# Removing the text from the dates
post_date = [post.split(':')[1] for post in post_date]
last_date = [last.split(':')[1] for last in last_date]

In [40]:
print(f"Post date : {post_date}\n")
print(f"Last date : {last_date}")

Post date : [' 22-10-2023 ', ' 17-11-2023 ', ' 17-11-2023 ', ' 21-11-2023 ', ' 25-07-2023 ', ' 31-08-2023 ', ' 06-09-2023 ', ' 11-09-2023 ', ' 12-09-2023 ', ' 12-09-2023 ']

Last date : [' 29-11-2023', ' 29-11-2023', ' 29-11-2023', ' 29-11-2023', ' 30-11-2023', ' 30-11-2023', ' 30-11-2023', ' 30-11-2023', ' 30-11-2023', ' 30-11-2023']


**All the Company name are present every third line from the first line, we get them by the following**

In [41]:
# Retrieving the company name
company_name = []

for i in range(0,len(all_datas),3):
    # temp variable is used to store the unwanted data when we split the lists
    temp, name = map(str,all_datas[i].split('|'))
    
    company_name.append(name)

In [42]:
company_name

[' Sri global innovatiion',
 ' C2S ENTERPRISES',
 ' C2S ENTERPRISES',
 ' Sri global innovatiion',
 ' SRG POWER CONTROL SYSTEM',
 ' SI Automobiles',
 ' Talentpro India HR Private Limited',
 ' ANNAI INFRA DEVELOPERS LIMITED',
 ' Darus infotech india pvt ltd',
 ' Darus infotech india pvt ltd']

**Finally in the Scraping process, we will get the roles information from the `all_datas` which are present every third line from second line.**

In [43]:
# Retrieving the roles needed for each job
roles = []

for i in range(2,len(all_datas),3):
    role = all_datas[i]
    
    roles.append(role)
roles

['Dealership Sales and Value Added Services Executive',
 'Animator | Assistant Graphic Designer | Graphic Designer | Social Media & Digital Marketing Manager',
 'Actor | Marketing and Social Media manager | Social Media & Digital Marketing Manager | Social Media Executive | Storyboard Artist',
 'Transport Coordinator',
 'Electrical Design Developer',
 'Accounts Executive | Billing Executive',
 'Sewing Machine Operator',
 'Construction Laboratory & Field Technician | Construction Technician (Civil)- Wind Power Plant',
 'Editor | Graphic Designer',
 'Director of Photography | Editor | Graphic Designer']

**Consolidating the Step 5.5 using the function `scrap_add_data()`**

In [44]:
def scrap_add_data(response):
    # This function is to get all the additional information from the main webpage
    post_date = [] 
    last_date = []
    roles = []
    company_name = []
    
    # Retrieving other information
    all_div = response.find_all("div",class_ = 'companyName')
    all_datas = [div.text for div in all_div]
    all_datas = [clean_text(data) for data in all_datas]

    # Retrieving the post_date and the last_date to apply
    for i in range(1,len(all_datas),3):
        post, last = map(str,all_datas[i].split('|'))
        post_date.append(post)
        last_date.append(last)
    
    # Removing the text from the dates
    post_date = [post.split(':')[1] for post in post_date]
    last_date = [last.split(':')[1] for last in last_date]

    # Retrieving the company name
    for i in range(0,len(all_datas),3):    
        # temp variable is used to store the unwanted data when we split the lists
        temp, name = map(str,all_datas[i].split('|'))
        company_name.append(name)
        
    # Retrieving the roles needed for each job
    for i in range(2,len(all_datas),3):
        role = all_datas[i]
        roles.append(role)
    
    return (post_date,last_date,roles,company_name)

In [45]:
scrap_add_data(response)

([' 22-10-2023 ',
  ' 17-11-2023 ',
  ' 17-11-2023 ',
  ' 21-11-2023 ',
  ' 25-07-2023 ',
  ' 31-08-2023 ',
  ' 06-09-2023 ',
  ' 11-09-2023 ',
  ' 12-09-2023 ',
  ' 12-09-2023 '],
 [' 29-11-2023',
  ' 29-11-2023',
  ' 29-11-2023',
  ' 29-11-2023',
  ' 30-11-2023',
  ' 30-11-2023',
  ' 30-11-2023',
  ' 30-11-2023',
  ' 30-11-2023',
  ' 30-11-2023'],
 ['Dealership Sales and Value Added Services Executive',
  'Animator | Assistant Graphic Designer | Graphic Designer | Social Media & Digital Marketing Manager',
  'Actor | Marketing and Social Media manager | Social Media & Digital Marketing Manager | Social Media Executive | Storyboard Artist',
  'Transport Coordinator',
  'Electrical Design Developer',
  'Accounts Executive | Billing Executive',
  'Sewing Machine Operator',
  'Construction Laboratory & Field Technician | Construction Technician (Civil)- Wind Power Plant',
  'Editor | Graphic Designer',
  'Director of Photography | Editor | Graphic Designer'],
 [' Sri global innovatiion',

### STEP 6 : Updating the Data
We will update all the data that we retrieved during Step - 5 in a dictionary and then in a `pandas-DataFrame`.

We will now use all the functions that we created during our journey
- `parse()`
- `scrap_title()`
- `get_href()`
- `clean_text()`
- `scrap_info()`
- `ret_info()`
- `scrap_add_data()`
  
We will create a dictionary to store all the above data for a single webpage

> Remember: All we done until now is only for the `base_url`, uffff, we need to extend it to all the webpage in the `urls`

In [46]:
# Initialise all the list to store the required data(s)
job_title = []
gender = []
age = []
opening = []
exp = []
city = []
landmark = []
field = []
desc = []
sal = []
qual = []
post_date = [] 
last_date = []
roles = []
company_name = []

In [47]:
url = base_url
url

'https://www.tnprivatejobs.tn.gov.in/Home/jobs'

### Appending functions 
2 Appending function to append the scraped content into their respective lists (from - homepage `append_data()`, childpage `append_add_data()`) 

In [48]:
def append_data(info):
    # This function is used to append the data scraped from the each child webpage into the respective variable
    [gender.append(i) for i in info[0]]
    [age.append(i) for i in info[1]]
    [opening.append(i) for i in info[2]]
    [exp.append(i) for i in info[3]]
    [city.append(i) for i in info[4]]    
    [landmark.append(i) for i in info[5]]    
    [desc.append(i) for i in info[6]]
    [sal.append(i) for i in info[7]]
    [qual.append(i) for i in info[8]]
    [field.append(i) for i in info[9]]

In [49]:
def append_add_data(info):
    # This function is used to append the data scraped to their respective variables
    [post_date.append(i) for i in info[0]]
    [last_date.append(i) for i in info[1]]
    [roles.append(i) for i in info[2]]
    [company_name.append(i) for i in info[3]]

### Main Function 
Main function `scrap_webpage()`to scrap the contents of the job posting webpage

In [50]:
def scrap_webpage(url):
    # Request and Parse the URL
    response = parse(url)
    
    # Get the child URLs for each job using the get_href() function
    hrefs = get_href(response)
    
    # Retrieve all the information from each urls using the ret_info() and scrap_info() function
    all_info = ret_info(hrefs)
    # append all the data using the append_data() function
    append_data(all_info)
    
    # Retrieve other additional information from the main webpages
    all_add_info = scrap_add_data(response)
    # append it using the append_add_data() function
    append_add_data(all_add_info)
    
    [job_title.append(i) for i in scrap_title(response)]

### Scraping through the entire webpages
Now doing web-scraping for all the pages in the urls

> Execute the below lines for scraping through the entire webpages to scrap the contents


![](https://imgur.com/KR2wVXH.png)

### Dictionary
Convert the acquired data into a dictionary `dic`

In [52]:
# Converting the acquired data into a dictionary
dic = {
    "Job Title" : job_title,
    "Description" : desc,
    "Field" : field,
    "Company Name" : company_name,
    "City" : city,
    "Landmark" : landmark,
    "Post Date" : post_date, 
    "Last Date" : last_date,
    "Salary" : sal,
    "Gender" : gender,
    "Age" : age,
    "Experience" : exp,
    "Qualification" : qual,
    "Roles" : roles,
    "Openings" : opening
}

### DataFrame
After having created a dictionary, we will now create a Pandas dataframe `df` for easy viewing of the acquired data

In [53]:
# Creating a table using the DataFrame function of the pandas library for easy viewing
df = pd.DataFrame(dic)

In [54]:
df.head(5)

Unnamed: 0,Job Title,Description,Field,Company Name,City,Landmark,Post Date,Last Date,Salary,Gender,Age,Experience,Qualification,Roles,Openings
0,Digital Marketing Executive,Search Engine Marketing Executive,Media & Entertainment,Sri global innovatiion,Chennai,Anna nagar,22-10-2023,29-11-2023,"15,000 - 25,000 p.m",All,20-30,0-1 Year,Under Graduate,Dealership Sales and Value Added Services Exec...,3
1,Graphic Desginer,Graphic Designer,Media & Entertainment,C2S ENTERPRISES,Salem,SALEM,17-11-2023,29-11-2023,"10,000 - 15,000 p.m",All,18-32,Fresher,Under Graduate,Animator | Assistant Graphic Designer | Graphi...,5
2,VIDEO JOCKEY,Actor,Media & Entertainment,C2S ENTERPRISES,Salem,SALEM,17-11-2023,29-11-2023,"10,000 - 15,000 p.m",All,18-25,Fresher,"Under Graduate - Bachelor of Arts , Bachelor o...",Actor | Marketing and Social Media manager | S...,4
3,Bus Captain,Captain,Tourism & Hospitality,Sri global innovatiion,Chennai,koyambedu,21-11-2023,29-11-2023,"15,000 - 25,000 p.m",Male,18-35,Fresher,SSLC,Transport Coordinator,30
4,Electrical design engineer,Electrical Design Developer,Electronics & Hardware,SRG POWER CONTROL SYSTEM,Coimbatore,641107,25-07-2023,30-11-2023,"10,000 - 15,000 p.m",All,18-35,Fresher,Under Graduate - Bachelor of Engineering / Tec...,Electrical Design Developer,5


### STEP 7 : Storing the acquired data
We will store the data acquired during this web-scraping tutorial in a CSV file using the pandas' `to_csv()` function  
  
We will store the contents in the file `job_postings.csv`

In [55]:
# Store the contents of the DataFrame df to a file
df.to_csv('job_postings.csv',index = False)

![](https://imgur.com/udNnm9r.png)

## Conclusion

In conclusion, the web scraping project focused on extracting job postings from the Government of Tamil Nadu's official webpage has been successfully implemented using BeautifulSoup in Python. By harnessing the power of web scraping, we've created a streamlined process for accessing crucial employment information. It is essential to note that the project adheres to ethical scraping practices, respecting the terms of use of the government portal. The implementation of Beautiful Soup proves to be effective in parsing HTML, making the project a valuable resource for those seeking government job opportunities in Tamil Nadu.

**_Now we have come to the end of this tutorial, and hope you all learned the basics of web-scraping in detail. Till next tutorial, peace out......._**

In [56]:
print('Sayonara!!!')

Sayonara!!!


<h1><center>THE END !!!</center></h1>