## **Project for Indian Institute Of Business**

<br>

<br>


## **Problem Statement**

Scrape all job postings by area from the links given in the file ‘link_by_areas.csv’. For every link, loop through all the job postings by page, collecting all the information for a given posting. 

Create an output CSV file with job title, company, experience, salary, location, description, tags associated, function area, posting date, scraping date. 

# **Solution**

The task at hand is to build a Python Auto Scraper tool that will scrape data from Naukri job site and save the output as a CSV file which can be used for further analysis.


In order to accomplish this task. I will be using ***Python*** to scrape the required data from the job search site-***Naukri***.

To begin with, the necessary libraries needs to be imported.

**WebDriver**: drives a browser natively, as a user would, either locally or on a remote machine using the Selenium server, marks a leap forward in terms of browser automation.

**Selenium WebDriver**: refers to both the language bindings and the implementations of the individual browser controlling code. This is commonly referred to as just WebDriver.

**Beautiful Soup**: is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

**Pandas**: is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Used for dataframe manupulations.

In [1]:
# importing the libraries

from selenium import webdriver
import chromedriver_binary
from bs4 import BeautifulSoup
import time
import pandas as pd

In [2]:
joblinks = pd.read_csv('link_by_areas.csv')
joblinks

Unnamed: 0,type,link
0,Accounting Jobs,https://www.naukri.com/accounting-jobs?xt=cats...
1,Interior Design Jobs,https://www.naukri.com/interior-design-jobs?xt...
2,Bank Jobs,https://www.naukri.com/bank-jobs?xt=catsrch&qf...
3,Content Writing Jobs,https://www.naukri.com/content-writing-jobs?xt...
4,Consultant Jobs,https://www.naukri.com/consultant-jobs?xt=cats...
5,Engineering Jobs,https://www.naukri.com/engineering-jobs?xt=cat...
6,Export Import Jobs,https://www.naukri.com/export-import-jobs?xt=c...
7,Merchandiser Jobs,https://www.naukri.com/merchandiser-jobs?xt=ca...
8,Security Jobs,https://www.naukri.com/security-jobs?xt=catsrc...
9,HR Jobs,https://www.naukri.com/hr-jobs?xt=catsrch&qf[]=12


# The **URL STRUCTURE** 

In order to scrape data from one page to the other, the urls need to be created in such a way that it will automatically adjust to move move from one page to the other.

When we look closely at the **urls**, we can see that all theurls have the domain **https://www.naukri.com/** followed by the job type, then the **page number** and a question mark as shown below. We will need to find a way to make the page number generic in order to be able to navigate through different pages automatically.

<br>

Lert's first convert our **urls** to a list

In [3]:
urls = joblinks['link'].tolist()

In [4]:
urls

['https://www.naukri.com/accounting-jobs?xt=catsrch&qf[]=1',
 'https://www.naukri.com/interior-design-jobs?xt=catsrch&qf[]=2',
 'https://www.naukri.com/bank-jobs?xt=catsrch&qf[]=6',
 'https://www.naukri.com/content-writing-jobs?xt=catsrch&qf[]=5',
 'https://www.naukri.com/consultant-jobs?xt=catsrch&qf[]=9',
 'https://www.naukri.com/engineering-jobs?xt=catsrch&qf[]=21',
 'https://www.naukri.com/export-import-jobs?xt=catsrch&qf[]=10',
 'https://www.naukri.com/merchandiser-jobs?xt=catsrch&qf[]=10',
 'https://www.naukri.com/security-jobs?xt=catsrch&qf[]=45',
 'https://www.naukri.com/hr-jobs?xt=catsrch&qf[]=12',
 'https://www.naukri.com/hotel-jobs?xt=catsrch&qf[]=4',
 'https://www.naukri.com/application-programming-jobs?xt=catsrch&qf[]=24.01',
 'https://www.naukri.com/client-server-jobs?xt=catsrch&qf[]=24.02',
 'https://www.naukri.com/dba-jobs?xt=catsrch&qf[]=24.03',
 'https://www.naukri.com/ecommerce-jobs?xt=catsrch&qf[]=24.12',
 'https://www.naukri.com/erp-jobs?xt=catsrch&qf[]=24.04',
 'h

<br>


In order to make the **urls** generic. We will need use a library called **Yarl**(Yet another URL library). All url parts: scheme, user, password, host, port, path, query and fragment are accessible by yarl properties.

We will use **yarl** to access different parts of the variuos **urls**.



In [5]:
!pip install yarl



In [6]:
import yarl
from yarl import URL

n=[]

for i in urls:
    n.append(URL(i).path)

In [7]:
n

['/accounting-jobs',
 '/interior-design-jobs',
 '/bank-jobs',
 '/content-writing-jobs',
 '/consultant-jobs',
 '/engineering-jobs',
 '/export-import-jobs',
 '/merchandiser-jobs',
 '/security-jobs',
 '/hr-jobs',
 '/hotel-jobs',
 '/application-programming-jobs',
 '/client-server-jobs',
 '/dba-jobs',
 '/ecommerce-jobs',
 '/erp-jobs',
 '/vlsi-jobs',
 '/mainframe-jobs',
 '/middleware-jobs',
 '/mobile-jobs',
 '/network-administrator-jobs',
 '/information-technology-jobs',
 '/testing-jobs',
 '/system-programming-jobs',
 '/edp-jobs',
 '/telecom-software-jobs',
 '/telecom-jobs',
 '/bpo-jobs',
 '/legal-jobs',
 '/marketing-jobs',
 '/packaging-jobs',
 '/pharma-jobs',
 '/maintenance-jobs',
 '/logistics-jobs',
 '/sales-jobs',
 '/secretary-jobs',
 '/corporate-planning-jobs',
 '/site-engineering-jobs',
 '/film-jobs',
 '/teaching-jobs',
 '/airline-jobs',
 '/graphic-designer-jobs',
 '/shipping-jobs',
 '/analytics-jobs',
 '/business-intelligence-jobs']

In [8]:
m=[]

for i in urls:
    m.append(URL(i).query_string)

In [9]:
m

['xt=catsrch&qf[]=1',
 'xt=catsrch&qf[]=2',
 'xt=catsrch&qf[]=6',
 'xt=catsrch&qf[]=5',
 'xt=catsrch&qf[]=9',
 'xt=catsrch&qf[]=21',
 'xt=catsrch&qf[]=10',
 'xt=catsrch&qf[]=10',
 'xt=catsrch&qf[]=45',
 'xt=catsrch&qf[]=12',
 'xt=catsrch&qf[]=4',
 'xt=catsrch&qf[]=24.01',
 'xt=catsrch&qf[]=24.02',
 'xt=catsrch&qf[]=24.03',
 'xt=catsrch&qf[]=24.12',
 'xt=catsrch&qf[]=24.04',
 'xt=catsrch&qf[]=24.05',
 'xt=catsrch&qf[]=24.13',
 'xt=catsrch&qf[]=24.15',
 'xt=catsrch&qf[]=24.14',
 'xt=catsrch&qf[]=24.06',
 'xt=catsrch&qf[]=24',
 'xt=catsrch&qf[]=24.08',
 'xt=catsrch&qf[]=24.09',
 'xt=catsrch&qf[]=24.11',
 'xt=catsrch&qf[]=24.10',
 'xt=catsrch&qf[]=37',
 'xt=catsrch&qf[]=8',
 'xt=catsrch&qf[]=13',
 'xt=catsrch&qf[]=15',
 'xt=catsrch&qf[]=18',
 'xt=catsrch&qf[]=16',
 'xt=catsrch&qf[]=19',
 'xt=catsrch&qf[]=14',
 'xt=catsrch&qf[]=22',
 'xt=catsrch&qf[]=11',
 'xt=catsrch&qf[]=7',
 'xt=catsrch&qf[]=20',
 'xt=catsrch&qf[]=43',
 'xt=catsrch&qf[]=36',
 'xt=catsrch&qf[]=44',
 'xt=catsrch&qf[]=3',
 

<br>

Now that we have been able to extract the variuos parts of the **urls**, we will sticth them together but this time we will add a placeholder **-{}** after the ***job type part*** of the url.

In [10]:
gen_urls=[]

for i in range(len(n)):
    url='https://www.naukri.com/'+n[i]+'-{}?'+m[i]
    gen_urls.append(url)

As seen below, all good. we have the our urls with a placeholder **{}** where we can simply be adding our page numbers to navigate to different pages as we scrape our data.

In [11]:
gen_urls

['https://www.naukri.com//accounting-jobs-{}?xt=catsrch&qf[]=1',
 'https://www.naukri.com//interior-design-jobs-{}?xt=catsrch&qf[]=2',
 'https://www.naukri.com//bank-jobs-{}?xt=catsrch&qf[]=6',
 'https://www.naukri.com//content-writing-jobs-{}?xt=catsrch&qf[]=5',
 'https://www.naukri.com//consultant-jobs-{}?xt=catsrch&qf[]=9',
 'https://www.naukri.com//engineering-jobs-{}?xt=catsrch&qf[]=21',
 'https://www.naukri.com//export-import-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//merchandiser-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//security-jobs-{}?xt=catsrch&qf[]=45',
 'https://www.naukri.com//hr-jobs-{}?xt=catsrch&qf[]=12',
 'https://www.naukri.com//hotel-jobs-{}?xt=catsrch&qf[]=4',
 'https://www.naukri.com//application-programming-jobs-{}?xt=catsrch&qf[]=24.01',
 'https://www.naukri.com//client-server-jobs-{}?xt=catsrch&qf[]=24.02',
 'https://www.naukri.com//dba-jobs-{}?xt=catsrch&qf[]=24.03',
 'https://www.naukri.com//ecommerce-jobs-{}?xt=catsrch&qf[]=24.12',
 

Confirming below to see if we still have same number of urls as given in original dataset.

In [12]:
len(urls)

45

In [13]:
len(gen_urls)

45

# **Scraping Data From Naukri.Com**

In order to scrape the required data from naukri.com, we will follow a simple two step process:



1.   Define a dataframe which will contain our data
2.   Create a generic python code which will help to extract the data as required.



<br>


**Create a dataframe to contain our scraped data**

In [15]:
!pip install chromedriver_binary==108.0.5359.71.0
!pip install selenium
!pip install msedge-selenium-tools
!pip install bs4

Collecting chromedriver_binary==108.0.5359.71.0
  Using cached chromedriver_binary-108.0.5359.71.0-py3-none-any.whl
Installing collected packages: chromedriver-binary
  Attempting uninstall: chromedriver-binary
    Found existing installation: chromedriver-binary 94.0.4606.113.0
    Uninstalling chromedriver-binary-94.0.4606.113.0:
      Successfully uninstalled chromedriver-binary-94.0.4606.113.0
Successfully installed chromedriver-binary-108.0.5359.71.0


In [16]:
!pip install chromedriver_binary==94.0.4606.113.0

Collecting chromedriver_binary==94.0.4606.113.0
  Using cached chromedriver_binary-94.0.4606.113.0-py3-none-any.whl
Installing collected packages: chromedriver-binary
  Attempting uninstall: chromedriver-binary
    Found existing installation: chromedriver-binary 108.0.5359.71.0
    Uninstalling chromedriver-binary-108.0.5359.71.0:
      Successfully uninstalled chromedriver-binary-108.0.5359.71.0
Successfully installed chromedriver-binary-94.0.4606.113.0


**Create a generic python code to scrape the required data**

In [17]:
gen_urls

['https://www.naukri.com//accounting-jobs-{}?xt=catsrch&qf[]=1',
 'https://www.naukri.com//interior-design-jobs-{}?xt=catsrch&qf[]=2',
 'https://www.naukri.com//bank-jobs-{}?xt=catsrch&qf[]=6',
 'https://www.naukri.com//content-writing-jobs-{}?xt=catsrch&qf[]=5',
 'https://www.naukri.com//consultant-jobs-{}?xt=catsrch&qf[]=9',
 'https://www.naukri.com//engineering-jobs-{}?xt=catsrch&qf[]=21',
 'https://www.naukri.com//export-import-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//merchandiser-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//security-jobs-{}?xt=catsrch&qf[]=45',
 'https://www.naukri.com//hr-jobs-{}?xt=catsrch&qf[]=12',
 'https://www.naukri.com//hotel-jobs-{}?xt=catsrch&qf[]=4',
 'https://www.naukri.com//application-programming-jobs-{}?xt=catsrch&qf[]=24.01',
 'https://www.naukri.com//client-server-jobs-{}?xt=catsrch&qf[]=24.02',
 'https://www.naukri.com//dba-jobs-{}?xt=catsrch&qf[]=24.03',
 'https://www.naukri.com//ecommerce-jobs-{}?xt=catsrch&qf[]=24.12',
 

In [46]:
df = pd.DataFrame(columns=['Job_Title','Experience','Company','Scraping_Date', 'Salary','Location','Tags_Associated','Posting_Date'])
for page in range(1,2):
    for urll in gen_urls[0:1]:
        url = urll.format(page)
        driver = webdriver.Chrome('E:\The Full Stack Data Science Bootcamp\Web Scraper Practises\Job Board Data Web Scrapping and Automation with Python PROJECT\chromedriver')
        driver.get(url)
        
        time.sleep(5)
        
        soup = BeautifulSoup(driver.page_source)
        
        # driver.close()
        
        
        
        # print(soup.prettify())
        results = soup.find(class_='list')
        job_elems = results.find_all('article',class_='jobTuple bgWhite br4 mb-8')
        
        for job_elem in job_elems:
            Job_Title = job_elem.find('a', class_='title fw500 ellipsis')
            print(Job_Title.text)
            
            # Experience
            Exp = job_elem.find('li',class_='fleft grey-text br2 placeHolderLi experience')
            if Exp is not None:
                Exp_span = Exp.find('span',class_='ellipsis fleft fs12 lh16 expwdth')
            if Exp_span is None:
                continue
            else:
                Experience = Exp_span.text
                print(Experience)
                
            # Company
            Company = job_elem.find('a',class_='subTitle ellipsis fleft')
            print(Company.text)
            
            # Date Scraped
            from datetime import date
            today = date.today()
                    # dd/mm/YY
            date_today = today.strftime("%d/%m/%Y")
            print(date_today)
            
            # Salary
            Sal = job_elem.find('li',class_='fleft grey-text br2 placeHolderLi salary')
            Sal_span = Sal.find('span',class_='ellipsis fleft fs12 lh16')
            if Sal_span is None:
                continue
            else:
                Salary = Sal_span.text
                print(Salary)

            
            # Location for the job post
            Loc = job_elem.find('li',class_='fleft grey-text br2 placeHolderLi location')
            # print("************ ",Loc)
            if Loc is not None:
                Loc_exp = Loc.find('span',class_='ellipsis fleft fs12 lh16 locWdth')
                if Loc_exp is None:
                    continue
                else:
                    Location = Loc_exp.text
                    print(Location)
            else:
                Location = None
                print(None)
            
            
            
            #tags
     
            tags = job_elem.find('li',class_='fleft fs12 grey-text lh16 dot')

            if tags is None:
                continue
            else:
                assoc_tags = tags.text
                print(assoc_tags)

            
            
            
            # Date job Posted
            date = job_elem.find("div",["type br2 fleft grey"])
            if date is not None:
                date_posted = date.find('span',class_='fleft fw500')
                if date_posted is None:
                    continue
                else:
                    Posting_Date = date_posted.text
                    print(Posting_Date)
            else:
                Posting_Date = None
                print(None)

            df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
                  
        

Senior Executive/Assistant Manager - Chartered Accountant
0-1 Yrs
Encube Ethicals
13/12/2022
Not disclosed
Mumbai (All Areas)
chartered accountant
None
Accounts Executive - Receivable
0-3 Yrs
Marriott
13/12/2022
Not disclosed
Kochi/Cochin
Supervisor
4 Days Ago
Finance and Accounting Drive - Accenture Operations
0-1 Yrs
Accenture
13/12/2022
Not disclosed
Gurgaon/Gurugram
Financial control
None
Senior Executive Finance & Accounts
0-1 Yrs
Max Healthcare
13/12/2022
6,00,000 - 7,50,000 PA.
Walk-in drive -(US Accounts Executive) Hyderabad
0-1 Yrs
SilverXis
13/12/2022
2,00,000 - 3,00,000 PA.
Cashier Accounts / Purchase Executives freshers
0-3 Yrs
Iioa Trainings and Placements
13/12/2022
Not disclosed
Hyderabad/Secunderabad
Accounts
None
Accounts executive
0-2 Yrs
Myriad Solutionz
13/12/2022
Not disclosed
Ahmedabad
Tally
None
Accounts Executive and Administrator
0-2 Yrs
Futurenet
13/12/2022
Not disclosed
Accounts Executive 0/5
0-5 Yrs
Innomech Technologies
13/12/2022
3,00,000 - 5,00,000 PA.
Hy

  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'


0-3 Yrs
Collab Accounting (au)
13/12/2022
1,50,000 - 6,00,000 PA.
Ahmedabad(Manekbaugh +1)
Financial Reporting
6 Days Ago
Senior Accounts Executive
5-9 Yrs
PVR Cinemas
13/12/2022
Not disclosed
Mumbai Suburban
accounts payable
None
Accounting Manager
0-2 Yrs
Hdfc Bank
13/12/2022
Not disclosed
Mumbai (All Areas)(Churchgate)
Accounting
4 Days Ago
Accounts Executive/Sr. Accounts Executive/MIS Executive
5-15 Yrs
Della Group
13/12/2022
Not disclosed
Lonavala
TDS
None
Account Manager - Ent / Govt
0-3 Yrs
Check Point Software Technologies
13/12/2022
Not disclosed
New Delhi
Training
None
Account Executive -Sports
0-2 Yrs
Paytm
13/12/2022
Not disclosed
Mumbai
Ticketing
10 Days Ago
Executive/SR Executive Accounts
0-1 Yrs
THE Central Financial Credit And Investment Co Operative India Ltd
13/12/2022
1,75,000 - 2,25,000 PA.
Ernakulam
Excel Sheet
None
Opening For Accounts Payable

  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)
  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'


0-5 Yrs
RRD
13/12/2022
2,00,000 - 7,00,000 PA.
Chennai
Accounts Payable
None
Opening For Accounts Billing , AP and AR
0-3 Yrs
RRD
13/12/2022
1,75,000 - 3,50,000 PA.
Chennai
Bcom
None


  df=df.append({'Job_Title':Job_Title.text,'Experience':Experience,'Company':Company.text,'Scraping_Date':date_today, 'Salary':Salary,'Location':Location,'Tags_Associated':assoc_tags,'Posting_Date':Posting_Date},ignore_index = True)


# Let's observe the data we have scraped and saved in our pandas dataframe

In [47]:
df

Unnamed: 0,Job_Title,Experience,Company,Scraping_Date,Salary,Location,Tags_Associated,Posting_Date
0,Senior Executive/Assistant Manager - Chartered...,0-1 Yrs,Encube Ethicals,13/12/2022,Not disclosed,Mumbai (All Areas),chartered accountant,
1,Accounts Executive - Receivable,0-3 Yrs,Marriott,13/12/2022,Not disclosed,Kochi/Cochin,Supervisor,4 Days Ago
2,Finance and Accounting Drive - Accenture Opera...,0-1 Yrs,Accenture,13/12/2022,Not disclosed,Gurgaon/Gurugram,Financial control,
3,Cashier Accounts / Purchase Executives freshers,0-3 Yrs,Iioa Trainings and Placements,13/12/2022,Not disclosed,Hyderabad/Secunderabad,Accounts,
4,Accounts executive,0-2 Yrs,Myriad Solutionz,13/12/2022,Not disclosed,Ahmedabad,Tally,
5,Accounts Executive 0/5,0-5 Yrs,Innomech Technologies,13/12/2022,"3,00,000 - 5,00,000 PA.","Hybrid - Hubli, Mysore/Mysuru, Bangalore/Benga...",accounting,
6,Frond Desk Executive cum Accountant @ Podar Pr...,0-5 Yrs,Podar Education Network,13/12/2022,"1,25,000 - 2,00,000 PA.",Junagadh,follow ups,
7,Accounts Executive (Australian Process),0-3 Yrs,Collab Accounting (au),13/12/2022,"1,50,000 - 6,00,000 PA.",Ahmedabad(Manekbaugh +1),Financial Reporting,6 Days Ago
8,Senior Accounts Executive,5-9 Yrs,PVR Cinemas,13/12/2022,Not disclosed,Mumbai Suburban,accounts payable,
9,Accounting Manager,0-2 Yrs,Hdfc Bank,13/12/2022,Not disclosed,Mumbai (All Areas)(Churchgate),Accounting,4 Days Ago


In [49]:
df.to_csv('Output_csv.csv')