## **Situation**

Scrape all job postings by area from the links given in the file ‘link_by_areas.csv’. For every link, loop through all the job postings by page, collecting all the information for a given posting. 

Create an output CSV file with job title, company, experience, salary, location, description, tags associated, function area, posting date, scraping date.

In [3]:
import pandas as pd
from selenium import webdriver
import chromedriver_binary
from bs4 import BeautifulSoup
import time

In [7]:
joblinks = pd.read_csv('link_by_areas.csv')
print(joblinks.shape)
joblinks.head()

(45, 2)


Unnamed: 0,type,link
0,Accounting Jobs,https://www.naukri.com/accounting-jobs?xt=cats...
1,Interior Design Jobs,https://www.naukri.com/interior-design-jobs?xt...
2,Bank Jobs,https://www.naukri.com/bank-jobs?xt=catsrch&qf...
3,Content Writing Jobs,https://www.naukri.com/content-writing-jobs?xt...
4,Consultant Jobs,https://www.naukri.com/consultant-jobs?xt=cats...


## Working on the links 

In [8]:
urls = joblinks.link.tolist()
urls

['https://www.naukri.com/accounting-jobs?xt=catsrch&qf[]=1',
 'https://www.naukri.com/interior-design-jobs?xt=catsrch&qf[]=2',
 'https://www.naukri.com/bank-jobs?xt=catsrch&qf[]=6',
 'https://www.naukri.com/content-writing-jobs?xt=catsrch&qf[]=5',
 'https://www.naukri.com/consultant-jobs?xt=catsrch&qf[]=9',
 'https://www.naukri.com/engineering-jobs?xt=catsrch&qf[]=21',
 'https://www.naukri.com/export-import-jobs?xt=catsrch&qf[]=10',
 'https://www.naukri.com/merchandiser-jobs?xt=catsrch&qf[]=10',
 'https://www.naukri.com/security-jobs?xt=catsrch&qf[]=45',
 'https://www.naukri.com/hr-jobs?xt=catsrch&qf[]=12',
 'https://www.naukri.com/hotel-jobs?xt=catsrch&qf[]=4',
 'https://www.naukri.com/application-programming-jobs?xt=catsrch&qf[]=24.01',
 'https://www.naukri.com/client-server-jobs?xt=catsrch&qf[]=24.02',
 'https://www.naukri.com/dba-jobs?xt=catsrch&qf[]=24.03',
 'https://www.naukri.com/ecommerce-jobs?xt=catsrch&qf[]=24.12',
 'https://www.naukri.com/erp-jobs?xt=catsrch&qf[]=24.04',
 'h

In order to make the **urls** generic. We will need use a library called **Yarl**(Yet another URL library). All url parts: scheme, user, password, host, port, path, query and fragment are accessible by yarl properties.

We will use **yarl** to access different parts of the variuos **urls**.

In [15]:
from yarl import URL

n = []

for i in urls:
    n.append(URL(i).path) #getting the job type part of the url using yarl property: path

In [16]:
print(n)

['/accounting-jobs', '/interior-design-jobs', '/bank-jobs', '/content-writing-jobs', '/consultant-jobs', '/engineering-jobs', '/export-import-jobs', '/merchandiser-jobs', '/security-jobs', '/hr-jobs', '/hotel-jobs', '/application-programming-jobs', '/client-server-jobs', '/dba-jobs', '/ecommerce-jobs', '/erp-jobs', '/vlsi-jobs', '/mainframe-jobs', '/middleware-jobs', '/mobile-jobs', '/network-administrator-jobs', '/information-technology-jobs', '/testing-jobs', '/system-programming-jobs', '/edp-jobs', '/telecom-software-jobs', '/telecom-jobs', '/bpo-jobs', '/legal-jobs', '/marketing-jobs', '/packaging-jobs', '/pharma-jobs', '/maintenance-jobs', '/logistics-jobs', '/sales-jobs', '/secretary-jobs', '/corporate-planning-jobs', '/site-engineering-jobs', '/film-jobs', '/teaching-jobs', '/airline-jobs', '/graphic-designer-jobs', '/shipping-jobs', '/analytics-jobs', '/business-intelligence-jobs']


In [13]:
m = []

for i in urls:
    m.append(URL(i).query_string)

In [14]:
print(m)

['xt=catsrch&qf[]=1', 'xt=catsrch&qf[]=2', 'xt=catsrch&qf[]=6', 'xt=catsrch&qf[]=5', 'xt=catsrch&qf[]=9', 'xt=catsrch&qf[]=21', 'xt=catsrch&qf[]=10', 'xt=catsrch&qf[]=10', 'xt=catsrch&qf[]=45', 'xt=catsrch&qf[]=12', 'xt=catsrch&qf[]=4', 'xt=catsrch&qf[]=24.01', 'xt=catsrch&qf[]=24.02', 'xt=catsrch&qf[]=24.03', 'xt=catsrch&qf[]=24.12', 'xt=catsrch&qf[]=24.04', 'xt=catsrch&qf[]=24.05', 'xt=catsrch&qf[]=24.13', 'xt=catsrch&qf[]=24.15', 'xt=catsrch&qf[]=24.14', 'xt=catsrch&qf[]=24.06', 'xt=catsrch&qf[]=24', 'xt=catsrch&qf[]=24.08', 'xt=catsrch&qf[]=24.09', 'xt=catsrch&qf[]=24.11', 'xt=catsrch&qf[]=24.10', 'xt=catsrch&qf[]=37', 'xt=catsrch&qf[]=8', 'xt=catsrch&qf[]=13', 'xt=catsrch&qf[]=15', 'xt=catsrch&qf[]=18', 'xt=catsrch&qf[]=16', 'xt=catsrch&qf[]=19', 'xt=catsrch&qf[]=14', 'xt=catsrch&qf[]=22', 'xt=catsrch&qf[]=11', 'xt=catsrch&qf[]=7', 'xt=catsrch&qf[]=20', 'xt=catsrch&qf[]=43', 'xt=catsrch&qf[]=36', 'xt=catsrch&qf[]=44', 'xt=catsrch&qf[]=3', 'xt=catsrch&qf[]=82', 'xt=catsrch&qf[]=81'

In [18]:
# Adding '-{}?' for the page number

common_url = []

for i in range(len(n)):
    url = 'https://www.naukri.com/' + n[i] + '-{}?' + m[i]
    common_url.append(url)

In [19]:
common_url

['https://www.naukri.com//accounting-jobs-{}?xt=catsrch&qf[]=1',
 'https://www.naukri.com//interior-design-jobs-{}?xt=catsrch&qf[]=2',
 'https://www.naukri.com//bank-jobs-{}?xt=catsrch&qf[]=6',
 'https://www.naukri.com//content-writing-jobs-{}?xt=catsrch&qf[]=5',
 'https://www.naukri.com//consultant-jobs-{}?xt=catsrch&qf[]=9',
 'https://www.naukri.com//engineering-jobs-{}?xt=catsrch&qf[]=21',
 'https://www.naukri.com//export-import-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//merchandiser-jobs-{}?xt=catsrch&qf[]=10',
 'https://www.naukri.com//security-jobs-{}?xt=catsrch&qf[]=45',
 'https://www.naukri.com//hr-jobs-{}?xt=catsrch&qf[]=12',
 'https://www.naukri.com//hotel-jobs-{}?xt=catsrch&qf[]=4',
 'https://www.naukri.com//application-programming-jobs-{}?xt=catsrch&qf[]=24.01',
 'https://www.naukri.com//client-server-jobs-{}?xt=catsrch&qf[]=24.02',
 'https://www.naukri.com//dba-jobs-{}?xt=catsrch&qf[]=24.03',
 'https://www.naukri.com//ecommerce-jobs-{}?xt=catsrch&qf[]=24.12',
 

## **Scriping the data**

In [20]:
df = pd.DataFrame(columns=['Job_Title','Experience','Company','Scraping_Date','Salary','Location','Tags_Associated','Posting_Date'])
df

Unnamed: 0,Job_Title,Experience,Company,Scraping_Date,Salary,Location,Tags_Associated,Posting_Date


In [None]:
# The code for scraping the web

for page in range(1,2):
    for iurl in common_url:
        url = iurl.format(page)
        driver = webdriver.Chrome('C:\Program Files\Chromedriver')
        driver.get(url)

        time.sleep(10) # Time between the searches to avoid being detected as a bot

        soup = BeautifulSoup(driver.page_source, 'html5lib')

        driver.close()

        
        results = soup.find(class_='list')
        job_elements = results.find_all('article', class_='jobTuple bgWhite br4 mb-8') # class where job information is located

        for job_element in job_elements:

            # scraping the titles
            job_title = job_element.find('a', class_='title fw500 ellipsis')
            print(job_title.text)

            # scraping the experience
            exp = job_element.find('li', class_='fleft grey-text br2 placeHolderLi experience')
            exp_span = exp.find('span', class_='ellipsis fleft fs12 lh16')
            if exp_span is None:
                continue
            else:
                experience = exp_span.text
                print(experience)

            # scraping the company
            company = job_element.find('a', class_='subtitle ellipsis fleft')
            print(company.text)

            # data scraped
            from datetime import date
            today = date.today() #dd/mm/YY
            date_today = today.strftime('%d/%m/%Y')
            print(date_today)

            # scraping the salary
            salary = job_element.find('li', class_='fleft grey-text br2 placeHolderLi salary')
            salary_span = salary.find('span', class_='ellipsis fleft fs12 lh16')
            if salary_span is None:
                continue
            else:
                salaries = salary_span.text
                print(salaries)

            # scraping the location
            location = job_element.find('li',class_='fleft grey-text br2 placeHolderLi location')
            loc_span = location.find('span',class_='ellipsis fleft fs12 lh16')
            if loc_span is None:
                continue
            else:
                loc = loc_span.text
                print(loc)             

            # scraping the tags   
            tags = job_element.find('li',class_='fleft fs12 grey-text lh16 dot')
            if tags is None:
                continue
            else:
                assoc_tags = tags.text
                print(assoc_tags)

            # scraping the posted date
            date = job_element.find("div",["type br2 fleft grey","type br2 fleft green"])
            date_posted = date.find('span',class_='fleft fw500')
            if date_posted is None:
                continue
            else:
                posted_d = date_posted.text
                print(posted_d)     

In [None]:
df = df.append({'Job_Title': job_title.text,
                'Experience': experience,
                'Company': company.text,
                'Scraping_Date': date_today,
                'Salary': salaries,
                'Location': loc,
                'Tags_Associated': assoc_tags,
                'Posting_Date': posted_d})

In [None]:
df.to_csv('jobs_data.csv')