# PSET4 - Web Scraping

This week you will be scraping job postings from one of Mongolia's top job boards: Zangia.mn (formerly BizNetwork). We will collect the following features from the site:

- Job title
- Job description
- Job sector
- Salary range

If you go to https://www.zangia.mn/job/list you will see all the job listings. There are several pages of listings. The recommended process is:

1. Make a list of job post results.
2. Scrape the listing URLs from the job list pages (1 through n).
3. Use the resulting URL list to scrape the features to a dataframe.

The final dataframe should include each of the features above. Not every job post will have each feature, but the resulting data should be a clean dataframe with the data in the right location.

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [2]:
response = requests.get('https://www.zangia.mn/job/list')

In [3]:
response.status_code

200

In [4]:
soup = BeautifulSoup(response.content)

In [5]:
job_titles = soup.find_all("div", {"class":"ad"})

In [6]:
job_titles[0].find_all("a")[0]

<a href="job/_jr39pn90ff"><b>Смарт оператор</b><span class="fsal">1,200,000 - 1,500,000 Тохиролцоно</span><span class="floca">Баянзүрх дүүрэг</span><em><b>2024-11-13</b>-н хүртэл анкет хүлээн авна</em><span class="sdate">10 сарын 22. 9:00</span></a>

In [7]:
job_titles[0].find_all("b")[0].text.strip()

'Смарт оператор'

In [8]:
salaries = job_titles[0].find_all("span")[0].text.strip()
print(salaries)

1,200,000 - 1,500,000 Тохиролцоно


In [9]:
listing_title = []
listing_salary = []
for job_title in job_titles:
    title = job_title.find_all("b")[0].text
    salary = job_title.find_all("span")[0].text
    listing_title.append(title)
    listing_salary.append(salary)

In [10]:
listing_salary

['1,200,000 - 1,500,000 Тохиролцоно',
 '2,100,000 - 2,500,000',
 '1,500,000 - 1,800,000',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000',
 '2,500,000 - 3,000,000',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000',
 '3,000,000 - 4,000,000',
 '1,800,000 - 2,100,000',
 '2,500,000 - 3,000,000',
 '2,500,000 - 3,000,000',
 '1,500,000 - 1,800,000',
 '2,500,000 - 3,000,000',
 '2,500,000 - 3,000,000',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000 Тохиролцоно',
 '1,800,000 - 2,100,000',
 '2,100,000 - 2,500,000 Тохиролцоно',
 '2,100,000 - 2,500,000',
 '1,800,000 - 2,100,000',
 ' Тохиролцоно',
 '1,800,000 - 2,100,000',
 '2,100,000 - 2,500,000',
 '1,500,000 - 1,800,000',
 '2,100,000 - 2,500,000',
 ' Тохиролцоно',
 '2,500,000 - 3,000,000 Тохиролцоно',
 '2,500,000 - 3,000,000 Тохиролцоно',
 '1,500,000 - 1,800,000',
 '4,000,000 - 5,000,000 Тохиролцоно',
 '2,100,000 - 2,500,000',
 '2,100,000 - 2,500,000',
 '4,000,000 - 5,000,000',
 '2,500,00

In [11]:
df = pd.DataFrame({"title": listing_title, "salary": listing_salary})
df

Unnamed: 0,title,salary
0,Смарт оператор,"1,200,000 - 1,500,000 Тохиролцоно"
1,Химич,"2,100,000 - 2,500,000"
2,"ХУДАЛДАГЧ,КАСС / ЗАЙСАН САЛБАРТ/","1,500,000 - 1,800,000"
3,НЯГТЛАН БОДОГЧ,"2,100,000 - 2,500,000"
4,ХУДАЛДААНЫ ТӨЛӨӨЛӨГЧ-ЖОЛООЧ,"2,100,000 - 2,500,000"
...,...,...
75,"Токарьчин, засварчин","2,100,000 - 2,500,000"
76,"Моторчин, засварчин","2,500,000 - 3,000,000"
77,"Автын цахилгаанчин, засварчин","2,100,000 - 2,500,000"
78,Автын механик,"2,500,000 - 3,000,000"


In [12]:
for i in range(len(df)):
    if "Тохиролцоно" in df['salary'][i]:
        df = df.drop(i)

In [13]:
df

Unnamed: 0,title,salary
1,Химич,"2,100,000 - 2,500,000"
2,"ХУДАЛДАГЧ,КАСС / ЗАЙСАН САЛБАРТ/","1,500,000 - 1,800,000"
3,НЯГТЛАН БОДОГЧ,"2,100,000 - 2,500,000"
4,ХУДАЛДААНЫ ТӨЛӨӨЛӨГЧ-ЖОЛООЧ,"2,100,000 - 2,500,000"
5,Борлуулалтын ажилтан - (Салбар),"2,100,000 - 2,500,000"
...,...,...
75,"Токарьчин, засварчин","2,100,000 - 2,500,000"
76,"Моторчин, засварчин","2,500,000 - 3,000,000"
77,"Автын цахилгаанчин, засварчин","2,100,000 - 2,500,000"
78,Автын механик,"2,500,000 - 3,000,000"


In [14]:
base_url = 'https://www.zangia.mn/job/list'

df_last = pd.DataFrame(df)
df_last.to_csv('zangia_job_listings.csv', index=False)

In [15]:
df

Unnamed: 0,title,salary
1,Химич,"2,100,000 - 2,500,000"
2,"ХУДАЛДАГЧ,КАСС / ЗАЙСАН САЛБАРТ/","1,500,000 - 1,800,000"
3,НЯГТЛАН БОДОГЧ,"2,100,000 - 2,500,000"
4,ХУДАЛДААНЫ ТӨЛӨӨЛӨГЧ-ЖОЛООЧ,"2,100,000 - 2,500,000"
5,Борлуулалтын ажилтан - (Салбар),"2,100,000 - 2,500,000"
...,...,...
75,"Токарьчин, засварчин","2,100,000 - 2,500,000"
76,"Моторчин, засварчин","2,500,000 - 3,000,000"
77,"Автын цахилгаанчин, засварчин","2,100,000 - 2,500,000"
78,Автын механик,"2,500,000 - 3,000,000"
