# Individual Data Challenge: Scraping Jobs.ch

(Weekend homework recap)

As a job seeker, one has to search through job portals to find most relevant jobs related to your profile.

In this challenge, your goal is to find all jobs related to keywords: “Data Scientist”, “Data Analyst”, “Python Developer”, “Data Engineer”, “Data Manager”, “Data Architect”, “Big Data Analyst” and “Data Python” on jobs.ch.

## Questions

Download all necessary information (including job text, job rank, company name, job keyword…) for all webpages.
Using the information obtained, perform a descriptive analysis on this data including questions:

1. How many jobs are shared between these categories?
2. How much the keywords: “Data Analyst” and “Big Data Analyst” overlap?
3. Are there some companies doing more hires than average?
4. How many jobs are there in different Kantons?
5. Is “machine learning” keyword most often in data scientist or data analyst jobs?
6. What is the distribution of most common keywords between and across categories?
7. Produce a report in the form of a clean notebook (or jupyter slides), with commented code and markdown cells for structuring and interpretations.

In [2]:
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook as tqdm # this is a fancy progress bar!
from time import sleep
import numpy as np
from datetime import datetime
import pandas as pd

In [3]:
#DATA SCIENTIST
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Scientist"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_datascientist = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_datascientist = df_datascientist.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_datascientist['location'] = df_datascientist['location'].str.replace('—', '')
df_datascientist

#DATA ANALYST
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Analyst"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_dataanalyst = pd.DataFrame(columns = cols)

for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_dataanalyst = df_dataanalyst.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_dataanalyst['location'] = df_dataanalyst['location'].str.replace('—', '')
df_dataanalyst

#PYTHON DEVELOPER
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Python%20Developer"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_pythondeveloper = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_pythondeveloper = df_pythondeveloper.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_pythondeveloper['location'] = df_pythondeveloper['location'].str.replace('—', '')
df_pythondeveloper

#DATA ENGINEER
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Engineer"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_dataengineer = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_dataengineer = df_dataengineer.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_dataengineer['location'] = df_dataengineer['location'].str.replace('—', '')
df_dataengineer

#DATA MANAGER
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Manager"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_datamanager = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_datamanager = df_datamanager.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_datamanager['location'] = df_datamanager['location'].str.replace('—', '')
df_datamanager

#DATA ARCHITECT
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Architect"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_dataarchitect = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_dataarchitect = df_dataarchitect.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_dataarchitect['location'] = df_dataarchitect['location'].str.replace('—', '')
df_dataarchitect

#BIG DATA ANALYST
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Big%20Data%20Analyst"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_bigdataanalyst = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_bigdataanalyst = df_bigdataanalyst.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_bigdataanalyst['location'] = df_bigdataanalyst['location'].str.replace('—', '')
df_bigdataanalyst

#DATA PYTHON
# Create the link (it is a better idea to constuct the especially link if you notice a specific pattern..)
link_first_part = 'https://www.jobs.ch'
link_mid_1_part = '/en/vacancies/?page='
link_mid_2_part = '&term='
link_mid_3_part = "Data%20Python"

link = link_first_part + link_mid_1_part + link_mid_2_part + link_mid_3_part

response = requests.get(link, timeout = 15)
soup = BeautifulSoup(response.content, 'html.parser')

job_links = []
def get_links(soup, job_links):
    all_links = soup.find_all('div', class_ = 'Box-sc-7ekkso-0 Position-b2pct5-0 Position__Relative-b2pct5-1 VacancySerpItem__ShadowBox-p4qu0m-0 hthPRS')
    for job_add in all_links:
        job_links.append(link_first_part+job_add.find('a', {'class':'x--job-link t--job-link SearchVacancyResultsComponent__StyledVacancySerpItem-n25jij-0 dQDQbr'}).get('href'))



max_pages = soup.find('div', class_ = 'Div-v2w9ke-0 Flex-sc-4aokm-0 eykbax').text.split()[2]

for page in tqdm(range(1, int(max_pages)+1)[:]):
    url = link_first_part +link_mid_1_part + str(page) + link_mid_2_part + link_mid_3_part
    response = requests.get(url, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    get_links(soup, job_links)
    sleep(0.6)

def get_content(soup):
    try:
        content = soup.find('div', class_ = 'Div-v2w9ke-0 fjQgMg').get_text()
    except:
        content = np.nan
    try:    
        title = soup.find('div', {'class' : 'Div-v2w9ke-0 hPuVjT'}).find('h1').get('title')
    except:
        title = np.nan
    try:
        company = soup.find('div', class_ = 'Div-v2w9ke-0 cvwY').find('a').get('title')
    except:
        company = np.nan
    try:
        published = soup.find('span', class_ = 'Span-bhy2uh-0 Badge-ndaeev-0 krRxxu').get_text()
    except: 
        published = np.nan
    try:
        location = soup.find('span', class_ = 'Span-bhy2uh-0 Text__span-sc-1vcmz87-8 YbklG Span-bhy2uh-0 Text__span-sc-1vcmz87-8 Text-sc-1vcmz87-9 gdfMMD').get_text()
    except: 
        location = np.nan
        
    return content, title, company, published, location

get_content(soup)

cols = ['date', 'company', 'title', 'published', 'content', 'location']
df_datapython = pd.DataFrame(columns = cols)


for job in tqdm(job_links[:]):
    response = requests.get(job, timeout = 15)
    soup = BeautifulSoup(response.content, 'html.parser')
    content, title, company, published, location = get_content(soup)
    
    df_datapython = df_datapython.append({
        'date': datetime.now(), 
        'company': company, 
        'title': title, 
        'published': published, 
        'content': content,
        'location': location
        
        
    }, ignore_index = True)
    sleep(0.6)


df_datapython['location'] = df_datapython['location'].str.replace('—', '')
df_datapython

HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=457), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=472), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=187), HTML(value='')))




HBox(children=(IntProgress(value=0, max=49), HTML(value='')))




HBox(children=(IntProgress(value=0, max=975), HTML(value='')))




HBox(children=(IntProgress(value=0, max=66), HTML(value='')))




HBox(children=(IntProgress(value=0, max=1323), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=228), HTML(value='')))




HBox(children=(IntProgress(value=0, max=11), HTML(value='')))




HBox(children=(IntProgress(value=0, max=208), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=573), HTML(value='')))




Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 01:56:21.205658,MBA Michael Bailey Associates GmbH,DevOps Python Developer,08.11.2019,Job descriptionInfoCompany,Zurich
1,2019-11-17 01:56:22.608452,Leica Geosystems AG,Quality Engineer (m/f/d),25.10.2019,Job descriptionInfoCompanyQuality Engineer (m/...,Heerbrugg
2,2019-11-17 01:56:23.833174,Arcanite Solutions Sàrl,Développeur/-euse Python orienté Web,14.11.2019,Job descriptionInfoCompanyArcanite est une jeu...,Puidoux
3,2019-11-17 01:56:25.060690,Noser Engineering AG,Data Engineer mit Flair für Analytics,15.11.2019,Job descriptionInfoCompanyData Engineer mit Fl...,Winterthur
4,2019-11-17 01:56:27.223534,Geberit AG,Product Data Analyst (m/w),15.11.2019,Job descriptionInfoCompanyProduct Data Analyst...,Rapperswil-Jona
5,2019-11-17 01:56:28.739321,UBS AG,IT Big Data Software Engineer,15.11.2019,Job descriptionInfoCompany200792BRIT Big Data ...,Opfikon
6,2019-11-17 01:56:29.978640,SIX,Internship Big Data,12.11.2019,Job descriptionInfoCompanySIX operates the inf...,Zürich
7,2019-11-17 01:56:31.210686,Signifikant Solutions AG,SOFTWARE ENGINEER PYTHON (M/W) 60-100%,30.10.2019,Job descriptionInfoCompany,Root D4
8,2019-11-17 01:56:32.478572,Die Schweizerische Post,Data Analyst Prozessentwicklung CRM Kundendien...,13.11.2019,Job descriptionInfoCompanyData Analyst Prozess...,Bern
9,2019-11-17 01:56:33.681222,Universitätsspital Basel,Django / Python web developer 80-100%,08.11.2019,Job descriptionInfoCompanyDjango / Python web ...,Basel


In [150]:
#export all the data to csv

df_datascientist.to_csv('datascientist.csv')
df_dataanalyst.to_csv('dataanalyst.csv')
df_pythondeveloper.to_csv('pythondeveloper.csv')
df_dataengineer.to_csv('dataengineer.csv')
df_datamanager.to_csv('datamanager.csv')
df_dataarchitect.to_csv('dataarchitect.csv')
df_bigdataanalyst.to_csv('bigdataanalyst.csv')
df_datapython.to_csv('datapython.csv')




In [4]:
df_bigdataanalyst

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 01:50:33.458324,Scintilla AG,Data Analyst (m/w/d),11.11.2019,Job descriptionInfoCompany,Zuchwil
1,2019-11-17 01:50:34.859341,Hapimag AG,Data Engineer / Data Analyst (m/w/d) 100%,29.10.2019,Job descriptionInfoCompany,Steinhausen
2,2019-11-17 01:50:36.398228,Die Schweizerische Post,Data Analyst Prozessentwicklung CRM Kundendien...,13.11.2019,Job descriptionInfoCompanyData Analyst Prozess...,Bern
3,2019-11-17 01:50:37.932056,EY,Senior Data Scientist - Forensics in Zurich,01.11.2019,Job descriptionInfoCompanySenior Data Scientis...,"Zurich, CH-ZH"
4,2019-11-17 01:50:39.160241,Geberit AG,Product Data Analyst (m/w),15.11.2019,Job descriptionInfoCompanyProduct Data Analyst...,Rapperswil-Jona
5,2019-11-17 01:50:40.497278,Vorwerk International & Co. KmG,Senior Quality Data Analyst,14.11.2019,Job descriptionInfoCompanySenior Quality Data ...,Wollerau
6,2019-11-17 01:50:41.961658,Swiss Life Asset Managers,Market Data Analyst,24.10.2019,Job descriptionInfoCompanySwiss Life Asset Man...,Zürich
7,2019-11-17 01:50:43.152884,Migros Bank AG,Data Analyst / Anwendungsentwickler (w/m),12.11.2019,"Job descriptionInfoCompanySie sind stolz, für ...",Wallisellen
8,2019-11-17 01:50:45.314227,Ultimativo Consulting GmbH,Group HR Data Analyst,15.11.2019,Job descriptionInfoCompany,Region Olten
9,2019-11-17 01:50:46.532069,CPM Switzerland AG,Junior Data Analyst,25.10.2019,Job descriptionInfoCompany,Thalwil


In [5]:
df_dataarchitect

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 01:44:52.634685,ROSEN Swiss AG,Microsoft Data Warehouse/BI-Entwickler (m/w),08.11.2019,Job descriptionInfoCompanyJobbeschreibung Mit ...,Stans
1,2019-11-17 01:44:53.857132,PwC,Junior Software Developer in PwC Digital Services,29.10.2019,Job descriptionInfoCompany You act as a Consul...,Zürich
2,2019-11-17 01:44:55.086277,Banian AG,Data Engineer / Solution Architect,28.10.2019,Job descriptionInfoCompany,Deutschschweiz
3,2019-11-17 01:44:58.770917,RM IT Professional Resources AG,Data Warehouse Developer - Oracle PL/SQL,15.11.2019,Job descriptionInfoCompanyData Warehouse Devel...,Bern
4,2019-11-17 01:45:00.006157,Credit Suisse AG,Agile DWH Analyst 80 - 100%,21.10.2019,Job descriptionInfoCompanyAgile DWH Analyst 80...,Zürich
5,2019-11-17 01:45:01.247604,42matters,Big Data Engineer,30.10.2019,Job descriptionInfoCompany,Zürich
6,2019-11-17 01:45:02.792472,Bossard AG,ICT BI / DWH Specialist (m/w),05.11.2019,Job descriptionInfoCompanyWir sind ein zukunft...,Zug
7,2019-11-17 01:45:06.453847,F. Hoffmann-La Roche AG,Data & Information Architect,22.10.2019,Job descriptionInfoCompany,Kaiseraugst
8,2019-11-17 01:45:07.715320,Coopers Group AG,Global Master Data Cleansing Manager (Master D...,14.11.2019,Job descriptionInfoCompany,Basel Area
9,2019-11-17 01:45:08.911239,Cognizant Technology Solutions AG,Data Platform Business Architect,23.10.2019,Job descriptionInfoCompanyThe Cognizant Digita...,Zürich


In [6]:
df_datamanager

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 01:12:32.945828,Comfone AG,Data Scientist Manager,25.10.2019,Job descriptionInfoCompany,Bern 22
1,2019-11-17 01:12:34.085141,B. Braun Medical AG,Stammdaten - Manager (w/m) 80-100%,08.11.2019,Job descriptionInfoCompanyHigh Tech aus dem Bi...,Escholzmatt
2,2019-11-17 01:12:35.409787,Sonova,Data Protection Manager,18.10.2019,Job descriptionInfoCompanyData Protection Mana...,Stäfa und Zug
3,2019-11-17 01:12:36.640146,maxon motor ag,Service Responsible Datacenter / Oracle 90-100...,06.11.2019,Job descriptionInfoCompanyFür unsere Abteilung...,Sachseln
4,2019-11-17 01:12:37.862540,Page Personnel,Data Administrator 50% (m/f),30.10.2019,Job descriptionInfoCompany,Zug
5,2019-11-17 01:12:39.678117,Geberit AG,Master Data Manager (m/w),15.11.2019,Job descriptionInfoCompanyMaster Data Manager ...,Rapperswil/Jona
6,2019-11-17 01:12:40.803619,UPC Schweiz GmbH,Manager Data Network Data Security,14.11.2019,Job descriptionInfoCompany,Wallisellen
7,2019-11-17 01:12:41.930758,Stämpfli AG,"Datenmanager/in Fahrplan, 80-100%",11.11.2019,Job descriptionInfoCompanyDatenmanager/in Fahr...,Bern
8,2019-11-17 01:12:43.075691,Aebi & Co. AG,Master Data Manager (m/w) 100 %,03.11.2019,Job descriptionInfoCompany,Burgdorf
9,2019-11-17 01:12:44.171209,Luzerner Kantonsspital,Datenmanager/in 80%,28.10.2019,Job descriptionInfoCompanyDatenmanager/in 80%I...,Luzern


In [7]:
df_dataengineer

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 00:46:55.410891,Raiffeisen Gruppe,Big Data Engineer (w/m),08.11.2019,Job descriptionInfoCompanyWerden Sie Teil eine...,Arbeitsort St. Gallen
1,2019-11-17 00:46:56.962005,Hapimag AG,Data Engineer / Data Analyst (m/w/d) 100%,29.10.2019,Job descriptionInfoCompany,Steinhausen
2,2019-11-17 00:46:58.172127,Experis Schweiz Zürich,Senior Data Engineer,04.11.2019,Job descriptionInfoCompanySenior Data Engineer...,Zürich
3,2019-11-17 00:46:59.720010,The Stamford Group AG,DevOps Engineer,13.11.2019,Job descriptionInfoCompanyDevOps EngineerJob D...,Zürich
4,2019-11-17 00:47:01.853487,Stamford Consultants AG,Cloud Site Reliability Engineer,15.11.2019,Job descriptionInfoCompany,Zürich
5,2019-11-17 00:47:03.087609,uniqFEED AG,DevOps / SRE (80-100%),05.11.2019,"Job descriptionInfoCompanyuniqFEED, a Spin-off...",Glattbrugg
6,2019-11-17 00:47:04.631022,The Stamford Group AG,Cloud Site Reliability Engineer,15.11.2019,Job descriptionInfoCompanyCloud Site Reliabili...,Zürich
7,2019-11-17 00:47:06.767791,Geberit AG,Product Data Engineer (m/w),15.11.2019,Job descriptionInfoCompanyProduct Data Enginee...,Rapperswil-Jona
8,2019-11-17 00:47:07.992880,Noser Engineering AG,Data Engineer mit Flair für Analytics,15.11.2019,Job descriptionInfoCompanyData Engineer mit Fl...,Winterthur
9,2019-11-17 00:47:09.225791,Leica Geosystems AG,Machine Learning Engineer (f/m),09.11.2019,Job descriptionInfoCompanyMachine Learning Eng...,Heerbrugg


In [8]:
df_pythondeveloper

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 00:41:23.929207,MBA Michael Bailey Associates GmbH,DevOps Python Developer,08.11.2019,Job descriptionInfoCompany,Zurich
1,2019-11-17 00:41:25.095972,Vontobel,API Specialist & Integration Engineer 80-100%,04.11.2019,Job descriptionInfoCompany,Zürich
2,2019-11-17 00:41:26.400489,Camptocamp SA,Python Developer,05.11.2019,Job descriptionInfoCompany,Olten
3,2019-11-17 00:41:27.933068,Labour Search GmbH,Junior Python Developer 100%,31.10.2019,Job descriptionInfoCompany,Wettingen
4,2019-11-17 00:41:29.473870,Universitätsspital Basel,Django / Python web developer 80-100%,08.11.2019,Job descriptionInfoCompanyDjango / Python web ...,Basel
5,2019-11-17 00:41:31.003573,Infomaniak Network SA,Développeur de logiciel,15.11.2019,"Job descriptionInfoCompanyParmi ses projets, I...",Genève
6,2019-11-17 00:41:32.222287,SonarSource SA,Java Developer,12.11.2019,Job descriptionInfoCompanySonarSource provides...,1215 Geneva 15
7,2019-11-17 00:41:33.451969,Harvey Nash AG,Cloud Site Reliability Engineer,15.11.2019,Job descriptionInfoCompany,Zürich
8,2019-11-17 00:41:34.981968,Stamford Consultants AG,Cloud Site Reliability Engineer,15.11.2019,Job descriptionInfoCompany,Zürich
9,2019-11-17 00:41:36.233081,The Stamford Group AG,Cloud Site Reliability Engineer,15.11.2019,Job descriptionInfoCompanyCloud Site Reliabili...,Zürich


In [9]:
df_dataanalyst

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 00:30:18.532134,Hapimag AG,Data Engineer / Data Analyst (m/w/d) 100%,29.10.2019,Job descriptionInfoCompany,Steinhausen
1,2019-11-17 00:30:20.058648,Scintilla AG,Data Analyst (m/w/d),11.11.2019,Job descriptionInfoCompany,Zuchwil
2,2019-11-17 00:30:21.287482,Geberit AG,Product Data Analyst (m/w),15.11.2019,Job descriptionInfoCompanyProduct Data Analyst...,Rapperswil-Jona
3,2019-11-17 00:30:22.499035,Vorwerk International & Co. KmG,Senior Quality Data Analyst,14.11.2019,Job descriptionInfoCompanySenior Quality Data ...,Wollerau
4,2019-11-17 00:30:23.731574,Swiss Life Asset Managers,Market Data Analyst,24.10.2019,Job descriptionInfoCompanySwiss Life Asset Man...,Zürich
5,2019-11-17 00:30:24.960227,Die Schweizerische Post,Data Analyst Prozessentwicklung CRM Kundendien...,13.11.2019,Job descriptionInfoCompanyData Analyst Prozess...,Bern
6,2019-11-17 00:30:26.241601,Migros Bank AG,Data Analyst / Anwendungsentwickler (w/m),12.11.2019,"Job descriptionInfoCompanySie sind stolz, für ...",Wallisellen
7,2019-11-17 00:30:27.734569,Ultimativo Consulting GmbH,Group HR Data Analyst,15.11.2019,Job descriptionInfoCompany,Region Olten
8,2019-11-17 00:30:29.110201,Magazine zum Globus AG,Data Analyst Produktdaten (m/w),15.11.2019,Job descriptionInfoCompanyFür unseren Hauptsit...,Zürich
9,2019-11-17 00:30:30.836197,CPM Switzerland AG,Junior Data Analyst,25.10.2019,Job descriptionInfoCompany,Thalwil


In [10]:
df_datascientist

Unnamed: 0,date,company,title,published,content,location
0,2019-11-17 00:19:10.475106,Hapimag AG,Data Engineer / Data Analyst (m/w/d) 100%,29.10.2019,Job descriptionInfoCompany,Steinhausen
1,2019-11-17 00:19:11.692854,Scintilla AG,Data Analyst (m/w/d),11.11.2019,Job descriptionInfoCompany,Zuchwil
2,2019-11-17 00:19:13.079891,Die Schweizerische Post,Data Analyst Prozessentwicklung CRM Kundendien...,13.11.2019,Job descriptionInfoCompanyData Analyst Prozess...,Bern
3,2019-11-17 00:19:14.545353,EY,Senior Data Scientist - Forensics in Zurich,01.11.2019,Job descriptionInfoCompanySenior Data Scientis...,"Zurich, CH-ZH"
4,2019-11-17 00:19:16.240372,Geberit AG,Product Data Analyst (m/w),15.11.2019,Job descriptionInfoCompanyProduct Data Analyst...,Rapperswil-Jona
5,2019-11-17 00:19:17.639676,Zurich Insurance Company,Distribution Data Scientist Analyst,29.10.2019,Job descriptionInfoCompanyDistribution Data Sc...,Zürich
6,2019-11-17 00:19:18.948675,Vorwerk International & Co. KmG,Senior Quality Data Analyst,14.11.2019,Job descriptionInfoCompanySenior Quality Data ...,Wollerau
7,2019-11-17 00:19:20.182978,Swiss Life Asset Managers,Market Data Analyst,24.10.2019,Job descriptionInfoCompanySwiss Life Asset Man...,Zürich
8,2019-11-17 00:19:21.408568,Migros Bank AG,Data Analyst / Anwendungsentwickler (w/m),12.11.2019,"Job descriptionInfoCompanySie sind stolz, für ...",Wallisellen
9,2019-11-17 00:19:22.942733,AZ Direct AG,Senior Data Scientist/Consultant,05.11.2019,Job descriptionInfoCompanyUnsere Kunden in Dia...,Cham ZG


In [55]:
df_datascientist.groupby('title').size()


title
(Senior) DMPK/PD Leader Small molecules, pRED Pharmaceutical Sciences                                                                 1
(Senior) DMPK/PD Leader, Large Molecules, pRED Pharmaceutical Sciences                                                                1
(Senior) Medical Director - Alzheimer's Disease                                                                                       1
(Senior) Principal Data Scientist, PHC Analytics (Methodology Focus)                                                                  1
(Senior) Principal Data Scientist, Real World Data - Neurosciences                                                                    1
(Senior) Real World Data Scientist-Epidemiologist, Neuroscience                                                                       1
(Senior) Scientist for Microbiome Data Analyses - pRED Pharmaceutical Sciences (Temporary Position 2 years)                           1
(Senior) Software Engineer - Big Data     

In [62]:
df_datascientist['title'].value_counts()

Data Scientist                                                                                                                                                 10
Data Analyst                                                                                                                                                    8
Data Engineer                                                                                                                                                   3
Junior Software Developer in PwC Digital Services                                                                                                               3
Product Specialist - Preclinical Software                                                                                                                       3
Senior Data Scientist                                                                                                                                           3
Clinical Scientist          

### 1. How many jobs are shared between these categories?

In [82]:
#lets combine all dataframe data into a single dataframe
shared_concat = pd.concat([df_datascientist, df_dataanalyst, df_pythondeveloper,df_dataengineer,df_datamanager,
                           df_dataarchitect,df_bigdataanalyst,df_datapython])
    
    

In [78]:
#lets add search terms to each respetive dataframe so we can keep track of search terms when comparing
df_datascientist['search']='Data Scientist'
df_dataanalyst['search']='Data Analyst'
df_pythondeveloper['search']='Python Developer'
df_dataengineer['search']='Data Engineer'
df_datamanager['search']='Data Manager'
df_dataarchitect['search']='Data Architect'
df_bigdataanalyst['search']='Big Data Analyst'
df_datapython['search']='Data Python'






In [93]:
duplicated_ex1st = shared_concat[shared_concat.duplicated(['title','company','location'],keep='first')]

In [92]:
duplicated_ex1st['search'].count()

1830

In [95]:
shared_concat['search'].count()

4423

In [103]:
a = 1830/4423 * 100
print("there are approx "+ str(round(a)) + "% duplicate job postings among all search keywords")

there are approx 41% duplicate job postings among all search keywords


In [None]:
#let's clean the data of each search since it's likely that the 'recommended' jobs at the top of each search page
#appear multiple times since companies pay to have their jobs pinned near the top no matter the search page.

In [117]:
#CLEANING OUR DATASETS BY REMOVING DUPLICATES FROM EACH SEARCH
df_datascientist_cleaned = df_datascientist.drop_duplicates(subset=['title','company','location'])
df_dataanalyst_cleaned = df_dataanalyst.drop_duplicates(subset=['title','company','location'])
df_pythondeveloper_cleaned = df_pythondeveloper.drop_duplicates(subset=['title','company','location'])
df_dataengineer_cleaned = df_dataengineer.drop_duplicates(subset=['title','company','location'])
df_datamanager_cleaned = df_datamanager.drop_duplicates(subset=['title','company','location'])
df_dataarchitect_cleaned = df_dataarchitect.drop_duplicates(subset=['title','company','location'])
df_bigdataanalyst_cleaned = df_bigdataanalyst.drop_duplicates(subset=['title','company','location'])
df_datapython_cleaned = df_datapython.drop_duplicates(subset=['title','company','location'])


In [120]:
#number of duplicates removed during clearning from each respective dataframe
print(df_datascientist.shape[0] - df_datascientist_cleaned.shape[0])
print(df_dataanalyst.shape[0] - df_dataanalyst_cleaned.shape[0])
print(df_pythondeveloper.shape[0] - df_pythondeveloper_cleaned.shape[0])
print(df_dataengineer.shape[0] - df_dataengineer_cleaned.shape[0])
print(df_datamanager.shape[0] - df_datamanager_cleaned.shape[0])
print(df_dataarchitect.shape[0] - df_dataarchitect_cleaned.shape[0])
print(df_bigdataanalyst.shape[0] - df_bigdataanalyst_cleaned.shape[0])
print(df_datapython.shape[0] - df_datapython_cleaned.shape[0])



33
45
9
52
78
12
16
41


In [None]:
#let's re run our comparison with cleaned data:

In [121]:
shared_concat_cleaned = pd.concat([df_datascientist_cleaned, df_dataanalyst_cleaned, df_pythondeveloper_cleaned
                                   ,df_dataengineer_cleaned,df_datamanager_cleaned,df_dataarchitect_cleaned,
                                   df_bigdataanalyst_cleaned,df_datapython_cleaned])

In [122]:
duplicated_ex1st_cleaned = shared_concat_cleaned[shared_concat_cleaned.duplicated(['title','company','location'],keep='first')]

In [123]:
#total number repeat jobs
duplicated_ex1st_cleaned.shape[0]

1544

In [124]:
#total number of completely unique jobs 
shared_concat_cleaned.shape[0]

4137

In [125]:
round(duplicated_ex1st_cleaned.shape[0] / shared_concat_cleaned.shape[0] * 100)

37

In [None]:
#after cleaning job replicated within each individual search table, we find that approx 37% of all jobs are listed
#under the other search categories

### 2. How much the keywords: “Data Analyst” and “Big Data Analyst” overlap?

In [126]:
shared_concat_DA_BDA = pd.concat([df_dataanalyst_cleaned,df_bigdataanalyst_cleaned])

In [127]:
duplicates_DA_BDA = shared_concat_DA_BDA[shared_concat_DA_BDA.duplicated(['title','company','location'],keep='first')]

In [128]:
duplicates_DA_BDA.shape[0]

192

In [129]:
shared_concat_DA_BDA.shape[0]

619

In [130]:
round(duplicates_DA_BDA.shape[0] / shared_concat_DA_BDA.shape[0] * 100)

31

In [None]:
#approx 31% of of jobs that are in Data Analyst are also in Big Data Analyst

### 3. Are there some companies doing more hires than average?

In [132]:
tesssst = shared_concat_cleaned[shared_concat_cleaned.duplicated(['company'],keep='first')]

In [133]:
#There are 3578 additional jobs posted by all companies after their first posting, implying that many companies
#have multiple job postings - which makes sense

Unnamed: 0,date,company,title,published,content,location,search,parsed
23,2019-11-17 00:19:42.908446,PwC,Junior Software Developer in PwC Digital Services,29.10.2019,Job descriptionInfoCompany You act as a Consul...,Zürich,Data Scientist,16/11/2019
30,2019-11-17 00:19:53.389399,Credit Suisse AG,Senior Data Scientist (80-100%),10.11.2019,Job descriptionInfoCompanySenior Data Scientis...,Zürich,Data Scientist,16/11/2019
31,2019-11-17 00:19:54.583631,PwC,Data Scientist (Consultant),28.10.2019,Job descriptionInfoCompany Du arbeitest bei in...,Zürich,Data Scientist,16/11/2019
32,2019-11-17 00:19:56.162605,Zurich Insurance Company,Actuarial Data Scientist (m/w) 80-100%,07.11.2019,Job descriptionInfoCompanyActuarial Data Scien...,Zürich,Data Scientist,16/11/2019
52,2019-11-17 00:20:22.609973,CTC Resourcing Solutions,Biomarker Operations Project Manager (896517),12.11.2019,Job descriptionInfoCompany,Basel area,Data Scientist,16/11/2019
57,2019-11-17 00:20:30.048413,gateB AG,Technischer Projektleiter,11.11.2019,Job descriptionInfoCompanyTechnischer Projektl...,Steinhausen,Data Scientist,16/11/2019
66,2019-11-17 00:20:42.509593,PSI,Linux Engineer,31.10.2019,Job descriptionInfoCompanyLinux Engineer ...,Villigen,Data Scientist,16/11/2019
71,2019-11-17 00:20:48.959469,"Helbling Technik Bern AG, Bern",Senior Hardware Engineer for Consumer Electron...,08.11.2019,Job descriptionInfoCompanySenior Hardware Engi...,Liebefeld,Data Scientist,16/11/2019
77,2019-11-17 00:20:57.609708,,Lead Data Scientist (M/W),15.11.2019,Job descriptionInfoLead Data Scientist (M/W) T...,,Data Scientist,16/11/2019
78,2019-11-17 00:20:58.782719,,Data Scientist - Senior Solution Developer,15.11.2019,Job descriptionInfoData Scientist - Senior Sol...,,Data Scientist,16/11/2019


In [137]:
#here is a list of how many unique jobs each company has posted
shared_concat_cleaned['company'].value_counts()

Credit Suisse AG                                    203
F. Hoffmann-La Roche AG                             162
Novartis AG                                          78
Swisscom (Schweiz) AG                                72
Google Switzerland GmbH                              50
Atos AG                                              45
Ernst & Young AG                                     44
PwC                                                  41
SIX Group AG                                         32
EPAM Systems (Switzerland) GmbH                      32
EF Education AG                                      30
Universität Basel                                    27
The Stamford Group AG                                25
ETH Zürich                                           25
Qualipet AG                                          24
EY                                                   24
ABB Schweiz AG                                       23
ELCA Informatik AG                              

In [138]:
#The mean number of jobs per company is
shared_concat_cleaned['company'].value_counts().mean()

5.281639928698752

In [141]:
#the most common number of jobs posted by a company is:
shared_concat_cleaned['company'].value_counts().mode()

#this implies that most companies hiring are small since on average, most companies are only hiring one person

0    1
dtype: int64

In [143]:
#here is a list of all companies which have more than the mean amount of job postings "True"
shared_concat_cleaned['company'].value_counts() > shared_concat_cleaned['company'].value_counts().mean()

Credit Suisse AG                                     True
F. Hoffmann-La Roche AG                              True
Novartis AG                                          True
Swisscom (Schweiz) AG                                True
Google Switzerland GmbH                              True
Atos AG                                              True
Ernst & Young AG                                     True
PwC                                                  True
SIX Group AG                                         True
EPAM Systems (Switzerland) GmbH                      True
EF Education AG                                      True
Universität Basel                                    True
The Stamford Group AG                                True
ETH Zürich                                           True
Qualipet AG                                          True
EY                                                   True
ABB Schweiz AG                                       True
ELCA Informati

In [147]:
#here is a list of all companies which have more than 20 job postings
shared_concat_cleaned['company'].value_counts() > 20

Credit Suisse AG                                     True
F. Hoffmann-La Roche AG                              True
Novartis AG                                          True
Swisscom (Schweiz) AG                                True
Google Switzerland GmbH                              True
Atos AG                                              True
Ernst & Young AG                                     True
PwC                                                  True
SIX Group AG                                         True
EPAM Systems (Switzerland) GmbH                      True
EF Education AG                                      True
Universität Basel                                    True
The Stamford Group AG                                True
ETH Zürich                                           True
Qualipet AG                                          True
EY                                                   True
ABB Schweiz AG                                       True
ELCA Informati

### 4. How many jobs are there in different Kantons?

In [158]:
#here are a list of the different cantones. Some of the cantones appear more than once under slightly different
#names but we will ignore that since they generally have a small efect
shared_concat_cleaned['location'].value_counts()[0:30]

  Zürich                  789
  Basel                   347
  Zurich                  160
  Bern                     89
  Lausanne                 65
  Zug                      51
  Wallisellen              42
  Genève                   37
  Rotkreuz                 33
  Baar                     32
  Geneva                   31
  Zürich, Zürich           28
  Kaiseraugst              28
  Winterthur               27
  Glattbrugg               25
  Thalwil                  24
  Olten                    21
  Heerbrugg                19
  Luzern                   18
  Schaan                   18
  St. Gallen               15
  Zug, CH                  14
  Liebefeld                14
  Muttenz                  13
  Schaffhausen             13
  Aarau                    13
  CH - Rapperswil/Jona     12
  Wollerau                 12
  Gland                    12
  Steinhausen              12
Name: location, dtype: int64

### 5. Is “machine learning” keyword most often in data scientist or data analyst jobs?

In [159]:
df_datascientist_cleaned.content.str.contains(r'machine learning').sum()

77

In [160]:
df_dataanalyst_cleaned.content.str.contains(r'machine learning').sum()

46

In [163]:
# machine learning occurs 46 times in Data Analyst job searches within 'content' and 77 times in Data Scientist job roles
#within 'content' however, let's consider the percentage of the jobs this corresponds to:

print(round(df_datascientist_cleaned.content.str.contains(r'machine learning').sum() / df_datascientist_cleaned.shape[0] * 100))
print(round(df_dataanalyst_cleaned.content.str.contains(r'machine learning').sum() / df_dataanalyst_cleaned.shape[0] * 100))

18.0
11.0


In [None]:
#18% percent of Data Scientist jobs contain "Machine learning" compared to 11% for Data Analyst jobs.

### 6. What is the distribution of most common keywords between and across categories?

In [215]:
ol = ['analyst', 'data scientist', 'data engineer', 'data manager', 'big data', 'architect']







177