# Scraping job market data using a web crawler 

While searching for a summer-internship, I became interested in doing a web scraping project, due to the fact it seemed quite viable but has a wide scope regarding the possibilities it provides. The fetched data could be used to extract useful insights like long-/short-term job market trends or the change of salaries in a certain field over time. The most ambitious analysis approach I could think about was using a natural language processing (NLP) algorithm for deeper analysis results. (e.g. <i> "What are the general requirements for a certain job/field? </i>)

### Approach
Using the frameworks  <b><i> requests </i></b> and <b><i> BeautifulSoup </i></b> enabled me to get a quick start with fetching easy accessible data. In order to receive the wanted data, the methodes .select() and .select_one() appeared to be sufficient for my use-case. As my goal was to have a suitable pandas dataframe, I tried appending each iterations result of the .select() methode to the dataframe, which turned out to be not working. My final (and working) approach was to create lists containing the <b>job titles, creation dates, employer names, locations, links of the detailed adverts and extents of the offered job (full-time, part-time etc.)</b>. As this dataframe could furthermore be used to append additional job market data from other providers, I decided to append the source of each job advert.

### Difficulties
During the time I spent researching the topic, I gained great insights into the difficulties of large-scale data scraping, which portays the bread and butter a lot of tech firms out there. Especially the necessary computing power rose steadily for each list comprehension / loop I built in, which resulted several minutes waiting time after calling the fetching() methode for regular searching expressions. Nevertheless, I am quite content with the final result regarding the expended time. Additional to the fetching() methode, the class also provides the possibilty to externally save the data as a CSV-file and open the link of the detailed job advert in a web browser.

### Result
In the end, I have a clean pandas dataframe, which can be used for further analysis.
I want to state clear, that this project is only for private/educational use, as I did not find any detailed explanation about the permissions or prohibitions of scraping the data of the used website(s).

In [1]:
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
import webbrowser
import timeit

In [2]:
class CrawlerKarriereAT():
    def __init__(self, key):
        # Defining instance variable(s)
        self.key = key
        self.url = "https://www.karriere.at/jobs/" + str(key)
        
        # Initial request to get amount of pages
        self.source = requests.get(self.url)
        soup = bs((self.source).text, "html.parser")
        try:
            self.pages = int(soup.select_one(".m-pagination__meta").get_text().split()[-1])
        except:
            self.pages = 1
        

    def fetching(self):
        
        # Start loading timer
        start = timeit.default_timer()
        
        # Creating of empty lists
        url_list = []
        titles, locations, dates, employer, links, portal, extent  = [], [], [], [], [], [], []

        # Filling url_list with url with page variables
        for i in range(1, (self.pages)+1):
            url_list.append(str(self.url) + "?page=" + str(i))
        
        # Defining URL List as instance variable for reusability
        self.url_list = url_list
        
        # Looping through URLs in url_list and adding values to respective list
        for url in url_list:
            # Request each URL
            source = requests.get(url)
            # Souping each request
            soup = bs(source.text, "html.parser")
            
            # Adding values to respective lists
            [titles.append(element.text) for element in soup.select(".m-jobsListItem__titleLink")]
            [[locations.append(element.select_one(".m-jobsListItem__locationLink").text) for i in range(0,1)] \
             for element in soup.select(".m-jobsListItem__dataContainer")]
            [dates.append(element.text[3:]) for element in soup.select(".m-jobsListItem__date")]
            [employer.append(element.text) for element in soup.select(".m-jobsListItem__companyName")]
            [links.append(element["href"]) for element in soup.find_all(class_='m-jobsListItem__titleLink', href=True)]
         


        extent1 = []        
        # Request received links of job adverts and fetch additional information. In this case the work extent (full-time, part-time) 
        for i in links:
            isource = requests.get(i)
            soup = bs(isource.text, "html.parser")
            extent = []
            for element in soup.select(".m-jobHeader__metaItemInner"):
                extent.append(element.text)
            extent1.append(extent[2]) 
           
        
        [portal.append("karriere.at") for i in range(0, len(titles))]
                

        # Defining input data
        data={"Titel": titles, "Standort": locations, "Datum": dates, "Arbeitgeber": employer, "Portal": portal, "Links": links, "Ausmaß": extent1}
        # Create DataFrame with prev. defined input data
        df = pd.DataFrame(data, columns = ["Arbeitgeber", "Titel", "Standort", "Datum", "Portal", "Links", "Ausmaß"])
        # Sort values by date (descending)
        df["Datum"] = pd.to_datetime(df['Datum'], format="%d.%m.%Y")
        df.sort_values(by=["Datum"], inplace=True, ascending=False)
        # Drop duplicates
        df.drop_duplicates(inplace=True)
        # Reindex list
        df.reset_index(drop = True, inplace=True)
        # Defining DataFrame as instance variable for reusability
        self.df = df
        # Stop loading timer
        stop = timeit.default_timer()
        # Display loading time
        print('Loading time in seconds: ', round(stop - start, 2))
        
    def export(self):
        # Export CSV-file (reusability)
        self.df.to_csv("./karriere_at_" + str(self.key) + ".csv")
    
    def openLink(self):
        # Open link of respective job insertion entry (integer)
        print("Please choose a value between 0 and " + str(self.df.shape[0]-1) + (" to open the respective link!"))
        position = int(input())
        webbrowser.open(self.df["Links"][position])
        return self.df[position:position+1]
        

            

In [3]:
web_developement = CrawlerKarriereAT("web developement")
web_developement.fetching()
web_developement.df

Loading time in seconds:  332.14


Unnamed: 0,Arbeitgeber,Titel,Standort,Datum,Portal,Links,Ausmaß
0,Wirecard Central Eastern Europe GmbH,Junior PHP / Web Developer (m/w/d) E-Commerce,Graz,2020-02-20,karriere.at,https://www.karriere.at/jobs/5666377,Vollzeit
1,Bundesrechenzentrum GmbH,Java Fullstack - Lead Developer (w/m/d),Wien,2020-02-20,karriere.at,https://www.karriere.at/jobs/5648771,Vollzeit
2,Computer Futures,Senior Software Developer (m/f/d) React / HTM...,Wien,2020-02-20,karriere.at,https://www.karriere.at/jobs/5636642,Vollzeit
3,BikerSOS GmbH,Frontend - Backend Developer,Linz,2020-02-20,karriere.at,https://www.karriere.at/jobs/5653302,"Vollzeit, Teilzeit, geringfügig"
4,Raiffeisen Software GmbH,Frontend/UI Developer (m/w/d),Linz,2020-02-20,karriere.at,https://www.karriere.at/jobs/5629100,Vollzeit
...,...,...,...,...,...,...,...
451,epunkt GmbH,Fullstack Web Developer (w/m/x) - Juniors & S...,Großraum Wels,2020-02-07,karriere.at,https://www.karriere.at/jobs/5659107,Vollzeit
452,MODUL Technology GmbH,Researchers and Software Developers for Web I...,Wien,2020-02-06,karriere.at,https://www.karriere.at/jobs/5624657,"Vollzeit, Teilzeit, geringfügig"
453,VACE Engineering GmbH,PHP Entwickler (m/w/d),Wien,2020-02-06,karriere.at,https://www.karriere.at/jobs/5658468,Vollzeit
454,Med-el Elektromedizinische Geräte GmbH,Cloud Developer (m/f),Innsbruck,2020-02-06,karriere.at,https://www.karriere.at/jobs/5658367,Vollzeit


In [4]:
web_developement.openLink()

Please choose a value between 0 and 455 to open the respective link!
155


Unnamed: 0,Arbeitgeber,Titel,Standort,Datum,Portal,Links,Ausmaß
155,Fronius,CRM Developer (m/w/d),Wels,2020-02-17,karriere.at,https://www.karriere.at/jobs/5544659,Vollzeit


In [5]:
data_analytics = CrawlerKarriereAT("data analytics")
data_analytics.fetching()
data_analytics.df

Loading time in seconds:  217.24


Unnamed: 0,Arbeitgeber,Titel,Standort,Datum,Portal,Links,Ausmaß
0,cubido business solutions gmbh,Projektleiter Big Data / Analytics (m/w/x),Leonding,2020-02-20,karriere.at,https://www.karriere.at/jobs/5642541,Vollzeit
1,A1 Telekom Austria AG,Praktikum Web Data Scientist (w/m),Vienna,2020-02-20,karriere.at,https://www.karriere.at/jobs/5667299,Praktika
2,A1 Digital International GmbH,M2M Solution Architect (w/m/d),Wien,2020-02-20,karriere.at,https://www.karriere.at/jobs/5619593,Vollzeit
3,waterdrop microdrink GmbH,(Senior) CRM and Marketing Automation Manager...,Wien,2020-02-20,karriere.at,https://www.karriere.at/jobs/5630565,Vollzeit
4,Österreichische Post AG,Data Scientist (d/m/w),Wien,2020-02-20,karriere.at,https://www.karriere.at/jobs/5642317,Vollzeit
...,...,...,...,...,...,...,...
264,Med-el Elektromedizinische Geräte GmbH,Research Scientist for Artificial Intelligenc...,Innsbruck,2020-02-06,karriere.at,https://www.karriere.at/jobs/5658369,Vollzeit
265,ÖBB-Konzern,Senior SpezialistIn Datenanalytik im Fernverk...,1100 Wien,2020-02-06,karriere.at,https://www.karriere.at/jobs/5658637,Vollzeit
266,RÜBIG Gesellschaft m.b.H. & Co. KG.,Junior BI Mitarbeiter (m/w),Marchtrenk,2020-02-06,karriere.at,https://www.karriere.at/jobs/5658588,Vollzeit
267,solvistas GmbH,Data Scientist mit Schwerpunkt Business Intel...,Wien,2020-02-06,karriere.at,https://www.karriere.at/jobs/5624665,Vollzeit


In [7]:
web_developement.export()
data_analytics.export()