# **Web Scraping Best Hiring Companies for Developers from StackOverflow**
### _A Job Finding Service For Developers Based on Tech Stacks_
![img](https://i.imgur.com/7nilnCg.png)


## **Introduction**


The  [**StackOverflow**](https://stackoverflow.com/) is a great website for programmers and developers. It has a great community of people helping each other figure out problems and errors in their code. Not only code, but also any query related to tech can be easily found in stackoverflow and can be clarified. Stackoverflow also has a job portal page where it has latest updates on tech jobs for developers.

The main aim of this project is to take leverage of this tech job updates posted on stack overflow by webscraping it and tabulating the important informations. By doing this one can easily find the best jobs suited for them based on a **particular tech stack.**

### **The various information to be  collected for each company are as follows:**

  - **Followers** :
    Total number of followers

  - **Industry** :
    The type of industry(Eg: ecommerce, fintech etc)

  - **Founded** :
    The year it was founded

  - **Size** :
    Total number of employees working
  
  - **Website** :
    Official link of the company

  - **Office Location** :
    Location of office

  - **Tech stack 1** :
    Top 1 tech stack mentioned
  
  - **Tech stack 2** :
    Top 2 tech stack mentioned

  - **Location Link** :
    Geolocation of the company

Based on the above information a candidate can easily shortlist the companies based on the top 2 `tech stacks`.

### Steps involved in building the project :

- Planning the code flow
- Installing and Importing the Required Libraries
- Bulding Helper Functions to Extract Data
- Creating the **Main, Wrapper Functions** to wrap all the Functions
- Summary
- Future Work
- References


## **Planning the codeflow**

Lets head to [stack-overflows](https://stackoverflow.com/jobs/companies) companies page to  analyse how the job updates are provided.

![img](https://i.imgur.com/ILJXI3M.png)



The job update page consist of 2 pages for each jobs, one of them is the main page which is the `first page` consist of list of all jobs with their links. By clicking the link one can get more details about the particual job which is the `second page`.

![img](https://i.imgur.com/jBsrWyW.png)

On inspecting the first page we can see a link of the second page for a particular job. And this link can be further used to scrape the information about a particular job.
**So our fist step would be to scrape all the job links of the second page from the first page, and then extracting the data from the second page for each jobs.**

Steps involved in scraping the web page :
- To code a main function as a starting point of the program
- To code a wrapper function which calls the other functions 
- To code a function to scrape the links of all jobs from the first page
- To design functions to scrape the data of individual jobs by taking the links as the input from the main parser function

## **Installing and Importing the Required Libraries**

In [1]:
!pip install requests bs4 pandas --quiet

In [2]:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd
import numpy as np

## **Building Helper Functions to Extract Data**

### **Scraping the First Page**

#### **mainParser Function**
As mentioned in the code flow the `main parser function` is to scrape all the links of the first page.

In [3]:



def mainParser(pageNo):
    links={
    "baseUrl":"https://stackoverflow.com",
    "subUrl":"/jobs/companies"
    }

    try:
        mainInfo={}
        mainUrl=links['baseUrl']+links['subUrl']+"?pg="+str(pageNo)

        response=requests.get(mainUrl)
        mainPage=bs(response.text,"html.parser")
        mainAtags=mainPage.find_all('a',class_="s-link")
        for tag in mainAtags:
            title=tag.text
            link=tag["href"]
            if title!='\n\n':
                mainInfo[title]=links["baseUrl"]+link
        return mainInfo

    except Exception as e:
        print(str(e))

### **Scraping the Second Page**

Inorder to scrape the second page two function has been developed.

**The `infoParser function` which collects :**
- Website link
- Size of the company
- The type of industry
- Number of followers

**The `extraParser function` which collects :**
- Location Link
- Office Locations
- **Tech Stack 1**
- **Tech Stack 2**


#### **infoParser Function**

All the links that are scraped from the first page by the `main parser function` will be then iterated one by one and will be taken as input by the `info parser function` for scraping the information about a particular job.

In [4]:
def infoParser(link):
    attributes={}
    response=requests.get(link)
    page=bs(response.text,'html.parser')
    divTag=page.find_all('div',class_="ba bc-black-100 ps-relative p16 bar-sm")
    pTags=divTag[0].find_all('p',class_="fw-bold fs-caption fs-category fc-black-400 mb0")

    attributes.update(aboutpTags(pTags))
    return page, attributes

Helper functions for `aboutpTags function`

In [5]:
def web_tag(tag):
    if tag.text=="Website":
        webtag=tag.parent()[1].find_all('a')[0]["href"]
        return webtag
    return False

def ind_tag(tag):
    if tag.text=="Industry":
        indtag=tag.parent()[1].text.strip()
        return indtag
    return False

def size_tag(tag):
    if tag.text=="Size":
        sizetag=tag.parent()[1].text.strip()
        return sizetag
    return False

def found_tag(tag):
    if tag.text=="Founded":
        ftag=tag.parent()[1].text.strip()
        return ftag
    return False

def follow_tag(tag):
    if tag.text=="Followers":
        fftag=tag.parent()[1].text.strip()
        return fftag
    return False

Helper function for `infoParser function`

In [6]:
def aboutpTags(pTags):
    attributes={}
    for tag in pTags:

        w_tag=web_tag(tag)
        if w_tag:attributes['website']=w_tag

        i_tag=ind_tag(tag)
        if i_tag:attributes['industry']=i_tag            
             
        s_tag=size_tag(tag)
        if s_tag:attributes['size']=s_tag     
        
        f_tag=found_tag(tag)
        if f_tag:attributes['founded']=f_tag     

        ff_tag=follow_tag(tag)
        if ff_tag:attributes['followers']=ff_tag  

    if not attributes.get("followers"): attributes['followers']=np.nan
    if not attributes.get("founded"): attributes['founded']=np.nan
    if not attributes.get("size"): attributes['size']=np.nan
    if not attributes.get("industry"): attributes['industry']=np.nan
    if not attributes.get("website"): attributes['website']=np.nan
    return attributes

Helper function for `extraParser Function`

In [7]:
def location_tag(page):

    locationDivTag=page.find('div',class_="mt32 js-locations")
    if locationDivTag:
        llink=locationDivTag.find('a')["href"]
        if llink:
            location_link=llink
        else:
            location_link=None
        lname=locationDivTag.find('a')["data-query"]
        if lname:
            office_location=lname
        else:
            office_location=None
    else:
        location_link=None
        office_location=None
    return (location_link,office_location)


def tech_tag(page):
    tech_stack=page.find('div',class_="fs-body2 mt32 js-nav-content")
    stack_a_tags=tech_stack.find_all('a',class_="flex--item s-tag no-tag-menu")
    tstack1,tstack2=stack_a_tags[:2]
    return (tstack1.text,tstack2.text)


#### **extraParser Function**

The function that is used to scrape some additional data regarding the jobs.

In [8]:
def extraParser(link, page):
    attributes={}

    divTag=page.find('p',class_="fc-light lh-md fs-body3 sticky:fade-out mb12 sm:mb0")
    attributes["moto"]=divTag.text.strip()

    l_tag=location_tag(page)
    attributes["locationLink"]=l_tag[0]
    attributes["officelocations"]=l_tag[1]

    ts_tag=tech_tag(page)
    attributes["tstack1"]=ts_tag[0]
    attributes["tstack2"]=ts_tag[1]

    return attributes

## **Creating the **Main, Wrapper Functions** to wrap all the Functions**

### **Wrapper Functions**

The `subParser function` which takes the links of all the jobs of the first page, it then iterates through all the links and providing those link as the input to the `infoParser function` to scrape the second page.

In [9]:
def subParser(fulldata):
    attributes={
        "followers":[],
        "industry":[],
        "founded":[],
        "size":[],
        "website":[],
        "officelocations":[],
        "moto":[],
        "tstack1":[],
        "tstack2":[],
        "locationLink":[]    
    }
    for data in fulldata:
        link=fulldata[data]
        page, attribute=infoParser(link)
        attribute.update(extraParser(link,page))
        for key in attributes:
            attributes[key].append(attribute[key])
    return attributes

The `dfBuild function` takes number of pages as input and iterates through them to give them as a input to the `mainParser function`. Once all the links are taken using the mainParser function, the links are then sent as input to the subParser function. It returns a **Pandas DataFrame**  of all the data collected.

In [10]:
def dfBuild(no_pages):
    mainInfos={}

    for pageNo in range(1,no_pages):
        mainInfo=mainParser(pageNo)
        mainInfos.update(mainInfo)

    attributes=subParser(mainInfos)
    attributes["Company Name"]=list(mainInfos.keys())
    df=pd.DataFrame(attributes)
    return df

### **Main Function**

The Starting point of all functions which calls the `dfBuild function` as the page number as input.

In [11]:
if __name__=='__main__':
    number_of_pages=15
    df=dfBuild(number_of_pages)
    df.to_csv('jobs.csv',index=False)

In [12]:
df

Unnamed: 0,followers,industry,founded,size,website,officelocations,moto,tstack1,tstack2,locationLink,Company Name
0,101,"Banking, Financial Technology, Software Develo...",1976,5k-10k employees,https://www.jackhenry.com/,"663 West Highway 60\nMonett, MO 65708",Jack Henry is a well-rounded financial technol...,scala,go,https://www.google.com/maps/search/?api=1&quer...,"Jack Henry & Associates, Inc.®"
1,3,"Ad Tech, Cloud-Based Solutions, Real Estate",2021,1k-5k employees,https://www.aviv-group.com/,"Axel-Springer-Straße 65, 10969 Berlin, Germany",Unlock everyone's perfect place.,amazon-web-services,sql,https://www.google.com/maps/search/?api=1&quer...,AVIV Group GmbH
2,64,Pharmaceuticals,,10k+ employees,https://www.sanofi.ca/,"240 Richmond Street\nToronto, ON M5V 1V6",We are an innovative global healthcare company...,artificial,intelligence,https://www.google.com/maps/search/?api=1&quer...,Sanofi
3,89,"Cybersecurity, Network Security, Software Deve...",2012,1k-5k employees,https://nordsecurity.com/careers,"Vilnius, Lithuania",Creating a safe cyber future.,php,go,https://www.google.com/maps/search/?api=1&quer...,Nord Security
4,672,"Agile Software Development, Automotive, Mobility",2019,201-500 employees,https://www.finn.auto/?utm_source=stackoverflo...,"Prinzregentenplatz 9, 81675 Munich",FINN is your monthly car subscription with eve...,typescript,amazon-web-services,https://www.google.com/maps/search/?api=1&quer...,FINN
...,...,...,...,...,...,...,...,...,...,...,...
119,13,"Big Data, Data Science, Life Sciences",1992,501-1k employees,https://www.ebi.ac.uk/careers,"Wellcome Genome Campus, Hinxton, Saffron Walde...",We help scientists realise the potential of bi...,javascript,java,https://www.google.com/maps/search/?api=1&quer...,EMBL-EBI (EMBL's European Bioinformatics Insti...
120,,Beauty,1882,10k+ employees,https://www.beiersdorf.com/career/departments-...,Beiersdorf AG\nUnnastrasse 48\n20245 Hamburg\n...,WE ARE SKIN CARE,sap,sap-commerce-cloud,https://www.google.com/maps/search/?api=1&quer...,Beiersdorf
121,2,Health Care,2011,10k+ employees,https://www.optum.com/,USA,Optum is an information and technology-enabled...,apis,agilescrum,https://www.google.com/maps/search/?api=1&quer...,Optum
122,1,"Business Process Outsourcing, IT Consulting, S...",1976,10k+ employees,https://www.cgi.com/canada/en-ca,Quebec\nMontréal - Head office\n1350 René-Léve...,CGI developers contribute to several projects ...,azure,amazon-web-services,https://www.google.com/maps/search/?api=1&quer...,CGI (Canada)


## **Summary**

Thus the stackoverflow job portal was scraped and parsed to gather various imporatant informations related to jobs which will make life a lot easier for a person looking for a job based on a particular specific **tech stack**. And other useful informations like the website of the company and the number of employees working in a company. These small factors  will guide him/ her to choose the best company that he/she deserves.





The dataframe can be slightly tweaked and can be looked at it from a different perspective. This will help the user to narrow his job search based on tech stacks that he is proficient in .

In [13]:
tech=df.groupby(['tstack1','tstack2'])
tech.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,followers,industry,founded,size,website,officelocations,moto,locationLink,Company Name
tstack1,tstack2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
.net,asp.net-core,77,"Cloud-Based Solutions, Information Technology,...",2019,51-200 employees,https://www.atma.io/,"Wickenburggasse 32, 8010 Graz, Austria",atma.io -->the world's leading connected produ...,https://www.google.com/maps/search/?api=1&quer...,atma.io
.net,azure,85,"Consulting, Software Development / Engineering",1845,10k+ employees,https://www2.deloitte.com/us/en/pages/careers/...,"420 North 20th Street\nSuite 2400\nBirmingham,...",Engineer your future at Deloitte. Apply your e...,https://www.google.com/maps/search/?api=1&quer...,Deloitte US
.net,c#,62,SaaS,2004,1k-5k employees,https://www.pluralsight.com/careers,"Draper, Utah",We help thousands of organizations upskill and...,https://www.google.com/maps/search/?api=1&quer...,Pluralsight
.net,core,64,"Corporate Training, Education, Higher Education",1983,1k-5k employees,https://careers.qa.com/,"International House, 1 St Katharines Way, Lond...","Powering our clients, learners, students and c...",https://www.google.com/maps/search/?api=1&quer...,QA Group
.net,corestandard,231,Software Development / Engineering,1907,10k+ employees,https://careers.bakerhughes.com/global/en/digi...,"Houston, TX","We take energy forward - making it safer, clea...",https://www.google.com/maps/search/?api=1&quer...,Baker Hughes
...,...,...,...,...,...,...,...,...,...,...
typescript,angular,71,Information Technology,,10k+ employees,,"Bavaria, DE",Let's care for tomorrow,https://www.google.com/maps/search/?api=1&quer...,Allianz Technology
typescript,react-hooks,81,Computer Software,2006,201-500 employees,https://www.gsoft.com/en/,"1751 Rue Richardson #1050, Montreal, Quebec H3...","We’re GSoft, home to a family of software prod...",https://www.google.com/maps/search/?api=1&quer...,GSoft
ubuntu,cloud,501,"Cloud Computing, Information Technology, Inter...",2004,501-1k employees,https://canonical.com/,Remote,"Deliver, maintain, secure and sustain. Open so...",https://www.google.com/maps/search/?api=1&quer...,Canonical
verilogvhdl,c++,109,"Capital Markets, Financial Technology, High Fr...",1989,1k-5k employees,http://www.imc.com,Amsterdam Office\nInfinity Building\nAmstelvee...,Global market maker trading on over 100 venues...,https://www.google.com/maps/search/?api=1&quer...,IMC Trading


## **Future Work**


Since Stack Overflow's career page updates the job openings regularly the notebook can be converted to a script and can be automated to run on a server at regular intervals to get updates on the latest jobs. This data can also be mailed directly to the user using the python [smtplib module](https://docs.python.org/3/library/smtplib.html)

## **References**

- [Tutorial on Webscraping project from scratch by Jovian](https://www.youtube.com/watch?v=RKsLLG-bzEY)
- [Stack Overflow's career page](https://stackoverflow.com/jobs/companies)
- [Beautifulsoup module Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Requests module Documentataion](https://requests.readthedocs.io/en/latest/)