# Web Scraping Best Hiring Companies for Developers from StackOverflow


The  [**StackOverflow**](https://stackoverflow.com/) is a great website for programmers and developers. It has a great community of people helping other people figure out problems and errors in their code. Not only code, but also any query related to tech can be easily  found in stackoverflow and can be clarified. Stackoverflow also has a job portal page where various tech jobs for developers with various tech stacks can be found.

This notebook revolves around scraping this [**job's listed page**](https://stackoverflow.com/jobs/companies) and fetching vital information about jobs listed on stackoverflow. This project is aimed to make life easier for people who are in need of **jobs based on a particular tech stack**





## Web Scraping the Page with detailed explanation

### [Web Scraping](https://www.geeksforgeeks.org/what-is-web-scraping-and-how-to-use-it/)
Web scraping is an automated process of getting data from webpages. The data obtained from the webpages are usually in raw form and unstructured. The data obtained is then processed and important information is derived from the data.

Thus fetching and parsing data are the two important basic steps involved webscraping.

**Fetching**

For fetching data in python `Requests` library is used.

**Parsing**

For parsing the raw data `Beautiful soup` library is used.


### **The various information collected for each company are as follows:**

  - **Followers**

    Total number of followers

  - **Status**

    Status of the company like how it is funded

  - **Industry**

    The type of industry(Eg: ecommerce, fintech etc)

  - **Founded**

    The year it was founded

  - **Size**

    Total number of employees working
  
  - **Website**

    Official link of the company

  - **Office Location**

    Location of office


  - **Tech stack 1**

    Top 1 tech stack mentioned
  
  - **Tech stack 2**

    Top 2 tech stack mentioned

  - **Github link**

    Link for github if provided

  - **Twitter link**

    Link for twitter if provided

  - **Location Link**

    Geolocation of the company






### Installing required libraries

In [None]:
!pip install requests bs4 pandas --quiet

### Importing required libraries

In [None]:
import requests
from bs4 import BeautifulSoup as bs
import json
import pandas as pd

### Required links to parse
This is a dictionary of links which conatins the main stack overflow url and relative link of the job's listed page.

In [None]:
links={
    "baseUrl":"https://stackoverflow.com",
    "subUrl":"/jobs/companies"
    }

### Functions for parsing and scraping data

### Parsing the main page

This method is developed to parse the main page of the stackoverflow's job's listing page.
 It grabs the name of the company and the link to the details regarding the company and stores it in a dictionary, with name of the campany as the key and url of the company as the url.

```
maininfo['company name']='company_url.com'
```


In [None]:
def mainParser(pageNo):

    try:
        mainInfo={}
        mainUrl=links['baseUrl']+links['subUrl']+"?pg="+str(pageNo)

        response=requests.get(mainUrl)
        mainPage=bs(response.text,"html.parser")
        mainAtags=mainPage.find_all('a',class_="s-link")
        for tag in mainAtags:
            title=tag.text
            link=tag["href"]
            if title!='\n\n':
                mainInfo[title]=links["baseUrl"]+link
        return mainInfo

    except Exception as e:
        print(str(e))

### Parsing a particular company page

Both the functions *aboutParser()*  and *extraParser()* is developed to grab information about the particular jobs page. The functions take the *link * of the page as arguement to scrape the details

#### Parsing About Information

This function is used to grab the following information
  - Website link
  - Size of the company
  - Status of the company
  - The type of industry
  - Number of followers
  - Twitter Link
  - Github Link




In [None]:
def aboutParser(link):
    global page
    response=requests.get(link)
    page=bs(response.text,'html.parser')
    divTag=page.find_all('div',class_="ba bc-black-100 ps-relative p16 bar-sm")
    pTags=divTag[0].find_all('p',class_="fw-bold fs-caption fs-category fc-black-400 mb0")

    wflag=False
    Iflag=False
    Sflag=False
    fflag=False
    sflag=False
    Fflag=False
    



    for tag in pTags:

        if tag.text=="Website":
            attributes["website"].append(tag.parent()[1].find_all('a')[0]["href"])
            wflag=True
        if tag.text=="Industry":
            attributes["industry"].append(tag.parent()[1].text.strip())
            Iflag=True
        if tag.text=="Size":
            attributes["size"].append(tag.parent()[1].text.strip())
            Sflag=True
        if tag.text=="Founded":
            attributes["founded"].append(tag.parent()[1].text.strip())
            fflag=True

        if tag.text=="Status":
            attributes["status"].append(tag.parent()[1].text.strip())
            sflag=True
        if tag.text=="Followers":
            attributes["followers"].append(tag.parent()[1].text.strip())
            Fflag=True


    if not wflag:
            attributes["website"].append(None)
    if not Iflag:
            attributes["industry"].append(None)
    if not Sflag:
            attributes["size"].append(None)
    if not fflag:
            attributes["founded"].append(None)
    if not sflag:
            attributes["status"].append(None)
    if not Fflag:
            attributes["followers"].append(None)
            

    socialTags=divTag[0].find_all('div',class_="flex--item")[-1]
    socialTag=socialTags.find_all('a',class_="js-gps-track")
    tw_flag=False
    gh_flag=False

    for tag in socialTag:
        if "twitter" in tag["href"]:
            attributes["twitterlink"].append(tag["href"])
            tw_flag=True
        
        if "github.com" in tag["href"]:
            attributes["githublink"].append(tag["href"])
            gh_flag=True

    if not tw_flag:
            attributes["twitterlink"].append(None)

    if not gh_flag:
            attributes["githublink"].append(None)  
  

#### Parsing additional information

This function grabs some additional information such as :
  - Location Link
  - Office Locations
  - **Tech Stack 1**
  - **Tech Stack 2**

In [None]:
def extraParser(link):

    divTag=page.find('p',class_="fc-light lh-md fs-body3 sticky:fade-out mb12 sm:mb0")
    attributes["moto"].append(divTag.text.strip())

    locationDivTag=page.find('div',class_="mt32 js-locations")
    if locationDivTag:
        llink=locationDivTag.find('a')["href"]
        if llink:
            attributes["locationLink"].append(llink)
        else:
            attributes["locationLink"].append(None)

        lname=locationDivTag.find('a')["data-query"]
        if lname:
            attributes["officelocations"].append(lname)
        else:
            attributes["officelocations"].append(None)
    else:
        attributes["locationLink"].append(None)
        attributes["officelocations"].append(None)



    tech_stack=page.find('div',class_="fs-body2 mt32 js-nav-content")
    stack_a_tags=tech_stack.find_all('a',class_="flex--item s-tag no-tag-menu")
    tstack1,tstack2=stack_a_tags[:2]

    
    attributes["tstack1"].append(tstack1.text)
    attributes["tstack2"].append(tstack2.text)


### Wrapper function
This function is used as a wrapper function, its called after the main pages are parsed across all pages.

In [None]:
def subParser(fulldata):
    for data in fulldata:
        link=fulldata[data]
        aboutParser(link)
        extraParser(link)
        

### Main Function


The main function is the starting point, where all the functions are called and the data is gathered systematically.




In [13]:
mainInfos={}
attributes={
        "followers":[],
        "status":[],
        "industry":[],
        "founded":[],
        "size":[],
        "website":[],
        "officelocations":[],
        "moto":[],
        "tstack1":[],
        "tstack2":[],
        "githublink":[],
        "twitterlink":[],
        "locationLink":[]    
    }

if __name__ =="__main__":

    # this varible can be changed to grab more data, the default value is 5 pages
    no_of_pages=5
    for pageNo in range(1,no_of_pages):
        mainInfo=mainParser(pageNo)
        #mainInfo.pop('We do not currently allow content pasted from ChatGPT on Stack Overflow; read our policy here.')
        mainInfos.update(mainInfo)

    attributes["Company Name"]=list(mainInfos.keys())
    subParser(mainInfos)
    df=pd.DataFrame(attributes)
df


Unnamed: 0,followers,status,industry,founded,size,website,officelocations,moto,tstack1,tstack2,githublink,twitterlink,locationLink,Company Name
0,36,Private,"Cloud Services, Security, Software Development",2017.0,11-50 employees,https://www.ockam.io/,,Trust for Data-in-Motion.\r\n\r\nWe are Hiring!,rust,elixir,https://github.com/build-trust/ockam,https://twitter.com/ockam,,Ockam
1,58,Public,"Banking, Financial Technology, Software Develo...",1976.0,5k-10k employees,https://www.jackhenry.com/,"663 West Highway 60\nMonett, MO 65708",Jack Henry is a well-rounded financial technol...,scala,go,https://github.com/Banno,https://twitter.com/JH_Fintech,https://www.google.com/maps/search/?api=1&quer...,"Jack Henry & Associates, Inc.®"
2,24,Private,"Cybersecurity, Network Security, Software Deve...",2012.0,1k-5k employees,https://nordsecurity.com/careers,"Vilnius, Lithuania",Creating a safe cyber future.,php,go,https://github.com/NordSecurity,https://twitter.com/NordNewsroom,https://www.google.com/maps/search/?api=1&quer...,Nord Security
3,76,Public,Information Technology,2000.0,10k+ employees,http://endava.com,United Kingdom,Reimagining the relationship between people an...,reactjs,angular,,https://twitter.com/endava,https://www.google.com/maps/search/?api=1&quer...,Endava
4,304,Private,Financial Technology,2018.0,1k-5k employees,https://about.paypay.ne.jp/career/en/,Work from anywhere at anytime,Work for Life or Work for Rice /We are Japan ...,java,spring-boot,,https://twitter.com/PayPayOfficial,https://www.google.com/maps/search/?api=1&quer...,PayPay Corporation.
5,461,Private,"Advertising, Enterprise Software",2008.0,501-1k employees,https://stackoverflow.com/company/work-here,"110 William Street\n28th Fl\nNew York, NY 10038",Stack Overflow empowers the world to develop t...,c#,asp.net-mvc,https://github.com/stackexchange,https://twitter.com/stackoverflow,https://www.google.com/maps/search/?api=1&quer...,Stack Overflow
6,73,Public,"Computer Software, Financial Technology",1983.0,10k+ employees,https://www.intuit.com/?cid=cpg_so_click_us_ca...,"2701 Coast Ave, Mountain View, CA 94043",Intuit’s Engineering and Data teams are using ...,java,kotlin,https://github.com/intuit,https://twitter.com/intuit,https://www.google.com/maps/search/?api=1&quer...,Intuit
7,493,Public,Financial Services,1799.0,10k+ employees,https://careers.jpmorgan.com/us/en/our-busines...,"Brooklyn, NY",www.jpmorganchase.com/techcareers,java,python,,,https://www.google.com/maps/search/?api=1&quer...,JPMorgan Chase & Co.
8,143,Public,"Biotechnology, Pharmaceuticals, Science",1956.0,10k+ employees,https://jobs.thermofisher.com/global/en/c/it-j...,"Carlsbad, CA",Our Mission is to enable our customers to make...,javascript,sql,,https://twitter.com/MyThermoFisher,https://www.google.com/maps/search/?api=1&quer...,Thermo Fisher Scientific Careers
9,197,Public,Software Development / Engineering,1907.0,10k+ employees,https://careers.bakerhughes.com/global/en/digi...,"Houston, TX","We take energy forward - making it safer, clea...",.net,corestandard,,https://twitter.com/bakerhughesco,https://www.google.com/maps/search/?api=1&quer...,Baker Hughes


## Summary and Conclusion

Thus the stackoverflow job portal was scraped and parsed to gather various imporatant informations related to jobs which will make life a lot easier for a person looking for a job based on a particular specific **tech stack**. And other useful informations like the website of the company and the number of employees working in a company. These small factors  will guide him/ her to choose the best company that he/she deserves.





The dataframe can be slightly tweaked and can be looked at it from a different perspective. This will help the user to narrow his job search based on tech stacks that he is proficient in .

In [18]:
tech=df.groupby(['tstack1','tstack2'])

In [19]:
tech.first()

Unnamed: 0_level_0,Unnamed: 1_level_0,followers,status,industry,founded,size,website,officelocations,moto,githublink,twitterlink,locationLink,Company Name
tstack1,tstack2,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
.net,asp.net-core,63,Public,"Cloud-Based Solutions, Information Technology,...",2019.0,51-200 employees,https://www.atma.io/,"Wickenburggasse 32, 8010 Graz, Austria",atma.io -->the world's leading connected produ...,,https://twitter.com/ADSmartrac,https://www.google.com/maps/search/?api=1&quer...,atma.io
.net,corestandard,197,Public,Software Development / Engineering,1907.0,10k+ employees,https://careers.bakerhughes.com/global/en/digi...,"Houston, TX","We take energy forward - making it safer, clea...",,https://twitter.com/bakerhughesco,https://www.google.com/maps/search/?api=1&quer...,Baker Hughes
azure,c#,17,Private,"Healthcare, Insurance",1947.0,10k+ employees,https://careers.bupa.co.uk/,"Salford Quays, Manchester","Helping people live longer, healthier, happier...",,https://twitter.com/BupaUKCareers,https://www.google.com/maps/search/?api=1&quer...,Bupa
c,java,96,Public,"Big Data, Enterprise Software, Information Tec...",,201-500 employees,https://www.vertica.com/,150 Cambridgepark Drive\n10th Floor\nCambridge...,Unified Analytics powering the fastest analyti...,,https://twitter.com/VerticaUnified,https://www.google.com/maps/search/?api=1&quer...,Vertica
c#,.net,114,Private,"Computer Software, Enterprise Software, Softwa...",1998.0,501-1k employees,https://www.authoritypartners.com/,"AP Headquarters: 200 Spectrum Center Dr., Suit...",Authority Partners is the premier IT solutions...,,https://twitter.com/apiknowsit,https://www.google.com/maps/search/?api=1&quer...,Authority Partners
c#,asp.net-mvc,461,Private,"Advertising, Enterprise Software",2008.0,501-1k employees,https://stackoverflow.com/company/work-here,"110 William Street\n28th Fl\nNew York, NY 10038",Stack Overflow empowers the world to develop t...,https://github.com/stackexchange,https://twitter.com/stackoverflow,https://www.google.com/maps/search/?api=1&quer...,Stack Overflow
c#,java-ee,161,Public,"Hardware Development, Semiconductors, Software...",1984.0,10k+ employees,https://www.asml.com/en/careers/job-categories...,"ASML Veldhoven | De Run 6501,\n5504 DR, Veldho...",Software makes our machines perform beyond raw...,,https://twitter.com/ASMLcompany,https://www.google.com/maps/search/?api=1&quer...,ASML
c++,c++17,104,Public,"Computer Software, Databases, Enterprise Software",2007.0,1k-5k employees,https://www.mongodb.com/,"1633 Broadway, 38th Floor, New York, NY 10019,...","MongoDB empowers innovators to create, transfo...",https://github.com/mongodb/mongo,https://twitter.com/MongoDB,https://www.google.com/maps/search/?api=1&quer...,MongoDB
golang,javascript,24,Public,"Information Technology, Technical Services, Tr...",2012.0,5k-10k employees,http://grab.careers,"3 Media Cl, Singapore 138498","Grab is Southeast Asia’s leading superapp, pro...",,,https://www.google.com/maps/search/?api=1&quer...,Grab
ios,android,70,Private,"Ad Tech, Advertising Technology, Mobile Applic...",2018.0,11-50 employees,http://www.adjoe.io,"An der Alster 42, 20099 Hamburg, Germany",We Build Advertising Technologies to Outperform,https://github.com/adjoeio,,https://www.google.com/maps/search/?api=1&quer...,adjoe
