# 1. Data collection

## Job Postings Data

- Collect job data from GitHub Jobs API.
- Determine the number of jobs currently open for various technologies
- Store the collected data into an excel spreadsheet.

The documentation for the GitHub Jobs API can be found at https://jobs.github.com/api





In [20]:
import requests
import os
from PIL import Image
from IPython.display import IFrame
from openpyxl import Workbook
 


baseurl = "https://jobs.github.com/positions.json"

response=requests.get(baseurl)

if response.ok:
    data_git = response.json()
    
data_git

[{'id': '32bf67e5-4971-47ce-985c-44b6b3860cdb',
  'type': 'Full Time',
  'url': 'https://jobs.github.com/positions/32bf67e5-4971-47ce-985c-44b6b3860cdb',
  'created_at': 'Wed May 19 00:49:17 UTC 2021',
  'company': 'SweetRush',
  'company_url': 'https://www.sweetrush.com/',
  'location': 'Remote',
  'title': 'Senior Creative Front End Web Developer',
  'description': '<p><strong>SweetRush has an exciting opportunity for an experienced creative front-end developer (full stack is also acceptable) with an eye for graphic and UX design!</strong></p>\n<p><strong>ABOUT THE ROLE:</strong></p>\n<p>This is an important role on the Engineering and Development department’s Course Development team, and you will be reporting directly to the Course Development team lead.</p>\n<p>Historically, the developers most successful in this role contribute to multiple projects at the same time; show a willingness to improve existing techniques, frameworks, and templates; and come up with innovations of their 

In [6]:
def get_number_of_jobs(technology):
    
    global number_of_jobs
    number_of_jobs = 0
    L=1
    pag=0
    
    while L !=0:
    
        pag=pag+1
        payload={"page":pag,"description":technology}

        response=requests.get(baseurl,params=payload)
    
        if response.ok:             
            data_git = response.json()
            L=len(data_git)         
    
            number_of_jobs=number_of_jobs+len(data_git)
            
            print('Page N.: ',pag)
            print('Jobs in the page: ',L)
            print('Total Jobs: ',number_of_jobs)
    
    return technology,number_of_jobs

print(get_number_of_jobs('python')) # checking example

Page N.:  1
Jobs in the page:  2
Total Jobs:  2
Page N.:  2
Jobs in the page:  0
Total Jobs:  2
('python', 2)


In [7]:
wb=Workbook()
ws=wb.active


tech_list=['C','C#','C++','Java','JavaScript','Python','Scala','Oracle','SQL Server','MySQL Server','PostgreSQL','MongoDB']

ws.append(['Technology','Number_of_Jobs'])  

for i in tech_list:
    print('Technology: ',i)
    get_number_of_jobs(i)
    print('number_of_jobs printed in excel: ',number_of_jobs)
    ws.append([i,number_of_jobs])

Technology:  C
Page N.:  1
Jobs in the page:  13
Total Jobs:  13
Page N.:  2
Jobs in the page:  0
Total Jobs:  13
number_of_jobs printed in excel:  13
Technology:  C#
Page N.:  1
Jobs in the page:  0
Total Jobs:  0
number_of_jobs printed in excel:  0
Technology:  C++
Page N.:  1
Jobs in the page:  0
Total Jobs:  0
number_of_jobs printed in excel:  0
Technology:  Java
Page N.:  1
Jobs in the page:  6
Total Jobs:  6
Page N.:  2
Jobs in the page:  0
Total Jobs:  6
number_of_jobs printed in excel:  6
Technology:  JavaScript
Page N.:  1
Jobs in the page:  5
Total Jobs:  5
Page N.:  2
Jobs in the page:  0
Total Jobs:  5
number_of_jobs printed in excel:  5
Technology:  Python
Page N.:  1
Jobs in the page:  2
Total Jobs:  2
Page N.:  2
Jobs in the page:  0
Total Jobs:  2
number_of_jobs printed in excel:  2
Technology:  Scala
Page N.:  1
Jobs in the page:  1
Total Jobs:  1
Page N.:  2
Jobs in the page:  0
Total Jobs:  1
number_of_jobs printed in excel:  1
Technology:  Oracle
Page N.:  1
Jobs in

In [None]:
wb.save("github-job-postings.xlsx")

## Popular Language Data


-   Write the scraped data into a csv file.

The data is the **name of the programming language** and **average annual salary**.

In [2]:
from bs4 import BeautifulSoup
import requests
import csv
from itertools import zip_longest
import pandas as pd



url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/Programming_Languages.html"

data_web  = requests.get(url).text
soup = BeautifulSoup(data_web,"html5lib")
data_web

'<!doctype html>\n<html lang="en">\n<head>\n<title>\nSalary survey results of programming languages\n</title>\n<style>\ntable, th, td {\n  border: 1px solid black;\n}\n</style>\n</head>\n\n<body>\n<hr />\n<h2>Popular Programming Languages</h2>\n<hr />\n<p>Finding out which is the best language is a tough task. A programming language is created to solve a specific problem. A language which is good for task A may not be able to properly handle task B. Comparing programming language is never easy. What we can do, however, is find which is popular in the industry.</p>\n<p>There are many ways to find the popularity of a programming languages. Counting the number of google searchs for each language is a simple way to find the popularity. GitHub and StackOverflow also can give some good pointers.</p>\n<p>Salary surveys are a way to find out the programmings languages that are most in demand in the industry. Below table is the result of one such survey. When using any survey keep in mind that 

In [3]:
table = soup.find('table')
language_name=[]
annual_avg_salary=[]

for row in table.find_all('tr'): 
    cols = row.find_all('td') 
    lan_col=cols[1].getText() 
    sal_col = cols[3].getText() 
    
    language_name.append(lan_col)
    annual_avg_salary.append(sal_col)
    print("{}--------------->{}".format(lan_col,sal_col))

Language--------------->Average Annual Salary
Python--------------->$114,383
Java--------------->$101,013
R--------------->$92,037
Javascript--------------->$110,981
Swift--------------->$130,801
C++--------------->$113,865
C#--------------->$88,726
PHP--------------->$84,727
SQL--------------->$84,793
Go--------------->$94,082


In [8]:
df_web =pd.DataFrame({language_name[0]: language_name[1:], annual_avg_salary[0]: annual_avg_salary[1:]})
df_web

Unnamed: 0,Language,Average Annual Salary
0,Python,"$114,383"
1,Java,"$101,013"
2,R,"$92,037"
3,Javascript,"$110,981"
4,Swift,"$130,801"
5,C++,"$113,865"
6,C#,"$88,726"
7,PHP,"$84,727"
8,SQL,"$84,793"
9,Go,"$94,082"


In [9]:
df_web.to_csv('popular-languages.csv',index=False)

## Survey of Technology professionals

Main data that used for projects

Stack Overflow source of the dataset: https://stackoverflow.blog/2019/04/09/the-2019-stack-overflow-developer-survey-results-are-in/ under a ODbL: Open Database License.



In [21]:
import pandas as pd

In [22]:
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m1_survey_data.csv"

df_survey=pd.read_csv(dataset_url)

df_survey.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
1,9,I am a developer by profession,Yes,Once a month or more often,The quality of OSS and closed source software ...,Employed full-time,New Zealand,No,Some college/university study without earning ...,"Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,,23.0,Man,No,Bisexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult
2,13,I am a developer by profession,Yes,Less than once a month but more than once per ...,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,United States,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)","Computer science, computer engineering, or sof...",...,Somewhat more welcome now than last year,Tech articles written by other developers;Cour...,28.0,Man,No,Straight / Heterosexual,White or of European descent,Yes,Appropriate in length,Easy
3,16,I am a developer by profession,Yes,Never,The quality of OSS and closed source software ...,Employed full-time,United Kingdom,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,26.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult
4,17,I am a developer by profession,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,Employed full-time,Australia,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,29.0,Man,No,Straight / Heterosexual,Hispanic or Latino/Latina;Multiracial,No,Appropriate in length,Easy


In [23]:
print(df_survey.shape)
df_survey.dtypes

(11552, 85)


Respondent       int64
MainBranch      object
Hobbyist        object
OpenSourcer     object
OpenSource      object
                 ...  
Sexuality       object
Ethnicity       object
Dependents      object
SurveyLength    object
SurveyEase      object
Length: 85, dtype: object