Scrape the Python ITJobsWatch page. Showcase:
1. Data Ingestion
2. Data Wrangling
3. Data Analysis
4. Data Visualisation

In [331]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

x = requests.get('https://www.itjobswatch.co.uk/jobs/uk/python.do')
soup = BeautifulSoup(x.text)

Parse a single table into pandas

In [332]:
def get_info_from_row(row):
    return [tag.get_text() for tag in row.find_all('tr') if len(tag.get_text()) < 50]

In [333]:
def get_info_from_section(section):
    return [get_info_from_row(row) for row in section.find_all('td') if len(get_info_from_row(row)) != 0][:14]

In [334]:
def get_skill_stats_df(skill, soup):
    section = soup.find(attrs={'id':'related_skills'})
    if section == None:
        section = soup.find(attrs={'id':'skill-set'})
    related_skill_texts = [text.split("%") for table in get_info_from_section(section) for text in table]

    dict = {'Primary Skill' : skill,
        'Secondary Skill' : [skill_text[1][1:].strip() for skill_text in related_skill_texts],
        'Percentage' : [skill_text[0].split("(")[1] for skill_text in related_skill_texts]}

    return pd.DataFrame(dict)

Scrape results page to get Skill pages in descending order
- Build df of Skill features
- Build association table

In [344]:
def get_skill_pages(page_num):
    search_page = 'https://www.itjobswatch.co.uk/default.aspx?ql=&ll=&id=0&p=6&e=200&page=' + str(page_num) + '&sortby=0&orderby=0'
    search_soup = BeautifulSoup(requests.get(search_page).text)
    return [(tag.a.get_text(), tag.a['href']) for tag in search_soup.find_all(attrs={'class':'c2'})]

Initialisation

In [348]:
skill_pages = [get_skill_pages(page_num) for page_num in range(1, 4)]
skills = [skill for page in skill_pages for skill, _ in page]
data = []
for skill_1 in skills:
    for skill_2 in skills:
        data.append([skill_1, skill_2, 0])
df_association = pd.DataFrame(columns=["Primary Skill", "Secondary Skill", "Percentage"], data=data)

Create df containing the following features for a skill:
- Name of skill
- Rank change
- % of all permanent jobs
- Category
- % of category
- Median annual salary
- Median annual salary (excl London)


In [338]:
def get_job_stats(skill, soup):
    info = [tag.get_text() for tag in soup.find('table').find_all('td')]
    return [skill, info[1], info[5], info[13], info[16].split("As % of the ")[1][:-9], info[17], info[33], info[49]]

In [351]:
columns = ["Skill", "Rank", "Rank Change", "% Jobs", "Category", "% Category", "Median Salary", "Median Salary (Excluding London)"]
data = []
for search_results_page in skill_pages:
    for skill, page in search_results_page:
        soup = BeautifulSoup(requests.get("https://www.itjobswatch.co.uk/" + page).text)
        features = get_job_stats(skill, soup)
        print(features)
        data.append(features)
        # skill association
        df_skills = get_skill_stats_df(skill, soup)
        df_association = df_association.merge(df_skills, on=["Primary Skill", "Secondary Skill"], how="left")
        df_association['Percentage'] = df_association['Percentage_y'].fillna(df_association['Percentage_x'])
        df_association = df_association.drop(['Percentage_x', 'Percentage_y'], axis=1)

df = pd.DataFrame(columns=columns, data=data)


['Social Skills', '1', '+1', '25.30%', 'General', '41.10%', '£50,000', '£43,500']
['Agile', '2', '-1', '21.21%', 'Processes & Methodologies', '23.76%', '£65,000', '£59,526']
['Finance', '3', '0', '20.67%', 'General', '33.57%', '£65,000', '£55,000']
['Azure', '4', '0', '19.68%', 'Cloud Services', '50.21%', '£60,000', '£53,426']
['Microsoft', '5', '0', '18.09%', 'Vendors', '46.70%', '£49,000', '£45,000']
['Developer', '6', '+3', '15.76%', 'Job Titles', '16.54%', '£60,000', '£52,500']
['Problem-Solving', '7', '+6', '15.57%', 'Processes & Methodologies', '17.45%', '£50,000', '£45,000']
['Senior', '8', '+2', '14.58%', 'Job Titles', '15.31%', '£65,000', '£60,000']
['Degree', '9', '-1', '14.51%', 'Qualifications', '49.50%', '£55,000', '£50,000']
['SQL', '10', '-3', '13.99%', 'Programming Languages', '36.66%', '£57,500', '£50,000']
['AWS', '11', '-5', '12.53%', 'Cloud Services', '31.98%', '£70,000', '£60,000']
['Analyst', '12', '+3', '11.18%', 'Job Titles', '11.74%', '£45,000', '£40,000']
['So

Explore with starting skill

In [361]:
df_association = df_association[df_association['Percentage'] != 0]

In [475]:
def query_primary_skill(skill):
    percentages_df = df_association[df_association['Primary Skill'] == skill]
    percentages_df = percentages_df.drop("Primary Skill", axis=1).rename(columns={"Secondary Skill":"Skill"})
    df_1 = percentages_df.merge(df, on="Skill")
    
    df_1["% Jobs"] = df_1["% Jobs"].apply(lambda x : float(x[:-1]))
    df_1["% Category"] = df_1["% Category"].apply(lambda x : float(x[:-1]))
    df_1["Percentage"] = df_1["Percentage"].apply(lambda x : float(x))
    df_1["Weighted Percentage"] = df_1["Percentage"] * df_1["% Jobs"] / df_1["% Jobs"].sum()
    
    return df_1.sort_values(["Weighted Percentage"], ascending=False)

In [480]:
def query_secondary_skill(skill):
    percentages_df = df_association[df_association['Secondary Skill'] == skill]
    percentages_df = percentages_df.drop("Secondary Skill", axis=1).rename(columns={"Primary Skill":"Skill"})
    df_1 = percentages_df.merge(df, on="Skill")
    
    df_1["% Jobs"] = df_1["% Jobs"].apply(lambda x : float(x[:-1]))
    df_1["% Category"] = df_1["% Category"].apply(lambda x : float(x[:-1]))
    df_1["Percentage"] = df_1["Percentage"].apply(lambda x : float(x))
    df_1["Weighted Percentage"] = df_1["Percentage"] * df_1["% Jobs"] / df_1["% Jobs"].sum()
    
    return df_1.sort_values(["Weighted Percentage"], ascending=False)

Primary Skill
- If you know _b_ you are eligible for _x_% of jobs looking for _a_

Secondary Skill
- If you know _c_ you are eligible for _y_% of jobs looking for _b_

In [494]:
skill = "Azure"
df_1 = query_primary_skill(skill)
df_2 = query_secondary_skill(skill)
df_3 = df_1.merge(df_2[["Skill", "Weighted Percentage"]], on="Skill")
df_3["Score"] = 0.5 * df_3["Weighted Percentage_x"] +  df_3["Weighted Percentage_y"]
df_3[df_3["Category"] != "Processes & Methodologies"].sort_values(by=["Category", "Score"], ascending=False)

Unnamed: 0,Skill,Percentage,Rank,Rank Change,% Jobs,Category,% Category,Median Salary,Median Salary (Excluding London),Weighted Percentage_x,Weighted Percentage_y,Score
0,Microsoft,42.53,5,0,18.09,Vendors,46.7,"£49,000","£45,000",2.649703,1.123477,2.448329
24,VMware,11.47,61,36,4.19,Vendors,10.83,"£49,000","£45,000",0.165516,0.302736,0.385495
22,Kubernetes,11.11,50,-18,4.76,Systems Management,39.12,"£75,000","£69,447",0.182131,0.293447,0.384513
16,Active Directory,17.66,28,23,6.97,System Software,47.31,"£40,000","£36,000",0.423923,0.466644,0.678605
12,Degree,11.77,9,-1,14.51,Qualifications,49.5,"£55,000","£50,000",0.588176,0.311027,0.605115
18,Security Cleared,10.06,17,21,9.4,Qualifications,32.07,"£57,500","£55,000",0.325678,0.265713,0.428552
5,SQL,24.35,10,-3,13.99,Programming Languages,36.66,"£57,500","£50,000",1.173221,0.642951,1.229561
9,C#,22.07,18,1,9.22,Programming Languages,24.17,"£60,000","£55,000",0.700804,0.582631,0.933033
14,JavaScript,17.23,19,-5,8.94,Programming Languages,23.43,"£60,000","£55,000",0.530501,0.454902,0.720152
13,Python,15.73,14,-3,10.72,Programming Languages,28.08,"£70,000","£57,500",0.580747,0.415545,0.705918


In [464]:
print(query_primary_skill("Python"))

                      Skill  Percentage Rank Rank Change  % Jobs  \
1                     Agile       32.02    2          -1   21.21   
2                   Finance       28.06    3           0   20.67   
3                     Azure       28.88    4           0   19.68   
0             Social Skills       20.26    1          +1   25.30   
7                       AWS       38.57   11          -5   12.53   
6                       SQL       33.40   10          -3   13.99   
4           Problem-Solving       22.29    7          +6   15.57   
8      Software Engineering       30.93   13          +3   11.06   
5                    Degree       20.62    9          -1   14.51   
9                    DevOps       25.72   16          -4    9.77   
14                     Java       27.35   21          -4    8.36   
12               JavaScript       19.24   19          -5    8.94   
15                    CI/CD       22.14   29          -7    6.66   
10         Security Cleared       14.70   17    

In [485]:
print(query_primary_skill("HTML"))

                     Skill  Percentage Rank Rank Change  % Jobs  \
1                    Agile       40.70    2          -1   21.21   
12              JavaScript       77.15   19          -5    8.94   
3                    Azure       31.72    4           0   19.68   
0            Social Skills       21.49    1          +1   25.30   
7                      SQL       32.10   10          -3   13.99   
4                Microsoft       22.14    5           0   18.09   
11                      C#       42.42   18          +1    9.22   
2                  Finance       18.59    3           0   20.67   
17                     CSS       79.46   64         -24    4.18   
13                    .NET       40.64   24          +2    7.92   
5          Problem-Solving       19.83    7          +6   15.57   
6                   Degree       16.79    9          -1   14.51   
9     Software Engineering       18.62   13          +3   11.06   
8                      AWS       14.60   11          -5   12.5