# Using online job postings to improve data science resume

## Problem Statement
I want to find out what are the skills of data scientist not represented in my resume draft.  
In order to improve the **CV** and start applying for the jobs.
There is folder with all the job listings in HTML format.  
Goal is to extract common data science skills and compare these skiils to the resume.

<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*rsJsA9wsN2Y5-o7HsItJ2A.jpeg"
     width="500"
     height="300">


## Plan

To achieve the goal we will do the following:

1. Parse out all the text from the HTML files.
2. Learn how job skills are commonly described in online postings.
3. Filter irrelevant postings.
4. Cluster the job skills within the relevant postings and visualize.
5. Compare the clustered skills and our resume content.


## Packages

In [14]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs
from IPython.core.display import HTML
from sklearn.feature_extraction.text import TfidfVectorizer

## Data Extraction

In [15]:
import zipfile

html_contents = []

with zipfile.ZipFile("job_postings.zip", "r") as z:
    for name in sorted(z.namelist()):
        if name.endswith(".html"):
            with z.open(name) as f:
                html_contents.append(f.read().decode("utf-8"))

print(f"{len(html_contents)} HTML files have been loaded")


1458 HTML files have been loaded


Let's parse and store the parsed results in **soup_objects** list

In [16]:
soup_objects = []
for html in html_contents:
    soup = bs(html)
    assert soup.title is not None
    assert soup.body is not None 
    soup_objects.append(soup)
html_dict = {'Title':[], 'Body':[]}
for soup in soup_objects:
    title = soup.find('title').text
    body = soup.find('body').text
    html_dict['Title'].append(title)
    html_dict['Body'].append(body)

df_jobs = pd.DataFrame(html_dict)
summary = df_jobs.describe()
display(summary)

Unnamed: 0,Title,Body
count,1458,1458
unique,1364,1458
top,"Data Scientist - New York, NY","Data Scientist - Beavercreek, OH\nData Scienti..."
freq,13,1


There are 1364 unique titles out of 1458, the remaining 94 titles are duplicates.  

The most common title repeated 13 times.  

All 1458 bodies are unique, so none of the job postings occur more than once, even if some postings share common title.  
  
Now let's explore HTML content in more detail, the aim is to determine how jobs skills are described in the files.

In [17]:
# Explore the skill desriptions
assert len(set(html_contents)) == len(html_contents)
display(HTML(html_contents[3]))

There are usually 2 subsections, 1 for Responsibilities and 1 for Qualifications.  
  
They are not that different, yet qualifications focus on tools and concepts, while responsibilities are covering actions to be performed on the job.  
  
Let's divide posting description into 2 parts:
  
  
A. Initial job summary.  

B. List of skills required to get a job 

Do these types of skills desriptions also appear in other job postings.  
Let' extract the bullets from each of parsed HTML files. 
Bullet point is taged in "li" and appears as dot in the html file.

In [18]:
# Extract a list of bullet points from soup object by calling following function
df_jobs['Bullets'] = [[bullet.text.strip() for bullet in soup.find_all('li') ]for soup in soup_objects]


We created new col "Bullets" to store all the bullet points, however, it may also be possible that majority of the job postings simply don't contain any!   
  
Let's find out the percentages of postings that actually contain bulleted text.  
  
If it's too low it is worth changing the approach of the analysis.

In [19]:
bulleted_post_count = 0
for bullet_list in df_jobs.Bullets:
    if bullet_list:
        bulleted_post_count += 1
percent_bulleted = 100 * bulleted_post_count / df_jobs.shape[0]
print(f"{percent_bulleted:.2f}% of the postings contain bullet points ")


90.53% of the postings contain bullet points 


Okay, next step is to understand whether most of these bullets focus on skills.  
For that matter let's print out top-ranked words in their text. 
For ranking we will use **Term Frequency Inverse Document Frequency**

### **How TFIDF Created**
TF-IDF is the product of two distinct metrics designed to balance local importance and global rarity:

1.  **Term Frequency (TF):** Measures how frequently a word appears in a specific document.
    *   *Core Idea:* If a word appears often, it is likely important to that documentâ€™s topic.
    *   *Formula:* $\text{TF}(t, d) = \frac{\text{Count of term } t \text{ in document } d}{\text{Total number of words in document } d}$.
2.  **Inverse Document Frequency (IDF):** Measures how rare a word is across the entire corpus.
    *   *Core Idea:* Common words (like "the" or "is") appear in many documents and provide little unique information, so they are penalized with a lower weight.
    *   *Formula:* $\text{IDF}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)$.
3.  **Final Score:** $\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)$.


In [20]:
def rank_words(text_list):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfdidf_matrix = vectorizer.fit_transform(text_list).toarray()
    df = pd.DataFrame({'Words': vectorizer.get_feature_names_out(),
                       "Summed TFIDF": tfdidf_matrix.sum(axis=0)})
    sorted_df = df.sort_values('Summed TFIDF', ascending=False)
    return sorted_df
all_bullets = []
for bullet_list in df_jobs.Bullets:
    all_bullets.extend(bullet_list)
sorted_df = rank_words(all_bullets)
print(sorted_df[:5].to_string(index=False))

     Words  Summed TFIDF
experience    878.030398
      data    842.978780
    skills    440.780236
      work    371.684232
   ability    370.969638


Terms such as **skills** and **ability** appear among the top five bulleted words.
Most likely the bullets correspond to individual job skills.  
How do these bulleted words compare to the remaining words in each job posting?  
  
  
We iterate over the body of each posting and delete any bulleted lists using Beautiful Soup *decompose* method. 
Then we extract the remaining body text and store it in a non_bullets list. 
Finally, we apply rank_words function that lists and display the top five non-bullet words.


In [21]:
non_bullets = []
for soup in soup_objects:
    body = soup.body
    for tag in body.find_all('li'):
        tag.decompose()
    non_bullets.append(body.text)
sorted_df = rank_words(non_bullets)
print(sorted_df[:5].to_string(index=False))

     Words  Summed TFIDF
      data     99.111312
      team     39.175041
      work     38.928948
experience     36.820836
  business     36.140488


The words **skills** and **ability** are no longer present in the ranked output. 
They have been replaced by the words **business** and **team**. Thus, non-bulleted text appears to be less skill oriented.  
  
Words *data*, *experience* and *work* top ranked words that are shared between bullets and non-bulleted.  
Strangely words **science** and **scientist** are missing.  

Logic question arises, do some posts pertain to data-driven jobs aren't directly *data science* jobs? 
 

Let's iterate over all the titles across all jobs and checking if each title mentions a data science position.  
Then, we will measure the percentages of jobs where data science and data scientist are missing from the titles.