# Extracting Raw Text from Job Posting HTML Web Pages

## Objective 

Extract relevant sections of the job posting webpages for similarity comparison with our resume and analysis of skills missing from our resume. Your pandas DataFrame might look like this once completed:

Hint: Where do most of the required skills end up showing up? Which HTML tags typically contain the skills requirements for jobs?

## Workflow 

1. Obtain the HTML job posting pages from the GitHub repository. Examine some HTML pages manually by opening them in a browser and inspecting the HTML elements (e.g., ctrl+shift+i in many browsers or right click on the page and choose “inspect”). Determine which sections should be extracted and stored.
**Hint: Where do most of the required skills end up showing up? Which HTML tags typically contain the skills requirements for jobs?**
2. Load all webpages into Python as strings using any method.
Be sure to check for exact duplicate HTML pages and remove them. **It may be easier to remove duplicate entries once you have the data in a pandas DataFrame.**
3. Store the webpage sections identified in step 1 (e.g., the title, body, and bullet points) into a pandas DataFrame with sensible column names.
**Examine some of the data in the DataFrame to make sure the data extraction worked as we expect.**
4. Filter the jobs to only include data science jobs. Since we already have our data in a pandas DataFrame, this can be done using pandas.
5. Save the DataFrame to disk so we can load it at a later time for future parts of the project.

In [1]:
import os
import pandas as pd
import sys
sys.path.append('/home/michal/Desktop/resume-job-posting-nlp-project/src')
from bs4 import BeautifulSoup as bs
from utils import render,read_html_from_file, show_result_with

In [2]:
#os.chdir("../data/html_job_postings/")
os.getcwd()

'/home/michal/Desktop/resume-job-posting-nlp-project/notebooks'

## Check html visually :  

Where do most of the required skills end up showing up? Which HTML tags typically contain the skills requirements for jobs?

In [3]:
os.chdir("../data/html_job_postings/")

In [4]:
file_list = [file for file in os.listdir() if file.endswith(".html")]
result = show_result_with(file_list[4])
#result.find_all("li")

**HTML**

<html>
 <head>
  <title>
   Transit Service Development Specialist I - San Jose, CA 95134
  </title>
 </head>
 <body>
  <h2>
   Transit Service Development Specialist I - San Jose, CA 95134
  </h2>
  <div>
   <div>
    <div>
     <b>
      Definition
     </b>
    </div>
    Initially, under close supervision, a Transit Service Development Specialist I assists in the assembly and analysis of transit performance data; modifies transit routes and schedules; assists in the preparation of materials for the operator bid process; prepares reports; and makes presentations.
    <br/>
    <br/>
    <b>
     Distinguishing Characteristics
    </b>
    <br/>
    This classification differs from the Transit Service Development Specialist II in that the latter independently performs a full range of complex analysis, planning, scheduling and runcutting for general and specialized transit services. After a period of training and experience, the Transit Service Development Specialist I perfo

# Preprocessing 
 

In [5]:
def extract_web_components(files):
    titles = []
    bodies = []
    bullets = []
    for file in files:
        html = read_html_from_file(file)
        titles.append(html.head.text)
        bodies.append(html.body.text)
        bullets.append([bull.text.strip() for bull in html.find_all("li")])
        
    return titles, bodies, bullets

In [6]:
titles, bodies, bullets = extract_web_components(file_list)

In [7]:
df = pd.DataFrame(
    {
        "title":titles,
        "body":bodies,
        "bullets":bullets
    }
)

In [8]:
df.head()

Unnamed: 0,title,body,bullets
0,"Data Analyst - San Francisco, CA","Data Analyst - San Francisco, CA\nRimeto’s mis...","[Interpret data, analyze results and provide o..."
1,Machine Learning and Computer Vision Engineer ...,Machine Learning and Computer Vision Engineer ...,"[Design, implement, and optimize cutting-edge ..."
2,"RHEL Integrator - North Charleston, SC 29406","RHEL Integrator - North Charleston, SC 29406\n...",[Provide engineering leadership in support of ...
3,"Talend Automation spoc - Somerset, NJ","Talend Automation spoc - Somerset, NJ\nOver 7+...",[Over 5.5 years of experience as python Full S...
4,Transit Service Development Specialist I - San...,Transit Service Development Specialist I - San...,[Develops and supports the implementation of b...


## Filter jobs containing Data Science jobs

In [9]:
df = df[df.title.str.contains("(data scientist) | (data science)",case=False)].copy()
df.head()

  return func(self, *args, **kwargs)


Unnamed: 0,title,body,bullets
8,Data Scientist II - Payment Products - Seattle...,Data Scientist II - Payment Products - Seattle...,"[Bachelor’s degree in Computer Science, Mathem..."
9,"Data Scientist - Seattle, WA","Data Scientist - Seattle, WA\nRing is looking ...",[Use predictive analytics and machine learning...
15,"Data Scientist - Jersey City, NJ 07311","Data Scientist - Jersey City, NJ 07311\nWorkin...",[Create predictive models using current and em...
18,2020 PhD Data Scientist Internship - Uber Eats...,2020 PhD Data Scientist Internship - Uber Eats...,[Develop models for user behavior and marketpl...
23,Data Analyst- Data Science & Analytics - Palo ...,Data Analyst- Data Science & Analytics - Palo ...,[Detailed and clear understanding of data used...


## Remove duplicates

In [10]:
df['bullets'] = df['bullets'].apply(tuple, 1)

In [11]:
df.drop_duplicates(inplace=True)

In [12]:
print(f"Number of Data Science Jobs offers : {len(df)}")

Number of Data Science Jobs offers : 388


# Save to disc

In [19]:
df.to_pickle("../step1_df.pk")