# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
import random
import time
import nltk
import openai

openai.api_key = open("api_key.txt", "r").read().strip("\n")

### ChatGPT API for generating a list of relevant Job Titles

In [2]:
msg_hist = []

def chat(inp, role="user"):
    msg_hist.append({"role": role, "content": f"{inp}"})
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=msg_hist
    )
    reply_content = completion.choices[0].message.content
    msg_hist.append({"role": "assistant", "content": f"{reply_content}"})
    return reply_content

In [3]:
def listify(output):
    lst = []
    for title in output.split('\n'):
        title = title.split(' ')
        role = ''
        for word in title[1:]:
            role += word+' '
        lst.append(role.rstrip().lower())
    return lst

### checkpoint

In [2]:
input = """
        I have an assignment whose problem statement is given as follows:
        "
        Data science, analytics, AI, big data are becoming widely used in many fields, that leads to the ever-increasing demand of data analysts, 
        data scientists, ML engineers, managers of analytics and other data professionals. Due to that, data science education is now a hot topic 
        for educators and entrepreneurs.

        In this assignment, you will need to design a course curriculum for a new “Master of Business and Management in Data Science and 
        Artificial Intelligence” program at University of Toronto with focus not only on technical but also on business and soft skills. 
        Your curriculum would need to contain optimal courses (and topics covered in each course) for students to obtain necessary technical 
        and business skills to pursue a successful career as data scientist, analytics and data manager, data analyst, business analyst, 
        AI system designer, etc.

        You are required to extract skills that are in demand at the job market from job vacancies posted on http://indeed.com web-portal 
        and apply clustering algorithms to group/segment skills into courses.

        You are provided with a sample Python code to web-scrape job postings from http://indeed.com web-portal, that you would need to modify 
        for your assignment. You can decide on the geographical locations of the job postings (e.g., Canada, USA, Canada and USA) and job roles 
        (e.g., “data scientist”, “data analyst”, “manager of analytics”, “director of analytics”) of the posting that you will be web-scraping, 
        but your dataset should contain at least 1000 unique job postings.
        "
        Following the requirements of this assignment, suggest atleast 20 job titles relevant to the course curriculum of 
        "Master of Business and Management in Data Science and Artificial Intelligence"
        """
output = chat(input)
print(output)

1. Data Scientist
2. Machine Learning Engineer
3. Data Analyst
4. Business Analytics Manager
5. Artificial Intelligence Specialist
6. Predictive Analytics Manager
7. Data Mining Engineer
8. Big Data Engineer
9. Business Intelligence Analyst
10. Chief Data Officer
11. Data Warehousing and Business Intelligence Analyst
12. Director of Analytics
13. Data Governance Manager
14. Enterprise Data Architect
15. Database Administrator
16. Data Integration Specialist
17. Analytics Strategy Consultant
18. Information Security Analyst
19. Software Developer
20. Infrastructure Engineer.


In [4]:
job_roles = listify(output)
job_roles

['data scientist',
 'machine learning engineer',
 'data analyst',
 'business analytics manager',
 'artificial intelligence specialist',
 'predictive analytics manager',
 'data mining engineer',
 'big data engineer',
 'business intelligence analyst',
 'chief data officer',
 'data warehousing and business intelligence analyst',
 'director of analytics',
 'data governance manager',
 'enterprise data architect',
 'database administrator',
 'data integration specialist',
 'analytics strategy consultant',
 'information security analyst',
 'software developer',
 'infrastructure engineer.']

### Path to webdriver (Firefox, Chrome) 

In [5]:
# Ensure that the driver path is correct before running this script.
# MacOS
driver_path = "/Users/dhairyaparmar/geckodriver"
driver = webdriver.Firefox(executable_path=driver_path)

  driver = webdriver.Firefox(executable_path=driver_path)


### Define position and location 

In [6]:
## Enter a job position
job_role = job_roles[0]
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

### Scrape job postings

In [7]:
def scrape_job_posts(job_role, postings = 50):
    url = get_url(position = job_role, location = locations)
    dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

    jn=0
    for i in range(0, postings, 10):
        driver.get(url + "&start=" + str(i))
        driver.implicitly_wait(3)

        jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

        for job in jobs:
            result_html = job.get_attribute('innerHTML')
            soup = BeautifulSoup(result_html, 'html.parser')
            
            jn += 1
            
            liens = job.find_elements(By.TAG_NAME, "a")
            links = liens[0].get_attribute("href")
            
            title = soup.select('.jobTitle')[0].get_text().strip()
            company = soup.select('.companyName')[0].get_text().strip()
            location = soup.select('.companyLocation')[0].get_text().strip()
            try:
                salary = soup.select('.salary-snippet-container')[0].get_text().strip()
            except:
                salary = 'NaN'
            try:
                rating = soup.select('.ratingNumber')[0].get_text().strip()
            except:
                rating = 'NaN'
            try:
                date = soup.select('.date')[0].get_text().strip()
            except:
                date = 'NaN'
            try:
                description = soup.select('.job-snippet')[0].get_text().strip()
            except:
                description = ''
        
            dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                            "Company": company,
                                            'Location': location,
                                            'Rating': rating,
                                            'Date': date,
                                            "Salary": salary,
                                            "Description": description,
                                            "Links": links}])], ignore_index=True)
            print("Job number {0:4d} added - {1:s}".format(jn,title))

    driver.quit()
    return dataframe

In [8]:
# data_sci_jobs = scrape_job_posts(job_role)
# data_sci_jobs.head()

### Scrape full job descriptions

In [9]:
def scrape_job_desc(job_role):
    dataframe = scrape_job_posts(job_role, postings = 50)
    Links_list = dataframe['Links'].tolist()
    driver = webdriver.Firefox(executable_path=driver_path)
    descriptions=[]
    for i in Links_list:
        driver.get(i)
        driver.implicitly_wait(random.randint(3, 8))
        jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
        descriptions.append(jd)
        time.sleep(random.randint(5,10))

    dataframe['Descriptions'] = descriptions
    driver.quit()
    return dataframe

In [10]:
data_sci_jobs = scrape_job_desc(job_role)
data_sci_jobs.head()

Job number    1 added - Data Scientist
Job number    2 added - Data Scientist - RWD
Job number    3 added - Data Scientist
Job number    4 added - Jr. Data Scientist
Job number    5 added - Data Scientist (All Levels)
Job number    6 added - Computational Biologist / Data Scientist
Job number    7 added - Interdisciplinary-Microbiologist/Data Scientist
Job number    8 added - Data Scientist (US Remote Eligible)
Job number    9 added - Jr. Data Scientist
Job number   10 added - Associate Data Scientist
Job number   11 added - Data Scientist
Job number   12 added - Data Scientist I, Product Analytics
Job number   13 added - Data Scientist
Job number   14 added - Senior Statistician
Job number   15 added - Senior Data Scientist
Job number   16 added - Data Scientist
Job number   17 added - Data Scientist / NLP (Python, Django/Flask, NLP, Clustering, Rest API)
Job number   18 added - Data Scientist jobs
Job number   19 added - Data Scientist - Fully remote
Job number   20 added - Data Scie

  driver = webdriver.Firefox(executable_path=driver_path)


Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,Data Products LLC,Remote,,PostedToday,"$80,000 - $120,000 a year",Present information using data visualization t...,https://www.indeed.com/company/Data-Products-L...,About us\nWe are professional and data-driven....
1,Data Scientist - RWD,Norstella,Remote,,EmployerActive 5 days ago,"$125,000 - $175,000 a year",Design data pipelines and queries and analyze ...,https://www.indeed.com/company/NorStella/jobs/...,Job Summary:\nWe are seeking an experienced Da...
2,Data Scientist,TELLUS SOLUTIONS,+3 locationsRemote,4.1,PostedToday,Up to $82.14 an hour,Proficiency in machine learning algorithms suc...,https://www.indeed.com/company/Tellus-Solution...,Job Description:\nData Scientists in the clien...
3,Jr. Data Scientist,Net2Aspire,Remote,,EmployerActive 4 days ago,"$65,000 - $80,000 a year", Create data dashboards and other data visual...,https://www.indeed.com/company/net2aspire/jobs..., Apply Statistical and Machine Learning metho...
4,Data Scientist (All Levels),Noblis,"Remote in Reston, VA 20191",4.0,PostedToday,,"In your role, you will work on multiple projec...",https://www.indeed.com/rc/clk?jk=b8f1b308b3018...,Responsibilities:\nNoblis is seeking to hire D...


### Saving and ReLoading resuts

In [4]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
data_sci_jobs.to_csv(date + "_" + job_role + "_" + locations + ".csv", index=False)

NameError: name 'data_sci_jobs' is not defined

### Extracting Skills from Job Descriptions

In [4]:
data_sci_jobs = pd.read_csv('2023-04-04_data scientist_remote.csv')
data_sci_jobs.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,Data Products LLC,Remote,,PostedToday,"$80,000 - $120,000 a year",Present information using data visualization t...,https://www.indeed.com/company/Data-Products-L...,About us\nWe are professional and data-driven....
1,Data Scientist - RWD,Norstella,Remote,,EmployerActive 5 days ago,"$125,000 - $175,000 a year",Design data pipelines and queries and analyze ...,https://www.indeed.com/company/NorStella/jobs/...,Job Summary:\nWe are seeking an experienced Da...
2,Data Scientist,TELLUS SOLUTIONS,+3 locationsRemote,4.1,PostedToday,Up to $82.14 an hour,Proficiency in machine learning algorithms suc...,https://www.indeed.com/company/Tellus-Solution...,Job Description:\nData Scientists in the clien...
3,Jr. Data Scientist,Net2Aspire,Remote,,EmployerActive 4 days ago,"$65,000 - $80,000 a year", Create data dashboards and other data visual...,https://www.indeed.com/company/net2aspire/jobs..., Apply Statistical and Machine Learning metho...
4,Data Scientist (All Levels),Noblis,"Remote in Reston, VA 20191",4.0,PostedToday,,"In your role, you will work on multiple projec...",https://www.indeed.com/rc/clk?jk=b8f1b308b3018...,Responsibilities:\nNoblis is seeking to hire D...


In [5]:
sample_job_desc = data_sci_jobs.loc[0].Descriptions
print(sample_job_desc)

About us
We are professional and data-driven.
Our work environment includes:
Growth opportunities
In this role, you should be highly analytical with a knack for analysis, math and statistics. Critical thinking and problem-solving skills are essential for interpreting data. We also want to see a passion for machine-learning and research.
Responsibilities
Identify valuable data sources and automate collection processes
Undertake preprocessing of structured and unstructured data
Analyze large amounts of information to discover trends and patterns
Build predictive models and machine-learning algorithms
Combine models through ensemble modeling
Present information using data visualization techniques
Propose solutions and strategies to business challenges
Collaborate with engineering and product development teams
Requirements and skills
Proven experience as a Data Scientist or Data Analyst
Experience in data mining
Understanding of machine-learning and operations research
Knowledge of R, SQL 

In [7]:
input = """Following is a sample job description:\n" """+sample_job_desc+""" "\nprovide a numbered list of skills mentioned in this job description 
        (name each skill to be a maximum of two words)"""
raw_skills = chat(input)
print(raw_skills, '\n')
skills = listify(raw_skills)
skills

1. data-driven
2. growth opportunities
3. analytical skills
4. math skills
5. statistics skills
6. critical thinking
7. problem-solving skills
8. machine-learning
9. research skills
10. data collection
11. data preprocessing
12. data analysis
13. predictive modeling
14. algorithm building
15. ensemble modeling
16. data visualization
17. solution proposing
18. collaboration skills
19. data mining
20. operations research
21. R programming
22. SQL
23. Python
24. Scala
25. Java
26. C++
27. business intelligence
28. Hadoop
29. business acumen
30. communication skills
31. presentation skills
32. computer science
33. engineering
34. quantitative field
35. remote work. 



['data-driven',
 'growth opportunities',
 'analytical skills',
 'math skills',
 'statistics skills',
 'critical thinking',
 'problem-solving skills',
 'machine-learning',
 'research skills',
 'data collection',
 'data preprocessing',
 'data analysis',
 'predictive modeling',
 'algorithm building',
 'ensemble modeling',
 'data visualization',
 'solution proposing',
 'collaboration skills',
 'data mining',
 'operations research',
 'r programming',
 'sql',
 'python',
 'scala',
 'java',
 'c++',
 'business intelligence',
 'hadoop',
 'business acumen',
 'communication skills',
 'presentation skills',
 'computer science',
 'engineering',
 'quantitative field',
 'remote work.']

In [10]:
# def extract_skills(desc):
#     chat_input =  """
#                     Following is a sample job description:\n" """+desc+""" "\n list the skills mentioned in this job description 
#                     (name each skill to be a maximum of two words)
#                     """
#     raw_skills = chat(chat_input)
#     skills = listify(raw_skills)
#     return skills

In [8]:
sample_job_desc2 = data_sci_jobs.loc[1].Descriptions
print(sample_job_desc2)

Job Summary:
We are seeking an experienced Data Scientist to join our Real-world data (RWD) team and support our data-driven initiatives. The ideal candidate will have a strong background in SQL and one high-level programming language (ideally Python), content knowledge of the Life Sciences industry and the drug development lifecycle, and experience querying healthcare databases (claims, lab, EMR, etc.). The candidate will also contribute to conference and journal publications.
Responsibilities:
Design data pipelines and queries and analyze data to support our Life Sciences use cases, particularly claims and lab data
Develop and optimize database queries and procedures for data processing, transformation, and analysis
Develop advanced algorithms on large-scale healthcare databases
Create final deliverables in industry standard to share with external clients
Work with cross-functional teams to identify data needs and develop solutions to support business goals
Support conference and jou

In [9]:
input2 = """Following is a sample job description:\n" """+sample_job_desc2+""" "\nprovide a numbered list of skills mentioned in this job description 
        (name each skill to be a maximum of two words)"""
raw_skills2 = chat(input2)
print(raw_skills2, '\n')
skills2 = listify(raw_skills2)
skills2

1. Data Scientist
2. Real-world data (RWD)
3. SQL
4. High-level programming
5. Life Sciences
6. Drug development lifecycle
7. Healthcare databases
8. Claims
9. Lab data
10. Algorithms
11. Large-scale databases
12. Data processing
13. Data transformation
14. Data analysis
15. Final deliverables
16. Cross-functional teams
17. Business goals
18. Research publication
19. Data quality
20. Problem-solving
21. Analytical skills
22. Epidemiological study
23. Collaborative skills
24. Attention to detail
25. Bachelor's degree 
26. Public Health
27. Biostatistics
28. Epidemiology
29. 5+ years of experience
30. Data engineering
31. Healthcare databases 
32. Remote work. 



['data scientist',
 'real-world data (rwd)',
 'sql',
 'high-level programming',
 'life sciences',
 'drug development lifecycle',
 'healthcare databases',
 'claims',
 'lab data',
 'algorithms',
 'large-scale databases',
 'data processing',
 'data transformation',
 'data analysis',
 'final deliverables',
 'cross-functional teams',
 'business goals',
 'research publication',
 'data quality',
 'problem-solving',
 'analytical skills',
 'epidemiological study',
 'collaborative skills',
 'attention to detail',
 "bachelor's degree",
 'public health',
 'biostatistics',
 'epidemiology',
 '5+ years of experience',
 'data engineering',
 'healthcare databases',
 'remote work.']

In [10]:
sample_job_desc3 = data_sci_jobs.loc[2].Descriptions
print(sample_job_desc3)

Job Description:
Data Scientists in the client Technology division specialize in applying computer vision, machine learning and artificial intelligence to solve problems across Merchandising, Marketing, Club Operations, Finance, eCommerce, and Security. You will have the opportunity to work with a high caliber team from a variety of disciplines to build new software and radically change our business. Data Scientists work as part of an Experience Team to develop and deploy advanced algorithms at scale. You will support and enable the entire project lifecycle including problem discovery with business clients, algorithmic design, coding, validation, deployment, testing, and monitoring. Data Scientists adhere to agile software development standards through rapid prototyping, iterative development, and incremental deployment of capabilities.
Key Responsibilities
Performs research and applies new techniques and concepts to solve problems
Understands and translates business and functional nee

In [11]:
input3 = """Following is a sample job description:\n" """+sample_job_desc3+""" "\nprovide a numbered list of skills mentioned in this job description 
        (name each skill to be a maximum of two words)"""
raw_skills3 = chat(input3)
print(raw_skills3, '\n')
skills3 = listify(raw_skills3)
skills3

1. Data Scientists
2. Computer Vision
3. Machine Learning
4. Artificial Intelligence
5. Merchandising
6. Marketing
7. Club Operations
8. Finance
9. eCommerce
10. Security
11. Agile software development
12. Research
13. Algorithmic design
14. Coding
15. Validation
16. Deployment
17. Testing
18. Monitoring
19. Data Sets
20. Disparate Data Sources
21. Data Pipelines
22. Engineering Teams
23. Exploratory Data Analysis
24. Anomaly Detection
25. Probability
26. Statistical Models
27. Technical Communication
28. Collaboration
29. Cloud Environment
30. Deep Learning
31. Computer Vision Algorithms
32. Python
33. Smart Phone Models
34. Data Science
35. Optimization Models
36. Master's degree
37. Spark
38. Scala
39. R Programming
40. Scikit Learn
41. TensorFlow
42. Torch
43. Contract
44. Remote Work. 



['data scientists',
 'computer vision',
 'machine learning',
 'artificial intelligence',
 'merchandising',
 'marketing',
 'club operations',
 'finance',
 'ecommerce',
 'security',
 'agile software development',
 'research',
 'algorithmic design',
 'coding',
 'validation',
 'deployment',
 'testing',
 'monitoring',
 'data sets',
 'disparate data sources',
 'data pipelines',
 'engineering teams',
 'exploratory data analysis',
 'anomaly detection',
 'probability',
 'statistical models',
 'technical communication',
 'collaboration',
 'cloud environment',
 'deep learning',
 'computer vision algorithms',
 'python',
 'smart phone models',
 'data science',
 'optimization models',
 "master's degree",
 'spark',
 'scala',
 'r programming',
 'scikit learn',
 'tensorflow',
 'torch',
 'contract',
 'remote work.']

Considering.......

In [None]:
skill_ngrams = ["Data Analysis",
                "Statistics",
                "Python",
                "R",
                "SQL",
                "Machine Learning",
                "Deep Learning",
                "Data Mining",
                "Data Visualization",
                "Feature Engineering",
                "Hadoop",
                "Spark",
                "Web Scraping",
                "Natural Language Processing",
                "Computer Vision",
                "Data-driven decision making",
                "Communication",
                "Ethics",
                "Collaboration",
                "Project Management",
                "Problem-solving",
                "critical thinking",
                "Business Strategy"]