# <center> Web scrapping for Data Scientist job in CO(Total points 9)

## <center> Adam Brenner

In this exercise we'll do web scrapping for **Data Scientist job in CO**


Here is the link to the search query

https://www.indeed.com/jobs?q=data+scientist&l=CO

As you can see at the bottom of the page there are link to series of pages related to this search.
If you click on second page, search url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10

If you click on 3rd then url changes to

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20

Hence to go to more pages we can format search string(**change start=??** part) for **requests.get in a loop**


# Q1(5 =  4(non indicator columns) + 1(indicator columns) points) Please complete the following task

- Scrape 10 pages (**last page(10 th) url will be https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90**)and build a pandas DataFrame containing following information
    + **job title, name of the company, location, summary of job description**
    + **Indicator columns(with value True/False) about keywords Python, SQL, AWS, RESTFUL, Machine learning, Deep Learning, Text Mining, NLP, SAS, Tableau, Sagemaker, TensorFlow, Spark**

Note:
- Make sure that you do a case insensitive search for keywords when filing(Tue/False) in the indicator columns
- You need to go to the webpage of detail job posting for keywords search. Main job posting only contains summary of the job description.  Build detail job posting webpage url  from web scrapping main search results.

In [1]:
from bs4 import BeautifulSoup as bsoup
import urllib.robotparser
import pandas as pd
import numpy as np
import requests

In [2]:
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.indeed.com/q-data-scientist-l-CO-jobs.html/robots.txt")
rp.read()
rp.can_fetch("*", "https://www.indeed.com/q-data-scientist-l-CO-jobs.html/robots.txt")

True

In [3]:
indeed_url = 'https://www.indeed.com/q-data-scientist-l-CO-jobs.html'
response = requests.get(indeed_url)
response.status_code
soup = bsoup(response.text, 'lxml')

In [4]:
# Used for testing
# print(soup.prettify())

In [5]:
# Used for testing
# print(soup.get_text())

In [6]:
# Used for testing
# for tag in soup.find_all(True):
#     print(tag.name)

In [7]:
# This extracts the job title from the webpage
def extract_job_title(soup):
    job_title = []
    for div in soup.find_all('h2', attrs = {'class':'title'}):
        for a in div.find_all('a', attrs = {'data-tn-element':'jobTitle'}):
            job_title.append(a['title'])
    return job_title

In [8]:
# This extracts the company name from the webpage
def extract_company_name(soup):
    company_name = []
    for div in soup.find_all('div', attrs = {'class':'row'}):
        a = div.find_all('span', attrs = {'class':'company'})
        for b in a:
            company_name.append(b.text.strip())
    return company_name

In [9]:
# This extracts the location from the webpage
def extract_location(soup):
    locations = []
    for div in soup.find_all('span', attrs = {'class':'location'}):
        locations.append(div.text)
    return locations

In [10]:
# This extracts the summary of the position from the webpage
def extract_summary(soup):
    summary = []
    for div in soup.find_all('div', attrs = {'class':'summary'}):
        summary.append(div.text.replace('\n', ""))
    return summary

In [11]:
# This goes to a new page with the full description of the job and retrieves that webpage http
def extract_full_descript(soup):
    full_d = []
    for div in soup.find_all('a', attrs = {'class':'jobtitle turnstileLink'}):
        full_d.append(div['href'])
    full_description = []
    for f in full_d:
        full_description.append('https://www.indeed.com' + f)
    return full_description

In [12]:
# This checks that full description to see if the keywords below are contained within
def keywords(soup):
    full = extract_full_descript(soup)
    keywords = ['python', 'sql', 'aws', 'restful', 'machine learning', 'deep learning', 'text mining', 'nlp', 'sas',
               'tableau', 'sagemaker', 'tensorflow', 'spark']
    keyword_true = []
    keyword_false = []
    for f in full:
        indeed_url = f
        response = requests.get(indeed_url)
        soup2 = bsoup(response.text, 'lxml')
        for div in soup2.find_all('div', attrs={'class':'jobsearch-jobDescriptionText'}):
            for k in keywords:
                if k in str.lower(div.text):
                    keyword_true.append((k, True))
                else:
                    keyword_false.append((k, False))
    return keyword_true, keyword_false

In [13]:
# keywords(soup)

In [14]:
# Webscrapes the indeed pages and puts the title, location, company name, and job summary into a pandas dataframe.
url_codes = [0,10,20,30,40,50,60,70,80,90]
job_titles = []
companies = []
locations = []
summaries = []
for n in url_codes:
    # Wanted to check to make sure we were hitting all the pages
    indeed_url = 'https://www.indeed.com/jobs?q=data+scientist&l=CO&start=' + str(n)
    print(indeed_url)
    
    response = requests.get(indeed_url)
    soup = bsoup(response.text, 'lxml')
    
    # These take our functions from above and returns a list of lists for each category
    job_titles.append(extract_job_title(soup))
    companies.append(extract_company_name(soup))
    locations.append(extract_location(soup))
    summaries.append(extract_summary(soup))
    
# This takes that list of lists and combines all entries into one list for transforming into a dataframe
job_titles = [j2 for j in job_titles for j2 in j]
companies = [c2 for c in companies for c2 in c]
locations = [l2 for l in locations for l2 in l]
summaries = [s2 for s in summaries for s2 in s]

# This dictionary is used for putting each category into columns in the dataframe
job_dict = {'Title':job_titles, 'Company':companies, 'Location':locations, 'Summary':summaries}
# I noticed that not all lists were of the same size and thus had to find a way
# for putting NA's into those spots while creating the dataframe
job_DF = pd.DataFrame({key:pd.Series(value) for key, value in job_dict.items()})

https://www.indeed.com/jobs?q=data+scientist&l=CO&start=0
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=10
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=20
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=30
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=40
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=50
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=60
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=70
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=80
https://www.indeed.com/jobs?q=data+scientist&l=CO&start=90


In [15]:
job_DF

Unnamed: 0,Title,Company,Location,Summary
0,CIRES/Earth Lab Research Data Scientist,University of Colorado Boulder,"Boulder, CO 80309 (Colorado University area)",Knowledge of statistical and machine learning ...
1,Applied Scientist II,Amazon.com Services LLC,"Denver, CO",Extensive knowledge and practical experience i...
2,Data Scientist,The Aerospace Corporation,"Colorado Springs, CO",Leading teams to identify key opportunities fo...
3,Senior Data Scientist,Verizon,"Denver, CO 80202 (Central Business District area)","Experience in exceptional data mining, statist..."
4,Data Science Engineer II Boulder,Sovrn Holdings,"Boulder, CO",Participating in data science and machine lear...
5,Senior Data Scientist,Uplight,"Boulder, CO 80301",Aggregate and disseminate results of analyses ...
6,Data Scientist,Viola AI,"Boulder, CO","Expert level knowledge of statistics, linear a..."
7,Intern - Data Scientist,Teradata,"Aurora, CO",Strong data visualization and data mining skil...
8,Associate Data Scientist,Gates Corporation,"Denver, CO",1-3 years of relevant experience in data scien...
9,Lead Data Scientist,Verizon,"Colorado Springs, CO 80919 (Northwest Colorado...","In this technical lead role, you will use mach..."


In [16]:
# To alleviate the NA problem I just decided to put Colorado as the location.
print(job_DF.isna().sum())
job_DF = job_DF.fillna('CO Missing')
print(job_DF.isna().sum())

Title       0
Company     0
Location    4
Summary     0
dtype: int64
Title       0
Company     0
Location    0
Summary     0
dtype: int64


# Q2(1 point) Save you DataFrame to pickle file name *indeed_job_co.pkl*. 
   Load this pkl file in dataFrame and use this dataFrame for answering following questions.

   <font color='red'>upload the pickle file(indeed_job_co.pkl) along with solution notebook to the canvas</font>

In [17]:
# Creates a pickle file
job_DF.to_pickle("./indeed_job_co.pkl")

indeed_DF = pd.read_pickle("./indeed_job_co.pkl")

<font size = "6" color='red'> Use pandas functionality to answer question 2</font>
# Q 3 a(1 point) Which city has maximum job posting.



In [18]:
# Top Job Location
# I included 2 just incase the location NA problem from above was significant
indeed_DF['Location'].value_counts().head(2)

Boulder, CO    20
Denver, CO     15
Name: Location, dtype: int64

# Q 3 b(1.5 point) - Top 3 most demanding skills(like Python, AWS, SQL ...)



In [19]:
# This takes the full job description to analyze
yes, no = keywords(soup)

In [20]:
# Top 3 Skills
pd.DataFrame(yes)[0].value_counts().head(3)

python              12
machine learning    12
sql                 10
Name: 0, dtype: int64

# Q3 c(.5 point) What other questions you would like to ask  based on indeed data?

This is free response questions.

Whether the job title has any keywords in it such as senior or II or 2, and categorize them into entry level and experienced.

Also extracting the salary for each position.

Looking at the summary to see if it mentions anything about years of experience to better categorize the job titles from the question above.