# Career Scraper Lab

### Introduction

In this lesson, we'll work with data collected from a scraper that pulls data from Indeed.com.  The goal of this scraper was to find more information about data engineering positions, and tech positions in general.  

We can use the data to determine what skills are needed by data engineers, the kinds of companies hiring data engineers, and where they are being hired. 

### Connecting to our database

We can begin by using the psycopg2 library.  This library is already installed on a google colab (so no need to install it).  If we would like to install it on our own computer, we can do so with the following:

In [1]:
! pip install psycopg2-binary

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting psycopg2-binary
  Downloading psycopg2_binary-2.9.5-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.5


Then we should be able to import the library.  We can then use this library to connect to a postgres database -- one that exists on Jigsaw's amazon account.

In [2]:
import psycopg2

In [11]:
DB_NAME="careers"
DB_HOST="career-scraper.crd5vw1vref2.us-east-1.rds.amazonaws.com"
DB_USER="student"
DB_PASSWORD="jigsaw_student"
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()

And from here, we can see all of the tables listed.

In [10]:
cursor.execute("""SELECT table_name FROM information_schema.tables
       WHERE table_schema = 'public'""")
tables = []
for table in cursor.fetchall():
    tables.append(table[0])
tables[1:]

['states',
 'cities',
 'scrapings',
 'scraped_pages',
 'cards',
 'companies',
 'position_locations',
 'position_skills',
 'skills',
 'positions',
 'job_titles']

So there are a number of tables listed, but we can ignore the `scrapings`, `scraped_pages`, and `cards` tables.  This leaves us with the following relevant tables.

In [5]:
relevant_tables = ['states', 'cities', 'companies',
 'position_locations', 'position_skills', 'skills', 'positions',
 'job_titles']

And we can see the columns of each of these tables with the following:

In [12]:
for relevant_table in relevant_tables:
    cursor.execute(f"Select * FROM {relevant_table} LIMIT 0")
    print(relevant_table, [desc[0] for desc in cursor.description])

states ['id', 'name', 'timestamp']
cities ['id', 'name', 'state_id', 'timestamp']
companies ['id', 'name', 'timestamp']
position_locations ['id', 'position_id', 'city_id', 'state_id', 'is_remote', 'timestamp']
position_skills ['id', 'position_id', 'skill_id']
skills ['id', 'name', 'timestamp']
positions ['id', 'source_id', 'card_id', 'title', 'description', 'minimum_salary', 'maximum_salary', 'minimum_experience', 'maximum_experience', 'company_id', 'timestamp', 'date_posted', 'query_string', 'job_title_id']
job_titles ['id', 'name', 'timestamp']


Begin by looking at the columns in the tables.  At this point, it's worth diagramming the structure of these tables.  Look at the foreign keys to determine how these tables relate to one another -- specify the relations.

* Which tables do you think we will be relying on the most?

### Answering Questions

> Hint: Before diving into the questions below, it may be useful to explore the data a little, and ask some questions of the data.

#### Assessing the data

Now this scraper does not have data on all of the jobs in the US, so it's good to start by getting a sense of the data.  What type of data does it hae most?  

* For example, what `job_titles` appear most frequently in the database?

Then ask similar questions to see what type of data has been collected.

> **Hint**: Think about dimensions in the data (who, what, where, when)?

### Diving into data engineers

* What are top skills required of data engineers (limit to the top 10 results)?


In [8]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()

In [14]:
cursor.execute(""" select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
where job_titles.name like '%data enginee%'
group by skills.id order by count(skills.id) desc limit 10""")
cursor.fetchall()
# name	amount
# 0	business	790
# 1	engineering	712
# 2	support	683
# 3	design	654
# 4	sql	624
# 5	python	555
# 6	software	538
# 7	analytics	489
# 8	computer science	448
# 9	communication	443

[('business', 790),
 ('engineering', 712),
 ('support', 683),
 ('design', 654),
 ('sql', 624),
 ('python', 555),
 ('software', 538),
 ('analytics', 489),
 ('computer science', 448),
 ('communication', 443)]

### Choosing a Career

* Now perhaps we want to use this database to help us choose a specific profession.  What are some questions we can ask...

We can ask questions about: 

* What is the average minimum salary of a data engineer?

In [15]:
cursor.execute(""" select avg(minimum_salary) from positions
join job_titles on job_titles.id = positions.job_title_id
where job_titles.name like '%data engineer%'
""")
cursor.fetchall()
# avg_salary
# 0	78606.050898

[(Decimal('78606.050898203593'),)]

* What are the average minimum salaries by job title, order from most to least

In [16]:
cursor.execute(""" select job_titles.name, avg(minimum_salary) from positions
join job_titles on job_titles.id = positions.job_title_id
group by job_titles.name
order by avg(minimum_salary) desc 
limit 5
""")
cursor.fetchall()
# name	avg_salary
# 0	machine learning engineer	104419.226190
# 1	data scientist	88704.466981
# 2	data engineer	78606.050898

[('machine learning engineer', Decimal('104419.226190476190')),
 ('data scientist', Decimal('88704.466981132075')),
 ('data engineer', Decimal('78606.050898203593'))]

* What are positions that appear the most in new york.

In [17]:
cursor.execute(""" select job_titles.name, count(job_titles.name) from positions
join job_titles on job_titles.id = positions.job_title_id
join position_locations on position_locations.position_id = positions.id
join cities on cities.id = position_locations.city_id

where cities.name = 'New York'
group by job_titles.name
order by count(positions.id) desc
limit 5
""")
cursor.fetchall()

# name	count
# 0	data engineer	38
# 1	data scientist	27
# 2	machine learning engineer	23

[('data engineer', 38),
 ('data scientist', 27),
 ('machine learning engineer', 23)]

* Order the cities from most to least by average salary for data engineers, and only include those cities where there are at least 10 positions, and do not include any positions where there is not a listed minimum salary.

In [18]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select cities.name, avg(minimum_salary), count(cities.name) from cities
join position_locations on cities.id = position_locations.city_id
join positions on position_locations.position_id = positions.id
join job_titles  on job_titles.id = positions.job_title_id
where minimum_salary is not null
group by cities.name
having count(cities.name) >= 10 
order by avg(minimum_salary) desc
limit 6
""")

cursor.fetchall()

# name	avg_salary	num_positions
# 0	Remote In San Francisco	124056.565217	23
# 1	Remote In New York	121450.000000	28
# 2	New York	113408.000000	25
# 3	Denver	99855.555556	18
# 4	San Diego	90680.000000	12


[('Remote In San Francisco', Decimal('124056.565217391304'), 23),
 ('Remote In New York', Decimal('121450.000000000000'), 28),
 ('New York', Decimal('113408.000000000000'), 25),
 ('San Francisco', Decimal('104985.000000000000'), 10),
 ('Denver', Decimal('99855.555555555556'), 18),
 ('San Diego', Decimal('90680.000000000000'), 12)]

* Find the average minimum years of experience required by position, order from fewest years required to most.

In [19]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select job_titles.name, avg(minimum_experience) from positions
join job_titles on job_titles.id = positions.job_title_id
group by job_titles.name
order by avg(minimum_experience) asc
""")

cursor.fetchall()

# name	avg_experience
# 0	data engineer	2.744042
# 1	machine learning engineer	3.135922
# 2	data scientist	3.217742

[('data engineer', Decimal('2.7440423654015887')),
 ('machine learning engineer', Decimal('3.1359223300970874')),
 ('data scientist', Decimal('3.2177419354838710'))]

### Choosing a skillset

Of course, we also may want to determine what skills are most valuable to learn. 
We could ask questions about:

* What are the top skills requested in the dataset?

In [20]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

# name	amount
# 0	business	1662
# 1	engineering	1547
# 2	support	1422
# 3	design	1385
# 4	software	1293
# 5	python	1191
# 6	analytics	1068
# 7	communication	1057
# 8	sql	1022
# 9	computer science	1022

[('business', 1662),
 ('engineering', 1547),
 ('support', 1422),
 ('design', 1385),
 ('software', 1293),
 ('python', 1191),
 ('analytics', 1068),
 ('communication', 1057),
 ('sql', 1022),
 ('computer science', 1022)]

* Which skills are most associated with jobs requiring 2 or fewer years of minimum experience?

In [21]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from position_skills
join skills on skills.id = position_skills.skill_id
join positions on positions.id = position_skills.position_id
join job_titles on job_titles.id = positions.job_title_id
where minimum_experience <=2 and minimum_experience is not null
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

# name	amount
# 0	engineering	585
# 1	business	581
# 2	support	525
# 3	design	508
# 4	python	481
# 5	software	459
# 6	sql	415
# 7	analytics	395
# 8	communication	395
# 9	computer science	389

[('engineering', 585),
 ('business', 581),
 ('support', 525),
 ('design', 508),
 ('python', 481),
 ('software', 459),
 ('sql', 415),
 ('communication', 395),
 ('analytics', 395),
 ('computer science', 389)]

* Which skills are most associated with a minimum salary over 160k?

In [22]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions

join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
where minimum_salary > 160000 
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

# name	amount
# 0	machine learning	18
# 1	engineering	17
# 2	software	16
# 3	design	14
# 4	analytics	11
# 5	support	11
# 6	computer science	11
# 7	communication	10
# 8	business	10
# 9	python	8

[('machine learning', 18),
 ('engineering', 17),
 ('software', 16),
 ('design', 14),
 ('computer science', 11),
 ('support', 11),
 ('analytics', 11),
 ('communication', 10),
 ('business', 10),
 ('python', 8)]

* Does demand for skillset differ based on region?  Or based on position?

In [25]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join position_locations on position_locations.position_id = positions.id
join cities on cities.id = position_locations.city_id
where cities.name = 'New York'
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

[('business', 534),
 ('support', 452),
 ('engineering', 428),
 ('design', 384),
 ('analytics', 367),
 ('machine learning', 356),
 ('software', 352),
 ('communication', 329),
 ('python', 281),
 ('computer science', 279)]

In [26]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join position_locations on position_locations.position_id = positions.id
join cities on cities.id = position_locations.city_id
where cities.name = 'San Francisco'
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

[('business', 34),
 ('engineering', 31),
 ('machine learning', 26),
 ('software', 24),
 ('design', 24),
 ('python', 22),
 ('support', 21),
 ('communication', 19),
 ('sql', 18),
 ('data science', 18)]

In [27]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
where job_titles.name like '%data scientist%'
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

[('business', 488),
 ('support', 369),
 ('analytics', 346),
 ('engineering', 345),
 ('python', 336),
 ('machine learning', 331),
 ('design', 308),
 ('communication', 303),
 ('data science', 292),
 ('software', 286)]

In [28]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
where job_titles.name like '%data engineer%'
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

[('business', 790),
 ('engineering', 712),
 ('support', 683),
 ('design', 654),
 ('sql', 624),
 ('python', 555),
 ('software', 538),
 ('analytics', 489),
 ('computer science', 448),
 ('communication', 443)]

In [29]:
import psycopg2
db_url = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}/careers"
conn = psycopg2.connect(db_url)
cursor = conn.cursor()
cursor.execute("""
select skills.name, count(skills.id) from positions
join position_skills on position_skills.position_id = positions.id
join skills on skills.id = position_skills.skill_id
join job_titles on job_titles.id = positions.job_title_id
where job_titles.name like '%machine learn%'
group by skills.id
order by count(skills.id) desc
limit 10
""")
cursor.fetchall()

[('machine learning', 516),
 ('engineering', 490),
 ('software', 469),
 ('design', 423),
 ('business', 384),
 ('support', 370),
 ('computer science', 320),
 ('communication', 311),
 ('python', 300),
 ('analytics', 233)]