# Project 4

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the **industry factors** that are most important in **predicting the salary amounts** for these data.
   2. Determine the **factors that distinguish job categories and titles from each other**. For example, can required skills accurately predict job title?
   
   
To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, **data analyst**, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

### Directions

In this project you will be leveraging a variety of skills. The first will be to use the **web-scraping** and/or **API** techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.



### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

In [12]:
from bs4 import BeautifulSoup
from time import sleep
import random

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from time import sleep

import os
from selenium import webdriver

chromedriver = "../chromedriver/chromedriver"

os.environ["webdriver.chrome.driver"] = chromedriver






#Create list of links for appending to
links = []

#Create 2 drivers and a list for iteration
driver1 = webdriver.Chrome(executable_path="../chromedriver/chromedriver")
driver2 = webdriver.Chrome(executable_path="../chromedriver/chromedriver")

drivers = [driver1,driver2]

#Iterate through the pages in steps of 2 (2 drivers)
for num in range(0,206,2):
    driver1.get("https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page="+"{}".format(num))
    driver2.get("https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page="+"{}".format(num+1))
    #Give some time for pages to load
    sleep(random.randint(4,8))
    
    #Iterate through the drivers to get the information from each page
    for driver in drivers:
        # Grab the page source.
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')

        #Save all the entries on the page
        page = soup.findAll("a",{"class":"bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3"})
 
        #Iterate through the links from each page and append to list
        for entry in page:
            links.append('https://www.mycareersfuture.sg'+entry['href'])
    
    #Periodically print a statement to show status and save links
    if num%5==0:
        print('Done with page {}'.format(num+1))
        #Create a table of the links and save the table as a csv file in case anything fails inbetween
        link_table = pd.DataFrame({'num':range(len(links)),'links':links})
        link_table.to_csv('careerfutures_links_{}.csv'.format(num+1),index=False)
        
    #Wait after grabbing the info
    sleep(random.randint(2,6))
    
#Create a table of the links and save all the links retrieved
link_table = pd.DataFrame({'num':range(len(links)),'links':links})
link_table.to_csv('careerfutures_links_done.csv'.format(num+1),index=False)


WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home


In [2]:
# length aka number of letters
len(html)

6696

In [None]:
soup = BeautifulSoup(html, 'html.parser', from_encoding="utf-8")