# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

### Project Feedback + Evaluation

For all projects, students will be evaluated on a simple 3 point scale (0, 1, or 2). Instructors will use this rubric when scoring student performance on each of the core project **requirements:** 

Score | Expectations
----- | ------------
**0** | _Does not meet expectations. Try again._
**1** | _Meets expectations. Good job._
**2** | _Surpasses expectations. Brilliant!_

[For more information on how we grade our DSI projects, see our project grading walkthrough.](https://git.generalassemb.ly/dsi-projects/readme/blob/master/README.md)


In [29]:
# Import the necessary libraries for webscraping

import requests     # Pull raw HTML from site
from bs4 import BeautifulSoup     # Parsing library that pulls data from HTML/XML code
from lxml import html     # High-speed parsing library used with BeautifulSoup


# Import library to set up and work in DataFrame
import numpy as np     # Scientific computing
import pandas as pd     # Build out DataFrame
import scipy.stats as stats

# Import libraries for plotting and visualizations
import matplotlib.pyplot as plt
import seaborn as sns

import time
import regex as re
import pickle # Haven't figured how to use some of these yet

sns.set_style("whitegrid")     # Control the appearances of the plots

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [35]:
# Initialize search parameters and dataframe
# 'my',
country_set = ['sg']
search_string = ['data scientist', 'data analyst', 'business analyst']
columns = ["job_category","job_title", "company_name", "location", "summary", "salary"]

In [36]:
# Initialize container to store all job postings
jobs_list = []

# Iterate through search parameters and store relevant data in respective columns in dataframe
for country in country_set:
    for query in search_string:
        
        url = 'https://www.indeed.com.' + country + '/jobs?q=' + '+'.join([word for word in query.split()]) + '&start='
        print(url)
        time.sleep(1)
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'lxml')
        jobs_count = soup.find_all('div', {'id':'searchCount'})[0].get_text()

        # Get maximum number of jobs to iterate over all pages
#         max_jobs = int(re.sub('[^0-9a-zA-Z]+', '', jobs_count.split()[-1]))
        max_jobs = int(jobs_count.replace(' Page 1 of ', '').replace('jobs', '').replace(',', ''))

        for start_number in range(0,max_jobs,10):
            time.sleep(1)
            url_page = url + str(start_number)
            page = requests.get(url_page)
            soup = BeautifulSoup(page.text, 'lxml')
            
            # Get all advertised job descriptions
            regex = re.compile('.*row.*')
            jobs = soup.find_all(name='div', attrs={'class':regex})
            
            # Get job title from job description
            for job in jobs:
                job_title = job.find(name='a', attrs={'data-tn-element':'jobTitle'})
                company = job.find(name='span', attrs={'class':'company'})
                location = job.find(name='span', attrs={'class':'location'})
                summary = job.find(name='span', attrs={'class':'summary'})
                salary = job.find(name='span', attrs={'class':'no-wrap'})

                # Put default for missing variables
                if job_title != None:
                    job_title_result = job_title.get_text()
                    job_title_result = job_title_result.replace('\n','')
                    job_title_result = job_title_result.strip()
                else:
                    job_title_result = np.nan

                if company != None:
                    company_result = company.get_text()
                    company_result = company_result.replace('\n','')
                    company_result = company_result.strip()
                else:
                    company_result = np.nan

                if location != None:
                    location_result = location.get_text()
                    location_result = location_result.replace('\n','')
                    location_result = location_result.strip()
                else:
                    location_result = np.nan

                if summary != None:
                    summary_result = summary.get_text()
                    summary_result = summary_result.replace('\n','')
                    summary_result = summary_result.strip()
                else:
                    summary_result = np.nan

                if salary != None:

                    salary_result = salary.get_text()
                    salary_result = salary_result.replace('\n','')
                    salary_result = salary_result.strip()
                else:
                    salary_result = np.nan

                # Append to list
                job_category = '_'.join([word for word in query.split()])
                jobs_list.append([job_category,job_title_result, company_result, location_result, summary_result, salary_result])

# Convert jobs list to dataframe
df = pd.DataFrame(jobs_list, columns = columns)
# drop all duplicated job postings based on summary
df.drop_duplicates(subset=['summary'], inplace=True)
df.reset_index(drop=True, inplace=True)
df.info()

https://www.indeed.com.sg/jobs?q=data+scientist&start=


SSLError: HTTPSConnectionPool(host='www.indeed.com.sg', port=443): Max retries exceeded with url: /jobs?q=data+scientist&start=250 (Caused by SSLError(SSLError("bad handshake: SysCallError(10060, 'WSAETIMEDOUT')")))

In [225]:
df.to_csv("DS_Search")

In [226]:
# For some reason when I tried do to multiple searches, it always returned me the error on top. Couldn't solve it in time.
# So i just opened 3x terminals and doing each category of search, saving them into CSV, then concat the DF together
df0 = pd.read_csv("DS_Search")
df1 = pd.read_csv("BA_Search")
df2 = pd.read_csv("DA_Search")
df = pd.concat([df0, df1, df2])
df

Unnamed: 0.1,Unnamed: 0,job_category,job_title,company_name,location,summary,salary
0,0,data_scientist,data scientist,indeed,,significant prior success as a data scientist ...,
1,1,data_scientist,data scientist,capita singapore,,data scientist data scientist needed to impr...,
2,2,data_scientist,growth strategy operations strategic planning...,wework,,ensuring data quality minimum years of experi...,
3,3,data_scientist,data scientist,gateway search pte ltd,,assess the effectiveness and accuracy of new d...,
4,4,data_scientist,data scientists machine learning,biofourmis singapore,singapore,knowledge in big data technologies including c...,
5,5,data_scientist,data scientist aml group customer analytics ...,ocbc bank,singapore,data scientist aml group customer analytics ...,
6,6,data_scientist,data engineer data science,twitter,singapore,data engineers work alongside data scientists ...,
7,7,data_scientist,data scientist,zyllem,singapore,interpreting data analyzing results using stat...,
8,8,data_scientist,data scientist,lenddoefl,singapore,proven experience in data manipulation as a da...,
9,9,data_scientist,data scientist,cxa group pte. limited,singapore,leverage data visualization techniques and too...,


In [227]:
df.drop_duplicates(subset=['summary'], inplace=True)

In [228]:
df.reset_index(drop=True, inplace=True)
df.drop(columns=['Unnamed: 0'], inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3341 entries, 0 to 3340
Data columns (total 6 columns):
job_category    3341 non-null object
job_title       3341 non-null object
company_name    3306 non-null object
location        3303 non-null object
summary         3341 non-null object
salary          122 non-null object
dtypes: object(6)
memory usage: 156.7+ KB


In [229]:
df.to_csv("df_cleaned_combined")

In [230]:
df

Unnamed: 0,job_category,job_title,company_name,location,summary,salary
0,data_scientist,data scientist,indeed,,significant prior success as a data scientist ...,
1,data_scientist,data scientist,capita singapore,,data scientist data scientist needed to impr...,
2,data_scientist,growth strategy operations strategic planning...,wework,,ensuring data quality minimum years of experi...,
3,data_scientist,data scientist,gateway search pte ltd,,assess the effectiveness and accuracy of new d...,
4,data_scientist,data scientists machine learning,biofourmis singapore,singapore,knowledge in big data technologies including c...,
5,data_scientist,data scientist aml group customer analytics ...,ocbc bank,singapore,data scientist aml group customer analytics ...,
6,data_scientist,data engineer data science,twitter,singapore,data engineers work alongside data scientists ...,
7,data_scientist,data scientist,zyllem,singapore,interpreting data analyzing results using stat...,
8,data_scientist,data scientist,lenddoefl,singapore,proven experience in data manipulation as a da...,
9,data_scientist,data scientist,cxa group pte. limited,singapore,leverage data visualization techniques and too...,


In [231]:
# Drop all null values except for salary
df.dropna(subset=['company_name','summary'], inplace=True)

In [232]:
# convert all to small letters if string
df = df.applymap(lambda x: x.lower().strip() if isinstance(x, str) else x)
# remove all non-alphabets
df.job_title = df.job_title.map(lambda x: re.sub(r'[^A-Za-z\s]','',x).strip())
df.summary = df.summary.map(lambda x: re.sub(r'[^A-Za-z\s]','',x).strip())
# remove business licence numbers
df.company_name = df.company_name.map(lambda x: x[:x.index(', ea licence')] if x.find(', ea licence') != -1 else x)

In [233]:
# filter jobs by data related terms
data_terms = ['data','analytics','intelligence','analysis','statistics','machine learning']
df_summary_null = df.summary.map(lambda x: x if any(x.find(t)>=0 for t in data_terms) else np.nan).isnull()
df_job_title_null = df.job_title.map(lambda x: x if any(x.find(t)>=0 for t in data_terms) else np.nan).isnull()
df_mod = df[(~df_summary_null) | (~df_job_title_null)]

In [234]:
# Well, looks like people just don't want to put salary. For the sake of time (and my sanity), I'll just use it. 
# If I had a chance to, I'd probably see if I can find more data on this.

df_mod[~df_mod.salary.isnull()].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88 entries, 10 to 3309
Data columns (total 6 columns):
job_category    88 non-null object
job_title       88 non-null object
company_name    88 non-null object
location        88 non-null object
summary         88 non-null object
salary          88 non-null object
dtypes: object(6)
memory usage: 4.8+ KB


In [235]:
# Extract those without salary data
df_unsalaried = df_mod[df_mod.salary.isnull()]

In [236]:
# Extract those with salary data
df_salaried = df_mod[~df_mod.salary.isnull()]
df_salaried.reset_index(drop=True, inplace=True)

In [237]:
# Convert all salary into yearly format
salary_range = df_salaried.salary.map(lambda x: re.sub('[^0-9\s]', '', ' '.join(re.findall(r'\d+(?:[\d,.]*\d)', x))))
salary_period = df_salaried.salary.map(lambda x: x[x.find('month'):] if x.find('month') >= 0 else x[x.find('hour'):] if x.find('hour') >= 0 else 'year')

In [238]:
temp_sal = []

for i in range(0, len(salary_period)):
    if salary_period.loc[i] == 'month':
        sal = int(salary_range[i].split()[0])
        temp_sal.append(sal*12)
    elif salary_period.loc[i] == 'hour':
        sal = int(salary_range[i].split()[0])
        temp_sal.append(sal*2080)
    else:
        temp_sal.append(int(salary_range[i].split()[0]))

salary_annual = pd.DataFrame(temp_sal, columns=['salary_annual'])

In [239]:
salary_annual = pd.DataFrame(temp_sal, columns=['salary_annual'])
# Classifiy Salary into high, low tier
salary_high_tier = salary_annual.applymap(lambda x: 1 if x > int(salary_annual.median()) else 0)

In [240]:
df_salaried.drop(labels=['salary'], axis=1, inplace=True)
df_salaried_mod = pd.concat([df_salaried, salary_high_tier], axis=1)
df_salaried_mod.rename(index=str, columns={'salary_annual': 'salary_high_tier'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [241]:
# We will use those jobs with description to predict those without
df_salaried_mod.info()

<class 'pandas.core.frame.DataFrame'>
Index: 88 entries, 0 to 87
Data columns (total 6 columns):
job_category        88 non-null object
job_title           88 non-null object
company_name        88 non-null object
location            88 non-null object
summary             88 non-null object
salary_high_tier    88 non-null int64
dtypes: int64(1), object(5)
memory usage: 4.8+ KB


## Question 1:

1. Get TFIDF of job title, company, location, summary

2. Use data with salary to predict those without

3. Find out features with highest significant in distinguishing high vs low salary jobs

4. Then collect TFIDF again for whole dataset and do second round of modelling

5. Check to see whether top features are the same with round 1

6. Features that appear highly significant in both rounds are the factors that are best at distinguishing high vs low salary

7. For my study, i will generate features from my dataset using TFIDF

8. Will use log reg and decision tree to predict, unless results really bad

In [242]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_predict, GridSearchCV
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler

In [243]:
# Get TFIDF for job summary
job_summary_tvec = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_summary_tvec.fit(df_salaried_mod.summary)
job_summary_tvec_df = pd.DataFrame(job_summary_tvec.transform(df_salaried_mod.summary).todense(),
                       columns=['summary_[' + f + ']' for f in job_summary_tvec.get_feature_names()])

In [244]:
# Get TFIDF for job title
job_title_tvec = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_title_tvec.fit(df_salaried_mod.job_title)
job_title_tvec_df = pd.DataFrame(job_title_tvec.transform(df_salaried_mod.summary).todense(),
                       columns=['title_[' + f + ']' for f in job_title_tvec.get_feature_names()])

In [245]:
# Get TFIDF for company name
job_company_tvec = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_company_tvec.fit(df_salaried_mod.company_name)
job_company_tvec_df = pd.DataFrame(job_company_tvec.transform(df_salaried_mod.company_name).todense(),
                       columns=['company_[' + f + ']' for f in job_company_tvec.get_feature_names()])

In [246]:
y_with_sal = df_salaried_mod.salary_high_tier
X_with_sal = pd.concat([job_summary_tvec_df,job_title_tvec_df,job_company_tvec_df], axis=1)

In [247]:
# Get training and testing set
X_train, X_test, y_train, y_test = train_test_split(X_with_sal, y_with_sal, test_size=0.3, random_state=42)

In [248]:
# Standardize predictors
X_train_ss = StandardScaler().fit_transform(X_train)
X_test_ss = StandardScaler().fit_transform(X_test)

In [249]:
X_train_ss = pd.DataFrame(X_train_ss, columns=X_train.columns)
X_test_ss = pd.DataFrame(X_test_ss, columns=X_train.columns)

In [250]:
# Fit with plain logistic regression
lr = LogisticRegression()
lr.fit(X_train_ss, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [251]:
pred = lr.predict(X_test_ss)
score = metrics.f1_score(y_test, pred)
print(classification_report(y_test, pred))
print('f1-score:', score)

             precision    recall  f1-score   support

          0       0.91      0.62      0.74        16
          1       0.62      0.91      0.74        11

avg / total       0.79      0.74      0.74        27

f1-score: 0.7407407407407406


In [252]:
# Gridsearch for Ridge and Lasso Logistic Regression, optimize C

parameters = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

print ("GRID SEARCH:")
lr_grid_search = GridSearchCV(LogisticRegression(), parameters, cv=10, verbose=0)
lr_grid_search.fit(X_train_ss, y_train)
print ("Best parameters set:")
lr_best_parameters = lr_grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print ("\t%s: %r" % (param_name, lr_best_parameters[param_name]))

GRID SEARCH:
Best parameters set:
	C: 0.08697490026177834
	penalty: 'l2'
	solver: 'liblinear'


In [253]:
print ("Logistic Regression with best parameter:")
clf = LogisticRegression(**lr_best_parameters)
clf.fit(X_train_ss, y_train)
lr_gs_pred = clf.predict(X_test_ss)
print(metrics.classification_report(y_test, lr_gs_pred, labels=[1,0], target_names=['high salary','low salary']))

Logistic Regression with best parameter:
             precision    recall  f1-score   support

high salary       0.59      0.91      0.71        11
 low salary       0.90      0.56      0.69        16

avg / total       0.77      0.70      0.70        27



In [254]:
from sklearn.tree import DecisionTreeClassifier

In [255]:
# gridsearch params
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,'log2','sqrt',2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), dtc_params, cv=5, verbose=1)

In [256]:
# use the gridsearch C model to fit the data
dtc_gs.fit(X_train_ss, y_train)

Fitting 5 folds for each of 385 candidates, totalling 1925 fits


[Parallel(n_jobs=1)]: Done 1925 out of 1925 | elapsed:    5.4s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [None, 1, 2, 3, 4], 'max_features': [None, 'log2', 'sqrt', 2, 3, 4, 5], 'min_samples_split': [2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [257]:
# Best Estimator
dtc_best = dtc_gs.best_estimator_
print(dtc_gs.best_params_)
print(dtc_gs.best_score_)

{'max_depth': None, 'max_features': 4, 'min_samples_split': 3}
0.8852459016393442


In [258]:
fi = pd.DataFrame({
        'feature':X_train_ss.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
fi.head(10)

Unnamed: 0,feature,importance
11,summary_[insights],0.120376
32,title_[data],0.117584
24,summary_[working],0.094951
22,summary_[using],0.078879
2,summary_[analysts],0.067611
19,summary_[team],0.065733
15,summary_[requirements],0.061147
69,company_[systems],0.057787
33,title_[data analyst],0.051404
65,company_[pte],0.043822


In [259]:
coef = lr_grid_search.best_estimator_.coef_
lr_coef = pd.DataFrame({'coef':coef.ravel(),
                    'mag':np.abs(coef.ravel()),
                    'pred':X_test_ss.columns})

lr_coef.sort_values('mag', ascending=False, inplace=True)
lr_coef.head(10)

Unnamed: 0,coef,mag,pred
22,-0.398748,0.398748,summary_[using]
15,-0.334503,0.334503,summary_[requirements]
2,-0.292206,0.292206,summary_[analysts]
19,-0.269633,0.269633,summary_[team]
0,0.221801,0.221801,summary_[analysis]
16,-0.214737,0.214737,summary_[role]
11,-0.201654,0.201654,summary_[insights]
17,-0.196832,0.196832,summary_[skills]
74,-0.16508,0.16508,company_[tvconal]
62,-0.160385,0.160385,company_[ninja]


In [260]:
pred = dtc_best.predict(X_test_ss)
print(classification_report(y_test, pred, labels=[1,0], target_names=['high salary','low salary']))

             precision    recall  f1-score   support

high salary       0.67      0.73      0.70        11
 low salary       0.80      0.75      0.77        16

avg / total       0.75      0.74      0.74        27



In [261]:
# Decision Tree has a better score than log reg
# Will take Decision Tree as model to evaluate factors that impact salary

In [262]:
# Use decision tree to predict the rest of the dataset
df_unsalaried.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2016 entries, 0 to 3340
Data columns (total 6 columns):
job_category    2016 non-null object
job_title       2016 non-null object
company_name    2016 non-null object
location        1990 non-null object
summary         2016 non-null object
salary          0 non-null object
dtypes: object(6)
memory usage: 110.2+ KB


In [263]:
# Get TFIDF for job title
job_title_tvec_unsal = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_title_tvec_unsal.fit(df_unsalaried.job_title)
job_title_tvec_unsal_df = pd.DataFrame(job_title_tvec_unsal.transform(df_unsalaried.job_title).todense(),
                       columns=['title_[' + f + ']' for f in job_title_tvec_unsal.get_feature_names()])

In [264]:
# Get TFIDF for job summary
job_summary_tvec_unsal = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_summary_tvec_unsal.fit(df_unsalaried.summary)
job_summary_tvec_unsal_df = pd.DataFrame(job_summary_tvec_unsal.transform(df_unsalaried.summary).todense(),
                       columns=['summary_[' + f + ']' for f in job_summary_tvec_unsal.get_feature_names()])

In [265]:
# Get TFIDF for company name
job_company_tvec_unsal = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_company_tvec_unsal.fit(df_unsalaried.company_name)
job_company_tvec_unsal_df = pd.DataFrame(job_company_tvec_unsal.transform(df_unsalaried.company_name).todense(),
                       columns=['company_[' + f + ']' for f in job_company_tvec_unsal.get_feature_names()])

In [266]:
X_without_sal = pd.concat([job_summary_tvec_unsal_df,job_title_tvec_unsal_df,job_company_tvec_unsal_df], axis=1)

In [267]:
# Standardize predictors
X_without_sal_ss = StandardScaler().fit_transform(X_without_sal)

In [268]:
X_without_sal_ss = pd.DataFrame(X_without_sal_ss, columns=X_without_sal.columns)

In [269]:
pred = dtc_best.predict(X_without_sal_ss)
df_unsalaried.salary = pred
df_unsalaried.rename(index=str, columns={"salary": "salary_high_tier"}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [270]:
# Merge predicted with original
final_df = pd.concat([df_unsalaried, df_salaried_mod], axis=0, ignore_index=True)

In [271]:
# Get TFIDF for job title
job_title_tvec_final = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_title_tvec_final.fit(final_df.job_title)
job_title_tvec_final_df = pd.DataFrame(job_title_tvec_final.transform(final_df.job_title).todense(),
                       columns=['title_[' + f + ']' for f in job_title_tvec_final.get_feature_names()])

In [272]:
# Get TFIDF for job summary
job_summary_tvec_final = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_summary_tvec_final.fit(final_df.summary)
job_summary_tvec_final_df = pd.DataFrame(job_summary_tvec_final.transform(final_df.summary).todense(),
                       columns=['summary_[' + f + ']' for f in job_summary_tvec_final.get_feature_names()])

In [273]:
# Get TFIDF for company name
job_company_tvec_final = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=2, max_df=0.5, max_features=25)
job_company_tvec_final.fit(final_df.company_name)
job_company_tvec_final_df = pd.DataFrame(job_company_tvec_final.transform(final_df.company_name).todense(),
                       columns=['company_[' + f + ']' for f in job_company_tvec_final.get_feature_names()])

In [274]:
X = pd.concat([job_title_tvec_final_df, job_summary_tvec_final_df, job_company_tvec_final_df], axis=1)
y = final_df.salary_high_tier

In [275]:
# Standardize predictors
Xs = StandardScaler().fit_transform(X)
Xs = pd.DataFrame(Xs, columns=X.columns)

In [276]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.33, random_state=42)

In [277]:
# Gridsearch for Ridge and Lasso Logistic Regression, optimize C

parameters = {
    'penalty':['l1','l2'],
    'solver':['liblinear'],
    'C':np.logspace(-5,0,100)
}

print ("GRID SEARCH:")
lr_grid_search = GridSearchCV(LogisticRegression(), parameters, cv=10, verbose=0)
lr_grid_search.fit(X_train, y_train)
print ("Best parameters set:")
lr_best_parameters = lr_grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print ("\t%s: %r" % (param_name, lr_best_parameters[param_name]))

GRID SEARCH:
Best parameters set:
	C: 0.49770235643321137
	penalty: 'l1'
	solver: 'liblinear'


In [278]:
print("Logistic Regression with best param:")
clf = LogisticRegression(**lr_best_parameters)
clf.fit(X_train, y_train)
lr_gs_pred = clf.predict(X_test)
print(metrics.classification_report(y_test, lr_gs_pred, labels=[1,0], target_names=['high salary','low salary']))

Logistic Regression with best param:
             precision    recall  f1-score   support

high salary       0.94      0.95      0.95       446
 low salary       0.91      0.90      0.90       249

avg / total       0.93      0.93      0.93       695



In [279]:
# Gridsearch params for decision tree classifier
dtc_params = {
    'max_depth':[None,1,2,3,4],
    'max_features':[None,'log2','sqrt',2,3,4,5],
    'min_samples_split':[2,3,4,5,10,15,20,25,30,40,50]
}

# set the gridsearch
dtc_gs = GridSearchCV(DecisionTreeClassifier(), dtc_params, cv=5, verbose=1)

In [280]:
# use the gridsearch C model to fit the data
dtc_gs.fit(X_train, y_train)

Fitting 5 folds for each of 385 candidates, totalling 1925 fits


[Parallel(n_jobs=1)]: Done 1925 out of 1925 | elapsed:    8.7s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [None, 1, 2, 3, 4], 'max_features': [None, 'log2', 'sqrt', 2, 3, 4, 5], 'min_samples_split': [2, 3, 4, 5, 10, 15, 20, 25, 30, 40, 50]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [281]:
# Best Estimator
dtc_best = dtc_gs.best_estimator_
print(dtc_gs.best_params_)
print(dtc_gs.best_score_)

{'max_depth': None, 'max_features': None, 'min_samples_split': 2}
0.9772888573456352


In [282]:
pred = dtc_best.predict(X_test)
print(classification_report(y_test, pred, labels=[1,0], target_names=['high salary','low salary']))

             precision    recall  f1-score   support

high salary       0.98      0.98      0.98       446
 low salary       0.96      0.96      0.96       249

avg / total       0.97      0.97      0.97       695



In [283]:
fi = pd.DataFrame({
        'feature':X_train.columns,
        'importance':dtc_best.feature_importances_
    })

fi.sort_values('importance', ascending=False, inplace=True)
fi.head(10)

Unnamed: 0,feature,importance
7,title_[data],0.260233
40,summary_[requirements],0.13036
49,summary_[years],0.123791
36,summary_[insights],0.12015
42,summary_[science],0.108398
69,company_[singapore pte],0.107946
48,summary_[working],0.03883
11,title_[engineer],0.013765
45,summary_[systems],0.013444
65,company_[pte],0.012108


In [284]:
lr_coef = pd.DataFrame({'coef':clf.coef_.ravel(),
                    'mag':np.abs(clf.coef_.ravel()),
                    'pred':X_test.columns})

lr_coef.sort_values('mag', ascending=False, inplace=True)
lr_coef.head(10)

Unnamed: 0,coef,mag,pred
40,-2.824296,2.824296,summary_[requirements]
36,-2.176361,2.176361,summary_[insights]
49,-1.993205,1.993205,summary_[years]
7,-1.505648,1.505648,title_[data]
69,-1.425915,1.425915,company_[singapore pte]
42,-1.195753,1.195753,summary_[science]
48,0.844117,0.844117,summary_[working]
8,-0.835526,0.835526,title_[data analyst]
20,0.594279,0.594279,title_[scientist]
71,0.486018,0.486018,company_[solutions pte]


In [285]:
# Looks like the "Top Factors" that affect high salary vs low salary are mostly the same
# Also, Decision Tree is the better model.

In [286]:
# df for question 2
final_df.to_csv('final_df')