# Web Scraping for Indeed.com & Predicting Salaries

This project, is a test of three major skills: collecting data by scraping a website, using natural language processing, and then building a binary classifier.

I collected salary information on data science jobs in a variety of markets. Then using the location, title, and summary of the job I attempted to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information, being to able extrapolate or predict the expected salaries from other listings can useful.

Normally, regression could be used for a task like this; however, since there is a fair amount of natural variance in job salaries, I approached this as a classification problem and used a random forest classifier.

Therefore, the first part of the project was focused on scraping Indeed.com. The latter part of the project was focused on using natural language processing and building models using job postings with salary information to predict salaries.

## Scraping job listings from Indeed.com

In [1]:
import requests
import bs4
from bs4 import BeautifulSoup

### Building functions to extract each item: location, company, job, summary, and salary.

These functions must be able to handle cases where the data/field may not be available.

In [60]:
def get_loc(result):
    try:
        return result.find('span', {'class':'location'}).text
    except:
        return 'NA'

In [61]:
def get_comp(result):
    try:
        return result.find('span', {'class':'company'}).text
    except:
        return 'NA'

In [62]:
def get_job(result):
    try:
        return result.find('a', {'data-tn-element':'jobTitle'}).text
    except:
        return 'NA'

In [63]:
def get_sal(result):
    try:
        return result.find('td', {'class':'snip'}).find('nobr').text
    except:
        return 'NA'

In [68]:
def get_desc(result):
    try:
        return result.find('span', {'class':'summary'}).text
    except:
        return 'NA'

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.
- "https://www.indeed.com/jobs?q=data+scientist&l=Washington+DC&limit=50&radius=25&start=0&pp=ADIAAAFbp-95iAAAAAEMMobjAQEBD3MzvjaRN_DEggu9hUHO3jOZTkJ2Z7SvcZJ1pEgnRjAhqUC21q96H2LZRAEACYLb_gg9TZj-Uiq9LmLnHRNRQqKGAQPQktPTwy4n4Swd39sFFyyDrS9wQYcfRpSo64YDtw"

There are three query parameters here we can alter to collect more results, the l=NWashington+DC, the limit=50, and the start=0. The first controls the location of the results (so we can try a different city). The second controls how many results are displayed per page. The third controls where in the results to start.

In [64]:
from time import sleep

In [2]:
import pandas as pd
import numpy as np

In [413]:
indeed_cities = ['New+York', 'Chicago', 'San+Francisco', 'San+Jose', 'San+Diego', 'Los+Angeles', 'Washington%2C+DC',
          'Boston', 'Pittsburgh', 'Philadelphia', 'Atlanta', 'Cincinnati', 'St.+Louis', 'Tampa', 'Oakland',
          'Austin', 'Houston', 'Dallas', 'Seattle', 'Portland', 'Denver', 'Phoenix', 'Minneapolis', 'Miami',
          'Charlotte', 'Jacksonville', 'Indianapolis', 'Nashville', 'Kansas+City', 'Columbus']
len(indeed_cities)

30

In [415]:
# loops through each city in the indeed_cities list, and loops through each page with search results for that city
# 100 results per page, 10 pages per city --> 1000 job postings per city
# each time this cell is run the results list resets aka is empty (note this does not affect my dataframe)
# each job posting is appended to the results list (as html text)
# use append method, rather than list comprehension so data isn't overwritten
# sleep 1 sec between each url request

max_results_per_city = 1000

results = []

for city in indeed_cities:
    for start in range(0, max_results_per_city, 100):
        url = "https://www.indeed.com/jobs?as_and=data+scientist&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=\
               &salary=&radius=25&l=" + city + "&fromage=any&limit=100&start=" + str(start) + "&sort=&psf=advsrch"
        html = requests.get(url)
        soup = BeautifulSoup(html.text, 'html.parser')
        for result in soup.find_all('div', {'class':' row result'}):
            results.append(result)
        sleep(1)

#### Use the functions to parse out the 5 fields - location, title, company, summary, and salary. Create a dataframe from the results with those 5 columns.

In [70]:
jobs0 = pd.DataFrame(columns=['location', 'title', 'company', 'salary', 'summary'])

In [71]:
for entry in results:
    location = get_loc(entry)
    title = get_job(entry)
    company = get_comp(entry)
    salary = get_sal(entry)
    desc = get_desc(entry)
    jobs0.loc[len(jobs0)] = [location, title, company, salary, desc]

In [133]:
jobs0.head(20)

Unnamed: 0,location,title,company,salary,summary
0,"New York, NY",Data Scientist,NBA,,The primary focus for this fast paced and coll...
1,"New York, NY 10018 (Clinton area)",Data Scientist,JW Player,,Provide expertise on machine learning concepts...
2,"New York, NY",Data Scientist,Movable Ink,,"As a Data Scientist at Movable Ink, you’ll be ..."
3,"New York, NY 10016 (Gramercy area)",Data Scientist,Simulmedia,,You will find yourself working with other data...
4,"New York, NY",Data Scientists,FXcompared,,We are looking for full-time Data Scientists t...
5,"New York, NY",Data Scientist,HarperCollins Publishers Inc.,,"Uses mid to large-scale machine learning, data..."
6,"New York, NY 10261 (Murray Hill area)",Data Scientist,MassMutual Financial Group,,MassMutual’s Advanced Analytics group is seeki...
7,"New York, NY",Data Scientist,PulsePoint,,"MS/PhD in Astronomy, Physics, Applied Mathemat..."
8,"New York, NY",Data Science Analyst,AIG,,§ Stay current on the latest machine learning ...
9,"New York, NY",Data Scientist/ Modeler,Nucleus Marketing,,You have a Master’s Degree in operations resea...


In [136]:
jobs0.shape

(11597, 5)

In [128]:
jobs0.company = jobs0.company.str.encode('utf-8').astype(str).str.replace('\n', '')
jobs0.summary = jobs0.summary.str.encode('utf-8').astype(str).str.replace('\n', '')

In [132]:
for col in ['location', 'title', 'salary']:
    jobs0[col] = jobs0[col].str.encode('utf-8').astype(str)

In [None]:
# more results

In [416]:
# made a new df because of the encoding step
jobs1 = pd.DataFrame(columns=['location', 'title', 'company', 'salary', 'summary'])

In [417]:
for entry in results:
    location = get_loc(entry)
    title = get_job(entry)
    company = get_comp(entry)
    salary = get_sal(entry)
    desc = get_desc(entry)
    jobs1.loc[len(jobs1)] = [location, title, company, salary, desc]

In [419]:
jobs1.shape

(11591, 5)

In [420]:
jobs1.company = jobs1.company.str.encode('utf-8').astype(str).str.replace('\n', '')
jobs1.summary = jobs1.summary.str.encode('utf-8').astype(str).str.replace('\n', '')

In [421]:
for col in ['location', 'title', 'salary']:
    jobs1[col] = jobs1[col].str.encode('utf-8').astype(str)

In [422]:
jobs1.head()

Unnamed: 0,location,title,company,salary,summary
0,"New York, NY 10154 (Midtown area)",Data Scientist,KPMG,,Analyze and model structured data and implemen...
1,"New York, NY 10018 (Clinton area)",Data Scientist,JW Player,,Provide expertise on machine learning concepts...
2,"New York, NY",Data Scientist,Movable Ink,,"As a Data Scientist at Movable Ink, you’ll be ..."
3,"New York, NY",Data Scientist,NBA,,The primary focus for this fast paced and coll...
4,"New York, NY",Data Scientist,HarperCollins Publishers Inc.,,"Uses mid to large-scale machine learning, data..."


In [None]:
# merge all my information into final df

In [423]:
results = pd.concat([jobs, jobs1]).drop_duplicates()

In [425]:
results.shape

(12745, 5)

In [440]:
results.to_csv('../csv/indeed-results.csv', index=False, encoding='utf-8')

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

In [15]:
results = pd.read_csv('../csv/indeed-results.csv')

In [16]:
salaries = results[results.salary.notnull()]

In [17]:
salaries.shape

(768, 5)

In [18]:
salaries.head()

Unnamed: 0,location,title,company,salary,summary
24,"New York, NY",Data Scientist,indify,"$90,000 - $170,000 a year",Indify data scientists contribute to all aspec...
50,"New York, NY 10031 (Hamilton Heights area)",Computer Science (Data Analysis) Instructor,"Urban Scholars Program, City College of Ne...",$40 an hour,"Data, data filtering, basic spreadsheet operat..."
63,"New York, NY",Data Scientist,WorldCover,"$70,000 - $110,000 a year",Your primary focus will be in applying data mi...
79,"New York, NY",Data Scientist,Scienaptic Systems Inc,"$100,000 a year","As our representative in front of client, you ..."
98,"New York, NY 10038 (Financial District area)",Data Scientist,Enterprise Select,"$130,000 a year",Deep knowledge of applied statistics and machi...


In [19]:
salaries = salaries[(~salaries.salary.str.contains('an hour')) & (~salaries.salary.str.contains('a month'))
                   & (~salaries.salary.str.contains('a week')) & (~salaries.salary.str.contains('a day'))]

In [20]:
salaries.salary = salaries.salary.str.replace('a year', '').str.replace(',', '').str.replace('$', '')

#### Need to turn salary ranges to an average and convert the salaries to floats

In [21]:
for i in salaries.salary:
    if len(i.split('-')) != 1 and len(i.split('-')) != 2:
        print i

In [22]:
new_salaries = []
for i in salaries.salary:
    a = i.split('-')
    if len(a) == 2:
        new_salaries.append(np.mean([float(b) for b in a]))
    else:
        new_salaries.append(float(a[0]))

In [23]:
new_salaries[0:5]

[130000.0, 90000.0, 100000.0, 130000.0, 75000.0]

In [24]:
salaries.salary = new_salaries

In [25]:
salaries.shape

(551, 5)

In [26]:
salaries.duplicated().sum()

0

In [27]:
salaries.head()

Unnamed: 0,location,title,company,salary,summary
24,"New York, NY",Data Scientist,indify,130000.0,Indify data scientists contribute to all aspec...
63,"New York, NY",Data Scientist,WorldCover,90000.0,Your primary focus will be in applying data mi...
79,"New York, NY",Data Scientist,Scienaptic Systems Inc,100000.0,"As our representative in front of client, you ..."
98,"New York, NY 10038 (Financial District area)",Data Scientist,Enterprise Select,130000.0,Deep knowledge of applied statistics and machi...
105,"New York, NY",Senior Research Analyst,Research Foundation of The City Univer...,75000.0,Overseeing all project activities related to d...


### Save results as a CSV

In [28]:
salaries.to_csv('../csv/salaries-complete.csv', index=False, encoding='utf-8')

# Predicting salaries using Random Forests

#### Load in the the data of scraped salaries

In [3]:
salaries = pd.read_csv('../csv/salaries-complete.csv')

In [4]:
salaries.head()

Unnamed: 0,location,title,company,salary,summary
0,"New York, NY",Data Scientist,indify,130000.0,Indify data scientists contribute to all aspec...
1,"New York, NY",Data Scientist,WorldCover,90000.0,Your primary focus will be in applying data mi...
2,"New York, NY",Data Scientist,Scienaptic Systems Inc,100000.0,"As our representative in front of client, you ..."
3,"New York, NY 10038 (Financial District area)",Data Scientist,Enterprise Select,130000.0,Deep knowledge of applied statistics and machi...
4,"New York, NY",Senior Research Analyst,Research Foundation of The City Univer...,75000.0,Overseeing all project activities related to d...


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

Regression could be used for a task like this, but since there is a fair amount of variance in job salaries, I treated this as a classification problem, with the goal of predicting whether a job salary would be above or below the median salary for a data scientist.

In [5]:
median_salary = np.median(salaries.salary)
median_salary

111000.0

In [6]:
salaries['high_salary'] = [1 if i > median_salary else 0 for i in salaries.salary]

In [7]:
salaries.head()

Unnamed: 0,location,title,company,salary,summary,high_salary
0,"New York, NY",Data Scientist,indify,130000.0,Indify data scientists contribute to all aspec...,1
1,"New York, NY",Data Scientist,WorldCover,90000.0,Your primary focus will be in applying data mi...,0
2,"New York, NY",Data Scientist,Scienaptic Systems Inc,100000.0,"As our representative in front of client, you ...",0
3,"New York, NY 10038 (Financial District area)",Data Scientist,Enterprise Select,130000.0,Deep knowledge of applied statistics and machi...,1
4,"New York, NY",Senior Research Analyst,Research Foundation of The City Univer...,75000.0,Overseeing all project activities related to d...,0


#### Create a Random Forest model to predict High/Low salary using Sklearn. Start by ONLY using the location as a feature. 

In [8]:
from sklearn.model_selection import cross_val_score, StratifiedKFold, train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

In [9]:
salaries.location.value_counts()[0:30]

New York, NY                                         76
Chicago, IL                                          37
Boston, MA                                           26
Manhattan, NY                                        24
Los Angeles, CA                                      21
Washington, DC                                       19
San Francisco, CA                                    17
St. Louis, MO                                        15
Seattle, WA                                          13
San Jose, CA 95113 (Downtown area)                   12
Phoenix, AZ                                          11
Philadelphia, PA                                     10
Coral Gables, FL                                      9
Portland, OR                                          8
Houston, TX                                           8
Atlanta, GA                                           7
Santa Clara, CA                                       6
Chicago, IL 60603 (Loop area)                   

In [10]:
cities = []
states = []

for loc in salaries.location:
    items = loc.split(',')
    cities.append(items[0])
    states.append(items[1])

In [11]:
import re

In [12]:
only_states = []
for state in states:
    only_states.append(re.search(r'\w+', state).group(0))

In [13]:
salaries['city'] = cities
salaries['state'] = only_states

In [14]:
salaries.city = salaries.city + ", " + salaries.state

In [15]:
for i in salaries.state:
    if len(i) > 2:
        print i

In [16]:
salaries.head()

Unnamed: 0,location,title,company,salary,summary,high_salary,city,state
0,"New York, NY",Data Scientist,indify,130000.0,Indify data scientists contribute to all aspec...,1,"New York, NY",NY
1,"New York, NY",Data Scientist,WorldCover,90000.0,Your primary focus will be in applying data mi...,0,"New York, NY",NY
2,"New York, NY",Data Scientist,Scienaptic Systems Inc,100000.0,"As our representative in front of client, you ...",0,"New York, NY",NY
3,"New York, NY 10038 (Financial District area)",Data Scientist,Enterprise Select,130000.0,Deep knowledge of applied statistics and machi...,1,"New York, NY",NY
4,"New York, NY",Senior Research Analyst,Research Foundation of The City Univer...,75000.0,Overseeing all project activities related to d...,0,"New York, NY",NY


In [17]:
salaries.city.nunique()

101

In [18]:
salaries.city.value_counts()

New York, NY                  91
Chicago, IL                   47
Boston, MA                    32
San Francisco, CA             24
Manhattan, NY                 24
Los Angeles, CA               22
Washington, DC                21
St. Louis, MO                 15
San Jose, CA                  15
Philadelphia, PA              14
Seattle, WA                   14
Phoenix, AZ                   11
Houston, TX                   10
Austin, TX                     9
Coral Gables, FL               9
Santa Clara, CA                9
Portland, OR                   8
Atlanta, GA                    7
Columbus, OH                   6
Charlotte, NC                  6
Denver, CO                     6
Pittsburgh, PA                 5
San Diego, CA                  5
Cambridge, MA                  5
San Mateo, CA                  5
Berkeley, CA                   5
Dallas, TX                     5
Alexandria, VA                 4
Indianapolis, IN               4
Pleasanton, CA                 4
          

---

## City random forest

In [20]:
city_dummies = pd.get_dummies(salaries.city)

X_city = city_dummies
y_city = salaries.high_salary

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X_city, y_city, test_size=0.3, random_state=90)

In [22]:
rfc = RandomForestClassifier(n_estimators=300, random_state=90)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_city, y_city, cv=10, n_jobs=-1)
print "Cross Validation Score:\t{:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.663
Cross Validation Score:	0.62 ± 0.109


In [23]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_city.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
for i in X_city.columns:
    feature_medians.append(np.median(salaries[salaries.city == i].salary))

feature_importances['median_salary'] = feature_medians
feature_importances['over_or_under'] = [1 if i > median_salary else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(15)

Unnamed: 0,feature,importance,median_salary,over_or_under
48,"Manhattan, NY",0.091707,76143.0,0
58,"New York, NY",0.089924,135000.0,1
83,"San Jose, CA",0.0478,162500.0,1
82,"San Francisco, CA",0.046551,162500.0,1
93,"St. Louis, MO",0.039109,54095.5,0
26,"Coral Gables, FL",0.037266,48000.0,0
11,"Boston, MA",0.029993,135000.0,1
66,"Philadelphia, PA",0.025194,148750.0,1
70,"Pleasanton, CA",0.021447,187500.0,1
68,"Pittsburgh, PA",0.020785,57500.0,0


## Summary Count Vectorizer

In [24]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

In [25]:
salaries_w_desc = salaries[salaries.summary.notnull()]

X_summ = salaries_w_desc.summary
y_summ = salaries_w_desc.high_salary

In [26]:
cv = CountVectorizer(stop_words="english")
cv.fit(X_summ)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [27]:
len(cv.get_feature_names())

1810

In [28]:
X_summ_trans = pd.DataFrame(cv.transform(X_summ).todense(), columns=cv.get_feature_names())

In [29]:
X_train, X_test, y_train, y_test = train_test_split(np.asmatrix(X_summ_trans), y_summ, test_size=0.3,
                                                    random_state=59, stratify=y_summ)

In [30]:
# show df
# X_train_trans.transpose().sort_values(0, ascending=False).head(10).transpose()
# sorting by most frequent words in doc 0, showing first 10 words

In [32]:
word_counts = X_summ_trans.sum(axis=0)
word_counts.sort_values(ascending = False).head(20)

data           491
learning       135
machine        123
scientist      102
scientists      96
analytics       95
team            85
health          77
experience      75
research        70
analysis        50
statistical     47
science         47
looking         44
clinical        39
python          38
project         36
modeling        34
work            34
develop         33
dtype: int64

In [33]:
word_counts.to_csv('../csv/indeed-words.csv', encoding='utf-8')

In [34]:
# X_train and X_test are already transformed

# X_test_trans = pd.DataFrame(cv.transform(X_test).todense(), columns=cv.get_feature_names())
# X_trans = pd.DataFrame(cv.transform(X_summ).todense(), columns=cv.get_feature_names())

In [35]:
rfc = RandomForestClassifier(200, random_state=59)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_summ_trans.as_matrix(), y_summ.as_matrix(), cv=10, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.832
Cross Validation Score: 0.78 ± 0.077


In [37]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_summ_trans.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
feature_means = []
for i in X_summ_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc.summary.str.lower().str.contains(i)].salary))
    feature_means.append(np.mean(salaries_w_desc[salaries_w_desc.summary.str.lower().str.contains(i)].salary))


feature_importances['median_salary'] = feature_medians
feature_importances['mean_salary'] = feature_means
feature_importances['over_or_under'] = [1 if i > median_salary else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,mean_salary,over_or_under
979,machine,0.039811,135000.0,140627.310185,1
934,learning,0.039283,140000.0,142024.77027,1
432,data,0.017486,125000.0,121076.754296,1
1445,role,0.011246,130000.0,123617.647059,1
754,health,0.009099,75400.0,84945.204918,0
1409,research,0.009051,78374.75,88871.125,0
108,analytics,0.008713,125000.0,129306.620482,1
568,engineer,0.008713,130000.0,128050.47619,1
1336,python,0.008018,135000.0,140526.315789,1
1728,university,0.007678,65977.0,65709.423077,0


## Title Count Vectorizer

In [38]:
salaries_w_desc = salaries[salaries.summary.notnull()]

X_title = salaries_w_desc.title
y_title = salaries_w_desc.high_salary

In [39]:
cv = CountVectorizer(stop_words="english")
cv.fit(X_title)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [40]:
X_title_trans = pd.DataFrame(cv.transform(X_title).todense(), columns=cv.get_feature_names())

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_title_trans, y_title, test_size=0.3, random_state=59)

In [42]:
rfc = RandomForestClassifier(200, random_state=59)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X_title_trans.as_matrix(), y_title.as_matrix(), cv=10, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.812
Cross Validation Score: 0.821 ± 0.053


In [43]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X_title_trans.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
feature_means = []
for i in X_title_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc.title.str.lower().str.contains(i)].salary))
    feature_means.append(np.mean(salaries_w_desc[salaries_w_desc.title.str.lower().str.contains(i)].salary))


feature_importances['median_salary'] = feature_medians
feature_importances['mean_salary'] = feature_means
feature_importances['over_or_under'] = [1 if i > median_salary else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,mean_salary,over_or_under
115,data,0.114136,138750.0,136604.8875,1
373,research,0.050251,60000.0,73759.431818,0
393,scientist,0.042662,130000.0,128084.261538,1
150,engineer,0.041488,135000.0,136137.059322,1
28,analyst,0.038988,75000.0,83395.379808,0
251,learning,0.034745,145000.0,145178.571429,1
397,senior,0.031822,142500.0,137266.97973,1
363,quantitative,0.028956,145000.0,150111.111111,1
263,machine,0.026596,143750.0,144187.5,1
131,director,0.019587,163200.0,162867.647059,1


# Combining Title CV, Summary CV, and Location

In [44]:
salaries_w_desc = salaries[salaries.summary.notnull()].reset_index()
city_dummies = pd.get_dummies(salaries_w_desc.city)

X = pd.concat([city_dummies, X_title_trans, X_summ_trans], axis=1)
y = salaries_w_desc.high_salary

In [45]:
print X.shape
print y.shape

(495, 2387)
(495,)


In [46]:
X_train, X_test, y_train, y_test = train_test_split(X.as_matrix(), y, test_size=0.3, random_state=68, stratify=y)

In [47]:
rfc = RandomForestClassifier(500, random_state=59)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print "Accuracy Score:", acc.round(3)

s = cross_val_score(rfc, X.as_matrix(), y.as_matrix(), cv=10, n_jobs=-1)
print "Cross Validation Score: {:0.3} ± {:0.3}".format(s.mean().round(3), s.std().round(3))

Accuracy Score: 0.792
Cross Validation Score: 0.826 ± 0.068


In [48]:
feature_importances = pd.DataFrame(rfc.feature_importances_,
                                   index = X.columns).reset_index()
feature_importances.columns = ['feature', 'importance']

feature_medians = []
for i in city_dummies.columns:
    feature_medians.append(np.median(salaries[salaries.city == i].salary))
for i in X_title_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc.title.str.lower().str.contains(i)].salary))
for i in X_summ_trans.columns:
    feature_medians.append(np.median(salaries_w_desc[salaries_w_desc.summary.str.lower().str.contains(i)].salary))

feature_importances['median_salary'] = feature_medians
feature_importances['over_or_under'] = [1 if i > median_salary else 0 for i in feature_importances.median_salary]

feature_importances.sort_values('importance', ascending=False).head(20)

Unnamed: 0,feature,importance,median_salary,over_or_under
211,data,0.051987,138750.0,1
1511,learning,0.034643,140000.0,1
1556,machine,0.027272,135000.0,1
489,scientist,0.024844,130000.0,1
1009,data,0.016558,125000.0,1
124,analyst,0.015555,75000.0,0
347,learning,0.014885,145000.0,1
469,research,0.014808,60000.0,0
359,machine,0.012412,143750.0,1
55,"New York, NY",0.008109,135000.0,1
