For this project, I will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

I am going to collect salary information on data science jobs in indeed. Then using the location, title, and summary of the job I will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, I will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second part, the focus is on using listings with salary information to build a model and predict high or low salaries and what features are predictive of that result.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

## Importing necessary functions

In [4]:
# Import necessary packages
# "%pdb" you can use debugger for this function. 
# import pdb 
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import re
import sys
import numpy as np
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, cross_val_score

## Webscraping

In [5]:
# Was inspired by Adam, he talked about this piece of code in class. 
# this function extracts location from 
def extract_location_from_result(result):
    a = result.find("span", class_ = "location") # Location
    return None if a is None else a.text.strip()
def salary_from_result(result):
    a = result.find("nobr")
    return None if a is None else a.text.strip()
def company_from_result(result):
    a = result.find("span", class_ = "company")
    return None if a is None else a.text.strip()
def jobtitle_from_result(result):
    a = result.find("a", attrs={'data-tn-element': "jobTitle"}) #Job title
    return None if a is None else a.text.strip()
def extract_summary_from_result(result):
    a = result.find('span',class_='summary')
    return None if a is None else a.text.strip()



In [6]:

url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 500

# Creating 4 empty lists to store data from results. 
results = []
job_title_lst = []
location_lst = []
company_lst = []
salary_lst = []
summary_lst = []

# looks through the url to g
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Atlanta',"Fort+Lee","Boston", "Tampa", "Washington","Seattle", "Pittsburgh", "Princeton","Cincinnati","Jersey+city","Palm+beach"]):
    for start in range(0, max_results_per_city, 10):
        url = url_template.format(city, start)
        r = requests.get(url)
        page_html = r.content
        soup = BeautifulSoup(page_html,"lxml")
        result = soup.findAll("div", class_ = "result")
        results.append(result)
        ## fix this 
        for i in results:
            for a in i:
                job_title_lst.append(jobtitle_from_result(a))
                location_lst.append(extract_location_from_result(a))
                company_lst.append(company_from_result(a))
                salary_lst.append(salary_from_result(a))
                summary_lst.append(extract_summary_from_result(a))       

In [7]:
print (len(job_title_lst))
print (len(location_lst))
print (len(company_lst))
print (len(salary_lst))
print (len(summary_lst))

4204739
4204739
4204739
4204739
4204739


## Exploring and cleaning dataset

In [8]:
# creates a dataFrame from the four list created from the code above. 
df = pd.DataFrame({"job_title": job_title_lst, "salary": salary_lst, "company":company_lst, "location": location_lst, "summary":summary_lst})

In [9]:
# Rearranging the columns
df = df[["job_title", "salary", "location", "company","summary"]]

In [10]:
df.head()

Unnamed: 0,job_title,salary,location,company,summary
0,Data Scientist - Big Data & Analytics,,"New York, NY 10154",KPMG,KPMG is currently seeking a Data Scientist - B...
1,Senior NLP Data Scientist,,"New York, NY",Elevano,Parsing of structured/unstructured data. Our c...
2,Lead Data Scientist,,"New York, NY 10017",Aetna,We are currently looking for an exceptional Le...
3,Business Intelligence Analyst/Data Scientist,,"Jersey City, NJ 07310 (Downtown area)",JP Morgan Chase,Effectively use Data Visualization techniques ...
4,Data Scientist with NLP Experience,,"Jersey City, NJ",EXL,Data Scientist with NLP Experience. EXL Analyt...


In [11]:
## Droping Duplicates
df.drop_duplicates(inplace = True,)

In [12]:
len(df)

3644

In [13]:
#df.dropna(inplace=True)

In [14]:
#Lessons for regular expression
# "^" (carot) Stands before a text to match that text similar to startswith() from the string library. 
# "."(period) matches any characters. 
# ".+"(period and plus) matches 

In [15]:
df.head()

Unnamed: 0,job_title,salary,location,company,summary
0,Data Scientist - Big Data & Analytics,,"New York, NY 10154",KPMG,KPMG is currently seeking a Data Scientist - B...
1,Senior NLP Data Scientist,,"New York, NY",Elevano,Parsing of structured/unstructured data. Our c...
2,Lead Data Scientist,,"New York, NY 10017",Aetna,We are currently looking for an exceptional Le...
3,Business Intelligence Analyst/Data Scientist,,"Jersey City, NJ 07310 (Downtown area)",JP Morgan Chase,Effectively use Data Visualization techniques ...
4,Data Scientist with NLP Experience,,"Jersey City, NJ",EXL,Data Scientist with NLP Experience. EXL Analyt...


In [16]:
# Splits column values based on "," and stores the first item. 
df["location"] = df["location"].apply(lambda x: None if x is None else x.split(",")[0].strip())

In [451]:
#Lessons for regular expression
# "^" (carot) Stands before a text to match that text similar to startswith() from the string library. 
# "."(period) matches any characters. 
# ".+"(period and plus) matches 

In [452]:
"""# cleans job_title columns
# Del
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("-")[0].strip())
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("/")[0].strip())
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("-")[0].strip())
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split(",")[0].strip())
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("|")[0].strip())
df["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("(")[0].strip())

"""

'# cleans job_title columns\n# Del\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("-")[0].strip())\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("/")[0].strip())\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("-")[0].strip())\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split(",")[0].strip())\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("|")[0].strip())\ndf["job_title"] = df["job_title"].apply(lambda x: None if x is None else x.split("(")[0].strip())\n\n'

In [453]:
# convert the type from unicode to text
df["location"] = df["location"].apply(lambda x: str(x))

In [454]:
#df["job_title"] = df["job_title"].apply(lambda x : None if x is None else x.encode('utf-8'))

In [455]:
df["company"] = df["company"].apply(lambda x : None if x is None else x.encode("utf-8"))

In [456]:
# this function is writen to clean salary column 
def fix_salary(n): 
    result = None
    if n is None:
        result = None
    elif "year" in n:
        a = re.findall('\d\S+', n)# 
        b = int(a[0].replace(",",""))
        try:
            c = int(a[1].replace(",",""))
            avg = (b+c)/2
            result = avg
        except:
            result = b
    elif "month" in n:
        a = re.findall('\d\S+', n)
        b = int(a[0].replace(",",""))
        try:
            c = int(a[1].replace(",",""))
            avg = (b+c)/2
            result = avg*12
        except:
            result = b*12

    elif "hour" in n:
        a = re.findall('\d\S+', n)
#         b = int(a[0].replace(",",""))
        b = float(a[0])
        try:
            c = float(a[1])
#             c = int(a[1].replace(",",""))
            avg = (b+c)/2
            result = avg*40*52
        except:
            result = b*40*52
    return result


In [457]:
# Apply the fix_salary to clean salary data. 
df["salary"] = df["salary"].apply(fix_salary)

In [458]:
len(df)

3617

In [459]:
train_df = df.dropna().reset_index()

In [460]:
len(train_df)

239

In [461]:
train_df.head()

Unnamed: 0,index,job_title,salary,location,company,summary
0,85,Research associate,41977.0,New York,Research Foundation of The City University of ...,"Assists scientists, students, and other techni..."
1,415,"Statistician, Level I",52903.0,New York,POLICE DEPARTMENT,Selected candidate will be responsible for pro...
2,668,"Data Analyst, Bureau of Immunization",65977.0,New York,DEPT OF HEALTH/MENTAL HYGIENE,"Review program data, assuring data quality and..."
3,817,Senior Data Scientist,160000.0,New York,Analytic Recruiting,Major insurance company seeks an experienced S...
4,985,Data Analyst/Modeler,75557.0,Manhattan,DEPARTMENT OF FINANCE,"Strong programming, data analysis, statistical..."


In [462]:
del train_df["index"]

In [463]:
train_df.head()

Unnamed: 0,job_title,salary,location,company,summary
0,Research associate,41977.0,New York,Research Foundation of The City University of ...,"Assists scientists, students, and other techni..."
1,"Statistician, Level I",52903.0,New York,POLICE DEPARTMENT,Selected candidate will be responsible for pro...
2,"Data Analyst, Bureau of Immunization",65977.0,New York,DEPT OF HEALTH/MENTAL HYGIENE,"Review program data, assuring data quality and..."
3,Senior Data Scientist,160000.0,New York,Analytic Recruiting,Major insurance company seeks an experienced S...
4,Data Analyst/Modeler,75557.0,Manhattan,DEPARTMENT OF FINANCE,"Strong programming, data analysis, statistical..."


In [464]:
train_df["salary_category"] = train_df["salary"].apply(lambda x : "high" if x > np.mean(train_df["salary"]) else "low")

In [587]:
train_df.head()

Unnamed: 0,salary,salary_category,location_Arlington,location_Atlanta,location_Austin,location_Bellevue,location_Berkeley,location_Boston,location_Bridgewater,location_Brooklyn,...,services,smith,solutions,south,state,texas,university,washington,water,workbridge
0,41977.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,52903.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,65977.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,160000.0,high,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,75557.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [466]:
"""#X = pd.concat(pd.DataFrame(train_df.ix[:,"job_title"]), pd.DataFrame(train_df.ix[:,"location":]),ignore_index=True)
columns_list = ['company','job_title','summary','location']
X = train_df[columns_list]"""

'#X = pd.concat(pd.DataFrame(train_df.ix[:,"job_title"]), pd.DataFrame(train_df.ix[:,"location":]),ignore_index=True)\ncolumns_list = [\'company\',\'job_title\',\'summary\',\'location\']\nX = train_df[columns_list]'

### Building Text feature for job_title and summary

In [468]:
# Build text features for job title and summary

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer(
    binary=True,  # Create binary features
    stop_words ='english', # Ignore common words 
    max_features= 40 
)

jwords_df = v.fit_transform(train_df["job_title"]).todense()
jwords_df = pd.DataFrame(jwords_df, columns=v.get_feature_names())

swords_df = v.fit_transform(train_df['summary']).todense()
swords_df = pd.DataFrame(swords_df, columns=v.get_feature_names())

com_words_df = v.fit_transform(train_df['company']).todense()
com_words_df = pd.DataFrame(com_words_df, columns=v.get_feature_names())

loc_words_df = v.fit_transform(train_df['company']).todense()
loc_words_df = pd.DataFrame(loc_words_df, columns=v.get_feature_names())



In [469]:
jwords_df.shape

(239, 40)

In [437]:
com_words_df.shape

(239, 40)

In [None]:
# categorizing the salary column so that we can use logistic regression model.
#df["salary_position"] = df["salary"].apply(lambda x: None if x is None elif 'high' if x > df['salary'].mean() else 'low')

In [156]:
"""import patsy
location_df = patsy.dmatrix('~ C(location)',train_df)"""

# My model starts from here

In [475]:
for i in swords_df.columns:
    if i in jwords_df.columns:
        del jwords_df[i]
    if i in com_words_df.columns:
        del com_words_df[i]

In [488]:
for i in jwords_df.columns:
    if i in com_words_df.columns:
        del com_words_df[i]

In [None]:
train_df = pd.get_dummies(train_df, columns=['location'], drop_first=True).reset_index(drop=True)

In [482]:
train_df = train_df.join(jwords_df)

In [486]:
train_df = train_df.join(swords_df)

In [489]:
train_df = train_df.join(com_words_df)

In [493]:
train_df = train_df.drop(['company','job_title','summary'], axis=1)

In [494]:
train_df.head()

Unnamed: 0,salary,salary_category,location_Arlington,location_Atlanta,location_Austin,location_Bellevue,location_Berkeley,location_Boston,location_Bridgewater,location_Brooklyn,...,services,smith,solutions,south,state,texas,university,washington,water,workbridge
0,41977.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,52903.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,65977.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,160000.0,high,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,75557.0,low,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [495]:
new_train = train_df.ix[:,"salary_category":]

In [497]:
y = new_train["salary_category"]

In [512]:
X = new_train.ix[:,"location_Arlington":]

In [513]:
X.shape

(239, 158)

In [509]:
new_train.shape

(239, 159)

In [515]:

# of course my trainning and test will be different.
X_train, X_test, y_train, y_test =\
train_test_split(X, y, test_size=0.33, random_state=77)

In [17]:
calories = map(int, raw_input().strip().split(' '))


2 4 5 6 8


In [18]:
calories

[2, 4, 5, 6, 8]

In [20]:
sorted(calories, reverse=True)

[8, 6, 5, 4, 2]

## Logistic Regression

In [517]:
# Now let's fit a standard logistic regression model
lr = LogisticRegression()
lr_model = lr.fit(X_train, y_train)

In [518]:
#Make our predictions
lr_ypred = lr_model.predict(X_test)

In [521]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [536]:
# Check our misclassifications with a confusion matrix
lr_cm = confusion_matrix(y_test,lr_ypred, labels=lr.classes_)
lr_cm = pd.DataFrame(lr_cm, columns=lr.classes_, index=lr.classes_)
lr_cm


Unnamed: 0,high,low
high,26,10
low,6,37


In [523]:
# Check our precision, recall, and f1
print classification_report(y_test, lr_ypred, labels=lr.classes_)

             precision    recall  f1-score   support

       high       0.81      0.72      0.76        36
        low       0.79      0.86      0.82        43

avg / total       0.80      0.80      0.80        79



In [558]:
# check accuracy score
print accuracy_score(y_test, lr_ypred)

0.79746835443


In [541]:
# Check the CV Score
cvs1 = cross_val_score(lr, X, y, cv=3)
cvs1

array([ 0.725     ,  0.65      ,  0.73417722])

## Penalized regression - LASSO (L1)

In [525]:
# Let's now use a penalized regression - we'll use LASSO (L1)
lr_l1 = LogisticRegression(C=10, penalty='l1')
lr_l1_model = lr_l1.fit(X_train, y_train)
lr_l1_ypred = lr_l1_model.predict(X_test)
lr_l1_model.predict_proba(X_test)

array([[  2.36336419e-01,   7.63663581e-01],
       [  9.75585254e-01,   2.44147461e-02],
       [  2.08354880e-08,   9.99999979e-01],
       [  2.41239685e-02,   9.75876031e-01],
       [  4.11490248e-02,   9.58850975e-01],
       [  8.76141057e-01,   1.23858943e-01],
       [  5.60569728e-02,   9.43943027e-01],
       [  1.00020442e-01,   8.99979558e-01],
       [  9.99082813e-01,   9.17187423e-04],
       [  7.71759934e-02,   9.22824007e-01],
       [  9.91528127e-01,   8.47187319e-03],
       [  9.07207292e-01,   9.27927083e-02],
       [  3.01983462e-04,   9.99698017e-01],
       [  2.66657173e-04,   9.99733343e-01],
       [  6.47241998e-03,   9.93527580e-01],
       [  9.76246855e-01,   2.37531454e-02],
       [  6.87993529e-01,   3.12006471e-01],
       [  2.41625788e-03,   9.97583742e-01],
       [  2.45955675e-02,   9.75404432e-01],
       [  5.50180440e-01,   4.49819560e-01],
       [  3.30168862e-01,   6.69831138e-01],
       [  3.70473277e-01,   6.29526723e-01],
       [  

In [526]:
# Get the confusion matrix
lr_l1_cm = confusion_matrix(y_test, lr_l1_ypred, labels=lr_l1.classes_)
lr_l1_cm = pd.DataFrame(lr_l1_cm, columns=lr_l1.classes_,\
                        index=lr_l1.classes_)
print lr_l1_cm, "\n"

print lr_cm

      high  low
high    21   15
low     12   31 

      high  low
high    26   10
low      6   37


In [527]:
# Get the classification report
print classification_report(y_test, lr_l1_ypred, labels=lr_l1.classes_)

             precision    recall  f1-score   support

       high       0.64      0.58      0.61        36
        low       0.67      0.72      0.70        43

avg / total       0.66      0.66      0.66        79



In [557]:

# check accuracy score
print accuracy_score(y_test, lr_l1_ypred)

0.658227848101


In [531]:
#Get mean cross val score
cvs2 = cross_val_score(lr_l1, X, y, cv=3)
np.mean(cvs2)

0.65675105485232066

### Comparing

In [533]:
# compare between the two 
print cvs2.mean()

print cvs1.mean()

0.656751054852
0.70305907173


In [None]:
# Get the classification report for our best model
print classification_report(y_test, logreg_cv.predict(X_test))

## GridSearchCV

In [545]:
from sklearn.grid_search import GridSearchCV

In [582]:
logreg = LogisticRegression(solver='liblinear')
C_vals = [0.0001, 0.001, 0.01, 0.1, .15, .25, .275, .33, 0.5, .66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']

gs = GridSearchCV(logreg, {'penalty': penalties, 'C': C_vals},\
                  verbose=False, cv=15)
gs.fit(X, y)

GridSearchCV(cv=15, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.0001, 0.001, 0.01, 0.1, 0.15, 0.25, 0.275, 0.33, 0.5, 0.66, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=False)

In [583]:
# finding the best parameters
print gs.best_params_
print gs.best_score_

{'penalty': 'l2', 'C': 0.15}
0.694560669456


In [549]:
# Use this parameter to fit, predict, and print a classification_report for our X and Y
logreg = LogisticRegression(C=gs.best_params_['C'],\
                            penalty=gs.best_params_['penalty'])
cv_model = logreg.fit(X_train, y_train)

In [550]:
cv_pred = cv_model.predict(X_test)

In [552]:
cm3 = confusion_matrix(y_test, cv_pred, labels=logreg.classes_)
cm3 = pd.DataFrame(cm3, columns=logreg.classes_, index=logreg.classes_)

In [563]:
print cm3

      high  low
high    24   12
low      6   37


In [574]:
print "Logistic Regression confusion Matrix: \n", lr_cm
print "L1 confusion Matrix: \n",lr_l1_cm
print "confusion Matrix using grid search results: \n", cm3

Logistic Regression confusion Matrix: 
      high  low
high    26   10
low      6   37
L1 confusion Matrix: 
      high  low
high    21   15
low     12   31
confusion Matrix using grid search results: 
      high  low
high    24   12
low      6   37


In [555]:
print classification_report(y_test, cv_pred,\
                            labels=logreg.classes_)

             precision    recall  f1-score   support

       high       0.80      0.67      0.73        36
        low       0.76      0.86      0.80        43

avg / total       0.78      0.77      0.77        79



In [556]:
# check accuracy score
print accuracy_score(y_test, cv_pred)

0.772151898734


In [598]:
coef_df = pd.DataFrame(zip(X_train.columns, abs(cv_model.coef_[0])))

In [585]:
coef_df

Unnamed: 0,0,1
0,location_Arlington,0.000000
1,location_Atlanta,0.158689
2,location_Austin,0.103552
3,location_Bellevue,0.001965
4,location_Berkeley,0.054088
5,location_Boston,0.121711
6,location_Bridgewater,0.229213
7,location_Brooklyn,0.000000
8,location_Cambridge,0.110307
9,location_Chantilly,0.000000


In [586]:
coef_df.sort([1], ascending=False)

  """Entry point for launching an IPython kernel.


Unnamed: 0,0,1
81,analysis,0.437172
85,big,0.424814
154,university,0.420373
55,associate,0.418473
105,research,0.401182
59,engineer,0.369078
102,new,0.359925
131,dept,0.319813
77,sr,0.285695
58,director,0.283389
