#  Web Scraping for Indeed.com & Predicting Salaries
This project is a test of three major skills: collecting data by scraping a website, using natural language processing, and building a binary classifier.

The first part of the project was focused on scraping Indeed.com. This notebook focused on using natural language processing and building models using job postings with salary information to predict salaries.


Data Collection:

I used Beautiful soup to scrape the data from the site. I collected salary information for data analyst and data scientist jobs from Indeed. Then using the location, title, and summary of the job I attempted to predict the salary of the job. For job posting sites, this would be extraordinarily useful. 

One of the challenge I faced during data collection is that not all job postings on Indeed have a salary listed. To work around this, I created a simple try/except statement to return 'NA' if there was no salary listed. Since I am trying to predict salaries, I had to filter my results down to only include the job postings that listed salaries.
The next step was to compute the median salary of my results, which came out to be $100,000 and then I created a binary variable for each job - 1 if the salary was above the median and 0 if the salary was below the median.

Models:

I approached this as a classification problem and used Logistic Regression, Decision Tree Classifier and Random Forest classifier. I predicted the Salary label as high or low based on the following parameters:
Job Title, Job Location and Job Summary
For the job title and job summary models, I used natural language processing to predict the salary class based on the various words that appeared.

According to my analysis the following five cities were top in the chart in predicting the median salary:

1. Canberra ACT	
2. Sydney Central Business District NSW	
3. Sydney NSW	
4. Chatswood NSW	
5. Melbourne VIC
My finding is that in Canberra most of the jobs are Contract and government jobs so the median price is higher than other cities.

And the skills required for highly paid Data Analysts are good data analysis skills, good understanding of data,experience. And SQL knowledge is required in most of the jobs.

And the top Job Titles having high salary are as below:

1. Datahub data analyst              
2. Senior data analyst                
3. Lead data scientist                
4. Audience data analyst              
5. Digital  Data Analyst              
6. Data analyst consultant 

As an employer which is a contracting company, the results can tell us how much a job candidate might be worth based on the role for which they are being hired and the skills that are required for that role. For example, a data analyst with proficient good data analysis skills might be paid more. As we can see from the result contract jobs are highly paid than permanent jobs which influence the model outcome.



In [2734]:
import pandas as pd
import csv
import numpy as np
import seaborn as sns
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category = FutureWarning)
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer




In [2876]:
#loading the scraped data analyst jobs from indeed.
df = pd.read_csv('indeed_data_analyst.csv')


# EDA



In [2877]:
df.head(30)
df.shape


(3097, 6)

In [2737]:
df.Salary.isnull().sum()


2479

In [2738]:
#dropping the Unnamed column
df.drop('Unnamed: 0',axis=1,inplace=True)


In [2739]:
df_salary=df.dropna()
df_salary.shape

(615, 5)

In [2740]:
#df_salary.drop_duplicates(subset='Summary',inplace=True)

In [2741]:
df_salary.shape

(615, 5)

In [2742]:
# Take commas,dollar sign and text'a year' out 
df_salary['Salary'].replace(to_replace=['\$',',',' a year'], value='', inplace=True,regex=True)
#df_salary

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [2743]:
df_salary['new'] =df_salary['Salary'].str.extractall('(\d+)')[0].unstack().astype(float).mean(1)
    
df_salary
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,Company,Location,Salary,Job Title,Summary,new
1,Cognizant Technology Solutions,Melbourne VIC,90000 - 110000,Data Analyst,Good understanding of Data Analysis and data m...,100000.0
3,Cognizant Technology Solutions,Sydney NSW,60000 - 90000,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0
4,Ignite,Melbourne VIC,600 - 700 a day,Business Analyst,"5+ years’ experience in analysis, data managem...",650.0
7,Publicis Emil,Melbourne VIC,70000 - 90000,Data Analyst,Collaboration with global data team to provide...,80000.0
8,Agency for Clinical Innovation,Chatswood NSW,110961 - 126496,Data & Statistical Analyst,An experienced data analyst who has worked wit...,118728.5
15,City of Greater Geelong,Geelong VIC,83508 - 90180,Data Analyst,The Data analyst positively contributes to the...,86844.0
25,University of New England (UNE),Armidale NSW,76549 - 84141,Academic Data Analyst,Working from either UNE Armidale or the UNE Sy...,80345.0
28,Aginic,Melbourne VIC,60000 - 70000,Data Analyst (Melbourne),We are looking for motivated and passionate Da...,65000.0
31,Cognizant Technology Solutions,Sydney NSW,60000 - 90000,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0
32,Ignite,Melbourne VIC,600 - 700 a day,Business Analyst,"5+ years’ experience in analysis, data managem...",650.0


In [2744]:
#converting the salary formats to per year

m1 = df_salary['Salary'].str.contains('month')
m2 = df_salary['Salary'].str.contains('day') | df_salary['Salary'].str.contains('p/d')
m3 = df_salary['Salary'].str.contains('hour') |df_salary['Salary'].str.contains('p/hr')
df_salary['Annual_salary'] =np.select([m1,m2,m3],[df_salary['new']*12,df_salary['new']*269,df_salary['new']*8*269], default=df_salary['new'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [2745]:
df_salary.Location.value_counts().head(7)


Melbourne VIC                           406
Sydney NSW                              114
Canberra ACT                             20
Sydney Central Business District NSW      7
Geelong VIC                               6
North Sydney NSW                          6
Brisbane QLD                              6
Name: Location, dtype: int64

In [2746]:

df2=df_salary.groupby('Job Title').median().head(20)



In [2747]:
df2.sort_values(by='Annual_salary',ascending=0)


Unnamed: 0_level_0,new,Annual_salary
Job Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"Application Specialist, Data Analyst - Global Financial Inte...",170000.0,170000.0
Analytics Lead/ Data Scientist,145000.0,145000.0
BI Analyst - Tableau,140000.0,140000.0
Associate / Data Science delivery lead,125000.0,125000.0
APS6 Data Analyst,54.0,116208.0
APS4 Business Analyst,52.5,112980.0
APS6 Senior Data Analyst,50.5,108676.0
Analyst,45.0,96840.0
APS4 Data Analyst - SQL / R Code,41.0,88232.0
3D Spatial Information System Analyst,87391.0,87391.0


In [2748]:
#finding the median salary
median=np.median(df_salary_new['Annual_salary'])
median

100000.0

### Creating label 'Job Type' for salary as high and low based on median salary


In [2751]:

df_salary['Job_Type'] = [1 if x>median else 0 for x in df_salary.Annual_salary]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [2752]:
df_salary


Unnamed: 0,Company,Location,Salary,Job Title,Summary,new,Annual_salary,Job_Type
1,Cognizant Technology Solutions,Melbourne VIC,90000 - 110000,Data Analyst,Good understanding of Data Analysis and data m...,100000.0,100000.0,0
3,Cognizant Technology Solutions,Sydney NSW,60000 - 90000,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,75000.0,0
4,Ignite,Melbourne VIC,600 - 700 a day,Business Analyst,"5+ years’ experience in analysis, data managem...",650.0,174850.0,1
7,Publicis Emil,Melbourne VIC,70000 - 90000,Data Analyst,Collaboration with global data team to provide...,80000.0,80000.0,0
8,Agency for Clinical Innovation,Chatswood NSW,110961 - 126496,Data & Statistical Analyst,An experienced data analyst who has worked wit...,118728.5,118728.5,1
15,City of Greater Geelong,Geelong VIC,83508 - 90180,Data Analyst,The Data analyst positively contributes to the...,86844.0,86844.0,0
25,University of New England (UNE),Armidale NSW,76549 - 84141,Academic Data Analyst,Working from either UNE Armidale or the UNE Sy...,80345.0,80345.0,0
28,Aginic,Melbourne VIC,60000 - 70000,Data Analyst (Melbourne),We are looking for motivated and passionate Da...,65000.0,65000.0,0
31,Cognizant Technology Solutions,Sydney NSW,60000 - 90000,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,75000.0,0
32,Ignite,Melbourne VIC,600 - 700 a day,Business Analyst,"5+ years’ experience in analysis, data managem...",650.0,174850.0,1


In [2753]:
#Dropping the unwanted columns and resetting the index

df_salary.reset_index(drop=True,inplace=True)
df_salary.drop(['Salary','new'],axis=1,inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [2755]:
#loading the cleaned data to csv file
df_salary.to_csv('indeed_data_analyst_cleaned.csv', encoding='utf-8')
df_salary_data = pd.read_csv('indeed_data_analyst_cleaned.csv')


In [2756]:
df_salary_data

Unnamed: 0.1,Unnamed: 0,Company,Location,Job Title,Summary,Annual_salary,Job_Type
0,0,Cognizant Technology Solutions,Melbourne VIC,Data Analyst,Good understanding of Data Analysis and data m...,100000.0,0
1,1,Cognizant Technology Solutions,Sydney NSW,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,0
2,2,Ignite,Melbourne VIC,Business Analyst,"5+ years’ experience in analysis, data managem...",174850.0,1
3,3,Publicis Emil,Melbourne VIC,Data Analyst,Collaboration with global data team to provide...,80000.0,0
4,4,Agency for Clinical Innovation,Chatswood NSW,Data & Statistical Analyst,An experienced data analyst who has worked wit...,118728.5,1
5,5,City of Greater Geelong,Geelong VIC,Data Analyst,The Data analyst positively contributes to the...,86844.0,0
6,6,University of New England (UNE),Armidale NSW,Academic Data Analyst,Working from either UNE Armidale or the UNE Sy...,80345.0,0
7,7,Aginic,Melbourne VIC,Data Analyst (Melbourne),We are looking for motivated and passionate Da...,65000.0,0
8,8,Cognizant Technology Solutions,Sydney NSW,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,0
9,9,Ignite,Melbourne VIC,Business Analyst,"5+ years’ experience in analysis, data managem...",174850.0,1


In [2757]:
df_salary_new =df_salary_data.drop(['Unnamed: 0'],axis=1)

In [2758]:
df_salary_new.shape

(615, 6)

###  Setting the target variable as Job_Type(high or low)

In [2759]:

y = df_salary_new["Job_Type"]

In [2760]:
#Calculating the baseline model accuracy
baseline =1-np.mean(y)
print('baseline:', baseline)


baseline: 0.6097560975609756


In [2761]:



ss = StandardScaler()



### Predicting salary with location

In [2762]:
df_salary_new['Location']= df_salary_new['Location'].apply(lambda x: 'Melbourne' if x in ['Docklands VIC','St Kilda VIC','Parkville VIC'] else x)

In [2763]:

X = df_salary_new["Location"]
X= pd.get_dummies(X, drop_first=True)
y = df_salary_new["Job_Type"]
# saving the encoded value to a dataframe
df_loc=X

In [2764]:
df_salary_new.shape

Flushing oldest 200 entries.
  'Flushing oldest {cull_count} entries.'.format(sz=sz, cull_count=cull_count))


(615, 6)

In [2765]:
df_salary_new

Unnamed: 0,Company,Location,Job Title,Summary,Annual_salary,Job_Type
0,Cognizant Technology Solutions,Melbourne VIC,Data Analyst,Good understanding of Data Analysis and data m...,100000.0,0
1,Cognizant Technology Solutions,Sydney NSW,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,0
2,Ignite,Melbourne VIC,Business Analyst,"5+ years’ experience in analysis, data managem...",174850.0,1
3,Publicis Emil,Melbourne VIC,Data Analyst,Collaboration with global data team to provide...,80000.0,0
4,Agency for Clinical Innovation,Chatswood NSW,Data & Statistical Analyst,An experienced data analyst who has worked wit...,118728.5,1
5,City of Greater Geelong,Geelong VIC,Data Analyst,The Data analyst positively contributes to the...,86844.0,0
6,University of New England (UNE),Armidale NSW,Academic Data Analyst,Working from either UNE Armidale or the UNE Sy...,80345.0,0
7,Aginic,Melbourne VIC,Data Analyst (Melbourne),We are looking for motivated and passionate Da...,65000.0,0
8,Cognizant Technology Solutions,Sydney NSW,DataHub Data Analyst,Cognizant hiring for DataHub Data Analyst with...,75000.0,0
9,Ignite,Melbourne VIC,Business Analyst,"5+ years’ experience in analysis, data managem...",174850.0,1


In [2766]:
# Scaling the data
from sklearn.preprocessing import StandardScaler

ss = StandardScaler()
Xs = ss.fit_transform(X)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [2767]:
X_train, X_test, y_train, y_test = train_test_split(Xs, y,random_state=50, test_size=0.3)

In [2768]:
#using Logistic regression

In [2769]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.6054054054054054

In [2770]:
#using decision tree to predict salary

In [2771]:
dtc = DecisionTreeClassifier(max_depth=None, max_features='auto')
dtc.fit(X_train, y_train)
print(('dtc acc:', dtc.score(X_test, y_test)))


('dtc acc:', 0.6054054054054054)


In [2772]:
#Finding the feature Coefficients for Location

In [2773]:
feature_importances = pd.DataFrame(dtc.feature_importances_,
                                   index = X.columns,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head(10)



Unnamed: 0,importance
Canberra ACT,0.168915
Sydney Central Business District NSW,0.16354
Sydney NSW,0.111274
Chatswood NSW,0.087604
Melbourne VIC,0.063744
Parramatta NSW,0.058657
Camperdown NSW,0.05623
Perth WA,0.05451
Newcastle NSW,0.05319
Brisbane QLD,0.047554


### Predict Salary by Summary using  Count Vectorizer

In [2790]:
# NLP Using a count vectorizer to predict salary with summary as predictor  
from sklearn.feature_extraction.text import CountVectorizer



In [2791]:
# Lets use the stop_words argument to remove words like "and, the, a"
cvec = CountVectorizer(stop_words='english',ngram_range=(3,3))

# Fit our vectorizer using our train data
cvec_model=cvec.fit(df_salary_new['Summary'])

# and check out the length of the vectorized data after
len(cvec.get_feature_names())

2194

In [2792]:
from sklearn.preprocessing import StandardScaler
#need to standardise as we calculate the distance

ss = StandardScaler()
Xs = ss.fit_transform(X)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [2793]:
# Transforming our x data using our fit cvec.
# And converting the result to a DataFrame.
X = pd.DataFrame(cvec.transform(df_salary_new['Summary']).todense(),
                       columns=cvec.get_feature_names())
# setting the target variable
y= df_salary_new['Job_Type']

In [2794]:
#X = pd.DataFrame(cvec.transform(df_salary_new['Summary']).todense(),columns=cvec.get_feature_names())

In [2795]:
#saving the count vectorized summary to a dataframe
df_summmary= X
X.shape

(615, 2194)

In [2796]:
# Which words appear the most?
word_counts = X.sum(axis=0)

word_counts.sort_values(ascending = False).head(20)

analysis data management       292
data analysis data             203
good understanding data        201
able analysis data             200
issue root cause               200
analysis data issue            200
create complex queries         200
solid sql skill                200
cause create complex           200
skill able analysis            200
root cause create              200
sql skill able                 200
data issue root                200
understanding data analysis    200
data management solid          188
management solid sql           188
experience analyst ability      92
management reporting change     92
reporting change management     92
data management reporting       92
dtype: int64

#### Model 1:Logistic Regression

In [2797]:
#Import and fit our logistic regression 
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=100, test_size=0.2)
lr = LogisticRegression()
model_summary=lr.fit(X_train, y_train)
y_pred= model_summary.predict(X_test)


print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))

Accuracy:  0.8699186991869918


In [2798]:
# we can predict the salary label by 85.4% accuracy based on the the job summary

#### Model 2: Decision tree

In [2799]:
dtc = DecisionTreeClassifier(max_depth=None, max_features=50)
dtc.fit(X_train, y_train)
print(('dtc acc:', dtc.score(X_test, y_test)))


('dtc acc:', 0.8617886178861789)


In [2800]:
feature_importances = pd.DataFrame(dtc.feature_importances_,
                                   index = X.columns,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head()



Unnamed: 0,importance
data management reporting,0.311991
business automotive data,0.127245
issue root cause,0.091642
including data cleansing,0.04164
validation reporting analytics,0.03604


### Predict salary label with Job Title

In [2801]:
# Setting the vectorizer just like we would set a model
cvec = CountVectorizer(stop_words='english',ngram_range=(3,3))

# Fitting the vectorizer on our training data
cvec.fit(df_salary_new['Job Title'])


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(3, 3), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [2802]:
# Transforming our x_train data using our fit cvec.
# And converting the result to a DataFrame.
X = pd.DataFrame(cvec.transform(df_salary_new['Job Title']).todense(),
                       columns=cvec.get_feature_names())



In [2803]:
#loading the features to a dataframe
df_jobtitle= X

X.shape



(615, 251)

In [2804]:
# setting the target variable
y= df_salary_new['Job_Type']

In [2805]:
from sklearn.preprocessing import StandardScaler
#need to standardise as we calculate the distance

ss = StandardScaler()
Xs = ss.fit_transform(X)


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [2806]:
#using test train split
X_train, X_test, y_train, y_test = train_test_split(Xs, y,random_state=80, test_size=0.2)


In [2807]:
#Import and fit our logistic regression and test it too
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)


0.6178861788617886

In [2808]:
word_counts = X.sum(axis=0)

word_counts.sort_values(ascending = False).head(20)


customer insights analyst         34
datahub data analyst              29
senior data analyst                6
lead data scientist                6
audience data analyst              5
analyst digital media              5
data analyst digital               5
data analyst consultant            3
analyst modeller visualisation     2
data analyst modeller              2
ict business analyst               2
rural ehealth business             2
12 month contract                  2
customer service systems           2
service systems analyst            2
junior business analyst            2
contract fin services              2
aps6 data analyst                  2
data scientist python              2
business systems analyst           2
dtype: int64

In [2809]:
dtc = DecisionTreeClassifier(max_depth=None, max_features='auto')
dtc.fit(X_train, y_train)
print(('dtc acc:', dtc.score(X_test, y_test)))


('dtc acc:', 0.6666666666666666)


In [2810]:
feature_importances = pd.DataFrame(dtc.feature_importances_,
                                   index = X.columns,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head()


Unnamed: 0,importance
customer insights analyst,0.104324
datahub data analyst,0.099041
senior data analyst,0.052983
audience data analyst,0.021924
aps6 data analyst,0.01979


### Predicting Salary label with location, title and summary


In [2811]:
X = pd.concat([df_summary,df_jobtitle, df_loc], axis=1)
y = df_salary_new.Job_Type

#### Model 1: Logistic Regression

In [2812]:
#Import and fit our logistic regression and test it too
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred=lr.predict(X_test)
lr.score(X_test, y_test)

0.6178861788617886

In [2813]:
#print(classification_report(y_pred,y_test))
print(metrics.classification_report(y_test, y_pred, target_names=['lower salary', 'higher salary']))

               precision    recall  f1-score   support

 lower salary       0.66      0.82      0.73        79
higher salary       0.44      0.25      0.32        44

    micro avg       0.62      0.62      0.62       123
    macro avg       0.55      0.54      0.53       123
 weighted avg       0.58      0.62      0.59       123



In [2814]:
# Display confusion matrix
confusion = confusion_matrix(y_test, y_pred)
pd.DataFrame(confusion, columns=['Predicted Low Salary', 'Predicted High Salary'], 
             index=['Actual Low Salary', 'Actual High Salary'])


Unnamed: 0,Predicted Low Salary,Predicted High Salary
Actual Low Salary,65,14
Actual High Salary,33,11


#### Model 2 : Random Forest Classifier

In [2815]:
#using random forest to predict salary label

rfc = RandomForestClassifier(n_estimators=300, random_state=90)
rfc.fit(X_train, y_train)

rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print ("Accuracy Score of Random Forest:", acc.round(3))



Accuracy Score of Random Forest: 0.667


#### Model 3: Decision tree classifier

In [2816]:

dtc = DecisionTreeClassifier(max_depth=None, max_features=50)
dtc.fit(X_train, y_train)
print(('dtc acc:', dtc.score(X_test, y_test)))



('dtc acc:', 0.6666666666666666)


# QUESTION 2: Factors that distinguish job category

In [None]:
# Second round of scraping done get Data scientist jobs

In [2858]:
#loading the scraped Data scientist jobs
df = pd.read_csv('indeed_data_scientist.csv')

In [2859]:
#Cleaning the data defined as a function 
def data_cleaning(df):
    


    df.drop('Unnamed: 0', axis=1,inplace=True)

    df_ds =df.dropna()

    df_ds=df_ds.reset_index()


    # Take commas,dollar sign and text'a year' out 
    df_ds['Salary'].replace(to_replace=['\$',',',' a year'], value='', inplace=True,regex=True)


    df_ds['new'] =df_ds['Salary'].str.extractall('(\d+)')[0].unstack().astype(float).mean(1)



    m1 = df_ds['Salary'].str.contains('month')
    m2 = df_ds['Salary'].str.contains('day')
    m3 = df_ds['Salary'].str.contains('hour')
    df_ds['Annual_salary'] =np.select([m1,m2,m3],[df_ds['new']*12,df_ds['new'] *365,df_ds['new']*8*365], default=df_ds['new'])

    df_ds=df_ds[['Company','Location','Annual_salary','Job Title','Summary']]

    df_ds.drop_duplicates(subset='Summary', inplace=True)
    df_ds.to_csv('indeed_data_scientist_cleaned.csv', encoding='utf-8')

    df_ds = pd.read_csv('indeed_data_scientist_cleaned.csv')
    return df_ds

In [2860]:
#calling the function to clean the data
df_ds= data_cleaning(df)

In [2861]:
df_ds.drop('Unnamed: 0',axis=1,inplace=True)

In [2862]:
#combine data scientist and data analyst jobs
df_full=pd.concat([df_salary_new,df_ds],axis=0)

In [2863]:
# deleting the duplicate rows based on summary
df_ds.drop_duplicates(subset='Summary', inplace=True)

In [2864]:

df_full['Job_Type'] = [1 if x > median else 0 for x in df_full.Annual_salary]

In [2865]:
df_full.shape

(740, 6)

## Distinguish data scientists from other data jobs

In [2866]:
# Creating label for Data scientist Jobs
df_full['Data_Scientist']= df_full['Job Title'].apply(lambda x: 1 if x in ['Data Scientist','Data Scientists','DATA SCIENTIST'] else 0)

In [2867]:
#SetTing the target variable
y=df_full['Data_Scientist']

In [2868]:
#calculating the baseline accuracy
1 - sum(y) / len(y)

0.9797297297297297

In [2869]:
#count vectorizer 
cvec = CountVectorizer(ngram_range=(2,2),stop_words='english',max_features=50)

# Fitting the vectorizer on our training data
cvec.fit(df_full['Summary'])


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=50, min_df=1,
        ngram_range=(2, 2), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [2870]:
X = pd.DataFrame(cvec.transform(df_full['Summary']).todense(),columns=cvec.get_feature_names())


In [2871]:
len(df_full['Annual_salary'])



740

### Using k means clustering 

In [2872]:
from sklearn.preprocessing import scale
from sklearn.cluster import KMeans, k_means
from sklearn import cluster, metrics


df_loc=pd.get_dummies(df_full['Location'], drop_first=True)
df_jobtitle= pd.DataFrame(cvec.transform(df_full['Job Title']).todense(),
                       columns=cvec.get_feature_names())
# Transforming our x data using our fit cvec.
# And converting the result to a DataFrame.
df_summary = pd.DataFrame(cvec.transform(df_full['Summary']).todense(),
                       columns=cvec.get_feature_names())

# using the features SummAry and Job Title for clustering
X= pd.concat([df_summary,df_jobtitle],axis=1)



#X = pd.DataFrame(cvec.transform(df_full['Summary']).todense(),columns=cvec.get_feature_names())
kmeans = cluster.KMeans(n_clusters=2)
kmeans.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [2873]:
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
df_full['Labels']= labels


In [2843]:
from sklearn.metrics import completeness_score, homogeneity_score, v_measure_score, adjusted_mutual_info_score

In [2844]:
print(completeness_score(df_full['Data_Scientist'], labels))

0.011077642644507214


In [2845]:
metrics.silhouette_score(X, labels, metric='euclidean')

0.6318572180387948

In [2631]:
#The score is high (close to 1), indicating that there is pretty good separation
# and coherence with 2 clusters.

In [2632]:
print(homogeneity_score(df_full['Data_Scientist'], labels))

0.06523531566445571


In [2633]:
print(adjusted_mutual_info_score(df_full['Data_Scientist'], labels))

0.009861754973659966


## Categorizing Senior and Junior jobs


In [2533]:
y = [1 if 'Senior' in x else 0 for x in df_full['Job Title'].values]

In [2534]:
# Calculate the baseline.
1 - sum(y) / len(y)

0.9648648648648649

In [2535]:
df_company= pd.get_dummies(df_full['Company'],drop_first=True)

In [2574]:
vect = TfidfVectorizer(stop_words='english', min_df=0.005, ngram_range=(2,3))
X_sum = vect.fit_transform(df_full['Summary'])
df_sum = pd.DataFrame(X_sum.toarray(), columns=vect.get_feature_names())

In [2575]:
X= pd.concat([df_sum,df_jobtitle], axis=1)
X.columns

Index(['ability years', 'ability years experience', 'able analysis',
       'able analysis data', 'advanced excel', 'analyse data', 'analysis data',
       'analysis data issue', 'analysis data management', 'analysis reporting',
       ...
       'roadmap standardizes', 'root cause', 'skill able', 'solid sql',
       'sql skill', 'stakeholder engagement', 'standardizes optimizes',
       'understanding data', 'working closely', 'years experience'],
      dtype='object', length=427)

In [2576]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

### Model 1: Logistic Regression

In [2577]:
lr = LogisticRegression(solver='lbfgs', max_iter=200)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [2578]:
print(metrics.confusion_matrix(y_test, y_pred))
print('-'*55, '\n')
print(metrics.classification_report(y_test, y_pred, target_names=['junior', 'senior']))

[[144   0]
 [  4   0]]
------------------------------------------------------- 

              precision    recall  f1-score   support

      junior       0.97      1.00      0.99       144
      senior       0.00      0.00      0.00         4

   micro avg       0.97      0.97      0.97       148
   macro avg       0.49      0.50      0.49       148
weighted avg       0.95      0.97      0.96       148



  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### Model 2: Random Forest Classifier

In [2589]:
rfc = RandomForestClassifier(n_estimators=200, random_state=1)
rfc.fit(X_train,y_train)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None,
            oob_score=False, random_state=1, verbose=0, warm_start=False)

In [2590]:
rfc_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, rfc_pred)
print ("Accuracy Score of Random Forest:", acc.round(3))


Accuracy Score of Random Forest: 0.98


In [2591]:

# Check important features.
pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)[:40]

senior analyst                   0.150029
senior data                      0.118573
senior data analyst              0.077881
data scientists                  0.062069
skills including                 0.048462
data models                      0.044454
analyst join                     0.043583
senior data scientist            0.040628
data analyst                     0.037932
data manipulation                0.030246
analyse data                     0.027479
data analyst                     0.024803
years experience                 0.019229
working closely                  0.015417
data reporting                   0.015414
work closely                     0.015182
data scientist                   0.015072
analyst responsible              0.014152
data management                  0.012288
data science                     0.010798
machine learning                 0.010571
data analyst data                0.009514
big data                         0.007692
exciting opportunity             0