### Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

Determine the industry factors that are most important in predicting the salary amounts for these data.
Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?
To limit the scope, your principal has suggested that you focus on data-related job postings, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by limiting your search to a single region.

Hint: Aggregators like Indeed.com regularly pool job postings from a variety of markets and industries.

Goal: Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:

* NLP
* Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
* Ensemble methods and decision tree models
* SVM models
* Whatever you decide to use, the most important thing is to justify your choices and interpret your results. Communication of your process is key. Note that most listings DO NOT come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:

* What components of a job posting distinguish data scientists from other data jobs?
* What features are important for distinguishing junior vs. senior positions?
* Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?
* You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.

### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve

In [23]:
import pandas as pd
import seaborn as sns
import numpy as np
import re
import matplotlib.pyplot as plt
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from nltk.corpus import stopwords
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA
from sklearn.tree import export_graphviz
from gensim.models import Word2Vec
from sklearn.model_selection import cross_val_score

In [24]:
jobs = pd.read_csv('./data_MCF_combined_1-180.csv')
jobs

Unnamed: 0.1,Unnamed: 0,Company,JobTitle,Location,EmploymentType,Seniority,Category,GovSupport,SalaryRange,JobLink,...,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31
0,0,INTERACTIVE DATA (SINGAPORE) PTE. LTD.,Reference Data Sales Specialist,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,"$60,000to$100,000Annually",/job/4444308b9cdfe083237649bb58399cda,...,,,,,,,,,,
1,1,NTT DATA SINGAPORE PTE. LTD.,Data Analyst,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,"$4,500to$7,000Monthly",/job/1808ee6a1f76b35f3e1fccc2bcbe2f2e,...,,,,,,,,,,
2,2,CIMB BANK BERHAD,"Head, Data and Information Security",East,Contract,Professional,Banking and Finance ...,Government support available,"$10,000to$11,000Monthly",/job/4eab9054bf3e5af132624759f99a033c,...,,,,,,,,,,
3,3,AXA DIL@ASIA PTE. LTD.,Data Science Engineer,East,Contract,Professional,Banking and Finance ...,Government support available,"$4,600to$5,600Monthly",/job/6bb5bd34a64c3f119dde566d77cc1976,...,,,,,,,,,,
4,4,FEHMARN CONSULTING PTE. LTD.,Data Architect,Central,Permanent,Senior Management,Banking and Finance,,"$6,000to$7,500Monthly",/job/31843d4e1f993ad8949d0014d02abf40,...,,,,,,,,,,
5,5,OSIM INTERNATIONAL PTE. LTD.,Data Architect,Central,Permanent,Senior Management,Banking and Finance,Government support available,"$4,700to$7,000Monthly",/job/ff94655ae809d1aa0091b2b801d1589f,...,,,,,,,,,,
6,6,AXA DIL@ASIA PTE. LTD.,Data Scientist Junior,Central,Full Time,Executive,Banking and Finance ...,Government support available,"$4,600to$5,600Monthly",/job/557127dc1c23f59816f38e0c9c7c4fc3,...,,,,,,,,,,
7,7,JONES LANG LASALLE TECHNOLOGY SERVICES PTE. LTD.,"Senior Manager, Data Governance",Central,Full Time,Executive,Banking and Finance ...,Government support available,"$8,000to$10,000Monthly",/job/ca75f75d752636fe19b0f7b4e0d580c2,...,,,,,,,,,,
8,8,NTUC INCOME INSURANCE CO-OPERATIVE LTD,Senior Data Research Executive,East,Permanent,Professional,Information Technology,,"$4,000to$7,000Monthly",/job/115abfac7a8347ba682c8b116ce59768,...,,,,,,,,,,
9,9,PROCTER & GAMBLE EUROPE SA SINGAPORE BRANCH,Senior Data Wrangler,East,Permanent,Professional,Information Technology,,"$8,000to$15,000Monthly",/job/d18e1cdd4d80258ca15d53e089660f6a,...,,,,,,,,,,


EDA

1. Summary of the dataframe and remove unneccessary columns
2. Dropped duplicates rows
3. Filling in null values for GovSupport feature
4. Changed all the characters in JobTitle feature to lowercase and
filter the data based on data science related keywords
5. Dropped rows with 'salary undisclosed' in SalaryRange feature
6. Split the variables in salary ranges into minimum salary, maximum
salary and monthly/annually(salary frequency)
7. Create another salary frequency feature to standardize all 
salaries into monthly frequency
8. Create 3 features that has minimum, maximum and average salary
9. Group salaries into 2 categories in SalCategory feature where salaries
that are lower than 6000 is 0 while salaries that are higher than 6000 is 1.

In [25]:
jobs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3742 entries, 0 to 3741
Data columns (total 32 columns):
Unnamed: 0            3742 non-null int64
Company               3742 non-null object
JobTitle              3742 non-null object
Location              3742 non-null object
EmploymentType        3742 non-null object
Seniority             3742 non-null object
Category              3742 non-null object
GovSupport            1836 non-null object
SalaryRange           3742 non-null object
JobLink               3742 non-null object
PostedDate            3742 non-null object
ClosingDate           3742 non-null object
RoleResponsibility    3742 non-null object
Requirements          3742 non-null object
Unnamed: 14           9 non-null object
Unnamed: 15           3 non-null object
Unnamed: 16           2 non-null object
Unnamed: 17           1 non-null object
Unnamed: 18           1 non-null object
Unnamed: 19           1 non-null object
Unnamed: 20           1 non-null object
Unnamed: 21 

In [26]:
# Removing unnecessary columns
jobs = jobs.iloc[:, 1:14]

In [27]:
# Finding and dropping duplicates
print (jobs.duplicated().value_counts())
jobs.drop_duplicates(keep=False, inplace=True)

False    3737
True        5
dtype: int64


In [28]:
# Finding null values
jobs.isnull().sum()

Company                  0
JobTitle                 0
Location                 0
EmploymentType           0
Seniority                0
Category                 0
GovSupport            1906
SalaryRange              0
JobLink                  0
PostedDate               0
ClosingDate              0
RoleResponsibility       0
Requirements             0
dtype: int64

In [29]:
jobs['JobTitle'] = jobs['JobTitle'].apply(lambda x: x.lower())

jobs = jobs[jobs['JobTitle'].str.contains('data|analy|ai|scientist|architect|engineer')]

In [30]:
drop = jobs[jobs['SalaryRange']=='Salary undisclosed'].index.tolist()
jobs.drop(drop, inplace=True)

In [31]:
jobs['SalaryRange'] = (jobs['SalaryRange'].str.replace('to', ' ').str.replace('$', '').str.replace(',', ''))

In [32]:
salary = []
for a in jobs['SalaryRange']:
    salary.append([a for a in re.split(r'([]A-Z[a-z]*)', a) if a])
    what = []
    for a in salary:
        what.append(a[1])

In [33]:
salary[:4]

[['60000 100000', 'Annually'],
 ['4500 7000', 'Monthly'],
 ['10000 11000', 'Monthly'],
 ['4600 5600', 'Monthly']]

In [34]:
jobs['SalaryRange'] = jobs['SalaryRange'].str.replace('Monthly', '').str.replace('Annually', '').str.split(' ')

In [35]:
jobs['SalaryFreq'] = what

In [36]:
Min = []
Max=[]
for a in jobs['SalaryRange']:
    Min.append(a[0])
    Max.append(a[1])

In [37]:
jobs['Min'], jobs['Max']  = Min, Max

In [38]:
jobs.drop(['SalaryRange'], axis=1, inplace=True)

In [39]:
jobs['Min'] = jobs['Min'].astype('int64')
jobs['Max'] = jobs['Max'].astype('int64')

In [40]:
jobs.loc[jobs['SalaryFreq'] == 'Annually', 'Min'] = jobs['Min']/12
jobs.loc[jobs['SalaryFreq'] == 'Annually', 'Max'] = jobs['Max']/12

In [41]:
jobs['AvgSalary'] = (jobs['Min']+ jobs['Max'])/2

In [42]:
np.median(jobs['AvgSalary'])

6000.0

In [43]:
jobs['SalCategory'] = ''

In [44]:
jobs.loc[jobs['AvgSalary'] < 6000, 'SalCategory'] = 0.0
jobs.loc[jobs['AvgSalary'] >= 6000, 'SalCategory'] = 1.0

In [45]:
jobs['SalCategory'] = jobs['SalCategory'].astype('int64')

In [46]:
jobs

Unnamed: 0,Company,JobTitle,Location,EmploymentType,Seniority,Category,GovSupport,JobLink,PostedDate,ClosingDate,RoleResponsibility,Requirements,SalaryFreq,Min,Max,AvgSalary,SalCategory
0,INTERACTIVE DATA (SINGAPORE) PTE. LTD.,reference data sales specialist,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,/job/4444308b9cdfe083237649bb58399cda,23 Mar 2018,22 Apr 2018,ICE Data Services are looking for an innovativ...,Excellent knowledge of the Reference Data bus...,Annually,5000.000000,8333.333333,6666.666667,1
1,NTT DATA SINGAPORE PTE. LTD.,data analyst,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,/job/1808ee6a1f76b35f3e1fccc2bcbe2f2e,15 Apr 2018,15 May 2018,Job Duties & responsibilities Develop data mo...,A Bachelor’s degree in Data Scientist or Comp...,Monthly,4500.000000,7000.000000,5750.000000,0
2,CIMB BANK BERHAD,"head, data and information security",East,Contract,Professional,Banking and Finance ...,Government support available,/job/4eab9054bf3e5af132624759f99a033c,03 Apr 2018,03 May 2018,The Role serve as primary custodian across th...,"• Bachelor’s degree or equivalent experience, ...",Monthly,10000.000000,11000.000000,10500.000000,1
3,AXA DIL@ASIA PTE. LTD.,data science engineer,East,Contract,Professional,Banking and Finance ...,Government support available,/job/6bb5bd34a64c3f119dde566d77cc1976,21 Mar 2018,20 Apr 2018,"The AXA Group, a worldwide leader in Insurance...",Successful qualifier of this mission: Technic...,Monthly,4600.000000,5600.000000,5100.000000,0
4,FEHMARN CONSULTING PTE. LTD.,data architect,Central,Permanent,Senior Management,Banking and Finance,,/job/31843d4e1f993ad8949d0014d02abf40,18 Apr 2018,18 May 2018,"Our client, located in Ubi, is looking for a D...",Bachelor's Degree in Computer Science or a re...,Monthly,6000.000000,7500.000000,6750.000000,1
5,OSIM INTERNATIONAL PTE. LTD.,data architect,Central,Permanent,Senior Management,Banking and Finance,Government support available,/job/ff94655ae809d1aa0091b2b801d1589f,03 Apr 2018,03 May 2018,The Data Architect is a newly created position...,Professional skills required: Essential knowl...,Monthly,4700.000000,7000.000000,5850.000000,0
6,AXA DIL@ASIA PTE. LTD.,data scientist junior,Central,Full Time,Executive,Banking and Finance ...,Government support available,/job/557127dc1c23f59816f38e0c9c7c4fc3,21 Mar 2018,20 Apr 2018,"The AXA Group, a worldwide leader in Insurance...",Successful qualifier of this mission: Profes...,Monthly,4600.000000,5600.000000,5100.000000,0
7,JONES LANG LASALLE TECHNOLOGY SERVICES PTE. LTD.,"senior manager, data governance",Central,Full Time,Executive,Banking and Finance ...,Government support available,/job/ca75f75d752636fe19b0f7b4e0d580c2,03 Apr 2018,03 May 2018,"At Jones Lang LaSalle (JLL), we are going thro...",Experience Must have 5+ years of experience i...,Monthly,8000.000000,10000.000000,9000.000000,1
8,NTUC INCOME INSURANCE CO-OPERATIVE LTD,senior data research executive,East,Permanent,Professional,Information Technology,,/job/115abfac7a8347ba682c8b116ce59768,03 Apr 2018,03 May 2018,"As Senior Data Resarch Executive, you will foc...",Degree holder in Data Engineering or Business...,Monthly,4000.000000,7000.000000,5500.000000,0
9,PROCTER & GAMBLE EUROPE SA SINGAPORE BRANCH,senior data wrangler,East,Permanent,Professional,Information Technology,,/job/d18e1cdd4d80258ca15d53e089660f6a,02 Apr 2018,02 May 2018,Information Technology (IT) at Procter & Gambl...,We are looking for leaders who have graduated ...,Monthly,8000.000000,15000.000000,11500.000000,1


### Training and testing data

1. X variables consists of text are relevant in determining salaries
2. split data into train and test data

In [47]:
X = (jobs['Location']+ '' +jobs['RoleResponsibility'] + ''+ jobs['Requirements']
          + ''+ jobs['EmploymentType']+ ''+ jobs['Seniority'])
y = jobs['SalCategory']

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

### Vectorizing text

In [49]:
cvec = CountVectorizer(stop_words= 'english')

In [50]:
stopwords = list(cvec.get_stop_words())
stopwords

['all',
 'six',
 'less',
 'being',
 'indeed',
 'over',
 'move',
 'anyway',
 'fifty',
 'four',
 'not',
 'own',
 'through',
 'yourselves',
 'go',
 'where',
 'mill',
 'only',
 'find',
 'before',
 'one',
 'whose',
 'system',
 'how',
 'somewhere',
 'with',
 'thick',
 'show',
 'had',
 'enough',
 'should',
 'to',
 'must',
 'whom',
 'seeming',
 'under',
 'ours',
 'has',
 'might',
 'thereafter',
 'latterly',
 'do',
 'them',
 'his',
 'around',
 'than',
 'get',
 'very',
 'de',
 'none',
 'cannot',
 'every',
 'whether',
 'they',
 'front',
 'during',
 'thus',
 'now',
 'him',
 'nor',
 'name',
 'several',
 'hereafter',
 'always',
 'who',
 'cry',
 'whither',
 'this',
 'someone',
 'either',
 'each',
 'become',
 'thereupon',
 'sometime',
 'side',
 'two',
 'therein',
 'twelve',
 'because',
 'often',
 'ten',
 'our',
 'eg',
 'some',
 'back',
 'up',
 'namely',
 'towards',
 'are',
 'further',
 'beyond',
 'ourselves',
 'yet',
 'out',
 'even',
 'will',
 'what',
 'still',
 'for',
 'bottom',
 'mine',
 'since',
 '

In [51]:
stopwords.append('years')
stopwords.append('working')
stopwords.append('provide')
stopwords.append('including')
stopwords.append('responsibilities')
stopwords.append('experience')
stopwords.append('good')
stopwords.append('candidates')
stopwords.append('ability')
stopwords.append('edu')
stopwords.append('visit')

In [52]:
cvec = CountVectorizer(ngram_range=(2,3), stop_words=stopwords, min_df=50)
cvec.fit(X_train)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=50,
        ngram_range=(2, 3), preprocessor=None,
        stop_words=['all', 'six', 'less', 'being', 'indeed', 'over', 'move', 'anyway', 'fifty', 'four', 'not', 'own', 'through', 'yourselves', 'go', 'where', 'mill', 'only', 'find', 'before', 'one', 'whose', 'system', 'how', 'somewhere', 'with', 'thick', 'show', 'had', 'enough', 'should', 'to', 'must', 'who...de', 'including', 'responsibilities', 'experience', 'good', 'candidates', 'ability', 'edu', 'visit'],
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [53]:
X_train = cvec.transform(X_train)
X_test = cvec.transform(X_test)

In [54]:
Xpd = pd.DataFrame(X_train.todense(), columns=cvec.get_feature_names())

In [55]:
Xpd.sum(axis=0).sort_values(ascending=False)

machine learning                351
computer science                290
communication skills            287
big data                        236
data science                    189
data analytics                  185
problem solving                 184
project management              154
bachelor degree                 144
able work                       140
team player                     140
data analysis                   140
degree computer                 129
degree computer science         117
work independently              114
work closely                    113
software development            101
data center                      98
analytical skills                98
best practices                   97
data management                  94
fast paced                       93
internal external                92
strong analytical                92
end end                          91
entry level                      89
business requirements            88
data mining                 

### Fitting text into logistic regression model

In [56]:
log = LogisticRegression()

log.fit(X_train, y_train)
y_pred_class = log.predict(X_test)

In [57]:
print classification_report(y_test, y_pred_class)
cm = confusion_matrix(y_test, y_pred_class)
print cm

             precision    recall  f1-score   support

          0       0.62      0.69      0.66       146
          1       0.68      0.61      0.64       157

avg / total       0.65      0.65      0.65       303

[[101  45]
 [ 61  96]]


### Fitting text into random forest model

In [58]:
RF_clf = RandomForestClassifier(n_estimators=15)
RF_clf = RF_clf.fit(X_train, y_train)
predicted = RF_clf.predict(X_test)


print classification_report(y_test, predicted)
cm = confusion_matrix(y_test, predicted)
print cm

             precision    recall  f1-score   support

          0       0.63      0.67      0.65       146
          1       0.67      0.63      0.65       157

avg / total       0.65      0.65      0.65       303

[[98 48]
 [58 99]]


## Qn 2

In [59]:
# Concat X variables into the dataframe
jobs['MergedTxt'] = X

In [60]:
jobs

Unnamed: 0,Company,JobTitle,Location,EmploymentType,Seniority,Category,GovSupport,JobLink,PostedDate,ClosingDate,RoleResponsibility,Requirements,SalaryFreq,Min,Max,AvgSalary,SalCategory,MergedTxt
0,INTERACTIVE DATA (SINGAPORE) PTE. LTD.,reference data sales specialist,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,/job/4444308b9cdfe083237649bb58399cda,23 Mar 2018,22 Apr 2018,ICE Data Services are looking for an innovativ...,Excellent knowledge of the Reference Data bus...,Annually,5000.000000,8333.333333,6666.666667,1,CentralICE Data Services are looking for an in...
1,NTT DATA SINGAPORE PTE. LTD.,data analyst,Central,Permanent ...,Senior Executive,Banking and Finance ...,Government support available,/job/1808ee6a1f76b35f3e1fccc2bcbe2f2e,15 Apr 2018,15 May 2018,Job Duties & responsibilities Develop data mo...,A Bachelor’s degree in Data Scientist or Comp...,Monthly,4500.000000,7000.000000,5750.000000,0,CentralJob Duties & responsibilities Develop ...
2,CIMB BANK BERHAD,"head, data and information security",East,Contract,Professional,Banking and Finance ...,Government support available,/job/4eab9054bf3e5af132624759f99a033c,03 Apr 2018,03 May 2018,The Role serve as primary custodian across th...,"• Bachelor’s degree or equivalent experience, ...",Monthly,10000.000000,11000.000000,10500.000000,1,EastThe Role serve as primary custodian acros...
3,AXA DIL@ASIA PTE. LTD.,data science engineer,East,Contract,Professional,Banking and Finance ...,Government support available,/job/6bb5bd34a64c3f119dde566d77cc1976,21 Mar 2018,20 Apr 2018,"The AXA Group, a worldwide leader in Insurance...",Successful qualifier of this mission: Technic...,Monthly,4600.000000,5600.000000,5100.000000,0,"EastThe AXA Group, a worldwide leader in Insur..."
4,FEHMARN CONSULTING PTE. LTD.,data architect,Central,Permanent,Senior Management,Banking and Finance,,/job/31843d4e1f993ad8949d0014d02abf40,18 Apr 2018,18 May 2018,"Our client, located in Ubi, is looking for a D...",Bachelor's Degree in Computer Science or a re...,Monthly,6000.000000,7500.000000,6750.000000,1,"CentralOur client, located in Ubi, is looking ..."
5,OSIM INTERNATIONAL PTE. LTD.,data architect,Central,Permanent,Senior Management,Banking and Finance,Government support available,/job/ff94655ae809d1aa0091b2b801d1589f,03 Apr 2018,03 May 2018,The Data Architect is a newly created position...,Professional skills required: Essential knowl...,Monthly,4700.000000,7000.000000,5850.000000,0,CentralThe Data Architect is a newly created p...
6,AXA DIL@ASIA PTE. LTD.,data scientist junior,Central,Full Time,Executive,Banking and Finance ...,Government support available,/job/557127dc1c23f59816f38e0c9c7c4fc3,21 Mar 2018,20 Apr 2018,"The AXA Group, a worldwide leader in Insurance...",Successful qualifier of this mission: Profes...,Monthly,4600.000000,5600.000000,5100.000000,0,"CentralThe AXA Group, a worldwide leader in In..."
7,JONES LANG LASALLE TECHNOLOGY SERVICES PTE. LTD.,"senior manager, data governance",Central,Full Time,Executive,Banking and Finance ...,Government support available,/job/ca75f75d752636fe19b0f7b4e0d580c2,03 Apr 2018,03 May 2018,"At Jones Lang LaSalle (JLL), we are going thro...",Experience Must have 5+ years of experience i...,Monthly,8000.000000,10000.000000,9000.000000,1,"CentralAt Jones Lang LaSalle (JLL), we are goi..."
8,NTUC INCOME INSURANCE CO-OPERATIVE LTD,senior data research executive,East,Permanent,Professional,Information Technology,,/job/115abfac7a8347ba682c8b116ce59768,03 Apr 2018,03 May 2018,"As Senior Data Resarch Executive, you will foc...",Degree holder in Data Engineering or Business...,Monthly,4000.000000,7000.000000,5500.000000,0,"EastAs Senior Data Resarch Executive, you will..."
9,PROCTER & GAMBLE EUROPE SA SINGAPORE BRANCH,senior data wrangler,East,Permanent,Professional,Information Technology,,/job/d18e1cdd4d80258ca15d53e089660f6a,02 Apr 2018,02 May 2018,Information Technology (IT) at Procter & Gambl...,We are looking for leaders who have graduated ...,Monthly,8000.000000,15000.000000,11500.000000,1,EastInformation Technology (IT) at Procter & G...


In [61]:
# Split the data into data scientists jobs and other jobs 

ds = jobs[jobs['JobTitle']=='data scientist']
nds = jobs[jobs['JobTitle']!='data scientist']

In [62]:
Xds = ds['MergedTxt'] 
cvec2 = CountVectorizer(ngram_range=(2,2), stop_words = 'english', min_df=10)
Xds = cvec2.fit_transform(Xds)

In [63]:
# Common words found in data scientist jobs
pd.DataFrame(Xds.todense(), columns=cvec2.get_feature_names()).sum(axis=0).sort_values(ascending=False)

machine learning      60
data science          34
computer science      28
big data              22
data mining           19
data sets             15
experience working    12
dtype: int64

In [69]:
# Common words found in other jobs
Xnds = nds['MergedTxt'] 
cvec2 = CountVectorizer(ngram_range=(2,2), stop_words = 'english', min_df=180)
Xnds = cvec2.fit_transform(Xnds)
pd.DataFrame(Xnds.todense(), columns=cvec2.get_feature_names()).sum(axis=0).sort_values(ascending=False)

machine learning        408
communication skills    382
computer science        361
years experience        359
big data                301
problem solving         248
ability work            232
bachelor degree         201
dtype: int64