# Finding Types of Experience from Adzuna Job Ads

We're going to use linguistic features to extract the types of experience commonly required for jobs from job ads. I'm not exactly sure what I mean by "types of experience"; we're going to let the data decide that!

We'll end up with a list of skills, and some relationships between skills that occur together

Import libraries and data

In [1]:
import re
import pandas as pd
import spacy
from spacy.util import filter_spans
from spacy.tokens import Span
from spacy.matcher import Matcher

In [2]:
spacy.__version__

'2.2.3'

In [3]:
from spacy import displacy
from IPython.display import HTML, display

Get the data from [Adzunda Job Salary Prediction Kaggle Competition](https://www.kaggle.com/c/job-salary-prediction), put it in the data subfolder and unzip all the files.

You can do this manually, or use the [Kaggle API](https://github.com/Kaggle/kaggle-api) (once you've installed the API, downloaded your `kaggle.json` file and agreed to the competition rules)

In [4]:
# for split, ext in [('Test', 'zip'), ('Train', 'zip'), ('Valid', 'csv')]:
#     !kaggle competitions download -c job-salary-prediction --path data/ -f {split}_rev1.{ext}
    
# !find data/ -name '*.zip' -execdir unzip '{}' ';'
# !find data/ -name '*.zip' -exec rm '{}' ';'

# !ls data/

Read in all the data to a single dataframe

In [5]:
%%time
dfs = []
for split in ['Train', 'Valid', 'Test']:
    dfs.append(pd.read_csv(f'data/{split}_rev1.csv').assign(split=split))
df = pd.concat(dfs, sort=False, ignore_index=True)
del dfs

CPU times: user 6.55 s, sys: 2.8 s, total: 9.34 s
Wall time: 16.1 s


Train/Valid/Test is in the ratio 6:1:3, with about 40k ads in total

In [6]:
df.split.value_counts()

Train    244768
Test     122463
Valid     40663
Name: split, dtype: int64

We're mainly interested in the ad content where the skills will be; that's the `FullDescription`

In [7]:
df

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,split
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000.0,cv-library.co.uk,Train
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000.0,cv-library.co.uk,Train
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Engineering Systems Analyst / Mathematical Mod...,"Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,25000 - 30000/annum 25K-30K negotiable,27500.0,cv-library.co.uk,Train
4,12613647,"Pioneer, Miser Engineering Systems Analyst","Pioneer, Miser Engineering Systems Analyst Do...","Surrey, South East, South East",Surrey,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000.0,cv-library.co.uk,Train
...,...,...,...,...,...,...,...,...,...,...,...,...,...
407889,72703426,Foreign Exchange Consultant Worcestershire,Do you have foreign exchange cashier experienc...,Worcestershire,Worcestershire,full_time,permanent,Travel Trade Recruitment,Travel Jobs,,,jobs.travelweekly.co.uk,Test
407890,72703453,Senior Business Travel Consultant,Senior Business Travel Consultant Birmingham ...,Birmingham,Birmingham,full_time,permanent,AA Appointments,Travel Jobs,,,jobs.travelweekly.co.uk,Test
407891,72705210,TEACHER OF MATHS,Position: Qualified Teacher Subject/Specialism...,Swindon,Swindon,,contract,,Teaching Jobs,,,hays.co.uk,Test
407892,72705214,Welsh Speaking Teaching Assistant Job,Hays Education currently have a job for a Wels...,Cardiff,Cardiff,,contract,,Teaching Jobs,,,hays.co.uk,Test


Extract the ads into a list

In [8]:
ads = list(df.FullDescription)

Initialise Spacy model

In [9]:
nlp = spacy.load('en_core_web_lg')

## Extracting from job ads

Let's look at sentences in job ads containing the word 'experience'.

Experience is a common word, but used in a few different ways:

*    has experience with a tool/using a skill/in a system
*    providing an experience to customers
*    this job will give you experience

We're interested in the first kind which occurs in a few different ways:

* {type of} experience ...
* experience in {field}

Let's look at extracting them

In [10]:
def highlight_terms(terms, texts):
    for doc in nlp.pipe(texts):
        for sentence in set([tok.sent for tok in doc if tok.lower_ in terms]):
            text = sentence.text.strip()
            markup = re.sub(fr'(?i)\b({"|".join(terms)})\b', r'<strong>\1</strong>', text)
            display(HTML(markup))

Note that you can already see some problems with the way the text was cleansed; it looks like list structure is gone and hyphens have been removed (35 years experience is probably 3-5 years experience).

In [11]:
highlight_terms(['experience'], ads[:10])

## Helper functions

Let's take a variety of informative examples to test extractions on

In [12]:
examples = [
    'They will need someone who has at least 1015 years of subsea cable engineering experience',
    'This position is ideally suited to high calibre engineering graduate with significant and appropriate post graduate experience.',
    'Aerospace industry experience would be advantageous covering aerostructures and/or aero engines.',
    'A sufficient and appropriate level of building services and controls experience gained within a client organisation, engineering consultancy or equipment supplier.',
    
    'Experience in Modelling and Simulation Techniques',
    'Any experience of Pioneer or Miser software would be an advantage.',
    'For this role, you must have a minimum of 10 years experience in subsea engineering, pipelines design or construction.',
    'Has experience within the quality department of a related company in a similar role Ideally from a mechanical or manufacturing engineering background.',
    'and have experience of the technical leadership of projects to time, quality and cost objectives.',
    'Experience of protection and control design at Transmission and Distribution voltages.',
    'Candidates with experience in telesales, callcentre, customer service, receptionist or travel are ideal for this role',
    'Experience dealing with business clients (B2B) would be preferable.',
    'Previous experience working as a Chef de Partie in a one AA Rosette hotel is needed for the position.',
    'The post holder must hold as a minimum Level 1 in Trampolining (British Gymnastics) and have experience in working with children, be fun, outgoing and have excellent customer service skills and be able to instruct in line with the British Gymnastics syllabus.',
    'Experience of techniques such as Discrete Event Simulation and/or SD modelling Mathematical/scientific background',

    
]

We could look to extract:

* a series of nouns before the word experience (e.g. "subsea cable engineering experience"); or
* experience as/in something (e.g "experience as a Chef de Partie")

we'll do this using [Spacy's Rule Based Matcher](https://spacy.io/usage/rule-based-matching)

In [13]:
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'NOUN', 'OP': '+'}, {'LOWER': 'experience'}]
matcher.add('experience_noun', [pattern])

pattern = [{'LOWER': 'experience'}, {'POS': 'ADP'}, {'POS': {'IN': ('DET', 'NOUN', 'PROPN')}, 'OP': '+'}]
matcher.add('experience_adp', [pattern])

In [14]:
doc = nlp(examples[0])
matcher(doc)

[(12285600890577657150, 13, 15), (12285600890577657150, 12, 15)]

Here we have a little helper function to visualise extractions.

In [15]:
def show_extraction(examples, *extractors):
    seen = set()
    for doc in nlp.pipe(examples):
        doc.ents = filter_spans([Span(doc, start, end, label) for extractor in extractors for label, start, end in extractor(doc)])
        for tok in doc:
            if tok.lower_ == 'experience':
                sentence = tok.sent
                if sentence.text in seen:
                    continue
                seen.update([sentence.text])
                if not sentence.ents:
                    doc.ents = list(doc.ents) + [Span(doc, tok.i, tok.i+1, 'MISSING')]
                displacy.render(sentence, style='ent', options = {'colors': {'MISSING': 'pink',
                                                                            'EXPERIENCE': 'lightgreen'}})
                

This is on the right track, but doesn't always pick up the appropriate context.

In [16]:
show_extraction(examples, matcher)

We can then extract them from a document.

Note the use of `filter_spans`; this ensures if we have overlapping spans we only take the largest one.

In [17]:
def get_extractions(examples, *extractors):
    # Could use context instead of enumerate
    for idx, doc in enumerate(nlp.pipe(examples, batch_size=100, disable=['ner'])):
        for ent in filter_spans([Span(doc, start, end, label) for extractor in extractors for label, start, end in extractor(doc)]):
            sent = ent.root.sent
            yield ent.text, idx, ent.start, ent.end, ent.label_, sent.start, sent.end

In [18]:
list(get_extractions(ads[:3], matcher))

[('experience in a', 1, 150, 153, 'experience_adp', 122, 164),
 ('years experience', 2, 45, 47, 'experience_noun', 16, 48),
 ('decision support models Experience', 2, 92, 96, 'experience_noun', 79, 118),
 ('Experience of techniques', 2, 102, 105, 'experience_adp', 79, 118)]

Put it in a dataframe and join with the job metadata

In [19]:
def extract_df(*extractors, n_max=None, **kwargs):
    if n_max is None:
        n_max = len(df)
    ent_df = pd.DataFrame(list(get_extractions(df[:n_max].FullDescription, *extractors)),
                          columns=['text', 'docidx', 'start', 'end', 'label', 'sent_start', 'sent_end'])
    return ent_df.merge(df, how='left', left_on='docidx', right_index=True)

In [20]:
%time ent_df = extract_df(matcher, n_max=1000)
ent_df.head()

CPU times: user 16.2 s, sys: 11.7 s, total: 27.9 s
Wall time: 1min 18s


Unnamed: 0,text,docidx,start,end,label,sent_start,sent_end,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,split
0,experience in a,1,150,153,experience_adp,122,164,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000.0,cv-library.co.uk,Train
1,years experience,2,45,47,experience_noun,16,48,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
2,decision support models Experience,2,92,96,experience_noun,79,118,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
3,Experience of techniques,2,102,105,experience_adp,79,118,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
4,experience within the Water industry,5,117,122,experience_adp,71,127,13179816,Engineering Systems Analyst Water Industry,Engineering Systems Analyst Water Industry Loc...,"Dorking, Surrey, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20K to 30K,25000.0,cv-library.co.uk,Train


Aggregate the counts of different texts.

It's more significant if it happens accross multiple Advertisers/Sources.

In [21]:
def aggregate_df(df, col=['text']):
    return (df
            .groupby(col)
            .agg(n_company=('Company', 'nunique'),
                 n_ad=('Id', 'nunique'),
                 n_source=('SourceName', 'nunique'),
                 n=('Id', 'count'))
            .reset_index()
            .sort_values(['n_company', 'n_ad', 'n'], ascending=False)
        )

Unfortunately what is caught with these simple rules has mixed results

In [22]:
aggregate_df(ent_df).head(10)

Unnamed: 0,text,n_company,n_ad,n_source,n
119,experience in a,4,52,5,52
286,years experience,3,22,3,22
233,management experience,2,17,3,18
69,banqueting experience,2,2,1,2
87,design experience,2,2,1,2
88,development experience,2,2,1,2
196,experience within a,1,7,2,7
260,rosette experience,1,5,2,5
176,experience of the,1,4,2,4
142,experience in software development,1,3,1,3


Let's add some tooling to look at specific cases

In [23]:
def showent(docidx, start, end, label, sent_start, sent_end, **kwargs):
    # We don't need to parse it, so just make_doc
    doc = nlp.make_doc(ads[docidx])
    doc.ents = [Span(doc, start, end, label)]
    sent = doc[sent_start:sent_end]
    displacy.render(sent, style='ent')
    
def showent_df(df):
    for idx, row in df.iterrows():
        showent(**row)

We can see that we've actually missed the subject entirely! 

We could be a bit more clever and use some structure from the grammar to extract what we need.

In [24]:
showent_df(ent_df.query('text == "experience in a"').head())

### Extracting types of experience

Let's extract some examples of {type of} experience

Here's a rough rule to extract the phrase to the left of the word 'experience' using SpaCy's noun_chunks, which is based on the syntactic structure (see [`spacy.lang.en.syntax_iterators`](https://github.com/explosion/spaCy/blob/master/spacy/lang/en/syntax_iterators.py#L7))

In [25]:
def extract_noun_phrase_experience(doc):
    for np in doc.noun_chunks:
        if np[-1].lower_ == 'experience':
            if len(np) > 1:
                yield 'EXPERIENCE', np[0].i, np[-1].i

Notice how our rule picks out the right amount of context like "subsea cable engineering".

However we're also picking up quantifiers like "Any" and "10 years"

In [26]:
show_extraction(examples, extract_noun_phrase_experience)

Let's look at how this does across a larger sample of job ads:

* There are sentence/word boundary errors that cause the rule to break (e.g. powerful decision support models Experience)
* We pick up quantifiers (previous, some, appropriate), as well as time quantifiers (3-5 years, 10 years)

In [27]:
show_extraction(ads[:10], extract_noun_phrase_experience)

In [28]:
%time ent_df = extract_df(extract_noun_phrase_experience, n_max=50000)

CPU times: user 14min 4s, sys: 6min 43s, total: 20min 48s
Wall time: 25min 59s


Again we are frequently picking up quantifiers

In [29]:
aggregate_df(ent_df).head(50)

Unnamed: 0,text,n_company,n_ad,n_source,n
13558,previous,894,2040,100,2128
5745,Previous,883,1745,100,1829
6321,Proven,260,468,73,515
15315,some,229,355,65,367
17077,your,221,864,70,874
11492,extensive,220,381,68,389
14616,relevant,208,398,62,405
14298,proven,189,301,63,305
10820,demonstrable,142,217,54,222
16178,the,138,303,56,313


In [30]:
showent_df(ent_df.query("text=='Previous'").head(5))

This looks like a bad parse (probably because of the stripped whitespace)

In [31]:
showent_df(ent_df.query("text=='Skills'").head(5))

We can blacklist the most common qualifiers

In [32]:
experience_qualifiers = ['previous', 'prior', 'following', 'recent', 'the above', 'past',
                         
                         'proven', 'demonstrable', 'demonstrated', 'relevant', 'significant', 'practical',
                         'essential', 'equivalent', 'desirable', 'required', 'considerable', 'similar',
                         'working', 'specific', 'qualified', 'direct', 'hands on', 'handson', 
                         
                         'strong', 'solid', 'good', 'substantial', 'excellent', 'the right', 'valuable', 'invaluable',
                         
                         'some', 'any', 'none', 'much', 'extensive', 'no', 'more',
                         'your', 'their',
                         'years', 'months',
                         'uk',
                        ]

stopwords = ['a', 'an', '*', '**', '•', 'this', 'the', ':', 'Skills']

experience_qualifier_pattern = rf'\b(?:{"|".join(experience_qualifiers)})\b'

experience_qualifier_pattern

'\\b(?:previous|prior|following|recent|the above|past|proven|demonstrable|demonstrated|relevant|significant|practical|essential|equivalent|desirable|required|considerable|similar|working|specific|qualified|direct|hands on|handson|strong|solid|good|substantial|excellent|the right|valuable|invaluable|some|any|none|much|extensive|no|more|your|their|years|months|uk)\\b'

If we ignore stopwords we start getting some skills out:

* sales
* commercial
* managment
* supervisory
* customer service
* development
* supervisory
* technical
* managment
* telesales
* financial services
* design
* project managment
* retail
* business sales
* SQL
* marketing
* people management
* SAP
* engineering

In [33]:
aggregate_df(ent_df[(~ent_df.text.str.lower().str.contains(experience_qualifier_pattern)) & # Not a qualifier
                     ~ent_df.text.isin(stopwords)]).head(50)

Unnamed: 0,text,n_company,n_ad,n_source,n
8436,sales,117,201,35,204
6108,commercial,96,167,33,171
7445,management,61,104,36,106
8733,supervisory,56,105,31,106
4013,Sales,54,98,23,98
7187,industry,52,82,29,83
1311,Commercial,43,68,20,71
9040,the customer,43,65,27,66
6359,customer service,42,58,23,69
9463,work,39,61,29,62


Commercial is more of a qualifier

In [34]:
showent_df(ent_df.query("text=='Commercial'").head(5))

Management experience seems correct

In [35]:
showent_df(ent_df.query("text=='Management'").head(5))

In [36]:
showent_df(ent_df.query("text=='Financial Services'").head(5))

In [37]:
showent_df(ent_df.query("text=='SQL'").head(5))

In [38]:
showent_df(ent_df.query("text=='development'").head(5))

### Extracting experience in a field

Another way experience is commonly stated is with an adposition

>  experience in/with modelling

For example

In [39]:
doc = nlp('Experience of protection and control design at Transmission and Distribution voltages.')
displacy.render([doc], style='dep', jupyter=True)

We extract the experience by looking to the right for a preposition (e.g. in/with) and then looking for its object and extracting the whole left subtree.

This is obviously quite specific to English.

In [40]:
def extract_adp_experience(doc, label='EXPERIENCE'):
    for tok in doc:
        if tok.lower_ == 'experience':
            for child in tok.rights:
                if child.dep_ == 'prep':
                    for obj in child.children:
                        if obj.dep_ == 'pobj':
                            yield label, obj.left_edge.i, obj.i+1

This works very well! All of our examples are specific.

Notice that:

* We're missing conjugations: we get experience in subsea engineering, but miss "pipelines design" and "construction"
* We miss elaborations: "such as Discrete Event Simulation  ..."
* We miss experience in actions (experience in working with children)

In [41]:
show_extraction(examples, extract_adp_experience)

An alternative strategy would be to look for a phrase like "Experience in/with/using" and then look for the noun phrase

Using spaCy's noun chunks we can do it backwards (I'm sure there's an easy way to do it forwards which could be quicker, but it's nice using spaCy's noun chunks directly):

In [42]:
def extract_adp_experience_2(doc):
    for np in doc.noun_chunks:
        start_tok = np[0].i
        if start_tok >= 2 and doc[start_tok - 2].lower_ == 'experience' and doc[start_tok - 1].pos_ == 'ADP':
            yield 'EXPERIENCE', start_tok, start_tok + len(np)

In [43]:
show_extraction(examples, extract_adp_experience_2)

Comparing speeds: the results are similar:

In [44]:
%time ent_adp_df = extract_df(extract_adp_experience, n_max=50)

CPU times: user 953 ms, sys: 266 ms, total: 1.22 s
Wall time: 1.28 s


In [45]:
%time ent_adp_df = extract_df(extract_adp_experience_2, n_max=50)

CPU times: user 984 ms, sys: 203 ms, total: 1.19 s
Wall time: 1.19 s


Extracting 50k results

In [46]:
%time ent_adp_df = extract_df(extract_adp_experience, n_max=50000)

CPU times: user 13min 29s, sys: 6min 32s, total: 20min 1s
Wall time: 25min


In [47]:
aggregate_df(ent_adp_df).head(50)

Unnamed: 0,text,n_company,n_ad,n_source,n
6342,a similar role,213,456,60,461
13767,the following,130,256,40,261
12074,sales,77,103,37,106
11041,one,55,80,30,85
13645,the design,53,74,31,83
14355,the use,49,71,28,72
8748,design,47,72,25,76
526,C,46,86,18,87
12167,selling,43,57,15,60
14498,this role,42,86,25,87


The extraction works pretty well for "sales" (although the last example should be "sales interviewing skills")

In [48]:
showent_df(ent_adp_df.query("text=='sales'").head(5))

Selling works as well, but we lose the context of *who* they are selling to

In [49]:
showent_df(ent_adp_df.query("text=='selling'").head(5))

In [50]:
showent_df(ent_adp_df.query("text=='design'").head(5))

In [51]:
showent_df(ent_adp_df.query("text=='C'").head(5))

We're often getting "a" because of bad tokenization

In [52]:
showent_df(ent_adp_df.query("text=='a'").head(5))

In [58]:
def highlight_text_context(terms, texts, n_before=1, n_after=2):
    context = []
    for doc in nlp.pipe(texts):
        sentences = list(doc.sents)
        idxs = [i for i, sent in enumerate(sentences) if any(term in sent.text.lower() for term in terms)]
        
        for idx in idxs:
            before = ''.join(sent.text for sent in sentences[max(idx-n_before, 0):idx])
            after = ''.join(sent.text for sent in sentences[idx+1:min(idx+n_before+1, len(sentences))])
            text = sentences[idx].text
            markup = re.sub(fr'(?i)\b({"|".join(terms)})\b', r'<strong>\1</strong>',
                                 f'<span style="color:blue">{text}</span>')
            display(HTML(before + markup + after))

The term "a" occurs mostly due to bad parsing because all numbers have been replaced with `****`

In [59]:
terms = ['experience']

for _, q in ent_adp_df.query("text=='a'").head(7).iterrows():
    doc = nlp(q.FullDescription)
    if q.sent_start > 0:
        prev_sent = doc[q.sent_start - 1].sent.text
    else:
        prev_sent = ''
    
    if q.sent_end < len(doc):
        next_sent = doc[q.sent_end].sent.text
    else:
        next_sent = ''
        
    text = doc[q.sent_start:q.sent_end].text
    markup = re.sub(fr'(?i)\b({"|".join(terms)})\b', r'<strong>\1</strong>',
                     f'<span style="color:blue">{text}</span>')
    display(HTML(prev_sent + markup + next_sent))

This is an interesting case where our heuristic extraction rule hasn't captured the complexity

In [60]:
displacy.render(nlp('Recent care experience within a Nursing Home or Care Home Environment'))

### Expanding conjugations

It would be useful to get each form of experience in long lists:

In [61]:
doc = nlp("Candidates with experience in telesales, callcentre, customer service, receptionist or travel are ideal for this role.")
doc

Candidates with experience in telesales, callcentre, customer service, receptionist or travel are ideal for this role.

In [62]:
displacy.render(doc)

In [63]:
span = doc[4:5]
span

telesales

This function is a very crude approximation of Spacy's noun_chunks, to get an approximate noun phrase

In [64]:
def get_left_span(tok, label='', include=True):
    offset = 1 if include else 0
    idx = tok.i
    while idx > tok.left_edge.i:
        if tok.doc[idx - 1].pos_ in ('NOUN', 'PROPN', 'ADJ', 'X'):
            idx -= 1
        else:
            break
    return label, idx, tok.i+offset

In [65]:
get_left_span(nlp('The Subsea pipeline engineering')[-1])

('', 1, 4)

In [66]:
get_left_span(span.root)

('', 4, 5)

This function gets the children of the conjugation

In [67]:
def get_conjugations(tok):
    new = [tok]
    while new:
        tok = new.pop()
        yield tok
        for child in tok.children:
            if child.dep_ == 'conj':
                new.append(child)

In [68]:
list(get_conjugations(span.root))

[telesales, callcentre, service, receptionist, travel]

And we then expand them by getting the left span

In [69]:
[doc[start:end] for label, start, end in [get_left_span(tok) for tok in get_conjugations(span.root)]]

[telesales, callcentre, customer service, receptionist, travel]

Note we *could* expand with other related terms like 'proficiency' or 'ability' or 'skill', but we won't for now (because they don't occur as much)

In [70]:
#old
EXP_TERMS = ['experience']
def extract_adp_conj_experience(doc, label='EXPERIENCE'):
    for tok in doc:
        if tok.lower_ in EXP_TERMS:
            for child in tok.rights:
                if child.dep_ == 'prep':
                    for obj in child.children:
                        if obj.dep_ == 'pobj':
                            for conj in get_conjugations(obj):
                                yield get_left_span(conj, label)

That's much better; we still lose elaboration (such as), but we're extracting much more from lists.

Notice that we're not getting Pioneer

In [71]:
show_extraction(examples, extract_adp_conj_experience)

The reason we don't get Pioneer is the sentence

>    Any experience of Pioneer or Miser software would be an advantage.

really means

>    Any experience of Pioneer **software** or Miser software would be an advantage.

but we don't have any way to reconstruct the missing word (yet)

In [72]:
doc = nlp('Any experience of Pioneer or Miser software would be an advantage.')

displacy.render(doc)

In [73]:
show_extraction(['Any experience of Pioneer software or Miser software would be an advantage.'], extract_adp_conj_experience)

In [74]:
doc = nlp('Any experience of Pioneer software or Miser software would be an advantage.')

displacy.render(doc)

Looking at a sample of ads it works alright

In [75]:
show_extraction(ads[:10], extract_adp_conj_experience)

### Extracting Verbs followed by Adposition

Notice something like 'Experience dealing with business clients' we have a verb followed by an adposition followed by the Noun. We can generate complex rules to parse things like this.

In [76]:
def extract_verb_maybeadj_noun_experience(doc, label='EXPERIENCE'):
    for tok in doc:
        if tok.lower_ in EXP_TERMS:
            for child in tok.rights:
                if child.dep_ == 'acl':
                    for gc in child.children:
                        if gc.dep_ == 'prep':
                            for ggc in gc.children:
                                if ggc.dep_ == 'pobj':
                                    for c in get_conjugations(ggc):
                                        yield get_left_span(c, 'EXPERIENCE')
                        elif gc.dep_ == 'dobj':
                            for c in get_conjugations(gc):
                                yield get_left_span(c, 'EXPERIENCE')

This works pretty well, when the parse works well

In [77]:
show_extraction(examples, extract_verb_maybeadj_noun_experience)

## Extracting types of experience accross all job ads

Let's just focus on the cleanest rule

In [78]:
extract_exps = [extract_adp_conj_experience,]

Most of the false positives are due to bad dependency parsing (which is often due to bad tokenization/sentence splitting)

However it looks like we're extracting a lot of signal

This takes a while because we need to parse every job ad and then run the rules across them.

Since documents are independent we could easily distribute this across a cluster if we needed to (think Hadoop/Dask/GNU Parallel).

In [79]:
len(df)

407894

In [80]:
n_ads = len(df)

In [81]:
%%time
df_ents = extract_df(*extract_exps, n_max=n_ads)

CPU times: user 1h 58min 21s, sys: 59min 34s, total: 2h 57min 55s
Wall time: 3h 43min 48s


In [82]:
df_ents.to_csv('experience_adp_ents.csv', index=False)

In [83]:
df_ents = pd.read_csv('experience_adp_ents.csv', low_memory=False)

In [84]:
df_ents

Unnamed: 0,text,docidx,start,end,label,sent_start,sent_end,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,split
0,professional engineering environment,1,153,156,EXPERIENCE,122,164,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000.0,cv-library.co.uk,Train
1,Simulation Techniques,2,99,101,EXPERIENCE,79,118,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
2,techniques,2,104,105,EXPERIENCE,79,118,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000.0,cv-library.co.uk,Train
3,Water industry,5,120,122,EXPERIENCE,71,127,13179816,Engineering Systems Analyst Water Industry,Engineering Systems Analyst Water Industry Loc...,"Dorking, Surrey, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20K to 30K,25000.0,cv-library.co.uk,Train
4,Miser software,5,211,213,EXPERIENCE,206,218,13179816,Engineering Systems Analyst Water Industry,Engineering Systems Analyst Water Industry Loc...,"Dorking, Surrey, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20K to 30K,25000.0,cv-library.co.uk,Train
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315902,insurance,407886,46,47,EXPERIENCE,40,62,72703412,Marine & International Trade Lawyer ****,Marine and International Trade Assistant **** ...,London,London,,permanent,,Legal Jobs,,,hays.co.uk,Test
315903,marine contracting,407886,49,51,EXPERIENCE,40,62,72703412,Marine & International Trade Lawyer ****,Marine and International Trade Assistant **** ...,London,London,,permanent,,Legal Jobs,,,hays.co.uk,Test
315904,construction,407886,52,53,EXPERIENCE,40,62,72703412,Marine & International Trade Lawyer ****,Marine and International Trade Assistant **** ...,London,London,,permanent,,Legal Jobs,,,hays.co.uk,Test
315905,shipbuilding,407886,55,56,EXPERIENCE,40,62,72703412,Marine & International Trade Lawyer ****,Marine and International Trade Assistant **** ...,London,London,,permanent,,Legal Jobs,,,hays.co.uk,Test


Because we kept the context we can show where the label came from

In [85]:
showent_df(df_ents[:2])

Let's count the most common terms

In [86]:
df_ent_agg = aggregate_df(df_ents)
df_ent_agg.head(10)

Unnamed: 0,text,n_company,n_ad,n_source,n
56334,similar role,1447,3829,107,3848
54297,role,841,2529,103,2565
33715,design,702,1612,71,1730
33966,development,702,1590,91,1677
37952,following,684,2118,76,2160
36705,experience,634,1350,97,1369
44556,management,572,1159,94,1179
54523,sales,513,1176,83,1195
42695,knowledge,504,950,87,954
36155,environment,480,968,83,979


In [87]:
len(df_ent_agg)

62589

In [88]:
from flashtext import KeywordProcessor

In [89]:
keyword_processor = KeywordProcessor(case_sensitive=True)

In [90]:
skills = df_ent_agg.query('n_company >= 3').text
len(skills)

12757

In [91]:
for skill in skills:
    keyword_processor.add_keyword(skill)

In [92]:
from collections import Counter

In [93]:
%%time
counter = Counter()
ad_counter = Counter()
for ad in ads[:10_000]:
    keywords = keyword_processor.extract_keywords(ad)
    counter.update(keywords)
    ad_counter.update(set(keywords))

CPU times: user 13.8 s, sys: 93.8 ms, total: 13.9 s
Wall time: 14.9 s


In [94]:
df_count_ad = pd.DataFrame(ad_counter.items(), columns=['text', 'n_ad_occur'])
df_count = pd.DataFrame(counter.items(), columns=['text', 'n_occur'])

In [95]:
df_c = (
    df_ent_agg
    .merge(df_count, how='left', validate='1:1')
    .merge(df_count_ad, how='left', validate='1:1')
     .assign(pct_ad_occur = lambda df: df.n_ad_occur / n_ads,
        avg_occur = lambda df: df.n_occur / df.n_ad_occur,
        ad_freq = lambda df: df.n_ad_occur / df.n_ad)
)

In [96]:
df_c.to_csv('term_counts.csv', index=False)

In [97]:
df_c = pd.read_csv('term_counts.csv')

In [98]:
df_c.head(50)

Unnamed: 0,text,n_company,n_ad,n_source,n,n_occur,n_ad_occur,pct_ad_occur,avg_occur,ad_freq
0,similar role,1447,3829,107,3848,257.0,250.0,0.000613,1.028,0.065291
1,role,841,2529,103,2565,8811.0,5080.0,0.012454,1.734449,2.008699
2,design,702,1612,71,1730,1590.0,908.0,0.002226,1.751101,0.563275
3,development,702,1590,91,1677,2927.0,2077.0,0.005092,1.409244,1.306289
4,following,684,2118,76,2160,1232.0,1110.0,0.002721,1.10991,0.524079
5,experience,634,1350,97,1369,9882.0,6024.0,0.014769,1.640438,4.462222
6,management,572,1159,94,1179,2049.0,1516.0,0.003717,1.351583,1.308024
7,sales,513,1176,83,1195,2285.0,1135.0,0.002783,2.013216,0.965136
8,knowledge,504,950,87,954,2130.0,1652.0,0.00405,1.289346,1.738947
9,environment,480,968,83,979,1788.0,1535.0,0.003763,1.164821,1.585744


In [99]:
skills = list(
(df_c
 .query('n_company >= 3')
 .query('ad_freq < 100')
).text
)
len(skills)

8075

In [100]:
with open('skills.txt', 'w') as f:
    for skill in skills:
        print(skill, file=f)

In [101]:
for a,b,c in zip(skills[::3],skills[1::3],skills[2::3]):
     print('{:<35}{:<35}{:<}'.format(a,b,c))

similar role                       role                               design
development                        following                          experience
management                         sales                              knowledge
environment                        areas                              field
industry                           one                                delivery
use                                C                                  ability
project management                 implementation                     projects
planning                           teams                              maintenance
testing                            Experience                         area
SQL                                selling                            manufacturing environment
marketing                          analysis                           HTML
aspects                            more                               managing
business                           years   

Key Stages                         environmental consultancy          fitness industry
disabilities                       APIs                               FX
ICT                                Scrub                              digital electronics design
libraries                          GMP environment                    MYSQL
PostgreSQL                         Service Delivery                   Team Management
account manager                    business management                capture
development experience             drainage                           footwear
retail organisation                routing                            suite
whole project life cycle           electromechanical systems          medical sales
soldering                          Benefits                           C .NET
Windows operating systems          benefit                            bias
brand management                   customer relationship management   environmental
hospital                    

security architecture              small organisation                 small projects
software installation              softwareasaservice                 staff recruitment
stock ordering                     strong communication skills        system development lifecycle
target setting                     taxation                           team player
technical integrity                technical project management       tendering process
web content                        workings                           Paid Search
derivatives                        exploratory testing                international sales
marketing roles                    sales techniques                   Access Control
Application                        Audits                             Board level
Communications                     Consulting                         Data Migration
Frameworks                         IT Project Management              IT Service Management
IT project management              Injecti

IT market                          IT packages                        IT recruitment industry
IT support role                    ITIL processes                     JIT
Java EE                            Java Spring                        Legal Cashier
Line Management                    Linux Engineer                     London schools
M A                                Management Systems                 Microsoft Office Packages
Mitsubishi                         Murex development                  NEBOSH
Network Operations                 Networking Technologies            Networks
Nonconformance                     OS                                 Object Oriented
Out                                PBX                                People Development
Planning Manager                   Products                           RECRUITMENT
Recoveries                         Regional Manager                   SIP
SOA principles                     SPC                                SQL experi

AGILE environment                  AUTOCAD                            Account Director level
Accounting Software                Accounting software                Active
Aerospace Industry                 Agile development practices        Agile development processes
Agile methodology Experience       Android Developer                  Application Monitoring
Appsense                           Architecting                       Army
Asset Manager                      Assistant Merchandiser             Attend
Autistic Spectrum Disorders        Britain                            Broadcasting
Building Control                   Building Services sector           Business Change projects
Business case                      C .net                             CAPA
CAT                                CSS skills                         CTS
Campaign Marketing                 Care Manager                       Certification
Cisco environment                  Cisco products                     Citrix

research techniques                retentions                         route
routes                             sales achievement                  sales cycle
sales manager                      sales training                     scaling
scripting skills                   scrum master                       search role
secondary education                secretarial support                security solutions
senior member                      senior stakeholders                server configuration
server infrastructure              sewing                             shift work
similar role Qualifications        similar sized business             site audit
sludge treatment                   small bore                         smoke
social marketing                   social media channels              software architecture
software company                   something                          source
specialist sector                  speciality                         specific emphasis
specifi

## Analysis

Look up a skill

Cooccurance would be great for understanding skills!

In [102]:
def filter_ents(query, exact=False, match_case=True):
    if exact and match_case:
        return df_ents[df_ents.text == query]
    elif exact:
        return df_ents[df_ents.text.str.lower() == query.lower()]
    else:
        return df_ents[df_ents.text.str.contains(fr'\b{query}\b', flags = 0 if match_case else re.IGNORECASE)]

In [103]:
def show_exp(query, exact=True, match_case=True, n_max=10):
    showent_df(filter_ents(query, exact, match_case)[:n_max])

In [104]:
def job_exp(query, exact=True, match_case=True):
    return filter_ents(query, exact, match_case).drop_duplicates('docidx')[['Company', 'Title']]

In [105]:
def related_experience(query, exact=True, match_case=True):
    return (
     df_ents[df_ents['docidx'].isin(filter_ents(query, exact, match_case).docidx.to_numpy())]
     .query('label == "EXPERIENCE"')
     .groupby('text')
     .agg(n=('text', 'count'),
      ads = ('docidx', 'nunique'),
      advertisers = ('Company', 'nunique'),
     )
  .query('advertisers > 1')
  .sort_values(['advertisers', 'ads', 'n'], ascending=False)
 )

"Experience" is a result of bad parsing.

It looks like these were probably lists that have had the list items stripped away. We could probably do something here to improve the sentence boundary detection.

In [106]:
show_exp('Experience', n_max=5)

In [107]:
show_exp('sales', n_max=5)

In [108]:
related_experience('sales').head(10)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
sales,1195,1176,513
customer service,138,132,56
marketing,106,104,53
business development,46,42,26
retail,64,64,24
account management,30,30,20
telesales,30,30,20
promotions,56,56,19
hospitality,40,40,14
recruitment,21,21,14


In [109]:
show_exp('project management', n_max=5)

In [110]:
related_experience('project management').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
project management,589,582,355
design,24,23,18
delivery,24,21,15
development,17,17,10
management,10,10,9
managing,10,10,9
implementation,10,9,7
planning,8,8,7
building,8,8,6
customer service,12,12,5


In [111]:
filter_ents('price control environment')

Unnamed: 0,text,docidx,start,end,label,sent_start,sent_end,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,split
34533,price control environment,47690,250,253,EXPERIENCE,244,257,68580067,Regulatory Analyst,Assessing quantitatively the impact on shareho...,Berkshire,Berkshire,,permanent,,Consultancy Jobs,40000 - 45000,42500.0,michaelpage.co.uk,Train
34539,price control environment,47692,270,273,EXPERIENCE,235,274,68580069,Regulatory Manager,A fantastic opportunity has arisen for a Regul...,Berkshire,Berkshire,,permanent,,Consultancy Jobs,50000 - 60000,55000.0,michaelpage.co.uk,Train
139178,price control environment,183711,414,417,EXPERIENCE,379,418,71631376,Regulatory Manager,We are the UK s biggest water and sewerage com...,"Reading, Berkshire",Reading,,permanent,Reed,Engineering Jobs,45000 - 50000/annum depending on experience + ...,47500.0,cv-library.co.uk,Train
187417,price control environment,242918,328,331,EXPERIENCE,322,335,72689668,Regulatory Analyst,We are the UK ****;s biggest water and sewerag...,Reading,Reading,,permanent,Reed Consulting,Accounting & Finance Jobs,"40,000 to 43,000",41500.0,jobsite.co.uk,Train
212607,price control environment,275788,229,232,EXPERIENCE,223,232,71680024,Regulatory Economist **** package,We have an excellent opportunity for a Regulat...,Reading Berkshire South East,Reading,,permanent,Jonathan Lee Engineering & Manufacturing,Accounting & Finance Jobs,,,totaljobs.com,Valid
288871,price control environment,375027,422,425,EXPERIENCE,416,429,71557474,Regulatory Analyst,We are the UK s biggest water and sewerage com...,"Reading, Berkshire",Reading,,permanent,Reed,Other/General Jobs,,,cv-library.co.uk,Test
293650,price control environment,380912,228,231,EXPERIENCE,222,231,71745569,"Regulatory Economist **** , **** , **** package",We have an excellent opportunity for a Regulat...,"Reading,Berkshire",UK,,permanent,Jonathan Lee Recruitment Product Eng,"Energy, Oil & Gas Jobs",,,renewablescareers.com,Test
310855,price control environment,401801,641,644,EXPERIENCE,606,645,72479775,Regulatory Manager,What is the purpose of the role? You will be r...,Reading,Reading,,permanent,Thames Water Utilities Ltd,Consultancy Jobs,,,jobsite.co.uk,Test
310859,price control environment,401803,670,673,EXPERIENCE,664,677,72479777,Regulatory Analyst,What is the purpose of the role? You will have...,Reading,Reading,,permanent,Thames Water Utilities Ltd,Consultancy Jobs,,,jobsite.co.uk,Test


In [112]:
related_experience('price control environment')

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
price control environment,9,9,5
project management,9,9,5
economic regulatory policy development,7,7,3
economic regulatory price controls,3,3,3
economic regulatory price control,3,3,2
finance role,2,2,2


In [113]:
related_experience('AJAX').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AJAX,89,89,60
CSS,33,33,21
HTML,28,28,20
JavaScript,21,21,17
Javascript,17,17,12
PHP,11,11,8
jQuery,9,9,7
design,7,7,7
IBM DB,11,11,6
Java,7,7,6


In [114]:
related_experience('Java').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Java,480,464,233
C++,75,75,41
C,77,67,41
SQL,27,26,15
JavaScript,24,24,15
J****EE,20,20,15
Linux,17,17,15
Spring,18,17,14
development,19,16,14
HTML,19,19,13


In [115]:
related_experience('Python').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Python,141,137,88
Perl,24,24,12
Ruby,18,18,12
Java,14,14,10
C,13,13,9
Bash,15,15,7
Django,8,8,7
PHP,14,14,6
OpenFrameworks,7,7,6
etc,7,7,6


In [116]:
related_experience('C++').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
C++,337,313,151
C,163,126,65
Java,76,75,41
stages,16,16,12
development,13,13,11
highlevel language,14,14,10
hightraffic systems,14,14,10
this,12,12,9
MFC,25,18,8
Linux,13,12,8


In [117]:
related_experience('Javascript').head(15)

Unnamed: 0_level_0,n,ads,advertisers
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Javascript,159,159,86
HTML,83,82,50
CSS,60,59,34
AJAX,17,17,12
PHP,16,14,10
JQuery,12,12,8
experience,9,9,8
Ajax,15,15,7
Java,8,7,7
Linux,15,15,6


In [118]:
filter_ents('IBM DB')

Unnamed: 0,text,docidx,start,end,label,sent_start,sent_end,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName,split
1449,IBM DB,2647,311,313,EXPERIENCE,309,313,55409877,"Java J****EE Developer ****k ****k Music, F...","Java J****EE Developer ****k ****k Music, F...",London,London,full_time,permanent,JOBG8,IT Jobs,"Up to 50,000 per year + 40000.00-50000.00",50000.0,planetrecruit.com,Train
2757,IBM DB,4408,302,304,EXPERIENCE,289,304,61811863,"Java J****EE Developer – ****k ****k Music, ...",NEW Java J****EE Developer – ****k ****k Mu...,London South East,South East London,,permanent,Parham Consulting,IT Jobs,"From 40,000 to 50,000 per annum 40,000 - 50,00...",45000.0,cwjobs.co.uk,Train
12054,IBM DB,17428,133,135,EXPERIENCE,130,135,66925434,Application/Integration Developer,We are looking for an experienced developer (*...,"Nottingham, Nottinghamshire",Nottingham,,permanent,Seismic Group,IT Jobs,35000 - 40000/annum,37500.0,cv-library.co.uk,Train
33948,IBM DB,46689,301,303,EXPERIENCE,288,303,68567721,"Java J****EE Developer ****k ****k Music, F...","Java J****EE Developer ****k ****k Music, F...",City of London - London,The City,full_time,permanent,London4Jobs,IT Jobs,40000-50000,45000.0,london4jobs.co.uk,Train
51186,IBM DB,68754,301,303,EXPERIENCE,288,303,68799489,"Java J****EE Developer ****k ****k Music, F...","Java J****EE Developer ****k ****k Music, F...",Central London,Central London,full_time,permanent,Parham Consulting Ltd,IT Jobs,40000.00 - 50000.00 GBP Annual,45000.0,jobs.newstatesman.com,Train
51865,IBM DB,69555,311,313,EXPERIENCE,297,313,68806243,NEW Java J****EE Developer ****k ****k Mus...,NEW Java J****EE Developer ****k ****k Mus...,"London,Euston,Kings Cross",London,,permanent,Parham Consulting Ltd,IT Jobs,"40K - 50K + bonus, bens",45000.0,jobsite.co.uk,Train
72518,IBM DB,95902,137,139,EXPERIENCE,134,139,69222789,Application/Integration Developer,We are looking for an experienced developer (*...,NOTTINGHAM,Nottingham,full_time,permanent,Seismic Recruitment,IT Jobs,"From 35,000 to 40,000 per year",37500.0,fish4.co.uk,Train
91450,IBM DB,120304,312,314,EXPERIENCE,298,314,69895464,NEW Java J****EE Developer ****k ****k Mus...,NEW Java J****EE Developer ****k ****k Mus...,London,London,full_time,permanent,PARHAM CONSULTING LIMITED,IT Jobs,"From 40,000 to 50,000 per year + 40K - 50K + d...",45000.0,planetrecruit.com,Train
103043,IBM DB,136377,301,303,EXPERIENCE,288,303,70322570,"Java J****EE Developer ****k ****k Music, F...","Java J****EE Developer ****k ****k Music, F...",UK,UK,,permanent,Parham Consulting Ltd,IT Jobs,40000-50000,45000.0,fish4.co.uk,Train
135503,IBM DB,179068,260,262,EXPERIENCE,247,262,71558569,"Java J****EE Developer ****k****k Music, Film...","Java J****EE Developer ****k ****k Music, F...",London Greater London,London,,permanent,,IT Jobs,50000,50000.0,technojobs.co.uk,Train


In [119]:
for ad in [ads[2647], ads[4408]]:
    print(ad + '\n')

Java J****EE Developer  ****k  ****k  Music, Film & TV  London Java J****EE Developers required for software house with client sectors of music, film and TV. Salary: Maximum ****: Discretionary bonus and benefits package. Location: Near Euston and King's Cross, London THE COMPANY: Consistent new business wins for the world leader in the provision of software solutions to the Music and Entertainment industry has given rise to the need for an experienced Java Developer. The working environment here is very pleasant with a casual dress code, laid back and friendly atmosphere, but also hardworking and dynamic with the autonomy to drive your job role forward. This is predominantly a development role, but you will be involved in the full product life cycle including design and clientfacing duties, so they need a good allrounder. EXPERIENCE REQUIRED: The experience required for this role is as follows:  A minimum of 5 years experience in the development of web applications for the J****EE dev

# Next Steps

We could keep building out a rule based approach:

* Do analysis of this list to build up list of positive/negative phrases
* Search the document for those phrases
* Look at the results and build new rules to get those phrases

Or we could use this as the seed of a model based approach:

* Build an NER model on these base phrases
* Annotate the predictions and refine the model

Or we could use some hybrid of the two