## Detecting programming languages with spaCy - Part 1


> Based on the tutorial by Vincent Warmerdam: https://www.youtube.com/watch?v=WnGPv6HnBok

---

### Import librarie and reda data

In [1]:
import pandas as pd
import numpy as np
import spacy

In [3]:
df = (pd.read_csv("./data/Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

## extract the titles
titles = [_ for _ in df['Title']]

In [4]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


### Super naive function

We want to detect the programming language "go" if it is present in the questions

In [5]:
def has_golang(text):
    return " go " in text

# create a generator
g = (title for title in titles if has_golang(title))

print ([next(g) for i in range(10)])

['Where does Console.WriteLine go in ASP.NET?', 'Should try...catch go inside or outside a loop?', 'Way to go from recursion to iteration', 'When are API methods marked "deprecated" actually going to go away?', 'How to go to main stack', 'In Mac OS X, is there a programmatic way to get the machine to go to sleep/hibernate?', 'Is .NET already the right way to go for small app development?', 'Should I use Java date and time classes or go with a 3rd party library like Joda Time?', 'Make multiple monitors go to sleep with Windows API?', 'Where to go to browse for open source projects to work on?']


It doesn't work too well. So let us try spacy first. You might need to install the language model first.



### Intro to spacY

In [6]:
nlp = spacy.load("en_core_web_sm")
[t for t in nlp("My name is mini.")]

[My, name, is, mini, .]

In [8]:
doc = nlp('Little mini likes gifts')
spacy.displacy.render(doc)

What does "amod" here mean?

In [9]:
spacy.explain("amod")

'adjectival modifier'

In [11]:
spacy.displacy.render(nlp("Where does Console.WriteLine go in ASP.NET?"))

In [17]:
for t in nlp("Where does Console.WriteLine go in ASP.NET?"):
    print(f'Token: {t} POS: {spacy.explain(t.pos_)} DEP: {t.dep_} Head text: {t.head.text}')

Token: Where POS: adverb DEP: advmod Head text: does
Token: does POS: auxiliary DEP: ROOT Head text: does
Token: Console POS: proper noun DEP: nsubj Head text: does
Token: . POS: punctuation DEP: punct Head text: does
Token: WriteLine POS: proper noun DEP: nsubj Head text: go
Token: go POS: verb DEP: ROOT Head text: go
Token: in POS: adposition DEP: prep Head text: go
Token: ASP.NET POS: proper noun DEP: pobj Head text: in
Token: ? POS: punctuation DEP: punct Head text: go


#### How to go from the text desc to the diag?

The head text tells us how to connect the arrows

> spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head.

- Read more here: https://spacy.io/usage/linguistic-features



### Back to Detecting Golang

In [18]:
nlp = spacy.load("en_core_web_md")

In [20]:
## read the top 2L rows now
df = (pd.read_csv("./data/Questions.csv", nrows=2_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [21]:
### extract titles with "go" in them
titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]

In [22]:
titles[:5]

['Good branching and merging tutorials for TortoiseSVN?',
 'Good STL-like library for C',
 'My website got hacked... What should I do?',
 "DVCS Choices - What's good for Windows?",
 'Is a "Confirm Email" input good practice when user changes email address?']

Now let us modify the `has_golang()` method to use some of spacY's features

In [23]:
def has_golang(text):
    doc = nlp(text)
    for token in doc:
        if token.lower_ in ['go', 'golang']:
            if token.pos_ != 'VERB':
                return True
    return False

g = (title for title in titles if has_golang(title))
print ([next(g) for i in range(10)])

['How to Convince Programming Team to Let Go of Old Ways?', 'Deploying multiple Java web apps to Glassfish in one go', 'Removing all event handlers in one go', 'Paypal integration to serve multiple sellers in one go for a shopping site', 'How to Create a Dropdown List Hyperlink without the GO button?', 'How do I disable multiple listboxes in one go using jQuery?', 'Embedding instead of inheritance in Go', 'Shared library in Go?', 'multi package makefile example for go', "What's the point of having pointers in Go?"]


There is quite a bit of "in one go" that we need to filter

Lets get an example that works well and visualize the patterns

In [24]:
spacy.displacy.render(nlp("Embedding instead of inheritance in Go"))

In [25]:
spacy.explain("pobj")

'object of preposition'

The dependency structure here is that "go" is an "object of a preposition (in)"

Lets see an example that does not work well:

In [27]:
spacy.displacy.render(nlp("Removing all event handlers in one go"))

In [29]:
spacy.explain("csubj")

'clausal subject'

Here it is the "clausal subject"

If we can tke the dependencies into account we can maybe detect better

In [31]:
def has_golang(text):
    doc = nlp(text)
    for token in doc:
        if token.lower_ in ['go', 'golang']:
            if token.pos_ != 'VERB':
                if token.dep_ == 'pobj': # obj of preposition
                    return True
    return False

g = (title for title in titles if has_golang(title))
print ([next(g) for i in range(10)])

['How do I disable multiple listboxes in one go using jQuery?', 'Embedding instead of inheritance in Go', 'Shared library in Go?', 'multi package makefile example for go', "What's the point of having pointers in Go?", 'Simulate a tcp connection in Go', 'Trouble reading from a socket in go', "What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?", 'Convert string to integer type in Go?', 'Global Variables with GO']


Better but still there are some errors:

"How do I disable multiple listboxes in one go using jQuery?"

Lets explore these again:

In [32]:
spacy.displacy.render(nlp("How do I disable multiple listboxes in one go using jQuery?"))

In [33]:
spacy.explain("nummod")

'numeric modifier'

It seems that the presence of the numeric modifier is the problem

In [40]:
for token in nlp("How do I disable multiple listboxes in one go using jQuery?"):
    print (token.text, token.pos_, token.dep_, token.head.text, [child for child in token.children])

How ADV advmod disable []
do AUX aux disable []
I PRON nsubj disable []
disable VERB ROOT disable [How, do, I, listboxes, in, using, ?]
multiple ADJ amod listboxes []
listboxes NOUN dobj disable [multiple]
in ADP prep disable [go]
one NUM nummod go []
go NOUN pobj in [one]
using VERB advcl disable [jQuery]
jQuery NOUN dobj using []
? PUNCT punct disable []


Basically the head text of one is go, and one is the "nummod"

This can be an effective filetring condition to add

Verify it with a correct example: "How do I disable multiple listboxes in go using jQuery?"

In [37]:
spacy.displacy.render(nlp("How do I disable multiple listboxes in go using jQuery?"))

Looks good... Lets add this condition and try it out

In [43]:
%%time

def has_golang(text):
    # print (type(text))
    doc = nlp(text)
    for token in doc:
        if token.lower_ in ['go', 'golang']:
            if token.pos_ != 'VERB':
                if token.dep_ == 'pobj': # obj of preposition
                    ## check the children
                    children = [child for child in token.children]
                    for child in children:
                        if child.dep_ == "nummod":
                            return False
                    ### not found; so return True
                    return True
    return False
g = (title for title in titles if has_golang(title))
print ([next(g) for i in range(10)])

['Embedding instead of inheritance in Go', 'Shared library in Go?', 'multi package makefile example for go', "What's the point of having pointers in Go?", 'Simulate a tcp connection in Go', 'Trouble reading from a socket in go', 'Convert string to integer type in Go?', 'Global Variables with GO', 'Generating Random Numbers in Go', 'making generic algorithms in go']
Wall time: 32.8 s


Good, this looks much better now :)

We can make the process faster using `nlp.pipe()` which returns a `doc` obj

In [44]:
%%time

def has_golang(doc):
    for token in doc:
        if token.lower_ in ['go', 'golang']:
            if token.pos_ != 'VERB':
                if token.dep_ == 'pobj': # obj of preposition
                    ## check the children
                    children = [child for child in token.children]
                    for child in children:
                        if child.dep_ == "nummod":
                            return False
                    ### not found; so return True
                    return True
    return False
g = (doc for doc in nlp.pipe(titles) if has_golang(doc))
print ([next(g) for i in range(10)])

[Embedding instead of inheritance in Go, Shared library in Go?, multi package makefile example for go, What's the point of having pointers in Go?, Simulate a tcp connection in Go, Trouble reading from a socket in go, Convert string to integer type in Go?, Global Variables with GO, Generating Random Numbers in Go, making generic algorithms in go]
Wall time: 8.41 s


This makes the process considerably faster!

In [45]:
nlp.pipe_names

['tagger', 'parser', 'ner']

We dont really need the "ner" module here, so we can disable it

In [46]:
nlp.disable_pipes("ner")

[('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x2ca77757168>)]

In [47]:
nlp.pipe_names

['tagger', 'parser']

In [48]:
%%time

g = (doc for doc in nlp.pipe(titles) if has_golang(doc))
print ([next(g) for i in range(10)])

[Embedding instead of inheritance in Go, Shared library in Go?, multi package makefile example for go, What's the point of having pointers in Go?, Simulate a tcp connection in Go, Trouble reading from a socket in go, Convert string to integer type in Go?, Global Variables with GO, Generating Random Numbers in Go, making generic algorithms in go]
Wall time: 5.3 s


This speeds the process up even more

### Some benchmarks and evaluations

We have the tags.csv file with the groud truth labels for a question

In [49]:
df_tags = pd.read_csv('./data/Tags.csv')
df_tags.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


In [50]:
df_tags.loc[df_tags['Tag'] == 'go'].head()

Unnamed: 0,Id,Tag
98267,1724680,go
98367,1726130,go
98457,1727250,go
100482,1757090,go
101172,1766720,go


In [53]:
## get the ids of the go tags
go_ids = df_tags.loc[lambda d: d['Tag']=='go']['Id']

print (go_ids[:10])

98267     1724680
98367     1726130
98457     1727250
100482    1757090
101172    1766720
107663    1863460
115079    1976950
126820    2148190
135806    2270670
145758    2403520
Name: Id, dtype: int64


In [54]:
all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()
all_go_sentences[:5]

['Go language benchmarks?',
 'Go code contribution: license and patent implications?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go']

In [55]:
### additional filter

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            return True
    return False

detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]

detectable[:5]

['Go language benchmarks?',
 'Go code contribution: license and patent implications?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go']

Basically `all_go_sentences` are all questions stackoverflow says are about go
But some of them might not have the words present in them

We are only intersted in the ones that do
This is in the `detectable` var

Now the entries that we do not want the model to discover:

- not in the ids that talk about go
- have the text "go" in the question

In [60]:
df.loc[lambda d: ~d['Id'].isin(go_ids)].loc[lambda d: d['Title'].str.lower().str.contains("go")].sample(5)

Unnamed: 0,Id,Title
1014037,33493750,Error Response: [13] An internal error occurre...
113965,4676310,Django Australian local flavor form validation
273869,10066100,Google Chrome Extension Manipulate DOM of Open...
935472,31161120,Optimize functions in C using Genetic Algorith...
638735,21939370,Nodejs Mongo date queries relative to database...


In [62]:
non_detectable = (df.loc[lambda d: ~d['Id'].isin(go_ids)].loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title'].tolist())

non_detectable[:5]

['Good branching and merging tutorials for TortoiseSVN?',
 'Good STL-like library for C',
 'My website got hacked... What should I do?',
 "DVCS Choices - What's good for Windows?",
 'Is a "Confirm Email" input good practice when user changes email address?']

In [63]:
print (len(non_detectable))

48243


We can filter these further by saying that they should have "go" as a token not simply like "google" suing the `has_go_token()` method

In [64]:
non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]

print (len(non_detectable))

non_detectable[:5]

1696


['Where did all the java applets go?',
 "Where'd my generic ActionLink go?",
 'Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?',
 'Way to go from recursion to iteration']

Basically these are the following vars of importance here:

- `all_go_sentences`: all questions talking about "go", but not necessarily containing go in the texts
- `detectable`: these are a sibset of the `all_go_sentences` with the token "go" in them . Ideally we should be able to detect these
- `non_detectable`: we dont want to detect these, these contain the token "go" but are not about the language

In [65]:
print (len(all_go_sentences), len(detectable), len(non_detectable))

1858 1208 1696


### Evaluating the model

In [66]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=["ner"])

def has_golang(doc):
    for token in doc:
        if token.lower_ in ['go', 'golang']:
            if token.pos_ != 'VERB':
                if token.dep_ == 'pobj': # obj of preposition
                    ## check the children
                    children = [child for child in token.children]
                    for child in children:
                        if child.dep_ == "nummod":
                            return False
                    ### not found; so return True
                    return True
    return False

In [70]:
true_positives = sum(has_golang(doc) for doc in model.pipe(detectable))
false_positives = sum(has_golang(doc) for doc in model.pipe(non_detectable))
true_negatives = len(non_detectable) - false_positives
false_negatives = len(detectable) - true_positives

precision = true_positives/(true_positives+false_positives)
recall = true_positives/len(detectable)

f1_score = 2*(precision*recall)/(precision+recall)

accuracy = (true_positives + true_negatives)/(len(detectable) + len(non_detectable))

print (precision, recall, f1_score, accuracy)

0.9655172413793104 0.3708609271523179 0.5358851674641149 0.7327823691460055


Recall is pretty bad here, that means we are missing quite a few cases

Some examples of this are:

In [74]:
go_lang_detected_correctly = [has_golang(doc) for doc in model.pipe(detectable)]

go_lang_detected_correctly[:5]

[False, False, True, True, True]

In [80]:
detectable[0:2]

['Go language benchmarks?',
 'Go code contribution: license and patent implications?']

In [78]:
spacy.displacy.render(nlp('Go code contribution: license and patent implications?'))

In [71]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=["ner"])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ != "VERB":
                return True
    return False

method = "not-verb-but-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct + wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))

f"{precision},{recall},{accuracy},{model_name},{method}" # this is logged

'0.8937381404174574,0.7798013245033113,0.8698347107438017,en_core_web_sm,not-verb-but-pobj'

When we relax the restriction to just not VERB, this improves recall considerably but does decrease precision, which is expected, but still its the best so far