## Detecting programming languages with spaCy - Part 1


> Based on the tutorial by Vincent Warmerdam: https://www.youtube.com/watch?v=WnGPv6HnBok

---

### Import librarie and reda data

In [1]:
import pandas as pd
import numpy as np
import spacy

In [3]:
df = (pd.read_csv("./data/Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

## extract the titles
titles = [_ for _ in df['Title']]

In [4]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


### Super naive function

We want to detect the programming language "go" if it is present in the questions

In [5]:
def has_golang(text):
    return " go " in text

# create a generator
g = (title for title in titles if has_golang(title))

print ([next(g) for i in range(10)])

['Where does Console.WriteLine go in ASP.NET?', 'Should try...catch go inside or outside a loop?', 'Way to go from recursion to iteration', 'When are API methods marked "deprecated" actually going to go away?', 'How to go to main stack', 'In Mac OS X, is there a programmatic way to get the machine to go to sleep/hibernate?', 'Is .NET already the right way to go for small app development?', 'Should I use Java date and time classes or go with a 3rd party library like Joda Time?', 'Make multiple monitors go to sleep with Windows API?', 'Where to go to browse for open source projects to work on?']


It doesn't work too well. So let us try spacy first. You might need to install the language model first.



### Intro to spacY

In [6]:
nlp = spacy.load("en_core_web_sm")
[t for t in nlp("My name is mini.")]

[My, name, is, mini, .]

In [8]:
doc = nlp('Little mini likes gifts')
spacy.displacy.render(doc)

What does "amod" here mean?

In [9]:
spacy.explain("amod")

'adjectival modifier'

In [11]:
spacy.displacy.render(nlp("Where does Console.WriteLine go in ASP.NET?"))

In [17]:
for t in nlp("Where does Console.WriteLine go in ASP.NET?"):
    print(f'Token: {t} POS: {spacy.explain(t.pos_)} DEP: {t.dep_} Head text: {t.head.text}')

Token: Where POS: adverb DEP: advmod Head text: does
Token: does POS: auxiliary DEP: ROOT Head text: does
Token: Console POS: proper noun DEP: nsubj Head text: does
Token: . POS: punctuation DEP: punct Head text: does
Token: WriteLine POS: proper noun DEP: nsubj Head text: go
Token: go POS: verb DEP: ROOT Head text: go
Token: in POS: adposition DEP: prep Head text: go
Token: ASP.NET POS: proper noun DEP: pobj Head text: in
Token: ? POS: punctuation DEP: punct Head text: go


#### How to go from the text desc to the diag?

The head text tells us how to connect the arrows

> spaCy uses the terms head and child to describe the words connected by a single arc in the dependency tree. The term dep is used for the arc label, which describes the type of syntactic relation that connects the child to the head.

- Read more here: https://spacy.io/usage/linguistic-features

