## Start

Let's first load the dataset (which can be found [here](https://www.kaggle.com/stackoverflow/stacksample)) into pandas.

We'll start by grabbing a list of the question titles.

In [1]:
import pandas as pd

df = (pd.read_csv("data/Questions.csv", nrows=1_000_00, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))
titles = [t for t in df['Title']]

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Id      100000 non-null  int64 
 1   Title   100000 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.5+ MB


In [47]:
df

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...
...,...,...
1264211,40143210,URL routing in PHP (MVC)
1264212,40143300,Bigquery.Jobs.Insert - Resumable Upload?
1264213,40143340,Obfuscating code in android studio
1264214,40143360,How to fire function after v-model change?


In [4]:
len(titles)

100000

In [5]:
import random

random.choices(titles, k=20)

['Non-blocking getch(), ncurses',
 'Automated download script',
 'Ensuring thread synchronization in SQL possible?',
 'What is the best Java-based Mailing List application allowing opt-in/opt-out',
 'Using System.Net.Socket, how can we know when the remote socket is closed?',
 'How to predict MySQL tipping points?',
 'Building a NSPredicate for a filter',
 'Why are Doctype declarations split on two lines?',
 'Logging Search Results in a Rails Application',
 'Using Bash to automate creation of test outputs',
 'Custom fonts, ellipsis on MultiLine TextViews, Glyphs and glitches',
 'Referencing figures with numbers in Sphinx and reStructuredText',
 'Check number of checked nodes in TreeView',
 'Open source RTP mixer/translator exe or sdk',
 'javascript string exec strange behavior',
 'Check conditions on each page request',
 'How do I refer to the document object of an <iframe> from the parent page, using jQuery?',
 'Interfacing common functionality between controls',
 'C#: menu item not s

In [6]:
def has_golang(text):
    return " go " in text

has_golang("abcdd")

False

In [7]:
g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]

['Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?']

In [8]:
type(g)

generator

It doesn't work too well. So let us try spacy first. You might need to install the language model first.

```
python -m spacy download en_core_web_sm
```

In [9]:
import spacy 

nlp = spacy.load("en_core_web_sm")

Let's first see how it works.

In [10]:
nlp("Some text")

Some text

In [11]:
[t for t in nlp("My name is Vincent.")]

[My, name, is, Vincent, .]

In [12]:
doc = nlp("My name is Vincent.")

In [13]:
t = doc[0]

In [14]:
type(t)

spacy.tokens.token.Token

In [15]:
from spacy import displacy

displacy.render(doc)

In [16]:
spacy.explain("DET")

'determiner'

In [17]:
for t in doc:
    print(t, t.pos_, t.dep_)

My DET poss
name NOUN nsubj
is AUX ROOT
Vincent PROPN attr
. PUNCT punct


In [18]:
for t in nlp("Where does Console.WriteLine go in ASP.NET?"):
    print(t, t.pos_, t.dep_)

Where ADV advmod
does AUX ROOT
Console PROPN nsubj
. PUNCT punct
WriteLine PROPN nsubj
go VERB ROOT
in ADP prep
ASP.NET PROPN pobj
? PUNCT punct


# Back to Detecting Golang

Let's now use it one more time for our problem of detecting the `go` language. I'll also go ahead and write the code in a more performant way.

In [19]:
nlp = spacy.load("en_core_web_sm")

In [20]:
df = (pd.read_csv("data/Questions.csv", nrows=2_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]

In [21]:
len(titles)

49589

In [22]:
#  print top 10 in list: titles
print(*titles[0:10], sep="\n")

Good branching and merging tutorials for TortoiseSVN?
Good STL-like library for C
My website got hacked... What should I do?
DVCS Choices - What's good for Windows?
Is a "Confirm Email" input good practice when user changes email address?
Any good advice on using emacs for C++ project?
What is a good way to denormalize a mysql database?
Is AnkhSVN any good?
Arguments for going open source
Does Hostmonster support Django


In [23]:
# PERF improvement: disabling NER from the pipeline
nlp = spacy.load("en_core_web_sm", disable=["ner"])

In [24]:
%%time

def has_golang(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ == "NOUN":
                return True 
    return False

# PERF improvement: use nlp.pipe()
g = (doc for doc in nlp.pipe(titles) if has_golang(doc))
[next(g) for i in range(30)]

CPU times: user 8.79 s, sys: 245 ms, total: 9.04 s
Wall time: 9.04 s


[Deploying multiple Java web apps to Glassfish in one go,
 Removing all event handlers in one go,
 Paypal integration to serve multiple sellers in one go for a shopping site,
 How do I disable multiple listboxes in one go using jQuery?,
 multi package makefile example for go,
 Google's 'go' and scope/functions,
 Where is App.config go after publishing?,
 SOAPUI & Groovy Scripts, executing multiple SQL statements in one go,
 What's the simplest way to edit conflicted files in one go when using git and an editor like Vim or textmate?,
 Import large chunk of data into Google App Engine Data Store at one go,
 Saving all nested form objects in one go,
 what's the state of go language IDE support?,
 Decrypt many PDFs in one go using pdftk,
 How do I allocate memory for an array in the go programming language?,
 Is message passing via channels in go guaranteed to be non-blocking?,
 The maximum value for an int type in Go,
 Is there a reason why arrays in memory 'go' down while the function st

In [25]:
displacy.render(nlp("The maximum value for an int type in Go"))

This works! Now let's write some pandas code that will help us with our benchmarks.

![](images/these-2.png)

TODOs
1. Pandas + Lambdas, Pandas querying / filtering
2. Precision, Recall - 
3. Printing, formatting
4. A few benchmarking experiments

In [26]:
df_tags = pd.read_csv("data/Tags.csv")

In [27]:
df_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3750994 entries, 0 to 3750993
Data columns (total 2 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   Id      int64 
 1   Tag     object
dtypes: int64(1), object(1)
memory usage: 57.2+ MB


In [28]:
df_tags

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn
...,...,...
3750989,40143360,javascript
3750990,40143360,vue.js
3750991,40143380,npm
3750992,40143380,mocha


In [29]:
go_ids = df_tags.loc[lambda d: d['Tag'] == 'go']['Id']

In [30]:
go_ids

98267       1724680
98367       1726130
98457       1727250
100482      1757090
101172      1766720
             ...   
3746985    40110670
3747206    40112250
3748186    40120850
3750374    40138660
3750837    40142060
Name: Id, Length: 1858, dtype: int64

In [42]:
#df_tags.query("Id == 1726130")
df_tags.iloc[3750837]
#df_tags.loc[3750837]

Id     40142060
Tag          go
Name: 3750837, dtype: object

In [50]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            #if t.pos_ != 'VERB':
            return True
    return False

In [51]:
## All sentences with tag as 'go'
all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()

In [54]:
all_go_sentences[0:10]

['Go language benchmarks?',
 'Go code contribution: license and patent implications?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go',
 "What's the point of having pointers in Go?",
 'Simulate a tcp connection in Go',
 'exec.Run and argv problem',
 'Trouble reading from a socket in go',
 "basic json > struct question ( using 'Go')"]

In [56]:
all_go_sentences[7] # Tagged as 'go' but no 'go' token in sentence

'exec.Run and argv problem'

In [57]:
detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]

In [58]:
detectable[0:10] # 'go' token and 'go' tag

['Go language benchmarks?',
 'Go code contribution: license and patent implications?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go',
 "What's the point of having pointers in Go?",
 'Simulate a tcp connection in Go',
 'Trouble reading from a socket in go',
 "basic json > struct question ( using 'Go')",
 "Google's 'go' and scope/functions"]

In [59]:
non_detectable = (df
                  .loc[lambda d: ~d['Id'].isin(go_ids)]
                  .loc[lambda d: d['Title'].str.lower().str.contains("go")]
                  ['Title']
                  .tolist())

In [60]:
non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]

In [61]:
non_detectable[0:10] # 'go' token but not 'go' tag

['Where did all the java applets go?',
 "Where'd my generic ActionLink go?",
 'Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?',
 'Way to go from recursion to iteration',
 'With wicket where does hibernate.cfg.xml file go?',
 'When are API methods marked "deprecated" actually going to go away?',
 'How to go to main stack',
 'In Mac OS X, is there a programmatic way to get the machine to go to sleep/hibernate?',
 'How to Convince Programming Team to Let Go of Old Ways?']

In [62]:
len(all_go_sentences), len(detectable), len(non_detectable)

(1858, 1208, 1696)

Nice, we get some numbers that can result in a meaningful benchmark.

We can calculate precision/recall like stats by running the code below. You can put a forloop around it if you want but as it is you can fiddle around with the `has_go_token` function to see how well it performs.

In [63]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=["ner"])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ != "VERB":
                return True
    return False

method = "not-verb-but-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable)) # True positives
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable)) # False positives
precision = correct/(correct + wrong)
recall = correct/len(detectable) # len(detectable) = TP + FN = Actual Positives
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))

f"{precision},{recall},{accuracy},{model_name},{method}" # this is logged

'0.8910984848484849,0.7789735099337748,0.8684573002754821,en_core_web_sm,not-verb-but-pobj'

Enjoy playing around with this.

# Should convert this to a binary classification representation with TP, TN, FP, FN like in the confusion matrix
# Using a Pandas dataframe with joins and derived featues should be more understandable