# NAMED ENITITY RECOGNITION

----

# PART 1 : GETTING STARTED

Problem Statement: We have the data from stackoverflow questions repository. We're trying to find the find the programming languages asked in the questions.

Data : https://www.kaggle.com/stackoverflow/stacksample

In [7]:
import pandas as pd
df = pd.read_csv(r"archive/Questions.csv", nrows=1000000, encoding = 'ISO-8859-1', usecols = ['Title', 'Id'])

In [8]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


In [9]:
titles = [_ for _ in df['Title']]
titles

['SQLStatement.execute() - multiple queries in one statement',
 'Good branching and merging tutorials for TortoiseSVN?',
 'ASP.NET Site Maps',
 'Function for creating color wheels',
 'Adding scripting functionality to .NET applications',
 'Should I use nested classes in this case?',
 'Homegrown consumption of web services',
 'Deploying SQL Server Databases from Test to Live',
 'Automatically update version number',
 'Visual Studio Setup Project - Per User Registry Settings',
 'How do I connect to a database and loop over a recordset in C#?',
 'How to get the value of built, encoded ViewState?',
 'How do I delete a file which is locked by another process in C#?',
 'Process size on UNIX',
 'Use SVN Revision to label build in CCNET',
 'How to make subdomain user accounts in a webapp',
 'Is nAnt still supported and suitable for .net 3.5/VS2008?',
 'Is Windows Server 2008 "Server Core" appropriate for a SQL Server instance?',
 'What is the best way to copy a database?',
 'Can I logically re

Naive Attempt : Let's try to build a function to find the languages

In [17]:
def has_golang(text):
    return " go " in text

#Decorator
g = (title for title in titles if has_golang(title))
[next(g) for i in range(2)]

['Where does Console.WriteLine go in ASP.NET?',
 'Should try...catch go inside or outside a loop?']

Realisation: Basic string search(even using regex) isn't going to help us. There is some inherent meaning of the sentence that is not being accounted for. Example: "Go" is used as a verb at many places, how do we interpret that?

#### Let's step into Spacy

Command to download : **python -m spacy download en_core_web_sm**

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")

It's a document spacy object

In [21]:
type(nlp("My name is Akshit"))

spacy.tokens.doc.Doc

In [25]:
doc = nlp("My name is Akshit")

It's a spacy token and they have a lot of properties

In [26]:
print(doc[0])
type(doc[0])

My


spacy.tokens.token.Token

In [28]:
from spacy import displacy
displacy.render(doc)

In [29]:
spacy.explain("poss")

'possession modifier'

In [31]:
for t in doc:
    print(t, t.pos_, t.dep_)

My PRON poss
name NOUN nsubj
is AUX ROOT
Akshit PROPN attr


Going back to an instance of our problem

In [33]:
for t in nlp("Where does Console.WriteLine go in ASP.NET?"):
    print(t, t.pos_, t.dep_)

Where ADV advmod
does VERB ROOT
Console PROPN nsubj
. PUNCT punct
WriteLine PROPN nsubj
go VERB ROOT
in ADP prep
ASP.NET PROPN pobj
? PUNCT punct


#### Great! We can find here that "go" is actually a verb.

Details : https://spacy.io/usage/linguistic-features

## Restart

In [64]:
df = pd.read_csv(r"archive/Questions.csv", nrows=2000000, encoding = 'ISO-8859-1', usecols = ['Title', 'Id'])
#Only find documents that have "go" in it
titles = [_ for _ in df.loc[lambda d: d['Title'].str.lower().str.contains("go")]['Title']]
titles

['Good branching and merging tutorials for TortoiseSVN?',
 'Good STL-like library for C',
 'My website got hacked... What should I do?',
 "DVCS Choices - What's good for Windows?",
 'Is a "Confirm Email" input good practice when user changes email address?',
 'Any good advice on using emacs for C++ project?',
 'What is a good way to denormalize a mysql database?',
 'Is AnkhSVN any good?',
 'Arguments for going open source',
 'Does Hostmonster support Django',
 "What's a good way to check if two datetimes are on the same calendar day in TSQL?",
 'Good strategy for leaving an audit trail/change history for DB applications?',
 'Factorial Algorithms in different languages',
 'What is a good dvd burning component for Windows or .Net?',
 'Best .NET Wrapper for Google Maps or Yahoo Maps?',
 'Is there a best .NET algorithm for credit card encryption?',
 'How to generate urls in django',
 'Suggest some good MVC framework in perl',
 'Is there a good Fogbugz client for Mac OS X?',
 'What are some

In [65]:
# If a sentence has the word go or goland and they are not verbs return True else return False
def has_golang2(text):
    doc = nlp(text)
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_!='VERB':
                return True
            
    return False

g = (title for title in titles if has_golang2(title))
[next(g) for i in range(5)]

['Removing all event handlers in one go',
 'How to Create a Dropdown List Hyperlink without the GO button?',
 'Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go']

This is an improvement but this is not perfect. Let's try to see what we can do

In [66]:
displacy.render(nlp("Embedding instead of inheritance in Go"))

In [67]:
spacy.explain("pobj")

'object of preposition'

In [68]:
def has_golang3(text):
    doc = nlp(text)
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_!='VERB':
                if t.dep_ == "pobj":
                    return True
            
    return False

g = (title for title in titles if has_golang3(title))
[next(g) for i in range(5)]

['Embedding instead of inheritance in Go',
 'Shared library in Go?',
 'multi package makefile example for go',
 "What's the point of having pointers in Go?",
 'Simulate a tcp connection in Go']

This is even better

## Let's try optimize the model - reduce the time

- Convert list to nlp.pipe (pipe is faster)
- The nlp model that we are using is doing all sorts of stuff behind : tokenization, lemmitisation. We're not using all of them. We're not using NER(named entity recognition) as of now. We'll turn it off

In [69]:
nlp = spacy.load("en_core_web_sm", disable = ["ner"])

In [70]:
%%time

def has_golang4(doc):
    for t in doc:
        if t.lower_ in ["go","golang"]:
            if t.pos_!='VERB':
                if t.dep_ == "pobj":
                    return True
            
    return False

g = (doc for doc in nlp.pipe(titles) if has_golang4(doc))
[next(g) for i in range(5)]

Wall time: 3.3 s


[Embedding instead of inheritance in Go,
 Shared library in Go?,
 multi package makefile example for go,
 What's the point of having pointers in Go?,
 Simulate a tcp connection in Go]

## Let's try to find the dataset(without using the function we have built) which seems about correct

**Then we will test our model on correct/not correct**

LHS: We have got the "tags" file in our data. So we will pick all the rows with the tag as "go". These are all the questions about go. These will have some questions which don't have the word "go" as well.

RHS: All the sentences that have the word "go" in it. Obviously if we do a normal string search we get bad results as well. But if they have been **tagged (LHS)** as "go" it gives us confidence

![appraoch](images/these-2.png)

In [71]:
df_tags = pd.read_csv(r"archive/Tags.csv")
df_tags.head()

Unnamed: 0,Id,Tag
0,80,flex
1,80,actionscript-3
2,80,air
3,90,svn
4,90,tortoisesvn


In [72]:
go_ids = df_tags.loc[df_tags['Tag']=='go']['Id']
go_ids

98267       1724680
98367       1726130
98457       1727250
100482      1757090
101172      1766720
             ...   
3746985    40110670
3747206    40112250
3748186    40120850
3750374    40138660
3750837    40142060
Name: Id, Length: 1858, dtype: int64

In [73]:
go_ids = df_tags.loc[lambda d: d['Tag']=='go']['Id']
go_ids

98267       1724680
98367       1726130
98457       1727250
100482      1757090
101172      1766720
             ...   
3746985    40110670
3747206    40112250
3748186    40120850
3750374    40138660
3750837    40142060
Name: Id, Length: 1858, dtype: int64

In [74]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            return True
    return False

#All sentences from questions dataframes that have similar IDs to that we got above : which have been tagged as "go"
all_go_sentences = df.loc[lambda d: d['Id'].isin(go_ids)]['Title'].tolist()

#Finding out sentences from all_go_sentences(tagged as go) that have "go" word in it
#This was our idea: the common part : (A intersection B)
detectable = [d.text for d in nlp.pipe(all_go_sentences) if has_go_token(d)]


#The sentences that do contain the word "go" but have not been tagged as "go"
#This is (A union B) - (A intersection B)
non_detectable = (df
                  .loc[lambda d: ~d['Id'].isin(go_ids)]
                  .loc[lambda d: d['Title'].str.lower().str.contains('go')]
                  ['Title']
                  .tolist()
                 )

non_detectable = [d.text for d in nlp.pipe(non_detectable) if has_go_token(d)]


len(all_go_sentences), len(detectable), len(non_detectable)

(1858, 1208, 1696)

![result](images/results.png)

## It's time to test our function/model on the above results to see the ACCURACY

In [97]:
model_name = "en_core_web_sm"
model = spacy.load(model_name, disable=["ner"])

def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ != "VERB":
                return True
    return False

method = "not-verb-but-pobj"

correct = sum(has_go_token(doc) for doc in model.pipe(detectable))
wrong = sum(has_go_token(doc) for doc in model.pipe(non_detectable))
precision = correct/(correct + wrong)
recall = correct/len(detectable)
accuracy = (correct + len(non_detectable) - wrong)/(len(detectable) + len(non_detectable))

f"{precision*100:^7.2f},{recall*100:^7.2f},{accuracy*100:^7.2f},{model_name:^15},{method:^15}" # this is logged

' 92.29 , 72.35 , 85.98 ,en_core_web_sm ,not-verb-but-pobj'

----

# PART TWO : DETECTING MULTI-TOKEN PR. LANGUAGES

Where are we? We made a function to detect "go" programming language and used "tagged" dataset to check the accuracy of our model

In [98]:
nlp = spacy.load("en_core_web_sm")

In [99]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ["go", "golang"]:
            if t.pos_ != "VERB":
                return True
    return False

In [102]:
doc = nlp("I like to program in go")

In [103]:
has_go_token(doc)

True

**Problem** : detecting other languages

In [107]:
def has_go_token_2(doc):
    for t in doc:
        if t.lower_ in ["go", "golang", "objective-c"]:
            if t.pos_ != "VERB":
                return True
    return False

In [108]:
doc = nlp("I like to program in objective-c")

In [110]:
has_go_token_2(doc)

False

In [111]:
[t for t in doc]

[I, like, to, program, in, objective, -, c]

**Solution** : Make patterns
<br>
https://explosion.ai/demos/matcher?text=I%20am%20a%20Ios%20dev%20who%20codes%20in%20objective-c.%20%20objective%20xyz%20-c&model=en_core_web_sm&pattern=%5B%7B%22id%22%3A3%2C%22attrs%22%3A%5B%7B%22name%22%3A%22LOWER%22%2C%22value%22%3A%22objective%22%7D%5D%7D%2C%7B%22id%22%3A1%2C%22attrs%22%3A%5B%7B%22name%22%3A%22IS_PUNCT%22%2C%22value%22%3Atrue%7D%5D%7D%2C%7B%22id%22%3A2%2C%22attrs%22%3A%5B%7B%22name%22%3A%22LOWER%22%2C%22value%22%3A%22c%22%7D%5D%7D%5D

In [146]:
from spacy.matcher import Matcher

In [161]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  #making punctuation optional. because it can also be "objective c"
                  {'IS_PUNCT': True, 'OP': '?'},
                  {'LOWER': 'c'}]

obj_c_pattern2 = [{'LOWER': 'objectivec'}]

golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

python_pattern = [{'LOWER': 'python'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

In [163]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", [obj_c_pattern1, obj_c_pattern2])
matcher.add("PYTHON_LANG", [python_pattern])
matcher.add("GO_LANG", [golang_pattern1, golang_pattern2])
matcher.add("JS_LANG", [js_pattern])
matcher.add("RUBY_LANG", [ruby_pattern])

In [164]:
doc = nlp("I am an ios dev who codes in both golang and objective-c")
matcher(doc)

[(4055088451470951652, 9, 10), (4002319739860662978, 11, 14)]

In [165]:
doc = nlp("I am an iOS dev who codes in both python, go/golang as well as objective-c")
for match_id, start, end in matcher(doc):
    print(doc[start: end])

python
golang
objective-c


In [166]:
doc = nlp("I've done some js and ruby and go programming")
for match_id, start, end in matcher(doc):
    print(doc[start: end])

js
ruby


## Benchmarking

Our current approach works, but it would be good to confirm this with data. I'll do a soft benchmark; I'll check for the occurence of a string, like "objective", and I'll see which instances my matcher does not pick up. If there's stuff that I am missing I should get a pretty clear picture of it.

In [153]:
import pandas as pd

df = (pd.read_csv("archive/Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [170]:
titles = (_ for _ in df['Title'] if "objective" in _.lower())

In [171]:
next(titles)

'Other than Xcode, are there any full functioned IDEs for Objective-C?'

In [172]:
for i in range(200):
    doc = nlp(next(titles))
    #Text that is NOT matching: In those cases it will be 0
    if len(matcher(doc)) == 0:
        print(doc)

Having to set objectives for developers, even though objectives don't work
How can i connect MySQL database with objective project?
Downloading multiple files in iphone app(Objective c)
Including Objective C++ Type in C++ Class Definition
Storing UIImages with ObjectiveRecord and ObjectiveSync
__OBJC__ equivalent for Objective-C++
iPhone Device/Simulator memory oddities using Objective-C++
How well is Objective-J documented? Is the documentation good enough to start using it seriously?
Objective-C. I have typedef float DuglaType[3]. How do I declare the property for this?


Wow, only one few results in the first 200 iterations that our model missed. That's great so far. No models in ML are perfect anyway

----

# PART-3

**Where are we?**

1. In first part: we started with a function and used "pos" to disregard verbs
2. In part 2, we found another better way of achieving this by patterns & matchers: this allowed us to detect multi-token languages such as objective-c

**Goal of part 3**

1. Evaluate the model & any shortcomings