# Detect Program Language

- using Spacy

- having multi-token concept

# 1)- Importing key modules

In [0]:
#support both Python 2 and Python 3 with minimal overhead.
from __future__ import absolute_import, division, print_function

# I am an engineer. I care only about error not warning. So, let's be maverick and ignore warnings.
import warnings
warnings.filterwarnings('ignore')

In [0]:
# For data processing and maths
import numpy as np
import pandas as pd
import time
import math
import os
#For Visuals
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from matplotlib import rcParams
rcParams['figure.figsize'] = 11, 8
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

In [0]:
# For text we shall use Spacy

import spacy 
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

In [4]:
! python -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


In [5]:
! pip install version_information

Collecting version_information
  Downloading https://files.pythonhosted.org/packages/ff/b0/6088e15b9ac43a08ccd300d68e0b900a20cf62077596c11ad11dd8cc9e4b/version_information-1.0.3.tar.gz
Building wheels for collected packages: version-information
  Building wheel for version-information (setup.py) ... [?25l[?25hdone
  Created wheel for version-information: filename=version_information-1.0.3-cp36-none-any.whl size=3880 sha256=50dad7b56af96bf866ba67319b8cd1df41bdfd54a2756da802fb8ff8b87db456
  Stored in directory: /root/.cache/pip/wheels/1f/4c/b3/1976ac11dbd802723b564de1acaa453a72c36c95827e576321
Successfully built version-information
Installing collected packages: version-information
Successfully installed version-information-1.0.3


In [6]:
# first install: pip install version_information
%reload_ext version_information
%version_information pandas,spacy,numpy,seaborn, matplotlib

Software,Version
Python,3.6.8 64bit [GCC 8.3.0]
IPython,5.5.0
OS,Linux 4.14.137+ x86_64 with Ubuntu 18.04 bionic
pandas,0.25.3
spacy,2.1.9
numpy,1.17.4
seaborn,0.9.0
matplotlib,3.1.1
Sun Nov 24 17:02:16 2019 UTC,Sun Nov 24 17:02:16 2019 UTC


# 2)- Detecting more than one Language

Our previous function to detect one language was:

In [0]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang']:
            if t.pos_ != 'VERB':
                return True
    return False

In [0]:
doc = nlp("What's the point of having pointers in Go?")

In [9]:
# check if it reads it well
has_go_token(doc)

True

**More than one language function**

In [0]:
def has_go_token(doc):
    for t in doc:
        if t.lower_ in ['go', 'golang', 'python', 'ruby', 'objective-c']:
            if t.pos_ != 'VERB':
                return True
    return False

In [0]:
doc = nlp("i am an iOS dev and I like to code in objective-c")

In [12]:
# check if it reads it well
has_go_token(doc)

False

Here we get a false value meaning that "objective-c" is not true in given sentence. But why?

In [13]:
[x for x in doc]

[i, am, an, iOS, dev, and, I, like, to, code, in, objective, -, c]

All those words consist of tokens and "Objective-c" is one word but, it makes three tokens. So, our approach with multi-tokens need to be updated.

# 3)- Use of spacy.matcher

https://explosion.ai/demos/matcher

In [0]:
from spacy.matcher import Matcher

In [0]:
# we shall make pattern to counter multi-token challenge
pattern = [{'LOWER': 'objective'}, # these are attributes taken from demos
           {'IS_PUNCT': True},
           {'LOWER': 'c'}]

In [0]:
# Instantiate a matcher object

matcher= Matcher(nlp.vocab)

In [0]:
# Match object is still empty so we add our given pattern to it.

matcher.add("OBJ-C", None, pattern)
# pattern is of three parts: "name", callback, pattern

In [18]:
# finding matches from match object

matcher(doc)

[(9809469755149532664, 11, 14)]

matcher returns list of tuples 
**First element is match ID, second is start, third is end.** <br>

We can verify our results

In [19]:
doc[11:14]

objective-c

So now multiple token has been detected. 

# 4)- Detecting more than one language

In [0]:
# we will create two different patterns for two different languages

objc_pattern = [{'LOWER': 'objective'}, # these are attributes taken from demos
                {'IS_PUNCT': True},
                {'LOWER': 'c'}]

go_pattern= [{'LOWER': {'IN' : ['go','golang'] }, 'POS': {'NOT_IN': ['VERB']}}]#IN shows a dictionary having these two key words

In [0]:
# Instantiate a matcher object

matcher= Matcher(nlp.vocab)

In [0]:
# add pattern for both languages
matcher.add("OBJ-C", None, objc_pattern)
matcher.add("GOLANG", None, go_pattern)

In [0]:
# make a new doc to contain both lang.s

doc2= nlp("It is always good to know more than one language. For example, learn golang and objective-c.")

In [30]:
print([x for x in doc2])

[It, is, always, good, to, know, more, than, one, language, ., For, example, ,, learn, golang, and, objective, -, c, .]


Here we have two languages and multi-token case as well. Let's see if we could detect them.

In [31]:
matcher(doc2)

[(4125859422530420866, 15, 16), (9809469755149532664, 17, 20)]

In [35]:
print(doc2[15:16])
print(doc2[17:20])

golang
objective-c


In [38]:
for match_id , start , end in matcher(doc2):
  print(doc2[start: end])

golang
objective-c


So, we have got it right this time both multi-tokens and two lang.s

# 5)- Clean up code and add more patterns to languages

In [0]:
objc_pattern = [{'LOWER': 'objective'}, 
                {'IS_PUNCT': True},
                {'LOWER': 'c'}]


golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

In [0]:
matcher= Matcher(nlp.vocab)
matcher.add("OBJ-C", None, objc_pattern)
matcher.add("GOLANG", None, golang_pattern1,golang_pattern2)

In [0]:
doc2= nlp("It is always good to know more than one language. For example, learn go/golang and objective-c.")

In [42]:
for match_id , start , end in matcher(doc2):
  print(doc2[start: end])

go
golang
objective-c


So we have improved one step more as we have detected all programming languages. Though go and golang meant same but, we picked it from matcher.

### add more languages in given patterns

In [0]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True}, 
                  {'LOWER': 'c'}]

obj_c_pattern2 = [{'LOWER': 'objectivec'}]

golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

python_pattern = [{'LOWER': 'python'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

In [0]:
matcher= Matcher(nlp.vocab)
matcher.add("OBJ-C", None, obj_c_pattern1,obj_c_pattern2 )
matcher.add("GOLANG", None, golang_pattern1,golang_pattern2)
matcher.add("OBJ-C", None, python_pattern)
matcher.add("OBJ-C", None, ruby_pattern)
matcher.add("OBJ-C", None, js_pattern)

In [0]:
doc1=nlp("It is always good to know more than one language. For example, learn ruby, go/golang and objective-c.")
doc2=nlp("Learning other languages that are recently in trend is also significant. For example, Python has developed, javascript is good to know and obviously learn go/golang .")

In [46]:
for match_id , start , end in matcher(doc1):
  print(doc1[start: end])


ruby
golang
objective-c


We have detected golang but, we couldn't detect go. Let's check other doc example

In [47]:
for match_id , start , end in matcher(doc2):
  print(doc2[start: end])

Python
javascript
go
golang


Here we detect all. So, what's up with this ?

In [48]:
# lets see why 

print([x for x in doc1])
print([x for x in doc2])

[It, is, always, good, to, know, more, than, one, language, ., For, example, ,, learn, ruby, ,, go, /, golang, and, objective, -, c, .]
[Learning, other, languages, that, are, recently, in, trend, is, also, significant, ., For, example, ,, Python, has, developed, ,, javascript, is, good, to, know, and, obviously, learn, go, /, golang, .]


In [49]:
print([(x,x.pos_) for x in doc1])

[(It, 'PRON'), (is, 'VERB'), (always, 'ADV'), (good, 'ADJ'), (to, 'PART'), (know, 'VERB'), (more, 'ADJ'), (than, 'ADP'), (one, 'NUM'), (language, 'NOUN'), (., 'PUNCT'), (For, 'ADP'), (example, 'NOUN'), (,, 'PUNCT'), (learn, 'VERB'), (ruby, 'PROPN'), (,, 'PUNCT'), (go, 'VERB'), (/, 'SYM'), (golang, 'NOUN'), (and, 'CCONJ'), (objective, 'NOUN'), (-, 'PUNCT'), (c, 'NOUN'), (., 'PUNCT')]


Notice go is considered as verb here. It is not verb though.

In [50]:
print([(x,x.pos_) for x in doc2])

[(Learning, 'VERB'), (other, 'ADJ'), (languages, 'NOUN'), (that, 'DET'), (are, 'VERB'), (recently, 'ADV'), (in, 'ADP'), (trend, 'NOUN'), (is, 'VERB'), (also, 'ADV'), (significant, 'ADJ'), (., 'PUNCT'), (For, 'ADP'), (example, 'NOUN'), (,, 'PUNCT'), (Python, 'PROPN'), (has, 'VERB'), (developed, 'VERB'), (,, 'PUNCT'), (javascript, 'ADJ'), (is, 'VERB'), (good, 'ADJ'), (to, 'PART'), (know, 'VERB'), (and, 'CCONJ'), (obviously, 'ADV'), (learn, 'VERB'), (go, 'PROPN'), (/, 'SYM'), (golang, 'NOUN'), (., 'PUNCT')]


In this case, go is PROPN. That's why we did detect it for doc2 and not for doc1

# 6)- Back to dataset

In [0]:
df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [61]:
df.shape

(1000000, 2)

In [62]:
df.head()

Unnamed: 0,Id,Title
0,80,SQLStatement.execute() - multiple queries in o...
1,90,Good branching and merging tutorials for Torto...
2,120,ASP.NET Site Maps
3,180,Function for creating color wheels
4,260,Adding scripting functionality to .NET applica...


In [0]:
# For objective C
titles = (_ for _ in df['Title'] if "objective-c" in _.lower())

In [69]:
for i in range(200):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 0:
        print(doc)

Objective-C. I have typedef float DuglaType[3]. How do I declare the property for this?
Learning Objective-C. Using Xcode 3.2.1. What is error: Program received signal: âEXC_ARITHMETICâ


We can see some words like objective space c, or objective space c++ that are out of place. So we need some more modifications 

### 6a)-back to pattern

- https://spacy.io/usage/rule-based-matching

check "Operators and quantifiers"

In [0]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True , 'OP': '?'}, # "?" Make the pattern optional, by allowing it to match 0 or 1 times.
                  {'LOWER': 'c'}]

obj_c_pattern2 = [{'LOWER': 'objectivec'}]

golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

python_pattern = [{'LOWER': 'python'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

In [0]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", None, obj_c_pattern1, obj_c_pattern2)
matcher.add("GO_LANG", None, golang_pattern1, golang_pattern2)
matcher.add("PYTHON_LANG", None, python_pattern)
matcher.add("JS_LANG", None, js_pattern)
matcher.add("RUBY_LANG", None, ruby_pattern)

In [0]:
df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [0]:
# For objective C
titles = (_ for _ in df['Title'] if "objective-c" in _.lower())

In [75]:
for i in range(200):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 0:
        print(doc)

__OBJC__ equivalent for Objective-C++
iPhone Device/Simulator memory oddities using Objective-C++
Objective-C. I have typedef float DuglaType[3]. How do I declare the property for this?
Learning Objective-C. Using Xcode 3.2.1. What is error: Program received signal: âEXC_ARITHMETICâ


Still there are few cases where objective-C and objective-C++ are not being detected. One option is to check if different operator make any difference

### 6b)-change operator and qualifiers 

In [0]:
obj_c_pattern1 = [{'LOWER': 'objective'},
                  {'IS_PUNCT': True , 'OP': '+'}, # "+"Require the pattern to match 1 or more times
                  {'LOWER': 'c'}]

obj_c_pattern2 = [{'LOWER': 'objectivec'}]

golang_pattern1 = [{'LOWER': 'golang'}] 
golang_pattern2 = [{'LOWER': 'go', 
                    'POS': {'NOT_IN': ['VERB']}}]

python_pattern = [{'LOWER': 'python'}]
ruby_pattern   = [{'LOWER': 'ruby'}]
js_pattern     = [{'LOWER': {'IN': ['js', 'javascript']}}]

In [0]:
matcher = Matcher(nlp.vocab, validate=True)
matcher.add("OBJ_C_LANG", None, obj_c_pattern1, obj_c_pattern2)
matcher.add("GO_LANG", None, golang_pattern1, golang_pattern2)
matcher.add("PYTHON_LANG", None, python_pattern)
matcher.add("JS_LANG", None, js_pattern)
matcher.add("RUBY_LANG", None, ruby_pattern)

In [0]:
df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [0]:
# For objective C
titles = (_ for _ in df['Title'] if "objective-c" in _.lower())

In [80]:
for i in range(200):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 0:
        print(doc)

__OBJC__ equivalent for Objective-C++
iPhone Device/Simulator memory oddities using Objective-C++
Objective-C. I have typedef float DuglaType[3]. How do I declare the property for this?
Learning Objective-C. Using Xcode 3.2.1. What is error: Program received signal: âEXC_ARITHMETICâ


The problem remains same. One other method comes in data cleaning where punctuation marks are removed. In that case, we ought to assume that Objective-C++ and Objective-C are of same meaning because they concern one language but, with different versions. Our main concern is to detect programming language.

### 6c)- Check other languages

In [0]:
df = (pd.read_csv("Questions.csv", nrows=1_000_000, 
                  encoding="ISO-8859-1", usecols=['Title', 'Id']))

In [0]:
# for Python

titles = (_ for _ in df['Title'] if "python" in _.lower())

In [83]:
for i in range(200):
    doc = nlp(next(titles))
    if len(matcher(doc)) == 0:
        print(doc)

mod_python/MySQL error on INSERT with a lot of data: "OperationalError: (2006, 'MySQL server has gone away')"
Running subversion under apache and mod_python
What's the best way to embed IronPython inside my C# App?
How to set the PYTHONPATH in Emacs?
wxPython wxDC object from win32gui.GetDC
Need skeleton code to call Excel VBA from PythonWin
Questions for python->scheme conversion
wxPython and sharing objects between windows
Django on IronPython
IronPython Webframework
A SuggestBox for wxPython?
Intercepting Method Access on the Host Program of IronPython
Is there anything like IPython / IRB for Perl?


Some issues are with PYTHONPATH, IPython, wxPython, IronPython. We can address these issues either by trying more operators or by trying a cleaned text data. It depends on problem statement. If some one wants to see how Python is integrated with other tools then Python/SQL or Python/ IRB are interesting. If not then we can have more precision in data cleaning.