# Programming Language Classifier

Download Watson Developer Cloud, import libraries, and load train/test sets

In [4]:
!pip install --upgrade watson_developer_cloud
!pip install wget

Requirement already up-to-date: watson_developer_cloud in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages
Requirement not upgraded as not directly required: Twisted>=13.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson_developer_cloud)
Requirement not upgraded as not directly required: python-dateutil>=2.5.3 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson_developer_cloud)
Requirement not upgraded as not directly required: pyOpenSSL>=16.2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson_developer_cloud)
Requirement not upgraded as not directly required: requests<3.0,>=2.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson_developer_cloud)
Requirement not upgraded as not directly required: service-identity>=17.0.0 in /opt/conda/envs/DSX-Python35/lib/python3.5/site-packages (from watson_developer_cloud)
Requirement not upgraded as not directly required: autobahn>=0.10.9 in /opt/con

In [5]:
import os
import re
import csv
import json
import wget
import base64
import operator
import numpy as np
import pandas as pd
from os import listdir
from collections import *
from os.path import isfile, join
from watson_developer_cloud import NaturalLanguageClassifierV1

In [13]:
wget.download( 'https://github.com/IBM/programming-language-classifier/blob/master/data/githubtrainingdatacompressed.npz?raw=true' )
wget.download( 'https://github.com/IBM/programming-language-classifier/blob/master/data/githubtestdatacompressed.npz?raw=true' )

train_data = np.array(np.load("githubtrainingdatacompressed.npz")['arr_0'])
test_data = np.array(np.load("githubtestdatacompressed.npz")['arr_0'])

### A little more preprocessing

break the training data into separate dictionaries indexed by pl type, and map training data to a csv for Watson (my csv is on GitHub)

In [14]:
pls = {}
for row in range(len(train_data)):
    if train_data[row][1].decode() not in pls:
        pls[train_data[row][1].decode()] = []
    pls[train_data[row][1].decode()].append(train_data[row][0].decode())
    

CSV cannot exceed 1024 characters for column width and 15000 rows. So each piece of code is pushed into a Pandas dataframe in at most 1024 character chunks. Watson cannot take empty column values either, so those are removed, then the dataframe is converted into a csv.

In [93]:
d = []
chunk = 1024

for i in train_data:
        for j in range(0,len(i[0]),chunk):
            text = re.sub(' +',' '," ".join(re.split(r'[^\w]', re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,i[0][j:j+chunk].decode('utf-8')))))   
            d.append({'text': text, 'pl': i[1].decode()})

df = pd.DataFrame(d, columns = ['text', 'pl'])
df['text'].replace(' ', np.nan, inplace=True)
df = df.dropna()
df.to_csv('trainingdata.csv', header=['text','pl'],index=False)

## Naive Bayes Classifier

Here we train a Naive Bayes Classifier
for a light review on Naive Bayes look through the slides on GitHub
for a thorough background on this topic (and many others in Machine Learning) 
check out Tom Mitchell's Carnegie Mellon course 
http://cc-web.isri.cmu.edu/CourseCast/Viewer/Default.aspx?id=a666b6e6-ad23-4fa3-96ce-ae50a42f45a3

In [10]:
def bayes_train(pldict, samples):
    plprobs = {}
    counts = Counter()
    for i in pldict:
        plprobs[i] = float(len(pldict[i]))/samples
        
    plwordprobs = {}
    plwordcounts = {}
    for pl in pldict:
        plwordprobs[pl] = {}
        plwordcounts[pl] = 0
    
    for pl in pldict:
        for i in pldict[pl]:
            counts.update(filter(None, re.split(r'[^\w]', re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,i))))
            for word in counts:
                if word not in plwordprobs[pl]:
                    plwordprobs[pl][word] = counts[word]
                else:
                    plwordprobs[pl][word] += counts[word]
                plwordcounts[pl] += counts[word]
            plwordcount = 0
            counts = Counter()
    for pl in plwordprobs:   
        for word in plwordprobs[pl]:
            plwordprobs[pl][word] = float(plwordprobs[pl][word])/plwordcounts[pl]
        
    
    return plprobs, plwordprobs
    
plprobs, plwordprobs = bayes_train(pls, len(train_data))

Checking out the distribution of programming languages in our training set, and 10 of the most commonly used words of a particular language, try replacing 'sh' with other languages and observe the output

In [161]:
plprobs

{'go': 0.04184782608695652,
 'java': 0.2184782608695652,
 'js': 0.20597826086956522,
 'm': 0.06956521739130435,
 'py': 0.07880434782608696,
 'sh': 0.17771739130434783,
 'swift': 0.175,
 'xml': 0.03260869565217391}

In [153]:
sorted(plwordprobs['sh'].items(), key=operator.itemgetter(1) ,reverse=True)[:10]

[('echo', 0.03864587716931944),
 ('the', 0.03778674579624146),
 ('0', 0.033368860713964506),
 ('1', 0.030890171747144934),
 ('if', 0.02780092617570351),
 ('for', 0.027205853790126674),
 ('import', 0.02685395198273526),
 ('to', 0.02617263735275594),
 ('in', 0.02470367772682303),
 ('return', 0.023925887959715765)]

Using the Naive Bayes Classifier to predict on the test set, again use the CMU course as a reference

In [17]:
def testbayes(testdata,plprob,plwordprob):
    Ypred = []

    for row in testdata:
        testcounter = Counter()
        testcounter.update(filter(None, re.split(r'[^\w]', re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,str(row[0])))))

        prob = {}
        for key in plprob:
            prob[key] = 0
        for key in prob:
            for i in testcounter:
                if i not in plwordprobs[key]:
                    plwordprob[key][i] = 1e-4
                else:
                    plwordprob[key][i] += 1e-4
                prob[key] += testcounter[i]*np.log(plwordprob[key][i])
            prob[key] += np.log(plprob[key])
        Ypred.append(max(prob.items(), key=operator.itemgetter(1))[0])
    
    return Ypred

In [20]:
predictions = testbayes(test_data, plprobs, plwordprobs)

### Watson and Evaluating Classification Accuracy

Autherticate with Watson, send it the training data csv, wait for it to finish its training phase, and compute the accuarcy of both models. 

My results are displayed below. Let me know what you get at nacosta@us.ibm.com

In [104]:
natural_language_classifier = NaturalLanguageClassifierV1(
    username="YOURUSERNAME",
    password="YOURPASSWORD")

In [117]:
with open('trainingdata.csv', 'rb') as training_data:
    print(json.dumps(natural_language_classifier.create_classifier(training_data=training_data, metadata='{"name": "Programming Language Classifier","language": "en"}'), indent=2))

{
  "name": "Programming Language Classifier",
  "classifier_id": "3b08fex552-nlc-1263",
  "url": "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/3b08fex552-nlc-1263",
  "language": "en",
  "status_description": "The classifier instance is in its training phase, not yet ready to accept classify requests",
  "created": "2018-08-23T17:52:43.060Z",
  "status": "Training"
}


Copy/Paste your classifier_id into the variable below. In the above example, "3b08fex552-nlc-1263" would be used. Monitor the status of your classifer be using the API below. Once the classifier's "status_description" matches the example below, proceed.

In [None]:
classifier_id = "YOURCLASSIFIERID"

In [118]:
natural_language_classifier.get_classifier(classifier_id)

{'classifier_id': '3b08fex552-nlc-1263',
 'created': '2018-08-23T17:52:43.060Z',
 'language': 'en',
 'name': 'Programming Language Classifier',
 'status': 'Training',
 'status_description': 'The classifier instance is in its training phase, not yet ready to accept classify requests',
 'url': 'https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/3b08fex552-nlc-1263'}

In [154]:
natural_language_classifier.get_classifier("classifier_id")

{u'classifier_id': u'3b08fex552-nlc-1263',
 u'created': u'2018-08-23T17:52:43.060Z',
 u'language': u'en',
 u'name': u'Programming Language Classifier',
 u'status': u'Available',
 u'status_description': u'The classifier instance is now available and is ready to take classifier requests.',
 u'url': u'https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/ee2ec4x254-nlc-4275'}

In [35]:
def compute_my_accuracy(pred, testdata):
    count = 0
    for i in range(len(pred)):
        if pred[i] == testdata[i][1].decode():
            count += 1
    return float(count)/len(pred)

def compute_watson_accuracy(pred, testdata):
    count = 0
    for i in range(len(pred)):
        if pred[i] == testdata[i][1].decode():
            count += 1
    return float(count)/len(pred)
    

In [129]:
watsonpred = []
for i in test_data:
    x = natural_language_classifier.classify(classifier_id,re.sub(' +',' '," ".join(re.split(r'[^\w]', re.sub(re.compile("/\*.*?\*/",re.DOTALL ) ,"" ,i[0].decode()))))[0:1024])
    watsonpred.append(x['top_class'])
    

In [None]:
print("My classifier's accuracy: " + str(compute_my_accuracy(predictions, test_data)))
print("Watson's accuracy: " + str(compute_watson_accuracy(watsonpred, test_data)))