In [60]:
from collections import Counter, defaultdict, namedtuple
from functools import partial
import pandas as pd
import numpy as np
import math, random
import glob

Current exercise has the following steps:
1. create a DECODER model that will discriminate computer code and identify the language by the code features.
2. the input training and test files needs to be prepared.
3. Target Structure to feed into decoding model:
    - training set: [file ID, [set of features, known code])
    - testing set: [file ID, [set of features, known code])
3. the discriminating features of code are not yet defined. 
4. the files need to be processed for input.
5. Need to decide which classifier is useful.

Strategy and Initial observations:
1. Envisioning a set of data that includes a row number, a profile of regex distributions, and the code type. 
2. Need to determine how to read in a file name, capture the file extension and associate it with the regex text profile.
3. Immerse self in the TextBlob software. 
4. Need to understand the role of glob.glob.


Loading test data:
1. glob the test set of files: 1 -> 32
2. append the test.csv code label (answer key for code file 1 to 32)
   - There are 2-4 code assignments per code type in the test data.
   - Target training to test ratio should between 60/40 and 75/25.
   - Therefore, need a minimum of 16 training snippets per code to cover the minimum 25% (4test/16train).
3. initial train/testing will focus on one translater.
4. quick build - a test and a benchmark. [start with ONE]
   - build a test for Clojure.
   - build a training set for Clojure.

Will start by getting control of the test data:

In [61]:
def read_snippets_test(num):
    snips=[]
    for i in range(1, num):
        filetest = glob.glob('test/{}'.format(i))
        for files in filetest:
            with open(files) as f:
                snips.append(f.read())
    return snips

In [125]:
sniptest = read_snippets_test(33)

Verified that all test files read into sniptest with
- len(sniptest)
- type(sniptest)
- sniptest[0] - various slices compared to original files.

In [129]:
sniptestnames = pd.read_csv('test.csv',sep=',',header=None)

In [130]:
type(sniptestnames)

pandas.core.frame.DataFrame

In [131]:
len(sniptestnames)

32

Appended the column with the test code examples to the code key file: 
Verified that the content was read into the dataframe:

In [137]:
sniptestnames[2] = sniptest

In [141]:
sniptestnames[-3:]

Unnamed: 0,0,1,2
29,30,php,class Application extends App {\n\t/**\n\t * @...
30,31,ocaml,type name = string\n\nlet compare_label label1...
31,32,ocaml,let search_compiler_libs () =\n prerr_endline...


In [142]:
sniptestnames

Unnamed: 0,0,1,2
0,1,clojure,"(defn cf-settings\n ""Setup settings for campf..."
1,2,clojure,(ns my-cli.core)\n\n(defn -main [& args]\n (p...
2,3,clojure,(extend-type String\n Person\n (first-name [...
3,4,clojure,(require '[overtone.live :as overtone])\n\n(de...
4,5,python,from pkgutil import iter_modules\nfrom subproc...
5,6,python,import re\nimport subprocess\n\ndef cmd_keymap...
6,7,python,class NoSuchService(Exception):\n def __ini...
7,8,python,from collections import namedtuple\nimport fun...
8,9,javascript,function errorHandler(context) {\n return fun...
9,10,javascript,"var _ = require('lodash'),\n fs = require('..."


In [191]:
def read_snippets_train(directory):
    filetrain = glob.glob('benchmarksgame-2014-08-31/benchmarksgame/bench/{}/*.clojure*'.format(directory))
    snips = []
    for files in filetrain:
        with open(files) as f:
            snips.append(f.read())
    return snips

In [197]:
dir_list = ['binarytrees','binarytreesredux','chameneosredux', 'fannkuchredux', 'fasta']
#,'fannkuchredux','fasta','fastaredux','knucleotide','mandelbrot','meteor']

def clojure_capture(dir_list):
    total_clojure_set = []
    for i in dir_list:
        total_clojure_set = read_snippets_train(i)
        total_clojure_set += total_clojure_set
    return total_clojure_set


Verified that the helper function worked to accurately capture the text associated with each code in the directories:
[For the first 2 directories, works perfectly.  When add the third, the function loses the first 5 entries!]
[Had to regress to a stepwise rather than recursive approach due to lack of time.]

In [201]:
total_clojure = clojure_capture(dir_list)

In [205]:
len(dir_list)

5

In [203]:
len(total_clojure)

4

In [208]:
one_clojure = read_snippets_train('binarytrees')

In [209]:
two_clojure = read_snippets_train('binarytreesredux')

In [210]:
three_clojure = read_snippets_train('chameneosredux')

In [211]:
four_clojure = read_snippets_train('fannkcuchredux')

In [212]:
five_clojure = read_snippets_train('fasta')

In [213]:
six_clojure = read_snippets_train('knucleotide')

In [216]:
seven_clojure = read_snippets_train('fastaredux')

In [217]:
eight_clojure = read_snippets_train('mandelbrot')

In [218]:
total_c = (one_clojure + two_clojure + three_clojure + four_clojure + five_clojure + six_clojure + seven_clojure +
           eight_clojure)

In [219]:
len(total_c)

18

Now have a training set of clojure snippets.  Need to Featurize them for clojure.

In [220]:
type(total_c)

list

In [233]:
#helper file to append the code type to the code snippet in a tuple.

codelist = total_c
codeid = 'clojure'

def labeler(codelist,codeid):
    code_group = defaultdict(list)
    for item in codelist:
        key = item[0][codeid]
        code_group[key].append(item)
    return(code_group)
        
#  Discouraging when code written in class does not work!!
# attempting to append 'clojure' to each snippet.

In [236]:
labeler(codelist, codeid)

TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

In [None]:
import csv

sms_data = []
sms_results = []

with open('test.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        sms_data.append(row[1])
        sms_results.append(row[0])