# Code classifier

### At the bottom of this notebok one can try classifiying code

## Potential further improvments:
1. Preprocessing
    - One could further explore if some signs are not removable
    - Fully understand, and implement tokenisation - as of now, I'm not 100% sure how it works
    - Explore training / testing splitting options, only ones used were 5:95 and 20:80
    - Is TFIDF the best for this problem? Probably
    - Explore TFIDF options
2. Models
    - One should definitely try more models
    - Two used models (Random Forest, Gradient boosting) were not fully explored
3. General
    - How do comments affect the models? If someone were to write an essey in one, the training / or trained model could potentially get confused

In [1]:
import pickle
import pandas as pd
import numpy as np 

from sklearn.feature_extraction.text import TfidfVectorizer

#### Choose the best model

In [2]:
models = 'Models/'
list_pickles = ['rfc_5%/df_models_rfc.pickle',
                'rfc_20%/df_models_rfc.pickle',
                'rfc_small/df_models_rfc.pickle', 
                'rfc_big/df_models_rfc.pickle',
                'df_models_rfc.pickle',
                'df_models_gbc.pickle',
                'gbc_big/df_models_gbc.pickle']

df_summary = pd.DataFrame()

for pickle_ in list_pickles:
    path = models + pickle_
    
    with open(path, 'rb') as data:
        df = pickle.load(data)

    df_summary = pd.concat([df_summary, df])

df_summary = df_summary.reset_index().drop('index', axis=1)

diff = abs(df_summary['Training Set Accuracy'] - df_summary['Test Set Accuracy'])
df_summary['Accuracy difference'] = diff

One has to compare the ordering with the list_pickles to understand what this means...

Random forest 4 is the only model with improved tokenisation

In [3]:
df_summary.sort_values('Test Set Accuracy', ascending=False)

Unnamed: 0,Model,Training Set Accuracy,Test Set Accuracy,Accuracy difference
4,Random Forest,0.992826,0.900648,0.092178
1,Random Forest,0.865,0.827522,0.037478
0,Random Forest,0.866758,0.821208,0.045549
3,Random Forest,0.869095,0.809582,0.059513
2,Random Forest,0.89239,0.808354,0.084037
5,Gradient Boosting,0.895797,0.807645,0.088152
6,Gradient Boosting,0.911415,0.785012,0.126402


Choose the model with highest Test Set Accuracy, and no overtraining

<b> I will choose the untokenised random forest 1 </b> - It's accuracy is not great, but it's unlikely to be overtrained.

#### Load  the bestmodel

In [4]:
with open('Models/rfc_20%/best_rfc.pickle', 'rb') as data:
    model = pickle.load(data)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


#### TF-IDF object

<div class="alert alert-block alert-info">
If using Random Forest #4 choose tokenised, otherwise Tokenless.
</div>

In [5]:
# Tokenised:
with open('Pickles/tfidf.pickle', 'rb') as data:
    tfidf = pickle.load(data)
    
# Tokenless
with open('Pickles/tfidf_tokenless.pickle', 'rb') as data:
    tfidf = pickle.load(data)

  tfidf = pickle.load(data)
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


#### Features from text

In [6]:
def create_features_from_code(code):
    df = pd.DataFrame(columns=['file_body'])
    df.loc[0] = code
    
    df.replace('', np.nan, inplace = True)
    df.dropna(subset=['file_body'], inplace = True)
    
    featurues = tfidf.transform(df['file_body']).toarray()
    return featurues

#### Predict from code

In [7]:
def predict_from_code(code):
    f = create_features_from_code(code)
    # Predict using the input model
    prediction = model.predict(f)[0]
    prediction_prob = model.predict_proba(f)[0]
    print("The predicted language is", prediction)
    print("The conditional probability is: %a" %(prediction_prob.max()*100))

## Make some predictions

In [8]:
#JavaScript

code = '''
// Functions as values of a variable
var cube = function (x) {
  return Math.pow(x, 3);
};
var cuberoot = function (x) {
  return Math.pow(x, 1 / 3);
};

// Higher order function
var compose = function (f, g) {
  return function (x) {
    return f(g(x));
  };
};

// Storing functions in a array
var fun = [Math.sin, Math.cos, cube];
var inv = [Math.asin, Math.acos, cuberoot];

for (var i = 0; i < 3; i++) {
  // Applying the composition to 0.5
  console.log(compose(inv[i], fun[i])(0.5));
}

'''

predict_from_code(code)

The predicted language is JavaScript
The conditional probability is: 75.63670901813417


In [9]:
# Python

code = '''
# Search for an odd factor of a using brute force:
for i in range(n):
    if (n%2) == 0:
        continue
    if (n%i) == 0:
        result = i
        break
else:
    result = None
    print "No odd factors found"

'''

predict_from_code(code)

The predicted language is Python
The conditional probability is: 87.44826928787701


In [10]:
# Mathematica

code = '''
{And @@ Table[l = RandomInteger[150, RandomInteger[1000]];
   Through[And[Length@# == Length@SelectSort@# &, OrderedQ@SelectSort@# &]@l],
   {RandomInteger[150]}],
 Block[{$RecursionLimit = Infinity},
  And @@ Table[l = RandomInteger[150, RandomInteger[1000]];
    Through[And[Length@# == Length@SelectSort2@# &, OrderedQ@SelectSort2@# &]@l],
    {RandomInteger[150]}]
  ]}

'''

predict_from_code(code)

The predicted language is Mathematica
The conditional probability is: 42.47233917011415


In [11]:
# Fortran

code = '''
program textposition
    use kernel32
    implicit none
    integer(HANDLE) :: hConsole
    integer(BOOL) :: q

    hConsole = GetStdHandle(STD_OUTPUT_HANDLE)
    q = SetConsoleCursorPosition(hConsole, T_COORD(3, 6))
    q = WriteConsole(hConsole, loc("Hello"), 5, NULL, NULL)
end program
'''

predict_from_code(code)

The predicted language is Fortran
The conditional probability is: 86.2725051710276


In [12]:
# Swift
code = '''
if let firstNumber = Int("4"), let secondNumber = Int("42"), firstNumber < secondNumber && secondNumber < 100 {
    print("\(firstNumber) < \(secondNumber) < 100")
}
// Prints "4 < 42 < 100"

if let firstNumber = Int("4") {
    if let secondNumber = Int("42") {
        if firstNumber < secondNumber && secondNumber < 100 {
            print("\(firstNumber) < \(secondNumber) < 100")
        }
    }
}
// Prints "4 < 42 < 100"
'''
predict_from_code(code)

The predicted language is Swift
The conditional probability is: 47.15944260072262


In [13]:
# Python
code = '''
import numpy as np

x = np.random.rand(100)
print(x)

x.sort()
print(x)
'''
predict_from_code(code)

The predicted language is Python
The conditional probability is: 74.9849906272347
