In [1]:
import pandas as pd
import prepare
import model
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)

In [2]:
df = pd.read_json('data1.json')
df = prepare.prep_repos(df)

In [3]:
target = 'language_reduced'

**Split the Data** into three separate samples to avoid data leakage while fitting and evaluating machine learning models.

In [4]:
# split the data using sklearn's train_test_split inside the split_data function
train, validate, test = prepare.split_data(df, target)

train	 n = 51
validate n = 23
test	 n = 19


# Modeling

The following code initializes the infrastructure for storing information about our models' performance

In [5]:
# initialize the model_number variable at 0
model_number = 0 
# initialize an empty dataframe
model_results = pd.DataFrame()

### Baseline

Here, we establish baseline predictions based on the most frequently ocurring target class and evaluate the baseline accuracy.

In [6]:
# create baseline predictions using the mode and get accuracy score with sklearn.metrics inside run_baseline()
model_number, model_results = model.run_baseline(train[target], model_number, model_results)

### Machine Learning Classsifiers

Here, we create, fit and evaluate the following types of machine learning classification models, using our train and validate samples. 

- Decision Tree
- Random Forest
- Multinomial Naive Bayes
- Complement Naive Bayes

Each model is run using two types of feature preprocessing:

- Term Frequency - Inverse Document Frequency (TF/IDF) Vectorization
- Count Vectorization (CV) AKA "Bag of Words"

For each type of model and type of preprocessing, the other associated hyperparameters were also varied, creating a total of 68 unique models with varying performance.

In [7]:
# user defined functions which create sklearn models with varying features and hyperparameters and store
# information about the models and their performance
model_number, model_results = model.run_decision_tree(train, validate, target, model_number, model_results)
model_number, model_results = model.run_random_forest(train, validate, target, model_number, model_results)
model_number, model_results = model.run_naive_bayes(train, validate, target, model_number, model_results)

### Results

Below, you can see the resulting accuracy scores on both the train and validate set for each model created.

In [8]:
model.display(model_results)

model_number,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,baseline
sample_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1
train,0.862745,0.803922,0.921569,0.862745,0.960784,0.941176,1.0,0.980392,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.647059,0.647059,0.627451,0.627451,0.627451,0.627451,0.627451,0.627451,0.666667,0.647059,0.627451,0.627451,0.627451,0.627451,0.627451,0.627451,0.666667,0.647059,0.627451,0.627451,0.627451,0.627451,0.627451,0.627451,0.666667,0.647059,0.627451,0.627451,0.627451,0.627451,0.627451,0.627451,0.960784,0.980392,1.0,0.960784,0.627451,0.941176,0.980392,0.941176,0.627451,0.921569,0.921569,0.941176,0.627451,0.862745,0.921569,0.941176,0.627451,0.843137,0.921569,0.941176,0.627451
validate,0.434783,0.652174,0.434783,0.608696,0.434783,0.608696,0.434783,0.608696,0.434783,0.608696,0.434783,0.608696,0.434783,0.608696,0.434783,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,0.608696,


### Best Model

Here, we choose the model which performed with the highest level of **accuracy** on the validate sample. We choose accuracy as our evaluation metric because in this multi-class classifier, we have no reason to favor any one of the target classes over any other.

In [9]:
# obtain the model numbers with the highest accuracy on validate
best = model.get_best(model_results)
# display info about those models
model_results[model_results.model_number.isin(best)]

Unnamed: 0,model_number,model_type,sample_type,accuracy,feature_type,max_depth,min_samples_leaf,alpha
3,2,decision_tree,train,0.803922,CV/BOW,3.0,,
4,2,decision_tree,validate,0.652174,CV/BOW,3.0,,


Model # 2 achieves the highest validate score of 65%, though this is actually 1 point below our baseline accuracy.

### Final Test

Here, we recreate model #2 and evaluate it on our test sample, which approximates how it might perform when used on additional READMEs which were not included in our samples.

In [10]:
model.test_model_2(train, test, target)

0.631578947368421

We see another small reduction in performance to 63%, making it about 3% below the baseline of 66% - and not particularly useful.