<a href="https://colab.research.google.com/github/HarrisonSantiago/Habitual_be_classifier/blob/master/examples/Habitual_Be_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**1. Setup**

The following lines install the Habitual_be_classifier library in the Colab enviornment. 

In [None]:
! git clone https://github.com/HarrisonSantiago/Habitual_be_classifier.git

% cd /content/Habitual_be_classifier 
! pip install numpy cython
! pip install -e .

import Habitual_be_classifier as hbc 
import numpy as np
import pickle

**2. Rule Based Portion**

The following cells load an example corpus which already has each instance of "be" labeled as habitual or non-habitual. The csv_processor splits these by default so that segments can be held out if desired. Here, all the data is then ran through the rule based filter. This filter removes as many non-habitual instances as possible, and returns the remaining undetermined instances. 

In [2]:
filepath = ['/content/Habitual_be_classifier/examples/CORAAL_example.csv']

hab_input, nonhab_input = hbc.csv_processor(filepath)

In [3]:
combined = np.concatenate((hab_input, nonhab_input))

In [4]:
declared_nonhab, unknown_hab = hbc.rule_filter(combined)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


**3. Augmentation** 

To run the augmentation functions, the word2 vec model must be downloaded and unzipped. The first cell shoes how to do that by downloading a copy of the word2vec model (provided by Tomas Mikolov). If the link has been accessed too frequently through gdown you may not be able to automatically download it. If so, just copy/paste the link into your browser, download from there, and run the gzip command. 

In [None]:
!pip install gdown
!gdown https://drive.google.com/u/0/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
!gzip -d GoogleNews-vectors-negative300.bin.gz

In [None]:
filepath_toStoreAugmentFiles = "."

augmented = hbc.augmenter(unknown_hab, filepath_toStoreAugmentFiles)

**4. Training the ML models**

Here the augmented data and the data that the rule based filter could not classify is used to train the ML models. Then predicting the habituality of the same data is done. This lack of train/test split is not recommended in practice, but is done for demonstration purposes.

In [None]:
augmentedAndUnknown = np.concatenate((unknown_hab, augmented), axis = 0)

y= augmentedAndUnknown[:,2].astype(int)

X= hbc.vectorize(augmentedAndUnknown)

models = hbc.algo_trainers(X, y)

hab_prediction = models['ensemble'].predict(X)

**5. Using Pretrained Models**

Here it is shown how to access the pretrained models that come with this package. Any classification with them is done in the same way as the models generated in Part 4.

In [None]:
pretrained_models = hbc.get_pretrained('./Habitual_be_classifier/Classifiers.obj')

print(pretrained_models.keys())

dict_keys(['Logistic Regression', 'Linear SGD Classifier', 'Neural Net', 'Ensemble'])
