For the parts ```Bag of words``` and ```TF-IDF```, I was inspired by the [moocs  of coursera](https://www.coursera.org/learn/language-processing/home/week/1) that I followed recently.  
For Bert, I was inspired by the reference [https://www.kaggle.com/rahulvks/distilbert-text-classification](https://www.kaggle.com/rahulvks/distilbert-text-classification) of the document [Technical Resources and Tutorials -- Pascal Notsawo summer 2020 Project 1](https://docs.google.com/document/d/1Sfev84E2mkF5rNNuvtZURlpYAhRw3NLmJqsJ2HQV--I/edit?usp=sharing) (shared drive folder), which itself is a replication of [https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/](https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/).

# **Data Overview**

**Workspace**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
! cp -R /content/drive/"My Drive"/"foo"/Data /content

**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import pickle

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**Import the dataset**

In [None]:
df = pd.read_csv('Data/EULA_Training_Data_Set_1_v1.csv')

In [None]:
df.shape

(7879, 3)

In [None]:
df.head(5)

Unnamed: 0,Clause ID,Clause Text,Classification
0,1588,18. Governing Law: This Agreement shall be gov...,0
1,1146,"1.8 Modification. We may modify, update, or di...",1
2,4792,Except as otherwise expressly provided in this...,0
3,2759,8.3. The benefit and burdens of this Ag...,1
4,4400,DEFINITIONS,0


**Summary of the dataset**

In [None]:
#df.describe()

In [None]:
df['Classification'].value_counts()

0    6407
1    1472
Name: Classification, dtype: int64

In [None]:
"""
training_pkl = pickle.load(open("Data/train.pkl", "rb"))
print(len(training_pkl))
for x_y in training_pkl[:5]:
  print("============")
  print("text : " ,x_y["text"])
  print("target : " ,x_y["target"])
print("============")
"""

# **Spliting the training dataset**

In [None]:
X, y = df['Clause Text'].values, df['Classification'].values

In [None]:
seed = 1234
test_ratio = 0.2

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_ratio, random_state = seed)
#X_train, X_test, y_train, y_test = X[:6303], X[6304:], y[:6303], y[6304:]

In [None]:
len(X_train), len(X_test),

(6303, 1575)

In [None]:
X_train[0]

'18. Governing Law: This Agreement shall be governed by and interpreted in accordance with the Federal laws of theUnited States, without reference to conflict-of-laws principles. If for any reason a court of competent jurisdiction finds any provision of this Agreement to be unenforceable, that provision will be enforced to the maximum extent possible to effectuate the intent of the parties, and the remainder of this Agreement will continue in full force and effect. This Agreement shall not be governed by the United Nations Convention on Contracts for the International Sale of Goods. Buyer agrees that exclusive jurisdiction for any dispute arising out of or relating to this Agreement lies within the venue mandated by applicable Federal law.'

**Text Prepare**

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = text.lower() # lowercase text
    text = re.sub(REPLACE_BY_SPACE_RE, ' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = re.sub(BAD_SYMBOLS_RE, '', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join([word for word in text.split() if word not in STOPWORDS]) # delete stopwords from text
    return text

In [None]:
a = X_train[0]
X_train = [text_prepare(x) for x in X_train]
X_test = [text_prepare(x) for x in X_test]

In [None]:
print(a)
print(X_train[0])

18. Governing Law: This Agreement shall be governed by and interpreted in accordance with the Federal laws of theUnited States, without reference to conflict-of-laws principles. If for any reason a court of competent jurisdiction finds any provision of this Agreement to be unenforceable, that provision will be enforced to the maximum extent possible to effectuate the intent of the parties, and the remainder of this Agreement will continue in full force and effect. This Agreement shall not be governed by the United Nations Convention on Contracts for the International Sale of Goods. Buyer agrees that exclusive jurisdiction for any dispute arising out of or relating to this Agreement lies within the venue mandated by applicable Federal law.
18 governing law agreement shall governed interpreted accordance federal laws theunited states without reference conflictoflaws principles reason court competent jurisdiction finds provision agreement unenforceable provision enforced maximum extent po

# **Transforming text to a vector**

## **1) Bag of words**   

   




1. Find *N* most popular words in train corpus and numerate them. Now we have a dictionary of the most popular words.
2. For each title in the corpora create a zero vector with the dimension equals to *N*.
3. For each text in the corpora iterate over words which are in the dictionary and increase by 1 the corresponding coordinate.  

Let's try to do it for a toy example. Imagine that we have *N* = 4 and the list of the most popular words is 

    ['hi', 'you', 'me', 'are']

Then we need to numerate them, for example, like this: 

    {'hi': 0, 'you': 1, 'me': 2, 'are': 3}

And we have the text, which we want to transform to the vector:

    'hi how are you'

For this text we create a corresponding zero vector 

    [0, 0, 0, 0]
    
And iterate over all words, and if the word is in the dictionary, we increase the value of the corresponding position in the vector:

    'hi':  [1, 0, 0, 0]
    'how': [1, 0, 0, 0] # word 'how' is not in our dictionary
    'are': [1, 0, 0, 1]
    'you': [1, 1, 0, 1]

The resulting vector will be 

    [1, 1, 0, 1]



---



To find the most common words use train data

**Words counts and most common words**

In [None]:
words_counts = {}
for line in X_train:
  word_list = line.split()
  for word in word_list: 
    words_counts[word] = words_counts.get(word, 0) + 1

In [None]:
most_common_words = sorted(words_counts.items(), key=lambda x: x[1], reverse=True)[:10]
print(most_common_words)

[('company', 8552), ('software', 5066), ('agreement', 5040), ('shall', 3940), ('use', 3493), ('customer', 2941), ('services', 2869), ('may', 2587), ('party', 2280), ('information', 2228)]


In [None]:
DICT_SIZE = 10000 # size of the dictionary
WORDS_TO_INDEX = {key: rank for rank, key in enumerate(sorted(words_counts.keys(), key=lambda x: words_counts[x], reverse=True)[:DICT_SIZE], 0)}
INDEX_TO_WORDS = {y:x for x,y in WORDS_TO_INDEX.items()}
ALL_WORDS = WORDS_TO_INDEX.keys()

In [None]:
def my_bag_of_words(text, words_to_index, dict_size):
    """
        text: a string
        dict_size: size of the dictionary
        
        return a vector which is a bag-of-words representation of 'text'
    """
    result_vector = np.zeros(dict_size)
    for item in text.split():
        if item in words_to_index.keys():
            result_vector[words_to_index[item]] += 1
    return result_vector

Now apply the implemented function to all samples.  
We use [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix) (Compressed Sparse Row matrix) for fast matrix vector products and [scipy.sparse.vstack](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.vstack.html#scipy.sparse.vstack)  to Stack sparse matrices vertically (row wise)

In [None]:
# sparse matrix package for numeric data.
from scipy import sparse as sp_sparse 

In [None]:
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])

In [None]:
print('X_train shape ', X_train_mybag.shape)
print('X_test shape ', X_test_mybag.shape)

X_train shape  (6303, 10000)
X_test shape  (1575, 10000)


## 2) **TF-IDF**


TF-IDF takes into account total frequencies of words in the corpora. It helps to penalize too frequent words and provide better features space. 

- We use class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) from *scikit-learn*. 
- We use *train* corpus to train a vectorizer. 
- Our filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the titles)
- We use bigrams along with unigrams in our vocabulary.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

How is it work?

In [None]:
corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X_dummy = vectorizer.fit_transform(corpus)
print(vectorizer.vocabulary_)
print(vectorizer.get_feature_names()) 
print(X_dummy.shape)
print(X_dummy)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
(4, 9)
  (0, 1)	0.46979138557992045
  (0, 2)	0.5802858236844359
  (0, 6)	0.38408524091481483
  (0, 3)	0.38408524091481483
  (0, 8)	0.38408524091481483
  (1, 5)	0.5386476208856763
  (1, 1)	0.6876235979836938
  (1, 6)	0.281088674033753
  (1, 3)	0.281088674033753
  (1, 8)	0.281088674033753
  (2, 4)	0.511848512707169
  (2, 7)	0.511848512707169
  (2, 0)	0.511848512707169
  (2, 6)	0.267103787642168
  (2, 3)	0.267103787642168
  (2, 8)	0.267103787642168
  (3, 1)	0.46979138557992045
  (3, 2)	0.5802858236844359
  (3, 6)	0.38408524091481483
  (3, 3)	0.38408524091481483
  (3, 8)	0.38408524091481483


In [None]:
def tfidf_features(X_train, X_test):
    """
        X_train, X_test — samples        
        return TF-IDF vectorized representation of each sample and vocabulary
    """
    # Create TF-IDF vectorizer with a proper parameters choice
    # Fit the vectorizer on the train set
    # Transform the train and test sets and return the result
    
    
    tfidf_vectorizer = TfidfVectorizer(
        lowercase = True, 
        min_df=5, 
        max_df=0.9, 
        ngram_range=(1, 2), 
        #token_pattern='(\S+)' # todo
    )
    
    X_train = tfidf_vectorizer.fit_transform(X_train)
    X_test = tfidf_vectorizer.transform(X_test)
    
    return X_train, X_test, tfidf_vectorizer, tfidf_vectorizer.vocabulary_

In [None]:
X_train_tfidf, X_test_tfidf, tfidf_vectorizer, tfidf_vocab = tfidf_features(X_train, X_test)
tfidf_reversed_vocab = {i:word for word,i in tfidf_vocab.items()}

In [None]:
print('X_train_tfidf shape ', X_train_tfidf.shape)
print('X_test_tfidf shape ', X_test_tfidf.shape)

X_train_tfidf shape  (6303, 11638)
X_test_tfidf shape  (1575, 11638)


In [None]:
assert list(tfidf_vocab.keys())[:10] == list(tfidf_reversed_vocab.values())[:10], "An error occurred"
list(tfidf_vocab.keys())[:10]

['18',
 'governing',
 'law',
 'agreement',
 'shall',
 'governed',
 'interpreted',
 'accordance',
 'federal',
 'laws']

## 3) **BERT**


In [None]:
! pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |████████████████████████████████| 778kB 8.2MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 24.2MB/s 
[?25hCollecting tokenizers==0.8.1.rc1
[?25l  Downloading https://files.pythonhosted.org/packages/40/d0/30d5f8d221a0ed981a186c8eb986ce1c94e3a6e87f994eae9f4aa5250217/tokenizers-0.8.1rc1-cp36-cp36m-manylinux1_x86_64.whl (3.0MB)
[K     |████████████████████████████████| 3.0MB 46.0MB/s 
Collecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl 

In [None]:
import random
import itertools
import torch
import transformers as tfm
from keras.preprocessing.sequence import pad_sequences

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True

**Loading the Pre-trained BERT model**

In [None]:
# For DistilBERT:
# model_class, tokenizer_class, pretrained_weights = (tfm.DistilBertModel, tfm.DistilBertTokenizer, 'distilbert-base-uncased')

## For BERT :
model_class, tokenizer_class, pretrained_weights = (tfm.BertModel, tfm.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
#max_input_length = tokenizer.max_model_input_sizes['distilbert-base-uncased']
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

model = model_class.from_pretrained(pretrained_weights)
model = model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [None]:
def pad_and_batching(df, pad_len, batch_size = 32, n_samples = None):
    data = []
    i = 0 
    if n_samples :
      n_samples = min(n_samples, df.shape[0])
    else :
      n_samples = df.shape[0]

    ### Tokenization
    a = [len(df['Clause Text'][i]) for i in range(100)]
    print(a)
    tokenized = df['Clause Text'].apply(lambda x : tokenizer.encode(x, add_special_tokens=True))
    
    ## Padding
    padded = pad_sequences(sequences = tokenized, maxlen= pad_len, dtype = 'int64', truncating="post", padding="post")
      
    ### Batching
    input_ids = []
    while n_samples > i :
        i += batch_size
        input_ids.append(torch.LongTensor(padded[i-batch_size:i]).to(device))  
    
    ## Masking
    attention_mask = [np.where(batch.cpu().numpy() != 0, 1, 0) for batch in input_ids]  
    attention_mask = [torch.LongTensor(batch).to(device) for batch in attention_mask]
    
    return input_ids, attention_mask, df['Classification'][:i]


In [None]:
input_ids, attention_mask, labels = pad_and_batching(
    df = df, 
    pad_len = max_input_length, # equal to model.config.to_dict()['max_position_embeddings']
    batch_size = 32,
    n_samples = None
)

In [None]:
print(len(input_ids), input_ids[0].shape)
print(len(attention_mask), attention_mask[0].shape)

Model - The model() function runs our sentences through BERT. The results of the processing will be returned into last_hidden_states.

In [None]:
last_hidden_states = []
model.eval()
with torch.no_grad():
    for input_ids_batch, attention_mask_batch in zip(input_ids, attention_mask) :
        last_hidden_states.append(
            model(input_ids_batch, attention_mask = attention_mask_batch)
        )

We'll save those in the features variable, as they'll serve as the features to our logitics regression model.

In [None]:
# flatten
features = list(
    itertools.chain.from_iterable(
      [batch[0][:,0,:].cpu().numpy() for batch in last_hidden_states]
    )
)

In [None]:
X_train_bert, X_test_bert, y_train_bert, y_test_bert = train_test_split(features, labels, test_size = test_ratio, random_state = seed)
#X_train_bert, X_test_bert, y_train_bert, y_test_bert = features[:6303], features[6304:], labels[:6303], labels[6304:]

**Best parameters search**

# **Classifiers**


## **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV  

### **Exhaustive search over specified parameter values for an estimator.**

In [None]:
parameters = {'C': np.linspace(start = 0.0001, stop= 100, num=100)}

In [None]:
grid_search_mybag = GridSearchCV(LogisticRegression(), parameters, n_jobs = -1)
grid_search_tfidf = GridSearchCV(LogisticRegression(), parameters, n_jobs = -1)
grid_search_bert = GridSearchCV(LogisticRegression(), parameters, n_jobs = -1)

In [None]:
grid_search_mybag.fit(X_train_mybag, y_train)
grid_search_tfidf.fit(X_train_tfidf, y_train)
grid_search_bert.fit(X_train_bert, y_train_bert)

In [None]:
print('best parameters mybag: ', grid_search_mybag.best_params_)
print('best scrores mybag: ', grid_search_mybag.best_score_)

print('best parameters tfidf: ', grid_search_tfidf.best_params_)
print('best scrores tfidf: ', grid_search_tfidf.best_score_)

print('best parameters bert: ', grid_search_bert.best_params_)
print('best scrores bert: ', grid_search_bert.best_score_)

best parameters mybag:  {'C': 1.0102}
best scrores mybag:  0.8332507584053974
best parameters tfidf:  {'C': 6.0607}
best scrores tfidf:  0.8453113553113554
best parameters bert:  {'C': 0.0001}
best scrores bert:  0.8113597170298201


Train the classifiers for different data transformations: *bag-of-words*, *tf-idf* and *bert*.

In [None]:
classifier_mybag = LogisticRegression(penalty="l2", C=1.0102, solver="newton-cg", random_state = 0, n_jobs = -1).fit(X_train_mybag, y_train)
classifier_tfidf = LogisticRegression(penalty="l2", C=6.0607, solver="newton-cg", random_state = 0, n_jobs = -1).fit(X_train_tfidf, y_train)
classifier_bert = LogisticRegression(penalty="l2", C=0.0001, solver="newton-cg", random_state = 0, n_jobs = -1).fit(X_train_bert, y_train_bert)

Create predictions for the data : labels and scores.

In [None]:
y_test_predicted_labels_mybag = classifier_mybag.predict(X_test_mybag)
y_test_predicted_scores_mybag = classifier_mybag.decision_function(X_test_mybag)

y_test_predicted_labels_tfidf = classifier_tfidf.predict(X_test_tfidf)
y_test_predicted_scores_tfidf = classifier_tfidf.decision_function(X_test_tfidf)

y_test_predicted_labels_bert = classifier_bert.predict(X_test_bert)
y_test_predicted_scores_bert = classifier_bert.decision_function(X_test_bert)

In [None]:
print('===== Bag-of-words : ', classifier_mybag.score(X_test_mybag, y_test))
print('===== Tfidf : ', classifier_tfidf.score(X_test_tfidf, y_test))
print('===== Bert : ', classifier_bert.score(X_test_bert, y_test_bert))

===== Bag-of-words :  0.8533333333333334
===== Tfidf :  0.8565079365079366
===== Bert :  0.8203174603174603


### Evaluation

To evaluate the results we will use several classification metrics:
 - [Accuracy](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)
 - [F1-score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

 

In [None]:
from sklearn.metrics import accuracy_score, f1_score, recall_score  

In [None]:
def print_evaluation_scores(y, predicted):
    print("accuracy_score : ", accuracy_score(y, predicted))
    print("f1_score : ", f1_score(y, predicted, average="macro"))
    print("recall_score : ", recall_score(y, predicted, average="macro"))

In [None]:
print('===== Bag-of-words')
print_evaluation_scores(y_test, y_test_predicted_labels_mybag)
print('===== Tfidf')
print_evaluation_scores(y_test, y_test_predicted_labels_tfidf)
print('===== Bert')
print_evaluation_scores(y_test_bert, y_test_predicted_labels_bert)

===== Bag-of-words
accuracy_score :  0.8533333333333334
f1_score :  0.728489008590292
recall_score :  0.7105345206708311
===== Tfidf
accuracy_score :  0.8565079365079366
f1_score :  0.7143955246873956
recall_score :  0.686253541773786
===== Bert
accuracy_score :  0.8203174603174603
f1_score :  0.45064527380537145
recall_score :  0.5


### **Save for production**

In [None]:
import pickle

In [None]:
production = {
   "WORDS_TO_INDEX" : WORDS_TO_INDEX,
   "DICT_SIZE" : DICT_SIZE,
   "tfidf_vectorizer" : tfidf_vectorizer,
   "classifier_mybag": classifier_mybag,
   "classifier_tfidf" : classifier_tfidf,
   #"tokenizer" : tokenizer,
   #"model" : model,
   "classifier_bert" : classifier_bert,
   "max_input_length" : max_input_length
}

In [None]:
pickle.dump(production, open('/content/production.pth', 'wb'))

### **Deploy model**

In [None]:
! pip install gradio

Collecting gradio
[?25l  Downloading https://files.pythonhosted.org/packages/e8/42/94dad1613672f0c7047bce471943581a6180275e6b23aff587636c87ee26/gradio-1.0.4-py3-none-any.whl (1.4MB)
[K     |▎                               | 10kB 18.8MB/s eta 0:00:01[K     |▌                               | 20kB 1.7MB/s eta 0:00:01[K     |▊                               | 30kB 2.2MB/s eta 0:00:01[K     |█                               | 40kB 2.5MB/s eta 0:00:01[K     |█▏                              | 51kB 2.0MB/s eta 0:00:01[K     |█▍                              | 61kB 2.2MB/s eta 0:00:01[K     |█▋                              | 71kB 2.5MB/s eta 0:00:01[K     |█▉                              | 81kB 2.7MB/s eta 0:00:01[K     |██▏                             | 92kB 2.8MB/s eta 0:00:01[K     |██▍                             | 102kB 2.7MB/s eta 0:00:01[K     |██▋                             | 112kB 2.7MB/s eta 0:00:01[K     |██▉                             | 122kB 2.7MB/s eta 0:00

In [None]:
import gradio as gr

In [None]:
def mybag_predict(eula):
    vec = my_bag_of_words(text_prepare(eula) , WORDS_TO_INDEX, DICT_SIZE)
    output = classifier_mybag.predict([vec])[0]
    return "EULA acceptable" if output == 1 else "EULA unacceptable"

def tfidf_predict(eula):
    vec = tfidf_vectorizer.transform([text_prepare(eula)])
    output = classifier_tfidf.predict(vec)[0]
    return "EULA acceptable" if output == 1 else "EULA unacceptable"

def bert_predict(eula):
  tokens = tokenizer.tokenize(eula)
  tokens = tokens[:max_input_length-2]
  init_token_idx = tokenizer.cls_token_id
  eos_token_idx = tokenizer.sep_token_id
  indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
  tensor = torch.LongTensor(indexed).to(device)
  tensor = tensor.unsqueeze(0)
  with torch.no_grad():
        pooled_output, _ = model(tensor)
  vec = pooled_output[:,0,:].cpu().numpy()
  output = classifier_bert.predict(vec)[0]
  return "EULA acceptable" if output == 1 else "EULA unacceptable"

def predict(model_name, eula):
  if model_name == "Bag of word":
    return mybag_predict(eula)
  elif model_name == "TD-IDF":
    return tfidf_predict(eula)
  elif model_name == "BERT":
    return bert_predict(eula)



---



In [None]:
inputs = gr.inputs.Textbox(placeholder="Your end-user license agreements", label = "EULA", lines=20)
output = gr.outputs.Textbox()
gr.Interface(fn = mybag_predict, inputs = inputs, outputs = output).launch()

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your end-user license agreements", label = "EULA", lines=20)
output = gr.outputs.Textbox()
gr.Interface(fn = tfidf_predict, inputs = inputs, outputs = output).launch()

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your end-user license agreements", label = "EULA", lines=20)
output = gr.outputs.Textbox()
gr.Interface(fn = bert_predict, inputs = inputs, outputs = output).launch()

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your end-user license agreements", label = "EULA", lines=20)
output = gr.outputs.Textbox()
gr.Interface(fn = bert_predict, inputs = inputs, outputs = output).launch()

In [None]:
inputs = gr.inputs.Textbox(placeholder="Your end-user license agreements", label = "EULA", lines=20)
model_name = gr.inputs.Dropdown(["Bag of word", "TD-IDF", "BERT"], label = "model name")
output = gr.outputs.Textbox()
gr.Interface(fn = predict, inputs = [model_name, inputs], outputs = output).launch()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on External URL: https://14360.gradio.app
Interface loading below...


(<gradio.networking.serve_files_in_background.<locals>.HTTPServer at 0x7f8926ead278>,
 'http://127.0.0.1:7860/',
 'https://14360.gradio.app')

## **Gaussian Naive Bayes**

In [None]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb = gnb.fit(X_train_bert, y_train_bert)

gnb.score(X_test_bert,  y_test_bert)

0.753015873015873