In [15]:
import pandas as pd
from thirdai import bolt

We will be using a standard dataset for sentiment classification called yelp_polarity. You should download the dataset from the link https://www.kaggle.com/datasets/irustandi/yelp-review-polarity. 

The dataset is in the form of a csv that has two columns:
1. category (positive - 1 or negative - 0)
2. text 


In [20]:
train_data=pd.read_csv("yelp_review_polarity_csv/train.csv",header=None)
train_data.columns=["category","text"]

test_data=pd.read_csv("yelp_review_polarity_csv/test.csv",header=None)
test_data.columns=["category","text"]

In [29]:
train_data.head()

Unnamed: 0,category,text
0,1,Unfortunately the frustration of being Dr. Go...
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


You should write your own preprocessing routine for the data. 
While saving the dataframes, make sure that the files have dataframe columns as header

In [16]:
def preprocess_and_save(dataframe,filename):
  dataframe["text"]=dataframe["text"].str.replace(',',' ')
  dataframe.to_csv(filename,index=None, encoding='utf-8')

In [25]:
train_file="yelp_polarity_train.csv"
test_file="yelp_polarity_test.csv"
preprocess_and_save(train_data,train_file)
preprocess_and_save(test_data,test_file)

UniversalDeepTransformer(UDT) can read data from a csv file (or you can specify your own delimiter). 

You have to specify the datatypes of the columns that you want your model to train on while initializing UDT. You also have to specify the target column. 

In [26]:
text_model = bolt.UniversalDeepTransformer(
    data_types={
        "text": bolt.types.text(),
        "category": bolt.types.categorical(n_unique_classes=2)
    },
    target="category",
    delimiter=','
)


input_1 (Input): dim=200000
input_1 -> fc_1 (FullyConnected): dim=512, sparsity=1, act_func=ReLU
fc_1 -> fc_2 (FullyConnected): dim=2, sparsity=1, act_func=Softmax



#### We will be training a model that has more than 100M parameters on an M1 under 10mins. 

After initializing UDT, we just have to specify the number of epochs and the learning rate for training. And we are good to go!

In [27]:
train_config = (bolt.TrainConfig(epochs=5, learning_rate=0.01)
                    .with_metrics(["categorical_accuracy"]))

text_model.train(train_file, train_config)

test_config = (bolt.EvalConfig()
                   .with_metrics(["categorical_accuracy"]))

text_model.evaluate(test_file, test_config)

Loading vectors from 'yelp_polarity_train.csv'
Loaded 560000 vectors from 'yelp_polarity_train.csv' in 10 seconds.
train epoch 0:


train | epoch 0 | updates 274 | {categorical_accuracy: 0.915975} | batches 274 | time 112s | complete

train epoch 1:


train | epoch 1 | updates 548 | {categorical_accuracy: 0.960654} | batches 274 | time 105s | complete

train epoch 2:


train | epoch 2 | updates 822 | {categorical_accuracy: 0.975034} | batches 274 | time 107s | complete

train epoch 3:


train | epoch 3 | updates 1096 | {categorical_accuracy: 0.981511} | batches 274 | time 104s | complete

train epoch 4:


array([[4.8032477e-02, 9.5196736e-01],
       [9.9999559e-01, 4.3396126e-06],
       [4.7769544e-10, 9.9999988e-01],
       ...,
       [9.9999928e-01, 6.3129028e-07],
       [9.9999988e-01, 8.8657220e-10],
       [9.9999988e-01, 1.5347833e-10]], dtype=float32)


train | epoch 4 | updates 1370 | {categorical_accuracy: 0.98777} | batches 274 | time 110s | complete

Loading vectors from 'yelp_polarity_test.csv'
Loaded 38000 vectors from 'yelp_polarity_test.csv' in 0 seconds.
test:


predict | epoch 5 | updates 1370 | {categorical_accuracy: 0.920974} | batches 19 | time 5784ms



In [28]:
text_model.save("sentiment_model.bolt")

Machine learning models sometimes feel like a black box where it becomes very hard to gauge into how the model makes decisions about the input data points. Our explainability module lifts the curtain over predictions and offers you deeper insights into the decision process of the model

In [36]:
inference_sample={"text":"I disliked the movie"}

To explain the decision of the model for a datapoint, pass the datapoint as a dictionary to model.explain.

In [35]:
predicted_class=text_model.predict(inference_sample)
print(predicted_class)

explanations=text_model.explain(inference_sample)
print(explanations[0])

[9.9922657e-01 7.7333325e-04]
column_name: "text" | keyword: "dislike" | percentage_significance: 24.3795


We can see that the word "dislike" has the highest significance.

Not only can we get insights the predictions of the model, but also how to change the datapoint to get the desired output. By specifying a target, we can find out what columns should be changed to get the desired output. "Nobu is an underwhelming restaurant" has negative sentiment and after setting the target as 1(positive), the model predicts that underwhelming is the word that should be changed to convert negative sentiment to positive

In [37]:
inference_sample={"text":"Nobu is an underwhelming restaurant"}
explanations=text_model.explain(inference_sample, target_class="1")
print(explanations[0])

column_name: "text" | keyword: "underwhelming" | percentage_significance: 54.4764
