![logog](https://raw.githubusercontent.com/Pacific-AI-Corp/langtest/main/docs/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Pacific-AI-Corp/langtest/blob/main/demo/tutorials/misc/Custom_Hub_Notebook.ipynb)

In this Notebook, we will delve into the powerful capabilities of the **LangTest** open-source Python library. This library serves as an indispensable tool for data scientists and developers engaged in Natural Language Processing (NLP) model evaluation. Regardless of your preference for established models like those from **John Snow Labs, Hugging Face, Spacy**, or cutting-edge options such as **OpenAI, Cohere, AI21, Hugging Face Inference API, and Azure-OpenAI**, LangTest has the versatility to support them all.

The key focus of LangTest lies in facilitating the assessment of various NLP model aspects. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Getting started with LangTest is a breeze. Simply execute the following command in your terminal or command prompt:

```bash
pip install langtest
```

Once you've installed the library, we'll demonstrate how to leverage its functionalities within this Notebook. Specifically, we'll showcase how LangTest can be used to evaluate the robustness, bias, accuracy, performance, security, fairness, toxicity, translation, representation, and clinical aspects of your customized NLP models. Follow along to harness the full potential of LangTest for your NLP evaluations.

In [18]:
# import libraries
import pandas as pd
import numpy as np
import pickle
import os

# text processing libraries
import re
from bs4 import BeautifulSoup
from collections import Counter

# nltk libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

# sklearn libraries
from sklearn.model_selection import train_test_split

# PyTorch libraries and modules
import torch
import torch.utils.data
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# import langtest 
from langtest import Harness

In [None]:
# download imdb dataset
!wget https://raw.githubusercontent.com/JohnSnowLabs/langtest/main/demo/data/imdb.csv

#### Dataset Pre-Processing

In the Pre-Processing phase, we'll be using the **IMDB Movie Reviews Dataset**. This dataset contains around 25,000 highly polarized reviews, with around 12,500 reviews each for training and testing. The dataset having the sentence and label columns, where the sentence column contains the review text, and the label column contains the sentiment label (0 for negative and 1 for positive). We'll be using this dataset to train a **Text Classification** model. The model will be trained to classify the sentiment of the movie review as either positive or negative. 

`review_to_words` function is used to clean the text data. It removes the HTML tags, punctuations, and stopwords from the text data. It also converts the text data to lowercase and lemmatizes the words.

In [6]:
def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [7]:
# Read the data
df = pd.read_csv('./imdb.csv')
df['class'].value_counts()

class
negative    10974
positive    10847
Name: count, dtype: int64

In [8]:
X = df['Sentence']
y = df['class'].apply(lambda x: 1 if x == 'positive' else 0) # Convert the target to numerical values (0, 1)

#### Splitting the Dataset into Train and Test Sets

In [9]:
# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [10]:
# where to store cache files
cache_dir = os.path.join("./", "cache/sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True)  # ensure cache directory exists


def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay

    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]

        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                                                              cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])

    return words_train, words_test, labels_train, labels_test


In [11]:
X_train, X_test, y_train, y_test = preprocess_data(X_train, X_test, y_train, y_test)

Read preprocessed data from cache file: preprocessed_data.pkl


### Transforming the Dataset

In [12]:
def build_dict(data, vocab_size=5000):
    """Construct and return a dictionary mapping each of the most frequently appearing words to a unique integer."""
    
    words = [word for sens in data for word in sens] 
    
    counter = Counter(words)
    word_count = dict(counter.most_common(vocab_size - 2))
    
    word_dict = {word: idx + 2 for idx, word in enumerate(word_count)}
        
    return word_dict

In [13]:
word_dict = build_dict(X_train + X_test)

with open("./sentiment_analysis/word_dict.pkl", "wb") as f:
  pickle.dump(word_dict, f)

In [14]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0 # We will use 0 to represent the 'no word' category
    INFREQ = 1 # and we use 1 to represent the infrequent words, i.e., words not appearing in word_dict
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [15]:
X_train, X_train_len = convert_and_pad_data(word_dict, X_train)
X_test, X_test_len = convert_and_pad_data(word_dict, X_test)

In [16]:
# print(X_train[2])
print("Array length of the train_X :- {}".format(len(X_train[2])))
print("Length of the train_X_len :- {}".format(X_train_len[2]))

Array length of the train_X :- 500
Length of the train_X_len :- 296


#### Save the transformed dataset to a csv file in locally

In [17]:
data_dir = "./sentiment_analysis/"
os.makedirs(data_dir, exist_ok=True)

pd.concat([pd.DataFrame(y_train.values), pd.DataFrame(X_train_len), pd.DataFrame(X_train)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

pd.concat([pd.DataFrame(y_test.values), pd.DataFrame(X_test_len), pd.DataFrame(X_test)], axis=1) \
        .to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)


In [19]:
# # Read in only the first 250 rows
train_data = pd.read_csv(os.path.join(
    './sentiment_analysis/train.csv'), header=None, names=None)

# Turn the input pandas dataframe into tensors
train_y = torch.from_numpy(train_data[0].values).float().squeeze()
train_X = torch.from_numpy(train_data.drop([0], axis=1).values).long()

# Build the dataset
train_ds = torch.utils.data.TensorDataset(
    train_X, train_y)
# Build the dataloader
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=50)


### Architecture of NLP Model with PyTorch

In [20]:
class LSTMClassifier(nn.Module):
    """
    This is the simple RNN model we will be using to perform Sentiment Analysis.
    """

    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        """
        Initialize the model by settingg up the various layers.
        """
        super(LSTMClassifier, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.dense = nn.Linear(in_features=hidden_dim, out_features=1)
        self.sig = nn.Sigmoid()

        self.word_dict = None

    def forward(self, x):
        """
        Perform a forward pass of our model on some input.
        """
        x = x.t()
        lengths = x[0, :]
        reviews = x[1:, :]
        embeds = self.embedding(reviews)
        lstm_out, _ = self.lstm(embeds)
        out = self.dense(lstm_out)
        out = out[lengths - 1, range(len(lengths))]
        return self.sig(out.squeeze())


In [21]:
class SentimentAnalysis:
    def __init__(self, embedding_dim, hidden_dim, vocab_size):
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

        self.model = LSTMClassifier(embedding_dim, hidden_dim, vocab_size)
        self.word_dict = pickle.load(open("./sentiment_analysis/word_dict.pkl", "rb"))

    def train(self, train_loader, epochs):
        optimizer = optim.Adam(self.model.parameters(), lr=1e-3)
        loss_fn = torch.nn.BCELoss()
        for epoch in range(1, epochs + 1):
            self.model.train()
            total_loss = 0
            for batch in train_loader:
                batch_X, batch_y = batch

                batch_X = batch_X.to(self.device)
                batch_y = batch_y.to(self.device)

                # TODO: Complete this train method to train the model provided.
                optimizer.zero_grad()

                model_output = self.model.forward(batch_X)
                loss = loss_fn(model_output, batch_y)

                loss.backward()

                optimizer.step()

                total_loss += loss.data.item()
            print("Epoch: {}, BCELoss: {}".format(
                epoch, total_loss / len(train_loader)))

    def predict(self, x):
        data_X, data_len = convert_and_pad(self.word_dict, review_to_words(x), pad=500)
        data_pack = np.hstack((data_len, data_X))
        data_pack = data_pack.reshape(1, -1)
        
        data = torch.from_numpy(data_pack)
        data = data.to(self.device)

        # Make sure to put the model into evaluation mode
        self.model.eval()
        with torch.no_grad():
            output = self.model(data)
            return "positive" if round(output.item()) else "negative"

    def evaluate(self, x, y):
        self.model.eval()
        with torch.no_grad():
            output = self.model(x)
            predicted = output.data.round()
            correct = (predicted == y).sum().item()
            return correct / len(y)


In [22]:
model = SentimentAnalysis(32, 100, 5000)


In [23]:
model.train(train_loader=train_dl, epochs=5)

Epoch: 1, BCELoss: 0.6145765907423837
Epoch: 2, BCELoss: 0.5105757576227188
Epoch: 3, BCELoss: 0.39539344251155856
Epoch: 4, BCELoss: 0.3329727892790522
Epoch: 5, BCELoss: 0.29848648594958443


In [24]:
test_data = pd.read_csv(os.path.join(
    './sentiment_analysis/test.csv'), header=None, names=None)

test_y = torch.from_numpy(test_data[0].values).float().squeeze()
test_X = torch.from_numpy(test_data.drop([0], axis=1).values).long()

In [25]:
model.evaluate(test_X, test_y)

0.8561282932416953

In [28]:
model.predict("I am happy with the product")

'positive'

In [30]:
model.predict("I am not feeling good today")

'negative'

### Let's testing the model with Langtest

**LangTest**is an open-source python library designed to help developers deliver safe and effective Natural Language Processing (NLP) models. Whether you are using **John Snow Labs, Hugging Face, Spacy**
models or **OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI** based LLMs, it has got you covered. You can test any Named Entity Recognition (NER), Text Classification, fill-mask, Translation model using the library. We also support testing LLMS for Question-Answering, Summarization and text-generation tasks on benchmark datasets. The library supports 60+ out of the box tests. For a complete list of supported test categories, please refer to the [documentation](http://langtest.org/docs/pages/docs/test_categories).

Metrics are calculated by comparing the model's extractions in the original list of sentences against the extractions carried out in the noisy list of sentences. The original annotated labels are not used at any point, we are simply comparing the model against itself in a 2 settings.

In [None]:
from langtest import Harness

It imports the Harness class from within the module, that is designed to provide a blueprint or framework for conducting NLP testing, and that instances of the Harness class can be customized or configured for different testing scenarios or environments.

Here is a list of the different parameters that can be passed to the Harness function:

<br/>



| Parameter     | Description |
| - | - |
| **task**      | Task for which the model is to be evaluated (text-classification or ner) |
| **model**     | Specifies the model(s) to be evaluated. This parameter can be provided as either a dictionary or a list of dictionaries. Each dictionary should contain the following keys: <ul><li>model (mandatory): 	PipelineModel or path to a saved model or pretrained pipeline/model from hub.</li><li>hub (mandatory): Hub (library) to use in back-end for loading model from public models hub, from path or use `custom`</li></ul>|
| **data**      | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: <ul><li>data_source (mandatory): The source of the data.</li><li>subset (optional): The subset of the data.</li><li>feature_column (optional): The column containing the features.</li><li>target_column (optional): The column containing the target labels.</li><li>split (optional): The data split to be used.</li><li>source (optional): Set to 'huggingface' when loading Hugging Face dataset.</li></ul> |
| **config**    | Configuration for the tests to be performed, specified in the form of a YAML file. |


<br/>
<br/>

In [31]:
harness = Harness(task="text-classification",
                  model={'model': model, "hub": "custom"}, data={'data_source': './data/imdb.csv'})

Test Configuration : 
 {
 "tests": {
  "defaults": {
   "min_pass_rate": 1.0
  },
  "robustness": {
   "add_typo": {
    "min_pass_rate": 0.7
   },
   "american_to_british": {
    "min_pass_rate": 0.7
   }
  },
  "accuracy": {
   "min_micro_f1_score": {
    "min_score": 0.7
   }
  },
  "bias": {
   "replace_to_female_pronouns": {
    "min_pass_rate": 0.7
   },
   "replace_to_low_income_country": {
    "min_pass_rate": 0.7
   }
  },
  "fairness": {
   "min_gender_f1_score": {
    "min_score": 0.6
   }
  },
  "representation": {
   "min_label_representation_count": {
    "min_count": 50
   }
  }
 }
}


In [34]:
harness.configure(
    {
        'tests': {
            'defaults': {'min_pass_rate': 0.65},
            'robustness': {
                'add_contraction': {'min_pass_rate': 0.7},
                'lowercase': {'min_pass_rate': 0.7},
            }
        }}
)


{'tests': {'defaults': {'min_pass_rate': 0.65},
  'robustness': {'add_contraction': {'min_pass_rate': 0.7},
   'lowercase': {'min_pass_rate': 0.7}}}}

In [35]:
harness.data = harness.data[:100]

In [36]:
harness.generate().run()


Generating testcases...: 100%|██████████| 1/1 [00:00<?, ?it/s]
- Test 'add_contraction': 19 samples removed out of 100

Running testcases... : 100%|██████████| 181/181 [00:53<00:00,  3.40it/s]




In [37]:
harness.report()


Unnamed: 0,category,test_type,fail_count,pass_count,pass_rate,minimum_pass_rate,pass
0,robustness,add_contraction,2,79,98%,70%,True
1,robustness,lowercase,0,100,100%,70%,True


In [40]:
t_df = harness.generated_results()

In [44]:
t_df.head()

Unnamed: 0,category,test_type,original,test_case,expected_result,actual_result,pass
0,robustness,add_contraction,One of the other reviewers has mentioned that ...,One of the other reviewers has mentioned that ...,negative,positive,False
1,robustness,add_contraction,A wonderful little production. <br /><br />The...,A wonderful little production. <br /><br />The...,positive,positive,True
2,robustness,add_contraction,I thought this was a wonderful way to spend ti...,I thought this was a wonderful way to spend ti...,negative,negative,True
3,robustness,add_contraction,"Petter Mattei's ""Love in the Time of Money"" is...","Petter Mattei's ""Love in the Time of Money"" is...",positive,positive,True
4,robustness,add_contraction,I sure would like to see a resurrection of a u...,I sure would like to see a resurrection of a u...,positive,positive,True


In [58]:

print("{:=<100}".format("="))
for idx, rows in t_df[t_df['pass'] == False].iterrows():
    print("original: {}\n".format(rows['original']))
    print("testcase: {}\n".format(rows['test_case']))

    print("diff => (original, testcase): \n", set(rows['test_case'].split()) - set(rows['original'].split()), "\n")

    print("expected: {}\n".format(rows['expected_result']))
    print("actual: {}\n".format(rows['actual_result']))
    print("{:=<100}".format("="))

original: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due 