<a href="https://colab.research.google.com/github/Kacper-W-Kozdon/notebook-testing-ivy/blob/main/Sarcasm_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IVY TRANSPILER

Installing kaggle and uploading the API key necessary to use it.

In [None]:
!pip install -q kaggle
from google.colab import files
from google.colab import userdata
import os
files.upload(); #Upload kaggle.json - you can get from the kaggle account settings, from the API section.

Installing packages necessary to use torch's transformers.

In [None]:
!pip install tqdm boto3 requests regex sentencepiece sacremoses

To use the API, credentials need to be copied into the kaggle folder. If everything works, the output will show the list of available datasets.

In [None]:
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets list

Preparing the ivy library.

In [None]:
#Inser the correct user when cloning the repos. Make sure that they are up-to-date.

!git clone "https://github.com/Kacper-W-Kozdon/demos.git"
!git clone "https://github.com/Kacper-W-Kozdon/ivy.git"
!pip install -U -q paddlepaddle ivy accelerate>=0.21.0 mlflow datasets>=2.14.5 nlp 2>/dev/null

Next: import the ivy library and get the dataset.

In [None]:
import ivy
!kaggle datasets download -d danofer/sarcasm
!cp -f sarcasm.zip '/content/demos/Contributor_demos/Sarcasm Detection/'
!unzip '/content/demos/Contributor_demos/Sarcasm Detection/sarcasm.zip' -d '/content/demos/Contributor_demos/Sarcasm Detection/'

Import the libraries suggested in the model which is to be transpiled.

In [None]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import gc  # For garbage collection to manage memory
import re  # For regular expressions
import numpy as np  # For numerical operations and arrays
import paddle

# Libraries to accompany torch's transformers
import tqdm
import boto3
import requests
import regex
import sentencepiece
import sacremoses

import warnings  # For handling warnings
warnings.filterwarnings("ignore")  # Ignore warning messages

import tensorflow as tf
import torch  # PyTorch library for deep learning
from transformers import AutoModel, AutoTokenizer  # Transformers library for natural language processing
from transformers import TextDataset, LineByLineTextDataset, DataCollatorForLanguageModeling, \
pipeline, Trainer, TrainingArguments, DataCollatorWithPadding  # Transformers components for text processing
from transformers import AutoModelForSequenceClassification  # Transformer model for sequence classification

import accelerate

from nlp import Dataset  # Import custom 'Dataset' class for natural language processing tasks
from imblearn.over_sampling import RandomOverSampler  # For oversampling to handle class imbalance
import datasets  # Import datasets library
from datasets import Dataset, Image, ClassLabel  # Import custom 'Dataset', 'ClassLabel', and 'Image' classes
from transformers import pipeline  # Transformers library for pipelines
from bs4 import BeautifulSoup  # For parsing HTML content

import matplotlib.pyplot as plt  # For data visualization
import itertools  # For working with iterators
from sklearn.metrics import (  # Import various metrics from scikit-learn
    accuracy_score,  # For calculating accuracy
    roc_auc_score,  # For ROC AUC score
    confusion_matrix,  # For confusion matrix
    classification_report,  # For classification report
    f1_score  # For F1 score
)

from datasets import load_metric  # Import load_metric function to load evaluation metrics

from tqdm import tqdm  # For displaying progress bars
tqdm.pandas()  # Enable progress bars for pandas operations

Set the seeds.

In [None]:
tf.keras.utils.set_random_seed(0)
torch.manual_seed(0)
paddle.seed(0)

Get the API key for ivy transpiler from your account and upload it to the project. Move it to the correct directory.

In [None]:
files.upload(); #Upload key.pem - you can get from the kaggle account settings, from the API section.

In [None]:
!mkdir ~/.ivy #It might be necessary to change ".ivy" to "ivy".
!cp key.pem ~/.ivy/

First we're loading the tokenizer and the model from torch. All of the basic set-up instructions can be found here: https://colab.research.google.com/github/pytorch/pytorch.github.io/blob/master/assets/hub/huggingface_pytorch-transformers.ipynb#scrollTo=72d8f2de

In [None]:
tokenizer = torch.hub.load('huggingface/pytorch-transformers', 'tokenizer', 'bert-base-cased')
model = torch.hub.load('huggingface/pytorch-transformers', 'model', 'bert-base-cased')

sequence_classifier = torch.hub.load('huggingface/pytorch-transformers', 'modelForSequenceClassification', 'bert-base-cased')

In [None]:
from ivy.stateful.module import Module
from ivy.stateful.sequential import Sequential
from ivy.stateful.layers import *
from ivy.stateful.losses import *
from ivy.stateful.optimizers import *
from ivy.stateful.activations import *
from ivy.stateful.initializers import *
from ivy.stateful.norms import *

In [None]:
ivy.set_backend("tensorflow")
ivy.set_soft_device_mode(True)
#tokenizer_tf = ivy.transpile(tokenizer, source="torch", to="tensorflow")
#model_pd = ivy.to_ivy_module(model)
#model_pd = model_pd.trace_graph()
#model_pd = model_pd.set_backend("tensorflow")
model_tf = ivy.transpile(model, source="torch", to="tensorflow")
sequence_classifier_tf = ivy.transpile(sequence_classifier, source="torch", to="tensorflow")

In [None]:
df = pd.read_csv("/content/demos/Contributor_demos/Sarcasm Detection/train-balanced-sarcasm.csv")
df = df.drop_duplicates()
df = df.rename(columns={'comment': 'title'})
df = df[['label', 'title']]
df = df[~df['label'].isnull()]
df = df[~df['title'].isnull()]
df.sample(5)

In [None]:
def count_words(text: str) -> int:
  return len(text.split())

def count_symbols(text: str) -> int:
  return len("".join(text.split()))

def symbol_to_word_ratio(text: str) -> float:
  return count_symbols(text)/count_words(text)

def upper_lower_ratio(text: str) -> float:
  text = "".join(text.split())
  return sum(1 for c in text if c.isupper())/(max([sum(1 for c in text if c.islower()), 1]))

df['word_count'] = df["title"].apply(count_words)
df['symbol_count'] = df["title"].apply(count_symbols)
df["upper_lower_ratio"] = df["title"].apply(upper_lower_ratio)
df["symbol_to_word_ratio"] = df["title"].apply(symbol_to_word_ratio)
df.sample(5)

A few plots to see some some characteristics of the data.

In [None]:
df_no_sarc = df.where(df["label"] == 0)
df_no_sarc = df_no_sarc.where(df_no_sarc["word_count"] <= 51)
df_sarc = df.where(df["label"] == 1)
df_sarc = df_sarc.where(df_sarc["word_count"] <= 51)
df_no_sarc = df_no_sarc[np.isfinite(df_no_sarc["word_count"])]
df_sarc = df_sarc[np.isfinite(df_sarc["word_count"])]
plt.style.use('_mpl-gallery-nogrid')

hist_df_no_sarc, bin_edges_no = np.histogram(df_no_sarc["word_count"].values, density=True)
hist_df_sarc, bin_edges = np.histogram(df_sarc["word_count"].values, density=True)
# plot:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

bin_mids_no = [(bin_edges_no[i+1] + bin_edges_no[i])/2 for i in range(len(bin_edges_no) - 1)]
bin_mids = [(bin_edges[i+1] + bin_edges[i])/2 for i in range(len(bin_edges) - 1)]
ax1.bar(bin_mids_no, hist_df_no_sarc, width=bin_edges_no[1] - bin_edges_no[0])
ax2.bar(bin_mids, hist_df_sarc, width=bin_edges[1] - bin_edges[0])
ax1.set_title("Hist no sarcasm")
ax1.set_ylabel("density")
ax1.set_xlabel("word count")
ax1.set_xticks(bin_edges_no)
ax1.grid(True)
ax2.set_title("Hist sarcasm")
ax2.set_xlabel("word count")
ax2.set_xticks(bin_edges)
ax2.grid(True)
plt.show()

In [None]:
df_no_sarc = df.where(df["label"] == 0)
df_no_sarc = df_no_sarc.where(df_no_sarc["symbol_count"] <= 201)
df_sarc = df.where(df["label"] == 1)
df_sarc = df_sarc.where(df_sarc["symbol_count"] <= 201)
df_no_sarc = df_no_sarc[np.isfinite(df_no_sarc["symbol_count"])]
df_sarc = df_sarc[np.isfinite(df_sarc["symbol_count"])]
plt.style.use('_mpl-gallery-nogrid')

hist_df_no_sarc, bin_edges_no = np.histogram(df_no_sarc["symbol_count"].values, density=True)
hist_df_sarc, bin_edges = np.histogram(df_sarc["symbol_count"].values, density=True)
# plot:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

bin_mids_no = [(bin_edges_no[i+1] + bin_edges_no[i])/2 for i in range(len(bin_edges_no) - 1)]
bin_mids = [(bin_edges[i+1] + bin_edges[i])/2 for i in range(len(bin_edges) - 1)]
ax1.bar(bin_mids_no, hist_df_no_sarc, width=bin_edges_no[1] - bin_edges_no[0])
ax2.bar(bin_mids, hist_df_sarc, width=bin_edges[1] - bin_edges[0])
ax1.set_title("Hist no sarcasm")
ax1.set_ylabel("density")
ax1.set_xlabel("symbol count")
ax1.set_xticks(bin_edges_no)
ax1.grid(True)
ax2.set_title("Hist sarcasm")
ax2.set_xlabel("symbol count")
ax2.set_xticks(bin_edges)
ax2.grid(True)
plt.show()

In [None]:
df_no_sarc = df.where(df["label"] == 0)
df_no_sarc = df_no_sarc.where(df_no_sarc["upper_lower_ratio"] <= 0.3)
df_sarc = df.where(df["label"] == 1)
df_sarc = df_sarc.where(df_sarc["upper_lower_ratio"] <= 0.3)
df_no_sarc = df_no_sarc[np.isfinite(df_no_sarc["upper_lower_ratio"])]
df_sarc = df_sarc[np.isfinite(df_sarc["upper_lower_ratio"])]
plt.style.use('_mpl-gallery-nogrid')

hist_df_no_sarc, bin_edges_no = np.histogram(df_no_sarc["upper_lower_ratio"].values, density=True)
hist_df_sarc, bin_edges = np.histogram(df_sarc["upper_lower_ratio"].values, density=True)
# plot:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

bin_mids_no = [(bin_edges_no[i+1] + bin_edges_no[i])/2 for i in range(len(bin_edges_no) - 1)]
bin_mids = [(bin_edges[i+1] + bin_edges[i])/2 for i in range(len(bin_edges) - 1)]
ax1.bar(bin_mids_no, hist_df_no_sarc, width=bin_edges_no[1] - bin_edges_no[0])
ax2.bar(bin_mids, hist_df_sarc, width=bin_edges[1] - bin_edges[0])
ax1.set_title("Hist no sarcasm")
ax1.set_ylabel("density")
ax1.set_xlabel("upper/lower ratio")
ax1.set_xticks(bin_edges_no)
ax1.grid(True)
ax2.set_title("Hist sarcasm")
ax2.set_xlabel("upper/lower ratio")
ax2.set_xticks(bin_edges)
ax2.grid(True)
plt.show()

In [None]:
df_no_sarc = df.where(df["label"] == 0)
df_no_sarc = df_no_sarc.where(df_no_sarc["symbol_to_word_ratio"] <= 11)
df_sarc = df.where(df["label"] == 1)
df_sarc = df_sarc.where(df_sarc["symbol_to_word_ratio"] <= 11)
df_no_sarc = df_no_sarc[np.isfinite(df_no_sarc["symbol_to_word_ratio"])]
df_sarc = df_sarc[np.isfinite(df_sarc["symbol_to_word_ratio"])]
plt.style.use('_mpl-gallery-nogrid')

hist_df_no_sarc, bin_edges_no = np.histogram(df_no_sarc["symbol_to_word_ratio"].values, density=True)
hist_df_sarc, bin_edges = np.histogram(df_sarc["symbol_to_word_ratio"].values, density=True)
# plot:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

bin_mids_no = [(bin_edges_no[i+1] + bin_edges_no[i])/2 for i in range(len(bin_edges_no) - 1)]
bin_mids = [(bin_edges[i+1] + bin_edges[i])/2 for i in range(len(bin_edges) - 1)]
ax1.bar(bin_mids_no, hist_df_no_sarc, width=bin_edges_no[1] - bin_edges_no[0])
ax2.bar(bin_mids, hist_df_sarc, width=bin_edges[1] - bin_edges[0])
ax1.set_title("Hist no sarcasm")
ax1.set_ylabel("density")
ax1.set_xlabel("symbols/words ratio")
ax1.set_xticks(bin_edges_no)
ax1.grid(True)
ax2.set_title("Hist sarcasm")
ax2.set_xlabel("symbols/words ratio")
ax2.set_xticks(bin_edges)
ax2.grid(True)
plt.show()

Checking if the tokenizer, encoder/decoder and classifier work.

In [None]:
input = df["title"][1]
print(f"The raw input: \n{input}\n")
token = tokenizer(input, return_tensors="pt", add_special_tokens=True)
print(f"The token: \n{token}\n")
with torch.no_grad():
  encoded_token = model(**token)
print(f"The encoded token: \n{encoded_token}\n")

Checking if the transpiled tokenizer, encoder/decoder and classifier work.

In [None]:
input = df["title"][1]
print(f"The raw input: \n{input}\n")
token = tokenizer.encode(input, return_tensors="tf", add_special_tokens=True)
print(f"The token: \n{token}\n")
#input_ids, token_type_ids, attention_mask = token["input_ids"], token["token_type_ids"], token["attention_mask"]
encoded_token = model_tf(token)
print(f"The encoded token: \n{encoded_token}\n")

A quick check whether transpiling to paddle works as intended.

In [None]:
class Network(torch.nn.Module):

    def __init__(self):
     super().__init__()
     self._linear = torch.nn.Linear(3, 3)

    def forward(self, x):
     return self._linear(x)

x = torch.tensor([1., 2., 3.])
net = Network()
net(x)

In [None]:
ivy.set_backend("paddle")
net_pd = ivy.transpile(net, source="torch", to="paddle")
x_pd = paddle.to_tensor([1., 2., 3.])
net_pd(x_pd)

Setting up the classifier based on BERT's sequence classifier model.

In [None]:
class Classifier(torch.nn.Module):
    def __init__(self, num_classes=2):
        super(Classifier, self).__init__()
        self.tokenizer = tokenizer
        self.model = sequence_classifier

    def forward(self, x):
        x = self.tokenizer(x, return_tensors="pt", add_special_tokens=True)
        x = self.model(**x)
        return x

input = df["title"][1]
print(input)
classifier = Classifier()
print(classifier(input))

ivy.set_backend("paddle")
classifier_pd = ivy.transpile(classifier, source="torch", to="paddle")
print(classifier_pd(input))
print(f"Layers: {classifier_pd.sublayers()}")

Setting up the training and training the model.

In [None]:
import paddle.distributed as dist
def one_hot(input):
  input = paddle.to_tensor(input)
  return paddle.nn.functional.one_hot(input, num_classes=2)

if type(df['label'][1]) is int:
  df['label'] = df['label'].map(one_hot)

print(df.sample(5))

train_dataset = df[['title', 'label']]
test_dataset = df[['title', 'label']]

train_loader = paddle.io.DataLoader(train_dataset, batch_size=64, shuffle=True)

def train(model):

  parameters = model.parameters()
  adam = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
  loss_fn = paddle.nn.CrossEntropyLoss()
  metric = paddle.metric.Accuracy()
  epochs = 2
  classifier = paddle.DataParallel(model)

  for epoch in range(epochs):
    for batch_id, data in enumerate(train_loader()):
      x_data = data[0]
      y_data = data[1]
      predicts = classifier(x_data)
      acc = paddle.metric.accuracy(predicts, y_data)
      loss = loss_fn(predicts, y_data)
      loss.backward()
      if batch_id % 100 == 0:
          print("epoch: {}, batch_id: {}, loss is: {}, acc is: {}".format(epoch, batch_id, loss.numpy(), acc.numpy()))
      adam.step()
      adam.clear_grad()