__Name:__ <br> 1. Amrita Veshin [22122104] <br> 2. Atharva Vetal [22122109]

--------------------------------------------------------------------------------
# <center> Humour Detection Using NLP: XLNet Implementation
--------------------------------------------------------------------------------

XLNet is a powerful and versatile pre-trained language model that's part of the transformer-based neural network family, developed by Google AI and released as an open-source project. It builds upon the success of models like BERT (Bidirectional Encoder Representations from Transformers) and introduces a novel approach to language modeling called "permutation-based training."

## Features of XLNet:
1. __Bidirectional Context Modeling:__ Like BERT, XLNet also takes advantage of bidirectional context. It captures information from both directions in a sentence, which allows it to better understand the relationships between words and their contextual meanings.

2. __Permutation-Based Training:__ What sets XLNet apart is its innovative permutation-based training strategy. Instead of using traditional autoregressive (AR) or bidirectional (BI) training, it employs a combination of both. During training, it considers all possible permutations of words in a sentence and calculates the probability of each permutation. This approach results in better context understanding and representation.

3. __Improved Learning:__ XLNet overcomes some of the limitations of previous models. It doesn't suffer from the "masking" of tokens, as in BERT, which can lead to errors in handling masked tokens. Instead, it uses an autoregressive mechanism to predict the tokens.

4. __Multi-Head Attention:__ Like other transformer models, XLNet uses multi-head self-attention mechanisms, allowing it to focus on different parts of a sentence simultaneously. This enables it to capture complex dependencies between words.

## Role of XLNet in Humour Detection (Jokes Classification)
1. __Role in Humor Detection:__ XLNet can be fine-tuned for various natural language understanding tasks, including humor detection. To use XLNet for humor detection or jokes classification, you would typically take a pre-trained XLNet model and fine-tune it on a labeled dataset of jokes and non-jokes. During fine-tuning, the model learns to recognize patterns and linguistic features that indicate humor.

2. __Features and Context:__ XLNet's ability to capture context and dependencies between words can be valuable for humor detection. It can identify the nuances and wordplay that often characterize jokes, including sarcasm, irony, and clever word choices.

3. __Fine-Tuning for Specific Tasks:__ Fine-tuning allows you to adapt XLNet to your specific task, making it a powerful tool for various natural language processing applications. In the context of humor detection, fine-tuning can help the model become proficient at distinguishing between humorous and non-humorous text.

In [None]:
import numpy as np
import pandas as pd


In [None]:
data=pd.read_csv('JokeDetectionDataset_classified.csv')
data.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [None]:
data.tail()

Unnamed: 0,text,humor
199995,Conor maynard seamlessly fits old-school r&b h...,False
199996,How to you make holy water? you boil the hell ...,True
199997,How many optometrists does it take to screw in...,True
199998,Mcdonald's will officially kick off all-day br...,False
199999,An irish man walks on the street and ignores a...,True


In [None]:
import spacy

In [None]:
nlp=spacy.load('en_core_web_sm')

## Pre-Trained Language Model Implementation: XLNet

In [None]:
!pip install transformers



In [None]:
from transformers import XLNetTokenizer, XLNetForSequenceClassification
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

X = data['text']
y = data['humor']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
#!pip uninstall SentencePiece
!pip install SentencePiece

In [None]:
import sentencepiece
# Load XLNet tokenizer and model
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=2)


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/467M [00:00<?, ?B/s]

Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.weight', 'logits_proj.bias', 'sequence_summary.summary.weight', 'sequence_summary.summary.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Define a custom dataset
class JokeDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts.iloc[idx])
        label = int(self.labels.iloc[idx])
        inputs = self.tokenizer(
            text,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt'
        )
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Create data loaders
max_length = 128
train_dataset = JokeDataset(X_train, y_train, tokenizer, max_length)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

val_dataset = JokeDataset(X_val, y_val, tokenizer, max_length)
val_loader = DataLoader(val_dataset, batch_size=32)

# Fine-tune the model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    for batch in train_loader:
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

# Evaluation
model.eval()
val_loss = 0.0
val_correct = 0

with torch.no_grad():
    for batch in val_loader:
        inputs = {k: v.to(device) for k, v in batch.items() if k != 'labels'}
        labels = batch['labels'].to(device)
        outputs = model(**inputs, labels=labels)
        val_loss += outputs.loss.item()
        val_correct += (outputs.logits.argmax(dim=1) == labels).sum().item()

val_accuracy = val_correct / len(y_val)
print(f'Validation Accuracy: {val_accuracy*100:.2f}%')


KeyboardInterrupt: ignored