## 🚀**Introduction** <a class="anchor"  id="h1"></a>

<div class = "alert alert-info">
This notebook delves into the task of identifying <b>AI-generated text</b> within a dataset. Commencing by loading and exploring datasets, including test essays and extended training data. Employing advanced tokenization techniques like <b>Byte-Pair Encoding (BPE)</b> and <b>TF-IDF vectorization</b>, enhancing text representations for subsequent model training. Utilizing classifiers such as <b>Multinomial Naive Bayes, Stochastic Gradient Descent, LightGBM, and CatBoost</b>, creating an ensemble model with a <b>Voting Classifier</b>. The trained models are then applied to the test dataset, and predictions are submitted following competition requirements. 
</div>

### 📋**Table of Contents**

* [Introduction](#h1)

* [Importing Libraries](#h2)

* [Loading Datasets](#h3)

* [Setting Constants](#h4)

* [Creating Byte-Pair Encoding Tokenizer](#h5)

* [Adding Special Tokens and Creating Trainer Instance](#h6)

* [Creating Huggingface Dataset Object](#h7)

* [Training the Tokenizer](#h8)

* [Tokenizing Texts](#h9)

* [TF-IDF Vectorization](#h10)

* [Getting Vocab](#h11)

* [Model Training](#h12)

## 📚 **Importing Libraries** <a class="anchor"  id="h2"></a>

In [1]:
import sys
import gc

import pandas as pd
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import roc_auc_score
import numpy as np
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

from datasets import Dataset
from tqdm.auto import tqdm
from transformers import PreTrainedTokenizerFast

from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier

## 📊**Loading Datasets** <a class="anchor"  id="h3"></a>

In [2]:
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
sub = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/sample_submission.csv')
org_train = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_essays.csv')
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv", sep=',')
train = train.drop_duplicates(subset=['text'])
train.reset_index(drop=True, inplace=True)

## 📌**Setting Constants** <a class="anchor"  id="h4"></a>

In [3]:
LOWERCASE = False
VOCAB_SIZE = 30522

## 🤖**Creating Byte-Pair Encoding Tokenizer** <a class="anchor"  id="h5"></a>

In [4]:
raw_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
raw_tokenizer.normalizer = normalizers.Sequence([normalizers.NFC()] + [normalizers.Lowercase()] if LOWERCASE else [])
raw_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()

## 🛠️**Adding Special Tokens and Creating Trainer Instance** <a class="anchor"  id="h6"></a>

In [5]:
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.BpeTrainer(vocab_size=VOCAB_SIZE, special_tokens=special_tokens)

## 📚**Creating Huggingface Dataset Object** <a class="anchor"  id="h7"></a>

In [6]:
dataset = Dataset.from_pandas(test[['text']])

## 🚀**Training the Tokenizer** <a class="anchor"  id="h8"></a>

In [7]:
def train_corp_iter(): 
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["text"]

In [8]:
raw_tokenizer.train_from_iterator(train_corp_iter(), trainer=trainer)
tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]")






## 🗂️**Tokenizing Texts** <a class="anchor"  id="h9"></a>

In [9]:
tokenized_texts_test = []
for text in tqdm(test['text'].tolist()):
    tokenized_texts_test.append(tokenizer.tokenize(text))

  0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
tokenized_texts_train = []
for text in tqdm(train['text'].tolist()):
    tokenized_texts_train.append(tokenizer.tokenize(text))

  0%|          | 0/44868 [00:00<?, ?it/s]

## 🤖➡️🔠**TF-IDF Vectorization** <a class="anchor"  id="h10"></a>

In [11]:
def dummy(text):
    return text

In [12]:
vectorizer = TfidfVectorizer(ngram_range=(3, 5), 
                             lowercase=False, 
                             sublinear_tf=True, 
                             analyzer = 'word',
                             tokenizer = dummy,
                             preprocessor = dummy,
                             token_pattern = None,
                             strip_accents='unicode')

In [13]:
vectorizer.fit(tokenized_texts_test)

## **Getting Vocab** 📚🧠 <a class="anchor"  id="h11"></a>

In [14]:
vocab = vectorizer.vocabulary_

In [15]:
vectorizer = TfidfVectorizer(ngram_range=(3, 5), 
                             lowercase=False, 
                             sublinear_tf=True, 
                             vocabulary=vocab,
                             analyzer = 'word',
                             tokenizer = dummy,
                             preprocessor = dummy,
                             token_pattern = None, 
                             strip_accents='unicode')

In [16]:
tf_train = vectorizer.fit_transform(tokenized_texts_train)
tf_test = vectorizer.transform(tokenized_texts_test)

In [17]:
del vectorizer
gc.collect()

21

## **Model Training** 🚀 <a class="anchor"  id="h12"></a>

In [18]:
y_train = train['label'].values

In [19]:
if len(test.text.values) <= 5:
    sub.to_csv('submission.csv', index=False)
else:
    clf = MultinomialNB(alpha=0.02)
    sgd_model = SGDClassifier(max_iter=8000, tol=1e-4, loss="modified_huber") 
    p6={'n_iter': 1500,'verbose': -1,'objective': 'binary','metric': 'auc','learning_rate': 0.05073909898961407, 'colsample_bytree': 0.726023996436955, 'colsample_bynode': 0.5803681307354022, 'lambda_l1': 8.562963348932286, 'lambda_l2': 4.893256185259296, 'min_data_in_leaf': 115, 'max_depth': 23, 'max_bin': 898}
    lgb=LGBMClassifier(**p6)
    cat=CatBoostClassifier(iterations=1000,
                           verbose=0,
                           l2_leaf_reg=6.6591278779517808,
                           learning_rate=0.005689066836106983,
                           allow_const_label=True,loss_function = 'CrossEntropy')
    weights = [0.07,0.31,0.31,0.31]
 
    ensemble = VotingClassifier(estimators=[('mnb',clf),
                                            ('sgd', sgd_model),
                                            ('lgb',lgb), 
                                            ('cat', cat)
                                           ],
                                weights=weights, voting='soft', n_jobs=-1)
    ensemble.fit(tf_train, y_train)
    gc.collect()
    final_preds = ensemble.predict_proba(tf_test)[:,1]
    sub['generated'] = final_preds
    sub.to_csv('submission.csv', index=False)
    sub