## **Introduction** <a class="anchor"  id="h1"></a>

<div class = "alert alert-info">
Welcome to this Kaggle notebook. In this iteration, we continue our exploration into the fascinating domain of <b>essay classification</b>. The primary objective is to refine the classification task, distinguishing between essays <b>authored by students</b> and those <b>generated by a language model (LLM)</b>. Leveraging ensemble learning, we introduce a <b>Soft Voting Classifier</b> that combines the capabilities of <b>Logistic Regression</b> and <b>Stochastic Gradient Descent (SGD)</b> models.</div>

### 📋**Table of Contents**

* [Introduction](#h1)

* [Import Libraries](#h2)

* [Read Dataset](#h3)

* [Vectorize Text Data](#h4)

* [Define and Train Models](#h5)

* [Generate Predictions and Create Submission File](#h6)

## 📚**Import Libraries** <a class="anchor"  id="h2"></a>

In [1]:
import regex as re
import numpy as np
import pandas as pd
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

## 📖**Read Dataset** <a class="anchor"  id="h3"></a>

In [2]:
train = pd.read_csv("/kaggle/input/daigt-v2-train-dataset/train_v2_drcat_02.csv")
test = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')

In [3]:
train_ = train[train.RDizzl3_seven == False].reset_index(drop=True)
train_ = train[train["label"]==1].sample(8000)
train = train[train.RDizzl3_seven == True].reset_index(drop=True)
train = pd.concat([train, train_])
train['text'] = train['text'].str.replace('\n', '')
test['text'] = test['text'].str.replace('\n', '')
train['label'].value_counts()

label
0    14250
1    14200
Name: count, dtype: int64

## 🔠**Vectorize Text Data** <a class="anchor"  id="h4"></a>

In [4]:
%%time
df = pd.concat([train['text'], test['text']])

vectorizer = TfidfVectorizer(sublinear_tf=True,
                             ngram_range=(3, 4),
                             tokenizer=lambda x: re.findall(r'[^\W]+', x),
                             token_pattern=None,
                             strip_accents='unicode',
                             )

vectorizer = vectorizer.fit(test['text'])
X = vectorizer.transform(df)

CPU times: user 22.2 s, sys: 304 µs, total: 22.2 s
Wall time: 22.2 s


## ⚙️**Define and Train Models** <a class="anchor"  id="h5"></a>

In [5]:
%%time
lr_model = LogisticRegression()
sgd_model = SGDClassifier(max_iter=5000, loss="modified_huber", random_state=42)

ensemble = VotingClassifier(estimators=[('lr', lr_model),
                                        ('sgd', sgd_model),
                                       ],
                            weights=[0.01, 0.99],
                            voting='soft'
                           )
ensemble.fit(X[:train.shape[0]], train.label)

CPU times: user 28 ms, sys: 0 ns, total: 28 ms
Wall time: 34.3 ms


## 📊**Generate Predictions and Create Submission File** <a class="anchor"  id="h6"></a>

In [6]:
preds_test = ensemble.predict_proba(X[train.shape[0]:])[:, 1]
pd.DataFrame({'id':test["id"], 'generated':preds_test}).to_csv('submission.csv', index=False)