<div style="background-color:#5D73F2; color:#19180F; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> Multiple approaches </div>
<div style="background-color:#A8B4F6; color:#19180F; font-size:20px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> 
📌 1. TFIDF + Stacking + K Fold CV + Optuna based approach <br>
📌 2. Word2Vec  + Stacking + K Fold CV + Optuna based approach<br>
📌 3. DistilBERT in PyTorch <br>
</div>


<div style="background-color:#A8B4F6; color:#19180F; font-size:20px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> 
📌 1. TFIDF + Stacking + K Fold CV + Optuna based approach <br>

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
    Importing modules
    </div>

In [None]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import optuna

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
    Loading data.
    </div>

In [None]:
train_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_essays.csv', low_memory=True, nrows=2000)
test_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv',nrows=2000)#remove nrows arg when using first method to generate submission
train_prompts = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv',nrows=2000)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
    Merging train essays and train prompts
    </div>

In [None]:
train_data = pd.merge(train_essays, train_prompts, on='prompt_id', how='left')

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
    Splitting the data into training and validation sets
    </div>

In [None]:
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Feature engineering using TF-IDF    </div>

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_data['text'])
X_val_tfidf = tfidf_vectorizer.transform(val_data['text'])

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining optuna based hyperparam optimization for random forest    </div>

In [None]:
def objective_rf(trial):
    params = { #define more dense param search space if using this method for submission
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'max_depth': trial.suggest_int('max_depth', 5, 6),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 3),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 2),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2']),
    }

    model = RandomForestClassifier(**params, random_state=42)
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_tfidf, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining optuna based hyperparam optimization for gradient boosting    </div>

In [None]:
def objective_gb(trial):
    params = {#define more dense param search space if using this method for submission
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.02),
        'max_depth': trial.suggest_int('max_depth', 3, 4),
        'subsample': trial.suggest_float('subsample', 0.5, 0.6),
    }

    model = GradientBoostingClassifier(**params, random_state=42)
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_tfidf, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining optuna based hyperparam optimization for extra trees   </div>

In [None]:
def objective_et(trial):
    params = {#define more dense param search space if using this method for submission
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'max_depth': trial.suggest_int('max_depth', 5, 6),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 3),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 2),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2']),
    }

    model = ExtraTreesClassifier(**params, random_state=42)
    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_tfidf, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining and training base mdels    </div>

In [None]:
study_rf = optuna.create_study(direction='maximize')
study_rf.optimize(objective_rf, n_trials=1)

study_gb = optuna.create_study(direction='maximize')
study_gb.optimize(objective_gb, n_trials=1)

study_et = optuna.create_study(direction='maximize')
study_et.optimize(objective_et, n_trials=1)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Fetching the best hyperparameters</div>

In [None]:
best_params_rf = study_rf.best_params
best_params_gb = study_gb.best_params
best_params_et = study_et.best_params


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Training base models on best params   </div>

In [None]:
best_rf_clf = RandomForestClassifier(**best_params_rf, random_state=42)
best_gb_clf = GradientBoostingClassifier(**best_params_gb, random_state=42)
best_et_clf = ExtraTreesClassifier(**best_params_et, random_state=42)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Creating voting classifier with soft voting    </div>

In [None]:
soft_voting_clf = VotingClassifier(
    estimators=[
        ('rf', best_rf_clf),
        ('gb', best_gb_clf),
        ('et', best_et_clf),
    ],
    voting='soft'
)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Creating stacking classifier with logistic regression as meta classifier   </div>

In [None]:
stacking_clf = StackingClassifier(
    estimators=[('rf', best_rf_clf), ('gb', best_gb_clf), ('et', best_et_clf)],
    final_estimator=LogisticRegression(),
    stack_method='auto', 
    n_jobs=-1, 
)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Training soft voting and stacking classifier    </div>

In [None]:
soft_voting_clf.fit(X_train_tfidf, train_data['generated'])
stacking_clf.fit(X_train_tfidf, train_data['generated'])

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Making predictions on the val set for soft voting</div>

In [None]:
val_predictions_soft = soft_voting_clf.predict(X_val_tfidf)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Evaluating soft voting model    </div>

In [None]:
accuracy_soft = accuracy_score(val_data['generated'], val_predictions_soft)
print(f'Soft Voting Model Accuracy: {accuracy_soft:.2f}')


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Making predictions on the val set for stacking model   </div>

In [None]:
val_predictions_stacking = stacking_clf.predict(X_val_tfidf)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Evaluating stacking model    </div>

In [None]:
accuracy_stacking = accuracy_score(val_data['generated'], val_predictions_stacking)
print(f'Stacking Model Accuracy: {accuracy_stacking:.2f}')


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Making predictions on the test set for soft voting    </div>

In [None]:
X_test_tfidf = tfidf_vectorizer.transform(test_essays['text'])
test_predictions_soft = soft_voting_clf.predict_proba(X_test_tfidf)[:, 1]


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Creating submission file for soft voting   </div>

In [None]:
submission_df_soft = pd.DataFrame({'id': test_essays['id'], 'generated': test_predictions_soft})
submission_df_soft.to_csv('submission_soft_voting.csv', index=False)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Making preds on test set for stacking and generating submission file    </div>

In [None]:
test_predictions_stacking = stacking_clf.predict_proba(X_test_tfidf)[:, 1]

In [None]:
submission_df_stacking = pd.DataFrame({'id': test_essays['id'], 'generated': test_predictions_stacking})
submission_df_stacking.to_csv('submission.csv', index=False)


In [None]:
submission_df_soft

In [None]:
submission_df_stacking


<div style="background-color:#A8B4F6; color:#19180F; font-size:20px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> 
📌 2. Word2vec + Stacking + KfoldCV + Optuna <br>

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Importing modules   </div>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import optuna
from gensim.models import KeyedVectors
from sklearn.preprocessing import LabelEncoder




<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Loading word2vec model   </div>

In [2]:
word2vec_model = KeyedVectors.load_word2vec_format('/kaggle/input/google-word2vec/GoogleNews-vectors-negative300.bin', binary=True)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Loading data  </div>

In [3]:
train_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_essays.csv', low_memory=True,nrows=4000)
test_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv',nrows=4000)
train_prompts = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv',nrows=4000)#remove nrows arg if you want to submit via this

train_data = pd.merge(train_essays, train_prompts, on='prompt_id', how='left')



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Encoding labels
</div>

In [4]:
label_encoder = LabelEncoder()
train_data['generated'] = label_encoder.fit_transform(train_data['generated'])
#train_data['generated'] = [0,1,0,1,0,0,0,0,0,1] #for sanity check of code

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [5]:
np.unique(train_data['generated'].values)

array([0, 1])

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Splitting the data   </div>

In [6]:
train_data, val_data = train_test_split(train_data, test_size=0.2, random_state=42)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Tokenizing and padding sequences   </div>

In [7]:
X_train_word2vec = train_data['text'].apply(lambda x: [word2vec_model[word] for word in x.split() if word in word2vec_model]).values.tolist()
X_val_word2vec = val_data['text'].apply(lambda x: [word2vec_model[word] for word in x.split() if word in word2vec_model]).values.tolist()


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining objective functions for hyperparam tuning   </div>

In [8]:
def objective_rf(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'max_depth': trial.suggest_int('max_depth', 5, 6),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 3),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 2),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2']),
    }

    model = RandomForestClassifier(**params, random_state=42)

    max_seq_length = max(len(seq) for seq in X_train_word2vec)
    X_train_padded = np.array([np.pad(seq, ((0, max_seq_length - len(seq)), (0, 0)), mode='constant') for seq in X_train_word2vec])

    X_train_padded = X_train_padded.reshape((X_train_padded.shape[0], -1))

    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_padded, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score


def objective_gb(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.02),
        'max_depth': trial.suggest_int('max_depth', 3, 4),
        'subsample': trial.suggest_float('subsample', 0.5, 0.6),
    }

    model = GradientBoostingClassifier(**params, random_state=42)

    max_seq_length = max(len(seq) for seq in X_train_word2vec)
    X_train_padded = np.array([np.pad(seq, ((0, max_seq_length - len(seq)), (0, 0)), mode='constant') for seq in X_train_word2vec])
    X_train_padded = X_train_padded.reshape((X_train_padded.shape[0], -1))

    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_padded, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score

def objective_et(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 51),
        'max_depth': trial.suggest_int('max_depth', 5, 6),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 3),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 2),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2']),
    }

    model = ExtraTreesClassifier(**params, random_state=42)

    max_seq_length = max(len(seq) for seq in X_train_word2vec)
    X_train_padded = np.array([np.pad(seq, ((0, max_seq_length - len(seq)), (0, 0)), mode='constant') for seq in X_train_word2vec])
    X_train_padded = X_train_padded.reshape((X_train_padded.shape[0], -1))

    kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    score = cross_val_score(model, X_train_padded, train_data['generated'], cv=kfold, scoring='accuracy').mean()

    return score


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Optimizing hyperparams  </div>

In [10]:
#uncomment if you're submitting via this
# study_rf = optuna.create_study(direction='maximize')
# study_rf.optimize(objective_rf, n_trials=1)#increase num trials if you want to do more extensive search

# study_gb = optuna.create_study(direction='maximize')
# study_gb.optimize(objective_gb, n_trials=1)

# study_et = optuna.create_study(direction='maximize')
# study_et.optimize(objective_et, n_trials=1)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Fetching the best params   </div>

In [11]:
# best_params_rf = study_rf.best_params
# best_params_gb = study_gb.best_params
# best_params_et = study_et.best_params
#uncomment if you ran optuna and wish to submit via this

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Initializing the models with best params   </div>

In [12]:
#uncomment if you ran optuna previously
best_params_rf = {
    'n_estimators': 100,
    'max_depth': 10,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'max_features': 'sqrt',
}

best_params_gb = {
    'n_estimators': 150,
    'learning_rate': 0.1,
    'max_depth': 5,
    'subsample': 0.8,
}

best_params_et = {
    'n_estimators': 120,
    'max_depth': 15,
    'min_samples_split': 10,
    'min_samples_leaf': 3,
    'max_features': 'log2',
}


In [13]:
best_rf_clf = RandomForestClassifier(**best_params_rf, random_state=42)
best_gb_clf = GradientBoostingClassifier(**best_params_gb, random_state=42)
best_et_clf = ExtraTreesClassifier(**best_params_et, random_state=42)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Instantiating soft voting and stacking classifier   </div>

In [14]:
soft_voting_clf = VotingClassifier(
    estimators=[
        ('rf', best_rf_clf),
        ('gb', best_gb_clf),
        ('et', best_et_clf),
    ],
    voting='soft'
)

stacking_clf = StackingClassifier(
    estimators=[('rf', best_rf_clf), ('gb', best_gb_clf), ('et', best_et_clf)],
    final_estimator=LogisticRegression(),
    stack_method='auto', 
    n_jobs=-1, 
)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Padding sequences to common length and evaluating after fitting   </div>

In [15]:
max_seq_length_train = max(len(seq) for seq in X_train_word2vec)
max_seq_length_val = max(len(seq) for seq in X_val_word2vec)
max_seq_length = max(max_seq_length_train, max_seq_length_val)

X_train_padded = np.array([np.pad(seq, ((0, max_seq_length - len(seq)), (0, 0)), mode='constant') for seq in X_train_word2vec])
X_train_padded = X_train_padded.reshape((X_train_padded.shape[0], -1))

X_val_padded = np.array([np.pad(seq, ((0, max_seq_length - len(seq)), (0, 0)), mode='constant') for seq in X_val_word2vec])
X_val_padded = X_val_padded.reshape((X_val_padded.shape[0], -1))

soft_voting_clf.fit(X_train_padded, train_data['generated'])
stacking_clf.fit(X_train_padded, train_data['generated'])

val_predictions_soft = soft_voting_clf.predict(X_val_padded)
accuracy_soft = accuracy_score(val_data['generated'], val_predictions_soft)
print(f'Soft Voting Model Accuracy: {accuracy_soft:.2f}')

val_predictions_stacking = stacking_clf.predict(X_val_padded)
accuracy_stacking = accuracy_score(val_data['generated'], val_predictions_stacking)
print(f'Stacking Model Accuracy: {accuracy_stacking:.2f}')


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


KeyboardInterrupt: 

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Generating submission by predicting on test set   </div>

In [None]:
X_test_word2vec = test_essays['text'].apply(lambda x: [word2vec_model[word] for word in x.split() if word in word2vec_model]).values.tolist()

max_seq_length_test = max(len(seq) for seq in X_test_word2vec)

X_test_padded = np.array([
    np.pad(seq, ((0, max_seq_length_test - len(seq)), (0, 0)), mode='constant') for seq in X_test_word2vec
])
X_test_padded = X_test_padded.reshape((X_test_padded.shape[0], -1))

max_features_train = X_train_padded.shape[1]
if X_test_padded.shape[1] < max_features_train:
    X_test_padded = np.pad(X_test_padded, ((0, 0), (0, max_features_train - X_test_padded.shape[1])), mode='constant')
elif X_test_padded.shape[1] > max_features_train:
    X_test_padded = X_test_padded[:, :max_features_train]

test_predictions_soft = soft_voting_clf.predict_proba(X_test_padded)

submission_df_soft = pd.DataFrame({'id': test_essays['id'], 'generated': test_predictions_soft[:, 1]})

submission_df_soft.to_csv('submission.csv', index=False)

test_predictions_stacking = stacking_clf.predict_proba(X_test_padded)

submission_df_stacking = pd.DataFrame({'id': test_essays['id'], 'generated': test_predictions_stacking[:, 1]})
submission_df_stacking.to_csv('submission_stacking.csv', index=False)


In [None]:
submission_df_stacking

In [None]:
submission_df_soft

<div style="background-color:#A8B4F6; color:#19180F; font-size:20px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> 
📌 3. DistilBERT in PyTorch <br>

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Importing modules   </div>

In [16]:
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from tqdm import tqdm


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Loading data   </div>

In [17]:
train_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_essays.csv', low_memory=True)
test_essays = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/test_essays.csv')
train_prompts = pd.read_csv('/kaggle/input/llm-detect-ai-generated-text/train_prompts.csv')
train_data = pd.merge(train_essays, train_prompts, on='prompt_id', how='left')

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Defining the dataset class </div>

In [18]:
class EssaysDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=256):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])
        encoding = self.tokenizer(text, truncation=True, padding='max_length', max_length=self.max_length, return_tensors='pt')

        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(label, dtype=torch.float32)
        }


In [20]:
model_tokenizer_dir= '/kaggle/input/distilbert-model-num-labels-1'
model = DistilBertForSequenceClassification.from_pretrained(model_tokenizer_dir)
tokenizer = DistilBertTokenizer.from_pretrained(model_tokenizer_dir)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Splitting the data into train and val sets   </div>

In [21]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_data['text'].values,
    train_data['generated'].values,
    test_size=0.2,
    random_state=42
)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Creating datasets and dataloaders  </div>

In [22]:
train_dataset = EssaysDataset(train_texts, train_labels, tokenizer)
val_dataset = EssaysDataset(val_texts, val_labels, tokenizer)

train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Hyperparams  </div>

In [23]:
epochs = 10
lr = 2e-5

optimizer = AdamW(model.parameters(), lr=lr)
criterion = torch.nn.BCEWithLogitsLoss()

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Training loop </div>

In [24]:
for epoch in tqdm(range(epochs)):
    model.train()
    for step,batch in (enumerate(train_dataloader)):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits.squeeze(), labels)
        if step%100==0:
            print("Step-{}, Loss-{}".format(step,loss.item()))
        loss.backward()
        optimizer.step()

    model.eval()
    all_val_labels = []
    all_val_preds = []
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask)
            val_preds = torch.sigmoid(outputs.logits.squeeze()).detach().cpu().numpy()
            all_val_labels.extend(labels.cpu().numpy())
            all_val_preds.extend(val_preds)

    auc_roc = roc_auc_score(all_val_labels, all_val_preds)
    print(f'Epoch {epoch + 1}/{epochs}, AUC-ROC: {auc_roc}')


  0%|          | 0/10 [00:00<?, ?it/s]

Step-0, Loss-0.0033012470230460167


  0%|          | 0/10 [00:06<?, ?it/s]


KeyboardInterrupt: 

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Generating submission  </div>

In [25]:
test_dataset = EssaysDataset(test_essays['text'].values, [0] * len(test_essays), tokenizer)
test_dataloader = DataLoader(test_dataset, batch_size=8, shuffle=False)

model.eval()
all_test_preds = []
with torch.no_grad():
    for batch in test_dataloader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask)
        test_preds = torch.sigmoid(outputs.logits.squeeze()).detach().cpu().numpy()
        all_test_preds.extend(test_preds)

submission_df = pd.DataFrame({'id': test_essays['id'], 'generated': all_test_preds})
submission_df.to_csv('submission.csv', index=False)


In [26]:
submission_df

Unnamed: 0,id,generated
0,0000aaaa,0.02126
1,1111bbbb,0.028847
2,2222cccc,0.024098


In [None]:
# output_dir = '/kaggle/working/'
# tokenizer.save_pretrained(output_dir)
# model.save_pretrained(output_dir)