# Modeling

This notebook will represent my experiments and what I learned. I tried doing this directly on Kaggle, but the notebook got messy very quickly

In [None]:
# Importing libraries
import pandas as pd
import sys
from sklearn.preprocessing import RobustScaler 
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import pickle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import shap
from sklearn.model_selection import cross_validate
from catboost import CatBoostClassifier

sys.path.append('../')

tqdm.pandas()

In [None]:
# Getting the training data
data = pd.read_csv('../prepared_training_set.csv')
data.head()

## Some minor data processing

I just need to perform some minor data processing such as scaling the numerical values and droping the essay and row_id columns

In [None]:
# Dropping columns
training_data = data.drop(['row_id','essay'],axis=1)
training_data.head()

In [None]:
# Getting numerical features for scaling
numerical = ['word_count','stop_word_count','stop_word_ratio','unique_word_count','unique_word_ratio','count_question','count_exclamation',
            'count_semi','count_colon','grammar_errors']

In [None]:
# Using RobustScaler since I know that there are outliers and that the data distribution isn't normal
# RobustScaler will use median and IQR instead of mean and standard deviation
scaler = RobustScaler()

In [None]:
# Scaling the data
training_data[numerical] = scaler.fit_transform(training_data[numerical])

In [None]:
with open('scalar_grammar.pkl', 'wb') as file:
    pickle.dump(scaler, file)

In [None]:
# Splitting the data into X & y
train_X = training_data.drop(['LLM_written'],axis=1)
train_y = training_data['LLM_written'].values

## Modeling

The data is ready to be modeled!

### Logistic Regression

In [None]:
# Defining and training the model
log_model = LogisticRegression(random_state=42,C=0.5)
log_model.fit(train_X.values,train_y)

In [None]:
# # Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = log_model.predict_proba(train_X.values)[:,1]
roc_auc_score(train_y,predictions)

In [None]:
# Cross validating
cross_val_scores = pd.DataFrame(cross_validate(LogisticRegression(random_state=42,C=0.5),
                                train_X.values,train_y,scoring='roc_auc',cv=5))
cross_val_scores['test_score'].describe()

### Decision Tree

In [None]:
# Building the model
d_tree = DecisionTreeClassifier(criterion='gini',min_samples_leaf=20,random_state=42)
d_tree.fit(train_X.values,train_y)

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = d_tree.predict_proba(train_X.values)[:,1]
roc_auc_score(train_y,predictions)

In [None]:
# Cross validating
cross_val_scores = pd.DataFrame(cross_validate(
    DecisionTreeClassifier(criterion='gini',min_samples_leaf=20,random_state=42),
                                train_X.values,train_y,scoring='roc_auc',cv=5))
cross_val_scores['test_score'].describe()

## Feature Importance and SHAP

I found that the model scores high on the training set and cross validation but not as high on the test set (LB). In fact, the disparity is very large. After doing some research via the discussion posts for the competition, I found that the test data distribution must be very different than the training data distribution. The difference is what everyone is trying to figure out. 

[One post](https://www.kaggle.com/competitions/llm-detect-ai-generated-text/discussion/452750) mentions that it might be some noise introduced into the test essays. The author ran an experiment using grammar/spelling errors as a predictor and found that the rule works really well for training essays, but not as well for test essays (LB). This indicates that there must be some noise in the grammar errors. The hosts must have added grammatical mistakes into the test essays for both classes. Thus, I want to check to see what features are dominating decisions for my models. I can then experiment with removing them to see if that better matches the test set.

In [None]:
# Plotting the mean SHAP values for the logistic regression model
explainer = shap.LinearExplainer(log_model,train_X)
shap_values = explainer(train_X)
shap.plots.bar(shap_values)

As the above SHAP plot shows, grammar_errors has a very high SHAP. This means that ,on average, the model places heavy importance on the grammar_errors. If the test data has grammar noise, this would make sense as per why the models are performing poorly on the LB data and excellent on the training data. 

In [None]:
# Plotting the mean SHAP values for the Decision Tree model
explainer = shap.TreeExplainer(d_tree)
shap_values = explainer(train_X)
shap.plots.bar(shap_values[:,:,1])

This plot also shows that grammar_errors play a big role in decision making of the decision tree. Both models show that they are dependent on grammar_errors highly. This is problematic if the test data has grammar noise. The noise is causing the models to perform poorly. Based on the post and my analysis, I do want to experiment with removing the grammar errors.

## Removing grammar_errors column

My hypothesis is that since the grammar_errors column contributes highly to the model decision-making, there must be some grammatical noise in the test essays. I will remove this column and evaluate my models.

In [None]:
no_gram_numerical = ['word_count','stop_word_count','stop_word_ratio','unique_word_count','unique_word_ratio','count_question','count_exclamation',
            'count_semi','count_colon']

In [None]:
# Creating a new scalar for when I don't have grammar_errors
training_data_no_gram = data.drop(['row_id','essay','grammar_errors'],axis=1)
training_data_no_gram.head()

In [None]:
# Scaling the data
scale_no_gram = RobustScaler()
training_data_no_gram[no_gram_numerical] = scale_no_gram.fit_transform(training_data[no_gram_numerical])

In [None]:
with open('scalar_no_grammar.pkl', 'wb') as file:
    pickle.dump(scale_no_gram, file)

In [None]:
# Splitting the data into X & y
train_X_no_gram = training_data_no_gram.drop(['LLM_written'],axis=1)
train_y_no_gram = training_data_no_gram['LLM_written'].values

### Logistic Regression

In [None]:
# Training a new model
log_reg_no_grammar = LogisticRegression(random_state=42,C=0.5)
log_reg_no_grammar.fit(train_X_no_gram.values,train_y_no_gram)

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = log_reg_no_grammar.predict_proba(train_X_no_gram.values)[:,1]
roc_auc_score(train_y_no_gram,predictions)

In [None]:
# Cross Validating
cross_val_scores = pd.DataFrame(cross_validate(LogisticRegression(random_state=42,C=0.5),
                                train_X_no_gram.values,train_y_no_gram,scoring='roc_auc',cv=5))
cross_val_scores['test_score'].describe()

There is a slight amount of overfitting still, but the idea that the model doesn't perform as high on the validation sets is promising. Perhaps I might see a performance boost. 

### Decision Tree

In [None]:
# Getting the model
d_tree = DecisionTreeClassifier(criterion='gini',min_samples_leaf=20,random_state=42)
d_tree.fit(train_X_no_gram.values,train_y_no_gram)

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = d_tree.predict_proba(train_X_no_gram.values)[:,1]
roc_auc_score(train_y_no_gram,predictions)

In [None]:
# Cross Validating
cross_val_scores = pd.DataFrame(cross_validate(DecisionTreeClassifier(criterion='gini',min_samples_leaf=20,random_state=42),
                                train_X_no_gram.values,train_y_no_gram,scoring='roc_auc',cv=5))
cross_val_scores['test_score'].describe()

Update: I made submissions for both models. Both models performed terribly on the test set (0.51 and 0.478 respectively). The scores are as good as running a random classifier. Perhaps removing the grammar_errors column isn't good. Let me revert back to the original feature set and try running random forest on it. Maybe a more powerful model will help?

### Random Forest

In [None]:
# Building the model
random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(train_X.values,train_y)

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = random_forest.predict_proba(train_X.values)[:,1]
roc_auc_score(train_y,predictions)

In [None]:
# Cross Validating
cross_val_scores = pd.DataFrame(cross_validate(RandomForestClassifier(random_state=42),
                                train_X_no_gram.values,train_y_no_gram,scoring='roc_auc',cv=5))
cross_val_scores['test_score'].describe()

Update: Random Forest performed really well (0.769). Perhaps removing grammar_errors isn't a good idea. I think I should focus my efforts on utilizing more complex models.

### CatBoost

Random Forest was promising, let's try gradient boosting

In [None]:
# Building catboost model
catboost_clf = CatBoostClassifier(iterations=100,learning_rate=0.03,loss_function='Logloss',
                                 random_seed=42)
catboost_clf.fit(train_X,train_y)

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = catboost_clf.predict_proba(train_X.values)[:,1]
roc_auc_score(train_y,predictions)

### Detector

I want to see how well the detector works on the test dataset. This could give me insight on how to utilize it better or if I shouldn't be using it.

In [None]:
# Adding columns from the OpenAI detector
# Getting the GPU
device = "cuda:0" if torch.cuda.is_available() else "cpu" # Need to put on GPU

# Getting model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base-openai-detector")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base-openai-detector")

In [None]:
# Defining a function for inference
def detector_pred(essay:str) -> float:
  # Tokenizing the input essay
  inputs = tokenizer(essay,return_tensors='pt',truncation=True).to(device)

  # Getting the logits
  with torch.no_grad():
    logits = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(logits)[:,0]
  # Doing 1 - max logit because the model has "Real" = class 1 and "Fake" = class 0
  # My labels are the opposite, 1 = LLM Written and 0 = student written.
  # If a logit = 0 = Fake, 1-0 = 1 = LLM Written
  # If a logit = 1 = Real, 1-1 = 0 = student written
  return probabilities.detach().item()

In [None]:
# Getting the probability predictions
detector_test = data.copy()
detector_test['generated'] = data['essay'].progress_apply(detector_pred)
detector_test

In [None]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
roc_auc_score(train_y,detector_test['generated'])

Update: Just using the detector turned my score from 0.769 to .789. This gives me some intuition that maybe a deep learning approach is needed. The classical ML approach seems to cap at 0.75 to 0.77.

## Notes

This section got a little messy. A big takeaway for me is to get the modeling done first, then worry about tuning the hyperparameters. With that being said, here are my learnings:

1. The classical ML approach with engineered features worked with the training data. I got high training scores and high CV scores. However, when it was time to generalize to a dataset with a completely different distribution and some added noise, the models struggled mightly. It seems that they started to overfit a lot and capped at a test performance of 0.76-0.77. These models perform well, but I think I will need stronger models to rank up on the LB and have better performance. 

2. When I ran the detector, it out-performed every classical ML model on the test dataset. This shows that I should switch my focus to deep learning approaches as they seem more ideal. 