# Tuning Noisy Grammar Models

I added noise to the grammar errors and trained models on it. I found the random forest and gradient model overfit to the training set and the ANN maintained performance. When I submitted the ANN to competition to test it on the test data, I found performance of the ANN on the test set to rise. Thus, I have reasonable evidence to claim that adding the noise to the grammar errors improves model performance. 

In this notebook, I will tune the models using Randomized Search.

In [1]:
import pandas as pd 
import numpy as np 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier
import pickle

%matplotlib inline

In [2]:
# Getting the data
training_data = pd.read_csv('../../data/train-noisy-grammar.csv')
valid_data = pd.read_csv('../../data/validation.csv')

In [3]:
# Getting the scalar and scaling
with open('../../models/custom-features/noisy-grammar-errors/scalar-noisy.pkl','rb') as file:
    scalar = pickle.load(file)

In [4]:
# Getting the X and y
X_train = training_data.drop(['LLM_written','prompt','essay','row_id'],axis=1)
y_train = training_data['LLM_written'].values

X_valid = valid_data.drop(['LLM_written','prompt','essay','row_id'],axis=1)
y_valid = valid_data['LLM_written'].values

In [5]:
numerical = ['word_count','stop_word_count','stop_word_ratio','unique_word_count','unique_word_ratio',
             'count_question','count_exclamation','count_semi','count_colon','grammar_errors']
X_train[numerical] = scalar.transform(X_train[numerical])
X_valid[numerical] = scalar.transform(X_valid[numerical])

In [6]:
# Creating a dictionary for the model performances
performances = {
    'model':[],
    'Train ROC AUC':[],
    'Valid ROC AUC':[]
}

## Random Forest

In [7]:
forest_clf = RandomForestClassifier(criterion='gini',class_weight='balanced',bootstrap=True,random_state=42,n_jobs=-1)
param_dict = {
    'n_estimators':[100,500,750],
    'max_depth':[2,3,5],
    'min_samples_leaf':[2,6,8],
    'max_features':['sqrt','log2',None],
    'max_samples':[0.3,0.5,1.0]
}

In [8]:
# Running the randomized search and finding the best model
search_obj = RandomizedSearchCV(forest_clf,param_dict,n_iter=20,scoring='roc_auc',refit=True,cv=3,random_state=42,return_train_score=True,
                                verbose=10,error_score="raise",n_jobs=-1)

In [9]:
search_obj.fit(X_train.values,y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[CV 1/3; 1/20] START max_depth=2, max_features=sqrt, max_samples=1.0, min_samples_leaf=8, n_estimators=100
[CV 2/3; 1/20] START max_depth=2, max_features=sqrt, max_samples=1.0, min_samples_leaf=8, n_estimators=100
[CV 3/3; 1/20] START max_depth=2, max_features=sqrt, max_samples=1.0, min_samples_leaf=8, n_estimators=100
[CV 1/3; 2/20] START max_depth=2, max_features=sqrt, max_samples=0.3, min_samples_leaf=8, n_estimators=100
[CV 1/3; 2/20] END max_depth=2, max_features=sqrt, max_samples=0.3, min_samples_leaf=8, n_estimators=100;, score=(train=0.988, test=0.942) total time=   3.5s
[CV 2/3; 2/20] START max_depth=2, max_features=sqrt, max_samples=0.3, min_samples_leaf=8, n_estimators=100
[CV 1/3; 1/20] END max_depth=2, max_features=sqrt, max_samples=1.0, min_samples_leaf=8, n_estimators=100;, score=(train=0.987, test=0.941) total time=   6.5s
[CV 3/3; 2/20] START max_depth=2, max_features=sqrt, max_samples=0.3, min_samples_leaf=8

In [10]:
# Getting the best model
best_model = search_obj.best_estimator_

In [11]:
# Showing best parameters
search_obj.best_params_

{'n_estimators': 750,
 'min_samples_leaf': 8,
 'max_samples': 0.3,
 'max_features': 'sqrt',
 'max_depth': 5}

In [12]:
# Making predictions
print('Predictions for Random Forest')
train_preds = best_model.predict_proba(X_train.values)[:,1]
valid_preds = best_model.predict_proba(X_valid.values)[:,1]
train_score = roc_auc_score(y_train,train_preds)
valid_score = roc_auc_score(y_valid,valid_preds)
print(f'Training ROC AUC: {train_score}')
print(f'Validation ROC AUC: {valid_score}')

Predictions for Random Forest
Training ROC AUC: 0.9884051614184051
Validation ROC AUC: 0.9763446142797685


In [13]:
# Adding the metrics
model = 'Random Forest'
performances['model'].append(model)
performances['Train ROC AUC'].append(train_score)
performances['Valid ROC AUC'].append(valid_score)

In [14]:
# Saving the model
with open('../../models/custom-features/noisy-grammar-errors/fine-tuned/forest-fine-noisy.pkl','wb') as file:
    pickle.dump(best_model,file)

## Gradient Boosting

In [15]:
# Creating the balanced class sample weights
sample_weights =  X_train.shape[0] / (2.0 * np.bincount(y_train.astype(int)))

In [16]:
# model
catboost_clf = CatBoostClassifier(iterations=1000,loss_function='Logloss',random_seed=42,early_stopping_rounds=10,
                                  eval_metric='AUC',class_weights=sample_weights)

# Parameter grid
param_grid = {
    'learning_rate':[0.01,0.03,0.3],
    'depth':[2,3,5],
    'l2_leaf_reg':[1,3,7],
    'min_data_in_leaf':[1,5,15]
}

In [17]:
# Performing randomized search
search_results = catboost_clf.randomized_search(param_grid,X_train.values,y_train,cv=3,n_iter=10,refit=True,shuffle=True)

0:	test: 0.9381435	best: 0.9381435 (0)	total: 185ms	remaining: 3m 4s
1:	test: 0.9388241	best: 0.9388241 (1)	total: 292ms	remaining: 2m 25s
2:	test: 0.9409334	best: 0.9409334 (2)	total: 396ms	remaining: 2m 11s
3:	test: 0.9417026	best: 0.9417026 (3)	total: 546ms	remaining: 2m 15s
4:	test: 0.9417031	best: 0.9417031 (4)	total: 739ms	remaining: 2m 27s
5:	test: 0.9443461	best: 0.9443461 (5)	total: 848ms	remaining: 2m 20s
6:	test: 0.9511140	best: 0.9511140 (6)	total: 1.04s	remaining: 2m 28s
7:	test: 0.9511157	best: 0.9511157 (7)	total: 1.16s	remaining: 2m 23s
8:	test: 0.9511914	best: 0.9511914 (8)	total: 2.15s	remaining: 3m 56s
9:	test: 0.9511869	best: 0.9511914 (8)	total: 2.25s	remaining: 3m 43s
10:	test: 0.9511890	best: 0.9511914 (8)	total: 2.41s	remaining: 3m 36s
11:	test: 0.9511978	best: 0.9511978 (11)	total: 2.53s	remaining: 3m 28s
12:	test: 0.9511886	best: 0.9511978 (11)	total: 2.63s	remaining: 3m 20s
13:	test: 0.9513660	best: 0.9513660 (13)	total: 2.79s	remaining: 3m 16s
14:	test: 0.95

In [18]:
# Checking if model is fitted
catboost_clf.is_fitted()

True

In [19]:
# Showing best params
catboost_clf.get_params()

{'iterations': 1000,
 'loss_function': 'Logloss',
 'random_seed': 42,
 'class_weights': array([0.78561644, 1.37529976]),
 'eval_metric': 'AUC',
 'early_stopping_rounds': 10,
 'min_data_in_leaf': 15,
 'depth': 5,
 'l2_leaf_reg': 3,
 'learning_rate': 0.3}

In [20]:
# Making predictions
print('Predictions for Gradient Boosting')
train_preds = catboost_clf.predict_proba(X_train)[:,1]
valid_preds = catboost_clf.predict_proba(X_valid)[:,1]
train_score = roc_auc_score(y_train,train_preds)
valid_score = roc_auc_score(y_valid,valid_preds)
print(f'Training ROC AUC: {train_score}')
print(f'Validation ROC AUC: {valid_score}')

Predictions for Gradient Boosting
Training ROC AUC: 0.9999597242041011
Validation ROC AUC: 0.9715287627410495


In [21]:
# Adding the metrics
model = 'Gradient Boosting'
performances['model'].append(model)
performances['Train ROC AUC'].append(train_score)
performances['Valid ROC AUC'].append(valid_score)

In [22]:
catboost_clf.save_model('../../models/custom-features/noisy-grammar-errors/fine-tuned/catboost-noisy-fine')

In [23]:
# Printing out the model performances in a dataframe and saving it
metrics_df = pd.DataFrame().from_dict(performances)
metrics_df

Unnamed: 0,model,Train ROC AUC,Valid ROC AUC
0,Random Forest,0.988405,0.976345
1,Gradient Boosting,0.99996,0.971529


In [25]:
# Saving the performances
metrics_df.to_csv('../../models/custom-features/noisy-grammar-errors/fine-tuned/metrics.csv',index=False)