# Catboost TFIDF 3-5 ngrams and Custom Features

I found that Catboost with TFIDF of 3-5 ngrams works well. This makes sense since the bigger the n-gram the clearer the distinction between the essays becomes. I want to try combining my custom features with the vectorization. I want to see if the results are good and if this pushes the model up. 

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import pickle
from catboost import CatBoostClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from tqdm import tqdm
import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        os.path.join(dirname, filename)
        
tqdm.pandas()

In [3]:
# Getting the data
data = pd.read_csv('../input/prepared-data-llm-competition/prepared_training_set.csv')
essays = data['essay']
labels = data['LLM_written']

In [4]:
# Text cleaning 
def text_cleaning(essay:str) -> str:
    cleaned_text = essay.replace('\n',"")
    cleaned_text = essay.replace("\t","")
    
    return cleaned_text

In [5]:
# Cleaning the text
essays_cleaned = data['essay'].progress_apply(text_cleaning)

100%|██████████| 49929/49929 [00:00<00:00, 163640.97it/s]


In [6]:
# Setting up the vectorizer
pattern = r'(?u)\b\w\w+\b|!|\?|\:|\;' # pattern for punctuation
vectorizer = TfidfVectorizer(token_pattern=pattern,ngram_range=(3,5),max_df=0.85,
                            min_df=100,max_features=1000,norm=None)

# Fitting it to the essays
X = vectorizer.fit_transform(essays_cleaned)
transformed_data = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
transformed_data.head()

Unnamed: 0,able to attend,able to attend classes,able to do,able to get,able to have,able to use,abolish the electoral,abolish the electoral college,abolishing the electoral,abolishing the electoral college,...,you for your time,you get to,you have to,you need to,you should join,you to support,you want to,you will be,you will have,young people are
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,4.173458,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
combined_data = pd.concat([data,transformed_data],axis=1)
combined_data.head()

Unnamed: 0,row_id,essay,word_count,LLM_written,stop_word_count,stop_word_ratio,unique_word_count,unique_word_ratio,count_question,count_exclamation,...,you for your time,you get to,you have to,you need to,you should join,you to support,you want to,you will be,you will have,young people are
0,1,"Dear State Senator,\n\nI'm writting to you tod...",291,1,137,0.47079,131,0.450172,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,"Uh, hi! So, like, summers are, like, awesome r...",311,1,137,0.440514,121,0.389068,3,4,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,"When peoples ask for advices, they sometimes t...",333,1,158,0.474474,155,0.465465,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,4.173458,0.0,0.0,0.0
3,4,I think art edukation is super impotent for ki...,308,1,121,0.392857,130,0.422078,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,I think we should totally switch to renewable ...,307,1,138,0.449511,146,0.47557,0,2,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [9]:
# Splitting data into X and y
X = combined_data.drop(['row_id','essay','LLM_written'],axis=1).values
y = combined_data['LLM_written'].values

In [12]:
# Building model
catboost_clf = CatBoostClassifier(iterations=6000,learning_rate=0.03,loss_function='Logloss',
                                  random_seed=42,task_type='GPU')
catboost_clf.fit(X,y,verbose=100)

0:	learn: 0.6372789	total: 85.2ms	remaining: 8m 30s
100:	learn: 0.0876930	total: 4.34s	remaining: 4m 13s
200:	learn: 0.0705795	total: 8.39s	remaining: 4m 2s
300:	learn: 0.0628254	total: 12.4s	remaining: 3m 54s
400:	learn: 0.0584043	total: 16.3s	remaining: 3m 48s
500:	learn: 0.0547688	total: 20.3s	remaining: 3m 42s
600:	learn: 0.0522529	total: 24.2s	remaining: 3m 37s
700:	learn: 0.0507818	total: 28.2s	remaining: 3m 32s
800:	learn: 0.0490042	total: 32.2s	remaining: 3m 29s
900:	learn: 0.0474822	total: 36.2s	remaining: 3m 24s
1000:	learn: 0.0459604	total: 40.1s	remaining: 3m 20s
1100:	learn: 0.0445719	total: 44s	remaining: 3m 15s
1200:	learn: 0.0431273	total: 47.9s	remaining: 3m 11s
1300:	learn: 0.0420768	total: 51.8s	remaining: 3m 7s
1400:	learn: 0.0413320	total: 55.7s	remaining: 3m 2s
1500:	learn: 0.0406090	total: 59.7s	remaining: 2m 58s
1600:	learn: 0.0397566	total: 1m 3s	remaining: 2m 55s
1700:	learn: 0.0391005	total: 1m 7s	remaining: 2m 51s
1800:	learn: 0.0383604	total: 1m 11s	remaini

<catboost.core.CatBoostClassifier at 0x7e0e92e39090>

In [13]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = catboost_clf.predict_proba(X)[:,1]
roc_auc_score(y,predictions)

ROC AUC on Training Set:


0.9995609722115035

In [14]:
# Saving the model
catboost_clf.save_model('catboost_clf-custom-and-tfidf')

In [15]:
# Saving the vectorizer for later use
with open('vectorizer-3-5-submit.pk','wb') as file:
    pickle.dump(vectorizer,file)