# Term Frequency Inverse Document Frequency (TFIDF)

For the feature set I made, it was clear that the models that utilized those features overfit heavily and had a "performance cap" of 0.77. To rank up in the leaderboard, it is clear that I may need to go away from those features and try to utilize the physical words in the essays.

Many high scoring solutions utilize TFIDF. The TFIDF of a word is the following:  

**TFIDF_word = Term Frequency * log (number of documents / document frequency)**  

Where term frequency is the number of times a word appears / total words in a document and document frequency = the number of documents with the word. 

It seems TFIDF is a very powerful tool to utilize in NLP problems, and one can utilize it to get a quick solution. I note this as a learning for future projects. 

In [31]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score
import pickle

%matplotlib inline

In [3]:
# Getting the essays
essays = pd.read_csv('../prepared_training_set.csv')['essay'].tolist()
labels = pd.read_csv('../prepared_training_set.csv')['LLM_written'].values
essays[3]

'I think art edukation is super impotent for kids. Some peoples might say its not that impotent but I disagree. Arts helps kids with theyre imagination and creativity. Like for example when we do art projects in skool it helps me think outside the box and come up with new ideas. Also it helps with theyre self esteem and confidence. When we perform in frunt of the skool or in a play it helps us not be affraid to speak in frunt of peoples. Also art is a way to express our selfs and show how we feel. Like when Im feeling sad or mad I can draw a picshure that shows how I feel and it helps me feel beter.\n\nAnother thing that sucks is that some skools are cutting art programs cuz they think its not that impotent. This is super wrong cuz art is a way for kids to express theyre selfs and it helps them develop theyre brain. It also helps them be more creative and have better imagination. Without art edukation kids will be bornt and not have as much fun in skool.\n\nI also think that art edukat

In [4]:
# Setting up the vectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,3),norm=None,max_features=500,min_df=100,max_df=0.8)

# Fitting it to the essays
X = vectorizer.fit_transform(essays)

In [5]:
# Storing the data into a dataframe
transformed_data = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
transformed_data.head()

Unnamed: 0,able,able to,about,about the,activities,activity,advice,after,air,all,all the,also,always,am,an,and it,and that,and the,and they,animals,another,any,are not,around,article,as,as the,as well,ask,at,at home,at the,attention,author,average,away,back,bad,based,be able,...,voters,votes,want,want to,was,way,way to,ways,we,we can,we should,well,were,what,what they,when,when you,where,which,while,who,why,will,will be,with,with the,without,work,working,world,would,would be,year,years,you,you are,you can,you have,young,your
0,0.0,0.0,4.95527,2.675781,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.445918,2.638084,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.44025,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.789292,2.859709,0.0,8.951207,0.0,6.544805,0.0,2.340838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.585469,0.0,0.0,0.0,2.497354,0.0,0.0,0.0,0.0,0.0,3.192727,2.225022,0.0,0.0,3.25424,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,6.607026,0.0,0.0,0.0,0.0,0.0,0.0,3.130157,0.0,1.445918,0.0,0.0,0.0,0.0,0.0,0.0,2.867196,7.700808,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.231589,0.0,0.0,7.578584,2.859709,0.0,16.112173,6.086583,0.0,0.0,0.0,1.807921,0.0,0.0,0.0,2.384413,0.0,0.0,1.861823,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.596363,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.651757,0.0,0.0,0.0,16.822185,0.0,0.0,1.565078,2.908467,1.445918,0.0,0.0,0.0,0.0,0.0,0.0,2.867196,0.0,2.220125,0.0,0.0,0.0,0.0,1.364978,0.0,0.0,24.272324,1.670572,0.0,2.83406,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,2.231589,2.476658,0.0,1.894646,0.0,0.0,0.0,0.0,0.0,2.259636,0.0,1.807921,0.0,1.650873,0.0,0.0,0.0,0.0,3.723646,4.181465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.542404,0.0,8.323337,0.0,0.0,16.082399
3,0.0,0.0,1.651757,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.675507,0.0,0.0,0.0,10.275265,0.0,0.0,0.0,0.0,2.220125,0.0,0.0,0.0,0.0,1.364978,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,5.683938,2.859709,0.0,7.160966,0.0,0.0,2.259636,0.0,0.0,0.0,6.603492,0.0,0.0,0.0,0.0,0.0,0.0,1.614378,2.610741,6.243385,0.0,2.525812,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,1.651757,0.0,0.0,0.0,0.0,0.0,6.825125,3.130157,2.908467,1.445918,0.0,0.0,0.0,2.568816,0.0,0.0,0.0,0.0,4.44025,0.0,0.0,2.54328,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9.00235,0.0,0.0,...,0.0,0.0,2.231589,2.476658,0.0,1.894646,0.0,2.87657,16.112173,3.043291,16.362012,0.0,0.0,0.0,0.0,3.301746,0.0,0.0,3.774841,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.637163,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Saving the vectorizer for later use
with open('../vectorizer.pk','wb') as file:
    pickle.dump(vectorizer,file)

## Models

Seeing how different models work with this feature set

### Logistic Regression

In [9]:
# Building the model
log_reg = LogisticRegression(penalty='l2',random_state=42,max_iter=900,C=0.5)
log_reg.fit(transformed_data.values,labels)

In [11]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = log_reg.predict_proba(transformed_data.values)[:,1]
roc_auc_score(labels,predictions)

ROC AUC on Training Set:


0.99781706635762

In [12]:
# Cross Validation 
# This will show me how good the model generalizes on the training set distribution
# I can get an idea of how well it will do then
cross_val_scores = pd.DataFrame(cross_validate(LogisticRegression(penalty='l2',random_state=42,max_iter=900,C=0.5),
                                transformed_data.values,labels,scoring='roc_auc',cv=5,return_train_score=True))
cross_val_scores[['train_score','test_score']].describe()

Unnamed: 0,train_score,test_score
count,5.0,5.0
mean,0.998733,0.867
std,0.000516,0.077472
min,0.998031,0.802032
25%,0.998391,0.815302
50%,0.998869,0.816038
75%,0.999083,0.935529
max,0.999293,0.966099


In [14]:
# Saving the model
with open('../models/tfidf-trained models/log_reg.pkl','wb') as file:
    pickle.dump(log_reg,file)

### Decision Tree

In [15]:
# Building the model
d_tree = DecisionTreeClassifier(criterion='gini',min_samples_leaf=20)
d_tree.fit(transformed_data.values,labels)

In [16]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = d_tree.predict_proba(transformed_data.values)[:,1]
roc_auc_score(labels,predictions)

ROC AUC on Training Set:


0.9850713017756577

In [17]:
# Cross Validation 
# This will show me how good the model generalizes on the training set distribution
# I can get an idea of how well it will do then
cross_val_scores = pd.DataFrame(cross_validate(DecisionTreeClassifier(criterion='gini',min_samples_leaf=20),
                                transformed_data.values,labels,scoring='roc_auc',cv=5,return_train_score=True))
cross_val_scores[['train_score','test_score']].describe()

Unnamed: 0,train_score,test_score
count,5.0,5.0
mean,0.985957,0.872913
std,0.001984,0.050421
min,0.983173,0.817527
25%,0.984626,0.829144
50%,0.986838,0.873998
75%,0.987248,0.90824
max,0.987901,0.935655


In [18]:
# Saving the model
with open('../models/tfidf-trained models/d_tree.pkl','wb') as file:
    pickle.dump(d_tree,file)

### Random Forest

In [19]:
# Building the model
random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(transformed_data.values,labels)

In [20]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = random_forest.predict_proba(transformed_data.values)[:,1]
roc_auc_score(labels,predictions)

ROC AUC on Training Set:


1.0

In [21]:
# Cross Validation 
# This will show me how good the model generalizes on the training set distribution
# I can get an idea of how well it will do then
cross_val_scores = pd.DataFrame(cross_validate(RandomForestClassifier(random_state=42),
                                transformed_data.values,labels,scoring='roc_auc',cv=5,return_train_score=True))
cross_val_scores[['train_score','test_score']].describe()

Unnamed: 0,train_score,test_score
count,5.0,5.0
mean,1.0,0.94864
std,0.0,0.040026
min,1.0,0.887465
25%,1.0,0.934532
50%,1.0,0.962802
75%,1.0,0.965085
max,1.0,0.993317


In [22]:
# Saving the model
with open('../models/tfidf-trained models/random_forest.pkl','wb') as file:
    pickle.dump(random_forest,file)

### CatBoost/Gradient Boosting

In [24]:
# Building the model
catboost_clf = CatBoostClassifier(iterations=100,learning_rate=0.03,loss_function='Logloss',random_seed=42)
catboost_clf.fit(transformed_data.values,labels)

0:	learn: 0.6614737	total: 278ms	remaining: 27.6s
1:	learn: 0.6312344	total: 509ms	remaining: 24.9s
2:	learn: 0.6032792	total: 683ms	remaining: 22.1s
3:	learn: 0.5774462	total: 930ms	remaining: 22.3s
4:	learn: 0.5543001	total: 1.25s	remaining: 23.7s
5:	learn: 0.5329778	total: 1.5s	remaining: 23.5s
6:	learn: 0.5142026	total: 1.67s	remaining: 22.2s
7:	learn: 0.4963277	total: 1.86s	remaining: 21.4s
8:	learn: 0.4796403	total: 2.1s	remaining: 21.2s
9:	learn: 0.4643784	total: 2.31s	remaining: 20.8s
10:	learn: 0.4505088	total: 2.5s	remaining: 20.2s
11:	learn: 0.4370316	total: 2.7s	remaining: 19.8s
12:	learn: 0.4256112	total: 2.9s	remaining: 19.4s
13:	learn: 0.4142865	total: 3.1s	remaining: 19s
14:	learn: 0.4040846	total: 3.32s	remaining: 18.8s
15:	learn: 0.3948493	total: 3.47s	remaining: 18.2s
16:	learn: 0.3862387	total: 3.64s	remaining: 17.8s
17:	learn: 0.3772845	total: 3.82s	remaining: 17.4s
18:	learn: 0.3693770	total: 4.03s	remaining: 17.2s
19:	learn: 0.3619550	total: 4.24s	remaining: 16.9

<catboost.core.CatBoostClassifier at 0x166420b20>

In [25]:
# Making predictions on training data and evaluating
print('ROC AUC on Training Set:')
predictions = catboost_clf.predict_proba(transformed_data.values)[:,1]
roc_auc_score(labels,predictions)

ROC AUC on Training Set:


0.9879435864950069

In [32]:
# Saving the model
catboost_clf.save_model('../models/tfidf-trained models/catboost_clf')

Takeaways:

TFIDF did a little bit better than my features. However, the models didn't perform that much better. Only better by 0.02. What if I try combing my features with the TFIDF features? What results will I get? What if I increase the TFIDF from 500 features to 750? What if I make the n-grams 3 to 5 like many notebooks have? Worthy experiments to run along with running deep learning models.