## Dragon Riders

### Feature generation and Classical ML

Part of Grigoriy Kryukov

In [199]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm
import nltk
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

In [201]:
# upload data
df = pd.read_csv("train.csv")

In [202]:
df.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5


In [203]:
df['full_text'][0]

"I think that students would benefit from learning at home,because they wont have to change and get up early in the morning to shower and do there hair. taking only classes helps them because at there house they'll be pay more attention. they will be comfortable at home.\n\nThe hardest part of school is getting ready. you wake up go brush your teeth and go to your closet and look at your cloths. after you think you picked a outfit u go look in the mirror and youll either not like it or you look and see a stain. Then you'll have to change. with the online classes you can wear anything and stay home and you wont need to stress about what to wear.\n\nmost students usually take showers before school. they either take it before they sleep or when they wake up. some students do both to smell good. that causes them do miss the bus and effects on there lesson time cause they come late to school. when u have online classes u wont need to miss lessons cause you can get everything set up and go t

I have no doubt that advanced models, such as DeBERTaV3, will cope with the task much better than the methods of classical ML. However, our course is ML, not DL. Therefore, at least one person from the team should show the power of classical methods.

To apply classical methods, we need features. I deliberately refuse pre-irradiated embeddings and try to show imagination and knowledge of computational linguistics.

Let's list a list of features that could potentially be useful.

- $\textbf{The number of tokens}$. To count them, we use the nltk.tokenize.word_tokenize. There is a hypothesis that the better a person knows the language, the more voluminous an essay he can write.
- $\textbf{The number of words and ratio words/tokens}$. Clearing tokens from punctuation. Perhaps a more literate person will use more punctuation marks.
- $\textbf{The number and share of words not included in the list nltk.corpus.stopwords}$. (Frequently used words that do not carry a semantic load). There are two hypotheses here. On the one hand, the less a person uses stop words, the more diverse and meaningful speech. On the other hand, a person who does not know English well may forget about the stop words "a", "the" and the like.
- $\textbf{The average and median length of words}$. Perhaps the longer the words, the more complex they are and the better the person using them knows English.
- $\textbf{The number of sentences}$. There is a hypothesis that the better a person knows the language, the more sentences he can leave.
- $\textbf{The average and median length of sentences}$. There are two hypotheses here. On the one hand, a person who does not know English well is able to compose only short sentences. On the one hand, a person who does not know English well may compose sentences that are too long with a large number of words. Which is often not natural for English.
- $\textbf{The number of uses of a part of speech}$. The hypothesis is that in the letter of a person who does not know English well, there will be many prepositions (for example, 'of'). Conversely, an expert in English will use rare parts of speech, such as adverbs.

In [75]:
# example of stop words
stop_words = sorted(stopwords.words('english'))
stop_words[::20]

['a', 'before', "don't", 'here', 'm', 'on', 'shouldn', 'through', 'who']

In [76]:
# set of parts of speech

tags = []

for i in tqdm(range(N)):
    tokens = word_tokenize(df['full_text'][i])
    clean_words = [w.strip(punct) for w in tokens if w.strip(punct)]
    pos = pos_tag(clean_words, tagset = "universal")
    for j in range(len(clean_words)):
        tags.append(pos[j][1])

tags = list(set(tags))

100%|██████████| 3911/3911 [01:30<00:00, 43.14it/s]


In [77]:
tags

['PRON',
 'NOUN',
 'NUM',
 'ADV',
 'CONJ',
 'VERB',
 'DET',
 'X',
 'ADJ',
 'ADP',
 '.',
 'PRT']

In [78]:
N = len(df)

punct = '!"#$%&()*\+,-\./:;<=>?@\[\]^_`{|}~„“«»†*\—/\-–‘’'
stop_words = sorted(stopwords.words('english'))

for tag in tags:
    df[tag] = np.zeros(N)

for i in tqdm(range(N)):
    tokens = word_tokenize(df['full_text'][i])
    df.loc[i, "num_tokens"] = len(tokens)
    clean_words = [w.strip(punct) for w in tokens if w.strip(punct)]
    df.loc[i, "num_punct"] = len(clean_words)
    df.loc[i, "share_punct"] = 1 - len(clean_words) / len(tokens)
    good_words = [w for w in clean_words if w not in stop_words]
    df.loc[i, "num_nonstop"] = len(good_words)
    df.loc[i, "share_nonstop"] = len(good_words) / len(clean_words)
    arrlen = np.array([len(w) for w in good_words])
    df.loc[i, "avglen"] = arrlen.mean()
    df.loc[i, "medlen"] = np.median(arrlen)
    sents = sent_tokenize(df['full_text'][i])
    df.loc[i, "num_sents"] = len(sents)
    sentlen = np.array([len(word_tokenize(s)) for s in sents])
    df.loc[i, "avgsentlen"] = sentlen.mean()
    df.loc[i, "medsentlen"] = np.median(sentlen)
    pos = pos_tag(clean_words, tagset = "universal")
    for p in pos:
        if p[1] in tags:
            df.loc[i, p[1]] += 1.

100%|██████████| 3911/3911 [08:17<00:00,  7.87it/s]


In [79]:
df.head()

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions,PRON,NOUN,...,num_tokens,num_punct,share_punct,num_nonstop,share_nonstop,avglen,medlen,num_sents,avgsentlen,medsentlen
0,0016926B079C,I think that students would benefit from learn...,3.5,3.5,3.0,3.0,4.0,3.0,33.0,47.0,...,283.0,264.0,0.067138,134.0,0.507576,5.11194,5.0,18.0,15.722222,14.5
1,0022683E9EA5,When a problem is a change you have to let it ...,2.5,2.5,3.0,2.0,2.0,2.5,63.0,89.0,...,554.0,536.0,0.032491,226.0,0.421642,5.376106,5.0,14.0,39.571429,25.0
2,00299B378633,"Dear, Principal\n\nIf u change the school poli...",3.0,3.5,3.0,3.0,3.0,2.5,34.0,70.0,...,356.0,330.0,0.073034,165.0,0.5,4.769697,4.0,19.0,18.736842,19.0
3,003885A45F42,The best time in life is when you become yours...,4.5,4.5,4.5,4.5,4.0,5.0,138.0,93.0,...,836.0,759.0,0.092105,352.0,0.463768,5.073864,5.0,36.0,23.222222,22.5
4,0049B1DF5CCC,Small act of kindness can impact in other peop...,2.5,3.0,3.0,3.0,2.5,2.5,31.0,59.0,...,237.0,234.0,0.012658,112.0,0.478632,5.4375,6.0,3.0,79.0,42.0


The data is ready. Now we split the sample into train and test to evaluate the quality of the model.

In [141]:
train, test = train_test_split(df, train_size = 0.8, random_state=42)

In [142]:
train.columns[8:]

Index(['PRON', 'NOUN', 'NUM', 'ADV', 'CONJ', 'VERB', 'DET', 'X', 'ADJ', 'ADP',
       '.', 'PRT', 'num_tokens', 'num_punct', 'share_punct', 'num_nonstop',
       'share_nonstop', 'avglen', 'medlen', 'num_sents', 'avgsentlen',
       'medsentlen'],
      dtype='object')

Now we implement a function that takes machine learning models as input, trains on a training sample and returns RMSE for a test sample.

Since we have six target metrics, the RMSE will be calculated for each of them.

In [162]:
X_train = train[list(train.columns)[8:]]
X_test = test[list(test.columns)[8:]]
y_col = list(train.columns[2:8])

def estimate_RMSE(main_model):
    
    models = []
    
    for col in y_col:
        model = main_model
        model.fit(X_train, train[col])
        models.append(model)

    preds = []

    for model in models:
        preds.append(model.predict(X_test))
        
    RMSE = np.zeros(len(preds))
    for i in range(len(preds)):
        RMSE[i] = MSE(test[y_col[i]], preds[i], squared=False)
        
    return RMSE

Let's try to apply different models.

In [168]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor

In [186]:
main_models = [LinearRegression(), Ridge(), Lasso(), SVR(), KNeighborsRegressor(),
         DecisionTreeRegressor(random_state=42), RandomForestRegressor(random_state=42),
         AdaBoostRegressor(random_state=42), XGBRegressor(random_state=42)]

scores = np.zeros((len(main_models), len(y_col)))
for i, model in enumerate(main_models):
    scores[i] = estimate_RMSE(model)
    
scores.round(4)

array([[0.5835, 0.5805, 0.5378, 0.6022, 0.6515, 0.5989],
       [0.5831, 0.5806, 0.5378, 0.6021, 0.6515, 0.5993],
       [0.6159, 0.6133, 0.5561, 0.6301, 0.6806, 0.6457],
       [0.5927, 0.5878, 0.5431, 0.6099, 0.6563, 0.6194],
       [0.6373, 0.6608, 0.5887, 0.6555, 0.7058, 0.6681],
       [0.8541, 0.8331, 0.8301, 0.8588, 0.8826, 0.8725],
       [0.5876, 0.5824, 0.5376, 0.6128, 0.6534, 0.615 ],
       [0.5927, 0.5987, 0.5384, 0.6182, 0.6704, 0.6278],
       [0.6326, 0.621 , 0.5838, 0.6482, 0.6877, 0.6442]])

The best average result is shown by linear regression.

In [187]:
MCRMSE = scores.mean(axis=1)
print(MCRMSE.round(3))
print(main_models[np.argmin(MCRMSE)])

[0.592 0.592 0.624 0.602 0.653 0.855 0.598 0.608 0.636]
LinearRegression()


However, we can use different models to predict different metrics!

Let's find out which models did a better job of predicting different variables.

In [191]:
print("Best models:")
print()
for i in range(len(y_col)):
    ind_best = np.argmin(scores, axis=0)
    print(y_col[i], main_models[ind_best[i]])
    np.argmin(scores, axis=0)

Best models:

cohesion Ridge()
syntax LinearRegression()
vocabulary RandomForestRegressor(random_state=42)
phraseology Ridge()
grammar LinearRegression()
conventions LinearRegression()


We use these models.

On the internal test, a score of $0.59$ was obtained.

The score on the kaggle public leaderboard was higher. $0.56$.

In [198]:
np.min(scores, axis=0).mean()

0.5922862069064257

![score1](./score1.png)

This is only $0.13$ behind the first place.

We can say that we were able to get a good intermediate result without using embeddings and neural networks!