# Automated Essay Scoring
Machine Learning Challenge by [Kaggle](https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2/overview)

In [21]:
import pandas as pd


In [22]:
df = pd.read_csv('data/train.csv')

In [23]:
df.head()

Unnamed: 0,essay_id,full_text,score
0,000d118,Many people have car where they live. The thin...,3
1,000fe60,I am a scientist at NASA that is discussing th...,3
2,001ab80,People always wish they had the same technolog...,4
3,001bdc0,"We all heard about Venus, the planet without a...",4
4,002ba53,"Dear, State Senator\n\nThis is a letter to arg...",3


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17307 entries, 0 to 17306
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   essay_id   17307 non-null  object
 1   full_text  17307 non-null  object
 2   score      17307 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 405.8+ KB


In [25]:
df.describe()

Unnamed: 0,score
count,17307.0
mean,2.948402
std,1.044899
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,6.0


In [26]:
df.isna().sum()

essay_id     0
full_text    0
score        0
dtype: int64

In [27]:
df['score'].value_counts()

score
3    6280
2    4723
4    3926
1    1252
5     970
6     156
Name: count, dtype: int64

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='score'), df['score'], test_size=0.2, random_state=42)

## Evaluation using the Quadratic Weighted Kappa
The quadratic weighted kappa (QWK) score ranges from -1 to 1. A score of 1 indicates perfect agreement between the predicted score and the true score. A score of 0 indicates the agreement is no better than random. A score of -1 indicates perfect disagreement between the predicted score and the true score.

In [29]:
from sklearn.metrics import confusion_matrix

def quadratic_weighted_kappa(y_true, y_pred, min_rating=None, max_rating=None):
    """
    Computes the quadratic weighted kappa.
    """
    if min_rating is None:
        min_rating = min(min(y_true), min(y_pred))
    if max_rating is None:
        max_rating = max(max(y_true), max(y_pred))

    conf_mat = confusion_matrix(y_true, y_pred, labels=range(min_rating, max_rating + 1))
    num_ratings = len(conf_mat)
    num_scored_items = float(len(y_true))

    hist_true = np.histogram(y_true, bins=np.arange(min_rating, max_rating + 2))[0]
    hist_pred = np.histogram(y_pred, bins=np.arange(min_rating, max_rating + 2))[0]

    expected_mat = np.outer(hist_true, hist_pred) / num_scored_items

    weight_mat = np.zeros((num_ratings, num_ratings))
    for i in range(num_ratings):
        for j in range(num_ratings):
            weight_mat[i, j] = ((i - j) ** 2) / ((num_ratings - 1) ** 2)

    kappa = 1.0 - (np.sum(weight_mat * conf_mat) / np.sum(weight_mat * expected_mat))
    return kappa

## Using CountVectorizer and TfidfTransformer from sklearn 
[Working With Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

*Text Preprocessing*

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train['full_text'])
X_train_counts.shape

(13845, 56588)

Literally a word count but with tokens/chunks of texts. 

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(13845, 56588)

TfidfTransformer is used to convert the word count into a frequency matrix. We simply divide the number of word per document/data/observation by total number of this word in all documents.

In [32]:
y_train.shape

(13845,)

In [33]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [34]:
X_new_counts = count_vect.transform(X_test['full_text'])
X_new_counts

<3462x56588 sparse matrix of type '<class 'numpy.int64'>'
	with 564659 stored elements in Compressed Sparse Row format>

In [35]:
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
X_new_tfidf

<3462x56588 sparse matrix of type '<class 'numpy.float64'>'
	with 564659 stored elements in Compressed Sparse Row format>

In [36]:
predicted = clf.predict(X_new_tfidf)
predicted

array([3, 3, 3, ..., 3, 3, 3], dtype=int64)

In [37]:
pred_df = pd.DataFrame({'essay_id': X_test['essay_id'], 'full_text': X_test['full_text'], 'score': predicted})
pred_df

Unnamed: 0,essay_id,full_text,score
12696,bb4c434,"People tend to use there cars so much, they ba...",3
4625,44e88b0,Imagine being a top scientist at NASA and Viki...,3
733,0ba78ec,The face of Mars could not be created by alien...,3
16885,f96c287,Many people belive that the face on Mars was c...,3
3334,317173f,Driverless Cars are coming soon or later? Peop...,3
...,...,...,...
16145,ee1d27b,How the author support his suggests that study...,3
4229,3e7dd0b,"In this aricle , the author its trying to you ...",3
4313,3fdbec2,The Facial Action Coding System enables comput...,3
934,0edee1b,"Hello my name is Luke Bomberger and, welcome t...",3


In [38]:
import numpy as np
print(f"Mean: {np.mean(predicted == y_test)}")

Mean: 0.3694396302715193


In [39]:
kappa_score = quadratic_weighted_kappa(y_test, predicted)
print(f"Quadratic Weighted Kappa: {kappa_score}")

Quadratic Weighted Kappa: 0.034102332855281525


In [40]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(X_train['full_text'], y_train)
predicted = text_clf.predict(X_test['full_text'])

print(f"Mean: {np.mean(predicted == y_test)}")

kappa_score = quadratic_weighted_kappa(y_test, predicted)
print(f"Quadratic Weighted Kappa: {kappa_score}")

Mean: 0.4679376083188908
Quadratic Weighted Kappa: 0.5708430116379231


For my next steps: <br/>
https://developer.ibm.com/tutorials/awb-tokenizing-text-in-python/
<br/>
https://medium.com/@bukowski.daniel/a-practical-framework-for-evaluating-text-generation-llms-4016ffa93736
<br/>
https://www.datacamp.com/blog/what-is-tokenization
<br/>
https://www.nltk.org
<br/>