## Train a classifier to classify words as concrete or abstract

1. Load words with concreteness score from xls file
2. Train a classifier to classify words as concrete or abstract
3. Save the classifier

In [92]:
import os.path

import spacy
from nltk.corpus import wordnet as wn
import json

import pandas as pd
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report


# spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md")

### 1. Load words with concreteness score from xls file

The file contains a list of words with their concreteness score. The concreteness score is a value between 1 and 5.
Words with a concreteness score >= 3 are considered concrete, while words with a concreteness score < 3 are considered abstract.
These values can be tuned to test different thresholds.

In [93]:
concreteness_df = pd.read_excel("data/brysbaert_concreteness_ratings.xlsx", na_filter=False)
# extract list of words with concreteness >= 3 and < 3
concrete_words = concreteness_df.query('conc_score >= 3')['word'].to_list()
abstract_words = concreteness_df.query('conc_score < 3')['word'].tolist()

Create file containing each word and its vector representation as a dictionary.
If the file already exists, load it instead of creating a new one.

In [94]:
if not os.path.exists("data/concreteness_word_vectors.txt"):
    cols = ['word', 'vector']
    word_vector_dict = {}
    for word in concreteness_df['word']:
        word_vector = nlp(word).vector.tolist()
        word_vector_dict[word] = word_vector
    
    # save dictionary to file
    with open("data/concreteness_word_vectors.txt", "w") as file:
        json.dump(word_vector_dict, file)
else:
    with open("data/concreteness_word_vectors.txt", "r") as file:
        word_vector_dict = json.load(file)

In [95]:
print(len(concrete_words))
print(concrete_words[:10])
print()
print(len(abstract_words))
print(abstract_words[:10])

18776
['accumulate', 'add', 'aerially', 'ahead', 'aiming', 'airless', 'alternation', 'anaphylactic', 'anatomically', 'annotate']

21178
['eh', 'essentialness', 'although', 'spirituality', 'would', 'spiritually', 'whatsoever', 'conceptualistic', 'conventionalism', 'belief']


### 2. Train a classifier to classify words as concrete or abstract

#### 2.1. Prepare the data
Create training data by combining the concrete and abstract words and their corresponding labels.

In [96]:
classes = ['concrete', 'abstract']
train_set = [concrete_words, abstract_words]

Convert the words to word vectors using spaCy to get a more meaningful representation of the words.
Preparing labels for the words. Words with concreteness score >= 3 are labeled as 0 (concrete), while words with concreteness score < 3 are labeled as 1 (abstract).
We now have a training set with word vectors and their corresponding labels.

In [97]:
# create word vector list
X = []
for part in train_set:
    for word in part:
        # get word vector for word
        word_vector = word_vector_dict[word]
        X.append(word_vector)
        
# get labels
y = [0] * len(concrete_words) + [1] * len(abstract_words)

#### 2.2. Train the classifier

In [98]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

classifier = LogisticRegression(max_iter=500).fit(X_train, y_train)

#### 2.3. Evaluate and save the classifier

In [99]:
y_pred = classifier.predict(X_test)
print(f"Accuracy score: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

Accuracy score: 0.8082923166763994
              precision    recall  f1-score   support

           0       0.81      0.76      0.79      5614
           1       0.80      0.85      0.82      6373

    accuracy                           0.81     11987
   macro avg       0.81      0.81      0.81     11987
weighted avg       0.81      0.81      0.81     11987


In [100]:
# save classifier
joblib.dump(classifier, "trained_models/concrete_abstract_classifier.joblib")

['trained_models/concrete_abstract_classifier.joblib']

#### 2.4. Sample use

In [101]:
# load classifier
classifier = joblib.load("trained_models/concrete_abstract_classifier.joblib")

In [102]:
synsets = ['war.n.01', 'fiefdom.n.01', 'bed.n.03', 'return_on_invested_capital.n.01', 'texture.n.02', 'news.n.01', 'look.n.02']

for synset_str in synsets:
    synset = wn.synset(synset_str)
    synset_name = synset.lemma_names()[0]
    synset_vector = list(nlp(synset_name))[0].vector
    synset_class = classifier.predict([synset_vector])[0]
    # print classification
    print(f'{synset_name} -> {synset_class} - {classes[synset_class]}')

war -> 1 - abstract
fiefdom -> 1 - abstract
bed -> 0 - concrete
return_on_invested_capital -> 1 - abstract
texture -> 0 - concrete
news -> 0 - concrete
look -> 0 - concrete


### 3. Regression model version

We predict now the concreteness score of a word

In [103]:
words = concreteness_df['word'].tolist()
concreteness_scores = concreteness_df['conc_score'].tolist()
X = []
y = concreteness_scores

for word in words:
    # get word vector for word
    word_vector = word_vector_dict[word]
    X.append(word_vector)

In [104]:
# print the first 10 words alog with their concreteness scores
for i in range(10):
    print(f'{words[i]} -> {y[i]}')

eh -> 1.04
essentialness -> 1.04
although -> 1.07
spirituality -> 1.07
would -> 1.12
spiritually -> 1.14
whatsoever -> 1.17
conceptualistic -> 1.18
conventionalism -> 1.18
belief -> 1.19


In [105]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LinearRegression().fit(X_train, y_train)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

Mean Squared Error: 0.47152531458015673


In [106]:
# save model
joblib.dump(model, "trained_models/concreteness_regression_model.joblib")

['trained_models/concreteness_regression_model.joblib']