Notebook containing the final versions of models 

In [None]:
import pandas as pd
import re
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import spacy 
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, ComplementNB

I will try and describe what I have done and why my final models look like they do. I quickly want to note something- when I say accracy is increased/decreased, I am actually refering to another value. I look at a combination of Accuracy, Macro Average, and Weighted Average from the classification_report output. The data we have is drastically skewed toward proper annotations, and by only looking at accuracy we might miss that a model which is really good at predicting proper annotations actually suck for the low and uninformative ones. Macro average is the unweighted mean of per-class scores making it useful when analysing imbalanced datasets. I need to read up on weighted average more, have honestly not looked at it much.

In [None]:
#Loading annotation and label file
data = pd.read_csv(r'.\AF50m_subset_REGEX_man_labels_5k.txt', sep="\t")

# Getting the annotatons that have been labeled manually
labeled = data.loc[data["manual_label"].notna(), ["protein_annotation", "manual_label"]]

My main concern for the cleaning function was what tokens do I want to preserve. Take "Si:ch211-256e16.3", "5'(3')-deoxyribonucleotidase", "[NAD(P)H]" as an example. Do I keep These completley intact? The automised tokenization pattern in TfidfVectorizer is r”(?u)\b\w\w+\b”, which considers tokens as something with atleast two alphanumeric characters, so ignores punctuation etc. In the end, I decided to remove brackets and parenthases, and replace non alphanumeric charactes with blankspaces. This method will not keep specialized words such as atoms or accession codes etc intact, but this approach seems to increase accuracy. I think this mainly is for the low and uninformative class, where accession codes and such aren't as frequent. Essentially a noise reduction step. Also remove any words with less than two characters. 

In [None]:
# Cleaning function

# should consider unseen data passed, should we force everything to str and check for naans and non ascii characters?

def cleaner(text):
    """
    Takes a string as an input and performs the following operations:
        - Lowercases the text
        - Replaces []() with ""
        - Replaces any non-alphanumeric characters with a blankspace
        - Removes words with less than 2 characters
    Returns the cleaned text.
    """
    text = text.lower()  # Lowercase
    text = re.sub(r"[\[\]\(\)]", "", text)  # removing brackets etc
    text = re.sub(r"[^a-zA-Z0-9]+", " ", text)  # remove non-alphanumeric characters
    text = re.sub(r"\s+", " ", text)  # remove multiple spaces

    # Remove words shorter than 2 characters
    text = re.sub(r"\b\w{1}\b", "", text)  # Removes isolated 1-character tokens
    text = re.sub(r"\s+", " ", text)  # cleans up extra spaces again

    return text.strip()


Lemmatization reduces a word to it's base form. Here, I try and keep words with numers in the intact, but remove digits completley, as these do not seem to indicative on the informativeness of an annotation. I did also try a version where the cleaner maintains any words with special characters such as .:-_'+ and the lemmatizer also considers words with these as tokens, but ultimatley found that removing these characters increases accuracy. 

In [None]:
# Lemmatization function

# Here I need to think about how things should be loaded for the package
# e.g. do I put load inside the function or outside?

nlp = spacy.load("en_core_web_sm", disable=["parser", "ner", "textcat"])

# Pattern for tokens that contain both letters and digits
HAS_LETTER_AND_DIGIT = re.compile(r'(?=.*[a-zA-Z])(?=.*\d)')

def lemmatizer(text):
    """
    Lemmatizes the input text using spaCy. Removes spaces, punctuation, and pure numbers.
    Keeps alphanumeric tokens (those containing both letters and digits).
    Returns the lemmatized text as a single string.
    """
    doc = nlp(text)
    lemmas = []
    for tok in doc:
        if tok.is_space:
                continue
        t = tok.text
        if tok.is_alpha:
            lemmas.append(tok.lemma_.lower()) #eg binding -> bind
        elif HAS_LETTER_AND_DIGIT.search(t):
            # Keep alphanumeric tokens like asp45 or hsp70
            lemmas.append(t.lower())
        # else: skip pure numbers and punctuation (shouldn’t occur post-cleaning)

    return " ".join(lemmas)

For vectorisation I tried quite a few things. 

Stopwords are words that occur frequently but contribute very little to the meaning of a sentence. The most frequent word in our dataset in protein, which appears in more than half of the annotations. There is a high drop-off in frequency after this, with the second most frequent word, domain, only appearing in 14.6% of annotations. Ultimately, I found just using the english stopwords is fine. 

Min_df defines the minimum amount of documents a word must appear in. I found that models have the highest accuracy when there is no min_df, that is to say by allowing tokens to appear in only one document.

ngram_range defines how many n-grams are used. So as an exmple, say we have the text "proten domain" and ngram_range=(1,2), that means we get three tokens, "protein", "domain", and "protein domain". Due to the harsh removing all special characters, I tried different n-gram ranges (I tried 1-4, could expand if interesting), and found that (1,2) was the best. 

I also tried a combination of a cleaner allowing special characters and a tokenizer allowing these, but ultimately found that these models had worse accuracy. 

In [None]:
# Vectorizer

# Testing without min_df
vectorizer = TfidfVectorizer(
    lowercase=False,
    stop_words=list(ENGLISH_STOP_WORDS), 
    ngram_range=(1,2), 
    max_df=0.9 # words that appear in more than 90% of annotations are ignored
)

In [None]:
# Splitting data into training and test sets

x = labeled["protein_annotation"].apply(cleaner)
x = x.apply(lemmatizer)
y = labeled["manual_label"]

x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42, stratify=y
)

train_vectors = vectorizer.fit_transform(x_train)
test_vectors = vectorizer.transform(x_test)

The classifiers I have tried are Logistic regression, SVM, Random forests, and two Naive Bayes methods. I also want to try and use a simple neural network, but it did not go well and is a future problem. 

I will now write the ChatGPT summary  of these models strengths and weaknesses, but I do want to find sources on this seperately at some point

Logistic regression is fast to train, relatively easy to interpre, and works for multiclass classification. it is however a linear model, so will not capture any non-linear trends in the data. It may also struggle if the classes are imbalanced without proper handling, and may struggle more with sparse data than other text algorithms. 

Below is the model which I found had highest macro average and accuracy after tuning parameters. I cannot say if this is the best possible model, but it is what it is

In [None]:
# best logistic regression model 

lr = LogisticRegression(
    solver="lbfgs",
    max_iter=1000,
    penalty='l2',
    C=13.1451
)


Support vector machines preform well with sparse high dimensional tasks, such as text classification. It is however harder to interpret and requirs tuning, so may be slower to train than logistic models. 

I again base this model on parameter tuning analysis.

In [None]:
# best SVM model

svm = LinearSVC(
    C = 0.7743, 
    class_weight='balanced', 
    loss= 'squared_hinge', 
    penalty= 'l2'
    )


Random forests can capture complex patterns, not just linear ones, and are generally robust to overfitting. Random forests are however less efficient for sparse data, the decision trees may stuggle to find meaningful splits due to sparse features. Training time and memory may be large, and interpretability is lower. 

The model I have here was based on just a straight up random forest. I am trying to reduce dimensionality of the vectors and then apply a random forest, the tuning of this is taking ages, will update when there is something to update.

In [None]:
# best RF model
rf = RandomForestClassifier(
    bootstrap=True,
    class_weight='balanced',
    max_depth=10,
    max_features='sqrt',
    min_samples_leaf=2,
    min_samples_split=11,
    n_estimators=910,
    random_state=42,
    n_jobs=-1   
)

Multinomial Naive Bayes was designed for count data, like the word frequency counts of TF-IDF. It is fast to train and tends to do well assuing independence of features. This independency assumption is however weak for texts, and the model may struggle with skewed classes. 

In [None]:
# best multinomial NB model

mnb = MultinomialNB(
    alpha=0.001)

Complement naive bayes is a NB variant designed to handle imbalanced text classification tasks, while retaining NB advantages (may be robust if one class is very large like proper). Similar downsides to MNB, and is less understood in text classification frameworks. 

In [None]:
# best complement NB model

cnb = ComplementNB(
    alpha=0.007743)

This text is for an eventual neural network. these offer flexibility and may notice patterns linear models miss. These however tend to require more data to train and may otherwise overfit, it is slower with more parameters to tune, weaker interpretability. May need to reduce dimensionality like in random forests. 

Finally, the following code is a simple setup for the final models. I need to see about reducing dimensions in Random forests, this might requir some extra steps. When I am happy with my models, the next steps are as follows: talk to group about how we want these models to be accessible and how to compare accuracy in all models, as well as computational costs, and look up how I can make the pretraned models obtainable for others without them needing to retrain. 

In [None]:
# training classifier

classifier.fit(train_vectors, y_train)

y_pred = classifier.predict(test_vectors)
print(classification_report(y_test, y_pred))    

class_names = classifier.classes_
cm = confusion_matrix(y_test, y_pred)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=class_names)
disp.plot(cmap="Blues", values_format='d')  # 'd' = integer format
plt.title(title)
plt.show()