# Albanian POS Tagging

This notebook will use this [Albanian POS](https://github.com/NeldaKote/Albanian-POS/blob/master/albanian-all-devel-new.conllu) dataset that contains 10,000 sentences with their POS tags. The dataset is in the CoNLL-U format.

The dataset will be trained on a `CRF` model from the `sklearn_crfsuite` package to predict the POS tags of the Albanian language.

In [2]:
import os

if not os.path.exists('./models'):
	os.makedirs('./models')

if not os.path.exists('./models/pos'):
	os.makedirs('./models/pos')

In [36]:
import pyconll
from sklearn_crfsuite import CRF
import os
import joblib

def load_conllu_file(file_path):
    sentences = []
    pos_tags = []
    conll = pyconll.load_from_file(file_path)
    for sentence in conll:
        words = []
        tags = []
        skip_sentence = False
        for token in sentence:
            if token.form is None or token.upos is None:
                skip_sentence = True
                break
            words.append(token.form)
            tags.append(token.upos)
        if not skip_sentence:
            sentences.append(words)
            pos_tags.append(tags)
    return sentences, pos_tags

# Load CoNLL-U file and filter out None values
sentences, pos_tags = load_conllu_file('./data/albanian-all-devel-new.conllu')

# Step 2: Train the CRF Model
crf = CRF(algorithm='lbfgs', max_iterations=100)
crf.fit(sentences, pos_tags)

# Step 3: Save the Model
model_path = "./models/pos/"
os.makedirs(model_path, exist_ok=True)
model_file = os.path.join(model_path, "pos_model.pkl")
joblib.dump(crf, model_file)

print("Model trained and saved successfully at:", model_file)

Model trained and saved successfully at: ./models/pos/pos_model.pkl


In [37]:
# Load the model
model = joblib.load(model_file)

# Test the model
text = "Ky është një shembull i një teksti të shkruar në gjuhën shqipe." # This is an example of a text written in the Albanian language.

tokens = text.split()
pos_tags = model.predict([tokens])[0]

for token, pos_tag in zip(tokens, pos_tags):
	print(token, pos_tag)

Ky PRON
është VERB
një NUM
shembull NOUN
i DET
një NUM
teksti NOUN
të DET
shkruar ADJ
në ADP
gjuhën NOUN
shqipe. ADJ
