#Uralic Language Identification Task - VarDial2021 - Part 3

This notebook contains the code developed by Team Phlyers to distinguish among 'target' languages for the ULI shared task at VarDial2021.

The first few blocks are needed to set up the directory.

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/My Drive/Colab Notebooks/ULI-VarDial2021

/content/drive/My Drive/Colab Notebooks/ULI-VarDial2021


This block loads the data.

In [None]:
import json
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score

# The corpus is stored in a dictionary in json format
# Dictionary format: {category:{language:[list of texts]}}
with open('data.json') as f:
  data = json.load(f)

# Dataset is in the format of a tuple (category, lang, sentence)
dataset = []

for category in data:
  for lang in data[category]:
    for sentence in data[category][lang]:
      dataset.append((category, lang, sentence))

# Sentences are shuffled
random.shuffle(dataset)



We then perform cross-validation on a NB classifier that is trained to distinguish among 'target' languages.

In [None]:
# Create a vector with all the sentences, and a vector with all the languages, only if they are on target

X_train = [sentence for category, _, sentence in dataset if category == "UR"]
y_train = [language for category, language, _ in dataset if category == "UR"]

# Train a Multinomial Naive Bayes model with cross-validation

vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,5), min_df=0.000001, sublinear_tf=True)
X_train = vectorizer.fit_transform(X_train)
model = MultinomialNB(alpha=0.0000001)
scores = cross_val_score(model, X_train, y_train, scoring='f1_macro')
print('Results:')
print(scores)

Results:
[0.91832377 0.91967079 0.91794949 0.90454321 0.92074492]
