#Uralic Language Identification Task - VarDial2021 - Part 2

This notebook contains the code developed by Team Phlyers to distinguish between 'target' and 'non-target' languages for the ULI shared task at VarDial2021.

The first few blocks are needed to set up the directory.

In [None]:
# Load the Drive helper and mount
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%cd /content/drive/My Drive/Colab Notebooks/ULI-VarDial2021

/content/drive/My Drive/Colab Notebooks/ULI-VarDial2021


This block loads the data.

In [None]:
import json
import random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# The corpus is stored in a dictionary in json format
# Dictionary format: {category:{language:[list of texts]}}
with open('data.json') as f:
  data = json.load(f)

# Dataset is in the format of a tuple (category, lang, sentence)
dataset = []

for category in data:
  for lang in data[category]:
    for sentence in data[category][lang]:
      dataset.append((category, lang, sentence))

# Sentences are shuffled
random.shuffle(dataset)



We then perform cross-validation on a SVM classifier that is trained to distinguish 'target' from 'non-target' languages.

In [None]:
# Create a vector with all the sentences, and a vector with all the categories

X_train = [sentence for _, _, sentence in dataset]
y_train = [category for category, _, _ in dataset]

# Train a Support Vector Machine with cross-validation

vectorizer = TfidfVectorizer(analyzer='char_wb', ngram_range=(3,4), max_features=100000)
scaler = StandardScaler(with_mean=False)
X_train = vectorizer.fit_transform(X_train)
X_train = scaler.fit_transform(X_train)
model = SGDClassifier(max_iter=7000)
scores = cross_val_score(model, X_train, y_train, scoring='f1_macro')
print('Results:')
print(scores)




Results:
[0.99536094 0.99527404 0.99538954 0.99539333 0.99531734]
