Task details

This task should not take you more than 3-4 hours. You are a language engineer working on a lexicon delivery for the Kazakh language. You are provided with corpus data (a corpus of sentences) where each sentence has been tokenised, and each token has been lemmatized and annotated with additional information including part-of-speech tags and morphological features.

A short sample of the corpus data is provided with the link below: sample_parsed_sentences.json

Your task is to take this input and process it using Python. You should use data classes (you can use Pydantic if you are familiar with this package) to produce the output. The output should be a JSON file and should contain the following information:

· An entry per lemma for all lemmas in the sample_parsed_sentences file

· The part of speech label and all inflection information per lemma

· A total frequency count for each lemma

· A total frequency count for each wordform per lemma

Please share your code in a public git repository. Please provide the following additional details along with your submission:

1. Any documentation you feel necessary to understand and run your solution

2. How would you foresee your solution to run in production environment (you do not need to build the actual pipeline). Can you mention any cloud services that you are familiar with, that can be used to run your solution?

3. Any other details you would like to share with us

In [1]:
!pip install ijson

Collecting ijson
  Downloading ijson-3.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (111 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.8/111.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: ijson
Successfully installed ijson-3.2.3


In [25]:
from pydantic import BaseModel, ValidationError
from typing import Optional
import ijson
import json

In [9]:
class WordForm(BaseModel):
    text: str

class Lemma(BaseModel):
    lemma: str

class Feats(BaseModel):
    pos: str
    pos_finegrained: str
    feats: Optional[str]

In [32]:
lemmas = {}

input_file_path = 'sample_parsed_sentences.json'
with open(input_file_path, "rb") as f:
    for k, v in ijson.kvitems(f, 'sentences.item'):
        if k == 'tokens':
            for obj in v:
                try:
                    lemma_obj = Lemma(**obj)
                    wordform_obj = WordForm(**obj)
                    feats_obj = Feats(**obj)

                except ValidationError as e:
                    print(e)

                lemma = lemma_obj.dict().get('lemma')

                if lemma not in lemmas:
                    lemmas[lemma] = feats_obj.dict()
                    lemmas[lemma].update({'lemma_total_frequency_count': 1})
                    lemmas[lemma]['wordform_total_frequency_count'] = {}
                else:
                    lemmas[lemma]['lemma_total_frequency_count'] +=1

                wordform = wordform_obj.dict().get('text')
                if wordform not in lemmas[lemma]['wordform_total_frequency_count']:
                    lemmas[lemma]['wordform_total_frequency_count'][wordform] = 1
                else:
                    lemmas[lemma]['wordform_total_frequency_count'][wordform] += 1

lemmas_obj = json.dumps(lemmas, indent=4, ensure_ascii=False)
with open("lemmas_out.json", "w") as outfile:
    outfile.write(lemmas_obj)

Notes:

`input_file_path` variable should be changed to location of file in local directory. Using ijson to stream potentially large json files rather than load into memory. Parrelization alongside other distributed methods can significantly reduce processing times.

Running on cloud:

To run on cloud server, for example Google App Engine, the application can be accessed using an API endpoint. This endpoint can be interfaced using REST API to request and send input and output data respectively.