No due date Last updated 28 days ago

    Some good software engineering wish list items to make the app gamified and have more impact for users and make the app more fun so that users will use it more...

     
    No due date Last updated 28 days ago

    Main Update: Upload session info to LingSync server w/ authentication.

     
    100% complete
    No due date Last updated over 1 year ago
     
    100% complete
    Past due by almost 3 years Last updated about 2 years ago

    Tasks: Harvest data and convert to JSON format (total ~200 hrs); cr…

    Tasks: Harvest data and convert to JSON format (total ~200 hrs); create XML-JSON conversion script (total ~200 hrs). Milestones: New databases entered into LingSync; completed XML-JSON conversion script, publication of the scripts with documentation on GitHub. Success criteria: Conversion scripts adapt to different annotation schemas (90% accuracy); the converted data correctly reflects the original annotation schema (100% accuracy). Success will be measured by linguists comparing small data sets. Risks: Unadaptable conversion scripts may lose or mix up some data elements in the converted data. In order to avoid this risk, scripts will be tested with databases of varying annotation schemas. Data to test Morphochallenge childes data from almaya other.. maybe re-connect with SIL to see if there is some data they would want us to try? we should provide an ecryption decryption hook in the tests for import so that we can test import with private data and investigate import with encryption at the morpheme level so that all the services can still run, but with encrypted data as input (no need to decrypt before running the service) we should go with a plan of one folder per database to be imported with additional metadata, consent forms and etc info in the same directory, and the directory replicated on the two department servers (? " first be stored in their original formats on the local servers in the two partner institutions, Harvard University and Concordia University, prior to the conversion into JSON format. Storage in two independent servers will prevent accidental loss of the data, and ensures the access to the data even if one of the servers is down.") ... The schema of linguistic annotation in the data files also varies among databases, reflecting the different purposes for which the databases are created. If the purpose is to document folktales in Marshallese, for example, sentences in the original language and the translation will suffice. If the purpose is the theoretical analysis of sentence structure, morpheme glosses and POS tags will be necessary. If the purpose is to analyze vowel quality in spoken Marshallese, audio recording of utterances aligned with phonetic transcription will be required. For language databases to be useful for linguistic analysis as well as for creating language teaching materials, annotation of morpheme glosses and POS tags is essential. ...

     
    100% complete
    No due date Last updated over 2 years ago
     
    No due date Last updated over 2 years ago
     
    100% complete
    No due date Last updated over 2 years ago
     
    44% complete
    No due date Last updated over 2 years ago

    We used to have a chrome app to go offline, maintaining it and upda…

    We used to have a chrome app to go offline, maintaining it and updating it to the latest specs for the chrome store, while chrome apps were still new, was very difficult and often resulted in urgent updates to be able to publish new versions. Since 2012 various ways of building native desktop apps using Chromium Frame have evolved. In this milestone we will make a Chromium Frame based container which we can use to make Native Desktop apps for any of the FieldDB apps (the corpus pages, experimentation dashboard, lexicon browser, dative, prototype, spreadsheet etc)

     
    No due date Last updated over 2 years ago
     
    100% complete
    No due date Last updated over 2 years ago
     
    100% complete
    No due date Last updated over 2 years ago

    Set up resources for people to find out about, learn more, and inte…

    Set up resources for people to find out about, learn more, and interact with about FieldDB, including facebook, twitter, website, tutorial videos, white paper, wiki, etc. In order for the projet to survive, we need people to find out about it, share it with friends and more importantly find more programming linguists who want to join in the development of new tools and webservices :)

     
    100% complete
    No due date Last updated almost 3 years ago
     
    16% complete
    No due date Last updated almost 3 years ago
     
    No due date Last updated almost 3 years ago

    This milestone contains links, screencasts and tutorials that the t…

    This milestone contains links, screencasts and tutorials that the team has put together for each other and for future interns or other linguists who want to learn how to program/script to save time in data collection/analysis. Most of the resources focus on Javascript, HTML and CSS since the most of the app is written using these technologies. One of the goals of the project is to collect resources to train students (and future linguists) how to program and design scripts, applications and databases for research purposes. Currently most linguists receive no such training despite the fact that it is an invaluable resource, both for data collection and analysis.

     
    No due date Last updated almost 3 years ago
     
    Past due by about 3 years Last updated about 3 years ago

    Currently we have no lexicon database (we have an implementation of…

    Currently we have no lexicon database (we have an implementation of a connected graph in the client libraries which we use to do glossing and display morpheme precedence relations, and a few map reduces that index the corpus and build a static index of morphemes) but we originally planned to have a lexicon database which could contain extra meta information about lexical items (edits, suggestions and other information so that we can help automate cleaning and connecting lexicons). Time to stitch it together into a dashboard... and make it easier for users to explore their data via a lexicon. Tasks: Human annotation of a small portion of selected datasets using automated morpheme segmentation tool and morpheme-gloss alignment checker (continued from M5-6); create lexicon server (~300 hrs). Milestones: Integration of the lexicon server with code and documentation published on GitHub. Success criteria: Lexicon server successfully extracts and maintains morpheme entries in databases. Tested by linguists and language consultants to figure out the percent of relevance of the extractions and accuracy lexical entries (ideally >90%). ... A lexicon server, independent of databases for language corpora, stores candidate morphemes created by auto-segmentation in (i), and the morphemes and glosses confirmed by human annotation in (iii) which is aided by the alignment checking tool created in (ii). Based on the morphemes in the lexicon server, the cleaning scripts suggest possible morpheme segmentation and glosses to human annotators (vi) and warn of any inconsistent annotation (v). There is a script in the system that extracts and indexes morphemes from annotated data together with their frequencies, and visualizes their distribution in a corpus. The lexicon server will be built on the existing script using ElasticSearch (which uses Lucene), an open-source information retrieval library that features text-indexing as well as spell-checking, fuzzy search and highlighting relevant to the data cleaning tools. While the scripts for automated segmentation (i) and morpheme extraction (iv) operate in background, the resulting morphological lexicon needs a user interface that visually represents the lexicon in a transparent and informative manner in order to facilitate linguistic analysis as well as to have linguistic data more accessible to language teachers and learners. Dendrogram graphs will be intuitive and useful in providing information about the categories of the morphemes (e.g. person markers, tense- aspect markers), while connected graphs will capture relationship among morphemes (e.g. negative concord). Project members will seek for input from fellow researchers and language teachers/learners during the development process in order to create effective and useful visualizations. ...

     
    No due date Last updated about 3 years ago
     
    100% complete
    No due date Last updated about 3 years ago

    We plan on integrating more phonetics/phonology into the app, we mi…

    We plan on integrating more phonetics/phonology into the app, we might start with Phonological Search, or integrating the ProsodyLab Aligner into the data so that users can get aligned TextGrid of the audio and orthography lines in the corpus. For code and more info about the ProsodyLab Aligner see: https://github.com/kylebgorman/Prosodylab-Aligner Tasks: Integration of Prosodylab Forced Aligner to train new languages (~550 hrs); evaluation of speech-text alignment Milestone: Functional speech-text alignment integrated to LingSync, publish the code and documentation on GitHub; publish the evaluation result on the website Success criteria: speech-text alignment is executed with > ~80% accuracy when evaluated against human alignment. Risks: Aligner may take longer to integrate than the time allotted. In this case, the team will host hackathons focused on the integration of the aligner with the app.

     
    100% complete
    No due date Last updated about 3 years ago

    The online talking dictionary at http://www.mikmaqonline.org/servle…

    The online talking dictionary at http://www.mikmaqonline.org/servlet/dictionaryFrameSet.html has quite a lot of data including audio recordings of headwords and example sentences, but its search features are limited and it's not collaboratively editable as-is. Crawl and import into LingSync so it's more usable: will be an example of import of dictionary-structured data and outcomes that are helpful for community members.

     
    100% complete
    No due date Last updated about 3 years ago

    In this milestone we will extract the existing export options (cesi…

    In this milestone we will extract the existing export options (cesine) and then convert them into a more robust npm module for export, and bower module for export widgets to be embedded in the spreadsheet app and/or corpus pages. Existing export: LaTeX csv json RDF lexicon TODOs: Improve all of the above Zip of entire database including audio files etc Add richer support for ordering and sorting datum Add support for selecting certain fields etc :) XML export OLAC export investigate endangeredlanguages interoperabillity investigate soundcloud interoperability export spidered version of corpus public pages to a cd .iso for distribution in villages :) More details: ... OLAC catalogues and provides links to existing natural language resources in various formats, and links to tools necessary to view the data. Similarly, TalkBank gathers repositories of natural language data created using CLAN (MacWhinney 2000). LDC distributes text and speech databases as resources for linguistic and language technology research, while TLA archives digital language databases and provides annotation and data management tools such as ELAN and Arbil. ...

     
    No due date Last updated about 3 years ago
     
    100% complete
    No due date Last updated about 3 years ago
     
    100% complete
    No due date Last updated about 3 years ago

    We can use the learn-x code base to re deploy the learn migmaq prot…

    We can use the learn-x code base to re deploy the learn migmaq prototype so that it can be used. While the prototype had support for very structured language lessons, maybe just letting users build lessons is better...

     
    100% complete
    No due date Last updated about 3 years ago

    A psycholing dashboard which is backed by fielddb is now deployed at

    A psycholing dashboard which is backed by fielddb is now deployed at http://app.phophlo.ca You can log in to fielddb You can get custom branded welcome/reset password emails You can log out If the user was logged in, it will show their most recent dashboard It shows the prototype results screen if you go to the base url If the user is not logged in, it will show a welcome page (with a customized branding for signup and login) Next Week's Tasks: Implement the Register buttons Implement the Dashboard that appears after you login (it currently just shows default FieldDB info, nothing which would make sense to a Psycholinguistics experiment user)

     
    100% complete
    No due date Last updated about 3 years ago
     
    No due date Last updated about 3 years ago
     
    100% complete
    No due date Last updated over 3 years ago
     
    No due date Last updated over 3 years ago
     
    100% complete
    No due date Last updated over 4 years ago

    A tokenizer in the context of agglutinative languages means more th…

    A tokenizer in the context of agglutinative languages means more than finding words, it means finding morphemes. We already have a tokenizer that relies on the gold standard info in the morphemes line of corpora (essentially a CouchDB map reduce that serves the tokens, and a client library in the glosser that uses the tokens to propose segmentations of unknown words). This milestone is to design and implement a web service API which can run existing morpheme analyzers on corpora for benchmarking, and training new tokenizers (for elastic search and/or for client apps) and also make it easier for researchers to use their own tokenizers that they have been using for their projects. It is a subset and more specialized example of the webservice we set up for inuktitut (lexicon web service). Tasks: Human annotation of a small portion of selected datasets using morpheme-gloss alignment checker (~250 hrs, continue to M7); create automated morpheme segmentation tool (~150 hrs); evaluate accuracy of the morpheme segmentation tool. Milestones: Functional automated morpheme segmentation tool with code and documentation published on GitHub; The accuracy evaluation result published on the website. Success criteria: Average >65% of accuracy in morpheme segmentation, to be measured against non-automated segmentation by linguists and language consultants. Risks: Automated morpheme segmentation may not reach the desired accuracy with certain types of languages (e.g. polysynthetic), which of its own will pose new interesting research questions in Natural Language Processing. To mitigate the potential low accuracy, some computer assisted hand annotation will be required. ...Shone and Jurafsky (2001) employed induced semantics (latent semantic analysis) with affix frequency and local syntactic context which resulted in the F-score of 80% on average with English, Dutch and German data. The algorithm in Bernhard (2006) relied on transitional probability and segment alignment. It produced the F-score of average 64% with English and Finnish and 60% with Turkish (see Bernhard 2006). The algorithm in Goldsmith (2001) exploits minimum description length (MDL) in conjunction with expectation maximization (EM). It obtained above 80% reliability in bi- morphemic analysis of words from the corpora of English and French (see Goldsmith 2001). Goldsmith’s algorithm is publicly available as Linguistica....

     
    No due date Last updated over 4 years ago

    Allow community users to add data via the spreadsheet module using …

    Allow community users to add data via the spreadsheet module using a different interface from linguists. Default view has fewer fields and removes the linguist-specific ones so as to not overwhelm users, e.g. utterance, translation, audio, notes. However users can add additional custom fields as with the existing spreadsheet module. This version is an approximation of what we think community users might want, to give people an idea of what's possible, however other custom skins can be created in response to individual users' needs. (Potentially a skin based on the fields in the Mi'kmaq Online talking dictionary? http://www.mikmaqonline.org/servlet/dictionaryFrameSet.html) Tasks: Workshop (Canadian & US teams, community language workers with guest speakers); training community users in annotation using augmented client app (~20 hrs); test annotation time cost and error rate comparing expert (linguist) vs. non-expert (community user) Milestones: Publish the time cost test result on the website and submit results for a publication in a journal specializing in computational linguistics. Success criteria: No significant difference in annotation time and error rate between expert and non-expert. To be measured by comparing volunteer sample groups using the same data. Risks: The user interface may not be immediately transparent to non-experts with only a small amount of annotation training. The project will seek the input of non-experts about the user interface in order to avoid this setback.

     
    No due date Last updated over 4 years ago

    Tentative: Allow user to download session info from corpus server A…

    Tentative: Allow user to download session info from corpus server Allow user to edit session info on device for downloaded sessions Investigate how this can be done for downloaded sessions, given that session info is stored in each datum. Upload signed beta 1 release apk to Google Play store.

     
    No due date Last updated over 4 years ago

    Another essential part of this project involves automatically segme…

    Another essential part of this project involves automatically segmenting audio files and aligning them with their transcriptions in a variety of different languages. This part of the project would be useful for doing automated audio-to-text transcription, not only for linguistic research but for other practical applications (e.g., subtitling).

     
    No due date Last updated about 5 years ago

    Allow user to change field names and types (e.g. date vs. text field). Upload signed beta 2 release apk to Google Play store.