# Retrieving grammatical information from Morphalou 3  

[To be further updated]:
* check (wrt the XML files) that no forms or interesting grammatical attributes are forgotten by *give_grammatical_info*  
* provide the code as a separate script ?  
* a basic query can take ~2min : see if this can be reduced
* write lookup(string, feature): for the moment, lookup(string)

## 1. Data Loading & XML Parsing

Following cell allows for the loading and parsing of Morphalou3 data. It can be quite long (~10 min).

In [1]:
from bs4 import BeautifulSoup
import re

# Paths to Morphalou3 data files
FILE_PATHS = ["data/adjective_Morphalou3_LMF.xml",
              "data/adverb_Morphalou3_LMF.xml",
             "data/commonNoun_Morphalou3_LMF.xml",
             "data/grammaticalWords_Morphalou3_LMF.xml",
             "data/interjection_Morphalou3_LMF.xml",
             "data/noCategory_Morphalou3_LMF.xml",
             "data/verb_Morphalou3_LMF.xml"]

SOUPS = []

# Parsing of the whole dataset
for file_path in FILE_PATHS:
    with open(file_path, "rb") as f:
        file = f.read()
        soup = BeautifulSoup(file, "xml")
        SOUPS.append(soup)
        print(f"{file_path}: parsed.")

In [None]:

def find_forms(form:str)->list:
    """A function that takes as input a list of BeautifulSoup objects and a written form, and returns the list
    of all <lexicalEntry> nodes from soup_list whose ID match this particular form """
    entries = []
    pattern = f"{form}(_.*)?$"
    
    # We will look for the form in all the Morphalou3 categories
    for soup_obj in SOUPS:
        entries.extend(soup_obj.find_all(id = re.compile(pattern)))
        
    return entries

def convert_to_dict(entries:list)->dict:
    """A function converting a list of <lexicalEntry> nodes into a readable dictionnary with useful grammatical information.
    The resulting dictionnary is indexed by lexical entry ids"""
    
    d = {}
    
    for entry in entries:
        children = entry.formSet.children
        
        # Lemma
        next(children)
        lemmatizedForm = next(children)
        lemma_attributes = lemmatizedForm.children
        lemma_attributes_dict = {}
        lemma_generator_not_empty = True

        while (lemma_generator_not_empty):

            try:
                attribute = next(lemma_attributes)
                if attribute.name:
                    if attribute.name != "originatingEntry":
                        lemma_attributes_dict[attribute.name] = attribute.text
            except Exception:
                lemma_generator_not_empty = False

        
        # Inflected forms
        inf_generator_not_empty = True
        inflected_forms_dict = {}

        while(inf_generator_not_empty):
            
            next(children)
            
            try:
                inflected_form = next(children)
                inflected_attributes = inflected_form.children
                
                inflected_attributes_dict = {}
                for attribute in inflected_attributes:
                    if attribute.name:
                        if attribute.name != "originatingEntry" and attribute.name != "orthography":
                            inflected_attributes_dict[attribute.name] = attribute.text
                            
                inflected_forms_dict[inflected_form.orthography.text] = inflected_attributes_dict
                
            except Exception:
                inf_generator_not_empty = False
                
        # Final dictionnary
        d[entry['id']] = {"lemma": lemma_attributes_dict, "inflected_forms": inflected_forms_dict }

        
    return d

def give_grammatical_info(feature="all", form:str):
    if feature == "all":
        return convert_to_dict(find_forms( form))
    else:
        pass # to do

## 2. Querying Morphalou3 with a given form

The function *give_grammatical_info* allows the user to see all lexical entries present in Morphalou3 that match a given written form. For each of them, following information can be accessed:  
* the **lemma** of the entry and its relevant features (orthography, grammaticalCategory, grammaticalGender, etc.)  
* its associated **inflected-forms** and their relevant features (grammaticalNumber, grammatical)

In [14]:
# Example 1
give_grammatical_info("parent")

{'parent_1': {'lemma': {'orthography': 'parent',
   'grammaticalCategory': 'commonNoun',
   'grammaticalGender': 'masculine'},
  'inflected_forms': {'parent': {'grammaticalNumber': 'singular'},
   'parents': {'grammaticalNumber': 'plural'}}},
 'arrière-grand-parent_1': {'lemma': {'orthography': 'arrière-grand-parent',
   'grammaticalCategory': 'commonNoun',
   'grammaticalGender': 'masculine'},
  'inflected_forms': {'arrière-grand-parent': {'grammaticalNumber': 'singular'}}}}

In [15]:
# Example 2 : a form that can be a Noun or an Adjective
give_grammatical_info("humain")

{'a-humain_1': {'lemma': {'orthography': 'a-humain',
   'grammaticalCategory': 'adjective'},
  'inflected_forms': {'a-humain': {'grammaticalNumber': 'singular',
    'grammaticalGender': 'masculine'}}},
 'anti-humain_1': {'lemma': {'orthography': 'anti-humain',
   'grammaticalCategory': 'commonNoun',
   'grammaticalGender': 'masculine'},
  'inflected_forms': {'anti-humain': {'grammaticalNumber': 'singular'}}},
 'anté-humain_1': {'lemma': {'orthography': 'anté-humain',
   'grammaticalCategory': 'commonNoun'},
  'inflected_forms': {'anté-humain': {'grammaticalNumber': 'singular'}}},
 'humain_1': {'lemma': {'orthography': 'humain',
   'grammaticalCategory': 'commonNoun',
   'grammaticalGender': 'masculine'},
  'inflected_forms': {'humain': {'grammaticalNumber': 'plural'},
   'humains': {'grammaticalNumber': 'plural'}}},
 'infra-humain_1': {'lemma': {'orthography': 'infra-humain',
   'grammaticalCategory': 'adjective'},
  'inflected_forms': {'infra-humain': {'grammaticalNumber': 'singular',

In [16]:
# Example 3 : A non-existing word
give_grammatical_info("xiuzc")

{}