<a href="https://colab.research.google.com/github/harisont/comp-syntax-2020/blob/master/lab1/chapter2/Chapter_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Designing the morphological types of the major parts of speech in Spanish

## Nouns
- inherent features: -
- inflective features:

feature | possible values
--- | ---
number | `Sing`, `Plur`
gender | `Masc`, `Fem`

I am not sure it is correct to say that Spanish has no inherent features when it comes to nouns: in a sense, one could argue that case is one such feature.

Furthermore, according to [wikipedia](https://https://es.wikipedia.org/wikiG%C3%A9nero_gramatical_en_espa%C3%B1ol), gender is grammatical in Spanish. While this makes sense to me, at the same time the textbook (par. 2.3) says that 

> A typical example is nouns and adjectives in languages like French and Italian: (...) morphologically, nouns inflect for number and have an inherent gender

and as far as I know, with respect to noun inflection, Spanish and Italian work the same way: while not all nouns have both a feminine and a masculine form, for many of them there is a systematic way to derive the masculine from the feminine and vice versa.

## Adjectives
- inherent features: -
- inflective features:

feature | possible values
--- | ---
number | `Sing`, `Plur`
gender | `Masc`, `Fem`
degree | `Pos`, `Cmp`, `Sup`, `Abs`

## Verbs
- inherent features: -
- inflective features:
  
feature | possible values
--- | ---
number | `Sing`, `Plur`
person | 1, 2, 3
tense | `Past`, `Pres`, `Fut`
mood | `Ind`, `Imp`, `Sub`, `Cnd` 
aspect | `Perf`, `Imp`, `Prog` (?)
voice | `Act`, `Pass`

Designing a morphological type proved harder for the verbs, as:
- while [the UD webpage](https://universaldependencies.org/u/feat/Mood.html#Cnd) consideres conditonal a mood, many other sources, including [wikipedia](https://es.wikipedia.org/wiki/Gram%C3%A1tica_del_espa%C3%B1ol#Verbo), present some ambiguity, describing it sometimes as a mood and sometimes, for reasons that are unclear to me, a tense
- it is not completely clear what to do with impersonal forms: for instance, I would say that gerund has progressive aspect, but we'll see further down that `Prog` is nowhere to be found in the treebank
- it is hard for me to see the exact difference between [`Imp` tense](https://universaldependencies.org/u/feat/Tense.html#Imp) and [`Imp` aspect](https://universaldependencies.org/u/feat/Aspect.html#Imp).

# Statistics

In [0]:
from pandas import read_csv # cause it's easier to abuse read_csv

es_features = read_csv('es_pud-ud-test.conllu', sep="\t", comment="#", usecols=[3, 5], names=["POS", "features"])

intresting_POS = ['NOUN', 'ADJ', 'VERB']

sub_dataframes = [es_features[es_features['POS']==POS] for POS in intresting_POS]  

In [0]:
def get_featurename_val_pairs(features_col):
  '''
  This monstruosity is meant to process the sub-dataframes above so to return, somehow, a list of tuples whose first element is 
  the name of a given feature, while the second element is its value.
  '''
  return list(map(lambda x: tuple(x.split('=')), [i for sub in map(lambda x: x.split('|'), features_col['features']) for i in sub]))

In [0]:
feature_dict = {}
for i, POS in enumerate(intresting_POS):
  feature_dict[POS] = get_featurename_val_pairs(sub_dataframes[i])

In [0]:
def print_POS_features_stats(pairs):
  '''
  This even more monstrous functions prints the things I want to know about the features
  and respective values of a certain POS.
  '''
  features = set(map(lambda pair: pair[0], pairs))
  print('features: {}'.format(features))
  for feature in features:
    values = list(map(lambda pair: pair[1] if len(pair) > 1 else "I don't think this is my fault!", filter(lambda pair: pair[0] == feature, pairs)))
    values_counts = [(value, values.count(value)) for value in set(values)]
    print('possible values of {}: {}'.format(feature, values_counts))

## Nouns
When it comes to nouns, there is nothing unexpected. The singular forms are much more common in the treebank, but in my opinion that doesn't say anything interesting about the language.

In [5]:
print_POS_features_stats(feature_dict["NOUN"]) # the 3rd line ("possible values of _") is due to something weird in the treebank

features: {'Gender', 'Number', '_'}
possible values of Gender: [('Fem', 1965), ('Masc', 2662)]
possible values of Number: [('Plur', 1370), ('Sing', 3349)]
possible values of _: [("I don't think this is my fault!", 2)]


## Adjectives
The same applies to adjectives. The fact that most of them are not assigned a value for their degree explicitly simply means that, unsurprisingly, most of them appear in their basic form, but I expected them to be marked at `Pos` instead.

In [6]:
print_POS_features_stats(feature_dict["ADJ"])

features: {'Degree', 'Gender', 'Number'}
possible values of Degree: [('Abs', 1), ('Cmp', 30), ('Sup', 5)]
possible values of Gender: [('Fem', 608), ('Masc', 806)]
possible values of Number: [('Plur', 432), ('Sing', 1002)]


## Verbs
The results are more intresting (or disappointing) when it comes to verbs:
- there are two extra features:
  - `Gender`, relatively rare, due to participles, which I had basically ignored like most other impersonal forms
  - [`VerbForm`](https://universaldependencies.org/u/feat/VerbForm.html). The existence of this feature would explain a lot about how to deal with impersonal forms (infinitives, participles...), but the fact that its value always is `Fin` looks perplexing, so I did not modify my original design
- there are no instances where the verbs have imperative mood, but again, I think this only says something about the corpus, not on Spanish itself
- there are no instances where the verbs have progressive aspect, which makes me think my assumptions about gerund where wrong
- as I imagined, `Imp`erfect is considered to be an aspect, not a tense.

In [0]:
print_POS_features_stats(feature_dict["VERB"])

features: {'Gender', 'Person', 'Number', 'Mood', 'VerbForm', 'Aspect', 'Voice', 'Tense'}
possible values of Gender: [('Masc', 151), ('Fem', 83)]
possible values of Person: [('1', 51), ('2', 2), ('3', 1177)]
possible values of Number: [('Plur', 416), ('Sing', 1049)]
possible values of Mood: [('Ind', 1141), ('Sub', 67), ('Cnd', 22)]
possible values of VerbForm: [('Fin', 2115)]
possible values of Aspect: [('Perf', 504), ('Imp', 1241)]
possible values of Voice: [('Act', 1496), ('Pass', 84)]
possible values of Tense: [('Pres', 483), ('Past', 694), ('Fut', 31)]
