# Tildes & acentos in Spanish and Spellcheck Testing

The following is a brief summary of important accents in Spanish, adapted mainly from *La Real Academia de la Lengua Española*

ref: http://lema.rae.es/dpd/srv/search?id=Adwesaq4ND64VT09xQ

**Definition**: An ortographic sign that is written to mark word stress according to a set of rules or to distinguish **homonymns** (words with same spelling but different meaning), in which case are called **tilde diacrítica**. 

## Word categorization according to number of syllables 

One syllable: **monosílabo** 
Two syllables: **bisílabo**
Three syllables: **trisílabo** 
Four or more: **polisílabo**

## Word categorization according to stress 

Divide the word by syllables, and then categorize according to where the stress falls. We count from right to left, thus the last syllable is called **última**, the one before is **penúltima** , the third from the right is **antepenúltima** , and the rest is **anteantepenúltima**, **anteanteantepenúltima** etc. 

Ex. the word *estratificación* 

etc. 'ti' -> `anteantepenúltima` 'fi' -> `antepenúltima` 'ca' -> `penúltima` 'ción' -> `ultima`  

### Categorization 

- A word is called **aguda** if it has the stress on the *última sílaba* 
- A word is called **grave** if it has the stress on the *penúltima sílaba* 
- A word is called **esdrújula** if it has the stress on the *antepenúltima sílaba* 
- A word is called **sobreesdrújula** if it has the stress on the *anteantepenúltima sílaba.* 

## Distribution according to number of syllables 

### Polisílabos (2+ syllables)

- **Palabras agudas**: Mark the stress to words that end in **-n, -s, or a vowel**, unless it has an extra consonant before.
- **Palabras graves**: Mark the stress if the word **does not end in -n, -s, or vowel**, and also when it has an extra consonant before. 
- **Palabras esdrújulas and sobreesdrújulas**: Always mark the stress.  

### Monosílabos (1 syllable) 
- Don't take an accent unless is a **tilde diacrítica**


## Tilde diacrítica 

**Definition**: We call a **tilde diacrítica** the tilde we write to ditinguish homonyms in written text. These happen frequently on monosílabos. 

The following is a dictionary of some of these 

In [198]:
# Imports 

import re
import spacy 
import pandas as pd 
import numpy as np
from IPython.display import display
from tqdm import tqdm
from tqdm import tqdm_notebook
from tqdm import trange
from spellchecker import SpellChecker 

spell = SpellChecker(language='es') 

In [2]:
# load the language model 
nlp = spacy.load("es_core_news_md")

Define the following three objects: 
- a dictionary whose keys are the words with tile diacrítica, mapping to a list of POS tags 
- a full-list of all tile diacrítica words
- a list of tuples of the tilde diacrítica words (with/without) 

Side note: spaCy POS accuracy for Spanish is about **96.9%**. 
See https://spacy.io/usage/facts-figures

In [3]:
# The TAGS assigned are spaCy annotation tags: 
# https://spacy.io/api/annotation

categ_dict ={
    'de':['PREP'], 
    'dé':['VERB'], 
    'el':['DET'],  
    'él':['PRON'], 
    'mas':['CONJ','CCONJ'], 
    'más':['ADV','ADJ','PRN'], 
    'mi':['ADJ','DET'], 
    'mí':['PRON'], 
    'se':['PRON'], 
    'sé':['VERB'], 
    'si':['CONJ','CCONJ'], 
    'sí':['ADV'], 
    'te':['PRON'], 
    'té':['NOUN'], 
    'tu':['DET'], 
    'como':['CONJ','SCONJ'], 
    'cómo':['PRON'], 
    'cuando':['CONJ','SCONJ'], 
    'cuándo':['PRON'],
    'cúal':['ADJ'],
    'cual':['PRON'], 
    'cuanto':['CONJ','SCONJ'], 
    'cuánto':['DET'], 
    'dónde':['PRON'], # question 
    'donde':['PRON'], 
    'qué':['PRON'],  
    'que':['SCONJ'], 
    'quién':['PROPN'], # proper noun 
    'quien':['PRON'], 
    'solo':['ADJ'], 
    'sólo':['ADJ','ADJ'],     
}

# Create two dictionaries with the whole list and correspondences 
full_list = list(categ_dict.keys())  # collection of tilde diacrítica words
all_pairs = [(full_list[i], full_list[i+1]) for i in range(0,len(full_list)-1,2)]


print(full_list) #  
print(all_pairs)

['de', 'dé', 'el', 'él', 'mas', 'más', 'mi', 'mí', 'se', 'sé', 'si', 'sí', 'te', 'té', 'tu', 'como', 'cómo', 'cuando', 'cuándo', 'cúal', 'cual', 'cuanto', 'cuánto', 'dónde', 'donde', 'qué', 'que', 'quién', 'quien', 'solo', 'sólo']
[('de', 'dé'), ('el', 'él'), ('mas', 'más'), ('mi', 'mí'), ('se', 'sé'), ('si', 'sí'), ('te', 'té'), ('tu', 'como'), ('cómo', 'cuando'), ('cuándo', 'cúal'), ('cual', 'cuanto'), ('cuánto', 'dónde'), ('donde', 'qué'), ('que', 'quién'), ('quien', 'solo')]


In [108]:
# Notice that we can obtain the POS tags for all of these 
# Create some examples in which these words are employed
texts= ["De qué quieres hablar hoy? ", 
        "Quiero que me dé un poco de tiempo", 
       "Ya no quiero comer más!", 
       "Yo sé, tu quieres que él se lave las manos", 
       "Sólo el tiempo decidirá qué se quiere hacer"] 

docs = nlp.pipe(texts)

for i,doc in enumerate(docs):
    
    tokens = [token.text for token in doc]  # obtain tokens
    lemmas = [token.lemma_ for token in doc] # obtain lemmas 
    
    print("\nText {}:".format(i))
    print("POS tags: \n", [[(token.text, token.pos_) for token in doc]])
    
    # tilde diacrítica words (tdw) found among the tokens, 
    # along with their respective POS taggings. Note these are the ones we 
    # annotated manually. Compare these with spaCy's ones. 
    tdw_tokens = [(token.text, categ_dict[token.text]) for token in doc if token.text in full_list] 
    tdw_lemmas = [(token.lemma_, categ_dict[token.lemma_]) for token in doc if token.lemma_ in full_list] 
    
    # display 
    print("Tilde diacritica tokens for doc + dict POS {} : \n{}".format(i, tdw_tokens))
    print("Tilde diacritica lemmas for doc + dict POS {} : \n{}".format(i, tdw_lemmas))



Text 0:
POS tags: 
 [[('De', 'ADP'), ('qué', 'DET'), ('quieres', 'AUX'), ('hablar', 'VERB'), ('hoy', 'ADV'), ('?', 'PUNCT')]]
Tilde diacritica tokens for doc + dict POS 0 : 
[('qué', ['PRON'])]
Tilde diacritica lemmas for doc + dict POS 0 : 
[('qué', ['PRON'])]

Text 1:
POS tags: 
 [[('Quiero', 'VERB'), ('que', 'SCONJ'), ('me', 'PRON'), ('dé', 'VERB'), ('un', 'DET'), ('poco', 'PRON'), ('de', 'ADP'), ('tiempo', 'NOUN')]]
Tilde diacritica tokens for doc + dict POS 1 : 
[('que', ['SCONJ']), ('dé', ['VERB']), ('de', ['PREP'])]
Tilde diacritica lemmas for doc + dict POS 1 : 
[('que', ['SCONJ']), ('de', ['PREP'])]

Text 2:
POS tags: 
 [[('Ya', 'ADV'), ('no', 'ADV'), ('quiero', 'VERB'), ('comer', 'VERB'), ('más', 'ADV'), ('!', 'PUNCT')]]
Tilde diacritica tokens for doc + dict POS 2 : 
[('más', ['ADV', 'ADJ', 'PRN'])]
Tilde diacritica lemmas for doc + dict POS 2 : 
[('más', ['ADV', 'ADJ', 'PRN'])]

Text 3:
POS tags: 
 [[('Yo', 'PRON'), ('sé', 'VERB'), (',', 'PUNCT'), ('tu', 'DET'), ('quieres'

Note in the above that in some cases, the **tilde (stress mark) disappears when the sentence is lemmatized**. 

## Testing pyspellchecker 

In this part we will apply some basic tests to assess the performance of pyspellchecker on Spanish. From the documentation, 

"*It uses a Levenshtein Distance algorithm to find permutations within an edit distance of 2 from the original word. It then compares all permutations (insertions, deletions, replacements, and transpositions) to known words in a word frequency list. Those words that are found more often in the frequency list are **more likely** the correct results.*"

It uses the following algorithm: 

![](../figs/Levenshtein_distance.jpg)

The important thing to notice is that whenever we have long words, changing the `edit_distance` parameter to 1 works better, according to documentation. 


In [215]:
### Basic Usage ### 
spell = SpellChecker(language='es') # Spanish dictionary  



# from spaCY
text0 = "Solicito que algo pase aqquí mañana en la mañana !" # note the word "aqquí" is misspelled 
doc = nlp(text0) 
tokens = [token.text for token in doc] 
lemma_tokens = [token.lemma_ for token in doc]



# find those words that might be misspelled 
misspelled = spell.unknown(tokens)

# words that are in the frequency list 
in_freq = spell.known(tokens) 

print("misspelled: \n", misspelled)
print("in_frequency list: \n", in_freq)


for word in misspelled: 
    
    print("Input word: {}\n".format(word)) 
    
    # Get the one 'most likely' answer  
    print("correction: ", spell.correction(word))
    
    # Get a list of the 'likely' options  
    print("candidates: ", spell.candidates(word))
    
    # Get probability of the input word occurring 
    print("probability of mispelled word: ", spell.word_probability(word))
    
del(misspelled) 

misspelled: 
 {'aqquí'}
in_frequency list: 
 {'la', 'solicito', 'algo', 'que', 'mañana', '!', 'en', 'pase'}
Input word: aqquí

correction:  aquí
candidates:  {'aquí'}
probability of mispelled word:  0.0


### Adding the tilde diacrítica list to the dictionary  

As a side note, notice that we can add the list above to the dictionary if necessary, as well as any other list of words (such as names) that we would like to have as correctly spelled. This can be done in the following fashion: 

In [63]:
spell2 = SpellChecker() # create a new spellcheck instance 
spell2.word_frequency.load_words(full_list) # load full list of tilde diac. words 
spell2.word_frequency.remove_words(full_list) # remove a list fo words from the spellchecker 

In [177]:
# loading some of the request data 
data = pd.read_csv('../data_clean/request_texts.txt', sep='|')
descriptions= data['DESCRIPCIONSOLICITUD'][0:11]
print("description: \n", descriptions[0])
print("\ndescription: \n", descriptions[1])

description: 
 1) Requiero que se entregue copia de la investigación de mercados realizado por COMESA EN LA CUAL SE DETERMINA CUALES SON LAS MEJORES CONDICIONES PARA EL ESTADO PARA LA ADJUCACION DIRECTA NO. SA-018TQA001-N231-2013.  2) REQUIERO QUE SE ENTREGUE COPIA DEL ACUERDO No 04/E25/2013   3) REQUIERO COPIA DE LA MINUTA DE  LA SESSION VIGESIMA QUINTA EXTRAORDINARIA DE FECHA DE 30 DICIEMBRE DE 2013  DEL CAAS EN COMESA   4) REQUIERO COPIA DEL ACTA  DE  LA SESION VIGESIMA QUINTA EXTRAORDINARIA DE FECHA DE 30 DICIEMBRE DE 2013  DEL CAAS EN COMESA   5) DE ACUERDO AL ART 41 FRACCION III DE LA LEY AASSP    III. Existan circunstancias que puedan provocar pérdidas o costos adicionales importantes   cuantificados y justificados;   Solicito a COMESA ME ENTREGUE UN DOCUMENTO O ACTA EN LA CUAL ME DIGAN CUALES  SON LOS SUPUESTOS QUE ELLOS INDICAN QUE PUEDEN PROVOCAR PERDIDAS O COSTOS ADICIONALES IMPORTANTES Y QUE ESTAN CUANTIFICADOS Y JUSTIFICADOS.  6)SOLICITO UN DOCUMENTO EN LA CUAL ME INDIQUEN

We define a function to obtain the mispelled words , the most likely words, the likely corrections and the probability of the wrong inputs occurring. 

In [209]:

# Make this into a function 
def spellcheck_test(texts, 
                    get_unknown=True, 
                    get_known=False, 
                    lowercase = False,
                    verbose=True): 
    
    # create an empty dataframe 
    df = pd.DataFrame(columns=['text','misspelled','correction','prob_wrong_input','num_tokens']) 
    
    misspelled_lists = []
    corrections_lists= [] 
    prob_wrong_input_lists = [] 
    num_tokens = []
    num_misspelled = []
    prop_misspelled = []
    
    # initialize a pipeline of fitted texts 
    docs = nlp.pipe(texts)
    
    for id, doc in tqdm_notebook(enumerate(docs), desc= 'processing text...'): 
        
        print("******\nText: {}\n******\n".format(id))
                
        # obtain raw tokens lists 
        tokens = [token.text if lowercase else token.text.lower() for token in doc] 
                
#         # obtain lemmas for all tokens 
#         lemmas = [token.lemma_ if lowercase else token.lemma_.lower() for token in doc]
        
        # find those words that might be misspelled 
        misspelled = [token for token in spell.unknown(tokens) if ' ' not in token]  
        misspelled_lists += [misspelled] # add the list of misspelled words 
                
        # obtain probability of mispelled words happening
        probabilities = {word:spell.word_probability(word) for word in tqdm_notebook(misspelled, 
                                                                                        desc='finding misspelled words...')}  
        prob_wrong_input_lists += [probabilities]
        if verbose: 
            print("Misspelled words & probabilities: \n", probabilities)

        # obtain corrections (NOTE: This part is VERY slow, am I doing something wrong?)
        corrections = [spell.correction(word) for word in tqdm_notebook(misspelled, "finding input correction...")]
        corrections_lists += [corrections]
        if verbose: 
            print("Corrections: \n", corrections)
            
        # number of tokens per text 
        num_tokens += [len(tokens)]
        
        # number of misspelled tokens 
        num_misspelled += [len(misspelled)]
        
        # proportion of misspelled tokens 
        prop_misspelled += [len(misspelled)/len(tokens)]         
        
    # build the returning data frame 
    df['text'] = texts 
    df['misspelled'] = misspelled_lists    
    df['correction'] = corrections_lists
    df['prob_wrong_input'] = prob_wrong_input_lists
    df['num_tokens'] = num_tokens
    df['num_misspelled'] = num_misspelled
    df['proportion_misspelled'] = prop_misspelled
    
    return df

In [210]:
descriptions= data['DESCRIPCIONSOLICITUD'][0:11]
df = spellcheck_test(descriptions)

HBox(children=(IntProgress(value=1, bar_style='info', description='processing text...', max=1, style=ProgressS…

******
Text: 0
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=12, style=ProgressStyle(des…

Misspelled words & probabilities: 
 {'cuantificados': 0.0, 'sa-018tqa001-n231': 0.0, '6)solicito': 0.0, 'vigesima': 0.0, '04/e25/2013': 0.0, 'session': 0.0, 'caas': 0.0, 'aassp': 0.0, 'realizacion': 0.0, 'fraccion': 0.0, 'comesa': 0.0, 'adjucacion': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=12, style=ProgressStyle(des…

Corrections: 
 ['cualificados', 'sa-018tqa001-n231', 'solicito', 'vigésima', '04e252013', 'sesion', 'casa', 'cass', 'realización', 'fracción', 'comes', 'educacion']
******
Text: 1
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=3, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'medidos': 0.0, 'descentralizados': 0.0, 'fundamentan': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=3, style=ProgressStyle(desc…

Corrections: 
 ['medios', 'descentralizados', 'fundamental']
******
Text: 2
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=15, style=ProgressStyle(des…

Misspelled words & probabilities: 
 {'no.': 0.0, '2300-a001': 0.0, '05/06/98': 0.0, 'ct011': 0.0, '20/06/2000': 0.0, '1.-': 0.0, 'institucionales': 0.0, 'normativos': 0.0, 'realizacion': 0.0, 'planeacion': 0.0, 'actualizacion': 0.0, 'sustantivas': 0.0, '2.-': 0.0, 'deroga': 0.0, 'emision': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=15, style=ProgressStyle(des…

Corrections: 
 ['no', '2300001', '050698', '011', '20062000', '1.', 'instituciones', 'informativos', 'realización', 'planeación', 'actualización', 'sustancias', '2.', 'droga', 'emisión']
******
Text: 3
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=3, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'expide': 0.0, 'consumibles': 0.0, 'mensualizado': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=3, style=ProgressStyle(desc…

Corrections: 
 ['expire', 'consumibles', 'mensualidad']
******
Text: 4
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=4, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'tuxpan': 0.0, 'solicitadas': 0.0, 'solicitad': 0.0, '¿': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=4, style=ProgressStyle(desc…

Corrections: 
 ['culpan', 'solicitada', 'solicitud', 'a']
******
Text: 5
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=1, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'ifai': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=1, style=ProgressStyle(desc…

Corrections: 
 ['fai']
******
Text: 6
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=3, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'09dpr1936a': 0.0, 'cct': 0.0, 'escutia': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=3, style=ProgressStyle(desc…

Corrections: 
 ['09dpr1936a', 'cat', 'escucha']
******
Text: 7
******



HBox(children=(IntProgress(value=1, bar_style='info', description='finding misspelled words...', max=1, style=…

Misspelled words & probabilities: 
 {}


HBox(children=(IntProgress(value=1, bar_style='info', description='finding input correction...', max=1, style=…

Corrections: 
 []
******
Text: 8
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=22, style=ProgressStyle(des…

Misspelled words & probabilities: 
 {'revocacion': 0.0, 'elaboracion': 0.0, 'inicadas': 0.0, 'imposicion': 0.0, 'promovio': 0.0, 'alude': 0.0, 'inciadas': 0.0, 'tubieron': 0.0, 'derivaron': 0.0, 'dictamenes': 0.0, 'decomisos': 0.0, 'originadas': 0.0, 'originados': 0.0, 'conseciones': 0.0, 'clausuras': 0.0, 'ascendio': 0.0, 'recaudados': 0.0, '¿': 0.0, 'contensioso': 0.0, 'peritajes': 0.0, 'sancion': 0.0, 'lgeepa': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=22, style=ProgressStyle(des…

Corrections: 
 ['revocación', 'elaboración', 'indicadas', 'imposición', 'promovido', 'ayude', 'iniciadas', 'tuvieron', 'derribaron', 'dictamen', 'decomiso', 'originales', 'originado', 'condiciones', 'clausura', 'ascendido', 'recaudador', 'a', 'contensioso', 'peritaje', 'cancion', 'leela']
******
Text: 9
******



HBox(children=(IntProgress(value=0, description='finding misspelled words...', max=2, style=ProgressStyle(desc…

Misspelled words & probabilities: 
 {'dependendecnia': 0.0, 'apf': 0.0}


HBox(children=(IntProgress(value=0, description='finding input correction...', max=2, style=ProgressStyle(desc…

Corrections: 
 ['dependendecnia', 'alf']
******
Text: 10
******



HBox(children=(IntProgress(value=1, bar_style='info', description='finding misspelled words...', max=1, style=…

Misspelled words & probabilities: 
 {}


HBox(children=(IntProgress(value=1, bar_style='info', description='finding input correction...', max=1, style=…

Corrections: 
 []


In [214]:
display(df)
df.to_csv('../data_clean/spellcheck_requests_sample_statistics.csv', encoding='utf-8')

Unnamed: 0,text,misspelled,correction,prob_wrong_input,num_tokens,num_misspelled,proportion_misspelled
0,1) Requiero que se entregue copia de la invest...,"[cuantificados, sa-018tqa001-n231, 6)solicito,...","[cualificados, sa-018tqa001-n231, solicito, vi...","{'cuantificados': 0.0, 'sa-018tqa001-n231': 0....",195,12,0.061538
1,Favor de proporcionar toda la documentación re...,"[medidos, descentralizados, fundamentan]","[medios, descentralizados, fundamental]","{'medidos': 0.0, 'descentralizados': 0.0, 'fun...",196,3,0.015306
2,En archivo electronico favor de aportar compl...,"[no., 2300-a001, 05/06/98, ct011, 20/06/2000, ...","[no, 2300001, 050698, 011, 20062000, 1., insti...","{'no.': 0.0, '2300-a001': 0.0, '05/06/98': 0.0...",165,15,0.090909
3,Con fundamento en lo previsto en los Artículos...,"[expide, consumibles, mensualizado]","[expire, consumibles, mensualidad]","{'expide': 0.0, 'consumibles': 0.0, 'mensualiz...",263,3,0.011407
4,¿De que manera la Secretaría de Marina da a co...,"[tuxpan, solicitadas, solicitad, ¿]","[culpan, solicitada, solicitud, a]","{'tuxpan': 0.0, 'solicitadas': 0.0, 'solicitad...",138,4,0.028986
5,Se solicita información que detalle las resolu...,[ifai],[fai],{'ifai': 0.0},92,1,0.01087
6,SOLICITO CONOCES LA APLICACION DE LOS BIENES C...,"[09dpr1936a, cct, escutia]","[09dpr1936a, cat, escucha]","{'09dpr1936a': 0.0, 'cct': 0.0, 'escutia': 0.0}",58,3,0.051724
7,Deseo conocer el fundamento legal por el que e...,[],[],{},43,0,0.0
8,¿numero de denuncias populares inciadas durant...,"[revocacion, elaboracion, inicadas, imposicion...","[revocación, elaboración, indicadas, imposició...","{'revocacion': 0.0, 'elaboracion': 0.0, 'inica...",230,22,0.095652
9,RELACION CON TODOS LOS NOMBRES Y CARGOS (INCLU...,"[dependendecnia, apf]","[dependendecnia, alf]","{'dependendecnia': 0.0, 'apf': 0.0}",55,2,0.036364


In [221]:
for i,misspelled_list in enumerate(df['misspelled']): 
    print("Text: {}\n misspelled_list: {}".format(i, misspelled_list)) 

Text: 0
 misspelled_list: ['cuantificados', 'sa-018tqa001-n231', '6)solicito', 'vigesima', '04/e25/2013', 'session', 'caas', 'aassp', 'realizacion', 'fraccion', 'comesa', 'adjucacion']
Text: 1
 misspelled_list: ['medidos', 'descentralizados', 'fundamentan']
Text: 2
 misspelled_list: ['no.', '2300-a001', '05/06/98', 'ct011', '20/06/2000', '1.-', 'institucionales', 'normativos', 'realizacion', 'planeacion', 'actualizacion', 'sustantivas', '2.-', 'deroga', 'emision']
Text: 3
 misspelled_list: ['expide', 'consumibles', 'mensualizado']
Text: 4
 misspelled_list: ['tuxpan', 'solicitadas', 'solicitad', '¿']
Text: 5
 misspelled_list: ['ifai']
Text: 6
 misspelled_list: ['09dpr1936a', 'cct', 'escutia']
Text: 7
 misspelled_list: []
Text: 8
 misspelled_list: ['revocacion', 'elaboracion', 'inicadas', 'imposicion', 'promovio', 'alude', 'inciadas', 'tubieron', 'derivaron', 'dictamenes', 'decomisos', 'originadas', 'originados', 'conseciones', 'clausuras', 'ascendio', 'recaudados', '¿', 'contensioso', '

### Are these actually mispelled?  

The above table displays the number of words the spellchecker thought were misspelled, but are they actually? For this part and as a native Spanish speaker, I verified by hand and update the numbers of those that were actually misspelled. Note dates in certain formats and codes are usually detected as misspelled. I did not consider as false positives words that were detected as misspelled because they lack the stress mark. I calculated the precision metric based on the number of tokens of each text. However note the number of tokens in each text is also different. 

$$Precision = \left( \frac{TP}{TP + FP} \right)$$

**NOTE:** 
- The spellchecker does seem to detect lack of stress marks where they should be. 

The following are the false positives for each of the texts above: 

**Text 0** 

"False positives:" 
['cuantificados'] 

Precision: **0.96**


**Text 1** 

"False positives:" 
['medidos', 'descentralizados', 'fundamentan']

Precision: **0.00**

**Text 2** 

"False positives:" 
['no.','05/06/98', '20/06/2000', '1.-', 'institucionales', 'normativos','sustantivas', 'emision'] **Note the dates**

Precision: **0.46**

**Text 3** 

"False positives:" 
['expide', 'consumibles', 'mensualizado']

Precision: **0.00**

**Text 4** 

"False positives:" 
['tuxpan', 'solicitadas', '¿'] **Note the interrogation mark. Is tuxpan the name of a place? **

Precision: **0.46**

**Text 5** 

"False positives:" 
[]

Precision: **1.00**

**Text 6** 

"False positives:" 
['09dpr1936a', 'cct', 'escutia'] **Note the codes are detected as misspelled**

Precision: **0**

**Text 7** 

"False positives:" 
[]

Precision: **1**

**Text 8** 

"False positives:" 
 ['alude', 'tubieron', 'derivaron', decomisos', 'originadas', 'originados', 'conseciones', 'clausuras', 'ascendio', 'recaudados', '¿', 'contensioso', 'peritajes']

Precision: **0.41**


**Text 9** 

"False positives:" 
[]

Precision: **1.0**

**Text 10** 

"False positives:" 
[]

Precision: **1.0**


**NOTE:**: These precision calculations are not really indicative unless we also consider the proportion "misspelled" tokens in the text. 


### Veredict 

The spellchecker overall does a good job, but one should be specially aware of: 
- numbers, dates and codes (usually detected as misspelled)
- lack of stress marks are detected as misspelled 
- besides this the rate of false positives doesn't seem to be too high, but does happen every once in a while. 
- Capitalization is irrelevant
- The intial '¿' is detected as misspelled.