<a href="https://colab.research.google.com/github/19barsav/EXTRA_STANZA/blob/main/Extravaganza_Standard_Stanza_Bulk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Initialize Stanza**

In [2]:
pip install stanza

Note: you may need to restart the kernel to use updated packages.


In [3]:
import stanza

In [4]:
stanza.download('es') # download Spanish model

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-15 11:09:48 INFO: Downloaded file to C:\Users\savan\stanza_resources\resources.json
2025-04-15 11:09:48 INFO: Downloading default packages for language: es (Spanish) ...
2025-04-15 11:09:49 INFO: File exists: C:\Users\savan\stanza_resources\es\default.zip
2025-04-15 11:09:53 INFO: Finished downloading models and saved to C:\Users\savan\stanza_resources


In [5]:
from stanza.pipeline.processor import register_processor, Processor
from stanza.models.common.doc import Word, Token

# the very last paragraph here explains this well: https://stanfordnlp.github.io/stanza/data_objects.html#token
# essentially, you are making a property called "pronoun_separated", which will be a Token type
# the getter allows you to access the property, and the setter allows you to  the value
Token.add_property('pronoun_separated', #make a property for our pre-processing in Stanza token object
                   default=False,
                   getter=lambda self: self._pronoun_separated,
                   setter=lambda self, value: value)

In [34]:
# from https://stanfordnlp.github.io/stanza/pipeline.html#processors
# Processors are units of the neural pipeline that perform specific NLP
# functions and create different annotations for a Document.

# register_processor is a function that allows you to make your own processor
# here, it's being used as a decorator
# essentially, it's the same as calling the register_processor function with
# the PronounProcessor class as input

@register_processor("pronoun")
class PronounProcessor(Processor):
    """
    This custom Stanza processor separates pronouns from the end of verbs in Spanish text.
    It ensures accurate separation by checking if the word before the pronoun is indeed a verb using Stanza's POS tagger.

    These two variables dente what is required (tokenized words) and what is given (pronouns)
    """
    _requires = set(['tokenize'])
    _provides = set(['pronoun'])

    def __init__(self, device, config, pipeline):
        """
        Initialize the processor, making sure the pipeline has tokenization and POS tagging.
        I believe that the "mwt" should say "pos"
        mwt is required by pos (in general)
        The token_id does not account for mwt, so it seems that spanish pronouns are not separated
        """
        if "tokenize" not in pipeline.processors:
            raise ValueError("Need a Pipeline with a valid tokenize processor")
        #pos included mwt, so this is fine
        if "pos" not in pipeline.processors:
            raise ValueError("Need a Pipeline with a valid mwt processor")

    def _set_up_model(self, *args):
        """
        This is purposely left blank, following the stanza instructions from the link above
        """
        pass

    def update_token_id(self, token, word_cnt):
        """
        Given a current word_cnt up till current token, update the id
        of the token as well as their words (per sentence)

        all token properties can be found here:
        https://stanfordnlp.github.io/stanza/data_objects.html#token
        """
        token.id = (word_cnt, )
        for word in token.words:
            word.id = word_cnt
            word_cnt += 1

        return word_cnt

    def process(self, doc):
        """
        Core processing method: Iterates through sentences and tokens, separating pronouns if conditions are met.
        """
        for sent in doc.sentences:
            word_cnt = 1  # Current number of words in a sentence

            # Iterate through each token in the sentence
            for i, token in enumerate(sent.tokens):
                # Skip if already separated
                #print(i, token)
                #print(word_cnt)
                if token._pronoun_separated:
                    continue

                token._pronoun_separated = True

                # Skip pronouns
                # Skip verbs with past tense
                # Skip adverbs with specific suffixes, note more suffixes may necessary to add
                if token.text in set(["ella", "ello", "ellas", "ellos"])\
                or token.text[-4:] in set(["illo", "illa",
                                              "uelo", "uela",
                                              "filo", "fila"])\
                or token.text[-5:] in set(["illos", "illas",
                                              "uelos", "uelas",
                                              "filos", "filas"])\
                or token.text.endswith("ste")\
                or token.text.endswith("mente"):

                    word_cnt = self.update_token_id(token, word_cnt)
                    continue

                # Skip noun
                if i > 0 and sent.tokens[i - 1].words[0].upos == "DET":
                    word_cnt = self.update_token_id(token, word_cnt)
                    continue

                # Check the ending of the token
                SINGULARS = ("lo", "me", "te", "la", "le")
                if token.text.endswith(SINGULARS):
                    # Skip pronoun that has been separated
                    # Is this a duplicate of if token._pronoun_separated?
                    # I don't think it is hurting anything, but maybe run on a small
                    # set of data with and without and see if you get the same results
                    
                    #WHYY this is broken
                  
                    print("here")
                    if token.words[-1].text in SINGULARS:
                        print('skipped')
                        word_cnt = self.update_token_id(token, word_cnt)
                        continue
                        
                    print(token.words[-1].text)
                    print(token.words)
                    print(token)

                    sent.words.remove(token.words[0])

                    # VERB
                    word_left = Word(sentence=sent,word_entry={
                        "id": word_cnt,
                        "text": token.text[:-2],
                        "start_char": token.start_char,
                        "end_char": token.start_char + len(token.text[:-2]) - 1,
                        # "head": word_cnt
                    })
                    sent.words.insert(word_cnt - 1, word_left)
                    word_cnt += 1

                    # PRONOUN
                    word_right = Word(sentence=sent,word_entry={
                        "id": word_cnt,
                        "text": token.text[-2:],
                        "start_char": word_left.end_char + 1,
                        "end_char": token.end_char,
                        # "head": word_left.head
                    })
                    # why is this cut out? I do not have the linguistics knowledge for why
                    '''
                    Add a check for left word being a verb:
                    Example: hola becomes ho-la and new tokens should only be replaced if
                    the 'ho' segment IS a verb. Since ho is not a verb, process is undone.

                    nlp_output = nlp(word_left["text"]) #nlp assumed to be defined globally
                    word_left_upos = nlp_output.sentences[0].words[0].upos

                    if word_left.upos == 'VERB':
                        sent.words.insert(word_cnt - 1, word_right)
                        word_cnt += 1

                        # Add new words into current token
                        new_words = [word_left, word_right]
                        token.id = (word_left.id, word_right.id)
                    else:
                        # Undo the process if 'ho' is not a verb
                        token.words = [Word(text=token.text, id=token.id)]
                    '''

                    sent.words.insert(word_cnt - 1, word_right)
                    word_cnt += 1

                    # Add new words into current token
                    new_words = [word_left, word_right]
                    token.id = (word_left.id, word_right.id)
                    token.words = new_words


                    doc.num_words += 1   # Update number of words in document
                    continue

                PLURALS = ("los", "las", "les")
                if token.text.endswith(PLURALS):
                    print("here2")
                    # Skip pronoun that has been separated
                    if token.words[-1].text in PLURALS:
                        word_cnt = self.update_token_id(token, word_cnt)
                        continue

                    sent.words.remove(token.words[0])
                    print("here3")

                    # VERB
                    word_left = Word({
                        "id": word_cnt,
                        "text": token.text[:-3],
                        "start_char": token.start_char,
                        "end_char": token.start_char + len(token.text[:-3]) - 1,
                        # "head": word_cnt
                    })

                    sent.words.insert(word_cnt - 1, word_left)
                    word_cnt += 1

                    #PRONOUN
                    word_right = Word({
                        "id": word_cnt,
                        "text": token.text[-3:],
                        "start_char": word_left.end_char + 1,
                        "end_char": token.end_char,
                        # "head": word_left.head
                    })

                    '''
                    Add a check for left word being a verb:
                    Example: hola becomes ho-la and new tokens should only be replaced if
                    the 'ho' segment IS a verb. Since ho is not a verb, process is undone.

                    nlp_output = nlp(word_left["text"]) #nlp assumed to be defined globally
                    word_left_upos = nlp_output.sentences[0].words[0].upos

                    if word_left.upos == 'VERB':
                        sent.words.insert(word_cnt - 1, word_right)
                        word_cnt += 1

                        # Add new words into current token
                        new_words = [word_left, word_right]
                        token.id = (word_left.id, word_right.id)
                    else:
                        # Undo the process if 'ho' is not a verb
                        token.words = [Word(text=token.text, id=token.id)]
                    '''

                    sent.words.insert(word_cnt - 1, word_right)
                    word_cnt += 1

                    # Add new words into current token
                    new_words = [word_left, word_right]
                    token.id = (word_left.id, word_right.id)
                    token.words = new_words

                    doc.num_words += 1   # Update number of words in document
                    continue

                word_cnt = self.update_token_id(token, word_cnt)

        return doc

#hola, habla, solo, nouns not preceeded with a determiner


In [35]:
TEST_MODELS_DIR = "~/root/stanza_resources/es"
nlp = stanza.Pipeline(dir=TEST_MODELS_DIR, lang='es', processors='tokenize,mwt,pos,pronoun')
nlp_2 = stanza.Pipeline(dir=TEST_MODELS_DIR, lang='es', processors='tokenize,pos,ner,lemma,depparse', tokenize_pretokenized = True)

2025-04-15 12:31:31 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-15 12:31:31 INFO: Downloaded file to ~/root/stanza_resources/es\resources.json
2025-04-15 12:31:31 INFO: Loading these models for language: es (Spanish):
| Processor | Package         |
-------------------------------
| tokenize  | combined        |
| mwt       | combined        |
| pos       | combined_charlm |
| pronoun   | default         |
| pronoun   | default         |

2025-04-15 12:31:31 INFO: Using device: cpu
2025-04-15 12:31:31 INFO: Loading: tokenize
2025-04-15 12:31:31 INFO: Loading: mwt
2025-04-15 12:31:31 INFO: Loading: pos
2025-04-15 12:31:33 INFO: Loading: pronoun
2025-04-15 12:31:33 INFO: Loading: pronoun
2025-04-15 12:31:33 INFO: Done loading processors!
2025-04-15 12:31:33 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

2025-04-15 12:31:33 INFO: Downloaded file to ~/root/stanza_resources/es\resources.json
2025-04-15 12:31:34 INFO: Loading these models for language: es (Spanish):
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
| ner       | conll02           |

2025-04-15 12:31:34 INFO: Using device: cpu
2025-04-15 12:31:34 INFO: Loading: tokenize
2025-04-15 12:31:34 INFO: Loading: pos
2025-04-15 12:31:36 INFO: Loading: lemma
2025-04-15 12:31:38 INFO: Loading: depparse
2025-04-15 12:31:38 INFO: Loading: ner
2025-04-15 12:31:40 INFO: Done loading processors!


**Pre-processing**

Remove the phonetic symbols and punctuation to leave just the corrected words

In [36]:
def create_revised_punct_file(file_name):
  cha_file = file_name + ".cha" #.cha file to pre-process, replace after uploading more file
  input = open(cha_file, "r")

  punctuation = "()&"
  processed = open(file_name + "_revised_punct.txt", "w")

  for line in input:
      new_line = line.strip()
      if new_line[0] == '*':
          speaker = new_line[:6].strip()
          text = new_line[6:].strip()
          for ch in punctuation:
              text = (text.replace(ch,'')).strip()

          if '[' in text:
              words = text.split()
              # Find corrected word and replace it into the transcription
              for i, word in enumerate(words):
                  if word == "[:":
                      words[i-1] = words[i+1].replace(']',"")
                      words[i] = ""
                      words[i+1] = ""
              text = ""
              for i in range(len(words)):
                  if i == (len(words)-1):
                      text += words[i]
                  elif words[i] != "":
                      text += words[i] + " "
          new_line = speaker + "\t" + text
      print(new_line, end="\n", file=processed)

  processed.close()
  input.close()

Create a dictionary in the following format
**{
  "id": line-index,
  "speaker",
  "text"
}**

In [37]:
def create_transcriptions_dict(file_name):
  processed = open(file_name + "_revised_punct.txt", "r")
  transcriptions = []

  for i, line in enumerate(processed):
      if line[0] == "*": #Speaker line
          line = line.strip()
          line = line.split(":\t")  #Split line into speaker name and their speech
          speaker = line[0]
          text = line[1]

          transcription = {
              "id": i,
              "speaker": speaker,
              "text": text
          }
          transcriptions.append(transcription)
  processed.close()

  return transcriptions

In [38]:
def create_csv_file(file_name, transcriptions):
    output = open("output.txt", "w")
    csv_file = open(file_name + ".csv", "w")

    for transcription in transcriptions:
        doc = nlp(transcription["text"])
        #doc = nlp_2(doc)
        print(doc, file=output)
        
        transcription["sentences"] = doc.to_dict()

        for sentence in doc.sentences:
            #Print: #text = sentence
            print("#text = {}".format(transcription["text"]), end = "\n", file = csv_file)

            for token in sentence.tokens:
                for i, word in enumerate(token.words):
                    # print(str(i + 1), file = csv_file)

                    word_dict = word.to_dict()
                    KEYS = ["id", "text", "lemma", "upos", "xpos", "feats", "head",
                            "deprel", "deps", "misc"]


                    VALUES = [str(word_dict[key]) if key in word_dict else "_" for key in KEYS]

                    csv_line = "\t".join(VALUES)

                    print(csv_line, end = "\n", file = csv_file)

            print(file = csv_file)

    csv_file.close()
    output.close()


**Create  txt file and save tag result there**

In [39]:
def create_tag_file(file_name, transcriptions):
  tagged =  open(file_name + ".txt", "w")
  print(sentence.words)

  for transcription in transcriptions:
      # doc = nlp_tokenized(transcription["text"])
      # doc = nlp(doc)
      doc = nlp_2(transcription["text"])
      #doc = nlp_2(doc)

      transcription["sentences"] = doc.to_dict()

      for sentence in doc.sentences:
          # sentence_dict = {
          #     "speaker": transcription["speaker"],
          #     "id": transcription["id"],
          #     "tokens": sentence.to_dict()
          # }

          # transcription_nlp.append(sentence_dict)

          print(str(transcription["id"]), end = "\n", file=tagged)
          print(transcription["speaker"], end = "\n", file=tagged)
          print(sentence.text, end = "\n", file=tagged)

          for word in sentence.words:
              print("{:20}{:20}{:20}{:20}{}"
                    .format(str(word.text),
                            str(word.lemma),
                            str(word.upos),
                            str(word.xpos),
                            str(word.feats)), end = "\n", file=tagged)
          print("\n", file=tagged)
  tagged.close()

In [40]:
def create_json_file(file_name, transcriptions):
  import json

  tagged_json = file_name + ".json"

  with open(tagged_json, 'w') as tagged_json_file:
      json.dump(transcriptions, tagged_json_file)
  tagged_json_file.close()

###Compile list of words that Stanza failed, along with the sentence and line num

In [41]:
suffices = ["lo", "me" , "te", "la", "le", "nos", "los", "las", "les", "se"]

In [42]:
collection = []

In [43]:
# For each word, check if the ending matches
# Print the word, lemma, upos, feats, sentence, and line number
# Compile a list of words

In [44]:
def collect_potential_misclassfied_words(transcriptions):
  for transcription in transcriptions:
    for sentence in transcription["sentences"]:
      for token in sentence:
        for suffix in suffices:
          if token["text"].endswith(suffix):
              try:
                result = {
                    "text": token["text"],
                    #"lemma": token["lemma"], why?
                    "upos": token["upos"],
                    "feats": token["feats"] if "feats" in token else "",
                    "transcription": transcription["text"],
                    "misclassfied": False
                }

                collection.append(result)
              except KeyError as err:
                  print(token)
                  print(transcription['text'])
                  print(err)
                  continue
            # print(transcription["text"])
            # print(token["text"])
            # print(token["lemma"], token["upos"], end=" ")
            # if "feats" in token:
            #   print(token["feats"])
            # print()
            #break
            
  return collection


In [45]:
import os
cha_files = [file for file in os.listdir() if file.endswith(".cha")]
file_names = [file[:len(".cha") * (-1)] for file in cha_files]

In [46]:
for file_name in file_names:
  file_name = os.getcwd() + "/" + file_name
  print(file_name)
  create_revised_punct_file(file_name)
  transcriptions = create_transcriptions_dict(file_name)
  # create_tag_file(file_name, transcriptions)
  create_csv_file(file_name, transcriptions)
  # create_json_file(file_name, transcriptions)
  

C:\Users\savan\EXTRA_STANZA/a-aal-MOT-INV-BIL1-2017-07-05-anlc
here
cumple
[{
  "id": 3,
  "text": "cumple",
  "upos": "VERB",
  "xpos": "vmip3s0",
  "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
  "start_char": 8,
  "end_char": 14
}]
[
  {
    "id": 3,
    "text": "cumple",
    "upos": "VERB",
    "xpos": "vmip3s0",
    "feats": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
    "start_char": 8,
    "end_char": 14,
    "misc": "SpaceAfter=No"
  }
]


TypeError: 'Word' object is not subscriptable

In [33]:
pot = open('potential' + ".txt", "w")
print(collect_potential_misclassfied_words(transcriptions), file=pot)

{'id': (3, 4), 'text': 'cumple', 'start_char': 8, 'end_char': 14, 'misc': 'SpaceAfter=No'}
cuÃ¡ndo cumple?
'upos'
{'id': 4, 'text': 'le', 'start_char': 12, 'end_char': 14}
cuÃ¡ndo cumple?
'upos'
{'id': (9, 10), 'text': 'tranquilo', 'start_char': 33, 'end_char': 42, 'misc': 'SpaceAfter=No'}
pero ahora ya no, ahora se queda tranquilo.
'upos'
{'id': 10, 'text': 'lo', 'start_char': 40, 'end_char': 42}
pero ahora ya no, ahora se queda tranquilo.
'upos'
{'id': (14, 15), 'text': 'sÃ³lo', 'start_char': 50, 'end_char': 55}
y es, xxx estÃ¡n con, viven todos juntos, o viven sÃ³lo +/?
'upos'
{'id': 15, 'text': 'lo', 'start_char': 53, 'end_char': 55}
y es, xxx estÃ¡n con, viven todos juntos, o viven sÃ³lo +/?
'upos'
{'id': (4, 5), 'text': 'aparte', 'start_char': 7, 'end_char': 13}
+< no, aparte son dos materias .. nada mÃ¡s.
'upos'
{'id': 5, 'text': 'te', 'start_char': 11, 'end_char': 13}
+< no, aparte son dos materias .. nada mÃ¡s.
'upos'
{'id': (1, 2), 'text': 'hola', 'start_char': 0, 'end_char':

In [21]:
transcriptions[0]["sentences"][0]

[{'id': 1,
  'text': 'bueno',
  'upos': 'ADJ',
  'xpos': 'aq0ms0',
  'feats': 'Gender=Masc|Number=Sing',
  'start_char': 0,
  'end_char': 5,
  'misc': 'SpaceAfter=No'},
 {'id': 2,
  'text': ',',
  'upos': 'PUNCT',
  'xpos': 'fc',
  'feats': 'PunctType=Comm',
  'start_char': 5,
  'end_char': 6},
 {'id': 3,
  'text': 'eh',
  'upos': 'INTJ',
  'start_char': 7,
  'end_char': 9,
  'misc': 'SpaceAfter=No'},
 {'id': 4,
  'text': ',',
  'upos': 'PUNCT',
  'xpos': 'fc',
  'feats': 'PunctType=Comm',
  'start_char': 9,
  'end_char': 10},
 {'id': 5,
  'text': 'cÃ³mo',
  'upos': 'X',
  'xpos': 'np00000',
  'start_char': 11,
  'end_char': 16,
  'misc': 'SpaceAfter=No'},
 {'id': 6,
  'text': ',',
  'upos': 'PUNCT',
  'xpos': 'fc',
  'feats': 'PunctType=Comm',
  'start_char': 16,
  'end_char': 17},
 {'id': 7,
  'text': '.',
  'upos': 'PUNCT',
  'xpos': 'fp',
  'feats': 'PunctType=Peri',
  'start_char': 18,
  'end_char': 19}]