# Stanza

Stanza is a powerful and versatile Natural Language Processing (NLP) library developed by the Stanford NLP Group. It provides tools and pre-trained models for various NLP tasks in multiple languages.

**Key Features of Stanza:**

* **Multilingual Support:** Stanza supports a wide range of languages (see list in my previous response), making it a valuable tool for researchers and developers working with diverse linguistic data.
* **Accurate and Efficient Models:** Stanza includes state-of-the-art neural network models that have achieved high accuracy on standard NLP benchmarks. These models are also optimized for efficiency, allowing for fast processing of large volumes of text.
* **Comprehensive NLP Pipeline:** Stanza offers a complete NLP pipeline that covers essential tasks like tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition. You can use the full pipeline or individual components based on your needs.
* **Customization and Extensibility:** Stanza allows you to customize existing models or train your own models on specific datasets, providing flexibility for specialized NLP applications.
* **User-Friendly Interface:** Stanza's Python API is designed to be intuitive and easy to use, even for those new to NLP.

In Stanza, a pipeline is a sequence of NLP processors that work together to analyze text. Each processor performs a specific task, such as tokenization, part-of-speech tagging, or dependency parsing. Stanza offers flexible pipelines that can be customized to your needs.

**Core Processors in Stanza Pipelines**

The main processors that you can include in a Stanza pipeline are:

1. **Tokenizer:** Splits text into words or subword units.
2. **Mwt (Multi-Word Token Expansion):** Expands multi-word tokens into their constituent words.
3. **Lemmatizer:** Reduces words to their base or dictionary forms.
4. **POS (Part-of-Speech) Tagger:** Assigns grammatical tags to words (e.g., noun, verb, adjective).
5. **Morphologizer:** Provides morphological information about words (e.g., gender, number, case).
6. **Dep (Dependency) Parser:** Analyzes the grammatical structure of sentences, identifying the relationships between words.
7. **NER (Named Entity Recognizer):** Identifies named entities like persons, organizations, and locations.


In [1]:
# Connect Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Set didip_ss as working directory
%cd /content/drive/MyDrive/didip_ss
# Check it
!ls DATA

/content/drive/MyDrive/didip_ss
lib		     readme.md		      similarity_results.json
mom_1000_sample.tsv  similarity_network.html  vulgate_sample_1000.tsv


In [3]:
# Install stanza
!pip install stanza

Collecting stanza
  Downloading stanza-1.8.2-py3-none-any.whl (990 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m990.1/990.1 kB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji (from stanza)
  Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.4/431.4 kB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.3.0->stanza)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.3.0->stanza)
  Using cached nvidia_cudnn_cu12-8.9.2.26-p

## Download language models


In [4]:
import stanza

# Download the required language models (you only need to do this once. Uncomment what you do not need bcause each package around 0,5 GB)
stanza.download('la')  # Latin
stanza.download('de')  # German (we'll use this for Middle German)
stanza.download('en')  # English (we'll use this for Middle English)

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: la (Latin) ...


Downloading https://huggingface.co/stanfordnlp/stanza-la/resolve/v1.8.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/la/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: de (German) ...


Downloading https://huggingface.co/stanfordnlp/stanza-de/resolve/v1.8.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/de/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.8.0/models/default.zip:   0%|          | 0…

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


In [5]:
# Initialize pipelines for each language (uncomment what you do not need)
latin_nlp = stanza.Pipeline('la')
german_nlp = stanza.Pipeline('de')  # Using modern German for Middle German
english_nlp = stanza.Pipeline('en')  # Using modern English for Middle English

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: la (Latin):
| Processor | Package       |
-----------------------------
| tokenize  | ittb          |
| pos       | ittb_nocharlm |
| lemma     | ittb_nocharlm |
| depparse  | ittb_nocharlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: de (German):
| Processor    | Package      |
-------------------------------
| tokenize     | gsd          |
| mwt          | gsd          |
| pos          | gsd_charlm   |
| lemma        | gsd_nocharlm |
| constituency | spmrl_charlm |
| depparse     | gsd_charlm   |
| sentiment    | sb10k_charlm |
| ner          | germeval2014 |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor    | Package                   |
--------------------------------------------
| tokenize     | combined                  |
| mwt          | combined                  |
| pos          | combined_charlm           |
| lemma        | combined_nocharlm         |
| constituency | ptb3-revised_charlm       |
| depparse     | combined_charlm           |
| sentiment    | sstplus_charlm            |
| ner          | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: constituency
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [6]:
# Create some test data from the vulgata
import pandas as pd

# Read the TSV file into a DataFrame
df = pd.read_csv("DATA/vulgate_sample_1000.tsv", delimiter="\t")  # Assuming tab-separated values

# Create a new DataFrame with the first 15 rows
vulgata_15_df = df.head(15)

In [7]:
vulgata_15_df

Unnamed: 0,id,text
0,2Kings_chapter_20,in diebus illis aegrotavit Ezechias usque ad m...
1,Luke_chapter_17,et ad discipulos suos ait inpossibile est ut n...
2,Isaiah_chapter_59,ecce non est adbreviata manus Domini ut salvar...
3,Job_chapter_39,numquid nosti tempus partus hibicum in petris ...
4,1John_chapter_4,carissimi nolite omni spiritui credere sed pro...
5,Jeremiah_chapter_10,audite verbum quod locutus est Dominus super v...
6,Joshua_chapter_22,eodem tempore vocavit Iosue Rubenitas et Gaddi...
7,Genesis_chapter_16,igitur Sarai uxor Abram non genuerat liberos s...
8,2Chronicles_chapter_9,regina quoque Saba cum audisset famam Salomoni...
9,Job_chapter_10,taedet animam meam vitae meae dimittam adversu...


## Tokenizer

In [8]:
def tokenize_latin(text):
    doc = latin_nlp(text)
    return [token.text for sentence in doc.sentences for token in sentence.tokens]

In [9]:
# Let's tokenize the first 15 Vulgata part
vulgata_15_df['tokens'] = vulgata_15_df['text'].apply(tokenize_latin)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vulgata_15_df['tokens'] = vulgata_15_df['text'].apply(tokenize_latin)


In [10]:
# Check the vulgata_15_df columns
vulgata_15_df.columns

Index(['id', 'text', 'tokens'], dtype='object')

In [11]:
# Let's see the tokens of the first row together with the text
vulgata_15_df[['text', 'tokens']]

Unnamed: 0,text,tokens
0,in diebus illis aegrotavit Ezechias usque ad m...,"[in, diebus, illis, aegrotavit, Ezechias, usqu..."
1,et ad discipulos suos ait inpossibile est ut n...,"[et, ad, discipulos, suos, ait, inpossibile, e..."
2,ecce non est adbreviata manus Domini ut salvar...,"[ecce, non, est, adbreviata, manus, Domini, ut..."
3,numquid nosti tempus partus hibicum in petris ...,"[numquid, nosti, tempus, partus, hibicum, in, ..."
4,carissimi nolite omni spiritui credere sed pro...,"[carissimi, nolite, omni, spiritui, credere, s..."
5,audite verbum quod locutus est Dominus super v...,"[audite, verbum, quod, locutus, est, Dominus, ..."
6,eodem tempore vocavit Iosue Rubenitas et Gaddi...,"[eodem, tempore, vocavit, Iosue, Rubenitas, et..."
7,igitur Sarai uxor Abram non genuerat liberos s...,"[igitur, Sarai, uxor, Abram, non, genuerat, li..."
8,regina quoque Saba cum audisset famam Salomoni...,"[regina, quoque, Saba, cum, audisset, famam, S..."
9,taedet animam meam vitae meae dimittam adversu...,"[taedet, animam, meam, vitae, meae, dimittam, ..."


## Lemmatizer

In [12]:
latin_lemmatizer = stanza.Pipeline(lang="la", processors="tokenize,mwt,lemma")

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: la (Latin):
| Processor | Package       |
-----------------------------
| tokenize  | ittb          |
| lemma     | ittb_nocharlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!


In [13]:
def lemmatize_latin(text):
    doc = latin_lemmatizer(text)
    return [word.lemma for sentence in doc.sentences for word in sentence.words]

In [14]:
# Let's lemmatize the first 15 Vulgata part
vulgata_15_df['lemmas'] = vulgata_15_df['text'].apply(lemmatize_latin)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vulgata_15_df['lemmas'] = vulgata_15_df['text'].apply(lemmatize_latin)


# POS tagger

In [15]:
latin_pos_tagger = stanza.Pipeline(lang="la", processors="tokenize,mwt,pos")

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.8.0.json:   0%|   …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: la (Latin):
| Processor | Package       |
-----------------------------
| tokenize  | ittb          |
| pos       | ittb_nocharlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


In [22]:
def pos_tagger_latin(text):
    doc = latin_pos_tagger(text)
    return [word.upos for sentence in doc.sentences for word in sentence.words]

In [23]:
# Let's lemmatize the first 15 Vulgata part
vulgata_15_df['pos'] = vulgata_15_df['text'].apply(pos_tagger_latin)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  vulgata_15_df['pos'] = vulgata_15_df['text'].apply(pos_tagger_latin)


In [24]:
vulgata_15_df

Unnamed: 0,id,text,tokens,lemmas,pos
0,2Kings_chapter_20,in diebus illis aegrotavit Ezechias usque ad m...,"[in, diebus, illis, aegrotavit, Ezechias, usqu...","[in, dies, ille, aegrotavit, ezechias, usque, ...","[ADP, NOUN, DET, VERB, PROPN, ADP, ADP, NOUN, ..."
1,Luke_chapter_17,et ad discipulos suos ait inpossibile est ut n...,"[et, ad, discipulos, suos, ait, inpossibile, e...","[et, ad, discipulus, suus, aio, inpossibilis, ...","[CCONJ, ADP, NOUN, DET, VERB, ADJ, AUX, SCONJ,..."
2,Isaiah_chapter_59,ecce non est adbreviata manus Domini ut salvar...,"[ecce, non, est, adbreviata, manus, Domini, ut...","[ecce, non, sum, adbreviata, manus, dominus, u...","[ADV, PART, AUX, ADJ, NOUN, NOUN, SCONJ, VERB,..."
3,Job_chapter_39,numquid nosti tempus partus hibicum in petris ...,"[numquid, nosti, tempus, partus, hibicum, in, ...","[numquis, nosco, tempus, partus, hibicus, in, ...","[PRON, VERB, NOUN, NOUN, NOUN, ADP, NOUN, CCON..."
4,1John_chapter_4,carissimi nolite omni spiritui credere sed pro...,"[carissimi, nolite, omni, spiritui, credere, s...","[carissamus, nolo, omnis, spiritus, credo, sed...","[ADJ, VERB, DET, NOUN, VERB, CCONJ, VERB, NOUN..."
5,Jeremiah_chapter_10,audite verbum quod locutus est Dominus super v...,"[audite, verbum, quod, locutus, est, Dominus, ...","[audio, verbum, quod, loquor, sum, dominus, su...","[VERB, NOUN, PRON, VERB, AUX, NOUN, ADP, DET, ..."
6,Joshua_chapter_22,eodem tempore vocavit Iosue Rubenitas et Gaddi...,"[eodem, tempore, vocavit, Iosue, Rubenitas, et...","[idem, tempus, vocavit, iosus, Rubenitas, et, ...","[DET, NOUN, VERB, PROPN, NOUN, CCONJ, VERB, CC..."
7,Genesis_chapter_16,igitur Sarai uxor Abram non genuerat liberos s...,"[igitur, Sarai, uxor, Abram, non, genuerat, li...","[igitur, sara, uxor, abra, non, genusra, liber...","[ADV, NOUN, NOUN, NOUN, PART, VERB, NOUN, CCON..."
8,2Chronicles_chapter_9,regina quoque Saba cum audisset famam Salomoni...,"[regina, quoque, Saba, cum, audisset, famam, S...","[regina, quoque, sabus, cum, audessus, fama, s...","[NOUN, PART, PROPN, SCONJ, VERB, NOUN, PROPN, ..."
9,Job_chapter_10,taedet animam meam vitae meae dimittam adversu...,"[taedet, animam, meam, vitae, meae, dimittam, ...","[taedet, anima, meus, vitae, meus, dimitta, ad...","[VERB, NOUN, DET, NOUN, DET, VERB, NOUN, PRON,..."
