# StanfordNLP

StanfordNLP is a Python natural language analysis package. It contains tools, which can be used in a pipeline, to convert a string containing human language text into lists of sentences and words, to generate base forms of those words, their parts of speech and morphological features, and to give a syntactic structure dependency parse, which is designed to be parallel among more than 70 languages, using the Universal Dependencies formalism. In addition, it is able to call the CoreNLP Java package and inherits additonal functionality from there, such as constituency parsing, coreference resolution, and linguistic pattern matching.

This StanfordNLP package is built with highly accurate neural network components that enable efficient training and evaluation with your own annotated data. The modules are built on top of ***PyTorch***. You will get much faster performance if you run this system on a GPU-enabled machine. This package is a combination of software based on the Stanford entry in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Java Stanford CoreNLP software. The CoNLL UD system is partly a cleaned up version of code used in the shared task and partly an approximate rewrite in PyTorch of the original Tensorflow version of the tagger and parser.


# Installation & Model Downlaod

### Installation

For installing nlp run below command, always install StanfordNLP through [PyPi]('https://pypi.org/'), once installed run in your comand line or anaconda prompt

    pip install stanfordnlp
    
This will take care of all your necessary dependencies to run StanfordNLP. The neural pipeline of StanfordNLP depends on PyTorch 1.0.0 or a later version with compatible APIs.

**Note:** Installation in PyTorch.

For Conda(Works fine for Windows and Linux), 

    conda install pytorch torchvision cpuonly -c pytorch
    
Conda(Mac)

    conda install pytorch torchvision -c pytorch
    
For Pip(Windows and Linux)

    pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
    
Pip(Mac)

    pip install torch torchvision

In [None]:
!pip install stanfordnlp

Collecting stanfordnlp
[?25l  Downloading https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl (158kB)
[K     |████████████████████████████████| 163kB 2.7MB/s 
Installing collected packages: stanfordnlp
Successfully installed stanfordnlp-0.2.0


In [None]:
!pip install torch==1.4.0+cpu torchvision==0.5.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.4.0+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torch-1.4.0%2Bcpu-cp36-cp36m-linux_x86_64.whl (127.2MB)
[K     |████████████████████████████████| 127.2MB 94kB/s 
[?25hCollecting torchvision==0.5.0+cpu
[?25l  Downloading https://download.pytorch.org/whl/cpu/torchvision-0.5.0%2Bcpu-cp36-cp36m-linux_x86_64.whl (5.4MB)
[K     |████████████████████████████████| 5.4MB 30.6MB/s 
Installing collected packages: torch, torchvision
  Found existing installation: torch 1.4.0
    Uninstalling torch-1.4.0:
      Successfully uninstalled torch-1.4.0
  Found existing installation: torchvision 0.5.0
    Uninstalling torchvision-0.5.0:
      Successfully uninstalled torchvision-0.5.0
Successfully installed torch-1.4.0+cpu torchvision-0.5.0+cpu


In [None]:
import stanfordnlp
stanfordnlp.download('en')   # This downloads the English models for the neural pipeline
nlp = stanfordnlp.Pipeline() # This sets up a default neural pipeline in English
doc = nlp("Barack Obama was born in Hawaii.  He was elected president in 2008.")
doc.sentences[0].print_dependencies()

Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
Y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /root/stanfordnlp_resources/en_ewt_models.zip


100%|██████████| 235M/235M [00:23<00:00, 10.1MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources



The last command here will print out the words in the first sentence in the input string (or Document, as it is represented in StanfordNLP), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its “head”), along with the dependency relation between the words.

# Models for Human Languages

Downloading a language pack is as simple as

In [None]:
import stanfordnlp 
#stanfordnlp.download('ar')    # replace "ar" with the language

To use default langauge pack for any language, simply build the pipeline as follows:

In [None]:
import stanfordnlp 
nlp = stanfordnlp.Pipeline(lang="en") # This sets up a default neural pipeline in English
doc = nlp("Narendra Modi was born in India. He became Prime minister in 2014.")
doc.sentences[0].print_dependencies()

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_parser.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand



# Pipeline

Users of StanfordNLP can process documents by building a Pipeline with the desired Processor units. The pipeline takes in a Document object or raw text, runs the processors in succession, and returns an annotated Document.

## Options
![alt text](1.png "Title")

In [None]:
import stanfordnlp

MODELS_DIR = '.'
stanfordnlp.download('en', MODELS_DIR) # Download the English models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='en_ewt', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Barack Obama was born in Hawaii.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result

Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
y

Downloading models for: en_ewt
Download location: ./en_ewt_models.zip


100%|██████████| 235M/235M [00:10<00:00, 22.2MB/s]



Download complete.  Models saved to: ./en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': './en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': './en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': './en_ewt_models/en_ewt.pretrain.pt', 'batch_size': 3000, 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
<Token index=1;words=[<Word index=1;text=Barack;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=2;words=[<Word index=2;text=Obama;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=3;words=[<Word index=3;text=was;upos=AUX;xpos=VBD;feats=Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin>]>
<Token index=4;words=[<Word index=4;text=born;upos=VERB;xpos=VBN;feats=Tense=Past|VerbForm=Part|Voice=Pass>]>
<Token index=5;words=[<Word index=5;text=in;upos=ADP;xpos=IN;feats

## Processors Summary

Processors are units of the neural pipeline that create different annotations for a Document. The neural pipeline now supports the following processors:
![alt text](2.png "Title")

## Data Objects

This section will describes the data objects used in StanfordNLP, and how they interact with each other.

### Document

A Document object holds the annotation of an entire document, and is automatically generated when a string is annotated by the Pipeline. It holds a collection of Sentences, and can be seamlessly translated into a CoNLL-U file.

Objects of this class expose useful properties such as text, sentences, and conll_file.

### Sentence

A Sentence object represents a sentence (as is predicted by the tokenizer), and holds a list of the Tokens in the sentence, as well as a list of all its Words. It also processes the dependency parse as is predicted by the parser, through its member method build_dependencies.

Objects of this class expose useful properties such as words, tokens, and dependencies, as well as methods such as print_tokens, print_words, print_dependencies.

### Token

A Token object holds a token, and a list of its underlying words. In the event that the token is a multi-word token (e.g., French au = à le), the token will have a range index as described in the CoNLL-U format specifications (e.g., 3-4), with its word property containing the underlying Words. In other cases, the Token object will be a simple wrapper around one Word object, where its words property is a singleton.

### Word

A Word object holds a syntactic word and all of its word-level annotations. In the example of multi-word tokens(MWT), these are generated as a result of multi-word token expansion, and are used in all downstream syntactic analyses such as tagging, lemmatization, and parsing. If a Word is the result from an MWT expansion, its text will usually not be found in the input raw text. Aside from multi-word tokens, Words should be similar to the familiar “tokens” one would see elsewhere.

## TokenizeProcessor

### Description
Tokenizes the text and performs sentence segmentation.
![alt text](3.png "Title")

### Options
![alt text](4.png "Title")

### Example 

The tokenize processor is usually the first processor used in the pipeline. It performs tokenization and sentence segmentation at the same time. After this processor is run, the input document will become a list of Sentences. The list of tokens for sentence sent can then be accessed with sent.tokens. The code below shows an example of tokenization and sentence segmentation.

In [None]:
import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize', lang='en')
doc = nlp("This is a test sentence for stanfordnlp. This is another sentence.")
for i, sentence in enumerate(doc.sentences):
    print(f"====== Sentence {i+1} tokens =======")
    print(*[f"index: {token.index.rjust(3)}\ttoken: {token.text}" for token in sentence.tokens], sep='\n')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
index:   1	token: This
index:   2	token: is
index:   3	token: a
index:   4	token: test
index:   5	token: sentence
index:   6	token: for
index:   7	token: stanfordnlp
index:   8	token: .
index:   1	token: This
index:   2	token: is
index:   3	token: another
index:   4	token: sentence
index:   5	token: .


## MWTProcessor

### Description
Expands multi-word tokens(MWT) predicted by the tokenizer.
![alt text](5.png "Title")
### Options
![alt text](6.png "Title")
### Example
The mwt processor only requires tokenize. After these two processors have run, the Sentences will have lists of tokens and corresponding words based on the multi-word-token expander model. The list of tokens for sentence sent can be accessed with sent.tokens. The list of words for sentence sent can be accessed with sent.words. The list of words for a token token can be accessed with token.words. The code below shows an example of accessing tokens and words.

In [None]:
import stanfordnlp
stanfordnlp.download('fr') 
nlp = stanfordnlp.Pipeline(processors='tokenize,mwt', lang='fr')
doc = nlp("Alors encore inconnu du grand public, Emmanuel Macron devient en 2014 ministre de l'Économie, de l'Industrie et du Numérique.")
print(*[f'token: {token.text.ljust(9)}\t\twords: {token.words}' for sent in doc.sentences for token in sent.tokens], sep='\n')
print('')
print(*[f'word: {word.text.ljust(9)}\t\ttoken parent:{word.parent_token.index+"-"+word.parent_token.text}' for sent in doc.sentences for word in sent.words], sep='\n')

Using the default treebank "fr_gsd" for language "fr".
Would you like to download the models for: fr_gsd now? (Y/n)
Y

Default download directory: C:\Users\sudha\stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: fr_gsd
Download location: C:\Users\sudha\stanfordnlp_resources\fr_gsd_models.zip


100%|████████████████████████████████████████████████████████████████████████████████| 235M/235M [09:54<00:00, 418kB/s]



Download complete.  Models saved to: C:\Users\sudha\stanfordnlp_resources\fr_gsd_models.zip
Extracting models file for: fr_gsd
Cleaning up...Done.
Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_tokenizer.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
---
Loading: mwt
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_mwt_expander.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
Done loading processors!
---
token: Alors    		words: [<Word index=1;text=Alors>]
token: encore   		words: [<Word index=2;text=encore>]
token: inconnu  		words: [<Word index=3;text=inconnu>]
token: du       		words: [<Word index=4;text=de>, <Word index=5;text=le>]
token: grand    		words: [<Word index=6;text=grand>]
token: public   		words

## POSProcessor

### Description
Labels tokens with their universal POS (UPOS) tags, treebank-specific POS (XPOS) tags, and universal morphological features (UFeats).
![alt text](7.png "Title")
### Options
![alt text](8.png "Title")

### Example

Running the part of speech tagger simply requires tokenization and multi-word expansion. So the pipeline can be run with tokenize,mwt,pos as the list of processors. After the pipeline is run, the document will contain a list of sentences, and the sentences will contain lists of words. The part-of-speech tags can be accessed via the upos and xpos fields.

In [None]:
import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos')
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'word: {word.text+" "}\tupos: {word.upos}\txpos: {word.xpos}' for sent in doc.sentences for word in sent.words], sep='\n')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---
word: Barack 	upos: PROPN	xpos: NNP
word: Obama 	upos: PROPN	xpos: NNP
word: was 	upos: AUX	xpos: VBD
word: born 	upos: VERB	xpos: VBN
word: in 	upos: ADP	xpos: IN
word: Hawaii 	upos: PROPN	xpos: NNP
word: . 	upos: PUNCT	xpos: .


## LemmaProcessor

### Description
Generates the word lemmas for all tokens in the corpus.
![alt text](9.png "Title")
### Options
![alt text](10.png "Title")
### Example
If your main interest is lemmatizing, you can supply a smaller processors list with just the prerequisites for lemma. After the pipeline is run, the document will contain a list of sentences, and the sentences will contain lists of words. The lemma information can be found in the lemma field.

In [None]:
import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma')
doc = nlp("Barack Obama was born in Hawaii.")
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_tagger.pt', 'pretrain_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\en_ewt_models\\en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
Done loading processors!
---
word: Barack 	lemma: Barack
word: Obama 	lemma: Obama
word: was 	lemma: be
word: born 	lemma: bear
word: in 	lemma: in
word: Hawaii 	le

## DepparseProcessor

### Description
Provides an accurate syntactic dependency parser.
![alt text](11.png "Title")
### Options
![alt text](12.png "Title")
### Example
The depparse processor depends on tokenize, mwt, pos, and lemma. After all these processors have been run, each Sentence in the output would have been parsed into Universal Dependencies structure, where the governor index of each word can be accessed by word.governor, and the dependency relation between the words word.dependency_relation. Note that the governor index starts at 1 for actual words, and is 0 only when the word itself is the root of the tree. This index should be offset by 1 when looking for the govenor word in the sentence. Here is an example to access dependency parse information:

In [None]:
import stanfordnlp

nlp = stanfordnlp.Pipeline(processors='tokenize,mwt,pos,lemma,depparse', lang='fr')
doc = nlp("Van Gogh grandit au sein d'une famille de l'ancienne bourgeoisie.")
print(*[f"index: {word.index.rjust(2)}\tword: {word.text.ljust(11)}\tgovernor index: {word.governor}\tgovernor: {(doc.sentences[0].words[word.governor-1].text if word.governor > 0 else 'root').ljust(11)}\tdeprel: {word.dependency_relation}" for word in doc.sentences[0].words], sep='\n')

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_tokenizer.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
---
Loading: mwt
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_mwt_expander.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
---
Loading: pos
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_tagger.pt', 'pretrain_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd.pretrain.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': 'C:\\Users\\sudha\\stanfordnlp_resources\\fr_gsd_models\\fr_gsd_lemmatizer.pt', 'lang': 'fr', 'shorthand': 'fr_gsd', 'mode': 'predict'}
Building an attentional Seq2



index:  1	word: Van        	governor index: 3	governor: grandit    	deprel: nsubj
index:  2	word: Gogh       	governor index: 1	governor: Van        	deprel: flat:name
index:  3	word: grandit    	governor index: 0	governor: root       	deprel: root
index:  4	word: à          	governor index: 6	governor: sein       	deprel: case
index:  5	word: le         	governor index: 6	governor: sein       	deprel: det
index:  6	word: sein       	governor index: 3	governor: grandit    	deprel: obl
index:  7	word: d'         	governor index: 9	governor: famille    	deprel: case
index:  8	word: une        	governor index: 9	governor: famille    	deprel: det
index:  9	word: famille    	governor index: 6	governor: sein       	deprel: nmod
index: 10	word: de         	governor index: 13	governor: bourgeoisie	deprel: case
index: 11	word: l'         	governor index: 13	governor: bourgeoisie	deprel: det
index: 12	word: ancienne   	governor index: 13	governor: bourgeoisie	deprel: amod
index: 13	word: bourgeo