<h1><b>Introduction to StanfordNLP</b></h1>

StanfordNLP is a collection of pre-trained state-of-the-art models. These models were used by the researchers in the CoNLL 2017 and 2018 competitions. All the models are built on PyTorch and can be trained and evaluated on your own annotated data. Awesome!

Additionally, StanfordNLP also contains an official wrapper to the popular behemoth NLP library — CoreNLP. This had been somewhat limited to the Java ecosystem until now. You should check out this tutorial to learn more about CoreNLP and how it works in Python.

What more could an NLP enthusiast ask for? Now that we have a handle on what this library does, let’s take it for a spin in Python

<h2><b>Step 1. Install the stanfordNLP using below command</b><h2>

In [4]:
# !pip install stanfordnlp

<h2><b>Step 2. Before this we need to download language's specific model to work with it. </b></h2>

In [5]:
#first use following command

import stanfordnlp

<b>After dat download language model</b>

In [8]:
#--stanfordnlp.download('en') 
# It might may take time to download depending on your internet connection since the file size is large.

<h2><b>Step 3. While going through this you might face some issue bcoz stanfordnlp is built on top of pytorch so you must check the version of pytorch you are using.</b></h2>

In [9]:
# Use this following command 
#---pip freeze | grep torch

<h2><b> Step 4. Lets go with some example. </b></h2>

In [10]:
nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,lemma,pos")

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/home/ajay/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/home/ajay/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: pos
With settings: 
{'model_path': '/home/ajay/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/home/ajay/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---


> <b>As we can see it also tells about on which model its been trained on.</b>

<h3>processors</h3>

<b> The processors = “” argument is used to specify the task. All five processors are taken by default if no argument is passed. </b>

<ol> Five process
    <li> tokenize </li>
    <li> mwt </li>
    <li> lemma</li>
    <li> pos</li>
    <li> depparse</li>
</ol>

For more information you can take a look to this link -<href> https://stanfordnlp.github.io/stanfordnlp/processors.html </href>

In [21]:
#passing some text .

document = nlp("Murasaki Shikibu's Tale of Genji (1010) has sometimes been described as the world's first novel, but there is considerable debate over this — there were certainly long fictional works much earlier. Spread of printed books in China led to the appearance of classical Chinese novels by the Ming dynasty (1368–1644). Parallel European developments occurred after the invention of the printing press. Miguel de Cervantes, author of Don Quixote (the first part of which was published in 1605), is frequently cited as the first significant European novelist of the modern era.[2] Ian Watt, in The Rise of the Novel (1957), suggested that the modern novel was born in the early 18th century.")

Lets see the process one by one

<h2><b> 1) Tokenization </b></h2>

In [13]:
document.sentences[0].print_tokens()

<Token index=1;words=[<Word index=1;text=Murasaki;lemma=Murasaki;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=2;words=[<Word index=2;text=Shikibu;lemma=Shikibu;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=3;words=[<Word index=3;text='s;lemma='s;upos=PART;xpos=POS;feats=_>]>
<Token index=4;words=[<Word index=4;text=Tale;lemma=Tale;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=5;words=[<Word index=5;text=of;lemma=of;upos=ADP;xpos=IN;feats=_>]>
<Token index=6;words=[<Word index=6;text=Genji;lemma=Genji;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=7;words=[<Word index=7;text=(;lemma=(;upos=PUNCT;xpos=-LRB-;feats=_>]>
<Token index=8;words=[<Word index=8;text=1010;lemma=1010;upos=NUM;xpos=CD;feats=NumType=Card>]>
<Token index=9;words=[<Word index=9;text=);lemma=);upos=PUNCT;xpos=-RRB-;feats=_>]>
<Token index=10;words=[<Word index=10;text=has;lemma=have;upos=AUX;xpos=VBZ;feats=Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin>]>
<Token index=11;words=[<W

The token object contains the index of the token in the sentence and a list of word objects (in case of a multi-word token). Each word object contains useful information, like the index of the word, the lemma of the text, the pos (parts of speech) tag and the feat (morphological features) tag.

<h2> <b> 2. Lemmatization </b></h2>

In [14]:
#we will save output in dataframe for better view

import pandas as pd

#lets define a function which will take text as input it will return the output of the text

def lemma_extract(document):
    text = {'word':[],'lemma':[]}
    for sentence in document.sentences:
        for word in sentence.words:
            text['word'].append(word.text)
            text['lemma'].append(word.lemma)
    return pd.DataFrame(text)

In [15]:
#lets call the function
lemma_extract(document)

Unnamed: 0,word,lemma
0,Murasaki,Murasaki
1,Shikibu,Shikibu
2,'s,'s
3,Tale,Tale
4,of,of
5,Genji,Genji
6,(,(
7,1010,1010
8,),)
9,has,have


This lool cool right

Lets look for remaining ones

<h2><b> 3. Parts of Speech (PoS) Tagging </b></h2>

In [18]:
#dictionary to hold pos tags and their explanations
pos_dict = {
'CC': 'coordinating conjunction',
'CD': 'cardinal digit',
'DT': 'determiner',
'EX': 'existential there (like: \"there is\" ... think of it like \"there exists\")',
'FW': 'foreign word',
'IN':  'preposition/subordinating conjunction',
'JJ': 'adjective \'big\'',
'JJR': 'adjective, comparative \'bigger\'',
'JJS': 'adjective, superlative \'biggest\'',
'LS': 'list marker 1)',
'MD': 'modal could, will',
'NN': 'noun, singular \'desk\'',
'NNS': 'noun plural \'desks\'',
'NNP': 'proper noun, singular \'Harrison\'',
'NNPS': 'proper noun, plural \'Americans\'',
'PDT': 'predeterminer \'all the kids\'',
'POS': 'possessive ending parent\'s',
'PRP': 'personal pronoun I, he, she',
'PRP$': 'possessive pronoun my, his, hers',
'RB': 'adverb very, silently,',
'RBR': 'adverb, comparative better',
'RBS': 'adverb, superlative best',
'RP': 'particle give up',
'TO': 'to go \'to\' the store.',
'UH': 'interjection errrrrrrrm',
'VB': 'verb, base form take',
'VBD': 'verb, past tense took',
'VBG': 'verb, gerund/present participle taking',
'VBN': 'verb, past participle taken',
'VBP': 'verb, sing. present, non-3d take',
'VBZ': 'verb, 3rd person sing. present takes',
'WDT': 'wh-determiner which',
'WP': 'wh-pronoun who, what',
'WP$': 'possessive wh-pronoun whose',
'WRB': 'wh-abverb where, when',
'QF' : 'quantifier, bahut, thoda, kam (Hindi)',
'VM' : 'main verb',
'PSP' : 'postposition, common in indian langs',
'DEM' : 'demonstrative, common in indian langs'
}

#for the above pos tag to understand i have provided the link you can go through this link
#---https://universaldependencies.org/u/pos/-------------

def pos_extract(document):
    text = {'word':[], 'pos':[], 'exp':[]}
    for sentence in document.sentences:
        for word in sentence.words:
            if word.pos in pos_dict.keys():
                pos_exp = pos_dict[word.pos]
            else:
                pos_exp = 'NA'
            text['word'].append(word.text)
            text['pos'].append(word.pos)
            text['exp'].append(pos_exp)
    return pd.DataFrame(text)

pos_extract(document)

Unnamed: 0,word,pos,exp
0,Murasaki,NNP,"proper noun, singular 'Harrison'"
1,Shikibu,NNP,"proper noun, singular 'Harrison'"
2,'s,POS,possessive ending parent's
3,Tale,NNP,"proper noun, singular 'Harrison'"
4,of,IN,preposition/subordinating conjunction
5,Genji,NNP,"proper noun, singular 'Harrison'"
6,(,-LRB-,
7,1010,CD,cardinal digit
8,),-RRB-,
9,has,VBZ,"verb, 3rd person sing. present takes"


<b>We can try with other langauge model also just download and use it in the same way.</b>