# StanfordNLP
- [Reference](https://medium.com/analytics-vidhya/introduction-to-stanfordnlp-an-nlp-library-for-53-languages-with-python-code-d7c3efdca118)

In [1]:
!pip install stanfordnlp

Collecting stanfordnlp
[?25l  Downloading https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl (158kB)
[K     |██                              | 10kB 29.2MB/s eta 0:00:01[K     |████▏                           | 20kB 6.6MB/s eta 0:00:01[K     |██████▏                         | 30kB 7.9MB/s eta 0:00:01[K     |████████▎                       | 40kB 8.7MB/s eta 0:00:01[K     |██████████▎                     | 51kB 7.4MB/s eta 0:00:01[K     |████████████▍                   | 61kB 8.0MB/s eta 0:00:01[K     |██████████████▌                 | 71kB 9.0MB/s eta 0:00:01[K     |████████████████▌               | 81kB 9.3MB/s eta 0:00:01[K     |██████████████████▋             | 92kB 8.8MB/s eta 0:00:01[K     |████████████████████▋           | 102kB 9.4MB/s eta 0:00:01[K     |██████████████████████▊         | 112kB 9.4MB/s eta 0:00:01[K     |████████████████████████▊       | 122kB 9.4MB/

In [2]:
import stanfordnlp

In [3]:
stanfordnlp.download('en')

Using the default treebank "en_ewt" for language "en".
Would you like to download the models for: en_ewt now? (Y/n)
Y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: en_ewt
Download location: /root/stanfordnlp_resources/en_ewt_models.zip


100%|██████████| 235M/235M [00:14<00:00, 16.5MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/en_ewt_models.zip
Extracting models file for: en_ewt
Cleaning up...Done.


In [4]:
!pip freeze | grep torch

torch==1.5.0+cu101
torchsummary==1.5.1
torchtext==0.3.1
torchvision==0.6.0+cu101


# Using StanfordNLP to Perform Basic NLP Tasks

In [5]:
nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,lemma,pos")
doc = nlp("""The prospects for Britain’s orderly withdrawal from the European Union on March 29 have receded further, even as MPs rallied to stop a no-deal scenario. An amendment to the draft bill on the termination of London’s membership of the bloc obliges Prime Minister Theresa May to renegotiate her withdrawal agreement with Brussels. A Tory backbencher’s proposal calls on the government to come up with alternatives to the Irish backstop, a central tenet of the deal Britain agreed with the rest of the EU.""")

Use device: gpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tokenizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_lemmatizer.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/en_ewt_models/en_ewt.pretrain.pt', 'lang': 'en', 'shorthand': 'en_ewt', 'mode': 'predict'}
Done loading processors!
---




## Tokenization

In [6]:
doc.sentences[0].print_tokens()

<Token index=1;words=[<Word index=1;text=The;lemma=the;upos=DET;xpos=DT;feats=Definite=Def|PronType=Art>]>
<Token index=2;words=[<Word index=2;text=prospects;lemma=prospect;upos=NOUN;xpos=NNS;feats=Number=Plur>]>
<Token index=3;words=[<Word index=3;text=for;lemma=for;upos=ADP;xpos=IN;feats=_>]>
<Token index=4;words=[<Word index=4;text=Britain;lemma=Britain;upos=PROPN;xpos=NNP;feats=Number=Sing>]>
<Token index=5;words=[<Word index=5;text=’s;lemma='s;upos=PART;xpos=POS;feats=_>]>
<Token index=6;words=[<Word index=6;text=orderly;lemma=orderly;upos=ADJ;xpos=JJ;feats=Degree=Pos>]>
<Token index=7;words=[<Word index=7;text=withdrawal;lemma=withdrawal;upos=NOUN;xpos=NN;feats=Number=Sing>]>
<Token index=8;words=[<Word index=8;text=from;lemma=from;upos=ADP;xpos=IN;feats=_>]>
<Token index=9;words=[<Word index=9;text=the;lemma=the;upos=DET;xpos=DT;feats=Definite=Def|PronType=Art>]>
<Token index=10;words=[<Word index=10;text=European;lemma=european;upos=ADJ;xpos=JJ;feats=Degree=Pos>]>
<Token index=

## Lemmatization

In [7]:
import pandas as pd

def extract_lemma(doc):
    parsed_text = {'word':[], 'lemma':[]}
    for sent in doc.sentences:
        for wrd in sent.words:
            #extract text and lemma
            parsed_text['word'].append(wrd.text)
            parsed_text['lemma'].append(wrd.lemma)
    #return a dataframe
    return pd.DataFrame(parsed_text)
extract_lemma(doc)

Unnamed: 0,word,lemma
0,The,the
1,prospects,prospect
2,for,for
3,Britain,Britain
4,’s,'s
...,...,...
86,rest,rest
87,of,of
88,the,the
89,EU,EU


## Parts of Speech (PoS) Tagging

In [8]:
#dictionary to hold pos tags and their explanations
pos_dict = {
'CC': 'coordinating conjunction',
'CD': 'cardinal digit',
'DT': 'determiner',
'EX': 'existential there (like: \"there is\" ... think of it like \"there exists\")',
'FW': 'foreign word',
'IN':  'preposition/subordinating conjunction',
'JJ': 'adjective \'big\'',
'JJR': 'adjective, comparative \'bigger\'',
'JJS': 'adjective, superlative \'biggest\'',
'LS': 'list marker 1)',
'MD': 'modal could, will',
'NN': 'noun, singular \'desk\'',
'NNS': 'noun plural \'desks\'',
'NNP': 'proper noun, singular \'Harrison\'',
'NNPS': 'proper noun, plural \'Americans\'',
'PDT': 'predeterminer \'all the kids\'',
'POS': 'possessive ending parent\'s',
'PRP': 'personal pronoun I, he, she',
'PRP$': 'possessive pronoun my, his, hers',
'RB': 'adverb very, silently,',
'RBR': 'adverb, comparative better',
'RBS': 'adverb, superlative best',
'RP': 'particle give up',
'TO': 'to go \'to\' the store.',
'UH': 'interjection errrrrrrrm',
'VB': 'verb, base form take',
'VBD': 'verb, past tense took',
'VBG': 'verb, gerund/present participle taking',
'VBN': 'verb, past participle taken',
'VBP': 'verb, sing. present, non-3d take',
'VBZ': 'verb, 3rd person sing. present takes',
'WDT': 'wh-determiner which',
'WP': 'wh-pronoun who, what',
'WP$': 'possessive wh-pronoun whose',
'WRB': 'wh-abverb where, when',
'QF' : 'quantifier, bahut, thoda, kam (Hindi)',
'VM' : 'main verb',
'PSP' : 'postposition, common in indian langs',
'DEM' : 'demonstrative, common in indian langs'
}

def extract_pos(doc):
    parsed_text = {'word':[], 'pos':[], 'exp':[]}
    for sent in doc.sentences:
        for wrd in sent.words:
            if wrd.pos in pos_dict.keys():
                pos_exp = pos_dict[wrd.pos]
            else:
                pos_exp = 'NA'
            parsed_text['word'].append(wrd.text)
            parsed_text['pos'].append(wrd.pos)
            parsed_text['exp'].append(pos_exp)
    return pd.DataFrame(parsed_text)

extract_pos(doc)

Unnamed: 0,word,pos,exp
0,The,DT,determiner
1,prospects,NNS,noun plural 'desks'
2,for,IN,preposition/subordinating conjunction
3,Britain,NNP,"proper noun, singular 'Harrison'"
4,’s,POS,possessive ending parent's
...,...,...,...
86,rest,NN,"noun, singular 'desk'"
87,of,IN,preposition/subordinating conjunction
88,the,DT,determiner
89,EU,NNP,"proper noun, singular 'Harrison'"


## Dependency Extraction

In [9]:
doc.sentences[0].print_dependencies()

# Implementing StanfordNLP on the Hindi Language

In [10]:
stanfordnlp.download('hi')

Using the default treebank "hi_hdtb" for language "hi".
Would you like to download the models for: hi_hdtb now? (Y/n)
Y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: hi_hdtb
Download location: /root/stanfordnlp_resources/hi_hdtb_models.zip


100%|██████████| 208M/208M [00:20<00:00, 9.94MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/hi_hdtb_models.zip
Extracting models file for: hi_hdtb
Cleaning up...Done.


In [11]:
hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया. कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए. हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा. केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")

In [12]:
extract_pos(hindi_doc)

Unnamed: 0,word,pos,exp
0,केंद्र,GW,
1,की,FW,foreign word
2,मोदी,AFX,
3,सरकार,AFX,
4,ने,AFX,
...,...,...,...
67,आसान,GW,
68,भाषा,GW,
69,में,FW,foreign word
70,यहां,NN,"noun, singular 'desk'"
