<a href="https://colab.research.google.com/github/Justin-Labs/NLP/blob/main/it_nltmadj_03_enus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Navigating Basic Polyglot Features

# it_nltmadj_03_enus_04

- Exploring polyglot features
 - Multi-language tokenization
 - Multi-language Named Entity Recognition
 - Multi-language Part-of-speech Tagging
 - Multi-Language Morphological Analysis

## Multi-Language Tokenization

The technique of identifying the text boundaries of words and sentences is known as tokenization. We can identify sentence boundaries first, then tokenize each sentence to identify the terms that comprise the sentence. Of course, we may tokenize the words first, then partition the token sequence into sentences.

%pip install polyglot #Ployglot is a natural language pipeline that supports massive multilingual applications.
%pip install PyICU #PyICU is a python extension implemented in C++ that wraps the C/C++ ICU library. It is known to
# also work as a PyPy extension. Where the ICU stands for "International Components for Unicode"
%pip install pycld2 # Python bindings for the Compact langauge Detect 2 (CLD2).
%pip install Morfessor #Morfessor is a tool for unsupervised and semi-supervised morphological segmentation.
%pip install polyglot transliteration # Transliteration is the conversion of a text from one script to another.

In [None]:
%pip install polyglot
%pip install PyICU
%pip install pycld2
%pip install Morfessor
%pip install polyglot transliteration

In [None]:
import polyglot
from polyglot.text import Text

Word Tokenization
To invoke our word tokenizer, we must first create a Text object.

In [None]:
blob = u"""
机器学习 (ML) 是对可以通过经验和数据使用自动改进的计算机算法的研究。它被视为人工智能的一部分。机器学习算法基于样本数据（称为训练数据）构建模型，
以便在没有明确编程的情况下做出预测或决策。 机器学习算法广泛用于各种应用，例如医学、电子邮件过滤、语音识别和计算机视觉，在这些领域开发传统算法来执行所需的任务是困难的或不可行的。
"""
text = Text(blob)

In [None]:
text.words

Since ICU boundary break algorithms are language aware, polyglot will detect the language used first before calling the tokenizer

In [None]:
print(text.language)

## Multi-Language Named Entity Extraction

The task known as entity extraction seeks to extract phrases from the plain text that correspond to entities. Polyglot distinguishes three types of entities:

- Cities, countries, regions, continents, neighborhoods, administrative divisions... are all examples of locations (Tag: I-LOC).
- Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profit organizations, businesses, and so on.
- Individuals (Tag: I-PER): politicians, scientists, artists, athletes, etc.

### Languages Coverage

The models were trained using datasets automatically scraped from Wikipedia. Polyglot is presently available in 40 main languages. We can query our do
wnload manager for which tasks are supported by polyglot, as the following:
downloader.supported_tasks(lang='en")
[u'embeddings2',
u'counts2',
u'pos2',
u'ner2',
u'sentiment2',
u'morph2',
u'tsne2']

In [None]:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("ner2", 3))

In [None]:
# Dowloading models

%%bash
polyglot download embeddings2.en ner2.en

Let's take a look into the example

In [None]:
blob = """US President Joe Biden and Russian President Vladimir Putin may meet in early 2022."""
text = Text(blob)

Let's query all entities mentioned in the text

In [None]:
text.entities

In [None]:
for sent in text.sentences:
  print(sent, "\n")
  for entity in sent.entities:
    print(entity.tag, entity)

In [None]:
# We can do careful inspection of the First entity Joe Biden, we can locate the position of the entity within the sentence.
joe = sent.entities[0]
sent.words[joe.start: joe.end]

# Multi-Language Part of Speech Tagging

The goal of the part of speech tagging challenge is to give a category to each word/token in plain text that indicates the syntactic functionality of the word occurrence.

Polyglot recognises 17 parts of speech, which are referred to as the universal part. of speech tag set:

- ADJ: adjective
- ADP: adposition
- ADV: adverb
- AUX: auxiliary verb
- CONJ: coordinating conjunction
- DET: determiner
- INTJ: interjection
- NOUN: noun
- NUM: numeral
- PART: particle
- PRON: pronoun
- PROPN: proper noun
- PUNCT: punctuation
- SCONJ: subordinating conjunction
- SYM: symbol
- VERB: verb
- X: other

In [None]:
# Here we will look, How many language Polyglot POS support
from polyglot.downloader import downloader
print(downloader.supported_languages_table("pos2"))

In [None]:
# Downloading necessary english model of POS from polyglot
%%bash
polyglot download embeddings2.en pos2.en

In [None]:
from polyglot.text import Text

In [None]:
blob = '''Biden and Putin may meet in early 2022. If that sounds like deja vu, you’re right.
After Russia mobilized troops on Ukraine’s border last April, a Biden–Putin summit took place in mid-June in Geneva.
Long ago, North Korea discovered that missile launches were an effective way of getting Washington’s attention'''
text = Text(blob)
text.pos_tags

In [None]:
# After calling the pos_tags property once, the words objects will carry the POS tags.
text.words[0].pos_tag

# Morphological Analysis

Polyglot provides trained morfessor models for generating morphemes from words. The Morpho project aims to provide unsupervised data-driven approaches for discovering the regularities underpinning word formation in natural languages. The Morpho project is particularly interested in the finding of morphemes, which are the primordial units of syntax, the smallest independently meaningful items in a language's utterances. Morphemes are significant in automatic language creation and recognition, especially in languages where words might have many distinct inflected forms.

In [None]:
# Morphemes language coverage

from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))

In [None]:
%%bash
polyglot download morph2.en morph2.ar

In [None]:
from polyglot.text import Text, Word

In [None]:
# Word Segmentation


words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w = Word(w, language="en")
  print("{:<20}{}".format(w, w.morphemes))

In [None]:
# Sentence Segmentation

blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"

In [None]:
text.morphemes

Section Ends Here

# it_nltmadj_03_enus_05

Exploring Polyglot features
 - Language Detection
 - Multi-lingual Sentiment Analysis
 - Tranliteration

## Language Detection

Polyglot is reliant on the `pycld2` library, which is reliant on the cld2 library, to detect the language(s) used in plain text.

In [None]:
from polyglot.detect import Detector

In [None]:
arabic_text = u'''يحتفل الناس حول العالم بأعياد الميلاد، وهو واحد من أقدس الأوقات في التقويم المسيحي.

مع هذا، وللعام الثاني على التوالي، يشارك أعداد أقل في المراسم الكنسية والفعاليات الأخرى بسبب استمرار تفشي جائحة كورونا.

'''

In [None]:
detector = Detector(arabic_text)
print(detector.language)

In [None]:
# mixed text

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state located in East Asia.
"""

If the text contains snippets from different languages, the detector is able to find the most probable langauges used in the text. For each language, we can query the model confidence level:

In [None]:
for language in Detector(mixed_text).languages:
  print(language)

In [None]:
for line in mixed_text.strip().splitlines():
  print(line + u"\n")
  for language in Detector(line).languages:
    print(language)
  print("\n")

In [None]:
# Supported languages

from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))

## Multi-Language Sentiment Analysis

For 136 languages, Polyglot contains polarity lexicons. The polarity of the words was measured on a three-degree scale: +1 for positive words and -1 for negative phrases. Words that are neutral will receive a score of 0.

In [None]:
# Language Coverage

from polyglot.downloader import downloader
print(downloader.supported_languages_table("sentiment2", 3))

In [None]:
# Dowloading models

%%bash
polyglot download sentiment2.en sentiment2.en

In [None]:
# Polarity - To investigate a word's polarity, we may simply call its own attribute polarity.

text = Text("The movie named Avengers was really good to watch. It's a big hit !")

print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in text.words:
    print("{:<16}{:>2}".format(w, w.polarity))


In [None]:
# Entity Sentiment - We may generate a more specific sentiment score for an entity mentioned in text as follows:

blob = ("Barack Obama gave a fantastic speech last night. "
        "Reports indicate he will move next to New Hampshire.")
text = Text(blob)

In [None]:
# First, we must divide the text into sentences, which will limit the words that alter an entity's attitude to those specified in the sentence.

first_sentence = text.sentences[0]
print(first_sentence)

In [None]:
# We will extract the entities

first_entity = first_sentence.entities[0]
print(first_entity)

In [None]:
# Finally, for each thing we discovered, we may compute the strength of its positive or negative emotion on a scale of 0 to 1.

print('Positive Sentiment=>',first_entity.positive_sentiment)
print('Negative Sentiment=>',first_entity.negative_sentiment)

## Transliteration

Transliteration is the process of converting a text from one script to another. For example, "Ellēnikḗ Dēmokratía" is a Latin transcription of the Greek term "Ελληνική Δημοκρατία" which is commonly interpreted as "Hellenic Republic."

In [None]:
from polyglot.transliteration import Transliterator

In [None]:
# Language Coverage

from polyglot.downloader import downloader
print(downloader.supported_languages_table("transliteration2"))

In [None]:
# Downloading required model

%%bash
polyglot download embeddings2.en transliteration2.ar

In [None]:
from polyglot.text import Text

In [None]:
blob = """Lufthansa plans to cut33,000 flights from its winter schedule due to the spread of the
Omicron coronavirus variant and related travel restrictions, CEO Carsten Spohr told the Frankfurter
Allgemeine Sonntagszeitung newspaper."""
text = Text(blob)

In [None]:
for x in text.transliterate("ar"):
  print(x)

Section Ends here

# it_nltmadj_03_enus_06
# Basic TextBlob Features

### Exploring features of textBlob
 - Installation
 - Noun Phrase Extraction
 - Part-of-speech
 - Parsing
 - WordNet Integration

## Installation

In [None]:
!pip install -U textblob
import nltk
nltk.download('brown')
nltk.download('punkt')

In [None]:
from textblob import TextBlob

## Noun-Phrase Extraction

In [None]:
tb=TextBlob('''Lufthansa plans to cut33,000 flights from its winter schedule due to the spread of
the Omicron coronavirus variant and related travel restrictions, CEO Carsten Spohr told the Frankfurter Allgemeine Sonntagszeitung newspaper.''')
tb.noun_phrases

### Part-of-speech

In [None]:
import nltk
nltk.download('averaged_perceptron_tagger')

In [None]:
tb1=TextBlob("Biden and Putin may meet in early 2022. If that sounds like deja vu, you’re right")
tb1.tags

### Parsing

In [None]:
from textblob.parsers import PatternParser
tb2 = TextBlob('''Lufthansa plans to cut33,000 flights from its winter schedule due to the spread of
the Omicron coronavirus variant and related travel restrictions, CEO Carsten Spohr told the Frankfurter Allgemeine Sonntagszeitung newspaper.''', parser=PatternParser())
tb2.parse()

### WordNet Integration

In [None]:
nltk.download('wordnet')

In [None]:
# You may retrieve a Word's synsets using the synsets property or the get_synsets method, optionally passing in a chunk of speech.
from textblob import Word
from textblob.wordnet import VERB
word = Word("octopus")
print(word.synsets)
print('---------//----------------//----------------------//---------------')
print(Word("hack").get_synsets(pos=VERB))

In [None]:
from textblob.wordnet import Synset
octopus = Synset('octopus.n.02')
shrimp = Synset('shrimp.n.03')
octopus.path_similarity(shrimp)

# it_nltmadj_03_enus_07
# Advanced TextBlob Features

## Exploring features of TextBlob
 - Sentiment Analysis
 - Classification model
 - Tokenization
 - Word/Phrases Frequencies
 - Word Inflcition
 - Spell Correction

## Sentiment Analysis

Sentiment is the namedtuple returned by the sentiment property (polarity, subjectivity). The polarity score is a number between -1.0 and 1.0. Subjectivity is a float between 0.0 and 1.0, with 0.0 being extremely objective and 1.0 being extremely subjective.

In [None]:
tb3 = TextBlob("The movie named Avengers was really good to watch. It's a big hit !, What great fun!")
print('polarity and subjectivity =>',tb3.sentiment)
print('polarity =>',tb3.sentiment.polarity)

## Tokenization

In [None]:
# TextBlobs may be broken down into words or phrases.
tb4 = TextBlob("Beautiful is better than ugly. "
              "Explicit is better than implicit. "
           "Simple is better than complex.")
print('words=>',tb4.words)
print('--------------------//---------------------//--------------------------//---------------------------//-----------------------//-----------')
print('sentences=>',tb4.sentences)

## Word/Phrase frequencies

In a TextBlob, there are two methods to obtain the frequency of a word or noun phrase.


In [None]:
# The first is to use the dictionary word counts.

tb5 = TextBlob('''Lufthansa plans to cut33,000 flights from its winter schedule due to the spread of
the Omicron coronavirus variant and related travel restrictions according to the newspaper, CEO Carsten Spohr told the Frankfurter Allgemeine Sonntagszeitung newspaper.''')
tb5.word_counts['newspaper']

In [None]:
# The count() technique is the second option.
tb5.words.count('newspaper')

## Words Inflection

Each word or sentence in TextBlob.words. words is a Word object (a Unicode subclass) having handy methods, such as inflection.

In [None]:
tb6 = TextBlob('''Lufthansa plans to cut33,000 flights from its winter schedule due to the spread of
the Omicron coronavirus variant and related travel restrictions according to the newspaper, CEO Carsten Spohr told the Frankfurter Allgemeine Sonntagszeitung newspaper.''')
print('words=>',tb6.words)

print('singularize=>',tb6.words[2].singularize())

print('pluralize=>',tb6.words[-1].pluralize())

## Spell Correction

To make an effort at spelling correction, use the correct() function.

In [None]:
tb7 = TextBlob("I am writing ths to let you know,you havv goood speling!")
print(tb7.correct())

In [None]:
# The spellcheck() method on Word objects produces a list of (word, confidence) tuples with spelling recommendations in the form of a list of (word, confidence) tuples.

from textblob import Word
w = Word('falibility')
w.spellcheck()

The pattern library's spelling correction is based on Peter Norvig's "How to Write a Spelling Corrector". It is around 70% correct.

Section Ends here

# it_nltmadj_03_enus_08
## Navigating Basic Gensim Features

### Exploring Features of Gensim
 - Installation
 - Topic Modelling

### Installation

In [None]:
!pip install --upgrade gensim

In [None]:
import nltk; nltk.download('stopwords')

# Run in terminal or command prompt
!python -m spacy download en

In [None]:
!pip install pyLDAvis

In [None]:
import pyLDAvis.gensim_models

In [None]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim_models  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [None]:
# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

In [None]:
# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()

In [None]:
# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

In [None]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

In [None]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])

In [None]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent))
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [None]:
# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en_core_web_sm' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en_core_web_sm
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])

In [None]:
# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

In [None]:
# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

In [None]:
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [None]:
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

In [None]:
# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

In [None]:
# Visualize the topics
%pip install pandas -U
import pandas
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
vis

Section Ends here

# it_nltmadj_03_enus_09
## Navigating Advanced Gensim Features

Exploring features of Gensim
 - Query Similarity

In [None]:
# Shows how to search a corpus for texts that are similar.

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [None]:
# To begin, we must first construct a corpus with which to operate. This is the same procedure as in the last instruction; if you've already finished it, you may go on to the next part.
from collections import defaultdict
from gensim import corpora

documents = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]

In [None]:
# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

Similarity interface



In [None]:
from gensim import models
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=2)

Assume a user enters "Human computer interaction" into the search box. We'd want to rank our nine corpus documents by relevancy to this query in decreasing order. Unlike current search engines, we just look at one element of probable similarities here: the apparent semantic similarity of respective sentences (words). There are no backlinks, no static rankings based on a random walk, only a semantic expansion of the boolean keyword match:

In [None]:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

To prepare for similarity searches, we must first input all of the documents that we wish to compare to the results of following queries. They are the same nine documents that were used to train LSI, but in 2-D LSA space. But that's just a coincidence; we may be indexing a completely different corpus.

In [None]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

Persistence of indexes is addressed using the standard: func:save as well as: func:load contains the following functions:

In [None]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

To compare our query document to the nine indexed documents, perform the following:

In [None]:
sims = index[vec_lsi]  # perform a similarity query against the corpus
print(list(enumerate(sims)))  # print (document_number, document_similarity) 2-tuples

The cosine measure yields similarities in the range -1, 1> (the greater the similarity, the higher the score), therefore the first document gets a score of 0.99809301, and so on.

We sort these commonalities into descending order using ordinary Python wizardry to get the final response to the question "Human computer interaction":

In [None]:
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
    print(doc_score, documents[doc_position])

A conventional boolean fulltext search would never produce documents 2 ("The EPS user interface management system") and 4 ("Relation of user perceived reaction time to error assessment") since they do not have any common terms with "Human computer interaction." However, after using LSI, we can see that they both have high similarity scores (no. 2 is actually the most similar! ), which matches our intuition of both having a "computer-human" connected issue in common with the inquiry. In fact, it is because of this semantic generalisation that we use transformations and topic modelling in the first place.

Section Ends here

# it_nltmadj_03_enus_10

## Core NLP Features

Exploring features of NLP
 - Installations
 - Named Entities
 - Dependancy Parse & coreference

### Installations

In [None]:
!pip install stanfordnlp

In [None]:
import stanfordnlp

In [None]:
!wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

In [None]:
!unzip stanford-corenlp-full-2018-10-05.zip

In [None]:
!java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In [None]:
!java /content/stanford-corenlp-full-2018-10-05/edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

In [None]:
!export CORENLP_HOME=stanford-corenlp-full-2018-10-05/

In [None]:
from stanfordnlp.server import CoreNLPClient
# example text
print('---')
print('input text')
print('')
text = "Chris Manning is a nice person. Chris wrote a simple sentence. He also gives oranges to people."
print(text)
# set up the client
print('---')
print('starting up Java Stanford CoreNLP Server...')
# set up the client
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner','depparse','coref'], timeout=30000, memory='16G') as client:
    # submit the request to the server
    ann = client.annotate(text)
    # get the first sentence
    sentence = ann.sentence[0]

In [None]:
stanfordnlp.download('en')

In [None]:
import nltk
from nltk.tag.stanford import StanfordNERTagger

"""
Named Entity tagging in Python with NLTK and the Stanford NER tagger
"""

PATH_TO_JAR='http://www.java2s.com/Code/Jar/s/Downloadstanfordner127sourcesjar.htm'
PATH_TO_MODEL = 'http://www.java2s.com/Code/Jar/s/Downloadstanfordsutimemodels135sourcesjar.htm'


tagger = StanfordNERTagger(model_filename=PATH_TO_MODEL,path_to_jar=PATH_TO_JAR, encoding='utf-8')

sentence = 'First up in London will be Riccardo Tisci, onetime Givenchy darling, favorite of Kardashian-Jenners everywhere, who returns to the catwalk with men’s and women’s wear after a year and a half away, this time to reimagine Burberry after the departure of Christopher Bailey.'

#split the sentence into words
words = nltk.word_tokenize(sentence)

tagged = tagger.tag(words)