## NLP Python Libraries

1. [NLTK (Natural Language Toolkit)](#1)
2. [spaCy](#2)
3. [TextBlob](#3)
4. [Hugging Face Transformer](#4)
5. Gensim
6. Textacy
7. VADER (Valence Aware Dictionary and sEntiment Reasoner)
8. AllenNLP
9. Stanza
10. Pattern
11. PyNLPl 
12. flair Library
13. FastText
14. Polyglot
15. Regex (Regular Expressions)

## 1. NLTK (Natural Language Toolkit) <a class="anchor" id="1"></a>

Natural Language Toolkit (NLTK) is one of the largest Python libraries for performing various Natural Language Processing tasks. From rudimentary tasks such as text pre-processing to tasks like vectorized representation of text – NLTK’s API has covered everything. In this article, we will accustom ourselves to the basics of NLTK and perform some crucial NLP tasks: Tokenization, Stemming, Lemmatization, and POS Tagging.

**Installation:**

In [1]:
! pip install nltk



**Accessing Additional Resources:**

To incorporate the usage of additional resources, such as recourses of languages other than English – you can run the following in a python script. It has to be done only once when you are running it for the first time in your system.

In [2]:
from IPython.display import clear_output

import nltk
nltk.download('all')

clear_output()     # This function clears the output

let’s perform some basic operations on text data using NLTK. 

### Tokenization
Tokenization refers to break down the text into smaller units. It entails splitting paragraphs into sentences and sentences into words. It is one of the initial steps of any NLP pipeline.

**Word Tokenization**

It involves breaking down the text into words.

* "I study Machine Learning on Kaggle." 

* ['I', 'study', 'Machine', 'Learning', 'on', 'Kaggle', '.'].

**Sentence Tokenization**

It involves breaking down the text into individual sentences. 


* "I study Machine Learning on Kaggle. Currently, I'm studying NLP"

*  ['I study Machine Learning on Kaggle.', 'Currently, I'm studying NLP.']

In [3]:
# Tokenization using NLTK
from nltk import word_tokenize, sent_tokenize
sent = "Kaggle is a great learning platform. It is one of the best for Data Science students."
print(word_tokenize(sent))
print(sent_tokenize(sent))

['Kaggle', 'is', 'a', 'great', 'learning', 'platform', '.', 'It', 'is', 'one', 'of', 'the', 'best', 'for', 'Data', 'Science', 'students', '.']
['Kaggle is a great learning platform.', 'It is one of the best for Data Science students.']


### Stemming and Lemmatization 

When working with Natural Language, we are not much interested in the form of words – rather, we are concerned with the meaning that the words intend to convey. Thus, we try to map every word of the language to its root/base form. This process is called canonicalization. 

E.g. The words ‘play’, ‘plays’, ‘played’, and ‘playing’ convey the same action – hence, we can map them all to their base form i.e. ‘play’.

Now, there are two widely used canonicalization techniques: Stemming and Lemmatization.

**Stemming**

Stemming generates the base word from the inflected word by removing the affixes of the word. It has a set of pre-defined rules that govern the dropping of these affixes. It must be noted that stemmers might not always result in semantically meaningful base words.  Stemmers are faster and computationally less expensive than lemmatizers. 

In [4]:
from nltk.stem import PorterStemmer

# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

play
play
play
play


We can see that all the variations of the word ‘play’ have been reduced to the same word  – ‘play’. In this case, the output is a meaningful word, ‘play’. However, this is not always the case. Let us take an example. 

In [5]:
from nltk.stem import PorterStemmer
# create an object of class PorterStemmer
porter = PorterStemmer()
print('Output: ',porter.stem("Communication"))

Output:  commun


The stemmer reduces the word ‘communication’ to a base word ‘commun’ which is meaningless in itself.

**Lemmatization**

**What is Lemmatization?**

Lemmatization is a linguistic process that involves reducing words to their base or canonical form, which is called the lemma. The lemma represents the dictionary form or citation form of a word. For example, the lemma of "running" is "run", and the lemma of "better" is "good".

**Purpose of Lemmatization**

The main purpose of lemmatization is to normalize words so that different forms of the same word are treated as one entity. This is particularly useful in text analysis tasks such as:

* Information Retrieval: Ensuring that searches retrieve all relevant documents regardless of the word form used in the query.

* Text Mining and Analytics: Improving the accuracy of analysis by consolidating different forms of a word.

* Machine Learning: Enhancing the performance of models by reducing the vocabulary size and capturing the semantic meaning more accurately.

**Lemmatization vs Stemming**

Lemmatization is often compared to stemming, another technique used to reduce words to their base forms. While stemming chops off prefixes and suffixes of words to derive the root, lemmatization uses lexical knowledge bases to obtain the correct base form of words. Therefore, lemmatization tends to be more precise and context-aware compared to stemming.

**How Lemmatization Works**

Lemmatization typically involves identifying the part of speech (POS) of a word and applying morphological analysis to determine its lemma. For instance, verbs, nouns, adjectives, and adverbs may have different lemmatization rules in a language. Tools like NLTK (Natural Language Toolkit) in Python and other NLP libraries provide lemmatization functionalities that leverage linguistic databases such as WordNet.

**Example Sentence:**

"The cats are chasing mice."

In lemmatization, words are transformed to their base form (lemma):

1. Tokenization: Split the sentence into individual words: ["The", "cats", "are", "chasing", "mice"].

2. Part-of-Speech Tagging: Determine the grammatical category of each word:

    * "The" (Determiner)
    * "cats" (Noun, plural)
    * "are" (Verb, present tense)
    * "chasing" (Verb, gerund or present participle)
    * "mice" (Noun, plural)

3. Lemmatization: Convert each word to its base form:

    * "The" -> "The" (unchanged for articles)
    * "cats" -> "cat"
    * "are" -> "be"
    * "chasing" -> "chase"
    * "mice" -> "mouse"

After lemmatization, the sentence becomes: "The cat be chase mouse."

Lemmatization helps in standardizing words so that variations of the same word (like "cat" and "cats") are treated as the same token, which is useful in various natural language processing tasks like text classification, information retrieval, and sentiment analysis.

### Part of Speech Tagging
Part of Speech (POS) tagging refers to assigning each word of a sentence to its part of speech. It is significant as it helps to give a better syntactic overview of a sentence. 

The parts of speech include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, and interjections.

The primary goal of POS tagging is to analyze the structure of a sentence and understand how words relate to each other syntactically. This information is crucial for many downstream NLP tasks, such as syntactic parsing, named entity recognition, sentiment analysis, and machine translation.

**Full Form of POS tags**

* Verb (VBZ)
* Preposition (IN)
* Determiner (DT)
* Adjective (JJ)
* Noun (NN)

**Example of POS Tagging**

Consider the sentence: “The quick brown fox jumps over the lazy dog.”

After performing POS Tagging:

* “The” is tagged as determiner (DT)
* “quick” is tagged as adjective (JJ)
* “brown” is tagged as adjective (JJ)
* “fox” is tagged as noun (NN)
* “jumps” is tagged as verb (VBZ)
* “over” is tagged as preposition (IN)
* “the” is tagged as determiner (DT)
* “lazy” is tagged as adjective (JJ)
* “dog” is tagged as noun (NN)

In [6]:
from nltk import pos_tag
from nltk import word_tokenize

text = "The quick brown fox jumps over the lazy dog"
tokenized_text = word_tokenize(text)
tags = pos_tag(tokenized_text)
tags


# I will discuss more on POS tags later in this notebook

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN')]

### Named Entity Recognition (NER)
Named Entity Recognition (NER) is a key task in Natural Language Processing (NLP) that involves the identification and classification of named entities in unstructured text, such as people, organizations, locations, dates, and other relevant information. NER is used in various NLP applications such as information extraction, sentiment analysis, question-answering, and recommendation systems.

**Steps involved in NER**

Now, let’s take a look at the various steps involved in the NER process:

* **Tokenization**: The first step in NER involves breaking down the input text into individual words or tokens.
* **POS Tagging**: Next, we need to label each word in the text with its corresponding part of speech.
* **Chunking**: After POS tagging, we can group the words together into meaningful phrases using a process called chunking.
* **Named Entity Recognition**: Once we have identified the chunks, we can apply NER techniques to identify and classify the named entities in the text.
* **Evaluation**: Finally, we can evaluate the performance of our NER model on a set of testing data to determine its accuracy and effectiveness.

**Use of NER in NLP**

NER has numerous applications in NLP, including information extraction, sentiment analysis, question-answering, recommendation systems, and more. Here are some common use cases of NER:

* **Information Extraction**: NER can be used to extract relevant information from large volumes of unstructured text, such as news articles, social media posts, and online reviews. This information can be used to generate insights and make informed decisions.
* **Sentiment Analysis**: NER can be used to identify the sentiment expressed in a text towards a particular named entity, such as a product or service. This information can be used to improve customer satisfaction and identify areas for improvement.
* **Question Answering**: NER can be used to identify the relevant entities in a text that can be used to answer a specific question. This is particularly useful for chatbots and virtual assistants.
* **Recommendation Systems**: NER can be used to identify the interests and preferences of users based on the entities mentioned in their search queries or online interactions. This information can be used to provide personalized recommendations and improve user engagement.

**Necessary requirements:**

In [7]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /usr/share/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /usr/share/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [8]:
import nltk
from nltk import word_tokenize, pos_tag, ne_chunk

# Sample text
text = "Steve Jobs was the CEO of Apple Inc."

# Tokenize the text
tokens = word_tokenize(text)

# Perform POS tagging
tagged = pos_tag(tokens)

# Perform named entity recognition (NER)
entities = ne_chunk(tagged)

# Print named entities
for entity in entities:
    print(entity)

(PERSON Steve/NNP)
(PERSON Jobs/NNP)
('was', 'VBD')
('the', 'DT')
(ORGANIZATION CEO/NNP)
('of', 'IN')
(PERSON Apple/NNP Inc/NNP)
('.', '.')


That is a simple example. I will discuss more on NER on other libraries

## 2. spaCy <a class="anchor" id="2"></a>

spaCy is a free, open-source library for NLP in Python written in Cython. spaCy is designed to make it easy to build systems for information extraction or general-purpose natural language processing.

**Installation of spaCy**

In [9]:
from IPython.display import clear_output

!pip install spacy    # install spacy library


clear_output()     # clear the unnecessary output if you want

There are various spaCy models for different languages. The default model for the English language is designated as en_core_web_sm. Since the models are quite large, it’s best to install them separately—including all languages in one package would make the download too massive.

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")

To start processing your input, you construct a Doc object. A Doc object is a sequence of Token objects representing a lexical token. Each Token object has information about a particular piece—typically one word—of text. You can instantiate a Doc object by calling the Language object with the input string as an argument:

In [11]:
introduction_doc = nlp("This tutorial is about Natural Language Processing in spaCy.")

print(type(introduction_doc))


print([token.text for token in introduction_doc])

<class 'spacy.tokens.doc.Doc'>
['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'spaCy', '.']


In the above example, the text is used to instantiate a Doc object. From there, you can access a whole bunch of information about the processed text.

For instance, you iterated over the Doc object with a list comprehension that produces a series of Token objects. On each Token object, you called the .text attribute to get the text contained within that token.

### Sentence Detection
Sentence detection is the process of locating where sentences start and end in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part-of-speech (POS) tagging and named-entity recognition, which you’ll come to later in the tutorial.

In spaCy, the .sents property is used to extract sentences from the Doc object. Here’s how you would extract the total number of sentences and the sentences themselves for a given input:

In [12]:
about_text = ("Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.")

doc = nlp(about_text)

sentences = list(doc.sents)

print('Length: ',len(sentences))

print(sentences)

Length:  2
[Gus Proto is a Python developer currently working for a London-based Fintech company., He is interested in learning Natural Language Processing.]


In the above example, spaCy is correctly able to identify the input’s sentences. With .sents, you get a list of Span objects representing individual sentences. You can also slice the Span objects to produce sections of a sentence.

### Tokens in spaCy
Building the Doc container involves tokenizing the text. The process of tokenization breaks a text down into its basic units—or tokens—which are represented in spaCy as Token objects.

As you’ve already seen, with spaCy, you can print the tokens by iterating over the Doc object. But Token objects also have other attributes available for exploration. For instance, the token’s original index position in the string is still available as an attribute on Token:

In [13]:
import spacy
nlp = spacy.load("en_core_web_sm")
about_text = ("Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.")
about_doc = nlp(about_text)

for token in about_doc:
    print (token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


In this example, you iterate over Doc, printing both Token and the .idx attribute, which represents the starting position of the token in the original text. 

### Stop Words in spaCy
Stop words are typically defined as the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make grammatical sense.

With NLP, stop words are generally removed because they aren’t significant, and they heavily distort any word frequency analysis. spaCy stores a list of stop words for the English language:

In [14]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
len(spacy_stopwords)

326

In [15]:
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

fifteen
itself
we
either
can
below
even
whither
eleven
some


In this example, you’ve examined the STOP_WORDS list from **spacy.lang.en.stop_words**. You don’t need to access this list directly, though. You can remove stop words from the input text by making use of the **.is_stop** attribute of each token.

In [16]:
custom_about_text = ("Gus Proto is a Python developer currently working for a London-based Fintech company. He is interested in learning Natural Language Processing.")
nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]


Here you use a list comprehension with a conditional expression to produce a list of all the words that are not stop words in the text.

While you can’t be sure exactly what the sentence is trying to say without stop words, you still have a lot of information about what it’s generally about.

### Lemmatization in spaCy
Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form, or root word, is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps you reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

spaCy puts a **.lemma_** attribute on the Token class. This attribute has the lemmatized form of the token:

In [17]:
import spacy
nlp = spacy.load("en_core_web_sm")

about_text = ("Gus is helping organize a developer conference on Applications of Natural Language Processing. He keeps organizing local Python meetups and several internal talks at his workplace.")
about_doc = nlp(about_text)

for token in about_doc:
    if str(token) != str(token.lemma_):
        print(token,' : ',token.lemma_)

is  :  be
He  :  he
keeps  :  keep
organizing  :  organize
meetups  :  meetup
talks  :  talk


In this example, you check to see if the original word is different from the lemma, and if it is, you print both the original word and its lemma.

You’ll note, for instance, that organizing reduces to its lemma form, organize. If you don’t lemmatize the text, then organize and organizing will be counted as different tokens, even though they both refer to the same concept. Lemmatization helps you avoid duplicate words that may overlap conceptually.

### Word Frequency in spaCy
You can now convert a given text into tokens and perform statistical analysis on it. This analysis can give you various insights, such as common words or unique words in the text:

In [18]:
import spacy
from collections import Counter

nlp = spacy.load("en_core_web_sm")

text = ("Four out of five of the most common words are stop words that don’t really tell you much about the summarized text. This is why stop words are often considered noise for many applications.")
doc = nlp(text)

# Words without stop_words and punctuation 
words = [token.text for token in doc if not token.is_stop and not token.is_punct]

print(Counter(words).most_common(5))

[('words', 3), ('stop', 2), ('common', 1), ('tell', 1), ('summarized', 1)]


If you can just look at the most common words, that may save you a lot of reading, because you can immediately tell if the text is about something that interests you or not.

### Part-of-Speech Tagging
Part of speech or POS is a grammatical role that explains how a particular word is used in a sentence. There are typically eight parts of speech:

* Noun
* Pronoun
* Adjective
* Verb
* Adverb
* Preposition
* Conjunction
* Interjection

Part-of-speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence. POS tags are useful for assigning a syntactic category like noun or verb to each word. 

In [19]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = ("He is interested in learning Natural Language Processing.")     # try with your own text and have fun.
doc = nlp(text)

for token in doc:
    print(f'''
    Token: {str(token)}
    ------------
    Tag: {str(token.tag_)}       Pos: {str(token.pos_)}
    Explanation: {spacy.explain(token.tag_)}
    ''')


    Token: He
    ------------
    Tag: PRP       Pos: PRON
    Explanation: pronoun, personal
    

    Token: is
    ------------
    Tag: VBZ       Pos: AUX
    Explanation: verb, 3rd person singular present
    

    Token: interested
    ------------
    Tag: JJ       Pos: ADJ
    Explanation: adjective (English), other noun-modifier (Chinese)
    

    Token: in
    ------------
    Tag: IN       Pos: ADP
    Explanation: conjunction, subordinating or preposition
    

    Token: learning
    ------------
    Tag: VBG       Pos: VERB
    Explanation: verb, gerund or present participle
    

    Token: Natural
    ------------
    Tag: NNP       Pos: PROPN
    Explanation: noun, proper singular
    

    Token: Language
    ------------
    Tag: NNP       Pos: PROPN
    Explanation: noun, proper singular
    

    Token: Processing
    ------------
    Tag: NNP       Pos: PROPN
    Explanation: noun, proper singular
    

    Token: .
    ------------
    Tag: .       Pos: PUNCT


Here, two attributes of the Token class are accessed and printed using f-strings:

1. **.tag_** displays a fine-grained tag.
2. **.pos_** displays a coarse-grained tag, which is a reduced version of the fine-grained tags.

You also use **spacy.explain()** to give descriptive details about a particular POS tag, which can be a valuable reference tool.

### Visualization: Using displaCy

spaCy comes with a built-in visualizer called displaCy. You can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can use displaCy to find POS tags for tokens:

In [20]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

text = ("He is interested in learning Natural Language Processing.")     # try with your own text and have fun.
doc = nlp(text)

displacy.render(doc, style="dep", jupyter=True)

In the image above, each token is assigned a POS tag written just below the token.

### Named-Entity Recognition in spaCy
Named-entity recognition (NER) is the process of locating named entities in unstructured text and then classifying them into predefined categories, such as person names, organizations, locations, monetary values, percentages, and time expressions.

You can use NER to learn more about the meaning of your text. For example, you could use it to populate tags for a set of documents in order to improve the keyword search. You could also use it to categorize customer support tickets into relevant categories.

spaCy has the property **.ents** on Doc objects. You can use it to extract named entities:

In [21]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

text = ("Great Piano Academy is situated in Mayfair or the City of London and has world-class piano instructors.")     
doc = nlp(text)

for ent in doc.ents:
    print(f'''
    {ent.text} =  {spacy.explain(ent.label_)}
    ''')


    Great Piano Academy =  Companies, agencies, institutions, etc.
    

    Mayfair =  Countries, cities, states
    

    the City of London =  Countries, cities, states
    


At first I extract Named entities by using **.ents** on Doc objects.

In the above example, **ent** is a Span object with various attributes:

* **.text** gives the Unicode text representation of the entity.
* **.label_** gives the label of the entity.

**spacy.explain** gives descriptive details about each entity label. 

**You can also use displaCy to visualize these entities:**

In [22]:
displacy.render(doc, style="ent")

**One use case** 

for NER is to redact people’s names from a text. For example, you might want to do this in order to hide personal information collected in a survey. Take a look at the following example:

Some explanations I found for these:

* PERSON:      People, including fictional.
* NORP:        Nationalities or religious or political groups.
* FAC:         Buildings, airports, highways, bridges, etc.
* ORG:         Companies, agencies, institutions, etc.
* GPE:         Countries, cities, states.
* LOC:         Non-GPE locations, mountain ranges, bodies of water.
* PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
* EVENT:       Named hurricanes, battles, wars, sports events, etc.
* WORK_OF_ART: Titles of books, songs, etc.
* LAW:         Named documents made into laws.
* LANGUAGE:    Any named language.
* DATE:        Absolute or relative dates or periods.
* TIME:        Times smaller than a day.
* PERCENT:     Percentage, including ”%“.
* MONEY:       Monetary values, including unit.
* QUANTITY:    Measurements, as of weight or distance.
* ORDINAL:     “first”, “second”, etc.
* CARDINAL:    Numerals that do not fall under another type.

In [23]:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

survey_text = ("Pepsi is a carbonated soft drink with a cola flavor, manufactured by PepsiCo. David loves pepsi a lot.")

doc = nlp(survey_text)

def replace_person_names(token):
    if token.ent_type_ == "PERSON":
        return "[REDACTED] "
    return token.text_with_ws

def redact_names(nlp_doc):
    with nlp_doc.retokenize() as retokenizer:
        for ent in nlp_doc.ents:
            retokenizer.merge(ent)
    tokens = map(replace_person_names, nlp_doc)
    return "".join(tokens)

redact_names(doc)

'Pepsi is a carbonated soft drink with a cola flavor, manufactured by PepsiCo. [REDACTED] loves pepsi a lot.'

The **redact_names()** function uses a retokenizer to adjust the tokenizing model. It gets all the tokens and passes the text through **map()** to replace any target tokens with [REDACTED].

So just like that, you would be able to redact a huge amount of text in seconds, while doing it manually could take many hours. That said, you always need to be careful with redaction, because the models aren’t perfect!

## 3. TextBlob <a class="anchor" id="3"></a>

TextBlob is a simple and intuitive NLP library built on NLTK and Pattern libraries. It provides a high-level interface for common NLP tasks like sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and classification. TextBlob’s easy-to-use API makes it suitable for beginners and rapid prototyping.

**Features**

* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration

**Get it now**

In [24]:
!pip install -U textblob
!python -m textblob.download_corpora

Collecting nltk>=3.8 (from textblob)
  Downloading nltk-3.8.1-py3-none-any.whl.metadata (2.8 kB)
Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: nltk
  Attempting uninstall: nltk
    Found existing installation: nltk 3.2.4
    Uninstalling nltk-3.2.4:
      Successfully uninstalled nltk-3.2.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
preprocessing 0.1.13 requires nltk==3.2.4, but you have nltk 3.8.1 which is incompatible.[0m[31m
[0mSuccessfully installed nltk-3.8.1
[nltk_data] Downloading package brown to /usr/share/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already

### Tutorial: Quickstart

In [25]:
# First, the import
from textblob import TextBlob

# Let’s create our first TextBlob.
wiki = TextBlob("Python is a high-level, general-purpose programming language.")

**Part-of-speech Tagging**

In [26]:
# Part-of-speech tags can be accessed through the tags property.
wiki.tags

[('Python', 'NNP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('high-level', 'JJ'),
 ('general-purpose', 'JJ'),
 ('programming', 'NN'),
 ('language', 'NN')]

**Noun Phrase Extraction**

In [27]:
# Similarly, noun phrases are accessed through the noun_phrases property.
wiki.noun_phrases

WordList(['python'])

**Sentiment Analysis**

The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [28]:
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

**Tokenization**

In [29]:
zen = TextBlob(
    "Beautiful is better than ugly. "
    "Explicit is better than implicit. "
    "Simple is better than complex."
)
print('Word list: ',zen.words, '\n')

print('Sentence list: ',zen.sentences)

Word list:  ['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'] 

Sentence list:  [Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]


**WordLists**

A WordList is just a Python list with additional methods.

In [30]:
animals = TextBlob("cat dog octopus")
print(animals.words)

print(animals.words.pluralize())

['cat', 'dog', 'octopus']
['cats', 'dogs', 'octopodes']


**Spelling Correction**

Use the **correct()** method to attempt spelling correction.

In [31]:
b = TextBlob("I havv goood speling!")
b.correct()

TextBlob("I have good spelling!")

Word objects have a **spellcheck(), Word.spellcheck()** method that returns a list of (word, confidence) tuples with spelling suggestions.

In [32]:
from textblob import Word

w = Word('correct')
w.spellcheck()

[('correct', 1.0)]

**Get Word and Noun Phrase Frequencies**

There are two ways to get the frequency of a word or noun phrase in a TextBlob.

The first is through the **word_counts** dictionary.

In [33]:
monty = TextBlob("We are no longer the Knights who say Ni. "
                    "We are now the Knights who say Ekki ekki ekki PTANG.")
monty.word_counts['ekki']

3

The second way is to use the **count()** method.

In [34]:
monty.words.count('ekki')

3

**Get Start and End Indices of Sentences**

Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob.

In [35]:
from textblob import TextBlob

text = TextBlob('We are no longer the Knights who say Ni')
for s in text.sentences:
    print('Start at index {}, Ends at index {}'.format(s.start,s.end))

Start at index 0, Ends at index 39


### Tutorial: Building a Text Classification System

The **textblob.classifiers** module makes it simple to create custom classifiers.

As an example, let’s create a custom sentiment analyzer.

**Loading Data and Creating a Classifier**

First we’ll create some training and test data.

In [36]:
train = [
    ("I love this sandwich.", "pos"),
    ("this is an amazing place!", "pos"),
    ("I feel very good about these beers.", "pos"),
    ("this is my best work.", "pos"),
    ("what an awesome view", "pos"),
    ("I do not like this restaurant", "neg"),
    ("I am tired of this stuff.", "neg"),
    ("I can't deal with this", "neg"),
    ("he is my sworn enemy!", "neg"),
    ("my boss is horrible.", "neg"),
]
test = [
    ("the beer was good.", "pos"),
    ("I do not enjoy my job", "neg"),
    ("I ain't feeling dandy today.", "neg"),
    ("I feel amazing!", "pos"),
    ("Gary is a friend of mine.", "pos"),
    ("I can't believe I'm doing this.", "neg"),
]

Now we’ll create a Naive Bayes classifier, passing the training data into the constructor.

In [37]:
from textblob.classifiers import NaiveBayesClassifier

cl = NaiveBayesClassifier(train)

**Classifying Text**

Call the classify(text) method to use the classifier.

In [38]:
cl.classify('I do not like this restaurant')

'neg'

**Classifying TextBlobs**

Another way to classify text is to pass a classifier into the constructor of **TextBlob** and call its **classify()** method.

In [39]:
from textblob import TextBlob

sents = TextBlob("Beautiful is better than ugly. Explicit is better than implicit. Simple is not better than complex.", classifier = cl)

for s in sents.sentences:
    print(s,'-->',s.classify())

Beautiful is better than ugly. --> pos
Explicit is better than implicit. --> pos
Simple is not better than complex. --> neg


**Evaluating Classifiers**

To compute the accuracy on our test set, use the **accuracy(test_data)** method.

In [40]:
cl.accuracy(test)

0.8333333333333334

### Advanced Usage: Overriding Models and the Blobber Class

TextBlob allows you to specify which algorithms you want to use under the hood of its simple API.

**Sentiment Analyzers**

The **textblob.sentiments** module contains two sentiment analysis implementations, **PatternAnalyzer** (based on the pattern library) and **NaiveBayesAnalyzer** (an NLTK classifier trained on a movie reviews corpus).

The default implementation is **PatternAnalyzer**, but you can override the analyzer by passing another implementation into a TextBlob’s constructor.

For instance, the **NaiveBayesAnalyzer** returns its result as a namedtuple of the form: **Sentiment(classification, p_pos, p_neg)**.

In [41]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer

blob = TextBlob('I love this library', analyzer = NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

Tokenizers

The **words** and **sentences** properties are helpers that use the **textblob.tokenizers.WordTokenizer** and **textblob.tokenizers.SentenceTokenizer** classes, respectively.

You can use other tokenizers, such as those provided by NLTK, by passing them into the **TextBlob** constructor then accessing the **tokens** property.

In [42]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
blob = TextBlob("This is\ta rather tabby\tblob.", tokenizer=tokenizer)
blob.tokens

WordList(['This is', 'a rather tabby', 'blob.'])

In [43]:
from nltk.tokenize import word_tokenize

text = "Hello, how are you?"
tokens = word_tokenize(text)
print(tokens)

['Hello', ',', 'how', 'are', 'you', '?']


In [44]:
from nltk.tokenize import sent_tokenize

text = "This is a sentence. And another sentence!"
sentences = sent_tokenize(text)
print(sentences) 

['This is a sentence.', 'And another sentence!']


### Noun Phrase Chunkers

TextBlob currently has two noun phrases chunker implementations, **textblob.np_extractors.FastNPExtractor** (default, based on Shlomi Babluki’s implementation from this blog post) and **textblob.np_extractors.ConllExtractor**, which uses the CoNLL 2000 corpus to train a tagger.

You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob’s constructor.

In [45]:
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor

extractor = ConllExtractor()
blob = TextBlob("Python is a high-level programming language.", np_extractor=extractor)

blob.noun_phrases

WordList(['python', 'high-level programming language'])

### POS Taggers

TextBlob currently has two POS tagger implementations, located in **textblob.taggers**. The default is the **PatternTagger** which uses the same implementation as the pattern library.

The second implementation is **NLTKTagger** which uses NLTK’s TreeBank tagger. Numpy is required to use the **NLTKTagger**.

Similar to the tokenizers and noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.

In [46]:
from textblob import TextBlob
from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
blob.pos_tags

[('Tag', 'NN'), ('You', 'PRP'), ("'re", 'VBP'), ('It', 'PRP')]

### Parsers

Parser implementations can also be passed to the TextBlob constructor.

In [47]:
from textblob import TextBlob
from textblob.parsers import PatternParser
blob = TextBlob("Parsing is fun.", parser=PatternParser())
blob.parse()

'Parsing/VBG/B-VP/O is/VBZ/I-VP/O fun/NN/B-NP/O ././O/O'

## 4. Hugging Face Transformer <a class="anchor" id="4"></a>

Read the basics from this artical

https://www.geeksforgeeks.org/hugging-face-transformers/