## TextBlob

Natural Language Process == understanding(processing) everyday language.

Natural language processing is a form of artificial intelligence that helps computers read and respond by simulating the human ability to understand everyday language.
Many organizations use NLP techniques to optimize customer support,improve the efficiency of text analytics by easily finding the information they need, and enhance social media monitoring. 
For example, banks might implement NLP algorithms to optimize customer support; a large consumer products brand might combine natural language 
processing and semantic analysis to improve their knowledge management strategies and social media monitoring

TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

### Features

* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration

#### Installing and download necessary corpora

pip install -U textblob
python -m textblob.download_corpora

In [2]:
import nltk
import textblob

In [39]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')
nltk.download('movie_reviews')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

This will install TextBlob and download the necessary NLTK corpora. If you need to change the default download
directory set the NLTK_DATA environment variable.

#### Short Tutorial

TextBlob aims to provide access to common text-processing operations through a familiar interface. You can treat
TextBlob objects as if they were Python strings that learned how to do Natural Language Processing.

In [3]:
from textblob import TextBlob
sf = TextBlob("San Francisco, officially the City and County of San Francisco and colloquially known as SF, San Fran, Frisco, or The City,is the cultural, commercial, and financial center of Northern California. San Francisco is the 16th most populous city in the United States, and the fourth most populous in California, with 881,549 residents as of 2019.It covers an area of about 46.89 square miles (121.4 km2),mostly at the north end of the San Francisco Peninsula in the San Francisco Bay Area, making it the second most densely populated large U.S. city, and the fifth most densely populated U.S. ")

### Part-of-speech Tagging

In [8]:
sf.tags 

[('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('officially', 'RB'),
 ('the', 'DT'),
 ('City', 'NNP'),
 ('and', 'CC'),
 ('County', 'NNP'),
 ('of', 'IN'),
 ('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('and', 'CC'),
 ('colloquially', 'RB'),
 ('known', 'VBN'),
 ('as', 'IN'),
 ('SF', 'NNP'),
 ('San', 'NNP'),
 ('Fran', 'NNP'),
 ('Frisco', 'NNP'),
 ('or', 'CC'),
 ('The', 'DT'),
 ('City', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('cultural', 'JJ'),
 ('commercial', 'JJ'),
 ('and', 'CC'),
 ('financial', 'JJ'),
 ('center', 'NN'),
 ('of', 'IN'),
 ('Northern', 'NNP'),
 ('California', 'NNP'),
 ('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('16th', 'CD'),
 ('most', 'JJS'),
 ('populous', 'JJ'),
 ('city', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('fourth', 'JJ'),
 ('most', 'RBS'),
 ('populous', 'JJ'),
 ('in', 'IN'),
 ('California', 'NNP'),
 ('with', 'IN'),
 ('881,549', 'CD'),
 ('residents', 'NNS'),
 ('as', 'IN'),
 (

In [9]:
sf.pos_tags

[('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('officially', 'RB'),
 ('the', 'DT'),
 ('City', 'NNP'),
 ('and', 'CC'),
 ('County', 'NNP'),
 ('of', 'IN'),
 ('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('and', 'CC'),
 ('colloquially', 'RB'),
 ('known', 'VBN'),
 ('as', 'IN'),
 ('SF', 'NNP'),
 ('San', 'NNP'),
 ('Fran', 'NNP'),
 ('Frisco', 'NNP'),
 ('or', 'CC'),
 ('The', 'DT'),
 ('City', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('cultural', 'JJ'),
 ('commercial', 'JJ'),
 ('and', 'CC'),
 ('financial', 'JJ'),
 ('center', 'NN'),
 ('of', 'IN'),
 ('Northern', 'NNP'),
 ('California', 'NNP'),
 ('San', 'NNP'),
 ('Francisco', 'NNP'),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('16th', 'CD'),
 ('most', 'JJS'),
 ('populous', 'JJ'),
 ('city', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('United', 'NNP'),
 ('States', 'NNPS'),
 ('and', 'CC'),
 ('the', 'DT'),
 ('fourth', 'JJ'),
 ('most', 'RBS'),
 ('populous', 'JJ'),
 ('in', 'IN'),
 ('California', 'NNP'),
 ('with', 'IN'),
 ('881,549', 'CD'),
 ('residents', 'NNS'),
 ('as', 'IN'),
 (

TextBlob currently has two POS tagger implementations, located in textblob.taggers. The default is the
PatternTagger which uses the same implementation as the pattern library.

The second implementation is NLTKTagger which uses NLTK’s TreeBank tagger. Numpy is required to use the
NLTKTagger.

Similar to the tokenizers and noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a
tagger instance to the constructor.

In [10]:

from textblob.taggers import NLTKTagger
nltk_tagger = NLTKTagger()
blob = TextBlob("Tag! You're It!", pos_tagger=nltk_tagger)
blob.pos_tags


[('Tag', 'NN'), ('You', 'PRP'), ("'re", 'VBP'), ('It', 'PRP')]

### Noun Phrase Extraction

In [None]:
sf.noun_phrases

WordList(['san francisco', 'sf', 'san fran', 'frisco', 'financial center', 'california', 'san francisco', 'populous city', 'california', 'square miles', 'san francisco peninsula', 'san francisco', 'bay area', 'u.s.', 'u.s'])

In [13]:
blob2 = TextBlob("Google is great search engine for finding almost anything ")
for np in blob2.noun_phrases:
    print(np)
    

google
great search engine


TextBlob currently has two noun phrases chunker implementations, textblob.np_extractors, FastNPExtractor

np_extractors.ConllExtractor, which uses the CoNLL 2000 corpus to train a tagger.
You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase
extractor to a TextBlob’s constructor.

In [None]:
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor
extractor = ConllExtractor()
blob = TextBlob("Python is a high-level programming language.", np_extractor=extractor)
blob.noun_phrases

WordList(['python', 'high-level programming language'])

### Sentiment Analysis


- Opinion Minining or Emotion AI
- Categorizing opinions and attitude expressed in a piece of text(positive, negative, or neutral)
- Determining the attitude of a speaker or a writer with respect to some topic


The textblob.sentiments module contains two sentiment analysis implementations, PatternAnalyzer
(based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier trained on a movie reviews corpus).

The default implementation is PatternAnalyzer, but you can override the analyzer by passing another implemen-
tation into a TextBlob’s constructor.

For instance, the NaiveBayesAnalyzer returns
Sentiment(classification, p_pos, p_neg).

In [14]:
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

In [15]:
testimonial.sentiment.polarity

0.39166666666666666

In [18]:
from textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.7996209910191279, p_neg=0.2003790089808724)

In [19]:
myfeelings = ["I love my phone but would not recommend it to any of my colleagues",
              "I love this watch",
              "This is an amazing library",
              "I do not like this restaurant",
              "Wow what a great tip.",
             "a surprisingly interesting movie",
              "hard to resist",
              "one lousy film",
              "this is too long"]

### Uses PatternAnalyzer(Default) or NaiveBayesAnalyzer
# How To Use the NaiveBayesAnalyser
# from textblob.sentiments import NaiveBayesAnalyzer
# blob = TextBlob("I love this city", analyzer=NaiveBayesAnalyzer())

for f in myfeelings:
    result = TextBlob(f).sentiment.polarity
    print(f'{f} ==> polarity:  {result}')

I love my phone but would not recommend it to any of my colleagues ==> polarity:  0.5
I love this watch ==> polarity:  0.5
This is an amazing library ==> polarity:  0.6000000000000001
I do not like this restaurant ==> polarity:  0.0
Wow what a great tip. ==> polarity:  0.45
a surprisingly interesting movie ==> polarity:  0.5
hard to resist ==> polarity:  -0.2916666666666667
one lousy film ==> polarity:  -0.5
this is too long ==> polarity:  -0.05


### Tokenizarion

You can break TextBlobs into words or sentences.

In [33]:
sf.words

WordList(['San', 'Francisco', 'officially', 'the', 'City', 'and', 'County', 'of', 'San', 'Francisco', 'and', 'colloquially', 'known', 'as', 'SF', 'San', 'Fran', 'Frisco', 'or', 'The', 'City', 'is', 'the', 'cultural', 'commercial', 'and', 'financial', 'center', 'of', 'Northern', 'California', 'San', 'Francisco', 'is', 'the', '16th', 'most', 'populous', 'city', 'in', 'the', 'United', 'States', 'and', 'the', 'fourth', 'most', 'populous', 'in', 'California', 'with', '881,549', 'residents', 'as', 'of', '2019.It', 'covers', 'an', 'area', 'of', 'about', '46.89', 'square', 'miles', '121.4', 'km2', 'mostly', 'at', 'the', 'north', 'end', 'of', 'the', 'San', 'Francisco', 'Peninsula', 'in', 'the', 'San', 'Francisco', 'Bay', 'Area', 'making', 'it', 'the', 'second', 'most', 'densely', 'populated', 'large', 'U.S', 'city', 'and', 'the', 'fifth', 'most', 'densely', 'populated', 'U.S'])

In [34]:
sf.sentences

[Sentence("San Francisco, officially the City and County of San Francisco and colloquially known as SF, San Fran, Frisco, or The City,is the cultural, commercial, and financial center of Northern California."),
 Sentence("San Francisco is the 16th most populous city in the United States, and the fourth most populous in California, with 881,549 residents as of 2019.It covers an area of about 46.89 square miles (121.4 km2),mostly at the north end of the San Francisco Peninsula in the San Francisco Bay Area, making it the second most densely populated large U.S. city, and the fifth most densely populated U.S.")]

Sentence objects have the same properties and methods as TextBlobs.

In [None]:
for sentence in sf.sentences:
    print(sentence.sentiment)

Sentiment(polarity=0.0, subjectivity=0.05)
Sentiment(polarity=0.3163265306122449, subjectivity=0.34693877551020413)


The words and sentences properties are helpers that use the textblob.tokenizers.WordTokenizer
and textblob.tokenizers.SentenceTokenizer classes, respectively.
You can use other tokenizers, such as those provided by NLTK, by passing them into the TextBlob constructor then
accessing the tokens property.

In [None]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
tokenizer = TabTokenizer()
blob = TextBlob("This is\ta rather tabby\tblob.", tokenizer=tokenizer)
blob.tokens

WordList(['This is', 'a rather tabby', 'blob.'])

### Words Inflection and Lemmatization

Each word in TextBlob.words or Sentence.words is a Word object (a subclass of unicode) with useful
methods, e.g. for word inflection.

In [35]:
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words

WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [36]:
sentence.words[2].singularize()

'space'

In [37]:
sentence.words[-1].pluralize()

'levels'

Words can be lemmatized by calling the lemmatize method.

In [40]:
from textblob import Word
w = Word("octopi")
w.lemmatize()



'octopus'

In [None]:
w = Word("went")
w.lemmatize("v")# Pass in WordNet part of speech (verb)

'go'

In [None]:
Word("octopus").definitions

['tentacles of octopus prepared as food',
 'bottom-living cephalopod having a soft oval body with eight long tentacles']

### WordLists

A WordList is just a Python list with additional methods.

In [None]:
animals = TextBlob("cat dog octopus")
animals.words




WordList(['cat', 'dog', 'octopus'])

In [None]:
animals.words.pluralize()

WordList(['cats', 'dogs', 'octopodes'])

### Spelling Correction

Use the correct() method to attempt spelling correction.

In [None]:
b = TextBlob("I havv goood speling!")
print(b.correct())

I have good spelling!


Word objects have a spellcheck() Word.spellcheck() method that returns a list of (word,
confidence) tuples with spelling suggestions.

In [None]:
w = Word('falibility')
w.spellcheck()

[('fallibility', 1.0)]

### Get Word and Noun Phrase Frequencies

There are two ways to get the frequency of a word or noun phrase in a TextBlob.

The first is through the word_counts dictionary.

In [None]:
monty = TextBlob("We are no longer the Knights who say Ni. "
"We are now the Knights who say Ekki ekki ekki PTANG.")

monty.word_counts['ekki']

3

If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a
frequency of 0.

The second way is to use the count() method.

In [None]:
monty.words.count('ekki', case_sensitive=True)

2

In [None]:
sf.noun_phrases.count('San Francisco')

3

### TextBlobs Are Like Python Strings!

You can use Python’s substring syntax.

In [None]:
sf[0:200]

TextBlob("San Francisco, officially the City and County of San Francisco and colloquially known as SF, San Fran, Frisco, or The City,is the cultural, commercial, and financial center of Northern California. San")

You can make comparisons between TextBlobs and strings.

In [None]:
apple_blob = TextBlob('apples')
banana_blob = TextBlob('bananas')
apple_blob < banana_blob

True

You can concatenate and interpolate TextBlobs and strings.

In [None]:
apple_blob + ' and ' + banana_blob

TextBlob("apples and bananas")

In [None]:
"{0} and {1}".format(apple_blob, banana_blob)

'apples and bananas'

### n-grams

The TextBlob.ngrams() method returns a list of tuples of n successive words.

In [None]:
blob = TextBlob("Now is better than never.")

In [None]:
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

### What language is it?

- detect_language()
- translate(to='en')


In [21]:
myword = TextBlob("Hello")
myword.detect_language()

'en'

In [20]:
myword2 = TextBlob("Привет")
myword2.detect_language()

'ru'

In [22]:
## Translation to French
myword.translate(to='fr')

TextBlob("Bonjour")

In [23]:
## Translation to Espanol/Spanish
myword.translate(to='es')

TextBlob("Hola")

In [28]:
newtext = "The Quick brown fox jumped over the lazy dogs."
eng = TextBlob(newtext)
eng.translate(to='fr')


TextBlob("Le renard brun rapide sauta par-dessus les chiens paresseux.")

In [31]:
newfrench_text = "Le renard brun rapide sauta par-dessus les chiens paresseux."

In [32]:
fr = TextBlob(newfrench_text)
fr.translate(to='en')

TextBlob("The quick brown fox jumped over the lazy dogs.")

### Get Start and End Indices of Sentences

In [None]:
for s in sf.sentences:
    print(s)
    print("---- Starts at index {}, Ends at index {}".format(s.start, s.end))

San Francisco, officially the City and County of San Francisco and colloquially known as SF, San Fran, Frisco, or The City,is the cultural, commercial, and financial center of Northern California.
---- Starts at index 0, Ends at index 196
San Francisco is the 16th most populous city in the United States, and the fourth most populous in California, with 881,549 residents as of 2019.It covers an area of about 46.89 square miles (121.4 km2),mostly at the north end of the San Francisco Peninsula in the San Francisco Bay Area, making it the second most densely populated large U.S. city, and the fifth most densely populated U.S.
---- Starts at index 197, Ends at index 588


### Building a Text Classification System

In [None]:
train = [
('I love this sandwich.', 'pos'),
('this is an amazing place!', 'pos'),
('I feel very good about these beers.', 'pos'),
('this is my best work.', 'pos'),
("what an awesome view", 'pos'),
('I do not like this restaurant', 'neg'),
('I am tired of this stuff.', 'neg'),
("I can't deal with this", 'neg'),
('he is my sworn enemy!', 'neg'),
]

In [None]:
test = [
('the beer was good.', 'pos'),
('I do not enjoy my job', 'neg'),
("I ain't feeling dandy today.", 'neg'),
("I feel amazing!", 'pos'),
('Gary is a friend of mine.', 'pos'),
("I can't believe I'm doing this.", 'neg')
]

In [None]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

You can also load data from common file formats including CSV, JSON, and TSV.

In [None]:
'''
with open('train.json', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="json")

'''    

In [None]:
cl.classify("This is an amazing library!")

'pos'

In [None]:
round(prob_dist.prob("pos"), 2)

0.96

Another way to classify text is to pass a classifier into the constructor of TextBlob and call its classify()
method.

In [None]:
from textblob import TextBlob
blob = TextBlob("I do not like this restaurant.I love this sandwich.", classifier=cl)
blob.classify()

'pos'

The advantage of this approach is that you can classify sentences within a TextBlob.

In [None]:
for s in blob.sentences:
    print(s)
    print(s.classify())

The beer is good.
pos
But the hangover is horrible.
pos


### Evaluating Classifiers

In [None]:
cl.accuracy(test)

1.0