## TextBlob: Simplified Text Processing

TextBlob is apython library for processing textual data. It provide a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation and more.

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

### Features:

- Noun phrase extraction
- Part-of-speech tagging
- Sentiment Analysis
- Classification(Naive Bayes, Decision Tree)
- Language translation and detection powered by Google Translate
- Tokenization (Splitting text into words and sentences)
- Word and phrase frequencies
- Parsing
- n-grams
- Word inflection (pluralization and singularization) and lemmatization
- Spelling correction
- Add new models or language through extensions
- WordNet integration

## Installation

In [1]:
!pip install textblob

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Create a TextBlob

for first import

In [2]:
from textblob import TextBlob

Let's create our first TextBlob

In [3]:
wiki = TextBlob("I Love Natural Language Processing, not you!")

### Part-of-speech(POS) Tagging
Parts-of-speech tags can be accessed through the tags property.

In [6]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers/averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping grammars/basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to /root/nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   U

True

In [9]:
wiki.tags

[('I', 'PRP'),
 ('Love', 'VBP'),
 ('Natural', 'JJ'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('not', 'RB'),
 ('you', 'PRP')]

### Noun Phrase Extraction

Similarly, noun phrases are accessed through the noun_phrase property

In [10]:
wiki.noun_phrases

WordList(['love', 'language processing'])

### Sentiment Analysis

The sentiment propert returns a named tuple of the form Sentiments(polarity, subjective). The polarity score is a float within the range [-1.0,1.0]. The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.

In [11]:
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

In [12]:
testimonial.sentiment.subjectivity

0.4357142857142857

## Tokenization

In [13]:
zen = TextBlob("Data is a new fuel. "
               "Explicit is better than implicit. "
               "Simple is better than complex. ")

zen.words

WordList(['Data', 'is', 'a', 'new', 'fuel', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

In [14]:
zen.sentences

[Sentence("Data is a new fuel."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

Sentences objects have the same properties and methods as TextBlobs.

In [16]:
for sentence in zen.sentences:
  print(sentence)

Data is a new fuel.
Explicit is better than implicit.
Simple is better than complex.


## Word inflection and Lemmatization

Each word in the TexBlob.words or Sentences.words is a Word objects(a subclass of unicode) with useful methods, e.g. for word inflection.

In [18]:
sentence = TextBlob("Use 4 spaces per indentation level")

sentence.words

WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [19]:
sentence.words[2].singularize()

'space'

In [20]:
sentence.words[0].pluralize()

'Uses'

Words can be lemmanized just by calling the lemmatize method.

In [24]:
from textblob import Word

q = Word('lions')
q.lemmatize()

'lion'

In [25]:
q = Word("went")
q.lemmatize('v')  # Pass in WordNet part of speech

'go'

## WordNet Integration

You can access the synets for a Word via the synsets property or the get_synsets methods optionally passing in a parts-of-speech.

### WordNet

WordNet is a lexical database that is dictionary for the English language, it is specifically for the natural language processing.

### Synset:
It is a special kind of a simple interface that is present in the NLTK for look up words in WordNet. Synset instances are the groupings of synonymous that express the same type of concept. Some words have only one synset and some have several.

You can access the definitions for each synset via the definitions property or the define() method, which can also take an optional part-of-speech(POS) argument.

In [26]:
Word("length").definitions

['the linear extent in space from one end to the other; the longest dimension of something that is fixed in place',
 'continuance in time',
 'the property of being the extent of something from beginning to end',
 'size of the gap between two places',
 'a section of something that is long and narrow']

You can create synsets directly.

In [27]:
from textblob.wordnet import Synset
octopus = Synset('octopus.n.02')
shrimp = Synset('shrimp.n.03')
octopus.path_similarity(shrimp)

0.1111111111111111

### WordLists

A WordList is just the python list additionaly methods.

WordLists will find it out the words which are in the sentence and ignore the spaces in between them.

In [28]:
animals = TextBlob("cow sheep octopus")
animals.words

WordList(['cow', 'sheep', 'octopus'])

In [29]:
animals.words.pluralize()  # It'll pluralize the words

WordList(['kine', 'sheep', 'octopodes'])

### Spelling Correction
For correcting the words you can use correct() method to attempt spelling correction.


In [30]:
g = TextBlob("Can you pronounce czechuslovakia?")
print(g.correct())

An you pronounce czechoslovakia?


Word objects have a spellcheck() Word.spellcheck(). this method that returns a list of (word, confidence) tuple with spelling suggestions.

In [31]:
from textblob import Word
k = Word("longituode")
k.spellcheck()

[('longitude', 1.0)]

This spelling correction is based on the Peter Norvig's "How to Write a Spelling Corrector". as implemented in the pattern library, it is about 70% accurate.

### Get Word and Noun Phrase Frequencies

There are two ways to get the frequencies of a word or noun phrase in the TextBlob.

The first one is through the word_counts dictionary.

In [32]:
sent = TextBlob('She sales sea shells at the sea shore.')

sent.word_counts['sea']

2

If you access the frequencies this way, the search will not be case sensitive, and words that are not found will have a frequency of 0.

The second wat is to use the count() method.

In [33]:
sent.words.count('sea')

2

You can specify whether or not the search should be case-sensitive (default is False).

In [35]:
sent.words.count('Sea', case_sensitive=True)

0

In the above example we have give 'Sea' and ofcourse "Sea" is not available in the sentence,'Sea' is not available in the sentence,'Sea' is available in the sentence but in lowercase because of that it given 0 as result.

Each of these methods can also be used with noun phrases.

In [37]:
sent.noun_phrases.count('sea')

0

## Translation and Language Detection

TextBlob can be translated between languages.

In [45]:
blob = TextBlob(u'Something is better than nothing.')
blob.translate(from_lang='en', to='hi')

TextBlob("कुछ नहीं से कुछ भला।")

In [44]:
blob = TextBlob("hello")
blob.translate(from_lang='en', to='hi')

TextBlob("नमस्ते")

If no source language is specified. TextBlob will attempt to detect the language. You can specify the source language explicity, like so. Raises TranslatorError if the TextBlob cannot be translate into the requested language or NotTranslated if the translated result is the same as the input string.

In [46]:
chinese_blob = TextBlob(u"有总比没有好")
chinese_blob.translate(from_lang="zh-CN", to='en')

TextBlob("There is always better than not")

You can also attempt to detect a TextBlob's language using TextBlob.detect_language().

#### Run this cell after when you have good internet connection

d = TextBlob("कुछ नहीं से कुछ भला")

d.detect_language()

## Parsing 

use the parse() method to parse the text.

In [54]:
b = TextBlob("And now for something completely different.")
print(b.parse())

And/CC/O/O now/RB/B-ADVP/O for/IN/B-PP/B-PNP something/NN/B-NP/I-PNP completely/RB/B-ADJP/O different/JJ/I-ADJP/O ././O/O


### translate Are Like Python Strings!

You can use Python's substring syntax

In [55]:
zen[0:15]

TextBlob("Data is a new f")

We can see use it as common string method.

In [56]:
zen.upper()

TextBlob("DATA IS A NEW FUEL. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX. ")

In [57]:
zen.find('than')  # It shows that 'than' word stars from 39th place.

39

You can make comparisons between TextBlobs and strings.

In [58]:
a_blob = TextBlob('apple')
s_blob = TextBlob('samsung')

a_blob < s_blob

True

In [59]:
a_blob =='apple'

True

You can concatenate and interpolate TextBlobs and strigs.

In [60]:
a_blob + ' and ' + s_blob

TextBlob("apple and samsung")

In [61]:
"{0} and {1}".format(a_blob,s_blob)

'apple and samsung'

### N-grams

TextBlob.ngrams() methods returns a list of tuple of n successive words.

In [62]:
blob = TextBlob("Now is better than never.")
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

In [63]:
blob = TextBlob("Now is better than never.")
blob.ngrams(n=2)

[WordList(['Now', 'is']),
 WordList(['is', 'better']),
 WordList(['better', 'than']),
 WordList(['than', 'never'])]

### Get Start and End Indicates of Sentences

use sentence.start and sentence.end to get the indices where a sentence starts and ends whitin a TextBlob.

In [65]:
for k in zen.sentences:
  print(k)
  print("---- Starts at index {}, Ends at index {}".format(k.start, k.end))

Data is a new fuel.
---- Starts at index 0, Ends at index 19
Explicit is better than implicit.
---- Starts at index 20, Ends at index 53
Simple is better than complex.
---- Starts at index 54, Ends at index 84


## Lets Starty building the Text Classification system

The textblob.classifiers module makes it simple to create custom classifiers.

As an examples, lets create a custom sentiment analyser.

### Loading data and Creating a Classifier
First we'll create some training and test data.

In [66]:
train = [
    ("I love this sandwich.",'pos'),
    ('this is an amazing place!','pos'),
    ('I feel very good about these beers.','pop'),
    ('this is my best work.','pos'),
    ('what an awesome view','pos'),
    ('I do not like this restaurant','neg'),
    ('I am tried of this stuff.','neg'),
    ("I can't deal with this",'neg'),
    ('he is my sworn enemy!','neg'),
    ('my boss is horrible.','neg')
]

test = [
    ('the beer was good.','pos'),
    ('I do not enjoy my job','neg'),
    ("I ain't feeling dandy today.",'neg'),
    ("I feel amazing!",'pos'),
    ('Gary is a friend of mine.','pos'),
    ("I can't believe I'm doing this.",'neg')
]

Now we'll create a Naive Bayes classifier, passing the training data into the constructor.

In [67]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

### Classifying Text

Call the classify(text) method to use the classifier

In [68]:
cl.classify("This is an amazing library!")

'pos'

You can get the label probability distribution with the prob_classify(text) method.

In [69]:
prob_dist = cl.prob_classify("I am suffering from cough and cold.")
prob_dist.max()

'neg'

In [71]:
round(prob_dist.prob("neg"), 2)

0.91

In [73]:
round(prob_dist.prob("pos"), 2)

0.09

### Classifying TextBlobs

Anothet way to classify is to pass a classifer into the constructor of TextBlob and call its classify()method.

In [75]:
from textblob import TextBlob
blob = TextBlob("Alcohol is good. But the hangover is horrible.", classifier=cl)
blob.classify()

'neg'

In [76]:
for b in blob.sentences:
  print(b)
  print(b.classify())

Alcohol is good.
pos
But the hangover is horrible.
neg


## Evaluating Classifiers

In [78]:
cl.accuracy(test)

0.8333333333333334

Use the show_informative_features() method to display a listing of the most informative features.

In [79]:
cl.show_informative_features(5)

Most Informative Features
             contains(I) = True              pop : pos    =      2.5 : 1.0
          contains(this) = False             pop : pos    =      2.5 : 1.0
            contains(an) = False             neg : pos    =      1.8 : 1.0
             contains(I) = False             pos : neg    =      1.7 : 1.0
            contains(is) = False             pop : pos    =      1.5 : 1.0


### Updating classifiers with new Data.

In [80]:
new_data = [("She is my best friend.",'pos'),
            ("I'm happy to have a new friend.",'pos'),
            ("Stay thirsty, my friend.",'pop'),
            ("He ain't from around here.",'neg')]

cl.update(new_data)

True

In [81]:
cl.accuracy(test)

1.0

## Feature Extraction

By default, the NaiveBayesClassifier users a simple feature extractor that indicates which words in the training set are containing in a document.

For example, the sentence 'I love" might have the features contains(love). True or contains(hate). False.

You can override the this features extractor by writing your own. A feature extractor is simply a function with documents (the text to extract features from) as the first argument. The function may include a secondf argument, train_set(the training datasets) if necessary.

The function should return a dictionary of features for documents.

In [82]:
def end_word_extractor(document):
  tokens = document.split()
  first_word, last_word = tokens[0], tokens[-1]
  feats = {}
  feats["first({0})".format(first_word)] = True
  feats["last({0})".format(last_word)] = False
  return feats

In [83]:
features = end_word_extractor("I love")

In [84]:
assert features == {'last(love)': False, 'first(I)': True}

We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.

In [85]:
cl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)

In [86]:
blob = TextBlob("I'm excited to try my new classifier.", classifier=cl2)
blob.classify()

'pos'