## Text Blob: Simplified Text Processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.\

TextBlob stands on the giant shoulders of NLTK and pattern and plays nicely with both.

## Features
Noun phrase extraction Part-of-speech tagging Sentiment analysis Classification (Naive Bayes, Decision Tree) Language translation and detection powered by Google Translate Tokenization (splitting text into words and sentences) Word and phrase frequencies Parsing N-grams Word inflection (pluralization and singularization) and lemmatization Spelling correction Add new models or languages through extensions WordNet integration

## Create a TextBlob

In [1]:
pip install textblob



In [2]:
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


In [3]:
from textblob import TextBlob

In [4]:
wiki = TextBlob("I love Natural Language Processing, not you!")

## Part-of-Speech Tagging
Parts-of-speech tags can be accessed through the tags property:

In [5]:
wiki.tags

[('I', 'PRP'),
 ('love', 'VBP'),
 ('Natural', 'JJ'),
 ('Language', 'NNP'),
 ('Processing', 'NNP'),
 ('not', 'RB'),
 ('you', 'PRP')]

## Noun Phrase Extraction
Similarly, noun phrases are accessed through the noun_phrases property:

In [6]:
wiki.noun_phrases

WordList(['language processing'])

## Sentiment Analysis
The sentiment property returns a named tuple of the form Sentiment(polarity, subjectivity). The polarity score is a float within the range [-1.0, 1.0]. The subjectivity is a float within the range [0.0, 1.0], where 0.0 is very subjective and 1.0 is very subjective.

In [7]:
testimonial = TextBlob("Textblob is amazingly simple to use. What great fun!")
testimonial.sentiment

Sentiment(polarity=0.39166666666666666, subjectivity=0.4357142857142857)

In [8]:
testimonial.sentiment.subjectivity

0.4357142857142857

## Tokenisation

In [9]:
zen = TextBlob("Beautiful is better than ugly. "
               "Explicit is better than implicit. "
               "Simple is better than complex.")
zen.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

## Get Sentences:
Sentences objects have the same properties and methods as Textblobs

In [10]:
zen.sentences

[Sentence("Beautiful is better than ugly."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

In [11]:
for sentence in zen.sentences:
    print(sentence)

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.


## Word Inflection and Lemmatization
Each word in the TextBlob.words or Sentence.words is a Word object (a sublclass of unicode) with useful methods, ex: word inflection.

In [12]:
sentence = TextBlob('Use 4 spaces per indentation level.')
sentence.words

WordList(['Use', '4', 'spaces', 'per', 'indentation', 'level'])

In [13]:
sentence.words[2].singularize()

'space'

In [14]:
sentence.words[0].pluralize()

'Uses'

In [15]:
from textblob import Word
w = Word("lions")
w.lemmatize()

'lion'

In [16]:
q = Word("went")
q.lemmatize("v") # Pass in WordNet part of speech

'go'

## WordNet Integration
You can access the synsets for a Word via the synsets property or the get_synsets method, optionally passing in a parts-of-speech.

## WordNet
WordNet is a lexical database that is a dictionary for the English language. It is specifically for natural language processing.

## Synset
It is a special kind of simple interface that is present in the NLTK for looking up words in WordNet. Synset instances are the groupings of synonymous words that express the same type of concept. Some words have only one synset and some have several.


In [17]:
Word("length").definitions

['the linear extent in space from one end to the other; the longest dimension of something that is fixed in place',
 'continuance in time',
 'the property of being the extent of something from beginning to end',
 'size of the gap between two places',
 'a section of something that is long and narrow']

In [18]:
from textblob.wordnet import Synset
octopus = Synset('octopus.n.02')
shrimp = Synset('shrimp.n.03')
octopus.path_similarity(shrimp)

0.1111111111111111

## WordLists
A wordlist is just the Python list with additional methods.

WordLists will find the words which are in the sentence and ignore the spaces in between them.

In [19]:
animals = TextBlob("cow sheep octopus")
animals.words

WordList(['cow', 'sheep', 'octopus'])

In [20]:
animals.words.pluralize() # It pluralize the words

WordList(['kine', 'sheep', 'octopodes'])

## Spelling Correction

In [22]:
g = TextBlob(" Can you pronounce czechuslovakia?")
print(g.correct())

 An you pronounce czechoslovakia?


In [23]:
from textblob import Word
w = Word('longitude')
w.spellcheck()

[('longitude', 1.0)]

## Get Word and Noun Phrase Frequencies
There are two ways to get the frequency of a word or noun phrase in the TextBlob.

The first one is through the word_counts dictionary.

In [24]:
sent = TextBlob('She sales sea shells at the sea shore')
sent.word_counts['sea']

2

The second way is to use the count() method.

In [25]:
sent.word_counts['shore']

1

In [26]:
sent.words.count('sea',case_sensitive=True)

2

In [27]:
sent.words.count('Sea', case_sensitive=True)

0

## Translation and Language Detection
TextBlobs can be translated between languages.

In [28]:
blob = TextBlob("hello")
blob.translate(from_lang='en', to='fr')

TextBlob("Bonjour")

In [29]:
chinse_blob = TextBlob(u"有总比没有好")
chinse_blob.translate(from_lang="zh-CN", to='en')

TextBlob("There is always better than not")

In [30]:
d = TextBlob("Bonjour")
d.detect_language

## TextBlobs are like Python Strings!
You can use Python's substring syntax

In [31]:
zen[0:15]

TextBlob("Beautiful is be")

In [32]:
zen.upper()

TextBlob("BEAUTIFUL IS BETTER THAN UGLY. EXPLICIT IS BETTER THAN IMPLICIT. SIMPLE IS BETTER THAN COMPLEX.")

In [33]:
zen.find("than")

20

In [34]:
a_blob = TextBlob('apple')
s_blob = TextBlob('samsung')
a_blob < s_blob

True

In [35]:
a_blob == 'apple'

True

You can concatenate and interpolate TextBlobs and strings

In [36]:
a_blob + ' and ' + s_blob

TextBlob("apple and samsung")

In [37]:
"{0} and {1}".format(a_blob, s_blob)

'apple and samsung'

## n-grams
The TextBlob.ngrams() method returns a list of tuples of n successive words.

In [38]:
blob = TextBlob("Now is better than never.")
blob.ngrams(n=3)

[WordList(['Now', 'is', 'better']),
 WordList(['is', 'better', 'than']),
 WordList(['better', 'than', 'never'])]

## Get Start and End Indices of Sentences
Use sentence.start and sentence.end to get the indices where a sentence starts and ends within a TextBlob

In [39]:
for k in zen.sentences:
    print(k)
    print("---- Starts at index {}, Ends at index {}".format(k.start, k.end))

Beautiful is better than ugly.
---- Starts at index 0, Ends at index 30
Explicit is better than implicit.
---- Starts at index 31, Ends at index 64
Simple is better than complex.
---- Starts at index 65, Ends at index 95


## Let's start building the Text Classification system
The textblob.classifiers module makes it simple to create custom classifiers.

As an example, let's create a custom sentiment analyzer.

Loading Data and Creating a Classifier
First, we’ll create some training and test data.

In [40]:
train = [
       ('I love this sandwich.', 'pos'),
       ('this is an amazing place!', 'pos'),
       ('I feel very good about these beers.', 'pos'),
       ('this is my best work.', 'pos'),
       ("what an awesome view", 'pos'),
       ('I do not like this restaurant', 'neg'),
       ('I am tired of this stuff.', 'neg'),
       ("I can't deal with this", 'neg'),
       ('he is my sworn enemy!', 'neg'),
]
test = [
       ('the beer was good.', 'pos'),
       ('I do not enjoy my job', 'neg'),
       ("I ain't feeling dandy today.", 'neg'),
       ("I feel amazing!", 'pos'),
       ('Gary is a friend of mine.', 'pos'),
       ("I can't believe I'm doing this.", 'neg')
]

Now we'll create a Naive Bayes classifier, passing the training data into the constructor.

In [41]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(train)

Classifying Text

Call the classify(text) method to use the classifier.

In [42]:
cl.classify("This is an amazing library!")

'pos'

In [43]:
prob_dist = cl.prob_classify("This one's a doozy.")
prob_dist.max()

'pos'

In [44]:
prob_dist = cl.prob_classify("I am suffering from cold")
prob_dist.max()

'neg'

In [45]:
round(prob_dist.prob("pos"), 2)

0.31

In [46]:
round(prob_dist.prob("neg"), 2)

0.69

## Classifying TextBlobs
Another way to classify text is to pass a classifier into the constructor of TextBlob and call its classify() method.

In [47]:
from textblob import TextBlob
blob = TextBlob("Alcohal is good. But the hangover is horrible.", classifier=cl)
blob.classify()

'pos'

In [48]:
for s in blob.sentences:
    print(s)
    print(s.classify())

Alcohal is good.
pos
But the hangover is horrible.
pos


## Evaluating Classifiers
To compute the accuracy on our test set, use the accuracy(test_data) method.

In [49]:
cl.accuracy(test)

1.0

In [50]:
cl.show_informative_features(5)

Most Informative Features
             contains(I) = False             pos : neg    =      1.9 : 1.0
             contains(I) = True              neg : pos    =      1.7 : 1.0
            contains(an) = False             neg : pos    =      1.5 : 1.0
            contains(is) = True              pos : neg    =      1.4 : 1.0
          contains(this) = False             pos : neg    =      1.4 : 1.0


## Updating Classifiers with New Data
Use the update(new_data) method to update a classifier with new training data.

In [52]:
new_data = [('She is my best friend.', 'pos'),
            ("I'm happy to have a new friend.", 'pos'),
            ("Stay thirsty, my friend.", 'pos'),
            ("He ain't from around here.", 'neg')]

cl.update(new_data)

True

In [53]:
cl.accuracy(test)

1.0

## Feature Extractors
By default, the NaiveBayesClassifier uses a simple feature extractor that indicates which words in the training set are contained in a document.

For example, the sentence "I love" might have the features contains(love): True or contains(hate): False.

You can override this feature extractor by writing your own. A feature extractor is simply a function with a document (the text to extract features from) as the first argument. The function may include a second argument, train_set (the training dataset), if necessary.

In [54]:
def end_word_extractor(document):
    tokens = document.split()
    first_word, last_word = tokens[0], tokens[-1]
    feats = {}
    feats["first({0})".format(first_word)] = True
    return feats

In [55]:
features = end_word_extractor("I love")

In [56]:
assert features == {'first(I)': True}

We can then use the feature extractor in a classifier by passing it as the second argument of the constructor.

In [57]:
cl2 = NaiveBayesClassifier(test, feature_extractor=end_word_extractor)

In [58]:
blob = TextBlob("I'm excited to try my new classifier.", classifier=cl2)
blob.classify()

'pos'