<a href="https://colab.research.google.com/github/SrikanthGuggila/TextBlob-Tutorial/blob/main/TextBlob_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**TextBlob**

TextBlob is library use in Natural Language Processing. Using TextBlob, we can perfor NLP tasks such as Tokenization, Lemmatization, Part-of-Speech tagging, noun-phrase extraction, sentiment analysis and many more.

TextBlob is built on top of NLTK library

TextBlob stands on the giant shoulders of NLTK and pattern, and plays nicely with both.

**Features of TextBlob**

* Noun phrase extraction
* Part-of-speech tagging
* Sentiment analysis
* Classification (Naive Bayes, Decision Tree)
* Tokenization (splitting text into words and sentences)
* Word and phrase frequencies
* Parsing
* n-grams
* Word inflection (pluralization and singularization) and lemmatization
* Spelling correction
* Add new models or languages through extensions
* WordNet integration



**Install TextBlob**

In [None]:
!pip install textblob



**Installing/Upgrading From the PyPI**

In [None]:
$ pip install -U textblob
$ python -m textblob.download_corpora

This will install TextBlob and download the necessary NLTK corpora. If you need to change the default download directory set the NLTK_DATA environment variable.

**Downloading the minimum corpora**

If you only intend to use TextBlob’s default models (no model overrides), you can pass the lite argument. This downloads only those corpora needed for basic functionality.

In [None]:
$ python -m textblob.download_corpora lite

**Install with conda**

In [None]:
$ conda install -c conda-forge textblob
$ python -m textblob.download_corpora

**From Source**

You can clone the public repo:

https://github.com/sloria/TextBlob

In [None]:
$ git clone https://github.com/sloria/TextBlob.git

Once you have the source, you can install it into your site-packages with

In [None]:
$ python setup.py install

**Get the bleeding edge version**

To get the latest development version of TextBlob, run

In [None]:
$ pip install -U git+https://github.com/sloria/TextBlob.git@dev

**Migrating from older versions (<=0.7.1)**

As of TextBlob 0.8.0, TextBlob’s core package was renamed to textblob, whereas earlier versions used a package called text. Therefore, migrating to newer versions should be as simple as rewriting your imports, like so:

New:

In [None]:
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger

Old:

In [None]:
from text.blob import TextBlob, Word, Blobber
from text.classifiers import NaiveBayesClassifier
from text.taggers import NLTKTagger

**Python**

TextBlob supports Python >=2.7 or >=3.5.

**Dependencies**

TextBlob depends on NLTK 3. NLTK will be installed automatically when you run pip install textblob or python setup.py install.

Some features, such as the maximum entropy classifier, require numpy, but it is not required for basic usage.

**Create a TextBlob**

In [None]:
from textblob import TextBlob
text_blob = TextBlob('I am learning Natural Language Processing with Python, Learning Python also very easy')

In [None]:
text_blob

TextBlob("I am learning Natural Language Processing with Python, Learning Python also very easy")

**Part-of-speech Tagging**

NLTK is a library which is also used for text processing and text analysis
TextBlob is bulit upon NLTK library.

To get the tags, we need to download the 'punkt' and 'averaged_perceptron_tagger' from nltk library

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
text_blob.tags

[('I', 'PRP'),
 ('am', 'VBP'),
 ('learning', 'VBG'),
 ('Natural', 'NNP'),
 ('Language', 'NNP'),
 ('Processing', 'VBG'),
 ('with', 'IN'),
 ('Python', 'NNP'),
 ('Learning', 'NNP'),
 ('Python', 'NNP'),
 ('also', 'RB'),
 ('very', 'RB'),
 ('easy', 'JJ')]

**POS tagging Advanced**

TextBlob currently has two POS tagger implementations, located in textblob.taggers. The default is the PatternTagger which uses the same implementation as the pattern library.

The second implementation is NLTKTagger which uses NLTK’s TreeBank tagger. Numpy is required to use the NLTKTagger.

Similar to the tokenizers and noun phrase chunkers, you can explicitly specify which POS tagger to use by passing a tagger instance to the constructor.

In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
from textblob import TextBlob
from textblob.taggers import NLTKTagger
tagger = NLTKTagger()
blob = TextBlob("we are learning parts of speech taggers in textblob", pos_tagger=tagger)
blob.pos_tags

[('we', 'PRP'),
 ('are', 'VBP'),
 ('learning', 'VBG'),
 ('parts', 'NNS'),
 ('of', 'IN'),
 ('speech', 'NN'),
 ('taggers', 'NNS'),
 ('in', 'IN'),
 ('textblob', 'NN')]

**Noun Phrase Extraction**

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
text_blob.noun_phrases

WordList(['language processing', 'python', 'learning python'])

**Noun Phrase Extraction Advanced**

TextBlob currently has two noun phrases chunker implementations, textblob.np_extractors.FastNPExtractor (default, based on Shlomi Babluki’s implementation from this blog post) and textblob.np_extractors.ConllExtractor, which uses the CoNLL 2000 corpus to train a tagger.

You can change the chunker implementation (or even use your own) by explicitly passing an instance of a noun phrase extractor to a TextBlob’s constructor.

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
nltk.download('conll2000')

[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.


True

In [None]:
from textblob import TextBlob
from textblob.np_extractors import FastNPExtractor
from textblob.np_extractors import ConllExtractor
fast_np_extract = FastNPExtractor()
blob1 = TextBlob("Python is High level programming language and easy to learn", np_extractor=fast_np_extract)
print(blob1.noun_phrases)
coll_extract = ConllExtractor()
blob2 = TextBlob("Python is High level programming language and easy to learn", np_extractor=coll_extract)
print(blob2.noun_phrases) 

['python', 'high level']
['python', 'high level programming language']


**Sentiment Analysis**

Sentiment function returns two properties
1. polarity 
2. subjectivity

* polarity lies between -1 and 1, -1 represents it is negative sentiment where 1 represents it is a positive sentiment

* subjectivity lies between 0 and 1, 0 reprensents it is a factual information where as 1 represents it is a personal opinion 

In [None]:
statement = TextBlob('Learning python is super fun and easy')
statement.sentiment

Sentiment(polarity=0.35555555555555557, subjectivity=0.5666666666666668)

In [None]:
statement.sentiment.polarity

0.35555555555555557

In [None]:
statement.sentiment.subjectivity

0.5666666666666668

In [None]:
statement2 = TextBlob('regular less sleep is not good for health')
print('Polairity: ', statement2.sentiment.polarity)
print('Subjectivity: ', statement2.sentiment.subjectivity)

Polairity:  -0.1722222222222222
Subjectivity:  0.24786324786324787


* The polarity is close to -1, we can say it is a negative sentiment
* The subjectivity is close to 0, we can say it is a fact

**Advanced Sentiment Anayzers**

TextBlob allows you to specify which algorithms you want to use under the hood of its simple API.

The textblob.sentiments module contains two sentiment analysis implementations, PatternAnalyzer (based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier trained on a movie reviews corpus).

The default implementation is PatternAnalyzer, but you can override the analyzer by passing another implementation into a TextBlob’s constructor.

For instance, the NaiveBayesAnalyzer returns its result as a namedtuple of the form: Sentiment(classification, p_pos, p_neg)

In [None]:
from textblob import TextBlob
from textblob.sentiments import NaiveBayesAnalyzer
blob = TextBlob('I love Learning TextBlob')
blob.sentiment

Sentiment(polarity=0.5, subjectivity=0.6)

As we are trying to use NaiveBayesAnalyzer classification model which is aleady used on movie reviews data set, we need to download movie_reviews dataset from nltk library

In [None]:
import nltk
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [None]:
blob = TextBlob('I love Learning TextBlob', analyzer = NaiveBayesAnalyzer())
blob.sentiment

Sentiment(classification='pos', p_pos=0.7085840931204725, p_neg=0.2914159068795271)

In [None]:
blob = TextBlob("I'm not interested in movies", analyzer=NaiveBayesAnalyzer())
print(blob.sentiment.p_pos)
print(blob.sentiment.p_neg)



0.4747641843877517
0.5252358156122483


**Tokenization**

In [None]:
text = TextBlob("Beautiful is better than ugly. "
                "Explicit is better than implicit. "
                "Simple is better than complex.")

In [None]:
text.words

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', 'Explicit', 'is', 'better', 'than', 'implicit', 'Simple', 'is', 'better', 'than', 'complex'])

In [None]:
for word in text.words:
  print(word)

Beautiful
is
better
than
ugly
Explicit
is
better
than
implicit
Simple
is
better
than
complex


In [None]:
text.sentences

[Sentence("Beautiful is better than ugly."),
 Sentence("Explicit is better than implicit."),
 Sentence("Simple is better than complex.")]

In [None]:
for sentence in text.sentences:
  print(sentence.sentiment)

Sentiment(polarity=0.2166666666666667, subjectivity=0.8333333333333334)
Sentiment(polarity=0.5, subjectivity=0.5)
Sentiment(polarity=0.06666666666666667, subjectivity=0.41904761904761906)


In [None]:
for sentence in text.sentences:
  print

**Advanced Tokenizers**

The words and sentences properties are helpers that use the 



```
textblob.tokenizers.WordTokenizer and
textblob.tokenizers.SentenceTokenizer classes, respectively.
```


You can use other tokenizers, such as those provided by NLTK, by passing them into the TextBlob constructor then accessing the tokens property.

In [None]:
from textblob import TextBlob
from nltk.tokenize import TabTokenizer
token = TabTokenizer()
blob = TextBlob("Beautiful\tis\tbetter\tthan\tugly.", tokenizer=token)
blob.tokens

WordList(['Beautiful', 'is', 'better', 'than', 'ugly.'])

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from textblob import TextBlob
from nltk.tokenize import BlanklineTokenizer
toekenizer = BlanklineTokenizer()
blob = TextBlob("Beautiful\nis\nbetter\nthan\nugly.")
blob.tokens

WordList(['Beautiful', 'is', 'better', 'than', 'ugly', '.'])

**Blobber: A TextBlob Factory**

It can be tedious to repeatedly pass taggers, NP extractors, sentiment analyzers, classifiers, and tokenizers to multiple TextBlobs. To keep your code DRY, you can use the Blobber class to create TextBlobs that share the same models.

First, instantiate a Blobber with the tagger, NP extractor, sentiment analyzer, classifier, and/or tokenizer of your choice.

In [None]:
from textblob import Blobber
from textblob.taggers import NLTKTagger
tb = Blobber(pos_tagger=NLTKTagger())

In [None]:
blob1 = tb("this is a blob")
blob2 = tb("this is another blob")
blob1.pos_tagger is blob2.pos_tagger

True

**Word Inflection and Lemmatization**

In [None]:
text = TextBlob("Jeo Biden is elected as president of US elections in 2020")
text.words

WordList(['Jeo', 'Biden', 'is', 'elected', 'as', 'president', 'of', 'US', 'elections', 'in', '2020'])

In [None]:
text.words[8].singularize()

'election'

In [None]:
text.words[5].pluralize()

'presidents'

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
from textblob import Word
w = Word('octopi')
w.lemmatize()

'octopus'

In [None]:
speech_text = """Wednesday night, approaching his 100th day in office, 
                  President Joe Biden addresses the joint session of Congress for the first time. 
                  The full text of his prepared speech below was released by the White House.
                  Madame Speaker. Madame Vice President. 
                  No president has ever said those words from this podium, and it’s about time. The First Lady. 
                  The Second Gentleman. Mr. Chief Justice. 
                  Members of the United States Congress and the Cabinet – and distinguished guests."""

In [None]:
blob = TextBlob(speech_text)
lemma_words = [word.lemmatize('v') for word in blob.words]
lemma_words

['Wednesday',
 'night',
 'approach',
 'his',
 '100th',
 'day',
 'in',
 'office',
 'President',
 'Joe',
 'Biden',
 'address',
 'the',
 'joint',
 'session',
 'of',
 'Congress',
 'for',
 'the',
 'first',
 'time',
 'The',
 'full',
 'text',
 'of',
 'his',
 'prepare',
 'speech',
 'below',
 'be',
 'release',
 'by',
 'the',
 'White',
 'House',
 'Madame',
 'Speaker',
 'Madame',
 'Vice',
 'President',
 'No',
 'president',
 'have',
 'ever',
 'say',
 'those',
 'word',
 'from',
 'this',
 'podium',
 'and',
 'it',
 '’',
 's',
 'about',
 'time',
 'The',
 'First',
 'Lady',
 'The',
 'Second',
 'Gentleman',
 'Mr',
 'Chief',
 'Justice',
 'Members',
 'of',
 'the',
 'United',
 'States',
 'Congress',
 'and',
 'the',
 'Cabinet',
 '–',
 'and',
 'distinguish',
 'guests']

**WordNet Integration**

In [None]:
from textblob import TextBlob
from textblob.wordnet import VERB
word = Word('octopus')
word.synsets


[Synset('octopus.n.01'), Synset('octopus.n.02')]

In [None]:
Word('hack').get_synsets(pos=VERB)

[Synset('chop.v.05'),
 Synset('hack.v.02'),
 Synset('hack.v.03'),
 Synset('hack.v.04'),
 Synset('hack.v.05'),
 Synset('hack.v.06'),
 Synset('hack.v.07'),
 Synset('hack.v.08')]

In [None]:
Word("octopus").definitions

['tentacles of octopus prepared as food',
 'bottom-living cephalopod having a soft oval body with eight long tentacles']

In [None]:
from textblob.wordnet import Synset
octopus = Synset('octopus.n.02')
shrimp = Synset('shrimp.n.03')
octopus.path_similarity(shrimp)

0.1111111111111111

**WordLists**

In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
animals = TextBlob("cat dog octopus")
animals.words
animals.words.pluralize()

WordList(['cats', 'dogs', 'octopodes'])

In [None]:
animals = TextBlob("cats dogs octopodes")
animals.words
animals.words.singularize()

WordList(['cat', 'dog', 'octopus'])

**Spelling Correction**

In [None]:
from textblob import TextBlob
b = TextBlob("I am goinf to learb nayural procescing langyage")
print(b.correct())

I am going to learn natural processing language


In [None]:
from textblob import Word
w = Word('favorate')
w.spellcheck()

[('favorite', 1.0)]

**Word and Noun Phrase frequencies**

In [None]:
text = TextBlob("Indian population is Very very very high compared to other countries population")
print('Population:',text.word_counts['population'])
print('very:',text.word_counts['very'])

Population: 2
very: 3


In [None]:
print('very:',text.words.count('very'))

very: 3


In [None]:
print('very:',text.words.count('very', case_sensitive=True))

very: 2


**Parsing**

In [None]:
b = TextBlob("Now we are going to learn parsing")
print(b.parse())

Now/RB/B-ADVP/O we/PRP/B-NP/O are/VBP/B-VP/O going/VBG/I-VP/O to/TO/B-PP/O learn/VB/B-VP/O parsing/VBG/I-VP/O


**n-grams**

In [None]:
blob = TextBlob("Now is better than ever")
blob.ngrams(n=2)

[WordList(['Now', 'is']),
 WordList(['is', 'better']),
 WordList(['better', 'than']),
 WordList(['than', 'ever'])]

In [None]:
blob = TextBlob("every in the IT industry trying to upskill themselves")
blob.ngrams(n=3)

[WordList(['every', 'in', 'the']),
 WordList(['in', 'the', 'IT']),
 WordList(['the', 'IT', 'industry']),
 WordList(['IT', 'industry', 'trying']),
 WordList(['industry', 'trying', 'to']),
 WordList(['trying', 'to', 'upskill']),
 WordList(['to', 'upskill', 'themselves'])]

**Text Classification System with TextBlob**

***Spam Classifier with TextBlob***

In [38]:
import pandas as pd
df = pd.read_csv('/content/SPAM text message 20170820 - Data.csv')
df.head(10)

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [58]:
data = zip(df.Message, df.Category)
data = list(data)

In [60]:
data[:10]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  'ham'),
 ('Ok lar... Joking wif u oni...', 'ham'),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  'spam'),
 ('U dun say so early hor... U c already then say...', 'ham'),
 ("Nah I don't think he goes to usf, he lives around here though", 'ham'),
 ("FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv",
  'spam'),
 ('Even my brother is not like to speak with me. They treat me like aids patent.',
  'ham'),
 ("As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune",
  'ham'),
 ('WINNER!! As a valued network customer you have been selected to receivea £900 prize 

In [66]:
len(data)

5572

In [69]:
term = round(len(data)*0.8)
term

4458

In [70]:
train = data[:term]
test = data[term:]
print('Train data Size:',len(train))
print('Test data Size:',len(test))

Train data Size: 4458
Test data Size: 1114


In [71]:
from textblob.classifiers import NaiveBayesClassifier
classifier = NaiveBayesClassifier(train)

In [74]:
classifier.classify(data[term+1][0])

'ham'

In [81]:
data[term+1]

('Die... I accidentally deleted e msg i suppose 2 put in e sim archive. Haiz... I so sad...',
 'ham')

In [77]:
from textblob import TextBlob
blob = TextBlob("The beer is good. But the hangover is horrible.", classifier = classifier)
blob.classify()

'ham'

In [79]:
from textblob import TextBlob
blob = TextBlob("FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv", classifier = classifier)
blob.classify()

'spam'

In [82]:
classifier.classify(data[term+2][0])

'spam'

In [83]:
classifier.accuracy(test)

0.9829443447037702

In [86]:
classifier.show_informative_features(10)

Most Informative Features
          contains(STOP) = True             spam : ham    =    177.0 : 1.0
            contains(16) = True             spam : ham    =    177.0 : 1.0
           contains(Txt) = True             spam : ham    =     90.3 : 1.0
        contains(Orange) = True             spam : ham    =     87.4 : 1.0
        contains(Mobile) = True             spam : ham    =     83.2 : 1.0
            contains(To) = True             spam : ham    =     83.2 : 1.0
           contains(txt) = True             spam : ham    =     80.3 : 1.0
        contains(camera) = True             spam : ham    =     66.1 : 1.0
         contains(award) = True             spam : ham    =     66.1 : 1.0
        contains(pounds) = True             spam : ham    =     61.8 : 1.0
