# TextBlob
TextBlob is a powerful NLP Python library. It can be used to perform a variety of NLP tasks. Documentation for TextBlob can be found [here](https://textblob.readthedocs.io/en/dev/).

In [1]:
%%capture
# Install textblob
!pip install -U textblob


In [2]:
from textblob import TextBlob


## Corpora

In [3]:
%%capture
# Download corpora
!python -m textblob.download_corpora


In [4]:
import nltk
# nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

## TextBlobs

In [5]:
my_blob = TextBlob("There is more than one way to skin a cat.")


In [6]:
my_blob


TextBlob("There is more than one way to skin a cat.")

## Tagging Parts of Speech
A list of the different parts of speech tags can be found [here](https://www.geeksforgeeks.org/python-part-of-speech-tagging-using-textblob/).

code | meaning | example
--- | --- | ---
CC | coordinating conjunction |
CD | cardinal digit |
DT | determiner |
EX | existential there | (like: “there is” … think of it like “there exists”)
FW | foreign word |
IN | preposition/subordinating conjunction |
JJ | adjective | ‘big’
JJR | adjective, comparative | ‘bigger’
JJS | adjective, superlative | ‘biggest’
LS | list marker | 1)
MD | modal could, | will
NN | noun, singular | ‘desk’
NNS | noun plural | ‘desks’
NNP | proper noun, singular | ‘Harrison’
NNPS | proper noun, plural | ‘Americans’
PDT | predeterminer | ‘all the kids’
POS | possessive ending | parent‘s
PRP | personal pronoun | I, he, she
PRP\$ | possessive pronoun | my, his, hers
RB | adverb | very, silently,
RBR | adverb, comparative | better
RBS | adverb, superlative | best
RP | particle | give up
TO | to go | ‘to‘ the store.
UH | interjection | errrrrrrrm
VB | verb, base form | take
VBD | verb, past tense | took
VBG | verb, gerund/present participle | taking
VBN | verb, past participle | taken
VBP | verb, sing. present, non-3d | take
VBZ | verb, 3rd person sing. present | takes
WDT | wh-determiner | which
WP | wh-pronoun | who, what
WP$ | possessive wh-pronoun | whose
WRB | wh-adverb | where, when






In [7]:
# Use the .tags attribute to see parts of speech
my_blob.tags


[('There', 'EX'),
 ('is', 'VBZ'),
 ('more', 'JJR'),
 ('than', 'IN'),
 ('one', 'CD'),
 ('way', 'NN'),
 ('to', 'TO'),
 ('skin', 'VB'),
 ('a', 'DT'),
 ('cat', 'NN')]

## Sentiment Analysis

Sentiment analysis can be used to understand the feeling or emotion tied to the text. The sentiment attribute in TextBlob will return two values:
1. The **polarity score** (a float between -1.0 and 1.0). -1 is negative, 1 is positive.
2. The **subjectivity** (a float between 0.0 and 1.0). 0 is very objective, while 1 is very subjective.

In [8]:
neg_blob = TextBlob("I am so tired. Today was a long, hard day.")
neg_blob.sentiment


Sentiment(polarity=-0.24722222222222223, subjectivity=0.5472222222222222)

In [9]:
pos_blob = TextBlob("Today was a great day. I am so happy.")
pos_blob.sentiment


Sentiment(polarity=0.8, subjectivity=0.875)

In [10]:
obj_blob = TextBlob("The cat is gray.")
obj_blob.sentiment


Sentiment(polarity=0.0, subjectivity=0.0)

In [11]:
subj_blob = TextBlob("The cat is so cute and sweet.")
print(subj_blob.sentiment)
print(subj_blob.sentiment.subjectivity) # Only get the subjectivity


Sentiment(polarity=0.425, subjectivity=0.825)
0.825


Sentiment analysis of multiple sentences

In [12]:
my_poem = TextBlob('''
  Python is a great language to learn.
  You can easily do NLP; it's fab.
  It might take some getting used to.
  But it's definitely more gooder than Matlab.
''')


In [13]:
my_poem

TextBlob("
  Python is a great language to learn.
  You can easily do NLP; it's fab.
  It might take some getting used to.
  But it's definitely more gooder than Matlab.
")

In [14]:
my_poem.sentiment


Sentiment(polarity=0.5777777777777778, subjectivity=0.6944444444444445)

In [15]:
my_poem.sentences


[Sentence("
   Python is a great language to learn."),
 Sentence("You can easily do NLP; it's fab."),
 Sentence("It might take some getting used to."),
 Sentence("But it's definitely more gooder than Matlab.")]

In [16]:
for sentence in my_poem.sentences:
  print(sentence.sentiment)


Sentiment(polarity=0.8, subjectivity=0.75)
Sentiment(polarity=0.43333333333333335, subjectivity=0.8333333333333334)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=0.5, subjectivity=0.5)


### Your Turn
Create three TextBlobs with the following sentiments:
1. Negative, subjective
2. Positive, objective
3. Neutral


In [17]:
# Solution 1
text_ns = "It's a cruddy day."
neg_sub = TextBlob(text_ns)
neg_sub.sentiment


Sentiment(polarity=-0.9, subjectivity=0.9)

In [18]:
# Solution 1
text_ns = "Hitler was a terrible man."
neg_sub = TextBlob(text_ns)
neg_sub.sentiment


Sentiment(polarity=-1.0, subjectivity=1.0)

In [19]:
# Solution 2
text_po = "Bill is a nice guy. He won the race."
pos_obj = TextBlob(text_po)
pos_obj.sentiment
# no luck


Sentiment(polarity=0.6, subjectivity=1.0)

In [20]:
# Solution 2
text_po = "My best friend had a baby boy."
pos_obj = TextBlob(text_po)
pos_obj.sentiment
# no luck


Sentiment(polarity=1.0, subjectivity=0.3)

In [21]:
# Solution 3
text_n = "One plus one is two."
neut = TextBlob(text_n)
neut.sentiment


Sentiment(polarity=0.0, subjectivity=0.0)

## Tokenization
Tokenization is the process of splitting long strings of text into small pieces (tokens).

In [22]:
my_poem.sentences


[Sentence("
   Python is a great language to learn."),
 Sentence("You can easily do NLP; it's fab."),
 Sentence("It might take some getting used to."),
 Sentence("But it's definitely more gooder than Matlab.")]

In [23]:
my_poem.sentences[0].words


WordList(['Python', 'is', 'a', 'great', 'language', 'to', 'learn'])

In [24]:
my_poem.words


WordList(['Python', 'is', 'a', 'great', 'language', 'to', 'learn', 'You', 'can', 'easily', 'do', 'NLP', 'it', "'s", 'fab', 'It', 'might', 'take', 'some', 'getting', 'used', 'to', 'But', 'it', "'s", 'definitely', 'more', 'gooder', 'than', 'Matlab'])

In [25]:
sorted(my_poem.word_counts.items(), key = lambda x: x[1], reverse=True)

[('it', 3),
 ('to', 2),
 ('s', 2),
 ('python', 1),
 ('is', 1),
 ('a', 1),
 ('great', 1),
 ('language', 1),
 ('learn', 1),
 ('you', 1),
 ('can', 1),
 ('easily', 1),
 ('do', 1),
 ('nlp', 1),
 ('fab', 1),
 ('might', 1),
 ('take', 1),
 ('some', 1),
 ('getting', 1),
 ('used', 1),
 ('but', 1),
 ('definitely', 1),
 ('more', 1),
 ('gooder', 1),
 ('than', 1),
 ('matlab', 1)]

## Singular & Plural

In [26]:
my_sent = TextBlob("The octopi went swimming in the dark ocean waters.")


In [27]:
my_sent.words

WordList(['The', 'octopi', 'went', 'swimming', 'in', 'the', 'dark', 'ocean', 'waters'])

In [28]:
my_sent.words[0]


'The'

In [29]:
# Singularize
my_sent.words[-1].singularize()


'water'

In [30]:
my_sent.words[1].singularize()


'octopus'

In [31]:
# Pluralize
my_sent.words[-2].pluralize()


'oceans'

In [32]:
foo = my_sent.words[-2]
foo == foo.singularize()

True

In [33]:
TextBlob("corpus").words.singularize(), TextBlob("corpus").words.pluralize()

(WordList(['corpu']), WordList(['corpora']))

In [34]:
my_sent.words[2:5]

WordList(['went', 'swimming', 'in'])

In [35]:
import numpy as np

np.array(my_sent.words)

array(['The', 'octopi', 'went', 'swimming', 'in', 'the', 'dark', 'ocean',
       'waters'], dtype='<U8')

## Stemming & Lemmatization

Stemming is the process of deleting prefixes and suffixes from a word, leaving on the word “stem”. Lemmatization is similar to stemming, but lemmatization is able to capture the underlying meaning of the word.

In [36]:
my_sent


TextBlob("The octopi went swimming in the dark ocean waters.")

In [37]:
# Find the index of 'swimming'
my_sent.words.index('swimming')


3

In [38]:
# Stemming
print(my_sent.words[3].stem())
print(my_sent.words[1].stem())


swim
octopi


In [39]:
# Lemmatization
print(my_sent.words[3].lemmatize())
print(my_sent.words[1].lemmatize())


swimming
octopus


In [40]:
care = TextBlob("caring")

(
  care.words.stem(),
  care.words.lemmatize()
)


(WordList(['care']), WordList(['caring']))

## WordNet

In [41]:
my_sent

TextBlob("The octopi went swimming in the dark ocean waters.")

In [42]:
{ my_sent.words[-2] : my_sent.words[-2].definitions }


{'ocean': ['a large body of water constituting a principal part of the hydrosphere',
  'anything apparently limitless in quantity or volume']}

In [43]:
{"swimming", "tennis"} - set(my_sent.words)

{'tennis'}

## Spelling ( correcting )

In [44]:
my_bad_spelling = TextBlob('Helllo, today is my birfday.')
my_bad_spelling.correct()


TextBlob("Hello, today is my birthday.")

## Counting Words

In [45]:
my_cheer = TextBlob('Data science is the best, data science is the coolest.')
my_cheer.words.count('data')


2

In [46]:
my_cheer.word_counts


defaultdict(int,
            {'data': 2,
             'science': 2,
             'is': 2,
             'the': 2,
             'best': 1,
             'coolest': 1})

### Your Turn
1. Create a TextBlob called `message` and set it equal to `Good morning, todayy is going to be a fantastic day!`.
2. Correct the spelling in your TextBlob and set it equal to a new variable called `message_sp`.
3. Find the index of the word `fantastic`.
4. Look up the definition of the word `fantastic`.
5. Stem and lemmatize the word `fantastic`.

In [47]:
# Solution 1
message = TextBlob("Good morning, todayy is going to be a fantastic day!.")
message

TextBlob("Good morning, todayy is going to be a fantastic day!.")

In [48]:
# Solution 2
message_sp = message.correct()
message_sp

TextBlob("Good morning, today is going to be a fantastic day!.")

In [49]:
list(zip(message.words, message_sp.words ))

[('Good', 'Good'),
 ('morning', 'morning'),
 ('todayy', 'today'),
 ('is', 'is'),
 ('going', 'going'),
 ('to', 'to'),
 ('be', 'be'),
 ('a', 'a'),
 ('fantastic', 'fantastic'),
 ('day', 'day')]

In [50]:
[ (i,t) for i, t in enumerate(zip(message.words, message_sp.words )) if t[0] != t[1] ]

[(2, ('todayy', 'today'))]

In [51]:
# Solution 3
(
    message.index("fantastic"),
    message.words.index("fantastic")
)

(38, 8)

In [52]:
# Solution 4
TextBlob("fantastic").words[0].definitions
message.words[ message.words.index("fantastic") ].definitions

['ludicrously odd',
 'extraordinarily good or great ; used especially as intensifiers',
 'fanciful and unrealistic; foolish',
 'existing in fancy only; - Nathaniel Hawthorne',
 'extravagantly fanciful in design, construction, appearance']

In [53]:
# Solution 5
fan = message.words[ message.words.index("fantastic") ]
(
  fan.stem(),
  fan.lemmatize()
)

('fantast', 'fantastic')

## TextBlobs as Strings
TextBlobs act as strings, meaning you can use all of the normal string methods and you can index them as you would a string.

In [54]:
my_cheer


TextBlob("Data science is the best, data science is the coolest.")

In [55]:
my_cheer[0:6]


TextBlob("Data s")

In [56]:
my_cheer.upper()


TextBlob("DATA SCIENCE IS THE BEST, DATA SCIENCE IS THE COOLEST.")

In [57]:
my_cheer.lower()


TextBlob("data science is the best, data science is the coolest.")

## **n**-grams
Overlapping lists of words.

In [58]:
my_cheer


TextBlob("Data science is the best, data science is the coolest.")

In [59]:
my_cheer.words

WordList(['Data', 'science', 'is', 'the', 'best', 'data', 'science', 'is', 'the', 'coolest'])

In [60]:
my_cheer.ngrams(n=3)


[WordList(['Data', 'science', 'is']),
 WordList(['science', 'is', 'the']),
 WordList(['is', 'the', 'best']),
 WordList(['the', 'best', 'data']),
 WordList(['best', 'data', 'science']),
 WordList(['data', 'science', 'is']),
 WordList(['science', 'is', 'the']),
 WordList(['is', 'the', 'coolest'])]

In [61]:
[ " ".join(i) for i in my_cheer.ngrams(n=3) ]


['Data science is',
 'science is the',
 'is the best',
 'the best data',
 'best data science',
 'data science is',
 'science is the',
 'is the coolest']

In [62]:
my_cheer.split(",")


WordList(['Data science is the best', ' data science is the coolest.'])

In [63]:
my_cheer.words


WordList(['Data', 'science', 'is', 'the', 'best', 'data', 'science', 'is', 'the', 'coolest'])

In [64]:
[ " ".join(i) for i in TextBlob("italian pop rock").ngrams(n=2) ]


['italian pop', 'pop rock']

## Translation


In [65]:
# %%capture
!pip install googletrans==3.1.0a0 transformers sacremoses


[0m

### Google translate

In [66]:
from googletrans import Translator


In [67]:
translator = Translator()


In [68]:
result = translator.translate(
    'Hello, how are you?',
    src='en',
    dest='es',
)
print(result.text)


¿Hola, cómo estás?


In [69]:
result = translator.translate(
    'Hello, how are you?',
    src='en',
    dest='fr',
)
print(result.text)


Bonjour comment allez-vous?


In [70]:
result = translator.translate(
    'Hello, how are you?',
    src='en',
    dest='ar',
)
print(result.text)


مرحبا، كيف حالك؟


In [71]:
result = translator.translate(
    'Hello, how are you?',
    src='en',
    dest='de',
)
print(result.text)


Hallo, wie geht es dir?


All in one ...

In [72]:
langs = 'es fr ar de'.split()

for lang in langs:
  result = translator.translate(
      'Hello, how are you?',
      src='en',
      dest=lang,
  )
  print(result.text)


¿Hola, cómo estás?
Bonjour comment allez-vous?
مرحبا، كيف حالك؟
Hallo, wie geht es dir?


### Hugging Face Transformers (via pre-trained models)


In [78]:
from transformers import MarianMTModel, MarianTokenizer
%pip install sentencepiece
import sentencepiece as spm



[0mNote: you may need to restart the kernel to use updated packages.


In [79]:
# Load pre-trained MarianMT model
model_name = 'Helsinki-NLP/opus-mt-en-es'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True )

text = "Hello, how are you?"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
result = tokenizer.decode(translated[0], skip_special_tokens=True)

print(result)


Hola, ¿cómo estás?


In [75]:
# Load pre-trained MarianMT model
model_name = 'Helsinki-NLP/opus-mt-en-fr'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

text = "Hello, how are you?"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
result = tokenizer.decode(translated[0], skip_special_tokens=True)

print(result)


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

Bonjour, comment allez-vous?


In [76]:
# Load pre-trained MarianMT model
model_name = 'Helsinki-NLP/opus-mt-en-ar'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

text = "Hello, how are you?"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
result = tokenizer.decode(translated[0], skip_special_tokens=True)

print(result)


config.json:   0%|          | 0.00/1.39k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/44.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/801k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/917k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

مرحباً، كيف حالك؟


In [77]:
# Load pre-trained MarianMT model
model_name = 'Helsinki-NLP/opus-mt-en-de'
model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

text = "Hello, how are you?"
translated = model.generate(**tokenizer(text, return_tensors="pt", padding=True))
result = tokenizer.decode(translated[0], skip_special_tokens=True)

print(result)


config.json:   0%|          | 0.00/1.33k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/298M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/768k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/797k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.27M [00:00<?, ?B/s]

Hallo, wie geht's?


model.safetensors:   0%|          | 0.00/298M [00:00<?, ?B/s]