# NLP with TextBlob

TextBlob is a python library that provides a simple API for common NLP tasks and builds on the Natural Language Toolkit (nltk) and the Pattern web mining libraries. TextBlob facilitates part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and others.

## Imports & Settings

In [1]:
% matplotlib inline
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# spacy, textblob and nltk for language processing
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

# sklearn for feature extraction & modeling
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.externals import joblib

In [2]:
np.random.seed(42)
pd.set_option('float_format', '{:,.2f}'.format)

## Load BBC Data

To illustrate the use of TextBlob, we sample a BBC sports article with the headline ‘Robinson ready for difficult task’. Similar to spaCy and other libraries, the first step is to pass the document through a pipeline represented by the TextBlob object to assign annotations required for various tasks.

In [3]:
path = Path('data', 'bbc')
files = path.glob('**/*.txt')
doc_list = []
for i, file in enumerate(files):
    topic = file.parts[-2]
    article = file.read_text(encoding='latin1').split('\n')
    heading = article[0].strip()
    body = ' '.join([l.strip() for l in article[1:]]).strip()
    doc_list.append([topic, heading, body])

In [4]:
docs = pd.DataFrame(doc_list, columns=['topic', 'heading', 'body'])
docs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2225 entries, 0 to 2224
Data columns (total 3 columns):
topic      2225 non-null object
heading    2225 non-null object
body       2225 non-null object
dtypes: object(3)
memory usage: 52.2+ KB


## Introduction to TextBlob

You should already have downloaded TextBlob, a Python library used to explore common NLP tasks.

### Select random article

In [5]:
article = docs.sample(1).squeeze()

In [6]:
print(f'Topic:\t{article.topic.capitalize()}\n\n{article.heading}\n')
print(article.body.strip())

Topic:	Sport

Robinson ready for difficult task

England coach Andy Robinson faces the first major test of his tenure as he tries to get back to winning ways after the Six Nations defeat by Wales.  Robinson is likely to make changes in the back row and centre after the 11-9 loss as he contemplates Sunday's set-to with France at Twickenham. Lewis Moody and Martin Corry could both return after missing the game with hamstring and shoulder problems. And the midfield pairing of Mathew Tait and Jamie Noon is also under threat. Olly Barkley immediately allowed England to generate better field position with his kicking game after replacing debutant Tait just before the hour. The Bath fly-half-cum-centre is likely to start against France, with either Tait or Noon dropping out.  Tait, given little opportunity to shine in attack, received praise from Robinson afterwards, even if the coach admitted Cardiff was an "unforgiving place" for the teenage prodigy. Robinson now has a tricky decision over 

In [7]:
parsed_body = TextBlob(article.body)

### Tokenization

In [8]:
parsed_body.words

WordList(['England', 'coach', 'Andy', 'Robinson', 'faces', 'the', 'first', 'major', 'test', 'of', 'his', 'tenure', 'as', 'he', 'tries', 'to', 'get', 'back', 'to', 'winning', 'ways', 'after', 'the', 'Six', 'Nations', 'defeat', 'by', 'Wales', 'Robinson', 'is', 'likely', 'to', 'make', 'changes', 'in', 'the', 'back', 'row', 'and', 'centre', 'after', 'the', '11-9', 'loss', 'as', 'he', 'contemplates', 'Sunday', "'s", 'set-to', 'with', 'France', 'at', 'Twickenham', 'Lewis', 'Moody', 'and', 'Martin', 'Corry', 'could', 'both', 'return', 'after', 'missing', 'the', 'game', 'with', 'hamstring', 'and', 'shoulder', 'problems', 'And', 'the', 'midfield', 'pairing', 'of', 'Mathew', 'Tait', 'and', 'Jamie', 'Noon', 'is', 'also', 'under', 'threat', 'Olly', 'Barkley', 'immediately', 'allowed', 'England', 'to', 'generate', 'better', 'field', 'position', 'with', 'his', 'kicking', 'game', 'after', 'replacing', 'debutant', 'Tait', 'just', 'before', 'the', 'hour', 'The', 'Bath', 'fly-half-cum-centre', 'is', 'li

### Sentence boundary detection

In [9]:
parsed_body.sentences

[Sentence("England coach Andy Robinson faces the first major test of his tenure as he tries to get back to winning ways after the Six Nations defeat by Wales."),
 Sentence("Robinson is likely to make changes in the back row and centre after the 11-9 loss as he contemplates Sunday's set-to with France at Twickenham."),
 Sentence("Lewis Moody and Martin Corry could both return after missing the game with hamstring and shoulder problems."),
 Sentence("And the midfield pairing of Mathew Tait and Jamie Noon is also under threat."),
 Sentence("Olly Barkley immediately allowed England to generate better field position with his kicking game after replacing debutant Tait just before the hour."),
 Sentence("The Bath fly-half-cum-centre is likely to start against France, with either Tait or Noon dropping out."),
 Sentence("Tait, given little opportunity to shine in attack, received praise from Robinson afterwards, even if the coach admitted Cardiff was an "unforgiving place" for the teenage prodi

### Stemming

To perform stemming, we instantiate the SnowballStemmer from the nltk library, call its .stem() method on each token and display tokens that were modified as a result:

In [10]:
# Initialize stemmer.
stemmer = SnowballStemmer('english')

# Stem each word.
[(word, stemmer.stem(word)) for i, word in enumerate(parsed_body.words) 
 if word.lower() != stemmer.stem(parsed_body.words[i])]

[('Andy', 'andi'),
 ('faces', 'face'),
 ('tenure', 'tenur'),
 ('tries', 'tri'),
 ('winning', 'win'),
 ('ways', 'way'),
 ('Nations', 'nation'),
 ('Wales', 'wale'),
 ('likely', 'like'),
 ('changes', 'chang'),
 ('centre', 'centr'),
 ('contemplates', 'contempl'),
 ('France', 'franc'),
 ('Lewis', 'lewi'),
 ('Moody', 'moodi'),
 ('Corry', 'corri'),
 ('missing', 'miss'),
 ('hamstring', 'hamstr'),
 ('problems', 'problem'),
 ('pairing', 'pair'),
 ('Jamie', 'jami'),
 ('Olly', 'olli'),
 ('immediately', 'immedi'),
 ('allowed', 'allow'),
 ('generate', 'generat'),
 ('position', 'posit'),
 ('kicking', 'kick'),
 ('replacing', 'replac'),
 ('debutant', 'debut'),
 ('before', 'befor'),
 ('fly-half-cum-centre', 'fly-half-cum-centr'),
 ('likely', 'like'),
 ('France', 'franc'),
 ('dropping', 'drop'),
 ('little', 'littl'),
 ('opportunity', 'opportun'),
 ('received', 'receiv'),
 ('praise', 'prais'),
 ('afterwards', 'afterward'),
 ('admitted', 'admit'),
 ('unforgiving', 'unforgiv'),
 ('teenage', 'teenag'),
 ('pr

### Lemmatization

In [11]:
[(word, word.lemmatize()) for i, word in enumerate(parsed_body.words) 
 if word != parsed_body.words[i].lemmatize()]

[('faces', 'face'),
 ('as', 'a'),
 ('tries', 'try'),
 ('ways', 'way'),
 ('changes', 'change'),
 ('as', 'a'),
 ('problems', 'problem'),
 ('was', 'wa'),
 ('has', 'ha'),
 ('regards', 'regard'),
 ('as', 'a'),
 ('was', 'wa'),
 ('was', 'wa'),
 ('forwards', 'forward'),
 ('positives', 'positive')]

Lemmatization relies on parts-of-speech (POS) tagging; `spaCy` performs POS tagging, here we make assumptions, e.g. that each token is verb.

In [12]:
[(word, word.lemmatize(pos='v')) for i, word in enumerate(parsed_body.words) 
 if word != parsed_body.words[i].lemmatize(pos='v')]

[('faces', 'face'),
 ('tries', 'try'),
 ('winning', 'win'),
 ('is', 'be'),
 ('changes', 'change'),
 ('contemplates', 'contemplate'),
 ('missing', 'miss'),
 ('pairing', 'pair'),
 ('is', 'be'),
 ('allowed', 'allow'),
 ('kicking', 'kick'),
 ('replacing', 'replace'),
 ('is', 'be'),
 ('dropping', 'drop'),
 ('given', 'give'),
 ('received', 'receive'),
 ('admitted', 'admit'),
 ('was', 'be'),
 ('has', 'have'),
 ('firing', 'fire'),
 ('outing', 'out'),
 ('regards', 'regard'),
 ('appeared', 'appear'),
 ('punishing', 'punish'),
 ('dished', 'dish'),
 ('said', 'say'),
 ('selected', 'select'),
 ('were', 'be'),
 ('playing', 'play'),
 ('was', 'be'),
 ('thought', 'think'),
 ('defended', 'defend'),
 ('got', 'get'),
 ('covered', 'cover'),
 ('missed', 'miss'),
 ('conceded', 'concede'),
 ('was', 'be'),
 ('turned', 'turn'),
 ('fumbled', 'fumble'),
 ('improved', 'improve'),
 ('remains', 'remain'),
 ('came', 'come'),
 ('stepping', 'step'),
 ('is', 'be'),
 ('posed', 'pose'),
 ('forwards', 'forward'),
 ('justifi

### Sentiment & Polarity

TextBlob provides polarity and subjectivity estimates for parsed documents using dictionaries provided by the Pattern library. These dictionaries lexicon map adjectives frequently found in product reviews to sentiment polarity scores, ranging from -1 to +1 (negative ↔ positive) and a similar subjectivity score (objective ↔ subjective).

The .sentiment attribute provides the average for each over the relevant tokens, whereas the .sentiment_assessments attribute lists the underlying values for each token

In [15]:
parsed_body.sentiment

Sentiment(polarity=0.088031914893617, subjectivity=0.46456433637284694)

In [14]:
parsed_body.sentiment_assessments

Sentiment(polarity=0.088031914893617, subjectivity=0.46456433637284694, assessments=[(['first'], 0.25, 0.3333333333333333, None), (['major'], 0.0625, 0.5, None), (['tries'], -0.1, 0.4, None), (['back'], 0.0, 0.0, None), (['winning'], 0.5, 0.75, None), (['likely'], 0.0, 1.0, None), (['back'], 0.0, 0.0, None), (['missing'], -0.2, 0.05, None), (['game'], -0.4, 0.4, None), (['better'], 0.5, 0.5, None), (['game'], -0.4, 0.4, None), (['likely'], 0.0, 1.0, None), (['little'], -0.1875, 0.5, None), (['teenage'], 0.0, 0.0, None), (['central'], 0.0, 0.25, None), (['future'], 0.0, 0.125, None), (['least'], -0.3, 0.4, None), (['unaffected'], -0.05, 0.1, None), (['particular'], 0.16666666666666666, 0.3333333333333333, None), (['more'], 0.5, 0.5, None), (['definitely'], 0.0, 0.5, None), (['hard'], -0.2916666666666667, 0.5416666666666666, None), (['next'], 0.0, 0.0, None), (['own'], 0.6, 1.0, None), (['first'], 0.25, 0.3333333333333333, None), (['half'], -0.16666666666666666, 0.16666666666666666, None

### Combine Textblob Lemmatization with `CountVectorizer`

In [13]:
def lemmatizer(text):
    words = TextBlob(text.lower()).words
    return [word.lemmatize() for word in words]

In [14]:
vectorizer = CountVectorizer(analyzer=lemmatizer, decode_error='replace')