# Text analysis using Python's TextBlob package - Overview

*This notebook runs in Python 3. Last update: 29 April 2020.*

## Setup

Please make sure to install the package Tweepy by accessing Anaconda prompt (see webclips), and then typing

```
pip install -U textblob
python -m textblob.download_corpora
pip install -U pandas

```

#### package 'textblob'

This is a package that implements several text mining methods in Python. Optional tutorials are available at https://textblob.readthedocs.io/en/dev/quickstart.html. Note that several functions in the TextBlob version we are using are **only implemented for text written in English**. For a list of recent translations (to date: French and German), check this page: https://textblob.readthedocs.io/en/dev/extensions.html#available-extensions).

#### package 'pandas'
This is a package that adds "data preparation" capabilities to Python (think about an "Excel in Python"). There is a great cheat sheet available with some commands:
- https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

For those that would like to deep-dive into Pandas, please follow this class at Datacamp.com: "Manipulating DataFrames with pandas", available at https://www.datacamp.com/courses/manipulating-dataframes-with-pandas. For the basics of Pandas, this notebook will be sufficient.

In [1]:
# Load the TextBlob package; it may take a few seconds to process
from textblob import TextBlob

# (1) Get acquainted with TextBlob

TextBlob is a package to conduct automatic text analysis in any data that you load into Python/Jupyter Notebook. Here, you first familiarize yourself with this package.

To use the functionality of this powerful TextBlob package, you need to convert a text (given as a character string) to a so-called TextBlob object.

In [2]:
text = "First let me explain that I have never played this game. Now with that said, let me say that this game is probably terrible. The only person who will play this game is Brian B because he eats and sleeps soccer. He is THE soccer enthusiast. He refuses to buy or rent any other ps2 games, showing severe disregard for his friends' at work professional advice. He also is a huge rem and live fan. God forbid."
text_blob = TextBlob(text)

Now, you can run pre-specified "functions" on the newly created text_blob object (recall: text_blob is a variable that can be renamed to anything).

### Sentiment analysis

One function you could call is the .sentiment function.

In [3]:
text_blob.sentiment

Sentiment(polarity=-0.14386363636363636, subjectivity=0.5408333333333333)

...or directly accessing the polarity score...

In [4]:
text_blob.sentiment.polarity

-0.14386363636363636

# (2) Data preprocessing

## Tokenization

### Break text into words

In a similar way, you can access the object's words as a list

In [5]:
text_blob.words

WordList(['First', 'let', 'me', 'explain', 'that', 'I', 'have', 'never', 'played', 'this', 'game', 'Now', 'with', 'that', 'said', 'let', 'me', 'say', 'that', 'this', 'game', 'is', 'probably', 'terrible', 'The', 'only', 'person', 'who', 'will', 'play', 'this', 'game', 'is', 'Brian', 'B', 'because', 'he', 'eats', 'and', 'sleeps', 'soccer', 'He', 'is', 'THE', 'soccer', 'enthusiast', 'He', 'refuses', 'to', 'buy', 'or', 'rent', 'any', 'other', 'ps2', 'games', 'showing', 'severe', 'disregard', 'for', 'his', 'friends', 'at', 'work', 'professional', 'advice', 'He', 'also', 'is', 'a', 'huge', 'rem', 'and', 'live', 'fan', 'God', 'forbid'])

...which you can use to loop through them (remember the loop from the Twitter parsing exercise?)

In [6]:
for word in text_blob.words:
    print(word)

First
let
me
explain
that
I
have
never
played
this
game
Now
with
that
said
let
me
say
that
this
game
is
probably
terrible
The
only
person
who
will
play
this
game
is
Brian
B
because
he
eats
and
sleeps
soccer
He
is
THE
soccer
enthusiast
He
refuses
to
buy
or
rent
any
other
ps2
games
showing
severe
disregard
for
his
friends
at
work
professional
advice
He
also
is
a
huge
rem
and
live
fan
God
forbid


### Break text into sentences

You can also access sentences...

In [7]:
text_blob.sentences

[Sentence("First let me explain that I have never played this game."),
 Sentence("Now with that said, let me say that this game is probably terrible."),
 Sentence("The only person who will play this game is Brian B because he eats and sleeps soccer."),
 Sentence("He is THE soccer enthusiast."),
 Sentence("He refuses to buy or rent any other ps2 games, showing severe disregard for his friends' at work professional advice."),
 Sentence("He also is a huge rem and live fan."),
 Sentence("God forbid.")]

...or write a loop to analyze the sentiment of each of these sentences.

In [8]:
for sentence in text_blob.sentences:
    print(sentence.sentiment)

Sentiment(polarity=-0.07500000000000001, subjectivity=0.3666666666666667)
Sentiment(polarity=-0.7, subjectivity=0.7)
Sentiment(polarity=-0.2, subjectivity=0.7)
Sentiment(polarity=0.0, subjectivity=0.0)
Sentiment(polarity=-0.012499999999999997, subjectivity=0.2375)
Sentiment(polarity=0.2681818181818182, subjectivity=0.7)
Sentiment(polarity=0.0, subjectivity=0.0)


### Break text into n-grams

N-grams are combination of words occuring in text. For instance, 2 grams are any combinations of length two of a given string.

The 2-grams of "I do not like this" are:
- I do
- do not
- not like
- like this

These are typically better tho capture sentiment than, let's say,  1-grams of the same sentence...:
- I
- do
- not
- like
- this.

The 1-gram above may not sufficiently capture the negativity of that  sentence (*not* is negative, *like* is positive). With 2-grams, *DO NOT*, and *NOT LIKE* are both clearly negative.

Let's convert a sentence to a TextBlob, and output the n-grams (here: 2-grams)... run the cell below!

In [9]:
text_blob = TextBlob("I really do not like this.")
text_blob.ngrams(n=2)

[WordList(['I', 'really']),
 WordList(['really', 'do']),
 WordList(['do', 'not']),
 WordList(['not', 'like']),
 WordList(['like', 'this'])]

You could also print these ngrams (or do anything else that you learnt before...)

In [10]:
for ngram in text_blob.ngrams(n=2):
    print(ngram)

['I', 'really']
['really', 'do']
['do', 'not']
['not', 'like']
['like', 'this']


Looks a little odd though, as the words are kept seperate... suppose you want to connect them again with a space (e.g., to export it to a CSV file), this is how you do it:

In [11]:
for ngram in text_blob.ngrams(n=2):
    combined_ngram = ' '.join(ngram)
    print(combined_ngram)

I really
really do
do not
not like
like this


...or write it to a file

In [14]:
f=open('output_file.csv', 'w')
for ngram in text_blob.ngrams(n=2):
    combined_ngram = ' '.join(ngram)
    f.write(combined_ngram+'\n')
f.close()

## Cleaning

Especially when working with textual data from the Internet, you may be interested in cleaning out "bad" words without meaning, such as <HTML> tags.

In [15]:
delete_words = ['<HTML>', '<br>']

text_blob = TextBlob("This awesome review <HTML> has been <br> downloaded from Amazon.com.")

for d in delete_words:
    text_blob = text_blob.replace(d + ' ', ' ')

# Let's also tidy up double spaces
text_blob = text_blob.replace('  ', ' ')

# Show the result
print(text_blob)
print(text_blob.sentiment)

This awesome review has been downloaded from Amazon.com.
Sentiment(polarity=1.0, subjectivity=1.0)


## Removing stop words

Similarly, the code above can be used to eliminate common words such as "a" or "the" that appear in most documents. For many languages (also Dutch), such words are contained in a pre-defined dictionary.

In [16]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hannesdatta/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [17]:
stopwords.words()

['إذ',
 'إذا',
 'إذما',
 'إذن',
 'أف',
 'أقل',
 'أكثر',
 'ألا',
 'إلا',
 'التي',
 'الذي',
 'الذين',
 'اللاتي',
 'اللائي',
 'اللتان',
 'اللتيا',
 'اللتين',
 'اللذان',
 'اللذين',
 'اللواتي',
 'إلى',
 'إليك',
 'إليكم',
 'إليكما',
 'إليكن',
 'أم',
 'أما',
 'أما',
 'إما',
 'أن',
 'إن',
 'إنا',
 'أنا',
 'أنت',
 'أنتم',
 'أنتما',
 'أنتن',
 'إنما',
 'إنه',
 'أنى',
 'أنى',
 'آه',
 'آها',
 'أو',
 'أولاء',
 'أولئك',
 'أوه',
 'آي',
 'أي',
 'أيها',
 'إي',
 'أين',
 'أين',
 'أينما',
 'إيه',
 'بخ',
 'بس',
 'بعد',
 'بعض',
 'بك',
 'بكم',
 'بكم',
 'بكما',
 'بكن',
 'بل',
 'بلى',
 'بما',
 'بماذا',
 'بمن',
 'بنا',
 'به',
 'بها',
 'بهم',
 'بهما',
 'بهن',
 'بي',
 'بين',
 'بيد',
 'تلك',
 'تلكم',
 'تلكما',
 'ته',
 'تي',
 'تين',
 'تينك',
 'ثم',
 'ثمة',
 'حاشا',
 'حبذا',
 'حتى',
 'حيث',
 'حيثما',
 'حين',
 'خلا',
 'دون',
 'ذا',
 'ذات',
 'ذاك',
 'ذان',
 'ذانك',
 'ذلك',
 'ذلكم',
 'ذلكما',
 'ذلكن',
 'ذه',
 'ذو',
 'ذوا',
 'ذواتا',
 'ذواتي',
 'ذي',
 'ذين',
 'ذينك',
 'ريث',
 'سوف',
 'سوى',
 'شتان',
 'عدا',
 'عسى',
 'عل'

In [18]:
text_blob = TextBlob("This awesome review <HTML> has been <br> downloaded from Amazon.com").lower()

for d in stopwords.words():
    text_blob = text_blob.replace(d.lower() + ' ', ' ')

# Let's also tidy up double spaces
text_blob = text_blob.replace('  ', ' ')

# Show the result
print(text_blob)
print(text_blob.sentiment)

t aw review <html>  <br> down fr amazon.com
Sentiment(polarity=-0.15555555555555559, subjectivity=0.2888888888888889)


Funny enough, there's not so much remaining from the text. Probably it's better to only use English stop words!

In [19]:
text_blob = TextBlob("This awesome review has been downloaded from Amazon.com").lower()

for d in stopwords.words('english'):
    text_blob = text_blob.replace(d.lower() + ' ', ' ')

text_blob = text_blob.replace('  ', ' ')
print(text_blob)


 awe review  downloade amazon.com


## Spelling

### Correct spelling mistakes

Have large bodies of text to correct? Do it with TextBlob!

In [20]:
b = TextBlob("This sentenc contains a typo!")
print(b.correct())

His sentence contains a type!


...but it doesn't work well in other languages (e.g., German).

In [21]:
b = TextBlob("Dieser Setz enthalt einen Fehler.")
print(b.correct())

Wiser Met enthalt linen Dealer.


...or Dutch...

In [22]:
b = TextBlob("Deze sin klopt niet.")
print(b.correct())

Were sin kept net.


### Translation to different languages (internet connection needed)

At least we can identify the language of a Tweet to safeguard us against wrong spelling corrections. You need an active internet connection to run this (as TextBlob's translation is powered by Google's Translate API).

In [23]:
dutch_blob = TextBlob('Dit is echt een fraaie game!')
dutch_blob.translate(to='en')

TextBlob("This is a really nice game!")

You can also use TextBlob to detect a blob's language:

In [24]:
dutch_blob.detect_language()

'nl'

## Stemming and lemmatization

TextBlob has a built-in algorithm to reduce words into their common stem or lemma.

In [25]:
sentence = TextBlob("Classes at many universities are currently all given online.")

for w in sentence.words:
    print(w.singularize())

Class
at
many
university
are
currently
all
given
online


In [26]:
from textblob import Word
w = Word("octopi")
print(w.lemmatize())

w = Word("went")
print(w.lemmatize("v"))  # Pass in WordNet part of speech (verb)

octopus
go


# 3) Common Tools to Analyze Text

## Entity extraction

### Single word counts

Got a dictionary to check for the occurence of words? You can use it in combination with the `.count` function to count its occurence. 

In [27]:
TextBlob('This is a good episode').words.count('good')

1

### Multiple word counts

When using multiple words, you have to write a small loop.

In [28]:
text = "First let me explain that I have never played this game. Now with that said, let me say that this game is probably terrible. The only person who will play this game is Brian B because he eats and sleeps soccer. He is THE soccer enthusiast. He refuses to buy or rent any other ps2 games, showing severe disregard for his friends' at work professional advice. He also is a huge rem and live fan. God forbid."
text_blob = TextBlob(text)

wordcount = 0
target_words = ['work','professional']

for target in target_words:
    wordcount+=text_blob.words.count(target)
    
print(wordcount)

2


For example, the words "work" and "professional" occur twice in the TextBlob above.

### Word count by sentence

We may also be interested in counting the occurence of words by sentence in the blob above.

In [29]:
text = "First let me explain that I have never played this game. Now with that said, let me say that this game is probably terrible. The only person who will play this game is Brian B because he eats and sleeps soccer. He is THE soccer enthusiast. He refuses to buy or rent any other ps2 games, showing severe disregard for his friends' at work professional advice. He also is a huge rem and live fan. God forbid."
text_blob = TextBlob(text)

wordcount = []
target_words = ['work','professional']

for sentence in text_blob.sentences:
    cnt = 0
    for target in target_words:
        cnt+=sentence.words.count(target)
    wordcount.append(cnt)

wordcount

[0, 0, 0, 0, 2, 0, 0]

The array below is of length 7 (the number of sentences in the TextBlob), and shows - per sentence - the word count of our target list.

### Occurence of n-grams

Of course, you can also tie in the word count algorithm from above with these n-grams.

In [30]:
text_blob = TextBlob("I really do not like this.")

wordcount = 0
target_words = ['not like']

for ngram in text_blob.ngrams(n=2):
    if ' '.join(ngram).lower() in target_words: wordcount+=1

print(wordcount)

1


### TextBlob's built-in sentiment analysis

In [31]:
text_blob = TextBlob("I really do not like this.")
text_blob.sentiment

Sentiment(polarity=-0.1, subjectivity=0.2)

In [32]:
text_blob.sentiment.subjectivity

0.2

## Using VADER for social media data

Please first run `pip install -U vaderSentiment`!

In [33]:
# initialize vader (see https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f)
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

score = analyser.polarity_scores(sentence)


In [34]:
out = analyser.polarity_scores("The phone is super cool.")
out

{'neg': 0.0, 'neu': 0.326, 'pos': 0.674, 'compound': 0.7351}

We can also extract individual scores

In [35]:
print(out['neg'])
print(out['neu'])
print(out['pos'])
print(out['compound'])

0.0
0.326
0.674
0.7351


VADER even works on emoticons...

In [36]:
out = analyser.polarity_scores("I am 😄 today")
out

{'neg': 0.0, 'neu': 0.522, 'pos': 0.478, 'compound': 0.6705}

...something that TextBlob alone can't do!

In [37]:
text_blob = TextBlob("I am 😄 today")
text_blob.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

## More stuff (optional)

## Named entity recognition

https://pythonprogramming.net/named-entity-recognition-stanford-ner-tagger/