# _Initial Exploration with `TextBlob`_

So we figured out how to utilize the Twitter API to download 3,199 of Donald Trump's most recent tweets (as of late-afternoon on October 14, 2019). This is a good start as far as data retrieval goes for this project however simply downloading tweets to a CSV is the foundation. The goal for the web-app is to be able to analyze user's tweets in (near) real-time so we'll need to dive further into things like how to store these tweets (perhaps in a database of some sort) and how to stream new tweets, add them to some data set, and update said storage. 

That will be for another day though. Today's focus will be on something called sentiment analysis. Here's the definition according to _Wikipedia_:

    Sentiment analysis (also known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information. Sentiment analysis is widely applied to voice of the customer materials such as reviews and survey responses, online and social media, and healthcare materials for applications that range from marketing to customer service to clinical medicine.
    
There are quite a few NLP tools available to us in Python, but the first one we'll be checking out is called [`TextBlob`](https://textblob.readthedocs.io/en/dev/). From my experience, it is fairly easy to get up and going with this particular library and it offers a nice range of features in addition to sentiment analysis, including:

- Noun phrase extraction
- Part-of-speech tagging
- Sentiment analysis
- Classification (Naive Bayes, Decision Tree)
- Language translation and detection powered by Google Translate
- Tokenization (splitting text into words and sentences)
- Word and phrase frequencies
- Parsing
- n-grams
- Word inflection (pluralization and singularization) and lemmatization
- Spelling correction
- Add new models or languages through extensions
- WordNet integration

We'll start by loading in the CSV containing the information pulled from Twitter and then do a simple example with `textblob` to show some of the basic functionality.

As a reminder there are a few things that we need to input in our `read_csv` call in order to return our `pandas` DataFrame in the appropriate format. These include:

1. Creating a list of column names
2. Create a dictionary that ensures the `id` column is a `str` type
3. Assign a value of `True` to both the `parse_dates` and `infer_datetime_format` so that `pandas` converts the `created_at` column from a string type to a hopefully correctly formatted datetime type.

In [1]:
# ensure we're in correct directory
import os

print('The current working directory is {}.'.format(os.getcwd()))

The current working directory is /Users/jai/Documents/projects/twitter-politics.


In [4]:
# import Path to make working with directory even more manageable
from pathlib import Path

# store our main directory path in a variable in case we need to access/download information in that locaton directly
PATH = Path(os.getcwd())
print(PATH)

/Users/jai/Documents/projects/twitter-politics


In [6]:
# import pandas
import pandas as pd
# changes the default setting so that all columns are output for a pandas Df
pd.set_option('display.max_columns', None)

# list containing the column names we want to assign the dataframe
column_names = ['username', 'id', 'created_at', 'source', 'retweet_count', 'favorite_count', 'tweet']

# create dictionary with column name as keys and data types as the values
dtypes = {'id': str}

# import the Twitter data
data = pd.read_csv('data/realDonaldTrump_tweets.csv', names=column_names, dtype=dtypes, 
                   parse_dates=['created_at'], infer_datetime_format=True)

# see first few rows
data.head()

Unnamed: 0,username,id,created_at,source,retweet_count,favorite_count,tweet
0,realDonaldTrump,1183908206088728576,2019-10-15 00:51:26,Twitter for iPhone,3183,7994,b'\xe2\x80\x9cProject Veritas-Obtained Underco...
1,realDonaldTrump,1183900672892309505,2019-10-15 00:21:30,Twitter for iPhone,8409,24211,b'A big scandal at @ABC News. They got caught ...
2,realDonaldTrump,1183899559124189184,2019-10-15 00:17:05,Twitter for iPhone,6571,21131,b'Shifty Schiff now seems to think they don\xe...
3,realDonaldTrump,1183873633057476609,2019-10-14 22:34:04,Twitter Media Studio,11785,29106,"b'""The House gone rogue! I want to remind you ..."
4,realDonaldTrump,1183869954640228352,2019-10-14 22:19:27,Twitter Media Studio,10572,27167,"b'""It doesn\'t speak for the FULL HOUSE becaus..."


In [7]:
# ensure data types in each column are appropriate
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 7 columns):
username          3199 non-null object
id                3199 non-null object
created_at        3199 non-null datetime64[ns]
source            3199 non-null object
retweet_count     3199 non-null int64
favorite_count    3199 non-null int64
tweet             3199 non-null object
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 175.0+ KB


### _Simple `TextBlob` Example_

We'll now take the text from the first tweet and showcase a few basic features from the `TextBlob` library, including:

- `tags`
- `noun_phrases`
- `polarity`

And we'll even see what happens when we try to translate the tweet into Spanish!

In [25]:
# import TextBlob
from textblob import TextBlob

# pull the third tweet
text = data['tweet'][2]

# create textblob object
blob = TextBlob(text)

# print out tags
print(blob.tags)

[("b'Shifty", 'JJ'), ('Schiff', 'NNP'), ('now', 'RB'), ('seems', 'VBZ'), ('to', 'TO'), ('think', 'VB'), ('they', 'PRP'), ('don\\xe2\\x80\\x99t', 'VBP'), ('need', 'VBP'), ('the', 'DT'), ('Whistleblower', 'NNP'), ('who', 'WP'), ('started', 'VBD'), ('the', 'DT'), ('whole', 'JJ'), ('Scam', 'NNP'), ('The', 'DT'), ('reason', 'NN'), ('is', 'VBZ'), ('that\\xe2\\x80\\xa6', 'JJ'), ('https', 'NN'), ('//t.co/nuvZedQ0M4', 'NN')]


In [26]:
# print out noun phrases
print(blob.noun_phrases)

['schiff', 'don\\xe2\\x80\\x99t need', 'whistleblower', 'scam', 'that\\xe2\\x80\\xa6 https']


In [27]:
# return polarity of the tweet
print(blob.sentiment.polarity)

0.2


In [28]:
# translate to Spanish
blob.translate(to="es")

TextBlob("b 'Shifty Schiff ahora parece pensar que no \ xe2 \ x80 \ x99t necesitan al Denunciante, que comenzó toda la estafa. La razón es que \ xe2 \ x80 \ xa6 https://t.co/nuvZedQ0M4 '")