### LSE Data Analytics Online Career Accelerator 

# DA301:  Advanced Analytics for Organisational Impact

## Practical activity: Pre-processing textural data from social media

**This is the solution to the activity.**

We will continue working with the data analytics team at Tumble Confectionery. Recall that the product line includes a range of chocolate products in unusual flavour combinations, and the company is using social media to research potential new flavours.

The product manager has a hunch that a cheesecake flavour would be a good addition to the product line. You have been asked to research the sentiment towards cheesecake on Twitter. We will look at some tweets about cheesecake straight from Twitter and apply natural language processing steps in order to comprehend the data at scale. Your objective is to:

- identify positive and negative sentiments related to cheesecake
- use the polarity score function and identify related words
- visualise the output to present back to the business to help them decide on adding a flavour to their product line.

# Pre-processing
##  Prepare your workstation

In [None]:
# If needed, install the libraries.
!pip install pyyaml
!pip install twitter
!pip install pandas

In [None]:
# Copy the YAML file and your Twitter keys over to this Jupyter Notebook before you start to work.
import yaml
from yaml.loader import SafeLoader
from twitter import *
import time

# Import the yaml file – remember to specify the whole path and use / between directories.
twitter_creds = yaml.safe_load(open('twitter.yaml', 'r').read())

In [None]:
# Pass your Twitter credentials.
twitter_api = Twitter(auth=OAuth(twitter_creds['access_token'],
                                 twitter_creds['access_token_secret'], 
                                 twitter_creds['api_key'],
                                 twitter_creds['api_secret_key'] ))

In [None]:
# See whether you are connected.
print(twitter_api)

In [None]:
# Run a test with #python.
python_tweets = twitter_api.search.tweets(q='#python')

# View output.
print(python_tweets)

## 1. Test connection

In [None]:
# Query the term cheesecake.
q = {'q':'cheesecake', 'count':100, 'result_type':'recent'}
results = []

while len(results) < 30:
    query = twitter_api.search.tweets(**q)
    try:
        q['max_id'] = query['search_metadata']['next_results'].split('&')[0].split('?max_id=')[1]
        results.append(query)
    except:
        break
    
# Determine the number of results.
len(results)    

## 2. Create DataFrames

In [None]:
# Import pandas to join the DataFrames.
import pandas as pd

# Concat DataFrames.
results_list_pd = pd.concat([pd.DataFrame(_['statuses']) for _ in results])

# View shape of output.
results_list_pd.shape

In [None]:
# Determine values of output.
results_list_values = results_list_pd['text'].values

## 3. Investigate tweets

In [None]:
# Import nltk and the required resources.
import nltk
from nltk.corpus import stopwords
from nltk.corpus import words
nltk.download('stopwords')
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))

In [None]:
# Look at one raw tweet.
results_list_values[1]

In [None]:
# Split up each tweet into individual words.
results_list_values_token = [word_tokenize(_) for _ in results_list_values]

In [None]:
# Get a list of all English words so we can exclude anything that doesn't appear on the list.
all_english_words = set(words.words())

In [None]:
# Some pre-processing:
#-- Let's get every word.
#-- Let's convert it to lowercase.
#-- Only include if the word is alphanumeric and if it is in the list of English words.

results_list_values_token_nostop =\
[[y.lower() for y in x if y.lower() not in stop_words and y.isalpha() and y.lower() in all_english_words]\
 for x in results_list_values_token]

In [None]:
# Let's have a look at the same tweet as above.
results_list_values_token_nostop[1]

# NLTK sentiment analysis 
## 1. Prepare your workstation
> Run the previous code snippets.

## 2. Import NLTK

In [None]:
# Import the prebuilt rules and values of the vader lexicon.
nltk.download('vader_lexicon')

In [None]:
# Import the vader classs and create a object of the analyzer called Darth Vader.
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Create a variable to store the sia.
darth_vader = SentimentIntensityAnalyzer()

In [None]:
# Run through a dictionary comprehension to take every cleaned tweet. 
# Next run the polarity score function on the string.
# This will return four values in a dictionary.

results_list_values_token_nostop_polarity =\
{" ".join(_) : darth_vader.polarity_scores(" ".join(_)) for _ in results_list_values_token_nostop}

## 3. Create a Pandas DataFrame

In [None]:
# Convert the list of dictionary results to a pandas dataframe. 
# The index is the cleaned tweet.
# We can see some of the highly positive words. 

polarity_pd = pd.DataFrame(results_list_values_token_nostop_polarity).T

In [None]:
# With the non-aplhanumeric words (the emojis, handles, hashtags and stopwords) removed 
# some of the most positive words are single words.

# Get the top five most positive cleaned tweets related to cheesecake.
polarity_pd.sort_values('pos', ascending=0).head(5)

In [None]:
# Get the top five most negative words related to cheesecake.
polarity_pd.sort_values('neg', ascending=0).head(5)

In [None]:
# The describe function on the compound will show the distribution and moments. 
# The average is 0.1 so slightly positive.
polarity_pd['compound'].describe()

## 3. Plot the output

In [None]:
# Sometimes the best way to see is to plot. 
# In the data sampled here many of the values are 0.
# There are fewer negative values than positive ones, but the negative values are highly negative.

%matplotlib inline
import matplotlib.pyplot as plt

_plot = polarity_pd.reset_index()['compound'].sort_values()
_plot.plot(kind='bar')
ax1 = plt.axes()
x_axis = ax1.axes.get_xaxis()
x_axis.set_visible(False)

plt.show()
plt.close()

In [None]:
# The boxplot is a nice way to see how many values sit on the edges as outliers.
_plot = polarity_pd.reset_index()['compound'].sort_values()
_plot.plot(kind='box')