# Sentiment Analysis

This notebook revists Pyohio2018 Natural Language Processing in Python
<br> Credit goes to https://github.com/adashofdata/nlp-in-python-tutorial

So far, all of the analysis we've done has been generic - counting, creating scatter plots and wordclouds, etc. 

When it comes to text data, there are a few popular techniques - sentiment analysis. 
<br>
Below is a few key points to remember with sentiment analysis.

1. **TextBlob Module:** Linguistic researchers have labeled the sentiment of words based on their domain expertise. Sentiment of words can vary based on where it is in a sentence. The TextBlob module allows us to take advantage of these labels.
2. **Sentiment Labels:** Each word in a corpus is labeled in terms of polarity and subjectivity (there are more labels as well, but we're going to ignore them for now). A corpus' sentiment is the average of these.
   * **Polarity**: How positive or negative a word is. -1 is very negative. +1 is very positive.
   * **Subjectivity**: How subjective, or opinionated a word is. 0 is fact. +1 is very much an opinion. Though "Stating subjectivity itself might be subjective".

For more info on how TextBlob coded up its [sentiment function](https://planspace.org/20150607-textblob_sentiment/).

Let's take a look at the sentiment of the various transcripts, both overall and throughout the comedy routine.

## Sentiment of Routine

In [None]:
# Start with reading the corpus
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

In [None]:
# Create lambda functions to find the polarity and subjectivity of each Transcript
# Reading for lambda function=> https://dbader.org/blog/python-lambda-functions
# Install textblob with Terminal/Anaconda Navigator: conda install -c conda-forge textblob
from textblob import TextBlob

pol = lambda x: TextBlob(x).sentiment.polarity
sub = lambda x: TextBlob(x).sentiment.subjectivity

# Apply the polarity and subjectivity to the data
data['Polarity'] = data['Transcript'].apply(pol)
data['Subjectivity'] = data['Transcript'].apply(sub)
data

In [None]:
# Let's plot the results
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [10, 8]

for index, comedian in enumerate(data.index):
    x = data.Polarity.loc[comedian]
    y = data.Subjectivity.loc[comedian]
    '''Pandas DataFrame.loc attribute access a group of rows and columns 
    by label(s) or a boolean array in the given DataFrame.'''
    plt.scatter(x, y, color='blue')
    plt.text(x+.001, y+.001, data['Full_name'][index], fontsize=10)
    #plt.text(x, y, data['Full_name'][index], fontsize=10)
    plt.xlim(-.01, .12) # Set the x limits of the current axes
    
plt.title('Sentiment Analysis', fontsize=15)
plt.xlabel('<-- Negative ----[Polarity]---- Positive -->', fontsize=15)
plt.ylabel('<-- Facts ----[Subjectivity]---- Opinions -->', fontsize=15)

plt.show()

## Sentiment of Routine Over Time

Instead of looking at the overall sentiment, let's see if there's anything interesting about the sentiment over time throughout each routine.

In [None]:
# Split each routine into 10 parts
import numpy as np
import math

def split_text(text, n=10):
    '''Takes in a string of text and splits into n equal parts, 
       with a default of 10 equal parts.'''

    """ Calculate 
      - length of text,
      - the size of each chunk of text, 
      - and the starting points of each chunk of text"""
    length = len(text)
    size = math.floor(length / n)
    start = np.arange(0, length, size)
    
    # Pull out equally sized pieces of text and put it into a list
    split_list = []
    for piece in range(n):
        split_list.append(text[start[piece]:start[piece]+size])
    return split_list

In [None]:
# Let's take a look at our data again
data

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces_p = []
for tp in data.Transcript:
    splitp = split_text(tp)
    list_pieces_p.append(splitp)
    
list_pieces_p

In [None]:
len(list_pieces_p)
# The list has 12 elements (commedians), one for each transcript

In [None]:
#Each transcript has been split into 10 pieces of text
len(list_pieces_p[0])

### Calculate the polarity for each piece of text

In [None]:
# Calculate the polarity for each piece of text

polarity_transcript = []
for lp in list_pieces_p:
    polarity_piece = []
    for p in lp:
        polarity_piece.append(TextBlob(p).sentiment.polarity)
    polarity_transcript.append(polarity_piece)
    
polarity_transcript

In [None]:
# Show the plot of Ali Wong's polarity
plt.plot(polarity_transcript[0])
plt.title(data['Full_name'].index[0])
plt.show()

In [None]:
# Show the plot for all comedians
plt.rcParams['figure.figsize'] = [16, 12]

for index, comedian in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(polarity_transcript[index])
    xmin, xmax = 0, 9
    plt.hlines([0], xmin, xmax, "orange", linestyles='solid') 
    plt.title(data['Full_name'][index])
    plt.ylim(ymin=-.2, ymax=.3)
    
plt.show()

Ali Wong stays generally positive throughout her routine. Similar comedians are Louis C.K. and Mike Birbiglia.

On the other hand, we can see the different patterns in Bo Burnham who gets positive as time passes, and Dave Chappelle who has some pretty down moments in his routine.

### Calculate the subjectivity for each piece of text

In [None]:
# Let's create a list to hold all of the pieces of text
list_pieces_s = []
for ts in data.Transcript:
    splits = split_text(ts)
    list_pieces_s.append(splits)
    
list_pieces_s

In [None]:
subjectivity_transcript = []
for lp2 in list_pieces_s:
    subjectivity_piece = []
    for ps in lp2:
        subjectivity_piece.append(TextBlob(ps).sentiment.subjectivity)
    subjectivity_transcript.append(subjectivity_piece)
    
subjectivity_transcript

In [None]:
# Show the plot for all comedians
plt.rcParams['figure.figsize'] = [16, 12]

for index, comedian in enumerate(data.index):    
    plt.subplot(3, 4, index+1)
    plt.plot(subjectivity_transcript[index])
    #plt.plot(polarity_transcript[index])
    xmin, xmax = 0, 9
    plt.hlines([0.5], xmin, xmax, "orange", linestyles='solid') 
    
    plt.plot(np.arange(0,10), np.zeros(10))
    
    plt.title(data['Full_name'][index])
    plt.ylim(ymin=.2, ymax=.8)
    
plt.show()

We can see the patterns in Joe Rogan who gets subjective as time passes in his routine.