In [None]:
# Lesson 01 - Custom Dataset Exploration of Sentiment and TextAnalytics
# 
# Owner:  Lorrie Tomek
#
# In the code below, I put my dataset into a public github repos that I created
# This can be quite convenient. 
#
# In this notebook, you will use your own sentiment dataset for your chosen bot
# domain
# 
# Explore your utterances using Text Analytics 
# Explore sentiment y prediction made by VADER as compared to your own "golden" label

In [None]:
# use pip to install the nltk library
! pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# import python libraries
import pandas as pd
import nltk
import sys
from pprint import pprint
from random import shuffle

In [None]:
# download some of the many nltk resources, we need vader_lexicon, not necessarily others
nltk.download([
  "names",
  "stopwords",
  "state_union",
  "twitter_samples",
  "movie_reviews",
  "averaged_perceptron_tagger",
  "vader_lexicon",
  "punkt",
  "shakespeare"
])

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package state_union to /root/nltk_data...
[nltk_data]   Package state_union is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   

True

## Load your Dataset of Sentiment Data

Here's a link that describes 3 ways to access your datasets from within Google CoLab

https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

I am going to demonstrating using a csv file that is checked into my own public github repos, as it is convenient.  

Find the URL for your CSV file in a github repos.  Make sure you choose "Raw" or you will get a parse error when pandas tries to read_csv.

In [None]:
# My URL from my git repos, using the "Raw" view or the read_csv will fail with a parse error
# If you run into trouble you can use the direct upload from the link above

url = 'https://raw.githubusercontent.com/lorrieteaching/sampledata/main/sentiment.csv'
df = pd.read_csv(url)

In [None]:
# or you can upload your data
#df2 = pd.read_csv("sentiment_v2.csv")

In [None]:
#df2[:5]

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')

In [None]:
df[:5]

Unnamed: 0,utterance,pos_neg,pos_neg_neutral
0,What time does your store open?,positive,neutral
1,What locations do you have?,positive,neutral
2,Are you open on Christmas?,positive,neutral
3,What is your phone number?,positive,neutral
4,Your customer service is poor.,negative,negative


In [None]:
# What are the columns of my dataframe? 
df.columns

Index(['utterance', 'pos_neg', 'pos_neg_neutral'], dtype='object')

In [None]:
# load the VADER Sentiment Analyzer and make sure it works
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sia.polarity_scores("Wow, NLTK is really powerful!")

{'neg': 0.0, 'neu': 0.295, 'pos': 0.705, 'compound': 0.8012}

In [None]:
# Here is where we can create a new column to get vader's prediction on the utterances in 
# our custom dataset.
def myfunc(utterance):
  return sia.polarity_scores(utterance)['pos']

df['vader_predict_pos'] = df.utterance.map(lambda x: myfunc(x))

In [None]:
# this shows the first 5
df[:5]

Unnamed: 0,utterance,pos_neg,pos_neg_neutral,vader_predict,vader_predict_pos
0,What time does your store open?,positive,neutral,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0
1,What locations do you have?,positive,neutral,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0
2,Are you open on Christmas?,positive,neutral,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0
3,What is your phone number?,positive,neutral,"{'neg': 0.0, 'neu': 0.755, 'pos': 0.245, 'comp...",0.245
4,Your customer service is poor.,negative,negative,"{'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'comp...",0.0


## DO VARIOUS Text Analytics on Your Dataset

Look back at the Text Analytics Notebook.  
Explore YOUR Dataset Using Text Analytics in this notebook in the section below. 

What did you find that was interesting or unexpected? 


In [None]:
# Write your Text Analytics Code Here

## Compare Your GOLDEN Labels for Sentiment to VADER's Predictions

Modify the Code Below to use YOUR dataset.

Calculate the number of times VADER predicted correctly and incorrectly for each of positive sentiment, negative sentiment, and neutral sentiment. (Your "Golden" labels are the correct sentiment, as they were labeled by a human expert, YOU).

Experiment with this. You can try other examples and add them to your dataset and re-run.

You will post a discussion board post with your experience

In [None]:
df.shape

(5, 4)

In [None]:
# Since VADER predicts positive, negative, and neutral
# Let's compare
for rowidx in range(df.shape[0]):     
   row = df.iloc[rowidx]
   print(f"Utterance: {row.utterance}\nVADER Predict: {row.vader_predict}\nGolden Label:  {row.pos_neg_neutral}\n")

Utterance: What time does your store open?
VADER Predict: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Golden Label:  neutral

Utterance: What locations do you have?
VADER Predict: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Golden Label:  neutral

Utterance: Are you open on Christmas?
VADER Predict: {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Golden Label:  neutral

Utterance: What is your phone number?
VADER Predict: {'neg': 0.0, 'neu': 0.755, 'pos': 0.245, 'compound': 0.0772}
Golden Label:  neutral

Utterance: Your customer service is poor.
VADER Predict: {'neg': 0.437, 'neu': 0.563, 'pos': 0.0, 'compound': -0.4767}
Golden Label:  negative

