<img src=".\CP_Logo.PNG" align="left" height="380" width="320" style="padding-right;5px">

# Sentiment Analysis in Marketing
---

Sentiment analysis (also known as opinion mining) is the use of Natural Language Processing (NLP), textual analysis, and other computational linguistic techniques to identify and categorize the author's attitude towards a particular topic, product, etc. as either positive, negative, or neutral. In terms of marketing, sentiment analysis provides a method to qualify textual responses to advertisement campaigns, product anouncements, and social media posts. 

## Required Imports:
---

<div class="alert alert-danger">

<font style="color:darkred"><b>Note:</b>If you have not previously installed these `packages`, you can use the cell below to perform the required `pip` installs.</font>

</div>

In [None]:
# In case you still need to perform some pip installs:
! pip install --user pandas -q
! pip install --user nltk -q
! pip install --user wordcloud -q
! pip install --user tensorflow -q
! pip install --user transformers -q

In [None]:
import pandas as pd
import numpy as np
import re

import seaborn as sns
sns.set(style="whitegrid")
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')

from wordcloud import STOPWORDS

import tensorflow as tf
import tensorflow_datasets as tfds

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import warnings
warnings.filterwarnings('ignore')

## Our Data
---

In 2018, Nike announced a partnership with Colin Kaepernick as the face for the 30th anniversary of their JustDoIt campaign along with the slogan, "Believe in something, even if it means sacrificing everything."

This campaign prompted a lot of controversy, which subsequently yielded a polarized social media response to the campaign. A snapshot of 5,000 tweets containing the #JustDoIt hashtag was captured days after the campaign launched, on September 7th, 2018.

We will be examining the `tweet_full_text` field in the dataset to assess the sentiment of the user's text.

<div class="alert alert-info">

This dataset is hosted on the popular Data Science competition and collaboration site, Kaggle:

https://www.kaggle.com/datasets/eliasdabbas/5000-justdoit-tweets-dataset

</div>

In [None]:
# Initialize a dataframe by reading in the csv
df = pd.read_csv('justdoit_tweets.csv')
# Simplify the dataframe based on the 4 columns listed
df = df[['tweet_created_at', 'tweet_full_text', 'user_screen_name', 'user_location']]
# Quickly impute missing values with a string of unknown
df.loc[(df.user_location.isna()), "user_location"] = 'Unknown'
df.loc[(df.user_screen_name.isna()), "user_screen_name"] = 'Unknown'
# Convert the date column to a datetime dtype
df['tweet_created_at'] = pd.to_datetime(df['tweet_created_at'], errors ='ignore')
df

# From the Top
---

Firstly, as we have seen with our ad campaign dataset, our textual samples do not readily come with a label to represent sentiment. In order to start building our sentiment analysis model, we will need to use a dataset that contains those labels for the training of our model. One of the most popular beginning datasets is the IMDB reviews dataset, which can be pulled directly from Tensorflow datasets:

In [None]:
# Load the IMDB Reviews dataset
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

print(info)

## Textual Preprocessing
---

One of the more difficult tasks associated with any textual analysis is performing the required preprocessing techniques to transform the raw text into an analyzable and interpretable format for machine learning algorithms. We will take a look at some basic preprocessing methodologies below:

In [None]:
# Get the train and test sets
train_data, test_data = imdb['train'], imdb['test']

# Initialize sentences and labels lists
training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

# Loop over all training examples and save the sentences and labels
for s,l in train_data:
    training_sentences.append(s.numpy().decode('utf8'))
    training_labels.append(l.numpy())

# Loop over all test examples and save the sentences and labels
for s,l in test_data:
    testing_sentences.append(s.numpy().decode('utf8'))
    testing_labels.append(l.numpy())

In [None]:
training_sentences[0]

In [None]:
training_labels[0]

In [None]:
# We will create a stopword list based on stopwords from two different sources
stopword_list = list(set(stopwords.words("english") + list(STOPWORDS)))
stopword_list

In [None]:
# In order to remove the stopwords, we iterate over the length of the training samples
for i in range(len(training_sentences)):
    # Tokenize the words of each sample as we loop
    word_tokens = word_tokenize(training_sentences[i])
    # Filter the sample to remove stopwords
    filtered_sentence = [w for w in word_tokens if not w.lower() in stopword_list]
    # Convert from a list of words back to a string
    s = str(filtered_sentence)
    # Remove punctuation based on regex
    s = re.sub(r'[^\w\s]','', s)
    # Remove any double spaces created as a result of our processes
    s = s.replace("  ", ' ')
    # Replace the original sample with our processed version
    training_sentences[i] = s

In [None]:
# Any new samples introduced must undergo the same preprocessing, so repeat for the testing/validation samples
for i in range(len(testing_sentences)):
    word_tokens = word_tokenize(testing_sentences[i])
    filtered_sentence = [w for w in word_tokens if not w.lower() in stopword_list]
    s = str(filtered_sentence)
    s = re.sub(r'[^\w\s]','', s)
    s = s.replace("  ", ' ')
    testing_sentences[i] = s

In [None]:
# Review the changes by viewing the first sample in training
training_sentences[0]

In [None]:
# Now we will tokenize the samples for conversion to sequences
tokenizer = Tokenizer(num_words = 10000, # Size of the vocabulary we want our model to have
                      filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', # In case we have any remaining punctuation, etc.
                      lower=True, # Sets all samples to lowercase
                      oov_token = "<OOV>") # In case there is a word outside of our vocabulary present in the samples

# We only fit the tokenizer to the training portion of the data
tokenizer.fit_on_texts(training_sentences)

# Generate and pad the training sequences
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen = 120, truncating = 'post')

# Generate and pad the test sequences
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences, maxlen = 120)

In [None]:
# Convert labels lists to numpy array
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

## Build a Model
---

The Sequential model is a linear stack of layers, which can take a singular input of data and generate a singular output as a response. There are various different types of layers that can be used and your layer schema usually depends on the type of problem you are trying to solve. For the purposes of this demo, we will be using 3 different layer types: 

In [None]:
# Build the model
model = tf.keras.Sequential([
    # Embedding layer enables us to convert each word into a fixed length vector of defined size.
    tf.keras.layers.Embedding(10000, 64), # Size of vocabulary & output dim length of vector for each word
    # RNN architecture for sequential prediction, past to present & present to past
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
    # Fully connected dense layer, transform summed weighted input with ReLu activation
    tf.keras.layers.Dense(64, activation='relu'),
    # Fully connected dense layer, logistic activation with binary output (postive or negative)
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Print the model summary
model.summary()

In [None]:
# Set the training parameters and optimzation
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(padded, training_labels_final, epochs=2, validation_data=(testing_padded, testing_labels_final))

## Export Model
---

In many cases, you will build and train a model for the purposes of making future predictions. An easy method to accomplish this is to structure an `export_model`, which includes our trained model with the appropriate activation (i.e. `sigmoid`). This `export_model` will be compiled the same as our original model, but we won't have to fit it to any new data.

In [None]:
# Include the trained model in our new export_model schema and use the appropriate activation
export_model = tf.keras.Sequential([
  model,
  tf.keras.layers.Activation('sigmoid')
])

# Use the same compiling parameters as before
export_model.compile(
    loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']
)

In [None]:
# Take a list of new samples
examples = [
  "The show was great! I enjoyed the special effects and the acting was fantastic. I would watch this movie again and again!",
  "The movie was terrible, horrible acting and special effects. I couldn't follow the story and just wanted it to end."
]

In [None]:
# Don't forget to apply the same preprocessing
for i in range(len(examples)):
    word_tokens = word_tokenize(examples[i])
    filtered_sentence = [w for w in word_tokens if not w.lower() in stopword_list]
    s = str(filtered_sentence)
    s = re.sub(r'[^\w\s]','', s)
    s = s.replace("  ", ' ')
    examples[i] = s

# Generate and pad the new sequences
example_sequences = tokenizer.texts_to_sequences(examples)
example_padded = pad_sequences(example_sequences, maxlen = 120)

In [None]:
# Make predictions based on the new samples
export_model.predict(example_padded)

# Conclusion
---

<div class="alert alert-success">
Congratulations, you have successfully preprocessed textual samples and trained a model to determine author sentiment. Now that you have an exported model, analyze the sentiment of the Nike ad campaign tweets and inspect your model's performance!
</div>