<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/twitter_emotion_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emotion Analysis with Python
Goal: Do a subtask of sentiment analysis on some tweets!   

This colab will return "emotion" labels of optomism, anger, joy, and sadness. It is a curious set of labels, no sure if they were exploring a particular question. It will be most accurate on tweets lowercased and stripped of punctuations, and user handles will probably be meaningless. The model has been trained on tweets up to 2021, so it will be able to handle a wide vernacular. It will not perform well on language clusters that have shifted away from the 2018-2021 norm.

To run the script, replace the given link variable with a share link to your xlsx file of twitter instances. The code will execute with runtime -> run all. Allow the colab to access your personal google account. The most likely error to occur is that your xlsx has different column names than the dataframe column names used. You will need to download the updated data from the left side bar upon conclusion of the code.

[paper](https://arxiv.org/abs/2104.12250) | [model](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion)


In [1]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
### IF YOUR DATASET DOES NOT USE STANDARD BRANDWATCH COLUMN NAMES YOU WILL NEED TO CHANGE THE EXCEL NAMES OR THE DF NAMES BELOW

We will start by installing an extra tool, datasets, from hugging face. Then we will go through some steps we did on day one to get all our tools and data into the notebook

In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
# '!' means these commands will execute on the command line, making changes outside of the notebook. 
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

In [5]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

In [6]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays
from numpy import mean
import pandas as pd

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import csv
from datetime import datetime
from matplotlib.dates import date2num

# more packages, tools for getting to google drive
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [7]:
### HERE IS THE CELL YOU NEED TO CHANGE
import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')

Using id 1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO to find file on drive
Authenticated colab user
Added file to workspace with name tweets.xlsx


### Apply emotion analysis on tweets

Emotion analysis follows the same format as sentiment analysis. Someone has trained a model on examples of tweets with a given emotion labed like "joy" or "sadness". We will go get this model from hugging face and infer labels for out current twitter dataset.  

Looking on Hugging Face, it turns out the authors of the last model [trained an emotion  classifier](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion) as well. (It's not much work for them because the largest part of the model, the language model, is the same between both the sentiment and the emotion model. They only differ in the last classification task)

Anyway, we'll look at example their code to see what we need to change in the code we've previously used to instantiate the models in the notebook.

In [8]:
# 1. change the hugging face path to the correct model task
task='emotion'
MODEL = f"cardiffnlp/twitter-roberta-base-{task}"

# 2. reload the tokenizer based on the new path
tokenizer = AutoTokenizer.from_pretrained(MODEL)
# you can choose from many tokenizing model options as well, scroll over the method

# 3. reload the classification model based on the new path
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

# 4. Redefine our output labels - I think it's ["optomism", "anger", "joy", "sadness"], but lets just copy and paste the authors code which fetches the labels from huggingface
# download label mapping
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

Downloading (…)lve/main/config.json:   0%|          | 0.00/768 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [9]:
print(labels)

['anger', 'joy', 'optimism', 'sadness']


Well, these labels don't exactly span the breadth of human emotions but it's a start.

Since the authors keep everything besides labels pretty consistent between models, all our code used for sentiment analysis should work form emotions analysis. 

I'll just copy and paste the section "Apply Roberta to tweets" from day 1 here and then rename the "output_df" to "emotion_df"

In [10]:
# deep learning toolkit
from torch.utils.data import DataLoader
from torch.nn import Softmax
import torch

In [11]:
CUDA = torch.cuda.is_available()
print("Using gpu:", CUDA)

Using gpu: True


In [12]:
# only cuda if its available
processor = 'cuda'
# processor = 'cpu'

We will make a `DataLoader` object that groups tweets into small batches, so that we can process them simultaneously.

The dataloader might seem arbitrary now but becomes useful once we want to control more factors of the dataset.

In [13]:
# making some pytorch variables to assist us
batch_size = 32
dl = DataLoader(df['Clean Text'], batch_size=batch_size) # seems arbitrary now but becomes useful once we want to control more factors of the dataset.
print("Number of Batches", len(dl))

Number of Batches 592


In [14]:
# lets run it! It should take about a minute or so

# this line makes an empty dataframe to hold the scores we are about to calculate
output_df = pd.DataFrame()

# this moves the model to a out chosen processor
model = model.to(processor)

for text_batch in progress_bar(dl): # we loop through every batch in our dataset

  encoded_input = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True) # turn each tweet string into tokens
  encoded_input.to(processor) # input must be on the same processor as the model

  output = model(**encoded_input) # apply the model

  embeddings = output[0].detach().cpu().numpy() # gross rearrangment of the data type, a messy intermediate step

  # append new embeddings to current dataframe through pandas manipulations
  embed_df = pd.DataFrame(embeddings, columns=labels)
  output_df = pd.concat((output_df, embed_df), axis=0, ignore_index=True)
  
# convert embeddings (vectors) to a probability distribution
output_df = output_df.apply(softmax, axis=1, result_type='broadcast') 

  0%|          | 0/592 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [15]:
emotion_df = output_df
#can add categorical label column
emotion_df

Unnamed: 0,anger,joy,optimism,sadness
0,0.265583,0.048415,0.456312,0.229690
1,0.084470,0.293997,0.436769,0.184764
2,0.126567,0.220660,0.400411,0.252362
3,0.041835,0.870179,0.041468,0.046518
4,0.909438,0.010088,0.030238,0.050235
...,...,...,...,...
18912,0.949816,0.004810,0.024483,0.020891
18913,0.846968,0.005877,0.038835,0.108320
18914,0.565558,0.141764,0.079879,0.212800
18915,0.737710,0.018656,0.111396,0.132238


In [16]:
def get_class(row):
  max_idx = np.argmax(row)
  return labels[max_idx]

emotion_df["predicted_emotion"] = emotion_df.apply(get_class, axis=1)

In [17]:
final_df = pd.concat([df,emotion_df],axis=1)
final_df.head()

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,...,Twitter Following,Twitter Reply Count,Twitter Retweets,Twitter Verified,Reach (new),anger,joy,optimism,sadness,predicted_emotion
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,...,2715,0,0,False,7325,0.265583,0.048415,0.456312,0.22969,optimism
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 730pm we sit down with and abo...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,...,1103,2,4,True,13263,0.08447,0.293997,0.436769,0.184764,optimism
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,...,300,0,1,False,0,0.126567,0.22066,0.400411,0.252362,optimism
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,...,34,0,0,False,0,0.041835,0.870179,0.041468,0.046518,joy
4,2022-10-01 22:02:12.000,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥38 days until midterms💥 what r u doing for de...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,...,1702,1,10,False,16039,0.909438,0.010088,0.030238,0.050235,anger


In [18]:
final_df.to_excel("predicted_emotion_scores.xlsx")