<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/twitter_emotion_sentiment_analysis_6emotions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emotion Analysis with Python
Goal: Do a subtask of sentiment analysis on some tweets!   

This colab will return "emotion" labels of love, anger, joy, suprise, sadness, and fear. These are a bit closer to the 6 emotion labels accepted by psychologists (it's missing disgust). It will be most accurate on tweets lowercased and stripped of punctuations, and user handles will probably be meaningless. The model has been trained on tweets up to 2021, so it will be able to handle a wide vernacular. It will not perform well on language clusters that have shifted away from the 2018-2021 norm.

This is a fine-tuned checkpoint of our other emotion model, so it shares the same paper. 

To run the script, replace the given link variable with a share link to your xlsx file of twitter instances. The code will execute with runtime -> run all. Allow the colab to access your personal google account. The most likely error to occur is that your xlsx has different column names than the dataframe column names used. You will need to download the updated data from the left side bar upon conclusion of the code.

[paper](https://arxiv.org/abs/2104.12250) | [model](https://huggingface.co/02shanky/finetuned-twitter-xlm-roberta-base-emotion?text=I+like+you.+I+love+you)


In [1]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
### The DF expects a column named "Clean Text" to hold the text instances

We will start by installing an extra tool, datasets, from hugging face. Then we will go through some steps we did on day one to get all our tools and data into the notebook

In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
Collec

In [3]:
# '!' means these commands will execute on the command line, making changes outside of the notebook. 
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m73.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m98.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.3 transformers-4.28.1


In [4]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

In [5]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer
import datasets

In [6]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays
from numpy import mean
import pandas as pd

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax

In [7]:

import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')

Using id 1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO to find file on drive
Authenticated colab user
Added file to workspace with name tweets.xlsx


### Apply emotion analysis on tweets

Emotion analysis follows the same format as sentiment analysis. Someone has trained a model on examples of tweets with a given emotion labed like "joy" or "sadness". We will go get this model from hugging face and infer labels for out current twitter dataset.  

Looking on Hugging Face, it turns out the authors of the last model [trained an emotion  classifier](https://huggingface.co/cardiffnlp/twitter-roberta-base-emotion) as well. (It's not much work for them because the largest part of the model, the language model, is the same between both the sentiment and the emotion model. They only differ in the last classification task)

Anyway, we'll look at example their code to see what we need to change in the code we've previously used to instantiate the models in the notebook.

In [8]:
labels = [ "sadness","joy","love","anger","fear","surprise"]
tokenizer = AutoTokenizer.from_pretrained("02shanky/finetuned-twitter-xlm-roberta-base-emotion")
model = AutoModelForSequenceClassification.from_pretrained("02shanky/finetuned-twitter-xlm-roberta-base-emotion")
model.to('cuda')

Downloading (…)okenizer_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

XLMRobertaForSequenceClassification(
  (roberta): XLMRobertaModel(
    (embeddings): XLMRobertaEmbeddings(
      (word_embeddings): Embedding(250002, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): XLMRobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x XLMRobertaLayer(
          (attention): XLMRobertaAttention(
            (self): XLMRobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): XLMRobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768,

In [9]:
print(labels)

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


Well, these labels don't exactly span the breadth of human emotions but it's a start.

Since the authors keep everything besides labels pretty consistent between models, all our code used for sentiment analysis should work form emotions analysis. 

I'll just copy and paste the section "Apply Roberta to tweets" from day 1 here and then rename the "output_df" to "emotion_df"

In [10]:
# deep learning toolkit
from torch.utils.data import DataLoader
from torch.nn import Softmax
import torch

In [11]:
CUDA = torch.cuda.is_available()
print("Using gpu:", CUDA)

Using gpu: True


In [12]:
# only cuda if its available
processor = 'cuda'
# processor = 'cpu'

We will make a `DataLoader` object that groups tweets into small batches, so that we can process them simultaneously.

The dataloader might seem arbitrary now but becomes useful once we want to control more factors of the dataset.

In [13]:
# making some pytorch variables to assist us
batch_size = 32
dataset = datasets.Dataset.from_pandas(df) # seems arbitrary now but becomes useful once we want to control more factors of the dataset.



In [14]:
# this line makes an empty dataframe to hold the scores we are about to calculate
emotion_df = pd.DataFrame()

for text_batch in progress_bar(dataset): # we loop through every batch in our dataset
  
  encoded_input = tokenizer(text_batch['Clean Text'], return_tensors='pt') # turn each tweet string into tokens
  encoded_input.to(processor) # input must be on the same processor as the model
  output = model(**encoded_input) # apply the model


  embeddings = output[0].detach().cpu().numpy()
  # print(embeddings)


  # append new embeddings to current dataframe through pandas manipulations
  embed_df = pd.DataFrame(embeddings, columns=labels)
  emotion_df = pd.concat((emotion_df, embed_df), axis=0, ignore_index=True)

  0%|          | 0/18917 [00:00<?, ?it/s]

In [15]:
emotion_df = emotion_df.apply(softmax, axis=1, result_type='broadcast') 
emotion_df

Unnamed: 0,sadness,joy,love,anger,fear,surprise
0,0.000859,0.007955,0.000510,0.975434,0.014818,0.000424
1,0.001722,0.142603,0.003991,0.666126,0.183467,0.002091
2,0.001506,0.937298,0.016740,0.040129,0.003514,0.000813
3,0.003799,0.000903,0.000235,0.951391,0.042651,0.001021
4,0.017389,0.574008,0.008602,0.381208,0.017640,0.001152
...,...,...,...,...,...,...
18912,0.000700,0.000333,0.000233,0.997922,0.000696,0.000115
18913,0.001727,0.060150,0.002675,0.929363,0.005224,0.000861
18914,0.003283,0.060041,0.000972,0.869056,0.064330,0.002318
18915,0.057245,0.149950,0.001696,0.783301,0.006415,0.001393


In [16]:
def get_class(row):
  max_idx = np.argmax(row)
  return labels[max_idx]

emotion_df["predicted_emotion"] = emotion_df.apply(get_class, axis=1)

In [17]:
final_df = pd.concat([df,emotion_df],axis=1)
final_df.head()

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,...,Twitter Retweets,Twitter Verified,Reach (new),sadness,joy,love,anger,fear,surprise,predicted_emotion
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,...,0,False,7325,0.000859,0.007955,0.00051,0.975434,0.014818,0.000424,anger
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 730pm we sit down with and abo...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,...,4,True,13263,0.001722,0.142603,0.003991,0.666126,0.183467,0.002091,anger
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,...,1,False,0,0.001506,0.937298,0.01674,0.040129,0.003514,0.000813,joy
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,...,0,False,0,0.003799,0.000903,0.000235,0.951391,0.042651,0.001021,anger
4,2022-10-01 22:02:12.000,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥38 days until midterms💥 what r u doing for de...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,...,10,False,16039,0.017389,0.574008,0.008602,0.381208,0.01764,0.001152,joy


In [18]:
final_df.to_excel("predicted_emotion_scores.xlsx")