<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/twitter_basic_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Python
Goal: sentiment analysis on some tweets!   

This colab will return sentiment labels of negative, neutral, and positive. It will be most accurate on tweets lowercased and stripped of punctuations, and user handles will probably be meaningless. The model has been trained on tweets up to 2022, so it will be able to handle a wide vernacular. It will not perform well on language clusters that have shifted away from the 2018-2022 norm.

To run the script, replace the given link variable with a share link to your xlsx file of twitter instances. The code will execute with runtime -> run all. Allow the colab to access your personal google account. The most likely error to occur is that your xlsx has different column names than the dataframe column names used. You will need to download the updated data from the left side bar upon conclusion of the code.

[paper](https://arxiv.org/abs/2202.03829) | [model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest?text=Covid+cases+are+increasing+fast%21)


In [1]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
### IF YOUR DATASET DOES NOT USE STANDARD BRANDWATCH COLUMN NAMES YOU WILL NEED TO CHANGE THE EXCEL NAMES OR THE DF NAMES BELOW

In [2]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.0/7.0 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m224.5/224.5 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [3]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

In [4]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax

In [5]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

### Load Data to DataFrame
We will load some tweets from a shared google drive file, which will require us to sign into our google drive accounts. 
[backup link to shared file](https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true)

<!-- # id = '1prY6_BgwYrUHCSJ0DYAB4KA7HcjTdDtU' -->

*key python package: pandas*   
[pandas documentation](https://pandas.pydata.org/docs/getting_started/index.html#getting-started). 

[visual cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

In [6]:
import pandas as pd # basically the excel of python

In [7]:
import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')
# df = pd.read_csv('tweets.xlsx')

Using id 1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO to find file on drive
Authenticated colab user
Added file to workspace with name tweets.xlsx


In [8]:
# df is now an object, with associated methods we can use
df.head(5) # lets look at the first five data samples
# you can even access the spreadsheet in colab... 

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,Region Code,City Code,Twitter Followers,Twitter Following,Twitter Reply Count,Twitter Retweets,Twitter Verified,Reach (new)
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,USA.HI,USA.HI.Honolulu,19711,2715,0,0,False,7325
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 730pm we sit down with and abo...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,USA.NC,USA.NC.Raleigh,3832,1103,2,4,True,13263
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,USA.OH,,39,300,0,1,False,0
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,USA.GA,USA.GA.Atlanta,360,34,0,0,False,0
4,2022-10-01 22:02:12.000,üí•38 DAYS UNTIL #ELECTIONDAY MIDTERMSüí• WHAT R U...,üí•38 days until midtermsüí• what r u doing for de...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,USA.PA,,33774,1702,1,10,False,16039


## Clean Data
Gameplan: define a function that cleans one tweet. Then apply this function to every tweet in our dataframe.

Regex is a shorthand for writing logical statements to match strings - we are just copying and pasting the regex today, its not important to learn all the shorthand yet. But if you are interested:

[using regex to remove hashtags](https://catriscode.com/2021/03/02/extracting-or-removing-mentions-and-hashtags-in-tweets-using-python/)   
[regex cheatsheet](https://www.interviewbit.com/regex-cheat-sheet/)


*key python package: regex*

In [9]:
import re # search through and clean text

In [10]:
# there is a problem with the data oh no!
print(df['Full Text'][0])
# Most language models probably don't know what the hell 'https://t.co/F5ak34HrCE' is

In Colorado Senate race, Michael Bennet still fights for child tax credit and immigration reform https://t.co/F5ak34HrCE


In [11]:
# running this cell defines the function, does not run the function

def clean(tweet):

  # remove uppercase letters
  tweet = tweet.lower()

  # remove mentions
  tweet = re.sub("@[A-Za-z0-9_]+", "", tweet)
  # remove hashtags
  tweet = re.sub("#[A-Za-z0-9_]+", "", tweet)
  # remove links
  tweet = re.sub(r"http\S+", "", tweet)

  return tweet

In [12]:
df['Clean Text'] = df['Full Text'].apply(clean)

In [13]:
print(df['Clean Text'][0])

in colorado senate race, michael bennet still fights for child tax credit and immigration reform 


## Sentiment Classification
We will load up a pretrained model to our workspace. This model will first tokenize tweets, then map tweets to an embedding space where similar words are near to each other. Then the model will classify each tweet into "positive", "negative", or "neutral" data. 

This model was trained on hand-labeled tweets until it reach a high score on the training task, so we can trust it to be pretty good as long as we only use it on tweet-like data.

*key python package: transformers by Hugging Face*   
[our model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest?text=Covid+cases+are+increasing+fast%21)

[how large language models work (specifically chapter 2 but all chapters are good)](https://www.pinecone.io/learn/sentence-embeddings/)


First, lets understand where our model is coming from

[Hugging Face model repository](https://huggingface.co/models)


In [14]:
!rm -r cardiffnlp # just in case

rm: cannot remove 'cardiffnlp': No such file or directory


In [15]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest" 

tokenizer = AutoTokenizer.from_pretrained(MODEL, is_split_into_words=True)
# you can choose from many tokenizing model options as well, scroll over the method

model = AutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

# our sentiment analysis model is going to classify every tweet as one of these labels
labels = ["negative", "neutral", "positive"]

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (‚Ä¶)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
# deep learning toolkit
from torch.utils.data import DataLoader
from torch.nn import Softmax
import torch

In [17]:
CUDA = torch.cuda.is_available()
print("Using gpu:", CUDA)
processor = 'cuda'

Using gpu: True


In [18]:
# making some pytorch variables to assist us
batch_size = 32
dl = DataLoader(df['Clean Text'], batch_size=batch_size) # seems arbitrary now but becomes useful once we want to control more factors of the dataset.
print("Number of Batches", len(dl))

Number of Batches 592


In [19]:
# lets run it! It should take about a minute or so

# this line makes an empty dataframe to hold the scores we are about to calculate
output_df = pd.DataFrame()

# this moves the model to a out chosen processor
model = model.to(processor)

for text_batch in progress_bar(dl): # we loop through every batch in our dataset
  # print(text_batch)
  encoded_input = tokenizer(text_batch, return_tensors='pt', padding=True, truncation=True) # turn each tweet string into tokens
  encoded_input.to(processor) # input must be on the same processor as the model

  output = model(**encoded_input) # apply the model

  embeddings = output[0].detach().cpu().numpy() # gross rearrangment of the data type, a messy intermediate step

  # append new embeddings to current dataframe through pandas manipulations
  embed_df = pd.DataFrame(embeddings, columns=labels)
  # print(embed_df)
  output_df = pd.concat((output_df, embed_df), axis=0, ignore_index=True)

  0%|          | 0/592 [00:00<?, ?it/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [20]:
# convert embeddings (vectors) to a probability distribution
sentiment_df = output_df.apply(softmax, axis=1, result_type='broadcast') 
sentiment_df
# all rows should sum to one

Unnamed: 0,negative,neutral,positive
0,0.019214,0.897980,0.082806
1,0.009155,0.954184,0.036661
2,0.019073,0.949389,0.031538
3,0.367401,0.598292,0.034307
4,0.062576,0.763785,0.173639
...,...,...,...
18912,0.729563,0.253436,0.017001
18913,0.780095,0.205532,0.014372
18914,0.142689,0.676964,0.180346
18915,0.722581,0.259136,0.018283


In [21]:
# Transform scores to labels for easier analysis
def get_class(row):
  max_idx = np.argmax(row)
  return labels[max_idx]

sentiment_df["predicted_sentiment"] = sentiment_df.apply(get_class, axis=1)

In [22]:
final_df = pd.concat((df, sentiment_df), axis=1)

In [23]:
final_df.head()

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,...,Twitter Followers,Twitter Following,Twitter Reply Count,Twitter Retweets,Twitter Verified,Reach (new),negative,neutral,positive,predicted_sentiment
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...","in colorado senate race, michael bennet still ...",Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,...,19711,2715,0,0,False,7325,0.019214,0.89798,0.082806,neutral
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 7:30pm: we sit down with and...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,...,3832,1103,2,4,True,13263,0.009155,0.954184,0.036661,neutral
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of high-profile supreme court cases:...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,...,39,300,0,1,False,0,0.019073,0.949389,0.031538,neutral
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,...,360,34,0,0,False,0,0.367401,0.598292,0.034307,neutral
4,2022-10-01 22:02:12.000,üí•38 DAYS UNTIL #ELECTIONDAY MIDTERMSüí• WHAT R U...,üí•38 days until midtermsüí• what r u doing for d...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,...,33774,1702,1,10,False,16039,0.062576,0.763785,0.173639,neutral


## Save your work

In [24]:
# let's save it
# this will only save to worksplace and then you have to download it to your local computer or upload to drive
save_path = "predicted_sentiment_scores.xlsx"
final_df.to_excel(save_path)