<a href="https://colab.research.google.com/github/SEEsuite/colab_scripts/blob/main/english_basic_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Python

Goal: sentiment analysis on some tweets!   

This colab will return sentiment labels of negative, neutral, and positive. It is trained on social media posts that are part of marketing campaigns. It may be confused by mentions, hashtags, very specific internet vernacular, etc. Probably remove blatant social media artifacts before running the script. It will biased towards "proper" or "standard internet" english and will probably perform less well for minority dialects.

To run the script, replace the given link variable with a share link to your xlsx file of twitter instances. The code will execute with runtime -> run all. Allow the colab to access your personal google account. The most likely error to occur is that your xlsx has different column names than the dataframe column names used. You will need to download the updated data from the left side bar upon conclusion of the code.

[paper](https://journals.sagepub.com/doi/full/10.1177/00222437211037258#supplementary-materials) | [model](https://huggingface.co/j-hartmann/sentiment-roberta-large-english-3-classes?text=Oh+no.+This+is+bad.)



<!-- 


https://huggingface.co/j-hartmann/sentiment-roberta-large-english-3-classes?text=Oh+no.+This+is+bad.. -->

In [1]:
### HERE IS THE CELL YOU NEED TO CHANGE
link = "https://docs.google.com/spreadsheets/d/1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO/edit?usp=sharing&ouid=101042095541764641159&rtpof=true&sd=true"
### Please make sure your text is in a column called "Full Text"

In [2]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.12.0-py3-none-any.whl (474 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.6/474.6 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from datasets)
  Downloading xxhash-3.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.5/212.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting multiprocess (from datasets)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collec

In [3]:
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

def import_data_from_drive(share_link, your_name_for_file="my_data"):
  """Brings data file from a google drive sharepoint to your colab workspace.
     It does not require you to host the dataset on your own account.

     Parameters:
     share_link: the link to view a file in google drive
     our_name_for_file: a string describing the file, preferable endling in a file type, ex. 'data.csv'
     """
  id = share_link.split("/")[5] # separate the id from the link
  print("Using id", id, "to find file on drive")

  # use pydrive and colab modules to authenticate you
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("Authenticated colab user")

  # This step will move the file from Drive to the workspace
  download_object = drive.CreateFile({'id':id}) 
  download_object.GetContentFile(your_name_for_file)
  print("Added file to workspace with name", your_name_for_file)

  return

In [4]:
# huggingface's tools for pretrained language models
from transformers import AutoModelForSequenceClassification
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoTokenizer

In [5]:
# importing miscelaneaous packages 
import numpy as np # fast manipulation of multidimensional arrays
from numpy import mean
import pandas as pd

from tqdm.notebook import tqdm as progress_bar # a little vizualization of how fast a loop is running
from scipy.special import softmax
import csv
from datetime import datetime
from matplotlib.dates import date2num

# more packages, tools for getting to google drive
import urllib.request
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

In [6]:

import_data_from_drive(link, your_name_for_file="tweets.xlsx")
df = pd.read_excel('tweets.xlsx')

Using id 1m1-qV00Qkm2m9Znypj_ORBZgAQ9yQ9eO to find file on drive
Authenticated colab user
Added file to workspace with name tweets.xlsx


### Use a sentiment analysis model trained on general text data


So, lets switch to a model that has been trained on more examples of language than just twitter. Here is one by J. Hartman that seems relatively popular and has an associated paper.

[model](https://huggingface.co/j-hartmann/sentiment-roberta-large-english-3-classes?text=Oh+no.+This+is+bad..)

The authors give a much smaller code example to get the model up and running. By putting the model in a pipeline, more of the model instantiation is done behind the scenes. 

In [7]:
from transformers import pipeline

# make the model all the components of the model in one step
general_model = pipeline("text-classification", model="j-hartmann/sentiment-roberta-large-english-3-classes", return_all_scores=True, device=0)

# infer scores for one sentence
general_model("This is so nice!") 

Downloading (…)lve/main/config.json:   0%|          | 0.00/725 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at j-hartmann/sentiment-roberta-large-english-3-classes were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading (…)okenizer_config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]



[[{'label': 'negative', 'score': 0.00016451838018838316},
  {'label': 'neutral', 'score': 0.000174045650055632},
  {'label': 'positive', 'score': 0.9996614456176758}]]

The model output above is a little messy, but the the highest score being positive is reasonable for that short sentence. Let's see how it does on the amazon review

This more general model seems to pick up on something the twitter model didn't, correctly identifying the negative review. Of course, its only one data sample. Continue following the rest of the code if you want to see how apply this new model to our tweets (feel free to try this on another dataset!).

It looks like we'll have to take a few steps to adjust the code we already have for applying the model to all out tweets.

1) we have to use Hugging Face's specific dataset tool, not the more general data loader we were using before. It turns out the old tool just doesn't play nice with J Hartman's code.

2) we pass the input directly into the pipeline, not through a tokenizer first

3) we have to restructure the intermediate data due to the authors' choice of output format.

In [8]:
# 1
from datasets import Dataset

# I am just loading the tweets here, but this would be a good spot to go and get your own data, load it as a df
# and reference the text samples where I call df['Clean Text']
dataset = Dataset.from_pandas(df)
labels = [ "negative","neutral",  "positive"]

In [None]:
# let's run it! It should take longer to process than the previous model
# this is not the best way to use a pipeline...

# this line makes an empty dataframe to hold the scores we are about to calculate
output_df = pd.DataFrame()

for text_batch in progress_bar(dataset): # we loop through every batch in our dataset
  # 2.
  outputs = general_model(text_batch['Clean Text'])

  # 3.
  outputs = [ d['score']  for y in outputs for d in y]
  batch_df = pd.DataFrame(outputs).T
  batch_df.columns = labels

  output_df = pd.concat((output_df, batch_df), axis=0, ignore_index=True) # pandas way of appending

In [11]:
# Transform scores to labels for easier analysis
def get_class(row):
  max_idx = np.argmax(row)
  return labels[max_idx]

output_df["predicted_class"] = output_df.apply(get_class, axis=1)

In [12]:
output_df

Unnamed: 0,negative,neutral,positive,predicted_class
0,0.000490,0.998759,0.000751,neutral
1,0.000198,0.998733,0.001069,neutral
2,0.000256,0.998489,0.001255,neutral
3,0.006774,0.992278,0.000948,neutral
4,0.000625,0.013860,0.985515,positive
...,...,...,...,...
1692,0.002256,0.987201,0.010543,neutral
1693,0.003129,0.996526,0.000345,neutral
1694,0.997501,0.002245,0.000254,negative
1695,0.000177,0.999118,0.000706,neutral


In [15]:
final_df = pd.concat([df,output_df],axis=1)
final_df.head()

Unnamed: 0,Date,Full Text,Clean Text,Author,Url,Continent,Country,Region,Country Code,Continent Code,...,Twitter Followers,Twitter Following,Twitter Reply Count,Twitter Retweets,Twitter Verified,Reach (new),negative,neutral,positive,predicted_class
0,2022-10-01 23:40:00.000,"In Colorado Senate race, Michael Bennet still ...",in colorado senate race michael bennet still f...,Prison_Health,http://twitter.com/Prison_Health/statuses/1576...,North America,United States of America,Hawaii,USA,NORTH AMERICA,...,19711,2715,0,0,False,7325,0.00049,0.998759,0.000751,neutral
1,2022-10-01 23:27:28.000,COMING UP on @WRAL at 7:30pm: We sit down with...,coming up on at 730pm we sit down with and abo...,BryanRAnderson,http://twitter.com/BryanRAnderson/statuses/157...,North America,United States of America,North Carolina,USA,NORTH AMERICA,...,3832,1103,2,4,True,13263,0.000198,0.998733,0.001069,neutral
2,2022-10-01 23:16:38.000,Summaries of high-profile Supreme Court cases:...,summaries of highprofile supreme court cases t...,January20th49,http://twitter.com/January20th49/statuses/1576...,North America,United States of America,Ohio,USA,NORTH AMERICA,...,39,300,0,1,False,0,0.000256,0.998489,0.001255,neutral
3,2022-10-01 23:05:12.000,Abortion Icon Emma Bonino Trounced in Italian ...,abortion icon emma bonino trounced in italian ...,UsBurning,http://twitter.com/UsBurning/statuses/15763475...,North America,United States of America,Georgia,USA,NORTH AMERICA,...,360,34,0,0,False,0,0.006774,0.992278,0.000948,neutral
4,2022-10-01 22:02:12.000,💥38 DAYS UNTIL #ELECTIONDAY MIDTERMS💥 WHAT R U...,💥38 days until midterms💥 what r u doing for de...,LeviFetterman,http://twitter.com/LeviFetterman/statuses/1576...,North America,United States of America,Pennsylvania,USA,NORTH AMERICA,...,33774,1702,1,10,False,16039,0.000625,0.01386,0.985515,positive


In [None]:
final_df.to_excel("predicted_sentiment_scores.xlsx")