### Catagorizing the tweets using pre-trained language models

It is highly recommend to run this notebook with GPU acceleration
(For example, using Google Colab). Running on CPU may take several days, whereas running on a GPU takes approximately 3-6 hours.

I recommend creating a seperate virtual environment since we will need many
one-off dependencies. Or again, use Google Colab

When using free Colab, it is strongly recommended to download the file 'tweets_with_analysis' after each
analaysis run, since free Colab doesn't support long idle times.


In [42]:
run_on_gpu = False


You will need the following packages: (uncomment and run to install)


In [3]:
# !pip install -q transformers
# !pip install -q datasets
# !pip install sentencepiece
# !pip install emoji==0.6.0

If you are not on Colab you will need to install the following packages as well: (uncomment and run to install)

In [None]:
#!pip3 install torch
#!pip3 install protobuf==3.20.0
#!pip install pip install tqdm

In [34]:
from datasets import Dataset
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
import pandas as pd
from transformers import pipeline

kwargs_dict = {'device': 0} if run_on_gpu else dict()


In [17]:
df_tweets = pd.read_csv('../datasets/twitter/somalia_tweets.csv')
# For testing purposes, we only run with the first 50 observations!
# Remove this line for replication purposes
df_tweets = df_tweets.head(50)
dataset = Dataset.from_pandas(df_tweets[['text']])


### Sentiment Analaysis


In [43]:
model_path = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
task = pipeline("sentiment-analysis", model=model_path,
                tokenizer=model_path, max_length=512, truncation=True, **kwargs_dict)


In [21]:
sentiment_labels = []
for out in tqdm(task(KeyDataset(dataset, "text"))):
    sentiment_labels.append(out)


100%|██████████| 50/50 [00:06<00:00,  7.85it/s]


In [44]:
sentiment_labels = pd.DataFrame(sentiment_labels)
sentiment_labels.columns = 'sentiment_' + sentiment_labels.columns

df_final = pd.concat([df_tweets, sentiment_labels], axis=1)
df_final.to_csv('../datasets/twitter/tweets_with_analysis.csv')


### Catagorization


In [46]:
model_path = "cardiffnlp/tweet-topic-21-multi"
task = pipeline("text-classification", model=model_path,
                tokenizer=model_path, max_length=512, truncation=True, **kwargs_dict)


In [49]:
category_labels = []
for out in tqdm(task(KeyDataset(dataset, "text"))):
    category_labels.append(out)


100%|██████████| 50/50 [00:05<00:00,  8.72it/s]


In [54]:
category_labels = pd.DataFrame(category_labels)
category_labels.columns = 'category_' + sentiment_labels.columns

df_final = pd.concat([df_final, category_labels], axis=1)
df_final.to_csv('../datasets/twitter/tweets_with_analysis.csv')


### Emotion analysis

In [56]:
model_path = "cardiffnlp/bertweet-base-emotion"
task = pipeline("text-classification", model=model_path,
                tokenizer=model_path, max_length=128, truncation=True, **kwargs_dict)


Downloading: 100%|██████████| 873/873 [00:00<00:00, 434kB/s]
Downloading: 100%|██████████| 515M/515M [00:48<00:00, 11.0MB/s] 
Downloading: 100%|██████████| 318/318 [00:00<00:00, 159kB/s]
Downloading: 100%|██████████| 824k/824k [00:00<00:00, 1.75MB/s]
Downloading: 100%|██████████| 1.03M/1.03M [00:00<00:00, 1.62MB/s]
Downloading: 100%|██████████| 17.0/17.0 [00:00<00:00, 1.42kB/s]


In [65]:
emotion_labels = []
for out in tqdm(task(KeyDataset(dataset, "text"))):
    emotion_labels.append(out)

100%|██████████| 50/50 [00:05<00:00,  8.82it/s]


In [66]:
emotion_labels = pd.DataFrame(emotion_labels)
emotion_labels.columns = 'emotion_' + emotion_labels.columns
emotion_labels['emotion_label'] = emotion_labels['emotion_label'].replace({'LABEL_0': 'anger', 'LABEL_1': 'joy', 'LABEL_2': 'optimism', 'LABEL_3': 'sadness'})
df_final = pd.concat([df_tweets, emotion_labels], axis=1)
df_final.to_csv('../datasets/twitter/tweets_with_analysis.csv')