In [1]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Imports

In [2]:
import pandas as pd
import numpy as np
import torch
from transformers import pipeline
from transformers.pipelines.pt_utils import Dataset, KeyDataset
import time
from tqdm.auto import tqdm

In [3]:
class ListDataset(Dataset):
    def __init__(self, original_list):
        self.original_list = original_list

    def __len__(self):
        return len(self.original_list)
    
    def __getitem__(self, i):
        return self.original_list[i]

In [31]:

#I modified this code quite a bit, now it only returns date and tweet text (preprocessed). username and language are excluded
def run_sentiment_analysis_and_save(path_to_tweets, model_path = f"cardiffnlp/twitter-roberta-base-sentiment-latest"):
    input_csv = pd.read_csv(path_to_tweets) 
    input_csv.head()
    tweet_text = input_csv['text'].to_list()
    #this dataframe column should be changed for question 1 and 2 to "tweetcreatedts" and for question 3 "Date"
    tweet_date = input_csv['tweetcreatedts'].to_list()  
    filtered_tweets = [text for text in tweet_text if type(text) == str] # If some tweets have no text for whatever reason, we remove them
    print('Removed ', len(tweet_text) - len(filtered_tweets), 'invalid tweets')
    

    #Adding a preprocessing step to remove links and users
    pre_processed = [] #we put all the filtered tweets in this array
    for tweet in filtered_tweets:
        tweet_words = [] 
        for word in tweet.split(' '):
            if word.startswith('@') and len(word) > 1: #if it is a mention then it starts with @ 
                word = '@user'
            elif "http" in word:
                i = word.index("http")
                word = word[:i] + "http"
            #    word = "http"
            tweet_words.append(word)
        tweet = " ".join(tweet_words)
        pre_processed.append(tweet)

    


    # print(tweet_text[:10])
    # tweets_dataset = ListDataset(tweet_text)
    print('Loaded tweets at ' + path_to_tweets)

    sentiment_pipeline = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path, max_length=512, truncation=True)
    
    print('Running Sentiment Analysis...')
    start_time = time.time()
    result = sentiment_pipeline(pre_processed)
    end_time = time.time()
    print('Time elapsed: ', end_time - start_time, ' seconds')

    #added by Sheikh, feel free to fix if you find any issue
    data_given = {"Date": tweet_date, "text": pre_processed}
    data_given = pd.DataFrame(data_given)
    result = pd.DataFrame(result)
    result_df = data_given.join(result)
    result_df.to_csv(path_to_tweets.split('.csv')[0] + '_with_sentiment.csv')
    #data_given.to_csv(path_to_tweets.split('.csv')[0] + 'test_preprocessing.csv')
    return data_given
    

## Q1

In [32]:
q1_path = 'data/q1/all_tweets.csv'
run_sentiment_analysis_and_save(q1_path)


Removed  0 invalid tweets
Loaded tweets at data/q1/all_tweets.csv


KeyboardInterrupt: 

## Q2

### English

In [8]:
q2_path_1 = 'data/q2/nato_english.csv'
run_sentiment_analysis_and_save(q2_path_1)

KeyError: 'Date'

In [20]:
q2_path_2 = 'data/q2/putin_english.csv'
run_sentiment_analysis_and_save(q2_path_2)

Loaded tweets at data/q2/putin_english.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1128.437345981598  seconds


In [26]:
q2_path_3 = 'data/q2/zelensky_english.csv'
run_sentiment_analysis_and_save(q2_path_3)

27507 27505
Loaded tweets at data/q2/zelensky_english.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1356.382895231247  seconds


### Russian

In [6]:
model_path = f"cointegrated/rubert-tiny-sentiment-balanced"
q2_path_4 = 'data/q2/nato_russian.csv'
run_sentiment_analysis_and_save(q2_path_4, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/nato_russian.csv
Running Sentiment Analysis...
Time elapsed:  0.08802032470703125  seconds


In [7]:
q2_path_5 = 'data/q2/putin_russian.csv'
run_sentiment_analysis_and_save(q2_path_5, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/putin_russian.csv
Running Sentiment Analysis...
Time elapsed:  232.93544363975525  seconds


In [8]:
q2_path_6 = 'data/q2/zelensky_russian.csv'
run_sentiment_analysis_and_save(q2_path_6, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/zelensky_russian.csv
Running Sentiment Analysis...
Time elapsed:  232.61395645141602  seconds


## Q3

I am making some significant changes here.
First I am using the data that I scrapped, which only contains the news titles and nothing else (replies and retweets). 
They contain all tweets posted from the channels from Dec to April, so I have to manually later sort out Ukranian war related tweets 


In [29]:
q3_path_1 = 'data/q3/FoxNews_Sheikh.csv'
run_sentiment_analysis_and_save(q3_path_1)

Removed  0 invalid tweets
Loaded tweets at data/q3/FoxNews_Sheikh.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  728.9498617649078  seconds


Unnamed: 0,Date,text
0,2022-04-29 23:50:08+00:00,OPINION: @user @user Biden thinks student loan...
1,2022-04-29 23:40:07+00:00,Clinton campaign seeks to block Durham access ...
2,2022-04-29 23:30:00+00:00,NYC bystander stabbed by group outside club in...
3,2022-04-29 23:20:00+00:00,Met Gala 2022 'Gilded Glamour' theme gets mixe...
4,2022-04-29 23:10:00+00:00,Elon Musk and Amber Heard: What we learned thi...
...,...,...
22500,2021-12-01 00:57:39+00:00,Gutfeld: 'It's going to be a war' between arme...
22501,2021-12-01 00:47:31+00:00,Illegal immigrant posed as rideshare driver an...
22502,2021-12-01 00:35:05+00:00,Salvation Army pulls controversial racism guid...
22503,2021-12-01 00:20:03+00:00,'Sex and the City' spin-off releases full trai...


In [30]:
q3_path_2 = 'data/q3/NYT_Sheikh.csv'
run_sentiment_analysis_and_save(q3_path_2)

Removed  0 invalid tweets
Loaded tweets at data/q3/NYT_Sheikh.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1622.3399460315704  seconds


Unnamed: 0,Date,text
0,2022-04-29 23:40:08+00:00,Linemen and receivers took center stage in the...
1,2022-04-29 23:00:10+00:00,"In Opinion\n\nJ.D. Vance's ""Trumpian turn has ..."
2,2022-04-29 22:40:05+00:00,The pandemic has upended the rigid 9-to-5 work...
3,2022-04-29 22:00:16+00:00,As a Manhattan grand jury wraps up its review ...
4,2022-04-29 21:53:06+00:00,"“If Mariupol is hell, Azovstal is worse.” The ..."
...,...,...
12388,2021-12-01 00:50:07+00:00,After thirteen cases of the Omicron variant we...
12389,2021-12-01 00:40:03+00:00,"""The last time I was inside the walls of Oxfor..."
12390,2021-12-01 00:30:09+00:00,Detectives investigating the deadly shooting o...
12391,2021-12-01 00:15:08+00:00,"Josh Duggar, who gained celebrity on the TLC r..."
