In [1]:
!jupyter nbextension enable --py widgetsnbextension

Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


## Imports

In [1]:
import pandas as pd
import numpy as np
import torch
from transformers import pipeline
from transformers.pipelines.pt_utils import Dataset, KeyDataset
import time
from tqdm.auto import tqdm

In [2]:
class ListDataset(Dataset):
    def __init__(self, original_list):
        self.original_list = original_list

    def __len__(self):
        return len(self.original_list)
    
    def __getitem__(self, i):
        return self.original_list[i]

In [7]:

#I modified this code quite a bit, now it only returns date and tweet text (preprocessed). username and language are excluded
def run_sentiment_analysis_and_save(path_to_tweets, model_path = f"cardiffnlp/twitter-roberta-base-sentiment-latest"):
    input_csv = pd.read_csv(path_to_tweets) 
    input_csv.head()
    tweet_text = input_csv['Tweet'].to_list()
    #this dataframe column should be changed for question 1 and 2 to "tweetcreatedts" and for question 3 "Date"
    tweet_date = input_csv['Date'].to_list()  
    filtered_tweets = [text for text in tweet_text if type(text) == str] # If some tweets have no text for whatever reason, we remove them
    print('Removed ', len(tweet_text) - len(filtered_tweets), 'invalid tweets')
    

    #Adding a preprocessing step to remove links and users
    pre_processed = [] #we put all the filtered tweets in this array
    for tweet in filtered_tweets:
        tweet_words = [] 
        for word in tweet.split(' '):
            if word.startswith('@') and len(word) > 1: #if it is a mention then it starts with @ 
                word = '@user'
            elif "http" in word:
                i = word.index("http")
                word = word[:i] + "http"
            #    word = "http"
            tweet_words.append(word)
        tweet = " ".join(tweet_words)
        pre_processed.append(tweet)

    


    # print(tweet_text[:10])
    # tweets_dataset = ListDataset(tweet_text)
    print('Loaded tweets at ' + path_to_tweets)

    sentiment_pipeline = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path, max_length=512, truncation=True)
    
    print('Running Sentiment Analysis...')
    start_time = time.time()
    result = sentiment_pipeline(pre_processed)
    end_time = time.time()
    print('Time elapsed: ', end_time - start_time, ' seconds')

    #added by Sheikh, feel free to fix if you find any issue
    data_given = {"Date": tweet_date, "text": pre_processed}
    data_given = pd.DataFrame(data_given)
    result = pd.DataFrame(result)
    result_df = data_given.join(result)
    result_df.to_csv(path_to_tweets.split('.csv')[0] + '_with_sentiment.csv')
    #data_given.to_csv(path_to_tweets.split('.csv')[0] + 'test_preprocessing.csv')
    return data_given
    

## Q1

In [32]:
q1_path = 'data/q1/all_tweets.csv'
run_sentiment_analysis_and_save(q1_path)


Removed  0 invalid tweets
Loaded tweets at data/q1/all_tweets.csv


KeyboardInterrupt: 

## Q2

### English

In [8]:
q2_path_1 = 'data/q2/nato_english.csv'
run_sentiment_analysis_and_save(q2_path_1)

KeyError: 'Date'

In [20]:
q2_path_2 = 'data/q2/putin_english.csv'
run_sentiment_analysis_and_save(q2_path_2)

Loaded tweets at data/q2/putin_english.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1128.437345981598  seconds


In [26]:
q2_path_3 = 'data/q2/zelensky_english.csv'
run_sentiment_analysis_and_save(q2_path_3)

27507 27505
Loaded tweets at data/q2/zelensky_english.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1356.382895231247  seconds


### Russian

In [6]:
model_path = f"cointegrated/rubert-tiny-sentiment-balanced"
q2_path_4 = 'data/q2/nato_russian.csv'
run_sentiment_analysis_and_save(q2_path_4, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/nato_russian.csv
Running Sentiment Analysis...
Time elapsed:  0.08802032470703125  seconds


In [7]:
q2_path_5 = 'data/q2/putin_russian.csv'
run_sentiment_analysis_and_save(q2_path_5, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/putin_russian.csv
Running Sentiment Analysis...
Time elapsed:  232.93544363975525  seconds


In [8]:
q2_path_6 = 'data/q2/zelensky_russian.csv'
run_sentiment_analysis_and_save(q2_path_6, model_path)

Removed  0 invalid tweets
Loaded tweets at data/q2/zelensky_russian.csv
Running Sentiment Analysis...
Time elapsed:  232.61395645141602  seconds


## Q3

I am making some significant changes here.
Running the analysis for four sets of data I scrapped

In [8]:
q3_path_1 = "/Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/foxtitle.csv"
run_sentiment_analysis_and_save(q3_path_1)

Removed  0 invalid tweets
Loaded tweets at /Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/foxtitle.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  874.1434421539307  seconds


Unnamed: 0,Date,text
0,2022-05-23 23:50:02+00:00,The Big Apple became a big headache for NY Dem...
1,2022-05-23 23:41:05+00:00,ENEMIES ARE WATCHING: @user explains why Biden...
2,2022-05-23 23:34:41+00:00,Hillary Clinton gave a green light to gaslight...
3,2022-05-23 23:28:49+00:00,TONIGHT: @user weighs in on Biden’s upside dow...
4,2022-05-23 23:25:22+00:00,"‘Jurassic Park’s’ Laura Dern, Sam Neill reflec..."
...,...,...
25257,2021-12-24 01:08:42+00:00,Karen Carpenter's brother Richard shares a fav...
25258,2021-12-24 00:55:32+00:00,George Floyd removed from consideration for Te...
25259,2021-12-24 00:46:17+00:00,McEnany: Biden won't be able to hide 'in the b...
25260,2021-12-24 00:26:05+00:00,Cuomo lashes out at AG James after prosecutors...


In [9]:
q3_path_2 = "/Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/nytitle.csv"
run_sentiment_analysis_and_save(q3_path_2)

Removed  0 invalid tweets
Loaded tweets at /Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/nytitle.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  537.7459280490875  seconds


Unnamed: 0,Date,text
0,2022-05-23 23:40:03+00:00,As more than a dozen countries grapple with ou...
1,2022-05-23 23:20:03+00:00,Major League Baseball suspended the Yankees th...
2,2022-05-23 23:00:09+00:00,Former Senator David Perdue ended his campaign...
3,2022-05-23 22:40:03+00:00,"Addressing the World Economic Forum, President..."
4,2022-05-23 22:20:05+00:00,The U.S. on Monday said it would supply Romani...
...,...,...
12259,2021-12-24 02:00:10+00:00,"Eudes Pierre, who was shot and killed by polic..."
12260,2021-12-24 01:50:03+00:00,The fast spread of the Omicron variant has lef...
12261,2021-12-24 01:00:06+00:00,"Inyoung You, a former Boston College student w..."
12262,2021-12-24 00:20:05+00:00,Former Gov. Andrew Cuomo will not face crimina...


In [10]:
q3_path_3 = "/Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/foxalltweets.csv"
run_sentiment_analysis_and_save(q3_path_3)

Removed  0 invalid tweets
Loaded tweets at /Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/foxalltweets.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  3576.2499661445618  seconds


Unnamed: 0,Date,text
0,2022-05-23 23:54:14+00:00,@user Damn they sure forget the 4 years they g...
1,2022-05-23 23:52:27+00:00,@user Inflation at 8.4% 8 trillion dollars add...
2,2022-05-23 23:51:52+00:00,@user @user @user ENEMIES ARE WATCHING: AND TH...
3,2022-05-23 23:43:17+00:00,"@user Russia, if you're listening..."
4,2022-05-23 23:35:34+00:00,@user Trump asked for help from Russia on LIVE...
...,...,...
87522,2021-12-24 01:50:14+00:00,@user How dare the west try to defend a sovere...
87523,2021-12-24 01:42:31+00:00,@user We need to give Ukraine our full militar...
87524,2021-12-24 01:42:26+00:00,@user Putin needs to look in the mirror to poi...
87525,2021-12-24 01:41:43+00:00,@user I'd say that the guy repeatedly threaten...


In [11]:
q3_path_4 = "/Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/nytalltweets.csv"
run_sentiment_analysis_and_save(q3_path_4)

Removed  0 invalid tweets
Loaded tweets at /Volumes/GoogleDrive/My Drive/Spring 2022/Data Science Methodology/UkraineConflictOnTwitter/DataCollection/scrapping/output/q3/May30Scrap/nytalltweets.csv


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Running Sentiment Analysis...
Time elapsed:  1689.477998971939  seconds


Unnamed: 0,Date,text
0,2022-05-23 23:55:22+00:00,@user Screw Ukraine.
1,2022-05-23 23:53:27+00:00,@user Hunter??? Russia Russia Russia?? Clinton...
2,2022-05-23 23:51:28+00:00,"@user @user Mind you, he demeans us white peop..."
3,2022-05-23 23:51:24+00:00,@user Yeah let’s worry about this while inflat...
4,2022-05-23 23:50:18+00:00,@user What a douche. Yeah let's 3rd world our ...
...,...,...
41114,2021-12-26 07:00:06+00:00,Nations have chosen their leaders from among m...
41115,2021-12-25 20:49:45+00:00,@user @user continues to promote china’s and R...
41116,2021-12-25 13:29:18+00:00,@user Nazi Russia
41117,2021-12-24 16:35:36+00:00,@user All democratic nations can halt trade wi...
