# Final Project for CSC 440

US election 2020

## Before we start: check for deps and data paths

Some deps to install

- **jupyter**: install this prior to running on jupyter notebook (otherwise you need to restart the kernel).
- **colab**: just uncomment the `%pip` line

In [2]:
# google colab
# %pip install datasets -q

In [3]:
# run this PRIOR to starting jupyter notebook
# pip3 install -q datasets sentencepiece protobuf==3.20
# pip3 install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# jupyter notebook

import the things we need

In [4]:
# os commands
import os
import sys

# data basics
import numpy as np
import pandas as pd

# visualizations
import seaborn as sns
import matplotlib.pyplot as plt

# pytorch and huggingface
import torch
from torch.utils.data import Dataset
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset, Dataset

# progress bar
from tqdm.auto import tqdm

making sure we are using GPU, or things would be terribly slow.

In [5]:
device = 0 if torch.cuda.is_available() else -1
device

0

check our data already in path

In [6]:
# download the data by curl
# %curl -O some_url

# if already downloaded 
DATA_ROOT = './'

# for google colab, files should be placed under /data/cs440/
# DATA_ROOT = '/content/drive/MyDrive/data/cs440/'

if not os.path.exists(DATA_ROOT):
    print(f'error: {DATA_ROOT} does not exist', file=sys.stderr)
for dirname, _, filenames in os.walk(DATA_ROOT):
    for filename in filenames:
        print(os.path.join(dirname, filename))

./CSC440_Final_Project.ipynb
./desktop.ini
./hashtag_donaldtrump.csv
./hashtag_joebiden.csv
./.ipynb_checkpoints\CSC440_Final_Project-checkpoint.ipynb


## Reading the data

Huggingface Doc: [datasets.load_dataset](https://huggingface.co/docs/datasets/main/en/package_reference/loading_methods#datasets.list_datasets)

In [8]:
dataset = load_dataset(
    "csv",
    data_files={
        'trump': f'{DATA_ROOT}hashtag_donaldtrump.csv',
        'biden': f'{DATA_ROOT}hashtag_joebiden.csv'
    },
    lineterminator="\n"
)

In [9]:
dataset

DatasetDict({
    trump: Dataset({
        features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
        num_rows: 970919
    })
    biden: Dataset({
        features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
        num_rows: 776886
    })
})

## Predicting language

In [7]:
model_ckpt = "papluca/xlm-roberta-base-language-detection"
pipe = pipeline("text-classification", model=model_ckpt, device=device)

In [8]:
TASK = 'lang'
os.makedirs(f'{DATA_ROOT}/{TASK}', exist_ok=True)
for par_name, par_data in dataset.items():
    print(par_name, par_data)
    res = []
    for out in tqdm(pipe(KeyDataset(par_data, 'tweet'), batch_size=1024,
                         truncation=True, max_length=128),
                    total=len(par_data)):
        res.append(out['label'])
    pd.Series(res, name=TASK).to_csv(f'{DATA_ROOT}/{TASK}/{par_name}.csv', index=False)

trump Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 970919
})


  0%|          | 0/970919 [00:00<?, ?it/s]

biden Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 776886
})


  0%|          | 0/776886 [00:00<?, ?it/s]

In [None]:
#def lang_cls(samples):
#    return {"lang": pipe(samples['tweet'], truncation=True, max_length=128)}
#trump_lang = dataset.map(lang_cls, batched=True)
#use num_proc=16 on cpu to speed this up
#trump_lang['train'].to_csv(f"{DATA_ROOT}hashtag_donaldtrump_lang.csv")

## Sentiment Analysis

In [10]:
model_ckpt = "cardiffnlp/twitter-xlm-roberta-base-sentiment"
pipe = pipeline("sentiment-analysis", model=model_ckpt, tokenizer=model_ckpt, device=device)

In [15]:
TASK = 'sent'
N_SAMPLES = 50
for par_name, par_data in dataset.items():
    print(par_name, par_data)
    for out in tqdm(pipe(KeyDataset(par_data.select(range(N_SAMPLES)), 'tweet'), batch_size=50,
                         truncation=True, max_length=128),
                    total=N_SAMPLES):
        print(out)

trump Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 970919
})


  0%|          | 0/50 [00:00<?, ?it/s]

{'label': 'neutral', 'score': 0.6742995381355286}
{'label': 'neutral', 'score': 0.7037839293479919}
{'label': 'negative', 'score': 0.492326945066452}
{'label': 'negative', 'score': 0.7795373797416687}
{'label': 'neutral', 'score': 0.7171657085418701}
{'label': 'negative', 'score': 0.8753815293312073}
{'label': 'negative', 'score': 0.9256624579429626}
{'label': 'negative', 'score': 0.9210783839225769}
{'label': 'positive', 'score': 0.49399882555007935}
{'label': 'neutral', 'score': 0.807226300239563}
{'label': 'negative', 'score': 0.7046650052070618}
{'label': 'negative', 'score': 0.9151006937026978}
{'label': 'neutral', 'score': 0.5873754024505615}
{'label': 'negative', 'score': 0.7721818685531616}
{'label': 'neutral', 'score': 0.5251723527908325}
{'label': 'negative', 'score': 0.653194785118103}
{'label': 'negative', 'score': 0.6693116426467896}
{'label': 'neutral', 'score': 0.545574426651001}
{'label': 'neutral', 'score': 0.7103788256645203}
{'label': 'negative', 'score': 0.830850422

  0%|          | 0/50 [00:00<?, ?it/s]

{'label': 'neutral', 'score': 0.6742995381355286}
{'label': 'neutral', 'score': 0.6340779662132263}
{'label': 'neutral', 'score': 0.5332580804824829}
{'label': 'positive', 'score': 0.7750991582870483}
{'label': 'negative', 'score': 0.8534130454063416}
{'label': 'negative', 'score': 0.7046650052070618}
{'label': 'negative', 'score': 0.9151006937026978}
{'label': 'neutral', 'score': 0.5251723527908325}
{'label': 'neutral', 'score': 0.7950451970100403}
{'label': 'negative', 'score': 0.9296641945838928}
{'label': 'neutral', 'score': 0.545574426651001}
{'label': 'neutral', 'score': 0.5850774645805359}
{'label': 'positive', 'score': 0.4592738151550293}
{'label': 'negative', 'score': 0.867538571357727}
{'label': 'negative', 'score': 0.8790665864944458}
{'label': 'negative', 'score': 0.9270439147949219}
{'label': 'neutral', 'score': 0.7250171303749084}
{'label': 'negative', 'score': 0.8513714671134949}
{'label': 'neutral', 'score': 0.666561484336853}
{'label': 'positive', 'score': 0.5026677846

In [11]:
TASK = 'sent'
os.makedirs(f'{DATA_ROOT}/{TASK}', exist_ok=True)
for par_name, par_data in dataset.items():
    print(par_name, par_data)
    res = []
    for out in tqdm(pipe(KeyDataset(par_data, 'tweet'), batch_size=1024,
                         truncation=True, max_length=128),
                    total=len(par_data)):
        res.append(out['label'])
    pd.Series(res, name=TASK).to_csv(f'{DATA_ROOT}/{TASK}/{par_name}.csv', index=False)

trump Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 970919
})


  0%|          | 0/970919 [00:00<?, ?it/s]

biden Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 776886
})


  0%|          | 0/776886 [00:00<?, ?it/s]

## Emotion

In [19]:
model_ckpt = "02shanky/finetuned-twitter-xlm-roberta-base-emotion"
pipe = pipeline("text-classification", model=model_ckpt, device=device)

config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

In [31]:
TASK = 'emotion'
N_SAMPLES = 500

collect = {}
for par_name, par_data in dataset.items():
    print(par_name, par_data)
    res = []
    for i, out in enumerate(tqdm(pipe(KeyDataset(par_data.select(range(N_SAMPLES)), 'tweet'), batch_size=50,
                         truncation=True, max_length=128),
                    total=N_SAMPLES)):
        out['tweet'] = par_data[i]['tweet'].replace("\n", " ")
        res.append(out)
    collect[par_name] = pd.DataFrame(res)

trump Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 970919
})




  0%|          | 0/500 [00:00<?, ?it/s]

biden Dataset({
    features: ['created_at', 'tweet_id', 'tweet', 'likes', 'retweet_count', 'source', 'user_id', 'user_name', 'user_screen_name', 'user_description', 'user_join_date', 'user_followers_count', 'user_location', 'lat', 'long', 'city', 'country', 'continent', 'state', 'state_code', 'collected_at'],
    num_rows: 776886
})




  0%|          | 0/500 [00:00<?, ?it/s]

In [34]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)

In [35]:
collect['trump']

Unnamed: 0,label,score,tweet
0,anger,0.75231,#Elecciones2020 | En #Florida: #JoeBiden dice que #DonaldTrump solo se preocupa por él mismo. El demócrata fue anfitrión de encuentros de electores en #PembrokePines y #Miramar. Clic AQUÍ ⬇️⬇️⬇️ ⠀ 🌐https://t.co/qhIWpIUXsT _ #ElSolLatino #yobrilloconelsol https://t.co/6FlCBWf1Mi
1,anger,0.695926,"Usa 2020, Trump contro Facebook e Twitter: coprono Biden #donaldtrump https://t.co/6ceURhe1VP https://t.co/94jidLjoON"
2,anger,0.648726,"#Trump: As a student I used to hear for years, for ten years, I heard China! In 2019! And we have 1.5 and they don't know how many we have and I asked them how many do we have and they said 'sir we don't know.' But we have millions. Like 300 million. Um. What?"
3,anger,0.672886,2 hours since last tweet from #Trump! Maybe he is VERY busy. Tremendously busy.
4,joy,0.432762,You get a tie! And you get a tie! #Trump ‘s rally #Iowa https://t.co/jJalUUmh5D
5,anger,0.980999,@CLady62 Her 15 minutes were over long time ago. Omarosa never represented the black community! #TheReidOut She cried to #Trump begging for a job!
6,joy,0.981912,@richardmarx Glad u got out of the house! DICK!!#trump 2020💪🏽🇺🇸🇺🇸
7,anger,0.991769,@DeeviousDenise @realDonaldTrump @nypost There won’t be many of them. Unless you all have been voting more than once again. But God prevails. BO was the most corrupt President ever. Dark to light. Your lies are all coming through. They wouldn’t last forever. #Trump
8,joy,0.968813,One of the single most effective remedies to eradicate another round of #Trump Plague in our #WhiteHouse. https://t.co/QGB9ODIVS8
9,joy,0.848548,#Election2020 #Trump #FreedomOfSpeech https://t.co/9slOZFZNHJ


In [36]:
collect['biden']

Unnamed: 0,label,score,tweet
0,anger,0.75231,#Elecciones2020 | En #Florida: #JoeBiden dice que #DonaldTrump solo se preocupa por él mismo. El demócrata fue anfitrión de encuentros de electores en #PembrokePines y #Miramar. Clic AQUÍ ⬇️⬇️⬇️ ⠀ 🌐https://t.co/qhIWpIUXsT _ #ElSolLatino #yobrilloconelsol https://t.co/6FlCBWf1Mi
1,anger,0.782161,#HunterBiden #HunterBidenEmails #JoeBiden #JoeBidenMustStepDown https://t.co/9enmxWvePm
2,anger,0.628936,@IslandGirlPRV @BradBeauregardJ @MeidasTouch This is how #Biden made his ! #TrumpIsNotAmerica ! https://t.co/uBqAFU86Ip
3,joy,0.997446,@chrislongview Watching and setting dvr. Let’s give him bonus ratings!! #JoeBiden
4,anger,0.959217,#censorship #HunterBiden #Biden #BidenEmails #BidenEmail #Corruption https://t.co/C6clrtshQl
5,anger,0.990663,"""IS THIS WRONG??!!"" Cory Booker's BRILLIANT Final Questioning of Trump Nominee Amy Coney Barrett https://t.co/gCTvVLl4CS #AmyConeyBarrett #CoryBooker #Barrett #Booker #Trump #KamalaHarris #JoeBiden #SCOTUS #SupremeCourtConfirmation"
6,anger,0.993187,"In 2020, #NYPost is being #censorship #CENSORED by Twitter to manipulate a US election in favor of #JoeBiden and against #Trump. but CCP from #China or porn on Twitter? That’s always been fine for @jack @vijaya @dickc @KatieS. @marciadorsey is @jack sick?"
7,joy,0.63912,►► Tell Politicians to STICK IT with this FREE Item! ►► https://t.co/sua2FPNJTS ►► #2020 #Biden #Deomocrat #Election #Politician #Politics #President #Republican #Trump #VPDebate ►► @FreebieDepot https://t.co/QTRW3eMtd0
8,joy,0.889988,#Biden https://t.co/qMs0PmUev5
9,anger,0.987541,Proof Bidens are crooked. Twitter will suspend me for sharing https://t.co/LUbPCbROEp #Biden #crookedBiden #ukraniancollusion #Democrats #HunterBidenEmails
