# Nvidia Sentiment Analysis

The purpose of this notebook is to derive sentiment scores from Stocktwits messages mentioning Nvidia, which will later be used for further analysis.  

The Stocktwits data was obtained from [this Kaggle dataset](https://www.kaggle.com/datasets/frankcaoyun/stocktwits-2020-2022-raw), which contains raw posts from 2020 to 2022.  

To analyze the sentiment of these messages, we used the **Twitter RoBERTa-base sentiment model**, a pre-trained transformer optimized for social media text classification.


In [None]:
import pandas as pd
from transformers import pipeline
from tqdm import tqdm
import torch

# select using cuda to improve inference speed otherwise fallback to cpu
if torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# reading the nvidia tweet data
df = pd.read_csv("./datasets/stockTweetData.csv")

  df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/combined.csv")


In [None]:
# reshape the datafraem to only include messenges
cleaned_df = df[["body", "created_at"]]
cleaned_df['body'] = cleaned_df['body'].apply(lambda x: x[5:].strip())
cleaned_df = cleaned_df.dropna(subset=['body'])

print(cleaned_df.shape)
cleaned_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['body'] = cleaned_df['body'].apply(lambda x: x[5:].strip())


(584181, 2)


Unnamed: 0,body,created_at
0,Time to back up the truck people! Make it happ...,2021-11-17T20:35:15Z
1,This was same AFRM movement before earnings. H...,2021-11-17T20:35:14Z
2,"its not just about merging, great low recently.",2021-11-17T20:35:11Z
3,Some guys can&#39;t handle the stress. Its gon...,2021-11-17T20:35:08Z
4,seeing a lot of PUTS being slung around.. be c...,2021-11-17T20:34:54Z


In [None]:
# convert into UTC time
df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')

# check if nans are present
nan_count = cleaned_df['body'].isna().sum()

# display the result
print(nan_count)

0


In [None]:
# define the pipeline
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment",
    device=device,
    batch_size=1024,
    truncation=True,
    max_length=512,
    padding=True,
    torch_dtype=torch.float16
    )

# assign certain labels to integer represenatation
label_to_score = {"LABEL_0": -1, "LABEL_1": 0, "LABEL_2": 1}

# run the model once on the whole column
texts = cleaned_df["body"].tolist()
scores = []

# define the batch size to speed up
batch_size = 1024

# progress bar around the batch loop
for start in tqdm(range(0, len(texts), batch_size)):
    batch_texts = texts[start:start + batch_size]
    batch_out   = sentiment_pipeline(batch_texts)

    scores.extend(
        round(label_to_score[r["label"]] * r["score"], 4)
        for r in batch_out
    )

# set sentiment scores
cleaned_df["sentiment_score"] = scores

Device set to use cuda
  2%|▏         | 9/571 [00:36<38:21,  4.10s/it]


KeyboardInterrupt: 

In [None]:
from google.colab import files
cleaned_df.to_csv('cleaned_data.csv', index=False)

# download the CSV file
files.download('cleaned_data.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [16]:
df = pd.read_csv("/content/cleaned_data.csv")

In [None]:
# strip time
df['date'] = pd.to_datetime(df['created_at']).dt.date

# average out the sentiment by day
daily_summary = (
    df.groupby('date')['sentiment_score']
      .agg(mean='mean', count='size')                    
      .reset_index()
)

print(daily_summary.head())

# save the results
daily_summary.to_csv('./datasets/dailyTweetSummary.csv', index=False)



         date      mean  count
0  2013-04-11  0.041931     16
1  2013-04-12  0.015730     44
2  2013-04-13  0.238400      4
3  2013-04-14  0.458900     11
4  2013-04-15  0.015864     11


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>