# Sentiment Analysis Notebook

This Jupyter notebook is used for sentiment analysis. It uses several libraries to perform this task:

1. `polars`
2. `numpy`
3. `transformers`

The notebook imports the following modules from the `transformers` library:

- `pipeline`: This is a high-level, easy to use, API for doing end-to-end NLP tasks.

- `AutoTokenizer`: This class can automatically choose the correct tokenizer for the model specified.

- `AutoModelForTokenClassification`: This class can automatically choose the correct model for token classification tasks.

Please ensure that all these libraries are installed in your Python environment before running this notebook.


In [1]:
import polars as pl
import numpy as np

from transformers import pipeline,AutoTokenizer, AutoModelForTokenClassification 

In [2]:
df = pl.read_parquet('../../reddit_subset_gm.parquet')

In [3]:
def add_reply_list(df:pl.DataFrame)->pl.DataFrame:
    '''
    A function that adds a column called 'reply_ids' that contains a list of the 
    'reddit_name' ids that are replies to the post in that row. If a there are 
    no replies, the value in this column is 'null'.

    Parameters:
    -----------
        df:pl.DataFrame
            A pl.DataFrame with a schema similar to the raw Aware data
    
    Returns:
    --------
        pl.DataFrame
            The input dataframe along with a new 'reply_ids' column, as 
            described above.
    '''

    ## Group the data by 'reddit_parent_id'
    group = df.group_by("reddit_parent_id")

    ## Aggregate the list of 'reddit_names' for each 'reddit_parent_id'
    replies = group.agg(pl.col("reddit_name"))
    ## Rename the columns so that they are aligned for joining. 
    ## We want to add the list of replies to a 'reddit_parent_id' to the
    ## row whose 'reddit_name' matches that value.
    new_names = {"reddit_name":"reply_ids","reddit_parent_id":"reddit_name"}
    replies = replies.rename(new_names)

    ## Join the list of 'reddit_name' values that are replies
    return df.join(replies, on="reddit_name", how="left")

# Sentiment Analysis Function

This cell defines two functions: `load_sentiment_model` and `get_sentiment_list`.

## load_sentiment_model Function

This function loads a pre-trained sentiment analysis model and its tokenizer from the Hugging Face model hub. The model used is "mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis", a DistilRoBERTa model fine-tuned for financial news sentiment analysis. The function returns a pipeline object that can be used for Named Entity Recognition (NER), which in this context is used for sentiment analysis.

## get_sentiment_list Function

This function takes a list of texts as input. It first loads the sentiment analysis model and tokenizer using the `load_sentiment_model` function. It then processes each text in the input list with the model, extracting the sentiment from the results. If no sentiment is detected, it defaults to 'No Sentiment Detected'. The function returns a list of tuples, where each tuple contains the original text and its detected sentiment.

In [4]:
def load_sentiment_model():
    # Load the tokenizer and model once to reuse for multiple calls
    tokenizer = AutoTokenizer.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
    model = AutoModelForTokenClassification.from_pretrained("mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis")
    nlp = pipeline("ner", model=model, tokenizer=tokenizer)
    return nlp

def get_sentiment_list(texts):
    # Load the NLP model and tokenizer once
    nlp = load_sentiment_model()
    
    # List to hold the sentiment results
    results_list = []
    
    # Process each text in the list
    for text in texts:
        results = nlp(text)
        sentiment = results[0]['entity'] if results else 'No Sentiment Detected'
        results_list.append((text, sentiment))
    
    return results_list

In [5]:
df_with_replies = add_reply_list(df)

In [6]:
# Pick out a random post which has a few replies
randint=612
example_reddit_names = df_with_replies[randint]["reply_ids"]
example_replies = df.filter(pl.col("reddit_name").is_in(example_reddit_names[0]))["reddit_text"]
print(df_with_replies[randint]["reddit_text"][0])
print("Number of replies to this comment: ", df_with_replies[randint]["reply_ids"][0].shape[0])

Easier now too. Ten years ago GM was handing out serious lowball offers coming out of the Great Recession and bankruptcy.

EDIT Why am I being downvoted for this? It's true. Ask anybody hired in 2013 what they got as a starting salary. TRACK engineers were below $60k for a while. Anyone thinking things are worse now than in the years after bankruptcy is dumb as fuck.
Number of replies to this comment:  4


In [7]:
get_sentiment_list(example_replies)

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at mrm8488/distilroberta-finetuned-financial-news-sentiment-analysis and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[('This is true. If you work at a tech center or are a developer, you make more hopping companies. If you work at an area with a plant in engineering, other manufacturers pay $20k+ LESS, or they pay the same with half the vacation and benefits.',
  'positive'),
 ('i got a great offer ens of 2012, so it likely varies by area.', 'positive'),
 ('I was hired in 2014 and it’s true. I made $73k with 4 years of experience. It took leaving and coming back to get my salary on track. If I stayed the whole time I doubt very much I would have jumped up the way I did.',
  'positive'),
 ('Do you ever hear people talk about how Twitter is not an accurate representation of real people? Well GM reddit has an unhealthy balance of employees who hate GM with a passion. Its weird.',
  'positive')]