Title: Assignment: 7.1
Author: Sarah Yawn
Date: 17 July 2025
Modified By: Sarah Yawn
Description: Sentiment Analysis using NLYK and SpaCy
Data: https://www.kaggle.com/datasets/nelgiriyewithana/emotions

#Set-Up

Install necessary packages using pip
pip install nltk
pip install spacy

In [None]:
#Install Packages
#!pip install nltk
#!pip install spacy

Importing libraries

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import spacy
from spacy import displacy

import kagglehub
import pandas as pd
import random
import os

Downloading resources

In [None]:
# Downloading necessary NLTK resources and spaCy model
nltk.download('vader_lexicon')
spacy.cli.download("en_core_web_sm")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
# Load spaCy English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Download latest version
path = kagglehub.dataset_download("nelgiriyewithana/emotions")

# Sentiment Analysis with NLTK
# Initialize the VADER sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()

#NLTK

Load the dataset
The downloaded path is a directory, so we need to find the actual CSV file
Choose which text we are analysing and grab the true sentiment:
Six sentiments: sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5).

In [None]:
csv_file_path = None
for root, _, files in os.walk(path):
    for file in files:
        if file.endswith('.csv'):
            csv_file_path = os.path.join(root, file)
            break
    if csv_file_path:
        break

if csv_file_path:
    df = pd.read_csv(csv_file_path)

    # Choose a random number between 1 and 500 (assuming IDs are 1-based)
    random_id = random.randint(1, 500)

    # Find the row with the corresponding id (assuming the dataset is 0-indexed in pandas)
    # Subtract 1 from random_id to get the 0-indexed position
    try:
      selected_row = df.iloc[random_id - 1]

      # Get the 'text' value
      text = selected_row['text']

      # Get the 'label' value and store it in 'sentiment'
      sentiment = selected_row['label']

      print(f"Selected ID: {random_id}")
      print(f"Text: {text}")
      print(f"Sentiment: {sentiment}")

    except IndexError:
      print(f"Error: ID {random_id} is out of bounds for the dataset.")
      # You might want to handle this case, perhaps by choosing a new random_id
      # or setting text and sentiment to default values.
      text = None
      sentiment = None
else:
    print(f"Error: No CSV file found in the downloaded directory: {path}")
    text = None
    sentiment = None

Selected ID: 204
Text: i couldnt feel any divine being in my own pulse
Sentiment: 1


In [None]:
# Select a text snippet from the loaded df DataFrame.

# Apply the VADER sentiment intensity analyzer (sia) to the selected text.
polarity_scores = sia.polarity_scores(text)

# Print the resulting polarity scores.
print("\nSentiment Analysis Results for Selected Text:")
for score in polarity_scores:
    print(f"{score}: {polarity_scores[score]}")

# Interpret the overall sentiment based on the compound score.
compound_score = polarity_scores['compound']
print(f"Compound Score: {compound_score}")

if compound_score >= 0.05:
    sentiment_interpretation = "Positive"
elif compound_score <= -0.05:
    sentiment_interpretation = "Negative"
else:
    sentiment_interpretation = "Neutral"

print(f"Interpretation: The overall sentiment is {sentiment_interpretation}")


Sentiment Analysis Results for Selected Text:
neg: 0.268
neu: 0.732
pos: 0.0
compound: -0.4449
Compound Score: -0.4449
Interpretation: The overall sentiment is Negative


## Response to NLTK
The text that I am submitting this under is ID 204, it reads "i couldnt feel any divine being in my own pulse" and the sentiment it was submitted with was 1(joy), and honestly I, a human, would also have given this a negative sentiment if asked to analyse it. I understand that this is a complex dataset, but in this instance, I feel like NLKT had a better understanding of the sentiment than the dataset creator.

FYI, here are the output for the previous two code boxes:

Selected ID: 204
Text: i couldnt feel any divine being in my own pulse
Sentiment: 1

Sentiment Analysis Results for Selected Text:
neg: 0.268
neu: 0.732
pos: 0.0
compound: -0.4449
Compound Score: -0.4449
Interpretation: The overall sentiment is Negative

# SpaCy

In [None]:
# Topic Modeling with spaCy
# Example text for topic modeling
doc = nlp("spaCy is an industrial-strength natural language processing library.")

In [None]:
# Select a different random row from the dataframe df
random_id_topic = random.randint(1, len(df)) # Get a random ID within the bounds of the dataframe
topic_text = df.iloc[random_id_topic - 1]['text'] # Select the text using 0-indexing

print(f"Selected ID for Topic Modeling: {random_id_topic}")
print(f"Text for Topic Modeling: {topic_text}")

# Process the topic_text with the loaded spaCy model nlp
topic_doc = nlp(topic_text)

# Iterate through the named entities in the topic_doc object and print the text and label of each entity.
print("\nNamed Entities in Topic Text:")
if topic_doc.ents:
    for ent in topic_doc.ents:
        print(f"{ent.text} ({ent.label_})")
else:
    print("No named entities found.")

# Iterate through the noun chunks in the topic_doc object and print the text of each noun phrase.
print("\nNoun Phrases in Topic Text:")
if list(topic_doc.noun_chunks):
    for np in topic_doc.noun_chunks:
        print(np.text)
else:
    print("No noun phrases found.")

Selected ID for Topic Modeling: 33629
Text for Topic Modeling: i have avoided writing a post about marriage because i feel inadequate to write one

Named Entities in Topic Text:
one (CARDINAL)

Noun Phrases in Topic Text:
i
a post
marriage
i


 Attempt to identify themes or topics prevalent in the text based on the entities and noun phrases extracted using spacy


In [None]:
from collections import Counter

# Function to identify themes based on entities and noun phrases
def identify_themes(doc):
    # Extract entities
    entities = [ent.text for ent in doc.ents]

    # Extract noun chunks
    noun_phrases = [chunk.text for chunk in doc.noun_chunks]

    # Combine entities and noun phrases
    keywords = entities + noun_phrases

    # Count the frequency of each keyword
    keyword_counts = Counter(keywords)

    # Get the most common keywords (can adjust the number as needed)
    most_common_keywords = keyword_counts.most_common(5) # Get top 5 keywords

    return most_common_keywords

# Assuming 'topic_doc' from the previous code is available and processed
if 'topic_doc' in locals() and topic_doc:
    themes = identify_themes(topic_doc)
    print("\nIdentified Themes/Topics:")
    if themes:
        for keyword, count in themes:
            print(f"- {keyword} (count: {count})")
    else:
        print("No significant themes identified.")
else:
    print("\n'topic_doc' is not available. Please ensure the spaCy document is processed.")
    print("Run the preceding code block to process the text with spaCy.")


Identified Themes/Topics:
- i (count: 2)
- one (count: 1)
- a post (count: 1)
- marriage (count: 1)


Token analysis and Parse Dependencies

In [None]:
# Print token-level analysis (Lemma, POS, Tag, Dep, Shape) for the topic_doc
print("\nToken-level analysis (Lemma, POS, Tag, Dep, Shape) for Topic Text:")
for token in topic_doc:
    print(f"{token.text} ({token.lemma_}, {token.pos_}, {token.tag_}, {token.dep_}, {token.shape_})")

# Visualize the dependency parse of the topic_doc
print("\nDependency Parsing Visualization for Topic Text:")
displacy.render(topic_doc, style='dep', options={'compact': True, 'bg': 'ghostwhite', 'color': '#000000'})


Token-level analysis (Lemma, POS, Tag, Dep, Shape) for Topic Text:
i (I, PRON, PRP, nsubj, x)
have (have, AUX, VBP, aux, xxxx)
avoided (avoid, VERB, VBN, ROOT, xxxx)
writing (write, VERB, VBG, xcomp, xxxx)
a (a, DET, DT, det, x)
post (post, NOUN, NN, dobj, xxxx)
about (about, ADP, IN, prep, xxxx)
marriage (marriage, NOUN, NN, pobj, xxxx)
because (because, SCONJ, IN, mark, xxxx)
i (I, PRON, PRP, nsubj, x)
feel (feel, VERB, VBP, advcl, xxxx)
inadequate (inadequate, ADJ, JJ, acomp, xxxx)
to (to, PART, TO, aux, xx)
write (write, VERB, VB, xcomp, xxxx)
one (one, NUM, CD, dobj, xxx)

Dependency Parsing Visualization for Topic Text:


I am fairly certain this visualization is so large due to the very complex sentence structure that was pulled. I am hoping when this is reran that a less complex sentence will work just as well, while I was testing this, I occasionally recieved less complex sentences, which appeared to display effectively.

##Response to SpaCy

I used fairly small texts because the example used fairly small texts, and prehaps that is my misjudgement. I bring this up because while I find SpaCy's functionality to be much larger than NLTK's, I am not sure what value knowing all the topics of a sentence are. In an very large series of texts, I can imagine their being values, but in this instance, I felt like it was meaningless clutter, while NLTK was able to give me a clear number to compare to my perspective on the text, or due to my dataset, what the creator of the dataset viewed the sentiment to be. To my mind, that has more face value. Still, I understand that the work we did with SpaCy here has more potential value in regards to long-term and in depth analysis.

## Summary:

### Data Analysis Key Findings

*   Sentiment analysis of a selected text using NLTK's VADER resulted in a compound score of -0.4767, indicating a "Negative" sentiment.
*   Named Entity Recognition using spaCy on a different selected text ("ive been prone to feeling worthless when ive been cast aside and claustrophobic when ive been deemed a keeper") found no named entities.
*   SpaCy identified the following noun phrases in the topic modeling text: "i", "i", "i", and "a keeper".
*   Token-level analysis provided detailed linguistic information (lemma, POS, tag, dependency, shape) for each word in the topic modeling text.
*   The dependency parse of the topic modeling text was successfully visualized, illustrating the grammatical relationships between words.

### Insights or Next Steps

*   While sentiment analysis provided an overall emotional tone, the lack of named entities in the topic modeling text suggests it might be highly subjective or personal, potentially focusing on internal states rather than external subjects or objects.
*   Further analysis could involve applying topic modeling techniques like LDA or NMF to a larger corpus of texts to identify broader themes and patterns across the dataset, as focusing on single sentences may not reveal significant topics.
