This is a prototype that was created to test the concepts that would later be adapted to the python files that come under main.py. By confirming the concepts worked on a smaller data set it became easier to migrate the code from this jupyter notebook into main.py, ingest.py, preprocess.py, analyse.py and visualise.py.

In [None]:
import json
from pathlib import Path

file_path = Path("../data/sample_stream.jsonl")
data=[]

if not file_path.exists():
    print(f"File not found! Looked in: {file_path.resolve()}")
else:
    print(f"Reading from: {file_path.resolve()}\n")

    with open(file_path, "r", encoding="utf-8") as f:
        for i, line in enumerate(f, 1):
            try:
                obj = json.loads(line)
                data.append(obj)
                print(f"Line {i} {json.dumps(obj, indent=2)}")
            except json.JSONDecodeError:
                print(f"Line {i}: (invalid JSON) {line.strip()}")

print ("\nFinished reading file.")
print(f"Total valid JSON objects: {len(data)}")

Cell One: Reads the .jsonl file line by line, parsing each entry into a JSON object and storing it into a list. This allows the structured format to be preserved, simplifying the data processing for in later cells. Error handling was also added to help with debugging path issues.

In [None]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

df = pd.DataFrame(data)
punct_to_remove = ''.join([p for p in string.punctuation if p not in ["!", "?", "'"]])

def preprocess_text_vader(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', punct_to_remove))
    tokens = text.split()
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return ' '. join(tokens)

df ['processed_text'] = df['text'].apply(preprocess_text_vader)
processed_data = df[['timestamp', 'processed_text']]

print(processed_data)

Cell Two: Cleans and preprocesses the text data for sentiment analysis in Cell Three. Each entry is converted to lowercase, and punctuation is removed except for "!", "?", and "'" to maintain emotional indicators and preserve word meaning. Stopwords are also filtered out. pandas is used for an efficient way of processing the entire column instead of iterating through all the lines (this was especially necessary in Cell Two as many operations are required as opposed to cell one). Using string helps simplify the management of punctuation, and nltk provides the stopwords. By preprocessing the data, this helps standardize the text, which should improve the consistency of VADER. The results of the preprocessing are stored in a new column, leaving the original data unchanged for future reference. This data is then saved to processed_data.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
import pandas as pd
nltk.download('vader_lexicon')

sia = SentimentIntensityAnalyzer()

processed_data['vader_scores'] = processed_data['processed_text'].apply(lambda x: sia.polarity_scores(x))
processed_data[['neg', 'neu', 'pos', 'compound']] = processed_data['vader_scores'].apply(pd.Series)

def label_sentiment(compound):
    if compound >= 0.05:
        return 'Positive'
    elif compound <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

processed_data['sentiment'] = processed_data['compound'].apply(label_sentiment)
display_columns = ['timestamp', 'processed_text', 'neg', 'neu', 'pos', 'compound', 'sentiment']
vader_processed_data = processed_data[display_columns]

print (vader_processed_data)

Cell Three: Applies sentiment analysis to the processed_data using VADER. Each column is scored to produce a score indicating negative, neutral and positive sentiment. Without documentation knowledge the scores are hard to understand, the function label_sentiment helps create a threshold used to label each entry. The data now labelled by Vader is then saved to vader_processed_data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import os

sns.set_theme(style="whitegrid")

plot_data = vader_processed_data.copy()
plot_data ['timestamp'] = pd.to_datetime(plot_data['timestamp'])

output_dir = "../output/prototype_outputs"
os.makedirs(output_dir, exist_ok=True)

#Line Plot showing sentiment changes over time
plt.figure(figsize=(12, 6))
plt.plot(plot_data['timestamp'], plot_data['compound'], marker= 'o', linewidth=2, markersize=6, color='steelblue', alpha=0.8)
plt.axhline(y=0, color= 'r', linestyle='--', alpha=0.3, label='Neutral Line')
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
plt.gca().xaxis.set_major_locator(mdates.MinuteLocator(interval=5))
plt.title('Sentiment Compound Score Over Time')
plt.xlabel('Time')
plt.ylabel('Compound Sentiment Score')
plt.xticks(rotation=45)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'sentiment_overtime.png'))
plt.show()

#Count Plot showing sentiment distribution
plt.figure(figsize=(6, 4))
sns.countplot(x='sentiment', data=vader_processed_data, hue='sentiment', palette= 'pastel', order=['Positive', 'Neutral', 'Negative'], legend=False)
plt.title("Sentiment Distribution (Count Plot)")
plt.xlabel("Sentiment")
plt.ylabel("Number of Reviews")
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'sentiment_distribution_count.png'))
plt.show()

#Average Sentiment scores by label
avg_scores = vader_processed_data.groupby('sentiment')[['neg', 'neu', 'pos']].mean().reset_index()
avg_scores_melted = avg_scores.melt(id_vars= 'sentiment', value_vars=['neg', 'neu', 'pos'], var_name='Score Type', value_name='Average')

#Bar plot showing the average sentiment scores by label
plt.figure(figsize=(7,4))
sns.barplot(x='sentiment', y='Average', hue='Score Type', data=avg_scores_melted, palette= 'muted')
plt.title("Average Sentiment Scores by label")
plt.ylabel("Average Score")
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'average_sentiment_scores.png'))
plt.show()

#Pie chart showing how sentiment is distributed
plt.figure(figsize=(6,6))
sentiment_counts = vader_processed_data['sentiment'].value_counts().reindex(['Positive', 'Neutral', 'Negative'])
sentiment_counts = sentiment_counts.fillna(0)
labels = sentiment_counts.index.tolist()
sizes = sentiment_counts.values
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=140, wedgeprops={'edgecolor':'w'} )
plt.title("Sentiment Distribution (Pie Chart)")

legend_labels = [f"{lbl}: {cnt}" for lbl, cnt in zip(labels, sizes)]
plt.axis('equal')
plt.tight_layout()
plt.savefig(os.path.join(output_dir, 'sentiment_distribution_pie.png'))
plt.show()

print(f"\nAll plots saved to {os.path.abspath(output_dir)}")

Cell Four: Visualisations are now implemented to help understand patterns and distributions within vader_processed_data. A variety of visualisations were chose to make the data more explainable. Line plot: Shows the change in compound sentiment allowing for indications of trends or spikes. Count plot: Displays the frequency for each sentiment label displaying which sentiments are dominant. Bar plot: compares the average negative, neutral and positive scores across the sentiment labels to show relative strength. Pie chart: Illustrates the overall proportion of each sentiment category. matplotlib gave the most flexible plotting framework for creating and saving these plots. seaborn was used to improve the aesthetics of the plots for easier understanding. The plots are then saved within the output folder within a folder called prototype_outputs as .pngs labelled according to their diagram.


In [None]:
import os

output_path = "../output/prototype_outputs/sentiment_analysis.csv"

output_dir = os.path.dirname(output_path)
if output_dir and not os.path.exists(output_dir):
    os.makedirs(output_dir, exist_ok=True)

to_save = vader_processed_data.copy()
to_save.to_csv(output_path, index=False, encoding="utf-8")

print(f"\n File successfully saved to: {os.path.abspath(output_path)}")
print(f"Total rows saved: {len(to_save)}")

Cell Five: Saves the processed sentiment data into a CSV file in the prototypes_outputs folder alongside the visualisations. pandas helps efficiently export the DataFrame whilst preserving the columns including the original text, preprocessed text, sentiment scores and sentiment labels. os ensures that the output directory is correct before saving to prevent any potential errors. This cell guarantees that the analysis results are stored persistently for future reference.