<a href="https://colab.research.google.com/github/Shirley-333/intro-final-project_Xin-Wen/blob/main/final_project_Xin_Wen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Abstract**

Lofi study music is one of the largest, most prominent music genres to emerge online in the past decade. Between streams of the “Lofi Girl” and countless “chill beats” playlists, lofi music is now synonymous with late-night study sessions and the student experience. I wanted to investigate the dynamics of this community. Are students documenting their academic stresses, nostalgia, or motivation? Is the comment section that is associated with lofi music just a uniquely supportive one? In order to answer this question, I scraped several lists of lofi study music from YouTube and ran those comments through various analyses, using several techniques explored in class (APIs, JSON, Pandas, sentiment analysis, and TF-IDF).



**Data Collection and cleaning**

First, I used the YouTube Data API to collect the comments and metadata of videos identified by the search queries “lofi study music” and “lofi beats to relax.” A helper function was used to make the request to the API endpoint using the requests package, which returned a JSON object. Then I used Pandas to wrangle the JSON object into a DataFrame and conduct an inner join to combine all the comment data. This process of wrangling the JSON objects from each page of extracted comments continued until the nextPageToken returned no value. Finally, I applied a series of cleaning functions to the DataFrame, which resulted in several thousand comments across multiple videos for further analyses, including sentiment analysis and keyword frequency count.

In [None]:
!pip -q install --upgrade nltk requests tqdm

import os
import json
from urllib.parse import urlencode

import requests
import pandas as pd
from tqdm import tqdm

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer


nltk.download('vader_lexicon')


sia = SentimentIntensityAnalyzer()

In [None]:
from getpass import getpass

os.environ["YOUTUBE_API_KEY"] = getpass("Paste your YouTube API key: ")

API_KEY = os.environ.get("YOUTUBE_API_KEY")
assert API_KEY, "API key not set!"

In [None]:
BASE_URL = "https://www.googleapis.com/youtube/v3"

def yt_get(resource: str, params: dict) -> dict:
    q = {**params, "key": API_KEY}
    url = f"{BASE_URL}/{resource}?{urlencode(q)}"
    r = requests.get(url, timeout=30)
    r.raise_for_status()
    return r.json()

In [None]:
QUERY = "lofi study music"
TARGET_VIDEOS = 30
MAX_RESULTS = 50

video_hits = []
page_token = None

with tqdm(total=TARGET_VIDEOS, desc="Searching videos") as pbar:
    while len(video_hits) < TARGET_VIDEOS:
        params = {
            "part": "snippet",
            "q": QUERY,
            "type": "video",
            "maxResults": MAX_RESULTS,
            "order": "relevance",
        }
        if page_token:
            params["pageToken"] = page_token

        data = yt_get("search", params)
        items = data.get("items", [])

        for it in items:
            vid = it.get("id", {}).get("videoId")
            if not vid:
                continue

            snip = it.get("snippet", {})
            video_hits.append({
                "video_id": vid,
                "title": snip.get("title"),
                "channelTitle": snip.get("channelTitle"),
                "publishedAt": snip.get("publishedAt"),
            })
            pbar.update(1)
            if len(video_hits) >= TARGET_VIDEOS:
                break

        page_token = data.get("nextPageToken")
        if not page_token:
            break

videos_df = pd.DataFrame(video_hits)
videos_df.head()

In [None]:
all_comments = []

for vid in tqdm(videos_df["video_id"].tolist(), desc="Fetching comments"):
    page_token = None
    fetched = 0
    try:
        while True:
            params = {
                "part": "snippet",
                "videoId": vid,
                "maxResults": 100,
                "order": "relevance"
            }
            if page_token:
                params["pageToken"] = page_token

            data = yt_get("commentThreads", params)
            items = data.get("items", [])

            for it in items:
                top = it.get("snippet", {}).get("topLevelComment", {})
                s = top.get("snippet", {})
                all_comments.append({
                    "video_id": vid,
                    "author": s.get("authorDisplayName"),
                    "publishedAt": s.get("publishedAt"),
                    "likeCount": s.get("likeCount", 0),
                    "text": s.get("textOriginal", ""),
                })
                fetched += 1

            page_token = data.get("nextPageToken")
            if not page_token or fetched >= 300:
                break

    except requests.HTTPError as e:
        print(f"Skipping {vid} due to HTTP error: {e}")
        continue

comments_df = pd.DataFrame(all_comments)
comments_df.head()

I used VADER for the sentiment analysis. VADER is suitable for analyzing social media content such as YouTube comments and works well with short text and emojis. Each comment received a compound score which could be interpreted as positive, neutral, or negative feedback. The results of the analysis revealed a strong bias. The majority of the comments were positive. Many users shared their personal thoughts publicly with statements such as “This helps me get through the day” or “Sending love to anyone studying right now.” Others left short and direct messages such as “beautiful,” “thank you,” and “I love this track.” This pattern was consistent across all the musical pieces in the sample.

A basic sentiment analysis is a good place to start. The bar chart shows that most of the comments are positive, some are neutral, and very few are negative. Although this is a very simple visualization, the results are illuminating. Most lofi chillhop fans are using YouTube comments to express praise, excitement, and empathy. This can be explained by the purpose of the

In [None]:
def compound_score(text: str) -> float:
    return sia.polarity_scores(text or "")["compound"]

comments_df["compound"] = comments_df["text"].fillna("").apply(compound_score)
comments_df[["text", "compound"]].head()

In [None]:
def label_sentiment(c):
    if c >= 0.05:
        return "Positive"
    elif c <= -0.05:
        return "Negative"
    else:
        return "Neutral"

comments_df["sentiment"] = comments_df["compound"].apply(label_sentiment)
comments_df[["text", "compound", "sentiment"]].head()

In [None]:
import matplotlib.pyplot as plt

sent_counts = comments_df["sentiment"].value_counts().reindex(
    ["Positive", "Neutral", "Negative"]
)
sent_counts = sent_counts.fillna(0)

sent_share = sent_counts / sent_counts.sum()

print("Counts:\n", sent_counts)
print("\nShare:\n", sent_share.round(3))

plt.figure(figsize=(6, 4))
sent_counts.plot(kind="bar")

plt.title("Sentiment of YouTube Comments on Lofi Study Music")
plt.ylabel("Number of Comments")
plt.xlabel("Sentiment Category")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

I then examined the changes in sentiment across time, using year as a variable from 2019 to 2025. The earlier result holds in this distribution, with positive comments being dominant across all years. The proportion of neutral comments fluctuates more heavily between years than with the previous variables but it remains the second largest sentiment category observed. Similarly, negative comments are minimal to nonexistent across all years. The data here suggests that there is a stability of sentiment within this lofi community, in that no matter what year the comments were scraped from, the reaction is the same. This is particularly interesting given the changes in academic calendars and years, and wider social events that might affect people’s emotions.

I ran a TF-IDF analysis on the comments to get more insight on the meaning embedded in the comments. The most significant words are love, music, thank, study, work, reading, and beautiful. The narrative embedded in these words suggests that lofi beats mean more than a general music genre. The music, within the context of lofi, speaks to a larger value system of productivity, daily rituals, and gratitude. The positive, repetitive, and familiar words align with the concept and style of lofi music, which is familiar, warm, and recognizable.

Another theme that I observed in these comments was the conspicuous amount of positive wishes extended to strangers. Many viewers wrote things like “If you’re studying I hope you pass your exam!” The comment section in these lofi Hip-Hop videos serves as a micro community to exchange kind words, personal stories, and well-wishes. Perhaps these comments are also a reason why millions of people return to these videos time and time again. They provide more than tastefully chopped song samples paired with anime-like illustrations. These videos also provide their viewers with a unique platform to share intimate details about themselves in a non-intrusive and kind-spirited environment.


In [None]:
comments_df["publishedAt"] = pd.to_datetime(comments_df["publishedAt"], errors="coerce")
comments_df["year"] = comments_df["publishedAt"].dt.year
comments_df.head()

In [None]:
year_sent = (
    comments_df
    .groupby(["year", "sentiment"])
    .size()
    .reset_index(name="count")
)

year_totals = (
    comments_df.groupby("year")
    .size()
    .reset_index(name="total")
)

year_sent = year_sent.merge(year_totals, on="year")
year_sent["proportion"] = year_sent["count"] / year_sent["total"]
year_sent.head()

In [None]:
pivot = year_sent.pivot(index="year", columns="sentiment", values="proportion").fillna(0)
pivot

In [None]:
plt.figure(figsize=(8,5))

for col in ["Positive", "Neutral", "Negative"]:
    if col in pivot.columns:
        plt.plot(pivot.index, pivot[col], marker="o", label=col)

plt.title("Sentiment Trends in Lofi Study Music Comments Over Time")
plt.xlabel("Year")
plt.ylabel("Proportion of Comments")
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

This project has some limitations. Firstly, due to the restrictions of the YouTube API, the comments and replies that were obtained were not comprehensive. Secondly, VADER has limitations in terms of accurately detecting nuanced sentiments, such as sarcasm, and it also has limitations in multilingual sentiment analysis. Lastly, using TF-IDF as a tool for topic detection has its limitations because it cannot directly present the themes discussed within a corpus of text and does not provide insights into the emotional content of the texts. Future studies could add to these findings by collecting a wider dataset of YouTube comments or including another form of engagement such as live comments, and finally utilize other machine learning approaches such as topic modeling to better understand discussions and themes mentioned by content creators and viewers.

**Conclusion**

This project led me to reflect on how communities within YouTube can be a source of emotional support. The community of listeners in lofi study music is ultimately filled with positive interactions between people who encourage one another and express communal appreciation for simple things. By using the techniques learned in class to analyze this simple trend that I hoped to understand, I was able to learn about some technical components of data scraping, working with APIs and JSON files, and analyzing human responses through messaging to understand not only their uses for music but also their respective community dynamics.

github URL: https://github.com/Shirley-333/intro-final-project_Xin-Wen/blob/main/final_project_Xin_Wen.ipynb