#  YouTube Comments Topic Modeling with NLP (Python)

This notebook performs topic modeling on real YouTube comments using NLP.  
It fetches comments via the YouTube API, cleans and tokenizes text using NLTK, applies LDA using Gensim, and visualizes the topics.

**Business Use Case**: Extract audience insights from YouTube videos to help content creators, marketers, or researchers understand user sentiment and feedback themes.


### **Importing Necessary Libraries**

This cell handles the importation of all the essential Python libraries required for the project. These libraries facilitate data manipulation, interaction with the YouTube API, natural language processing, and data visualization.

*   **`os`**: Used for interacting with the operating system, such as creating directories.
*   **`pandas`**: A fundamental library for data analysis and manipulation, used here to manage the comments in a DataFrame.
*   **`googleapiclient.discovery`**: The official Google API Client Library for Python, enabling interaction with the YouTube Data API v3.
*   **`nltk`**: The Natural Language Toolkit is a comprehensive library for natural language processing (NLP). It's used for tasks like tokenization and removing stopwords.
*   **`re`**: The regular expression library, essential for cleaning text data by removing URLs and special characters.
*   **`wordcloud`**: A fun and insightful library for creating visual representations of text data in the form of word clouds.
*   **`matplotlib.pyplot`**: A widely used plotting library for creating static, animated, and interactive visualizations in Python.
*   **`gensim`**: A robust library for topic modeling, document indexing, and similarity retrieval with large corpora.
*   **`pyLDAvis`**: A library for interactive topic model visualization.

The `nltk.download('stopwords')` command specifically downloads a list of common stopwords (e.g., "the," "a," "is") that are often filtered out during text preprocessing.

In [1]:
import os
import pandas as pd
from googleapiclient.discovery import build
import nltk
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from gensim import corpora, models
import pyLDAvis.gensim_models
import pyLDAvis

nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\soura\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **YouTube API Configuration**

This cell sets up the connection to the YouTube Data API.

*   An **`API_KEY`** is a unique identifier used to authenticate requests to Google Cloud APIs. You need to replace the placeholder `'AIzaSyDhVjJbW53fM90dz0YfmhB6Ntm8LXm9EYo'` with your own valid YouTube Data API v3 key.
*   The **`build`** function from `googleapiclient.discovery` initializes a service object that allows you to make API calls. Here, it's configured for the `youtube` service, version `v3`, using the provided `developerKey`.

In [2]:
API_KEY = 'AIzaSyDhVjJbW53fM90dz0YfmhB6Ntm8LXm9EYo'
youtube = build('youtube', 'v3', developerKey=API_KEY)

### **Function to Retrieve YouTube Comments**

This cell defines a function, `get_comments`, to programmatically fetch comments from a specific YouTube video.

*   **`get_comments(video_id, max_comments=100)`**:
    *   It takes a `video_id` (a unique string identifying a YouTube video) and an optional `max_comments` argument as input.
    *   It sends a request to the `commentThreads().list` endpoint of the YouTube API.
    *   The function paginates through the comment threads using `next_page_token` to retrieve a large number of comments, up to the `max_comments` limit.
    *   It extracts the plain text of each top-level comment and appends it to a list.
    *   Finally, it returns a list of comment strings.

In [3]:
def get_comments(video_id, max_comments=100):
    comments = []
    next_page_token = None
    while len(comments) < max_comments:
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=video_id,
            maxResults=100,
            pageToken=next_page_token,
            textFormat="plainText"
        )
        response = request.execute()
        for item in response['items']:
            comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
            comments.append(comment)
        next_page_token = response.get('nextPageToken')
        if not next_page_token:
            break
    return comments[:max_comments]

### **Specifying Target YouTube Video IDs**

This cell initializes a list named `video_ids`. Each string in this list is the unique identifier for a YouTube video from which you want to collect comments for analysis.

In [4]:
video_ids = [
    "86Gy035z_KA",
    "enyrNQJFoi8",
    "O0cs8aIXgkc",
    "1BDqfPEhsCA",
    "NIgrGqmoeHs"
]

### **Fetching Comments for All Specified Videos**

This cell iterates through the `video_ids` list defined in the previous cell. For each `vid`, it calls the `get_comments` function to retrieve up to 500 comments. All the collected comments are then aggregated into a single list called `all_comments`.

In [5]:
all_comments = []
for vid in video_ids:
    all_comments.extend(get_comments(vid, 500))

### **Creating and Saving a DataFrame of Comments**

This cell converts the raw list of comments into a more structured format using the pandas library.

*   A pandas DataFrame named `df` is created with a single column, 'comment'.
*   The `df.to_csv()` function is used to save the DataFrame to a CSV file named `youtube_comments.csv` in a `data` subdirectory. This is a crucial step for data persistence, allowing you to reload the data later without needing to re-fetch it from the API.
*   `df.head()` displays the first few rows of the DataFrame, providing a quick preview of the collected data.

In [6]:
df = pd.DataFrame({'comment': all_comments})
df.to_csv("../data/youtube_comments.csv", index=False)
df.head()a

Unnamed: 0,comment
0,Meanwhile I reviewed a fake I watch which i pa...
1,"Okay, let's be honest, the Meta Quest is techn..."
2,Id you want to make sure you cover the entire ...
3,"Yes sir, i will"
4,I’ll stick with the Meta Quest 3! Apple could ...


### **Loading and Pre-processing the Comment Data**

This cell focuses on loading the saved data and performing initial cleaning.

*   The comments are loaded from the `youtube_comments.csv` file into a pandas DataFrame.
*   `df.dropna(inplace=True)` removes any rows that have missing values (NaNs).
*   The subsequent line removes any rows where the 'comment' is just an empty string or whitespace. This ensures the dataset is clean and ready for further text processing.
*   `df.head()` is used again to preview the cleaned DataFrame.

In [7]:
df = pd.read_csv("../data/youtube_comments.csv")
df.dropna(inplace=True)
df = df[df["comment"].str.strip() != ""]  # remove empty strings
df.head()

Unnamed: 0,comment
0,Meanwhile I reviewed a fake I watch which i pa...
1,"Okay, let's be honest, the Meta Quest is techn..."
2,Id you want to make sure you cover the entire ...
3,"Yes sir, i will"
4,I’ll stick with the Meta Quest 3! Apple could ...


### **Text Cleaning and Tokenization**

This cell defines and applies a function to clean and tokenize the text comments, preparing them for topic modeling.

*   A set of English `stop_words` is imported from `nltk`.
*   The **`clean_comment`** function performs several key NLP pre-processing steps:
    1.  Converts the comment to lowercase for uniformity.
    2.  Uses regular expressions (`re.sub`) to remove URLs and any characters that are not letters or spaces.
    3.  Tokenizes the cleaned string into a list of words using `word_tokenize`.
    4.  Removes stopwords and any words that are two characters or less.
*   This function is applied to the 'comment' column of the DataFrame, and the resulting list of tokens is stored in a new 'tokens' column.```

In [19]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import re

stop_words = set(stopwords.words("english"))

def clean_comment(comment):
    comment = comment.lower()
    comment = re.sub(r"http\S+|www\S+|https\S+", '', comment)
    comment = re.sub(r'[^a-z\s]', '', comment)
    tokens = word_tokenize(comment)
    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    return tokens

df["tokens"] = df["comment"].apply(clean_comment)
df.head()

Unnamed: 0,comment,tokens
0,Meanwhile I reviewed a fake I watch which i pa...,"[meanwhile, reviewed, fake, watch, paid, alot,..."
1,"Okay, let's be honest, the Meta Quest is techn...","[okay, lets, honest, meta, quest, technologica..."
2,Id you want to make sure you cover the entire ...,"[want, make, sure, cover, entire, floor, youre..."
3,"Yes sir, i will","[yes, sir]"
4,I’ll stick with the Meta Quest 3! Apple could ...,"[ill, stick, meta, quest, apple, could, never]"


### **Creating a Dictionary and Corpus for LDA**

This cell prepares the data for the Latent Dirichlet Allocation (LDA) model using the `gensim` library.

*   A **`dictionary`** is created from the 'tokens' column. This dictionary maps each unique word to an integer ID.
*   A **`corpus`** is then created using the `doc2bow` (document-to-bag-of-words) method. The corpus is a list of lists, where each inner list represents a document (a comment) and contains tuples of (word_id, word_frequency).
*   The dictionary and corpus are saved to files, which is good practice for reusing them later without needing to re-process the text.

In [20]:
from gensim import corpora

# Create dictionary and corpus
dictionary = corpora.Dictionary(df["tokens"])
corpus = [dictionary.doc2bow(tokens) for tokens in df["tokens"]]

# Save dictionary and corpus for reuse
dictionary.save("../lda_model/youtube_dict.dict")
corpora.MmCorpus.serialize("../lda_model/youtube_corpus.mm", corpus)

### **Training the LDA Topic Model**

This is where the core of the topic modeling happens. An LDA model is trained on the prepared corpus and dictionary.

*   **`LdaModel`** is instantiated with several key parameters:
    *   `corpus` and `id2word=dictionary`: The data the model will be trained on.
    *   `num_topics=8`: The number of topics the model should identify in the data. This is a parameter that can be tuned.
    *   `random_state=42`: Ensures reproducibility of the results.
    *   `passes=10`: The number of times the model will pass through the entire corpus during training.
    *   `alpha='auto'`: A parameter that controls the sparsity of the document-topic distribution.
    *   `per_word_topics=True`: Allows for the calculation of topic distributions for each word.
*   The code then iterates through the trained model and prints the top keywords for each of the 8 identified topics.

In [28]:
from gensim.models import LdaModel

# Train the LDA model
lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=8,           
                     random_state=42,
                     passes=10,
                     alpha='auto',
                     per_word_topics=True)

# Print top keywords for each topic
for idx, topic in lda_model.print_topics(-1):
    print(f"\n🔹 Topic {idx}:")
    print(topic)



🔹 Topic 0:
0.075*"fireship" + 0.067*"viewer" + 0.066*"people" + 0.065*"years" + 0.064*"created" + 0.063*"accounts" + 0.063*"many" + 0.062*"watching" + 0.061*"illusion" + 0.060*"message"

🔹 Topic 1:
0.014*"microsoft" + 0.011*"back" + 0.011*"always" + 0.010*"every" + 0.009*"access" + 0.009*"azure" + 0.008*"sponsor" + 0.008*"exactly" + 0.008*"wow" + 0.007*"thank"

🔹 Topic 2:
0.028*"video" + 0.025*"like" + 0.012*"time" + 0.012*"would" + 0.012*"looks" + 0.012*"love" + 0.010*"much" + 0.009*"shit" + 0.008*"microsoft" + 0.007*"car"

🔹 Topic 3:
0.029*"samsung" + 0.028*"iphone" + 0.026*"better" + 0.015*"ultra" + 0.009*"video" + 0.009*"camera" + 0.008*"waiting" + 0.008*"good" + 0.008*"competition" + 0.008*"one"

🔹 Topic 4:
0.024*"code" + 0.023*"copilot" + 0.023*"microsoft" + 0.022*"open" + 0.019*"source" + 0.015*"like" + 0.008*"dont" + 0.008*"github" + 0.007*"big" + 0.007*"google"

🔹 Topic 5:
0.009*"apple" + 0.008*"like" + 0.008*"see" + 0.008*"really" + 0.007*"know" + 0.007*"truck" + 0.007*"some

### **Generating and Saving Word Clouds for Each Topic**

This cell visualizes the results of the LDA model by creating a word cloud for each topic.

*   It first ensures that a `charts` directory exists to save the output images.
*   It then loops through each topic number from 0 to 7.
*   For each topic, it generates a `WordCloud` object from the topic's word frequencies as determined by the LDA model.
*   Each word cloud is then plotted using `matplotlib`, given a title, and saved as a PNG image in the `../charts/` directory.

In [27]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os

# Ensure the folder exists
os.makedirs("../charts", exist_ok=True)

# Generate and save wordclouds
for topic_num in range(lda_model.num_topics):
    plt.figure(figsize=(10, 6))
    wordcloud = WordCloud(background_color="white").generate_from_frequencies(dict(lda_model.show_topic(topic_num, topn=30)))
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    plt.title(f"Topic {topic_num}", fontsize=16)
    plt.tight_layout()
    plt.savefig(f"../charts/topic_{topic_num}.png")
    plt.close()

### **Interactive Topic Model Visualization**

This cell generates an interactive visualization of the LDA model using the `pyLDAvis` library.

*   **`pyLDAvis.enable_notebook()`** enables the visualization to be displayed directly within the Jupyter Notebook.
*   **`gensimvis.prepare()`** takes the trained LDA model, corpus, and dictionary as input and creates the visualization data.
*   The resulting `vis` object is an interactive plot that allows you to explore the topics, their relationships, and the keywords associated with them.

In [23]:
import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

pyLDAvis.enable_notebook()
vis = gensimvis.prepare(lda_model, corpus, dictionary)
vis

### **Saving the Interactive Visualization**

This final cell saves the interactive `pyLDAvis` visualization to a standalone HTML file. This is extremely useful for sharing the results with others who may not have access to the Jupyter Notebook, as they can open the HTML file in any web browser to explore the topic model.

In [25]:
pyLDAvis.save_html(vis, "../charts/topic_model_visualization.html")