#**YouTube Comment Intelligence System ‚Äì Data Preparation & Batching**

In [1]:
# Core data handling
!pip install -U pandas numpy openpyxl

# NLP utilities
!pip install -U scikit-learn sentence-transformers

# LlamaIndex core
!pip install -U llama-index

# NEW & OFFICIAL Groq integration for LlamaIndex
!pip install -U llama-index-llms-groq




###**Core Data Processing (pandas, numpy, openpyxl)**
These libraries are used to load and process the Excel-based YouTube comment dataset. They handle tasks such as reading .xlsx files, cleaning missing or empty values, extracting the comment column, converting data into Python lists, and computing basic statistics like average comment length and word counts.

###**Semantic NLP and Clustering (sentence-transformers, scikit-learn)**
These tools enable semantic understanding of comments. Sentence Transformers convert each comment into a numerical embedding that captures its meaning, while scikit-learn provides clustering algorithms such as K-Means to group similar comments together and discover hidden topics. This allows the system to work in an unsupervised way without requiring labeled data.

###**LLM-Powered Summarization (llama-index, llama-index-llms-google-genai)**
  LlamaIndex provides the framework for organizing and summarizing large volumes of text using hierarchical tree-based methods. The Google GenAI integration connects Gemini models to this framework, allowing the system to generate high-quality summaries and insights from batches of comments while staying within model context limits.

In [2]:
import pandas as pd
import numpy as np
import re
import random

from llama_index.core import Document
from llama_index.core.indices.tree import TreeIndex
from llama_index.core import Settings

import google.generativeai as genai



All support for the `google.generativeai` package has ended. It will no longer be receiving 
updates or bug fixes. Please switch to the `google.genai` package as soon as possible.
See README for more details:

https://github.com/google-gemini/deprecated-generative-ai-python/blob/main/README.md

  loader.exec_module(module)


## **Import Required Libraries**

This cell imports all the Python libraries needed for our YouTube Comment Insight System.

#### **Data Processing**
- **pandas** ‚Üí for loading and manipulating Excel and CSV files  
- **numpy** ‚Üí for numerical operations and array handling  
- **re** ‚Üí for text cleaning using regular expressions  
- **random** ‚Üí for sampling and experimentation  

#### **LLM & Knowledge Indexing**
- **llama_index** ‚Üí to convert comments into documents and build a tree-based knowledge index  
- **Document** ‚Üí represents each comment as a document  
- **TreeIndex** ‚Üí organizes documents in a hierarchical structure for efficient querying  
- **Settings** ‚Üí controls model and embedding configurations  

#### **Google Generative AI**
- **google.generativeai** ‚Üí connects our system with Google Gemini models for intelligent summarization and insights


In [3]:
# Upload your Excel file manually in Colab
from google.colab import files
uploaded = files.upload()

# Get uploaded file name
file_name = list(uploaded.keys())[0]

# Load Excel file
df=pd.read_excel(file_name)

# Keep only the comment column
COMMENT_COLUMN = "Comment"

# Keep only the comment column
df = df[[COMMENT_COLUMN]]

# Drop real NaN values
df = df.dropna()

# Convert to string (only after NaNs removed)
df[COMMENT_COLUMN] = df[COMMENT_COLUMN].astype(str)

# Remove truly empty rows (but NOT emojis, slang, mixed language, etc)
df = df[df[COMMENT_COLUMN].str.strip() != ""]

print("Total comments available:", len(df))
print("\nSample raw comments:")
for c in df[COMMENT_COLUMN].head(5):
    print("-", c)



Saving youtube_comments (1).xlsx to youtube_comments (1).xlsx
Total comments available: 1072

Sample raw comments:
- Honest feedback I want to share regarding @CodeWithHarry. I have completed your reactjs course long back, 2 front-end project and 3rd one complete mern stack. I had gone through multiple tutorials, but when I did complete your free tutorial I never looked at any other course. I started building products my own, in interview I was asked to write code using reactjs and currently I am pursuing my masters at IIT that too in Data Science yet I have taken your course. I truly believe your knowledge, and more than that the way you deliver the content. I was last in c programming exam back in first year of bachelor's but by completing the engineering I had done multiple internships cracked most of companies then I'm into iit now. There is a huge hand of yours behind my success. Thankyou Harry bhaiyya and just a suggestion I would request ki mathematics bhi depth mein sikhaye for

### **Excel Upload & Cleaning**
**df[[COMMENT_COLUMN]]** ‚Üí Keeps only the YouTube comment text and removes all other columns

**dropna()** ‚Üí Removes only real missing values without touching meaningful text

**astype(str**) ‚Üí Converts all remaining comments into text format for LLM input

**str.strip() != ""** ‚Üí Removes truly empty rows while preserving emojis, slang, and mixed language

**len(df)** ‚Üí Shows how many valid, LLM-ready comments are available

**head(5)** ‚Üí Displays sample raw comments to verify that meaning and emotion are preserved

In [4]:
def validate_text(text):
    """
    Only remove truly broken rows.
    Preserve all real language, emojis, links, slang.
    """
    if not isinstance(text, str):
        return False
    if text.strip() == "":
        return False
    return True



This function removes only invalid or empty rows while keeping all meaningful user-generated content intact, ensuring the LLM receives authentic, emotion-rich YouTube comments.

**validate_text()** ‚Üí Removes broken and empty entries while preserving emojis, links, slang, and real language for LLM-ready text input.

In [6]:
# Fix random seed for reproducibility
random.seed(42)

# Use ALL comments (do not randomly sample)
comments = df[COMMENT_COLUMN].tolist()

# Keep only valid LLM-safe comments
comments = [c for c in comments if validate_text(c)]

# Randomly sample a small but representative subset for experiments
random.shuffle(comments)
comments = comments[:300]
print("Using comments for experiments:", len(comments))



print("Total comments used for LLM:", len(comments))
print("\nFirst 10 comments:")
for c in comments[:10]:
    print("-", c)


Using comments for experiments: 300
Total comments used for LLM: 300

First 10 comments:
- Bhaiya sorry to say but jab aap iit Kharagpur ka naam lete ho thodi si jal jati haüòÖ
- Sir youtube par ai ml ka full coes lalo na
- harry bhiya .. machine learning sikhne ke liye apke computer ki  minimum specification kya hona chaiye ?
- Please make a sigma ai/ml course on youtube
- So true.
- Just 12th passed should I do it?
- From Pakistan Learned Alot Of Python From Harry Bhai
Alots of Love from Pakistan>>>>>>‚ù§‚ù§‚ù§‚ù§
- Bhayiya mai robotic and automation ka student hu kya ap us pr bhi ek video bna skte hai kee kon kon see skills sikhne kee jrurat hai. Please bhayiya‚ù§
- Free data science course playlist full plzzz
- Is anyone can tell me please I'm a mechanical engineer soo can I start my career in different domain as a fresher?


### **Comment Sampling**
**random.seed(42)** ‚Üí Ensures the same random comments are selected every time for reproducible experiments

**COMMENT_COLUMN.tolist(**) ‚Üí Converts the selected comments into a Python list for model input

**len(comments)** ‚Üí Confirms how many comments were successfully sampled

**comments[:10]** ‚Üí Displays a few random samples to verify correct selection

In [7]:
# Use RAW comments (no cleaning)
lengths = [len(c.split()) for c in comments]

print("Average raw comment length:", np.mean(lengths))
print("Max raw comment length:", np.max(lengths))
print("Min raw comment length:", np.min(lengths))


Average raw comment length: 18.576666666666668
Max raw comment length: 351
Min raw comment length: 1


### **Raw Text Statistics**
**len(c.split())** ‚Üí Measures the word length of raw YouTube comments to understand their size and complexity before LLM processing.

In [8]:
def batch_comments(comments, batch_size):
    return [comments[i:i+batch_size] for i in range(0, len(comments), batch_size)]

batch_sizes = [100, 200, 300]

all_batches = {}

for bs in batch_sizes:
    batches = batch_comments(comments, batch_size=bs)
    all_batches[bs] = batches

    print(f"\nBatch size = {bs}")
    print("Total batches:", len(batches))
    print("First batch size:", len(batches[0]))



Batch size = 100
Total batches: 3
First batch size: 100

Batch size = 200
Total batches: 2
First batch size: 200

Batch size = 300
Total batches: 1
First batch size: 300


### **Multi-Batch Experiment**

This step tests multiple batch sizes to find how many comments the LLM should see at once for the best insight quality.

**batch_comments()** ‚Üí Splits YouTube comments into different chunk sizes (100, 300, 600, 1000) to optimize LLM context and summarization performance

In [9]:
from google.colab import userdata
from llama_index.llms.groq import Groq
from llama_index.core import Settings

# Get Groq API key from Colab Secrets
GROQ_API_KEY = userdata.get("GROQ_API_KEY")

# Configure Groq as the LLM
Settings.llm = Groq(
    model="llama-3.3-70b-versatile",   # Use a currently supported Groq model
    api_key=GROQ_API_KEY,
    temperature=0.2
)

print(" Groq LLM configured successfully")

 Groq LLM configured successfully


### **LLM Configuration**
**userdata.get("GROQ_API_KEY")** ‚Üí Securely retrieves the Groq API key from Colab Secrets

**Groq()** ‚Üí Connects LlamaIndex to Groq‚Äôs high-performance LLM platform

**model="llama-3.3-70b-versatile"** ‚Üí Selects a powerful large language model optimized for text analysis and summarization

**api_key=GROQ_API_KEY** ‚Üí Authenticates requests to the Groq API

**temperature=0.2** ‚Üí Keeps model outputs stable, focused, and less random

**Settings.llm** ‚Üí Sets Groq as the global LLM engine for the entire pipeline

In [10]:
from llama_index.core import Document

all_documents = {}

for bs, batches in all_batches.items():
    documents = []

    for i, batch in enumerate(batches):
        batch_text = "\n".join(batch)
        doc = Document(
            text=batch_text,
            metadata={
                "batch_size": bs,
                "batch_id": i + 1
            }
        )
        documents.append(doc)

    all_documents[bs] = documents
    print(f"Batch size {bs}: Created {len(documents)} documents")



Batch size 100: Created 3 documents
Batch size 200: Created 2 documents
Batch size 300: Created 1 documents


### **Multi-Batch Document Creation**

This step converts each batch of YouTube comments into LLM-readable documents for every tested batch size.

**Document()** ‚Üí Transforms comment batches of different sizes into structured LlamaIndex documents so each chunking strategy can be evaluated independently


In [11]:
from llama_index.core.indices.tree import TreeIndex

all_tree_indexes = {}

for bs, documents in all_documents.items():
    index = TreeIndex.from_documents(documents)
    all_tree_indexes[bs] = index
    print(f"TreeIndex created for batch size {bs}")



TreeIndex created for batch size 100
TreeIndex created for batch size 200
TreeIndex created for batch size 300


### **Multi-Batch Indexing**

This step builds a separate TreeIndex for each batch size so different chunking strategies can be compared fairly.

**TreeIndex.from_documents()** ‚Üí Creates parallel knowledge trees for each batch size to evaluate which chunking strategy gives the best LLM reasoning and summaries.

In [12]:
!pip install nest_asyncio




### **Async Support**
**nest_asyncio** ‚Üí Allows multiple asynchronous event loops to run correctly inside Google Colab

In [13]:
import nest_asyncio
nest_asyncio.apply()

print(" Asyncio loop patched for Colab")


 Asyncio loop patched for Colab


### **Async Fix**
**nest_asyncio.apply()** ‚Üí Patches Google Colab‚Äôs event loop so LLM and LlamaIndex async operations run without errors

In [14]:
all_summaries = {}

for bs, tree_index in all_tree_indexes.items():
    query_engine = tree_index.as_query_engine(
        response_mode="tree_summarize"
    )

    prompt = """
    Summarize the main themes, questions, feedback, and repeated ideas
    present in these YouTube comments. Focus on:
    - Common questions
    - Content requests
    - Confusions
    - Praise or criticism
    """

    summary = query_engine.query(prompt)
    all_summaries[bs] = summary

    print(f"\n==============================")
    print(f"Batch size = {bs}")
    print("==============================")
    print(summary)



Batch size = 100
The main themes present in these comments include career development in AI and Machine Learning, requests for specific courses or content, and feedback on existing content. 

Common questions revolve around the suitability of certain courses or skills for future careers, such as MERN with generative AI, and whether non-CSE students can learn AI and ML after Python. Many users are also inquiring about the availability of courses in languages other than English, specifically Hindi.

Content requests are prominent, with users asking for a roadmap for Cloud and Security, tips for getting into the field, and recommendations for books on machine learning and data mining. Some users are also requesting free access to paid courses, citing financial difficulties.

Confusions arise from the suitability of certain laptops for machine learning and data analysis, and the process of enrolling in courses, particularly for international students.

Praise is given for the valuable inf

### **Multi-Batch Insight Generation**

This step runs the same insight-extraction prompt on each batch-size TreeIndex so we can compare how different chunk sizes affect the quality of LLM summaries.

**all_tree_indexes** ‚Üí Stores a separate TreeIndex for each batch size

**tree_index.as_query_engine()** ‚Üí Converts each TreeIndex into an LLM-powered question-answering and summarization engine

**response_mode="tree_summarize"** ‚Üí Forces the LLM to read all batches and generate a global, hierarchical summary

**prompt** ‚Üí Defines exactly what creator-level insights the LLM should extract

**query_engine.query(prompt)** ‚Üí Runs the prompt on the full comment knowledge base for that batch size

**all_summaries[bs]** ‚Üí Stores the summary output for each batch size so they can be compared


In [15]:
final_prompt = """
You are an AI assistant helping a YouTube creator.

Based on these comments, generate a clear and structured insight report with:
1. Top 5 content topics viewers want
2. Top 5 recurring questions
3. Main confusions or learning barriers
4. Overall audience sentiment
5. 3 recommended next video ideas
"""
prompt_P1 = """
You are an AI assistant helping a YouTube creator understand their audience deeply.

From these YouTube comments extract:
1. The biggest problems or struggles viewers are facing
2. The most requested content or topics
3. Frequently asked questions
4. Emotional signals (confusion, excitement, frustration, praise)
5. Opportunities for new videos, courses, or tutorials

Focus on repeated and high-impact signals. Ignore spam and one-off comments.
"""

prompt_P2 = """
You are a YouTube growth strategist.

Analyze these comments to identify:
1. Viewer segments (beginners, intermediate, advanced)
2. What is blocking viewers from learning or progressing
3. Language or format preferences (Hindi, English, short videos, full courses)
4. High-demand topics that can drive subscribers and revenue
5. A data-driven plan for the creator‚Äôs next 3 videos

Give actionable insights that a YouTuber can use to grow their channel.
"""

prompts = {
    "P0_Baseline": final_prompt,
    "P1_Creator_Intelligence": prompt_P1,
    "P2_Growth_Strategy": prompt_P2
}

all_final_reports = {}

for bs, tree_index in all_tree_indexes.items():
    query_engine = tree_index.as_query_engine(response_mode="tree_summarize")

    print("\n" + "="*50)
    print(f"Batch size = {bs}")
    print("="*50)

    all_final_reports[bs] = {}

    for name, prompt in prompts.items():
        result = query_engine.query(prompt)
        all_final_reports[bs][name] = result

        print(f"\n--- {name} ---\n")
        print(result)




Batch size = 100

--- P0_Baseline ---

Insight Report:

**1. Top 5 Content Topics Viewers Want:**
1. AI and Machine Learning (ML) courses and tutorials
2. Data Science and related topics
3. Python programming and its applications (e.g., Django)
4. Cybersecurity courses
5. Android development and software engineering

**2. Top 5 Recurring Questions:**
1. What are the minimum computer specifications required for learning ML?
2. Can a non-technical person or someone from a different domain (e.g., mechanical engineering) start a career in AI/ML or software engineering?
3. What is the duration of AI and ML courses?
4. How to get started with coding and web development?
5. What are the skills required for robotic and automation?

**3. Main Confusions or Learning Barriers:**
- Confusion between AI and Data Science
- Difficulty in choosing the right career path (e.g., JEE, NEET, or coding)
- Technical issues (e.g., accessing purchased courses due to incorrect email)
- Lack of clear guidance o

### **Prompt & Chunking Optimization**

This step runs multiple prompt strategies on each batch-size TreeIndex to compare how different prompts and chunk sizes affect the quality, depth, and usefulness of YouTube creator insights.

**final_prompt** ‚Üí Baseline prompt for generating standard creator insights

**prompt_P1** ‚Üí Creator-intelligence prompt focused on viewer pain points, emotions, and content demand

**prompt_P2** ‚Üí Growth-strategy prompt focused on audience segments, monetization, and channel expansion

**prompts{}** ‚Üí Stores multiple prompt versions for systematic comparison

**all_tree_indexes** ‚Üí Provides different TreeIndexes built using different batch sizes

**tree_index.as_query_engine()** ‚Üí Converts each TreeIndex into an LLM-powered reasoning and summarization engine

**response_mode="tree_summarize"** ‚Üí Forces the LLM to analyze all comment chunks and produce a global structured report

**query_engine.query(prompt)** ‚Üí Runs each prompt on the full YouTube comment knowledge base

**all_final_reports[bs][name]** ‚Üí Stores every output by batch size and prompt type for evaluation

**print(result)** ‚Üí Displays the generated creator-ready intelligence reports