# Sandbox

### imports

In [1]:
from dotenv import load_dotenv
from ingestion.ingest import *

import sqlite3
import pandas as pd

from query_gen.functions import *

In [2]:
load_dotenv()
API_KEY = os.getenv("YOUTUBE_API_KEY")

### ingestion

In [3]:
conn = sqlite3.connect('data/videos.db')

# View as DataFrame
df = pd.read_sql_query("SELECT * FROM videos", conn)
df.head()

Unnamed: 0,video_id,title,transcript
0,4oUOJ37GKYE,"The more they hurt you, the stronger you get #...",story is the story of the Hydra monster which ...
1,r5qk3uIdkks,What is #ai? — Simply Explained,what is AI exactly I work in the industry and ...
2,WgmMK5fS0X0,How to do MORE with LESS - multikills,multitasking isn't a good strategy how do we s...
3,5zeqo-R12vk,How to STAY dumb,in a lot of situations we might feel the need ...
4,hB27yAkJLC8,Being fragile means you have more downside tha...,first story is the sword of dimicles is the st...


In [4]:
df.shape

(158, 3)

In [5]:
video_id = 'vEvytl7wrGM'
irow = (df['video_id'] == video_id).idxmax()

### query gen

In [6]:
# from ingestion.database import get_video_by_id
# from query_gen.functions import get_all_comments, generate_queries

# # Get video data
# video_data = get_video_by_id(video_id)
# print(f"Title: {video_data['title']}")

# # Get comments
# comments = get_all_comments(video_id)
# print(f"Comments: {len(comments)}")

# # Generate queries
# result = generate_queries(
#   video_title=video_data["title"],
#   transcript=video_data["transcript"],
#   comments=comments
# )

# # Display results
# for i, q in enumerate(result.queries, 1):
#   print(f"\n{i}. [{q.query_type.value} / {q.difficulty.value}]")
#   print(f"   {q.query}")
#   print(f"   → {q.grounding}")

Title: Claude Skills Explained in 23 Minutes
Comments: 6

1. [exact / grounded]
   What are Claude skills?
   → Core definition — "reusable instructions Claude can access when needed"

2. [exact / medium]
   claude skills folder structure
   → File/folder structure — "folder with a file called skill.md"

3. [exact / hard]
   does MCP work with other LLMs or is it just claude?
   → MCP comparison — "skills only work with claude while MCP is a universal open standard"

4. [conceptual / grounded]
   Why is progressive disclosure important in Claude skills?
   → Progressive disclosure — "skills give Claude just enough context for the next step"

5. [conceptual / medium]
   context window impact of many skills
   → Context/token limits — "metadata are so lightweight, we can have hundreds, if not thousands of skills and have a relatively small impact on the context window"

6. [conceptual / hard]
   MCP vs skills vs sub agents which to use?
   → Sub agents — "the main value of sub aents is t

In [7]:
result.queries

[Query(query_type=<QueryType.EXACT: 'exact'>, difficulty=<DifficultyLevel.GROUNDED: 'grounded'>, grounding='Core definition — "reusable instructions Claude can access when needed"', query='What are Claude skills?'),
 Query(query_type=<QueryType.EXACT: 'exact'>, difficulty=<DifficultyLevel.MEDIUM: 'medium'>, grounding='File/folder structure — "folder with a file called skill.md"', query='claude skills folder structure'),
 Query(query_type=<QueryType.EXACT: 'exact'>, difficulty=<DifficultyLevel.HARD: 'hard'>, grounding='MCP comparison — "skills only work with claude while MCP is a universal open standard"', query='does MCP work with other LLMs or is it just claude?'),
 Query(query_type=<QueryType.CONCEPTUAL: 'conceptual'>, difficulty=<DifficultyLevel.GROUNDED: 'grounded'>, grounding='Progressive disclosure — "skills give Claude just enough context for the next step"', query='Why is progressive disclosure important in Claude skills?'),
 Query(query_type=<QueryType.CONCEPTUAL: 'conceptual'

### answer

In [12]:
from utils.answer import generate_answer

response = generate_answer("why is overfitting a problem for decision trees")
print(response.answer)
for citation in response.citations:
    print(f"- {citation.title} (https://youtu.be/{citation.video_id})")

Overfitting is a problem for decision trees because they can become overly complex by creating too many splits that perfectly classify the training data, often resulting in a large and intricate tree. This leads the model to be overly optimized to the training data, capturing noise and specific patterns that do not generalize well to new, unseen data. As explained, "overfitting is when you learn a machine learning model based on some data set but your model becomes over optimized on the data set it was trained on and when you try to apply that model to new data that it's never seen before you'll find that your model is actually very inaccurate." Therefore, an overfit decision tree may have perfect prediction on the training data but poor performance on test data. Using hyperparameters to control the growth of the tree (like limiting depth or splits) can help mitigate overfitting and improve generalizability to new data.
- An Introduction to Decision Trees | Gini Impurity & Python Code 