<a href="https://colab.research.google.com/github/AmirJlr/AmirJlr/blob/main/04_recommender_system_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# A very high level overview

In this project, we build a knowledge graph-based recommendation system for educational videos where the nodes represent individual videos and edges capture their relationships. We start by processing video metadata **(titles, descriptions, durations)** using NLP techniques - specifically:

-  We use spaCy for extracting key concepts and SentenceTransformer for generating semantic embeddings. These embeddings help us understand video relationships as we can calculate similarity scores between them.

When building the graph with NetworkX,

- We create edges based on both semantic similarity and logical prerequisites, while nodes store various attributes like difficulty level, topics covered, and duration.

<p>

After building the graph,

- We perform various analyses and visualizations using NetworkX and PyVis - including generating learning paths, visualizing topic relationships, and analyzing video prerequisites.

Then, for demo purposes,
- We serialize and save this graph structure into a SQLite database, storing nodes, edges, and embeddings in separate tables.

This is just for learning purposes only so you can learn how to serialize the graph, load it back and build a graph again. After loading the database, to perform further analysis, we don't actually query the SQL database directly - instead, we load the data back into a NetworkX graph structure using load_recommendation_system().

We do have some basic SQL query functions (query_videos_by_topic, get_video_prerequisites), again for learning purposes. The main analytical work like finding learning paths and analyzing relationships still happens using NetworkX methods after reconstructing the graph from the database. This approach suggests we might want to reconsider either making better use of SQL capabilities or exploring graph databases that could maintain the graph structure natively.


# Building an Educational Content Recommender System Using Knowledge Graphs and NLP

This code is designed to illustrate the power of knowledge graphs especially when combined with NLP as a great recommender system for educational content.


## Initial Data Processing

The first stage involves loading and preprocessing the video data from the CSV file. The `load_video_data` function handles this initial step, where we:
1. Convert video durations from milliseconds to minutes for better readability
2. Convert timestamp strings to proper datetime objects for temporal analysis
3. Prepare the data for further processing


## Text Processing and Feature Extraction

The next stage is processing the video titles and descriptions using Natural Language Processing (NLP). This is where the automation magic begins to happen. If you checked out the previous tutorial, this is where we manually defined our graph using python learning topic names. Let us extract these details using NLP. In the `process_video_data` function, we use several NLP techniques:

### 1. Text Cleaning
- We clean all text by converting to lowercase and removing special characters
- Handle any missing values (NaN) in titles or descriptions
- Remove extra whitespace and standardize the text format

### 2. Semantic Understanding using Sentence Transformers
- We use the 'all-MiniLM-L6-v2' model from the sentence-transformers library
- This model converts our text into high-dimensional vector embeddings
- Each video's title and description are combined and transformed into a vector that captures its semantic meaning
- These embeddings allow us to measure semantic similarity between videos, even if they don't share exact words

### 3. Concept Extraction using spaCy
The `extract_key_concepts` function uses spaCy's powerful NLP capabilities to:
- Identify noun phrases that represent key concepts in the video content
- Extract named entities (like Python, OpenCV, etc.)
- Focus on phrases of 3 words or less to capture concise, meaningful concepts
- Remove duplicates and clean the extracted concepts

This process helps us understand what each video is about at a conceptual level. For example, from a video titled "Introduction to Python Image Processing with OpenCV", it might extract concepts like:
- python
- image processing
- opencv
- introduction
These concepts become crucial for building relationships between videos.

### 4. Topic Identification
Again, in the previous tutorial we manually defined a few topics but here we will assign our YouTube videos into one of these several topics. The `extract_video_topics` function categorizes videos into predefined topics like:
- Python basics
- Image processing
- Machine learning
- Computer vision
- Bio applications
This categorization is done by looking for specific keywords in the title and description, helping us organize content into meaningful groups.

### 5. Difficulty Assessment
We automatically infer video difficulty levels (beginner, intermediate, advanced) based on several factors:
- Explicit indicators in the title (like "introduction", "advanced")
- Video sequence number (earlier videos tend to be more basic)
- Number of technical concepts covered
- Complexity of the content

This way of automatic processing of the video content creates the foundation for building our knowledge graph. The extracted features, concepts, and relationships will determine how videos are connected and how we can navigate between them effectively.

## Building the Knowledge Graph

After processing our video content, we construct a knowledge graph that represents the relationships between videos. This is where our system becomes truly powerful, as it captures not just the content of videos, but how they relate to each other in meaningful ways.

### Graph Construction Process

The knowledge graph is built using NetworkX, with each video becoming a node in the graph. The `build_knowledge_graph` function handles this process:

1. **Node Creation**
   - Each video becomes a node in our graph
   - Node attributes include:
     - Title
     - Description
     - Difficulty level
     - Duration
     - Key concepts
     - Topics covered
     - Video number (if part of a series)
   These rich attributes allow us to understand each video's content and context thoroughly.

2. **Edge Creation**
   The most interesting part is how we automatically create edges between videos. We use multiple factors:

   a) **Semantic Similarity**
   - Use the cosine similarity between video embeddings
   - Connect videos if their similarity exceeds a threshold (default for now is 0.5)
   - This catches semantic relationships even when videos don't share exact keywords

   b) **Relationship Type Determination**
   The `determine_relationship` function assigns relationship types between videos:
   - **Prerequisite**: Video A should be watched before Video B
   - **Advanced**: Video B builds upon concepts in Video A
   - **Related**: Videos cover related topics but don't have a strict order

   Relationships are determined based on:
   - Video sequence numbers (if part of a series)
   - Difficulty levels (beginner → intermediate → advanced)
   - Topic relationships (e.g., Python basics are prerequisites for advanced topics)

### Graph Visualization

We create several types of visualizations to understand our knowledge graph:

1. **Complete Knowledge Graph** (`visualize_knowledge_graph`)
   - Shows all videos and their relationships
   - Color-coded by difficulty level:
     - Light green for beginner
     - Light blue for intermediate
     - Light pink for advanced
   - Interactive visualization where you can:
     - Hover over nodes to see video details
     - Drag nodes to explore relationships
     - Zoom in/out to focus on specific areas

2. **Topic Subgraphs** (`visualize_topic_subgraph`)
   - Shows videos related to specific topics (e.g., 'python_basics', 'image_processing')
   - Helps understand the structure within each topic area
   - Useful for seeing prerequisite chains within a topic

3. **Simplified Graph** (`visualize_simplified_graph`)
   - Two versions:
     - Strong connections only (high similarity weight)
     - Python basics structure (foundational content)
   - Helps see the core structure of the content
   - Reduces visual complexity for better understanding

4. **Learning Path Visualization** (`visualize_learning_path`)
   - Shows the recommended sequence of videos for a specific learning goal
   - Highlights the progression from prerequisites to advanced content
   - Color-coded to show difficulty progression

All visualizations are saved as interactive HTML files using the pyvis library, which allows for:
- Zooming and panning
- Node dragging for better arrangement
- Hovering for detailed information
- Physics-based layout for natural clustering

These visualizations not only help us understand the structure of our educational content but also validate the relationships our system has identified. They're particularly useful for:
- Identifying gaps in content
- Ensuring proper prerequisite chains
- Finding isolated content that needs better integration
- Understanding the overall structure of the educational material

## Learning Path Generation and Recommendations

Once our knowledge graph is built, we can use it to generate personalized learning paths and make intelligent recommendations. This is where the system demonstrates its real power in educational content organization.

### Learning Path Generation

The `find_learning_path` function is central to our recommendation system. Here's how it works:

1. **Goal Understanding**
   - Takes a learning goal as input (e.g., "Mastering Python for Bioimage analysis")
   - Converts the goal into an embedding using the same sentence transformer model
   - Compares this embedding with all video embeddings to find relevant content

2. **Topic Prerequisites**
   We define prerequisite relationships between topics. For example:
   ```
   'image_processing': needs {'python_basics', 'data_structures'}
   'machine_learning': needs {'python_basics', 'data_structures', 'data_analysis'}
   'computer_vision': needs {'python_basics', 'image_processing'}
   'bio_applications': needs {'python_basics', 'image_processing'}
   ```

3. **Path Construction**
   The system builds a path through the content by:
   a) Starting with foundational content
      - Automatically includes Python basics for technical topics
      - Sorts basics by video number to maintain logical progression
   
   b) Adding topic-specific content
      - Uses similarity scores to find most relevant videos
      - Checks and includes prerequisites before advanced content
      - Ensures proper skill progression

### Path Analysis and Validation

The `analyze_path_coverage` function examines generated paths to ensure quality:

1. **Topic Coverage**
   - Tracks which topics are covered in the path
   - Ensures all necessary prerequisite topics are included

2. **Prerequisite Validation**
   - Checks if advanced topics (like machine learning) have necessary basics
   - Issues warnings if prerequisites are missing
   - Helps maintain logical learning progression

3. **Path Statistics**
   We collect detailed statistics about each path:
   - Total number of videos
   - Total duration
   - Difficulty breakdown
   - Topic coverage
   - Concept progression

### Querying the System

Our system supports various types of queries:

1. **Topic-Based Queries** (`query_videos_by_topic`)
   - Find videos by topic and difficulty level
   - Useful for focused learning in specific areas
   - Can filter by beginner/intermediate/advanced content

2. **Prerequisite Queries** (`get_video_prerequisites`)
   - Find what videos should be watched before a specific video
   - Shows relevance scores for each prerequisite
   - Helps ensure proper preparation for advanced content

3. **Learning Path Queries** (`get_learning_path_from_db`)
   - Generate complete learning paths for specific goals
   - Includes:
     - Step-by-step video sequence
     - Estimated completion time
     - Difficulty progression
     - Topic coverage analysis

### System Persistence and Database Structure

Instead of using simple serialization (like pickle), we use SQLite because:
1. **Structured Storage**
   - Maintains relationships between different components
   - Allows for complex queries
   - Ensures data integrity

2. **Efficient Querying**
   - Fast retrieval of video information
   - Efficient path generation
   - Quick prerequisite lookups

3. **Data Organization**
   Our database has three main tables:
   - `nodes`: Stores video information
   - `edges`: Stores relationships between videos
   - `embeddings`: Stores video embeddings for similarity calculations

This structured storage allows us to:
- Quickly rebuild the knowledge graph
- Run complex queries efficiently
- Maintain relationship integrity
- Update content without rebuilding everything

## System Outputs and Practical Applications

Let's look at what we get from this system and how it can be practically used.

### Output Organization

All system outputs are organized in a structured directory:
```
/video_recommender/results/
├── database/
│   └── video_recommender.db
├── visualizations/
│   ├── knowledge_graph.html
│   ├── learning_path_*.html
│   ├── topic_subgraph_*.html
│   └── simplified_*.html
└── queries/
    ├── learning_path_*.txt
    └── topic_query_*.txt
```

### Practical Applications

1. **For Content Creators (like myself)**
   - **Content Gap Analysis**
     - Identify missing prerequisite content
     - Spot areas needing more advanced material
     - See which topics are under-represented

   - **Content Organization**
     - Verify logical progression of video series
     - Ensure proper coverage of prerequisites
     - Plan future content based on graph structure

   - **Channel Management**
     - Optimize playlist organization
     - Create better video descriptions
     - Link related videos effectively

2. **For Learners**
   - **Personalized Learning Paths**
     - Get customized pathways for specific goals
     - Understand prerequisites clearly
     - Track learning progress

   - **Topic Exploration**
     - Find related content easily
     - Understand topic relationships
     - Choose appropriate difficulty levels

3. **For Educational Institutions**
   - **Curriculum Planning**
     - Design coherent course sequences
     - Ensure proper skill progression
     - Create balanced learning paths

### Real-World Impact

In my YouTube channel (DigitalSreeni), this system helps:
- New viewers find appropriate starting points
- Regular viewers progress logically through topics
- Advanced viewers find specific content quickly
- Me (as a content creator) maintain content coherence

### Future Extensions and Possibilities

1. **Technical Enhancements**
   - Real-time content updates
   - User feedback integration
   - More sophisticated difficulty estimation
   - Advanced recommendation algorithms

2. **Feature Additions**
   - User progress tracking
   - Interactive learning path modification
   - Content engagement metrics
   - Multi-language support

3. **Integration Possibilities**
   - YouTube API integration
   - Learning Management Systems (LMS)
   - Course creation platforms
   - Social learning features



## Summary

The goal is to make this a comprehensive project on building a recommender system for learning content using knowledge graphs and NLP. While I've used it for my YouTube channel, the principles and techniques can be applied to any educational content platform.


In [1]:
!pip install sentence-transformers spacy gensim pandas numpy networkx pyvis
!python -m spacy download en_core_web_sm

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m579.0 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy (from sentence-transformers)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
INFO: pip is looking at multiple versions of thinc to determine which version is compatible with other requirements. This could take a while.
Collecting thinc<8.4.0,>=8.

In [1]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import spacy
import re
import networkx as nx
from typing import List, Dict, Set, Tuple
from gensim import corpora, models
import sqlite3
import json
from pyvis.network import Network
import os

In [2]:
## Personal note, delete later: Add Doc strings to each function. May be chatGPT can help?

def load_video_data(csv_path: str = '04_combined_videos.csv') -> pd.DataFrame:
    """
    Load and preprocess the video data from CSV. Convert ms to minutes and
    date to proper datetime format. Not sure if we will use the date but why not
    make it ready for use.
    """
    df = pd.read_csv(csv_path)

    # Convert duration from milliseconds to minutes
    df['duration_minutes'] = df['Approx Duration (ms)'] / (1000 * 60)

    # Convert timestamp to datetime
    df['publish_date'] = pd.to_datetime(df['Video Publish Timestamp'])

    return df




def clean_text(text: str) -> str:
    """
    Clean and preprocess text.
    Takes a text string, converts it to lowercase, removes special characters,
    extra spaces, and returns the cleaned result.
    """
    if pd.isna(text):
        return ""

    # Convert to lowercase and remove special characters
    text = re.sub(r'[^\w\s]', ' ', text.lower())
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text



def extract_key_concepts(text: str, nlp) -> List[str]:
    """
    Extract key concepts from text using spaCy.
    This process helps us understand what each video is about at a conceptual level.
    For example, from a video titled "Introduction to Python Image Processing
    with OpenCV", it might extract concepts like:
        python
        image processing
        opencv
        introduction
    These concepts become important for building relationships between videos.
    """
    doc = nlp(text)
    concepts = []

    # Extract noun phrases and named entities
    for chunk in doc.noun_chunks:     # doc.noun_chunks is a generator provided by Spacy's Doc object and gives noun phrases from the text.
        if len(chunk.text.split()) <= 3:  # Limit to phrases of 3 words or less, just so we work with concise phrases
            concepts.append(chunk.text)

    for ent in doc.ents:
        if len(ent.text.split()) <= 3:
            concepts.append(ent.text)

    # Remove duplicates and clean
    concepts = list(set([clean_text(c) for c in concepts]))
    return [c for c in concepts if c]  # Remove empty strings




def process_video_data(video_data: pd.DataFrame) -> Tuple[pd.DataFrame, np.ndarray, np.ndarray, Dict]:
    """
    Process video data and create embeddings for semantic understanding

    We use the 'all-MiniLM-L6-v2' model from the sentence-transformers library
    This model converts our text into high-dimensional vector embeddings
    Each video's title and description are combined and transformed into a vector that captures its semantic meaning
    These embeddings allow us to measure semantic similarity between videos, even if they don't share exact words

    """
    # Initialize models
    transformer_model = SentenceTransformer('all-MiniLM-L6-v2')
    nlp = spacy.load('en_core_web_sm')

    print("Cleaning text...")
    # Clean titles and descriptions
    video_data['cleaned_title'] = video_data['Video Title (Original)'].apply(clean_text)
    video_data['cleaned_description'] = video_data['Video Description (Original)'].apply(clean_text)

    # Combine title and description for embedding
    video_data['combined_text'] = video_data['cleaned_title'] + " " + video_data['cleaned_description']

    print("Creating embeddings...")
    # Create embeddings
    embeddings = transformer_model.encode(
        video_data['combined_text'].tolist(),
        show_progress_bar=True
    )

    # Calculate similarity matrix
    similarity_matrix = cosine_similarity(embeddings)

    print("Extracting concepts...")
    # Extract concepts for each video - by building a dictionary that maps each video (by its index)
    # to a list of key concepts extracted from its combined title and description.
    #Example structure of the final result....
    # {
    #     0: ['python', 'image processing', 'opencv'],  # Concepts for video 0
    #     1: ['machine learning', 'classification', 'sklearn'],  # Concepts for video 1
    #     2: ['deep learning', 'segmentation', 'tensorflow']  # Concepts for video 2
    # }

    video_concepts = {}
    for idx, row in video_data.iterrows():
        concepts = extract_key_concepts(row['combined_text'], nlp)
        video_concepts[idx] = concepts

    return video_data, embeddings, similarity_matrix, video_concepts




def extract_video_topics(title: str, description: str) -> Set[str]:
    """
    Extract main topics from video title and description
    categorizes videos into predefined topics like:
        Python basics
        Image processing
        Machine learning
        Computer vision
        Bio applications
This categorization is done by looking for specific keywords in the title and
description, helping us organize content into meaningful groups.
Basically, the output looks like this:
Input: text = "introduction to python image processing learn how to use opencv for microscopy image analysis and visualization"
Output: {'python_basics', 'image_processing', 'computer_vision'}

    """
    # Handle NaN values
    title = str(title) if not pd.isna(title) else ""
    description = str(description) if not pd.isna(description) else ""

    text = (title + " " + description).lower()
    topics = set()

    # Define topic categories
    topic_keywords = {
        'python_basics': {'python basics', 'introduction', 'variables', 'functions', 'loops', 'conditionals'},
        'data_structures': {'lists', 'dictionaries', 'arrays', 'data structures'},
        'image_processing': {'image', 'microscopy', 'bioimage', 'visualization', 'processing'},
        'machine_learning': {'machine learning', 'deep learning', 'neural network', 'classification'},
        'data_analysis': {'data analysis', 'statistics', 'pandas', 'numpy'},
        'computer_vision': {'opencv', 'vision', 'object detection', 'segmentation'},
        'bio_applications': {'cell', 'tissue', 'microscopy', 'biology', 'medical'}
    }

    for topic, keywords in topic_keywords.items():
        if any(keyword in text for keyword in keywords):
            topics.add(topic)

    return topics




def infer_video_difficulty(title: str, concepts: List[str], video_number: int = None) -> str:
    """
    Infer difficulty level of a video
    by looking for some words that represent beginner or advanced
    by video number, as my first videos are basic intro videos
    also by number of concepts, like python, opencv, etc. etc.

    """
    title = title.lower()

    # Check for explicit difficulty indicators
    if any(word in title for word in ['introduction', 'basics', 'beginner', 'what is']):
        return 'beginner'
    elif any(word in title for word in ['advanced', 'expert', 'complex']):
        return 'advanced'

    # Consider video number in series
    if video_number is not None:
        if video_number <= 20:
            return 'beginner'
        elif video_number > 50:
            return 'advanced'

    # Check concept complexity
    concept_count = len(concepts)
    if concept_count < 5:
        return 'beginner'
    elif concept_count > 10:
        return 'advanced'

    return 'intermediate'

In [3]:
def build_knowledge_graph(video_data: pd.DataFrame,
                        similarity_matrix: np.ndarray,
                        video_concepts: Dict,
                        similarity_threshold: float = 0.6) -> nx.DiGraph:
    """
    Build a knowledge graph from video data using NetworkX
    Build nodes and edges. Nodes are basically videos and they have various attributes (metadata).

    Edges are added based on similarity we calculate (using cosine).
    Basically semantic Similarity between video embeddings
    Connect videos if their similarity exceeds a threshold (default 0.5 but changed to 0.6 to minimize the number of edges.)
    This catches semantic relationships even when videos don't share exact keywords

    """
    G = nx.DiGraph()

    # Add nodes (videos)
    for idx, row in video_data.iterrows():
        # Extract video number if present
        video_num = None
        match = re.search(r'(\d+)', row['Video Title (Original)'])
        if match:
            video_num = int(match.group(1))

        # Get video topics
        topics = extract_video_topics(row['Video Title (Original)'],
                                   row['Video Description (Original)'])

        # Infer difficulty
        difficulty = infer_video_difficulty(row['Video Title (Original)'],
                                         video_concepts[idx],
                                         video_num)

        # Add node with metadata
        G.add_node(idx,
                  title=row['Video Title (Original)'],
                  description=row['Video Description (Original)'],
                  concepts=video_concepts[idx],
                  topics=topics,
                  difficulty=difficulty,
                  duration=row['Approx Duration (ms)'],
                  video_number=video_num)

    # Add edges based on similarity and relationships
    for i in range(len(video_data)):
        for j in range(len(video_data)):
            if i != j and similarity_matrix[i][j] >= similarity_threshold:   #similarity_matrix is nothing but our cosine similarity
                # Determine edge type and direction
                edge_type = determine_relationship(G, i, j)  #determine_relationship is defined next

                G.add_edge(i, j,
                          weight=similarity_matrix[i][j],
                          type=edge_type)

    return G



def determine_relationship(G: nx.DiGraph, video1_idx: int, video2_idx: int) -> str:
    """
    Determine the relationship type between two videos

    Prerequisite: Video A should be watched before Video B
    Advanced: Video B builds upon concepts in Video A
    Related: Videos cover related topics but don't have a strict order

    Relationships are determined based on:
      Video sequence numbers (if part of a series)
      Difficulty levels (beginner → intermediate → advanced)
      Topic relationships (e.g., Python basics are prerequisites for advanced topics)

    """
    v1 = G.nodes[video1_idx]
    v2 = G.nodes[video2_idx]

    # Check if videos are part of a numbered series
    if (v1['video_number'] is not None and
        v2['video_number'] is not None):
        if v1['video_number'] < v2['video_number']:
            return 'prerequisite'
        elif v1['video_number'] > v2['video_number']:
            return 'advanced'

    # Compare difficulty levels
    diff_levels = ['beginner', 'intermediate', 'advanced']
    v1_diff_idx = diff_levels.index(v1['difficulty'])
    v2_diff_idx = diff_levels.index(v2['difficulty'])

    if v1_diff_idx < v2_diff_idx:
        return 'prerequisite'
    elif v1_diff_idx > v2_diff_idx:
        return 'advanced'

    # Check topic relationships
    v1_topics = v1['topics']
    v2_topics = v2['topics']

    if 'python_basics' in v1_topics and not 'python_basics' in v2_topics:
        return 'prerequisite'

    return 'related'

In [4]:
## We are using pyviz for visualization, make sit easy for a complex graph like this one.

def visualize_knowledge_graph(G: nx.DiGraph, filename='knowledge_graph.html'):
    """
    Create interactive visualization of the knowledge graph (we are using pyviz)
    """

    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)
    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 font_color='#000000', directed=True, notebook=True)

    # Set physics layout options
    net.force_atlas_2based()
    net.show_buttons(filter_=['physics'])

    # Color mapping for different difficulty levels
    color_map = {
        'beginner': '#90EE90',      # light green
        'intermediate': '#ADD8E6',   # light blue
        'advanced': '#FFB6C1'        # light pink
    }

    # Add nodes
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        title = node_data['title']
        difficulty = node_data['difficulty']
        duration = node_data['duration'] / (1000 * 60)  # Convert to minutes
        concepts = ', '.join(node_data['concepts'][:5])
        topics = ', '.join(node_data['topics'])

        hover_text = f"""
        Title: {title}
        Topics: {topics}
        Difficulty: {difficulty}
        Duration: {duration:.1f} min
        Key Concepts: {concepts}
        """

        net.add_node(
            str(node_id),
            label=title[:20] + "...",
            title=hover_text,
            color=color_map[difficulty],
            size=20
        )

    # Add edges
    for edge in G.edges(data=True):
        source, target, data = edge
        net.add_edge(
            str(source),
            str(target),
            value=data['weight'] * 2,
            title=f"Type: {data['type']}\nWeight: {data['weight']:.2f}",
            arrows='to'
        )

    # Save the network
    net.save_graph(output_path)
    print(f"Knowledge graph visualization saved to {output_path}")



def visualize_learning_path(G: nx.DiGraph, path: List[int], filename='learning_path.html'):
    """
    Visualize a specific learning path
    Create an interactive visualization of a learning path in a directed graph,
    highlighting nodes and edges with details like title, difficulty, and duration,
    and save it as an HTML file.

    """
    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)
    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 font_color='#000000', directed=True, notebook=True)

    # Color mapping
    color_map = {
        'beginner': '#90EE90',
        'intermediate': '#ADD8E6',
        'advanced': '#FFB6C1'
    }

    # Add nodes in path order
    for i, node_id in enumerate(path):
        node_data = G.nodes[node_id]
        title = node_data['title']
        difficulty = node_data['difficulty']
        duration = node_data['duration'] / (1000 * 60)

        hover_text = f"""
        Step {i+1}
        Title: {title}
        Difficulty: {difficulty}
        Duration: {duration:.1f} min
        Topics: {', '.join(node_data['topics'])}
        """

        net.add_node(
            str(node_id),
            label=f"{i+1}. {title[:20]}...",
            title=hover_text,
            color=color_map[difficulty],
            size=20
        )

    # Add edges between consecutive nodes in the path
    for i in range(len(path)-1):
        source = path[i]
        target = path[i+1]
        if G.has_edge(source, target):
            edge_data = G.get_edge_data(source, target)
            net.add_edge(
                str(source),
                str(target),
                value=edge_data['weight'] * 2,
                title=f"Type: {edge_data['type']}",
                arrows='to'
            )

    # Save the network
    net.save_graph(output_path)
    print(f"Learning path visualization saved to {output_path}")




def visualize_topic_subgraph(G: nx.DiGraph, topic: str, filename=None):
    """
    Visualize subgraph for a specific topic
    Generate an interactive visualization of a subgraph for a specific topic,
    highlighting related nodes and edges with difficulty-based colors.
    """
    if filename is None:
        filename = f'{topic}_subgraph.html'
    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    # Find nodes related to topic
    topic_nodes = [n for n, d in G.nodes(data=True) if topic in d['topics']]

    # Get subgraph
    subgraph = G.subgraph(topic_nodes)

    # Create visualization
    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 directed=True, notebook=True)

    # Add nodes with difficulty-based colors
    difficulty_colors = {
        'beginner': '#90EE90',
        'intermediate': '#ADD8E6',
        'advanced': '#FFB6C1'
    }

    for node_id in subgraph.nodes():
        node_data = subgraph.nodes[node_id]
        net.add_node(
            str(node_id),
            label=node_data['title'][:30] + "...",
            title=f"Title: {node_data['title']}\nDifficulty: {node_data['difficulty']}",
            color=difficulty_colors[node_data['difficulty']],
            size=20
        )

    # Add edges
    for u, v, data in subgraph.edges(data=True):
        net.add_edge(str(u), str(v), value=data['weight'] * 2)

    net.save_graph(output_path)
    print(f"Topic subgraph saved to {output_path}")




def visualize_simplified_graph(G: nx.DiGraph, min_weight=0.7, show_basics_only=False, filename=None):
    """
    Creates a simplified visualization of the knowledge graph,
    filtering nodes and edges by importance or topic.

    """
    if filename is None:
        filename = 'simplified_graph.html'
        if show_basics_only:
            filename = 'simplified_basics_only_graph.html'

    output_path = os.path.join(OUTPUT_DIR, 'visualizations', filename)

    net = Network(height='750px', width='100%', bgcolor='#ffffff',
                 directed=True, notebook=True)

    # Customize physics for better layout
    net.set_options("""
    const options = {
        "physics": {
            "forceAtlas2Based": {
                "gravitationalConstant": -100,
                "springLength": 100,
                "springConstant": 0.1
            },
            "maxVelocity": 50,
            "minVelocity": 0.1,
            "solver": "forceAtlas2Based"
        },
        "edges": {
            "smooth": {
                "type": "continuous",
                "forceDirection": "none"
            }
        }
    }
    """)

    # Filter nodes based on criteria
    nodes_to_include = []
    for node_id in G.nodes():
        node_data = G.nodes[node_id]
        if (show_basics_only and 'python_basics' in node_data['topics']) or \
           (not show_basics_only and len(list(G.neighbors(node_id))) > 3):
            nodes_to_include.append(node_id)

    # Create subgraph
    subgraph = G.subgraph(nodes_to_include)

    # Color scheme
    color_map = {
        'python_basics': '#90EE90',
        'image_processing': '#ADD8E6',
        'machine_learning': '#FFB6C1',
        'bio_applications': '#DDA0DD'
    }

    # Add nodes with simplified information
    for node_id in subgraph.nodes():
        node_data = subgraph.nodes[node_id]
        primary_topic = next(
            (topic for topic in ['python_basics', 'image_processing',
                               'machine_learning', 'bio_applications']
             if topic in node_data['topics']),
            'other'
        )

        net.add_node(
            str(node_id),
            label=node_data['title'][:30] + "..." if len(node_data['title']) > 30 else node_data['title'],
            title=f"Title: {node_data['title']}\nDifficulty: {node_data['difficulty']}",
            color=color_map.get(primary_topic, '#DCDCDC'),
            size=20
        )

    # Add important edges
    for u, v, data in subgraph.edges(data=True):
        if data['weight'] >= min_weight:
            net.add_edge(
                str(u),
                str(v),
                value=data['weight'] * 2,
                title=f"Relationship: {data['type']}"
            )

    net.save_graph(output_path)
    print(f"Simplified graph visualization saved to {output_path}")

# Analysis

In [5]:
def find_learning_path(G: nx.DiGraph,
                      goal: str,
                      embeddings: np.ndarray,
                      transformer_model) -> List[int]:
    """
    Find optimal learning path for a given goal

    This function is key to our recommendation system. Here's how it works:

    Goal Understanding:
    Takes a learning goal as input (e.g., "Mastering Python for Bioimage analysis")
    Converts the goal into an embedding using the same sentence transformer model
    Compares this embedding with all video embeddings to find relevant content

    Topic Prerequisites
    We manually define prerequisite relationships between topics. This ensures that
    before a learner is shown advanced videos on a topic (e.g., machine learning),
    they are first recommended videos covering the necessary foundational knowledge.

    For example:
    'image_processing': needs {'python_basics', 'data_structures'}
    'machine_learning': needs {'python_basics', 'data_structures', 'data_analysis'}
    'computer_vision': needs {'python_basics', 'image_processing'}
    'bio_applications': needs {'python_basics', 'image_processing'}

    Path Construction:
    The system builds a path through the content by:
    a) Starting with foundational content
    Automatically includes Python basics for technical topics
    Sorts basics by video number to maintain logical progression
    b) Adding topic-specific content
    Uses similarity scores to find most relevant videos
    Checks and includes prerequisites before advanced content
    Ensures proper skill progression

    Outputs a list of vidoes

    """
    # Get goal embedding
    goal_embedding = transformer_model.encode([goal])
    video_similarities = cosine_similarity(goal_embedding, embeddings)[0]

    # Define topic prerequisites
    topic_prerequisites = {
        'image_processing': {'python_basics', 'data_structures'},
        'machine_learning': {'python_basics', 'data_structures', 'data_analysis'},
        'computer_vision': {'python_basics', 'image_processing'},
        'bio_applications': {'python_basics', 'image_processing'}
    }

    # Initialize path
    path = []
    visited = set()

    # Start with basics if needed
    goal_lower = goal.lower()
    if any(topic in goal_lower for topic in ['bioimage', 'image', 'machine learning', 'computer vision']):
        # Add Python basics videos first
        basics_videos = [n for n, d in G.nodes(data=True)
                        if 'python_basics' in d['topics'] and
                        d['difficulty'] == 'beginner']
        basics_videos.sort(key=lambda x: G.nodes[x].get('video_number', float('inf'))) # Sort with a default value if video_number is None
        path.extend(basics_videos)
        visited.update(basics_videos)

    # Get top relevant videos
    relevant_indices = np.argsort(video_similarities)[::-1]

    # Add videos to path based on relevance and prerequisites
    for idx in relevant_indices:
        if idx in visited:
            continue

        node_data = G.nodes[idx]

        # Check if prerequisites are met
        node_topics = node_data['topics']
        prerequisites_needed = set()
        for topic in node_topics:
            if topic in topic_prerequisites:
                prerequisites_needed.update(topic_prerequisites[topic])

        # Add missing prerequisites first
        if prerequisites_needed:
            for prereq_topic in prerequisites_needed:
                if not any(prereq_topic in G.nodes[n]['topics'] for n in path):
                    prereq_videos = [n for n, d in G.nodes(data=True)
                                   if prereq_topic in d['topics'] and
                                   n not in visited]
                    prereq_videos.sort(key=lambda x: G.nodes[x]['video_number'] or float('inf'))
                    path.extend(prereq_videos)
                    visited.update(prereq_videos)

        # Add the video to path
        path.append(idx)
        visited.add(idx)

    return path



def format_learning_path(G: nx.DiGraph, path: List[int]) -> Dict:
    """
    Format a learning path by summarizing total videos, duration,
    difficulty breakdown, topics covered, and detailed video info.

    Basically, organize and present information about a learning path
    in a structured and detailed way.
    This function takes a directed graph and creates a summary with:

      - The total number of videos.
      - The total duration in minutes.
      - A breakdown of the difficulty levels.
      - A list of unique topics covered.
      - Detailed information for each video, like its title, difficulty, duration, associated topics, and key concepts.
    """
    formatted_path = {
        'total_videos': len(path),
        'total_duration_minutes': sum(G.nodes[n]['duration'] / (1000 * 60) for n in path),
        'difficulty_breakdown': {
            'beginner': sum(1 for n in path if G.nodes[n]['difficulty'] == 'beginner'),
            'intermediate': sum(1 for n in path if G.nodes[n]['difficulty'] == 'intermediate'),
            'advanced': sum(1 for n in path if G.nodes[n]['difficulty'] == 'advanced')
        },
        'topics_covered': set(),
        'videos': []
    }

    for i, node_id in enumerate(path):
        node_data = G.nodes[node_id]
        formatted_path['topics_covered'].update(node_data['topics'])

        video_info = {
            'id': node_id,  # Include the video ID
            'step': i + 1,
            'title': node_data['title'],
            'difficulty': node_data['difficulty'],
            'duration_minutes': node_data['duration'] / (1000 * 60),
            'topics': list(node_data['topics']),
            'key_concepts': node_data['concepts'][:5]
        }
        formatted_path['videos'].append(video_info)

    formatted_path['topics_covered'] = list(formatted_path['topics_covered'])
    return formatted_path



def print_learning_path(formatted_path: Dict):
    """
    Print formatted learning path in a readable way
    """
    print("\nLearning Path Summary")
    print("=" * 50)
    print(f"Total Videos: {formatted_path['total_videos']}")
    print(f"Total Duration: {formatted_path['total_duration_minutes']:.1f} minutes")
    print(f"                ({formatted_path['total_duration_minutes']/60:.1f} hours)")

    print("\nDifficulty Breakdown:")
    for level, count in formatted_path['difficulty_breakdown'].items():
        print(f"  {level.title()}: {count} videos")

    print("\nTopics Covered:")
    for topic in formatted_path['topics_covered']:
        print(f"  - {topic.replace('_', ' ').title()}")

    print("\nDetailed Video Path:")
    print("=" * 50)
    for video in formatted_path['videos']:
        print(f"\n{video['step']}. {video['title']}")
        print(f"   Difficulty: {video['difficulty']}")
        print(f"   Duration: {video['duration_minutes']:.1f} minutes")
        print(f"   Topics: {', '.join(t.replace('_', ' ').title() for t in video['topics'])}")
        if video['key_concepts']:
            print(f"   Key Concepts: {', '.join(video['key_concepts'])}")



def analyze_path_coverage(G: nx.DiGraph, path: List[int]):
    """
    Analyze topic coverage and prerequisites in the learning path

      Tracks which topics are covered in the path
      Ensures all necessary prerequisite topics are included
      Prerequisite Validation

      Checks if advanced topics (like machine learning) have necessary basics
      Issues warnings if prerequisites are missing
      Helps maintain logical learning progression
      Path Statistics We collect detailed statistics about each path:

      Total number of videos
      Total duration
      Difficulty breakdown
      Topic coverage
      Concept progression

Example output
{
    'covered_topics': {'python_basics', 'data_structures', 'machine_learning'},
    'prerequisite_issues': ["Warning: Advanced topic found without Python basics coverage"],
    'topic_sequence': [['python_basics'], ['data_structures'], ['machine_learning']]
}


    """
    covered_topics = set()
    for node_id in path:
        covered_topics.update(G.nodes[node_id]['topics'])

    # Check if basic prerequisites are included before advanced topics
    prerequisite_issues = []
    for i, node_id in enumerate(path):
        node_topics = G.nodes[node_id]['topics']
        if ('machine_learning' in node_topics or 'computer_vision' in node_topics) and \
           'python_basics' not in covered_topics:
            prerequisite_issues.append(f"Warning: Advanced topic found without Python basics coverage")

    return {
        'covered_topics': covered_topics,
        'prerequisite_issues': prerequisite_issues,
        'topic_sequence': [list(G.nodes[n]['topics']) for n in path]
    }



def create_recommendation_system(csv_path='combined_videos.csv'):
    """
    Create and initialize the complete recommendation system (by running previously defined functions)
    Basically return the knowledge graph, the transformer model and embeddings

    """
    # Initialize transformer model
    transformer_model = SentenceTransformer('all-MiniLM-L6-v2')

    # Load and process data
    print("Loading video data...")
    video_data = load_video_data(csv_path)

    print("Processing video content...")
    processed_data, embeddings, similarity_matrix, video_concepts = process_video_data(video_data)

    print("Building knowledge graph...")
    G = build_knowledge_graph(processed_data, similarity_matrix, video_concepts)

    print("System ready!")
    return G, transformer_model, embeddings

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
### Serialize to a SQL Database

OUTPUT_DIR = '/content/drive/MyDrive/recommender_systems/knowledge_graphs'
DB_PATH = os.path.join(OUTPUT_DIR, 'database', 'video_recommender.db')



def create_output_dirs():
    """
    Create output directories if they don't exist
    """
    subdirs = ['visualizations', 'queries', 'database']
    for subdir in subdirs:
        path = os.path.join(OUTPUT_DIR, subdir)
        os.makedirs(path, exist_ok=True)
    print(f"Created output directories in {OUTPUT_DIR}")



def create_database(db_name='video_recommender.db'):
    """
    Create SQLite database with necessary tables
    """
    try:
        conn = sqlite3.connect(db_name, timeout=10)
        conn.text_factory = str
        c = conn.cursor()

        # Drop existing tables to ensure clean slate
        c.execute('DROP TABLE IF EXISTS edges')
        c.execute('DROP TABLE IF EXISTS nodes')
        c.execute('DROP TABLE IF EXISTS embeddings')

        # Create tables with explicit types and add video_number
        c.execute('''
            CREATE TABLE nodes (
                id INTEGER PRIMARY KEY,
                title TEXT NOT NULL,
                description TEXT,
                difficulty TEXT NOT NULL,
                duration INTEGER NOT NULL,
                concepts TEXT NOT NULL,
                topics TEXT NOT NULL,
                video_number INTEGER
            )
        ''')

        c.execute('''
            CREATE TABLE edges (
                source INTEGER NOT NULL,
                target INTEGER NOT NULL,
                weight REAL NOT NULL,
                relationship_type TEXT NOT NULL,
                PRIMARY KEY (source, target),
                FOREIGN KEY (source) REFERENCES nodes(id),
                FOREIGN KEY (target) REFERENCES nodes(id)
            )
        ''')

        c.execute('''
            CREATE TABLE embeddings (
                id INTEGER PRIMARY KEY,
                embedding BLOB NOT NULL
            )
        ''')

        conn.commit()

    except Exception as e:
        print(f"Database error: {e}")
    finally:
        conn.close()



# Save the recommender system as serialized database
def save_recommendation_system(G, transformer_model, embeddings):
    """
    Save the recommendation system to the specified location

    """
    #db_path = os.path.join(OUTPUT_DIR, 'database', 'video_recommender.db')
    try:
        create_database(DB_PATH)

        conn = sqlite3.connect(DB_PATH, timeout=20)
        c = conn.cursor()

        # Save nodes with video_number
        for node_id, data in G.nodes(data=True):
            c.execute('''
                INSERT OR REPLACE INTO nodes
                (id, title, description, difficulty, duration, concepts, topics, video_number)
                VALUES (?, ?, ?, ?, ?, ?, ?, ?)
            ''', (
                node_id,
                data['title'],
                data.get('description', ''),
                data['difficulty'],
                data['duration'],
                json.dumps(data['concepts']),
                json.dumps(list(data['topics'])),
                data.get('video_number', None)  # Handle cases where video_number doesn't exist
            ))

        # Save edges with explicit float conversion
        c.execute('DELETE FROM edges')
        for source, target, data in G.edges(data=True):
            weight = float(data['weight'])
            c.execute('''
                INSERT INTO edges (source, target, weight, relationship_type)
                VALUES (?, ?, ?, ?)
            ''', (source, target, weight, data['type']))

        # Save embeddings
        c.execute('DELETE FROM embeddings')
        for i, emb in enumerate(embeddings):
            c.execute('INSERT INTO embeddings (id, embedding) VALUES (?, ?)',
                     (i, emb.tobytes()))

        conn.commit()
        print(f"System saved to {DB_PATH}")
    except Exception as e:
        print(f"Error saving to database: {e}")
        try:
            conn.rollback()
        except:
            pass
        raise

    finally:
        try:
            conn.close()
        except:
            pass



def save_query_results(results, filename):
    """
    Save query results to a text file
    """
    output_path = os.path.join(OUTPUT_DIR, 'queries', filename)
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(results)
    print(f"Query results saved to {output_path}")




def load_recommendation_system():
    """
    Load the recommendation system from SQLite database, for future use without
    building the knowledge graph again.
    Note that we are loading the database but then reconstructing the graph
    again using NetworkX.
    """

    if not os.path.exists(DB_PATH):
        raise FileNotFoundError(f"Database not found at {DB_PATH}")

    conn = sqlite3.connect(DB_PATH)
    conn.text_factory = str
    c = conn.cursor()

    # Create new graph
    G = nx.DiGraph()

    # Load nodes with video_number
    c.execute('SELECT * FROM nodes')
    for row in c.fetchall():
        node_id = row[0]
        G.add_node(
            node_id,
            title=str(row[1]) if isinstance(row[1], bytes) else row[1],
            description=str(row[2]) if isinstance(row[2], bytes) else row[2],
            difficulty=str(row[3]) if isinstance(row[3], bytes) else row[3],
            duration=row[4],
            concepts=json.loads(row[5]) if isinstance(row[5], str) else json.loads(row[5].decode()),
            topics=set(json.loads(row[6]) if isinstance(row[6], str) else json.loads(row[6].decode())),
            video_number=row[7]  # Add video_number to node attributes
        )

    # Load edges
    c.execute('SELECT source, target, CAST(weight AS REAL) as weight, relationship_type FROM edges')
    for row in c.fetchall():
        G.add_edge(
            row[0], row[1],
            weight=float(row[2]),
            type=str(row[3]) if isinstance(row[3], bytes) else row[3]
        )

    # Load embeddings
    c.execute('SELECT * FROM embeddings ORDER BY id')
    embeddings = []
    for row in c.fetchall():
        embedding = np.frombuffer(row[1], dtype=np.float32)
        embeddings.append(embedding)
    embeddings = np.array(embeddings)

    conn.close()

    # Recreate transformer model
    transformer_model = SentenceTransformer('all-MiniLM-L6-v2')

    return G, transformer_model, embeddings




def query_videos_by_topic(topic=None, difficulty=None):
    """
    Query videos based on topic and/or difficulty.
    Directly querying the databse and not the graph.
    Just to show as an example that you can do certain queries on the databse itself.

    """
    conn = sqlite3.connect(DB_PATH)
    c = conn.cursor()

    query = 'SELECT id, title, difficulty, duration FROM nodes WHERE 1=1'
    params = []

    if topic:
        query += ' AND topics LIKE ?'
        params.append(f'%{topic}%')

    if difficulty:
        query += ' AND difficulty = ?'
        params.append(difficulty)

    c.execute(query, params)
    results = c.fetchall()
    conn.close()

    return [{'id': r[0],
             'title': r[1],
             'difficulty': r[2],
             'duration_minutes': r[3]/1000/60}
            for r in results]




def get_video_prerequisites(video_id):
    """
    Get prerequisites for a specific video
    Directly querying the databse and not the graph.
    Just to show as an example that you can do certain queries on the databse itself.

    Basically querying source nodes (n.id, n.title, and n.difficulty) that are linked to the
    target video (e.target = video_id) through a relationship of type 'prerequisite'.
    """
    conn = sqlite3.connect(DB_PATH)
    conn.text_factory = str
    c = conn.cursor()

    c.execute('''
        SELECT n.id, n.title, n.difficulty, CAST(e.weight AS REAL) as weight
        FROM edges e
        JOIN nodes n ON e.source = n.id
        WHERE e.target = ? AND e.relationship_type = 'prerequisite'
    ''', (video_id,))

    results = c.fetchall()
    conn.close()

    return [{'id': r[0],
             'title': str(r[1]) if isinstance(r[1], bytes) else r[1],
             'difficulty': str(r[2]) if isinstance(r[2], bytes) else r[2],
             'relevance': float(r[3])}
            for r in results]




def get_learning_path_from_db(goal, db_name='video_recommender.db'):
    """
    Generate learning path for a specific goal from database
    Again, we are loading the database but converting back to a Digraph
    """
    # Load the system
    G, transformer_model, embeddings = load_recommendation_system()

    # Find the path
    path = find_learning_path(G, goal, embeddings, transformer_model)

    # Format and analyze the path
    formatted_path = format_learning_path(G, path)

    return formatted_path




def print_learning_path_from_db(path_info):
    """
    Print the learning path from database

    """
    print("\nLearning Path Summary")
    print("=" * 50)
    print(f"Total Videos: {path_info['total_videos']}")
    print(f"Total Duration: {path_info['total_duration_minutes']:.1f} minutes")
    print(f"                ({path_info['total_duration_minutes']/60:.1f} hours)")

    print("\nDifficulty Breakdown:")
    for level, count in path_info['difficulty_breakdown'].items():
        print(f"  {level.title()}: {count} videos")

    print("\nTopics Covered:")
    for topic in path_info['topics_covered']:
        print(f"  - {topic.replace('_', ' ').title()}")

    print("\nDetailed Learning Path:")
    print("=" * 50)
    for video in path_info['videos']:
        print(f"\n{video['step']}. {video['title']}")
        print(f"   Difficulty: {video['difficulty']}")
        print(f"   Duration: {video['duration_minutes']:.1f} minutes")
        print(f"   Topics: {', '.join(t.replace('_', ' ').title() for t in video['topics'])}")
        if video['key_concepts']:
            print(f"   Key Concepts: {', '.join(video['key_concepts'])}")

In [8]:
def initialize_system(csv_path):
    """
    Initialize and save the recommendation system
    """
    print("Initializing recommendation system...")
    create_output_dirs()  # Using the original function instead of setup_environment

    # Create and save the system
    G, transformer_model, embeddings = create_recommendation_system(csv_path)
    save_recommendation_system(G, transformer_model, embeddings)

    # Generate and save initial visualizations
    visualize_knowledge_graph(G)
    visualize_simplified_graph(G, min_weight=0.7, filename='simplified_strong_connections.html')
    visualize_simplified_graph(G, show_basics_only=True, filename='simplified_basics_only.html')

    return G, transformer_model, embeddings



# Getting Prerequisites for a Video
def check_prerequisites(video_id):
    """
    Check prerequisites for a specific video and save results
    """
    prereqs = get_video_prerequisites(video_id)

    # Prepare output string
    output = []
    output.append(f"Prerequisites for Video {video_id}:")
    output.append("=" * 50)

    if prereqs:
        for prereq in prereqs:
            output.append(f"Title: {prereq['title']}")
            output.append(f"Difficulty: {prereq['difficulty']}")
            output.append(f"Relevance Score: {prereq['relevance']:.2f}")
            output.append("-" * 30)
    else:
        output.append(f"No prerequisites found for Video {video_id}")

    # Save to file
    filename = os.path.join(OUTPUT_DIR, 'queries', f'prerequisites_video_{video_id}.txt')
    with open(filename, 'w', encoding='utf-8') as f:
        f.write('\n'.join(output))

    # Also print to console
    print('\n'.join(output))



# Getting Learning Path for a Goal
def get_path_for_goal(goal):
    """
    Get and display learning path for a specific goal
    """
    print(f"\nGenerating learning path for: {goal}")
    print("=" * 80)

    path_info = get_learning_path_from_db(goal)

    # Create output string (all the following code just to dump output to a file)
    output = []
    output.append(f"Learning Path for: {goal}")
    output.append("=" * 80)
    output.append(f"\nTotal Videos: {path_info['total_videos']}")
    output.append(f"Total Duration: {path_info['total_duration_minutes']:.1f} minutes")
    output.append(f"                ({path_info['total_duration_minutes']/60:.1f} hours)")

    output.append("\nDifficulty Breakdown:")
    for level, count in path_info['difficulty_breakdown'].items():
        output.append(f"  {level.title()}: {count} videos")

    output.append("\nTopics Covered:")
    for topic in path_info['topics_covered']:
        output.append(f"  - {topic.replace('_', ' ').title()}")

    output.append("\nDetailed Video Path:")
    output.append("=" * 50)
    for video in path_info['videos']:
        output.append(f"\n{video['step']}. {video['title']}")
        output.append(f"   Difficulty: {video['difficulty']}")
        output.append(f"   Duration: {video['duration_minutes']:.1f} minutes")
        output.append(f"   Topics: {', '.join(t.replace('_', ' ').title() for t in video['topics'])}")
        if video['key_concepts']:
            output.append(f"   Key Concepts: {', '.join(video['key_concepts'])}")

    # Save to file
    filename = os.path.join(OUTPUT_DIR, 'queries', f'learning_path_{goal.replace(" ", "_")}.txt')
    with open(filename, 'w', encoding='utf-8') as f:
        f.write('\n'.join(output))

    # Also print to console
    print('\n'.join(output))

    return path_info




# Querying Videos by Topic and Difficulty
def explore_topics(topic=None, difficulty=None):
    """
    Explore videos by topic and/or difficulty
    """
    videos = query_videos_by_topic(topic=topic, difficulty=difficulty)

    # Prepare output string
    output = []
    output.append(f"\nFound {len(videos)} videos")
    if topic:
        output.append(f" for topic '{topic}'")
    if difficulty:
        output.append(f" with {difficulty} difficulty")
    output.append(":")
    output.append("=" * 50)

    for video in videos:
        output.append(f"\nID: {video['id']}")
        output.append(f"Title: {video['title']}")
        output.append(f"Difficulty: {video['difficulty']}")
        output.append(f"Duration: {video['duration_minutes']:.1f} minutes")

    # Save to file
    filename_parts = []
    if topic:
        filename_parts.append(topic)
    if difficulty:
        filename_parts.append(difficulty)
    filename = os.path.join(OUTPUT_DIR, 'queries', f'topic_query_{"_".join(filename_parts)}.txt')

    with open(filename, 'w', encoding='utf-8') as f:
        f.write('\n'.join(output))

    # Also print to console
    print('\n'.join(output))




def run_example_queries(G, transformer_model, embeddings):
    """
    Run and save example queries
    """
    create_output_dirs()  # Ensure directories exist

    # Example goals
    example_goals = [
        "Mastering python for Bioimage analysis",
        "Learning machine learning for microscopy",
       # "Understanding computer vision and image processing",
       #"Python programming basics for scientists"
    ]

    # Generate learning paths
    for goal in example_goals:
        print(f"\nProcessing goal: {goal}")
        try:
            path_info = get_path_for_goal(goal)
            if path_info and 'videos' in path_info:
                filename = f'learning_path_{goal.replace(" ", "_")}.html'
                path = [video['id'] for video in path_info['videos']]
                if path:
                    visualize_learning_path(G, path, filename)
                else:
                    print(f"Warning: No valid path found for goal: {goal}")
            else:
                print(f"Warning: Invalid path_info structure for goal: {goal}")
        except Exception as e:
            print(f"Error processing goal '{goal}': {str(e)}")

    # Explore topics
    for topic in ['python_basics', 'machine_learning', 'image_processing']:
        for difficulty in ['beginner', 'intermediate', 'advanced']:
            try:
                explore_topics(topic=topic, difficulty=difficulty)
            except Exception as e:
                print(f"Error exploring topic '{topic}' with difficulty '{difficulty}': {str(e)}")

    # Generate topic visualizations
    for topic in ['python_basics', 'image_processing', 'bio_applications']:
        try:
            visualize_topic_subgraph(G, topic)
        except Exception as e:
            print(f"Error visualizing topic '{topic}': {str(e)}")

    # Check prerequisites for some example videos
    for video_id in [42, 43, 44]:
        try:
            check_prerequisites(video_id)
        except Exception as e:
            print(f"Error checking prerequisites for video {video_id}: {str(e)}")

In [9]:
if __name__ == "__main__":
    csv_path = '/content/04_combined_videos.csv'

    # Create directories first
    create_output_dirs()

    # Initialize or load system
    try:
        print("Checking if system needs initialization...")
        G, transformer_model, embeddings = load_recommendation_system()
        print("System loaded from database successfully!")
    except Exception as e:
        print(f"System not found or error loading: {str(e)}")
        print("Initializing new system...")
        try:
            G, transformer_model, embeddings = initialize_system(csv_path)
        except Exception as e:
            print(f"Error initializing system: {str(e)}")
            raise

    # Run queries
    try:
        run_example_queries(G, transformer_model, embeddings)
        print("\nAll operations completed successfully!")
    except Exception as e:
        print(f"\nError during query execution: {str(e)}")
        raise

Created output directories in /content/drive/MyDrive/recommender_systems/knowledge_graphs
Checking if system needs initialization...
System not found or error loading: Database not found at /content/drive/MyDrive/recommender_systems/knowledge_graphs/database/video_recommender.db
Initializing new system...
Initializing recommendation system...
Created output directories in /content/drive/MyDrive/recommender_systems/knowledge_graphs


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Loading video data...
Processing video content...
Cleaning text...
Creating embeddings...


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Total Duration: 8735.8 minutes
                (145.6 hours)

Difficulty Breakdown:
  Beginner: 119 videos
  Intermediate: 13 videos
  Advanced: 306 videos

Topics Covered:
  - Data Structures
  - Machine Learning
  - Computer Vision
  - Bio Applications
  - Python Basics
  - Image Processing
  - Data Analysis

Detailed Video Path:

1. 04 - What is a digital image?
   Difficulty: beginner
   Duration: 10.7 minutes
   Topics: Image Processing, Data Structures, Data Analysis, Python Basics
   Key Concepts: these images, you, a digital image, 04, it

2. 05 - What is Python?
   Difficulty: beginner
   Duration: 8.7 minutes
   Topics: Python Basics
   Key Concepts: you, 05, the video, introduction, a bit

3. 06 - Python basics - IDE & operators
   Difficulty: beginner
   Duration: 19.1 minutes
   Topics: Python Basics
   Key Concepts: different data types, you, various development environments, int float, 06

4. 07 - Python ba