### Task 1: Knowledge Graph Schema Design
Design a knowledge graph schema that connects:
Users and the content they’ve interacted with.
Content and their associated tags.
Tags and their related tags (based on semantic relationships).
Provide a short explanation (1-2 paragraphs or a simple diagram) of how your schema
supports content personalization.
### Answer
Schema for content personlization:
1. Direct connection between users and content through interacted with (by using ratings).
2. create link content to tags . that give us content similarity analysis,
3. Setup Relationship between tags to give us semantic similarity for related topic.
4. user properties (e.g. interests, preferences, location ) to use as demographic based recommendations.

<img src="./graph_schema.png" width="800" alt="Content and User Properties Diagram">


### Task 2: Knowledge Graph Construction
Using the provided dataset, create a knowledge graph using any tool or library of your
choice (e.g., NetworkX, Neo4j, or Python dictionaries).
Populate the graph with the relationships defined in your schema.
Include at least one sample query using Python, Cypher or SPARQL (e.g., finding
similar users, recommended content based on tags).

In [2]:
from yfiles_jupyter_graphs_for_neo4j import Neo4jGraphWidget
from neo4j import GraphDatabase

In [3]:


class EduSphereGraphViz:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))
        self.widget = Neo4jGraphWidget(self.driver)

    def close(self):
        self.driver.close()

    def create_constraints(self):
        """Create uniqueness constraints"""
        with self.driver.session() as session:

            session.run("""
                CREATE CONSTRAINT user_id IF NOT EXISTS
                FOR (u:User) REQUIRE u.user_id IS UNIQUE
            """)
            
            session.run("""
                CREATE CONSTRAINT content_id IF NOT EXISTS
                FOR (c:Content) REQUIRE c.content_id IS UNIQUE
            """)
            
            session.run("""
                CREATE CONSTRAINT tag_name IF NOT EXISTS
                FOR (t:Tag) REQUIRE t.name IS UNIQUE
            """)

    def load_data(self):
        """Load data from CSV files in Neo4j import directory"""
        """Use can use the following command to load data directly in Neo4j Browser"""
        with self.driver.session() as session:
            
            print("Loading users...")
            session.run("""
                LOAD CSV WITH HEADERS FROM 'file:///Users.csv' AS row
                WITH row WHERE row.user_id IS NOT NULL
                MERGE (u:User {user_id: toInteger(row.user_id)})
                SET u.age = toInteger(row.age),
                    u.location = row.location
            """)

            # Load Content
            print("Loading content...")
            session.run("""
                LOAD CSV WITH HEADERS FROM 'file:///Content.csv' AS row
                WITH row WHERE row.content_id IS NOT NULL
                MERGE (c:Content {content_id: toInteger(row.content_id)})
                SET c.title = row.title
                WITH c, row
                UNWIND split(row.tags, ',') AS tag
                MERGE (t:Tag {name: trim(tag)})
                MERGE (c)-[:HAS_TAG]->(t)
            """)

            # Load Tags
            print("Loading tags...")
            session.run("""
                LOAD CSV WITH HEADERS FROM 'file:///Tags.csv' AS row
                WITH row WHERE row.tag IS NOT NULL
                MERGE (t1:Tag {name: row.tag})
                WITH t1, row
                UNWIND split(row.related_tags, ',') AS related
                MERGE (t2:Tag {name: trim(related)})
                MERGE (t1)-[:RELATED_TO]->(t2)
            """)

            # Load Interactions
            print("Loading interactions...")
            session.run("""
                LOAD CSV WITH HEADERS FROM 'file:///Interactions.csv' AS row
                WITH row WHERE row.user_id IS NOT NULL
                MATCH (u:User {user_id: toInteger(row.user_id)})
                MATCH (c:Content {content_id: toInteger(row.content_id)})
                MERGE (u)-[r:INTERACTED_WITH]->(c)
                SET r.rating = toInteger(row.rating)
            """)
            print("Data loading completed!")

    def visualize_full_graph(self, limit=50):
        
        query = """
        MATCH (n)-[r]->(m)
        WITH n, r, m
        LIMIT toInteger($limit)
        RETURN n, r, m
        """
        params = {"limit": limit}
        self.widget.show_cypher(query, parameters=params)

    def visualize_user_interactions(self, user_id):
        
        query = """
        MATCH (u:User {user_id: toInteger($user_id)})-[r:INTERACTED_WITH]->(c:Content)-[:HAS_TAG]->(t:Tag)
        RETURN u, r, c, t
        """
        params = {"user_id": user_id}
        self.widget.show_cypher(query, parameters=params)

    def visualize_tag_network(self):
        
        query = """
        MATCH (t1:Tag)-[r:RELATED_TO]->(t2:Tag)
        RETURN t1, r, t2
        """
        self.widget.show_cypher(query)

    def visualize_content_tags(self, content_id):
        """Visualize content and its associated tags"""
        query = """
        MATCH (c:Content {content_id: toInteger($content_id)})-[r:HAS_TAG]->(t:Tag)
        RETURN c, r, t
        """
        params = {"content_id": content_id}
        self.widget.show_cypher(query, parameters=params)

In [4]:
URI = "neo4j://localhost:7687"  # use with your Neo4j URI 
USER = "neo4j"                  # please use defatult database if you change make sure you have right configaration
PASSWORD = "123456789"          # Use your password  

In [5]:
graph = EduSphereGraphViz(URI, USER, PASSWORD)

In [6]:
# Create constraints and load data
graph.create_constraints()
graph.load_data()


Loading users...
Loading content...
Loading tags...
Loading interactions...
Data loading completed!


In [7]:
print("Visualizing full graph (limited to 50 nodes)...")
graph.visualize_full_graph(limit=50)

Visualizing full graph (limited to 50 nodes)...


GraphWidget(layout=Layout(height='670px', width='100%'))

In [8]:
print("\nVisualizing user 1's interactions...")
graph.visualize_user_interactions(user_id=1)


Visualizing user 1's interactions...


GraphWidget(layout=Layout(height='500px', width='100%'))

In [9]:
print("\nVisualizing tag network...")
graph.visualize_tag_network()


Visualizing tag network...


GraphWidget(layout=Layout(height='500px', width='100%'))

In [10]:
print("\nVisualizing content ID 1 and its tags...")
graph.visualize_content_tags(content_id=103)


Visualizing content ID 1 and its tags...


GraphWidget(layout=Layout(height='500px', width='100%'))

### finding similar users

1: Finds similer user based on no of same content items they watched and count of common content.
2: Show visualizations of user network 




In [12]:
driver = GraphDatabase.driver(URI, auth=(USER, PASSWORD))
widget = Neo4jGraphWidget(driver)

def find_similar_users(user_id, limit=5):
    with driver.session() as session:
        
        result = session.run("""
            // Find the target user and other users who watched same content
            MATCH (u1:User {user_id: $user_id})-[:INTERACTED_WITH]->(c:Content)
            MATCH (u2:User)-[:INTERACTED_WITH]->(c)
            WHERE u1 <> u2
            
            // Count how many same contents they watched
            WITH u2, count(c) as common_content
            
            // Return results
            RETURN 
                u2.user_id as similar_user_id,
                common_content
            ORDER BY common_content DESC
            LIMIT $limit
        """, user_id=user_id, limit=limit)
        
        return list(result)
    
def show_user_network(user_id):
    
    query = """
    MATCH (u1:User {user_id: $user_id})-[:INTERACTED_WITH]->(c:Content)
    MATCH (u2:User)-[:INTERACTED_WITH]->(c)
    WHERE u1 <> u2
    RETURN u1, u2, c
    LIMIT 10
    """
    widget.show_cypher(query, parameters={"user_id": user_id})
    
test_user = 1
print(f"\nFinding similar users for user {test_user}:")
similar_users = find_similar_users(test_user)
for user in similar_users:
    print(f"User {user['similar_user_id']} - {user['common_content']} common content")

    
print("\nShowing user network visualization...")
show_user_network(test_user)    


Finding similar users for user 1:
User 2 - 1 common content

Showing user network visualization...


GraphWidget(layout=Layout(height='500px', width='100%'))

## Task 3: Personalization Query
Use the knowledge graph to implement a basic recommendation system:

Recommend the top 3 pieces of content for a given user based on:

Content the user has interacted with and their associated tags.
Tags related to the user's past interactions.
Provide code and a brief explanation of your approach and how you would evaluate
this system.

### Answer: 

basic recommendation system logic:

1: Use tages which user has interacted.

2: provide recommends content that have matching tags.

3: use no of tag matches as the score.

Evaluation:

1: look if recommendation matches with user's actual high-rated content (rating ≥ 4).

2: Calculates basic match rate or calculate precision and recall to measure the accuracy of recommendations.

3: Shows number of matches and total counts and Compare with other users.

Print the results of the evaluation. no of recommndation, no of actual highly rated items, no of matches, precision, recall, f1 score.


In [24]:
driver = GraphDatabase.driver(URI, auth=(USER, PASSWORD))

def get_recommendations(user_id, limit=3):
    
    with driver.session() as session:
        
        result = session.run("""
            MATCH (u:User {user_id: $user_id})-[:INTERACTED_WITH]->(c:Content)-[:HAS_TAG]->(t:Tag)
            MATCH (new_content:Content)-[:HAS_TAG]->(t)
            WHERE NOT (u)-[:INTERACTED_WITH]->(new_content)
            WITH new_content, COUNT(t) as tag_matches
            RETURN 
                new_content.content_id as content_id,
                new_content.title as title,
                tag_matches as relevance_score
            ORDER BY relevance_score DESC
            LIMIT $limit
        """, user_id=user_id, limit=limit)
        
        return list(result)

def evaluate_simple(user_id):
    
    with driver.session() as session:
        
        actual = session.run("""
            MATCH (u:User {user_id: $user_id})-[r:INTERACTED_WITH]->(c:Content)
            WHERE r.rating >= 4
            RETURN c.content_id as content_id
        """, user_id=user_id)
        actual_content = set(record["content_id"] for record in actual)
        
        
        recommended = get_recommendations(user_id, limit=3)
        recommended_content = set(record["content_id"] for record in recommended)
        
        
        if len(actual_content) == 0:
            print(f"User {user_id} has no highly rated content for comparison")
            return None
            
        
        matches = len(actual_content.intersection(recommended_content))
        
        precision = matches / len(recommended_content) if recommended_content else 0
        recall = matches / len(actual_content) if actual_content else 0
        
        return {
            "user_id": user_id,
            "recommended_count": len(recommended_content),
            "actual_highly_rated": len(actual_content),
            "matching_recommendations": matches,
            "precision": precision,
            "recall": recall
        }

def print_recommendations(user_id):
    
    print(f"\nGetting recommendations for user {user_id}:")
    print("-" * 10)
    
    recommendations = get_recommendations(user_id)
    
    for i, rec in enumerate(recommendations, 1):
        print(f"\nRecommendation {i}:")
        print(f"Title: {rec['title']}")
        print(f"Content ID: {rec['content_id']}")
        print(f"Relevance Score: {rec['relevance_score']}")

def print_evaluation(user_id):
    
    print(f"\nEvaluating recommendations for user {user_id}:")
    print("-" * 10)
    
    results = evaluate_simple(user_id)
    if results:
        print(f"\nResults:")
        print(f"Total Recommendations: {results['recommended_count']}")
        print(f"User's Highly Rated Content: {results['actual_highly_rated']}")
        print(f"Matching Recommendations: {results['matching_recommendations']}")
        print(f"Precision: {results['precision']:.2f}")
        print(f"Recall: {results['recall']:.2f}")
    else:
        print("Could not evaluate - user has no highly rated content")



 
user_id = 1
print_recommendations(user_id)
    
print_evaluation(user_id)
test_users = [2, 3]
for test_user in test_users:
    print_evaluation(test_user)



Getting recommendations for user 1:
----------

Recommendation 1:
Title: Machine Learning 101
Content ID: 103
Relevance Score: 1

Evaluating recommendations for user 1:
----------

Results:
Total Recommendations: 1
User's Highly Rated Content: 2
Matching Recommendations: 0
Precision: 0.00
Recall: 0.00

Evaluating recommendations for user 2:
----------

Results:
Total Recommendations: 0
User's Highly Rated Content: 1
Matching Recommendations: 0
Precision: 0.00
Recall: 0.00

Evaluating recommendations for user 3:
----------

Results:
Total Recommendations: 0
User's Highly Rated Content: 1
Matching Recommendations: 0
Precision: 0.00
Recall: 0.00


Task 4: Machine Learning Component(Optional but encouraged if time permits):

Propose or implement a simple machine learning model to predict user preferences or ratings for new content. For example:

Use collaborative filtering or a regression model to predict ratings.
Explain how this model could complement the knowledge graph-based
approach.

 
### Answer 

### Simple user-based collaborative filtering recommender Approch

1: Use all ratings from Neo4j data.

2: Build a user-item rating matrix.

3: Calculate user similarity using cosine similarity to find similar users.

5: Predict ratings for based on similer users prefernce.

6: Recommend top-rated content.

### How this complements the knowledge graph approach:

1. Graph approach uses content relationships (tags).

2. Machine learning approach is used to understand user behavior patterns.

3. combine both approched to get more accurate and diverse recommendations.



In [57]:
from neo4j import GraphDatabase
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

driver = GraphDatabase.driver(URI, auth=(USER, PASSWORD))

def get_user_ratings():
    
    with driver.session() as session:
        
        result = session.run("""
            MATCH (u:User)-[r:INTERACTED_WITH]->(c:Content)
            RETURN u.user_id as user_id, 
                   c.content_id as content_id,
                   r.rating as rating
        """)
        
       
        ratings = [
            {
                'user_id': record['user_id'],
                'content_id': record['content_id'],
                'rating': record['rating']
            }
            for record in result
        ]
        
        print(f"Found {len(ratings)} ratings")
        if ratings:
            print("Sample rating:", ratings[0])
        return ratings

def create_rating_matrix():
  
    ratings = get_user_ratings()
    df = pd.DataFrame(ratings)
    
    print("\nDataFrame head:")
    print(df.head())
    print("\nDataFrame columns:", df.columns.tolist())
    
    if len(df) == 0:
        print("Warning: No ratings found in database!")
        return None
    users = sorted(df['user_id'].unique())
    items = sorted(df['content_id'].unique())
    
    print(f"\nUnique users: {users}")
    print(f"Unique items: {items}")
    
  
    rating_matrix = pd.DataFrame(
        0, 
        index=users,
        columns=items
    )
    
    for _, row in df.iterrows():
        rating_matrix.loc[row['user_id'], row['content_id']] = row['rating']
    
    print("\nRating matrix shape:", rating_matrix.shape)
    print("Rating matrix sample:")
    print(rating_matrix.iloc[:3, :3])
    
    return rating_matrix

def get_ml_recommendations(user_id, n_recommendations=3):
    print(f"\nGetting recommendations for user {user_id}")
    
    rating_matrix = create_rating_matrix()
    
    if rating_matrix is None:
        print("No rating matrix available")
        return []
    
    if user_id not in rating_matrix.index:
        print(f"User {user_id} not found in rating matrix")
        return []
    
    # Calculate user similarity
    user_similarity = cosine_similarity(rating_matrix)
    
    user_similarity_df = pd.DataFrame(
        user_similarity,
        index=rating_matrix.index,
        columns=rating_matrix.index
    )
    
    similar_users = user_similarity_df.loc[user_id].sort_values(ascending=False)[1:4]
    print("\nSimilar users:", similar_users.index.tolist())
    
    # Find unwatched content
    user_ratings = rating_matrix.loc[user_id]
    unwatched = user_ratings[user_ratings == 0].index
    
    # Get predictions
    predictions = []
    for content_id in unwatched:
        similar_ratings = []
        for sim_user in similar_users.index:
            rating = rating_matrix.loc[sim_user, content_id]
            if rating > 0:
                similar_ratings.append(rating)
        
        if similar_ratings:
            pred_rating = sum(similar_ratings) / len(similar_ratings)
            predictions.append({
                'content_id': content_id,
                'predicted_rating': pred_rating
            })
    
    predictions.sort(key=lambda x: x['predicted_rating'], reverse=True)
    top_predictions = predictions[:n_recommendations]
    
    
    with driver.session() as session:
        for pred in top_predictions:
            result = session.run("""
                MATCH (c:Content {content_id: $content_id})
                RETURN c.title as title
            """, content_id=pred['content_id'])
            title_record = result.single()
            pred['title'] = title_record['title'] if title_record else f"Content {pred['content_id']}"
    
    return top_predictions


print("Starting recommendation test...")
with driver.session() as session:
        
    users = session.run("MATCH (u:User) RETURN count(u) as count").single()['count']
    print(f"\nNumber of users in database: {users}")
        
    
    content = session.run("MATCH (c:Content) RETURN count(c) as count").single()['count']
    print(f"Number of content items: {content}")
        
        
    interactions = session.run("MATCH ()-[r:INTERACTED_WITH]->() RETURN count(r) as count").single()['count']
    print(f"Number of interactions: {interactions}")
    
    
    user_id = 1
    recommendations = get_ml_recommendations(user_id)
    
    if recommendations:
        print("\nTop Recommendations:")
        for i, rec in enumerate(recommendations, 1):
            print(f"\nRecommendation {i}:")
            print(f"Title: {rec['title']}")
            print(f"Content ID: {rec['content_id']}")
            print(f"Predicted Rating: {rec['predicted_rating']:.2f}")
    else:
        print("\nNo recommendations found")

Starting recommendation test...

Number of users in database: 3
Number of content items: 5
Number of interactions: 5

Getting recommendations for user 1
Found 5 ratings
Sample rating: {'user_id': 1, 'content_id': 101, 'rating': 5}

DataFrame head:
   user_id  content_id  rating
0        1         101       5
1        1         102       4
2        2         102       3
3        2         103       5
4        3         104       4

DataFrame columns: ['user_id', 'content_id', 'rating']

Unique users: [1, 2, 3]
Unique items: [101, 102, 103, 104]

Rating matrix shape: (3, 4)
Rating matrix sample:
   101  102  103
1    5    4    0
2    0    3    5
3    0    0    0

Similar users: [2, 3]

Top Recommendations:

Recommendation 1:
Title: Machine Learning 101
Content ID: 103
Predicted Rating: 5.00

Recommendation 2:
Title: Advanced SQL
Content ID: 104
Predicted Rating: 4.00
