## Introduction

This notebook builds a social media knowledge graph using Neo4j. We define users, posts, topics, and platforms, then connect them via relationships like `CREATED`, `TAGGED_WITH`, `POSTED_ON`, and `ENGAGED_WITH`.

In [1]:
# importing our libraries

from neo4j import GraphDatabase
from datetime import datetime, timedelta
import random

In [2]:
# Connect to Neo4j
uri = "bolt://localhost:7687" 
username = "neo4j"
password = "12345678"

driver = GraphDatabase.driver(uri, auth=(username, password))

### 🧾 Data Initialization

Here we define sample data for users, posts, topics, and platforms. Each entity has properties like IDs, timestamps, and engagement counts. These data will be inserted into the Neo4j database.

#### Create Users (10 users)

In [3]:
# === User Data ===
users = [
    {"user_id": "U1", "username": "social_star", "join_date": "2020-01-15", "follower_count": 12500, "verified": True},
    {"user_id": "U2", "username": "tech_guru", "join_date": "2019-05-22", "follower_count": 8700, "verified": False},
    {"user_id": "U3", "username": "fitness_fan", "join_date": "2021-03-10", "follower_count": 4300, "verified": False},
    {"user_id": "U4", "username": "data_dude", "join_date": "2018-11-18", "follower_count": 18000, "verified": True},
    {"user_id": "U5", "username": "gadget_queen", "join_date": "2020-09-05", "follower_count": 3000, "verified": False},
    {"user_id": "U6", "username": "health_blogger", "join_date": "2021-02-14", "follower_count": 2100, "verified": False},
    {"user_id": "U7", "username": "eco_activist", "join_date": "2019-08-22", "follower_count": 10000, "verified": True},
    {"user_id": "U8", "username": "fashionista", "join_date": "2020-04-28", "follower_count": 8000, "verified": False},
    {"user_id": "U9", "username": "travel_enthusiast", "join_date": "2022-06-12", "follower_count": 5400, "verified": False},
    {"user_id": "U10", "username": "bookworm", "join_date": "2021-12-07", "follower_count": 6700, "verified": True},
]

# === Create User Nodes ===
def create_users(tx, users):
    for user in users:
        tx.run("""
            MERGE (u:User {user_id: $user_id})
            SET u.username = $username,
                u.join_date = date($join_date),
                u.follower_count = $follower_count
        """, **user)

# === Run it ===
with driver.session() as session:
    session.write_transaction(create_users, users)

print("✅ 10 users created in Neo4j.")

  session.write_transaction(create_users, users)


✅ 10 users created in Neo4j.


#### Create Posts (20 Posts)

In [4]:
# --- Post Data --- #
posts = [
    {"post_id": "P1", "timestamp": "2023-01-05T09:30:00", "content_type": "image", "like_count": 453, "share_count": 120},
    {"post_id": "P2", "timestamp": "2023-01-06T14:22:00", "content_type": "text", "like_count": 127, "share_count": 45},
    {"post_id": "P3", "timestamp": "2023-01-07T18:15:00", "content_type": "video", "like_count": 2145, "share_count": 610},
    {"post_id": "P4", "timestamp": "2023-01-08T11:30:00", "content_type": "text", "like_count": 987, "share_count": 312},
    {"post_id": "P5", "timestamp": "2023-01-09T13:00:00", "content_type": "image", "like_count": 203, "share_count": 70},
    {"post_id": "P6", "timestamp": "2023-01-10T15:30:00", "content_type": "video", "like_count": 1145, "share_count": 350},
    {"post_id": "P7", "timestamp": "2023-01-11T17:00:00", "content_type": "text", "like_count": 55, "share_count": 20},
    {"post_id": "P8", "timestamp": "2023-01-12T12:45:00", "content_type": "image", "like_count": 322, "share_count": 98},
    {"post_id": "P9", "timestamp": "2023-01-13T10:30:00", "content_type": "video", "like_count": 783, "share_count": 210},
    {"post_id": "P10", "timestamp": "2023-01-14T14:00:00", "content_type": "text", "like_count": 150, "share_count": 55},
    {"post_id": "P11", "timestamp": "2023-01-15T16:00:00", "content_type": "video", "like_count": 952, "share_count": 275},
    {"post_id": "P12", "timestamp": "2023-01-16T19:15:00", "content_type": "image", "like_count": 647, "share_count": 180},
    {"post_id": "P13", "timestamp": "2023-01-17T10:30:00", "content_type": "text", "like_count": 302, "share_count": 85},
    {"post_id": "P14", "timestamp": "2023-01-18T11:45:00", "content_type": "video", "like_count": 1280, "share_count": 420},
    {"post_id": "P15", "timestamp": "2023-01-19T13:00:00", "content_type": "image", "like_count": 200, "share_count": 65},
    {"post_id": "P16", "timestamp": "2023-01-20T14:15:00", "content_type": "text", "like_count": 75, "share_count": 28},
    {"post_id": "P17", "timestamp": "2023-01-21T15:30:00", "content_type": "video", "like_count": 56, "share_count": 17},
    {"post_id": "P18", "timestamp": "2023-01-22T17:00:00", "content_type": "image", "like_count": 300, "share_count": 92},
    {"post_id": "P19", "timestamp": "2023-01-23T18:15:00", "content_type": "text", "like_count": 190, "share_count": 60},
    {"post_id": "P20", "timestamp": "2023-01-24T19:00:00", "content_type": "video", "like_count": 245, "share_count": 88},
]

# --- Function to Insert Posts --- #
def create_posts(tx, posts):
    for post in posts:
        tx.run("""
            MERGE (p:Post {post_id: $post_id})
            SET p.timestamp = datetime($timestamp),
                p.content_type = $content_type,
                p.like_count = $like_count
        """, **post)

# --- Execute --- #
with driver.session() as session:
    session.write_transaction(create_posts, posts)

print("✅ 20 posts created in Neo4j.")

✅ 20 posts created in Neo4j.


  session.write_transaction(create_posts, posts)


#### 3. Create Topics (8 Topics)

In [5]:
# --- Topic Data --- #
topics = [
    {"topic_id": "T1", "name": "Data Science", "popularity_score": 85},
    {"topic_id": "T2", "name": "Tech News", "popularity_score": 75},
    {"topic_id": "T3", "name": "Fitness", "popularity_score": 65},
    {"topic_id": "T4", "name": "Travel", "popularity_score": 80},
    {"topic_id": "T5", "name": "Health", "popularity_score": 70},
    {"topic_id": "T6", "name": "Fashion", "popularity_score": 60},
    {"topic_id": "T7", "name": "Food", "popularity_score": 72},
    {"topic_id": "T8", "name": "Education", "popularity_score": 68}
]

# --- Function to Insert Topic Nodes --- #
def create_topics(tx, topics):
    for topic in topics:
        tx.run("""
            MERGE (t:Topic {topic_id: $topic_id})
            SET t.name = $name,
                t.popularity_score = $popularity_score
        """, **topic)

# --- Run It --- #
with driver.session() as session:
    session.write_transaction(create_topics, topics)

print("✅ 8 topics created in Neo4j.")

  session.write_transaction(create_topics, topics)


✅ 8 topics created in Neo4j.


#### 4. Create Platforms (3 Platforms)

In [6]:
# --- Platform Data --- #
platforms = [
    {"platform_id": "Pl1", "name": "Twitter", "monthly_active_users": 330_000_000},
    {"platform_id": "Pl2", "name": "Instagram", "monthly_active_users": 1_500_000_000},
    {"platform_id": "Pl3", "name": "YouTube", "monthly_active_users": 2_200_000_000}
]

# --- Function to Create Platforms --- #
def create_platforms(tx, platforms):
    for platform in platforms:
        tx.run("""
            MERGE (p:Platform {platform_id: $platform_id})
            SET p.name = $name,
                p.monthly_active_users = $monthly_active_users
        """, **platform)

# --- Run it --- #
with driver.session() as session:
    session.write_transaction(create_platforms, platforms)

print("✅ 3 platforms created in Neo4j.")

  session.write_transaction(create_platforms, platforms)


✅ 3 platforms created in Neo4j.


#### 5. Create Relationships (50+ Relationships)

#### 🔗 Define and Create Relationships

Relationships between nodes such as `FOLLOWS`, `CREATED`, `TAGGED_WITH`, `POSTED_ON`, and `ENGAGED_WITH` are established in this section. These relationships define how entities interact with each other.

In [7]:
# === Sample relationship data ===
created_relations = [
    ("U1", "P1", "2023-01-05"), ("U2", "P2", "2023-01-06"), ("U3", "P3", "2023-01-07"),
    ("U4", "P4", "2023-01-08"), ("U5", "P5", "2023-01-09"), ("U6", "P6", "2023-01-10"),
    ("U7", "P7", "2023-01-11"), ("U8", "P8", "2023-01-12"), ("U9", "P9", "2023-01-13"),
    ("U10", "P10", "2023-01-14"), ("U1", "P11", "2023-01-15"), ("U2", "P12", "2023-01-16"),
    ("U3", "P13", "2023-01-17"), ("U4", "P14", "2023-01-18"), ("U5", "P15", "2023-01-19"),
    ("U6", "P16", "2023-01-20"), ("U7", "P17", "2023-01-21"), ("U8", "P18", "2023-01-22"),
    ("U9", "P19", "2023-01-23"), ("U10", "P20", "2023-01-24"),
]

tagged_relations = [
    ("P1", "T1", 0.95), ("P2", "T2", 0.88), ("P3", "T3", 0.92), ("P4", "T4", 0.85),
    ("P5", "T5", 0.91), ("P6", "T6", 0.89), ("P7", "T7", 0.86), ("P8", "T8", 0.87),
    ("P9", "T1", 0.94), ("P10", "T2", 0.90), ("P11", "T3", 0.93), ("P12", "T4", 0.84),
    ("P13", "T5", 0.83), ("P14", "T6", 0.92), ("P15", "T7", 0.91), ("P16", "T8", 0.85),
    ("P17", "T1", 0.87), ("P18", "T2", 0.93), ("P19", "T3", 0.95), ("P20", "T4", 0.89)
]

posted_relations = [
    ("P1", "Pl1"), ("P2", "Pl2"), ("P3", "Pl3"), ("P4", "Pl1"), ("P5", "Pl2"),
    ("P6", "Pl3"), ("P7", "Pl1"), ("P8", "Pl2"), ("P9", "Pl3"), ("P10", "Pl1"),
    ("P11", "Pl2"), ("P12", "Pl3"), ("P13", "Pl1"), ("P14", "Pl2"), ("P15", "Pl3"),
    ("P16", "Pl1"), ("P17", "Pl2"), ("P18", "Pl3"), ("P19", "Pl1"), ("P20", "Pl2")
]

follows_relations = [
    ("U1", "U2", "2022-01-01"), ("U2", "U3", "2022-02-01"), ("U3", "U1", "2022-03-01"),
    ("U4", "U5", "2022-04-01"), ("U6", "U7", "2022-05-01"), ("U8", "U9", "2022-06-01"),
    ("U10", "U1", "2022-07-01")
]

# Redefine tagged_relations first
tagged_relations = [
    ("P1", "T1", 0.95), ("P2", "T2", 0.88), ("P3", "T3", 0.92), ("P4", "T4", 0.85),
    ("P5", "T5", 0.91), ("P6", "T6", 0.89), ("P7", "T7", 0.86), ("P8", "T8", 0.87),
    ("P9", "T1", 0.94), ("P10", "T2", 0.90), ("P11", "T3", 0.93), ("P12", "T4", 0.84),
    ("P13", "T5", 0.83), ("P14", "T6", 0.92), ("P15", "T7", 0.91), ("P16", "T8", 0.85),
    ("P17", "T1", 0.87), ("P18", "T2", 0.93), ("P19", "T3", 0.95), ("P20", "T4", 0.89)
]

# Define user-topic mapping based on user_id
user_topic_map = {
    "U4": ["T1", "T2"],  # data_dude
    "U3": ["T3", "T5"],  # fitness_fan
    "U6": ["T5", "T3"],  # health_blogger
    "U7": ["T4", "T8"],  # eco_activist
    "U10": ["T8", "T1"],  # bookworm
    "U2": ["T2", "T1"],  # tech_guru
    "U1": ["T6", "T7"],  # social_star
    "U8": ["T6", "T7"],  # fashionista
    "U9": ["T4", "T7"],  # travel_enthusiast
}

tagged_lookup = {}
for post_id, topic_id, _ in tagged_relations:
    tagged_lookup.setdefault(topic_id, []).append(post_id)

# Create the ENGAGED_WITH relationships
engaged_with_relations = []

for user_id, topic_ids in user_topic_map.items():
    used_posts = set()
    for topic in topic_ids:
        valid_posts = tagged_lookup.get(topic, [])
        random.shuffle(valid_posts)
        for post_id in valid_posts:
            if post_id not in used_posts:
                engaged_with_relations.append((
                    user_id,
                    post_id,
                    random.choice(["like", "comment", "share"]),
                    (datetime.today() - timedelta(days=random.randint(1, 90))).strftime('%Y-%m-%d')
                ))
                used_posts.add(post_id)
                if len(used_posts) >= 2:
                    break

In [8]:
# === Relationship Creator ===
def create_all_relationships(tx):
    for u1, u2, date in follows_relations:
        tx.run("""
            MATCH (a:User {user_id: $u1}), (b:User {user_id: $u2})
            MERGE (a)-[:FOLLOWS {follow_date: date($date)}]->(b)
        """, u1=u1, u2=u2, date=date)

    for u, p, date in created_relations:
        tx.run("""
            MATCH (u:User {user_id: $u}), (p:Post {post_id: $p})
            MERGE (u)-[:CREATED {creation_date: date($date)}]->(p)
        """, u=u, p=p, date=date)

    for p, t, score in tagged_relations:
        tx.run("""
            MATCH (p:Post {post_id: $p}), (t:Topic {topic_id: $t})
            MERGE (p)-[:TAGGED_WITH {relevance_score: $score}]->(t)
        """, p=p, t=t, score=score)

    for p, pl in posted_relations:
        tx.run("""
            MATCH (p:Post {post_id: $p}), (pl:Platform {platform_id: $pl})
            MERGE (p)-[:POSTED_ON]->(pl)
        """, p=p, pl=pl)

    for u, p, etype, date in engaged_with_relations:
        tx.run("""
            MATCH (u:User {user_id: $u}), (p:Post {post_id: $p})
            MERGE (u)-[:ENGAGED_WITH {
                engagement_type: $etype, 
                engagement_date: date($date)
            }]->(p)
        """, u=u, p=p, etype=etype, date=date)

# Run the transaction
with driver.session() as session:
    session.write_transaction(create_all_relationships)

"✅ All enriched relationships created (FOLLOWS, CREATED, TAGGED_WITH, POSTED_ON, ENGAGED_WITH)."

  session.write_transaction(create_all_relationships)


'✅ All enriched relationships created (FOLLOWS, CREATED, TAGGED_WITH, POSTED_ON, ENGAGED_WITH).'

### 📊 Analytical Queries

In this section, we write Cypher queries to analyze the graph:

- Find top influential users based on followers and engagement.
- Identify topics with highest average engagement.
- Discover which content types perform best on each platform.
- Find communities of users engaging with similar topics.

1. Who are the most influential users based on followers and engagement received?

In [9]:
# --- Query Function --- #
def get_most_influential_users(tx):
    query = """
    MATCH (u:User)-[:CREATED]->(p:Post)
    RETURN u.username AS username, 
           u.follower_count AS followers, 
           SUM(coalesce(p.like_count, 0) + coalesce(p.share_count, 0)) AS total_engagement
    ORDER BY total_engagement DESC
    LIMIT 5
    """
    result = tx.run(query)
    return result.data()

# --- Run Query and Print Results --- #
with driver.session() as session:
    top_users = session.execute_read(get_most_influential_users)  # use updated method

# --- Display --- #
print("📊 Top 5 Most Influential Users:")
for user in top_users:
    print(f"👤 {user['username']} | Followers: {user['followers']} | Engagement: {user['total_engagement']}")



📊 Top 5 Most Influential Users:
👤 fitness_fan | Followers: 4300 | Engagement: 2447
👤 data_dude | Followers: 18000 | Engagement: 2267
👤 social_star | Followers: 12500 | Engagement: 1405
👤 health_blogger | Followers: 2100 | Engagement: 1220
👤 travel_enthusiast | Followers: 5400 | Engagement: 973


2. Which topics generate the highest engagement rates?

In [10]:
# Query Function
def get_highest_engagement_topics(tx):
    query = """
    MATCH (p:Post)-[:TAGGED_WITH]->(t:Topic)
    RETURN t.name AS topic, 
           AVG(COALESCE(p.like_count, 0) + COALESCE(p.share_count, 0)) AS avg_engagement
    ORDER BY avg_engagement DESC;
    """
    result = tx.run(query)
    return result.data()

# --- Run Query and Display Results --- #
with driver.session() as session:
    top_topics = session.execute_read(get_highest_engagement_topics)

# --- Display Output --- #
print("📊 Topics with Highest Average Engagement:")
for topic in top_topics:
    avg = topic.get("avg_engagement")
    if avg is not None:
        print(f"🏷️ {topic['topic']} | Avg Engagement: {round(avg, 2)}")
    else:
        print(f"🏷️ {topic['topic']} | Avg Engagement: N/A")



📊 Topics with Highest Average Engagement:
🏷️ Fashion | Avg Engagement: 1212.5
🏷️ Fitness | Avg Engagement: 1095.67
🏷️ Travel | Avg Engagement: 626.33
🏷️ Data Science | Avg Engagement: 430.67
🏷️ Health | Avg Engagement: 252.5
🏷️ Education | Avg Engagement: 198.5
🏷️ Tech News | Avg Engagement: 192.33
🏷️ Food | Avg Engagement: 127.5


3. What content types perform best on each platform?

In [11]:
# Query Function
query = """
MATCH (p:Post)-[:POSTED_ON]->(pl:Platform)
RETURN pl.name AS platform, p.content_type AS type, AVG(p.like_count) AS avg_likes
ORDER BY platform, avg_likes DESC
"""

with driver.session() as session:
    result = session.run(query)
    records = result.data()

print("📊 Best Performing Content Types per Platform:\n")
for row in records:
    print(f"🌐 {row['platform']:10s} | 📦 {row['type']:6s} | 👍 Avg Likes: {round(row['avg_likes'], 2)}")

📊 Best Performing Content Types per Platform:

🌐 Instagram  | 📦 video  | 👍 Avg Likes: 633.25
🌐 Instagram  | 📦 image  | 👍 Avg Likes: 262.5
🌐 Instagram  | 📦 text   | 👍 Avg Likes: 127.0
🌐 Twitter    | 📦 image  | 👍 Avg Likes: 453.0
🌐 Twitter    | 📦 text   | 👍 Avg Likes: 293.17
🌐 YouTube    | 📦 video  | 👍 Avg Likes: 1357.67
🌐 YouTube    | 📦 image  | 👍 Avg Likes: 382.33


4. Identify communities of users who engage with similar content.

In [12]:
query = """
MATCH (u:User)-[:ENGAGED_WITH]->(p:Post)-[:TAGGED_WITH]->(t:Topic)
RETURN t.name AS topic, COLLECT(DISTINCT u.username) AS community
ORDER BY topic;
"""

with driver.session() as session:
    result = session.run(query)
    records = result.data()

print("🧩 Content-Based User Communities:\n")
for row in records:
    print(f"📌 Topic: {row['topic']}")
    print(f"👥 Users: {', '.join(row['community'])}\n")

🧩 Content-Based User Communities:

📌 Topic: Data Science
👥 Users: tech_guru, fitness_fan, data_dude, bookworm

📌 Topic: Education
👥 Users: eco_activist, bookworm

📌 Topic: Fashion
👥 Users: social_star, fashionista

📌 Topic: Fitness
👥 Users: fitness_fan, gadget_queen, health_blogger

📌 Topic: Food
👥 Users: social_star, fashionista, travel_enthusiast

📌 Topic: Health
👥 Users: fitness_fan, health_blogger, eco_activist

📌 Topic: Tech News
👥 Users: tech_guru, data_dude

📌 Topic: Travel
👥 Users: health_blogger, eco_activist, travel_enthusiast

