# Question 2: Neo4j - Movie Recommendation Graph Database

This notebook implements a movie recommendation system using Neo4j and the MovieLens 100K dataset.

## Property Graph Data Model Explanation

### Node Types:
1. **User**: Users with properties (userId, age, gender, zipCode)
2. **Movie**: Movies with properties (movieId, title, releaseYear)
3. **Genre**: Movie genres with property (name)
4. **Occupation**: User occupations with property (name)

### Relationship Types:
1. **RATED**: User → Movie (properties: rating, timestamp) - captures user ratings
2. **BELONGS_TO**: Movie → Genre - indicates movie genres
3. **HAS_OCCUPATION**: User → Occupation - links users to their occupations
4. **SIMILAR_TO**: Movie → Movie (property: totalScore) - computed similarity based on shared genres

### Design Decisions:
- **Occupation as separate node**: Enables efficient querying by occupation and occupation-based analytics.
- **Genre as separate node**: Supports multi-genre movies and efficient genre-based queries.
- **Rating as relationship property**: The RATED relationship captures rating value and timestamp, which is more natural than separate Rating nodes.
- **SIMILAR_TO computed relationship**: Created in query 2D based on genre overlap between movies.


In [6]:
# Install required packages
%pip install neo4j pandas requests


Note: you may need to restart the kernel to use updated packages.


In [7]:
# Import required libraries
from neo4j import GraphDatabase
import pandas as pd
import requests
import zipfile
import os
from io import BytesIO

# Neo4j AuraDB Connection
URI = "neo4j+s://4bc64b91.databases.neo4j.io"
AUTH = ("neo4j", "rhZucA3Q7SUkBEg6DThkHT_XeBRjDAHnATWvPZ2uKiU")

# Test connection
driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()
print("Successfully connected to Neo4j AuraDB!")


Successfully connected to Neo4j AuraDB!


## 2A.2) Load the Database

Download and load the MovieLens 100K dataset into Neo4j.


In [8]:
# Download MovieLens 100K dataset
url = "https://files.grouplens.org/datasets/movielens/ml-100k.zip"
print("Downloading MovieLens 100K dataset...")

response = requests.get(url)
with zipfile.ZipFile(BytesIO(response.content)) as z:
    z.extractall("./")
    
print("Dataset downloaded and extracted successfully!")

# List extracted files
print("\nExtracted files:")
for f in os.listdir("./ml-100k"):
    print(f"  - {f}")


Downloading MovieLens 100K dataset...
Dataset downloaded and extracted successfully!

Extracted files:
  - u.item
  - u3.test
  - u1.base
  - u.info
  - u2.test
  - u5.test
  - u.genre
  - ub.test
  - ua.base
  - u.data
  - README
  - u4.test
  - u5.base
  - ub.base
  - ua.test
  - u4.base
  - u.user
  - allbut.pl
  - u3.base
  - u1.test
  - mku.sh
  - u2.base
  - u.occupation


In [9]:
# Load data files (headers from README.txt)
# u.user: user info (userid | age | gender | occupation | zip code)
users_df = pd.read_csv("./ml-100k/u.user", sep="|", 
                       names=["userId", "age", "gender", "occupation", "zipCode"],
                       encoding="latin-1")

# u.item: movie info (movie id | title | release date | ... | genres)
# Genres are binary flags in columns 5-23
genre_names = ["unknown", "Action", "Adventure", "Animation", "Children's", "Comedy", 
               "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", 
               "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

movies_cols = ["movieId", "title", "releaseDate", "videoReleaseDate", "imdbUrl"] + genre_names
movies_df = pd.read_csv("./ml-100k/u.item", sep="|", names=movies_cols, encoding="latin-1")

# u.data: ratings (user id | item id | rating | timestamp)
ratings_df = pd.read_csv("./ml-100k/u.data", sep="\t", 
                         names=["userId", "movieId", "rating", "timestamp"])

# u.genre: genre names
genres_df = pd.read_csv("./ml-100k/u.genre", sep="|", names=["genre", "genreId"])

# u.occupation: occupation names
occupations_df = pd.read_csv("./ml-100k/u.occupation", names=["occupation"])

print("Data loaded successfully!")
print(f"\nUsers: {len(users_df)} records")
print(f"Movies: {len(movies_df)} records")
print(f"Ratings: {len(ratings_df)} records")
print(f"Genres: {len(genres_df)} records")
print(f"Occupations: {len(occupations_df)} records")


Data loaded successfully!

Users: 943 records
Movies: 1682 records
Ratings: 100000 records
Genres: 19 records
Occupations: 21 records


In [10]:
# Preview data
print("=== Users Sample ===")
print(users_df.head())
print("\n=== Movies Sample ===")
print(movies_df[["movieId", "title", "releaseDate"]].head())
print("\n=== Ratings Sample ===")
print(ratings_df.head())
print("\n=== Genres ===")
print(genres_df)
print("\n=== Occupations ===")
print(occupations_df)


=== Users Sample ===
   userId  age gender  occupation zipCode
0       1   24      M  technician   85711
1       2   53      F       other   94043
2       3   23      M      writer   32067
3       4   24      M  technician   43537
4       5   33      F       other   15213

=== Movies Sample ===
   movieId              title  releaseDate
0        1   Toy Story (1995)  01-Jan-1995
1        2   GoldenEye (1995)  01-Jan-1995
2        3  Four Rooms (1995)  01-Jan-1995
3        4  Get Shorty (1995)  01-Jan-1995
4        5     Copycat (1995)  01-Jan-1995

=== Ratings Sample ===
   userId  movieId  rating  timestamp
0     196      242       3  881250949
1     186      302       3  891717742
2      22      377       1  878887116
3     244       51       2  880606923
4     166      346       1  886397596

=== Genres ===
          genre  genreId
0       unknown        0
1        Action        1
2     Adventure        2
3     Animation        3
4    Children's        4
5        Comedy        5
6  

In [11]:
# Run Cypher queries
def run_query(query, parameters=None):
    with driver.session() as session:
        result = session.run(query, parameters or {})
        return [record.data() for record in result]

# Clear existing data (for re-running)
print("Clearing existing data...")
run_query("MATCH (n) DETACH DELETE n")
print("Database cleared.")


Clearing existing data...
Database cleared.


In [12]:
# Create constraints and indexes
print("Creating constraints and indexes...")

# Unique constraints
run_query("CREATE CONSTRAINT user_id IF NOT EXISTS FOR (u:User) REQUIRE u.userId IS UNIQUE")
run_query("CREATE CONSTRAINT movie_id IF NOT EXISTS FOR (m:Movie) REQUIRE m.movieId IS UNIQUE")
run_query("CREATE CONSTRAINT genre_name IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE")
run_query("CREATE CONSTRAINT occupation_name IF NOT EXISTS FOR (o:Occupation) REQUIRE o.name IS UNIQUE")

print("Constraints created successfully!")


Creating constraints and indexes...
Constraints created successfully!


In [13]:
# Create Genre nodes
print("Creating Genre nodes...")
for _, row in genres_df.iterrows():
    if pd.notna(row["genre"]) and row["genre"].strip():
        run_query(
            "MERGE (g:Genre {name: $name})",
            {"name": row["genre"]}
        )
print(f"Created {len(genres_df)} Genre nodes")


Creating Genre nodes...
Created 19 Genre nodes


In [14]:
# Create Occupation nodes
print("Creating Occupation nodes...")
for _, row in occupations_df.iterrows():
    run_query(
        "MERGE (o:Occupation {name: $name})",
        {"name": row["occupation"]}
    )
print(f"Created {len(occupations_df)} Occupation nodes")


Creating Occupation nodes...
Created 21 Occupation nodes


In [15]:
# Create User nodes and HAS_OCCUPATION relationships
print("Creating User nodes and HAS_OCCUPATION relationships...")

# Batch insert
batch_size = 100
users_data = users_df.to_dict('records')

for i in range(0, len(users_data), batch_size):
    batch = users_data[i:i+batch_size]
    run_query("""
        UNWIND $batch AS user
        MERGE (u:User {userId: user.userId})
        SET u.age = user.age,
            u.gender = user.gender,
            u.zipCode = user.zipCode
        WITH u, user
        MATCH (o:Occupation {name: user.occupation})
        MERGE (u)-[:HAS_OCCUPATION]->(o)
    """, {"batch": batch})
    
print(f"Created {len(users_df)} User nodes with HAS_OCCUPATION relationships")


Creating User nodes and HAS_OCCUPATION relationships...
Created 943 User nodes with HAS_OCCUPATION relationships


In [16]:
# Create Movie nodes and BELONGS_TO relationships
print("Creating Movie nodes and BELONGS_TO relationships...")

# Extract release year from title (format: Movie Title (1995))
import re

for _, row in movies_df.iterrows():
    # Extract year from title
    title = row["title"]
    year_match = re.search(r'\((\d{4})\)', title)
    release_year = int(year_match.group(1)) if year_match else None
    
    # Create movie node
    run_query("""
        MERGE (m:Movie {movieId: $movieId})
        SET m.title = $title,
            m.releaseYear = $releaseYear
    """, {
        "movieId": int(row["movieId"]),
        "title": title,
        "releaseYear": release_year
    })
    
    # Create BELONGS_TO relationships for genres
    for genre in genre_names:
        if row[genre] == 1:
            run_query("""
                MATCH (m:Movie {movieId: $movieId})
                MATCH (g:Genre {name: $genre})
                MERGE (m)-[:BELONGS_TO]->(g)
            """, {"movieId": int(row["movieId"]), "genre": genre})

print(f"Created {len(movies_df)} Movie nodes with BELONGS_TO relationships")


Creating Movie nodes and BELONGS_TO relationships...
Created 1682 Movie nodes with BELONGS_TO relationships


In [17]:
# Create RATED relationships
print("Creating RATED relationships...")

# Batch insert ratings
batch_size = 1000
ratings_data = ratings_df.to_dict('records')

for i in range(0, len(ratings_data), batch_size):
    batch = ratings_data[i:i+batch_size]
    run_query("""
        UNWIND $batch AS r
        MATCH (u:User {userId: r.userId})
        MATCH (m:Movie {movieId: r.movieId})
        MERGE (u)-[rated:RATED]->(m)
        SET rated.rating = r.rating,
            rated.timestamp = r.timestamp
    """, {"batch": batch})
    if (i + batch_size) % 10000 == 0:
        print(f"  Processed {min(i + batch_size, len(ratings_data))} ratings...")

print(f"Created {len(ratings_df)} RATED relationships")


Creating RATED relationships...
  Processed 10000 ratings...
  Processed 20000 ratings...
  Processed 30000 ratings...
  Processed 40000 ratings...
  Processed 50000 ratings...
  Processed 60000 ratings...
  Processed 70000 ratings...
  Processed 80000 ratings...
  Processed 90000 ratings...
  Processed 100000 ratings...
Created 100000 RATED relationships


In [18]:
# Verify graph structure
print("=== Graph Statistics ===")

# Count nodes
result = run_query("MATCH (n) RETURN labels(n)[0] AS label, count(*) AS count ORDER BY label")
print("\nNode counts:")
for r in result:
    print(f"  {r['label']}: {r['count']}")

# Count relationships
result = run_query("MATCH ()-[r]->() RETURN type(r) AS type, count(*) AS count ORDER BY type")
print("\nRelationship counts:")
for r in result:
    print(f"  {r['type']}: {r['count']}")


=== Graph Statistics ===

Node counts:
  Genre: 19
  Movie: 1682
  Occupation: 21
  User: 943

Relationship counts:
  BELONGS_TO: 2893
  HAS_OCCUPATION: 943
  RATED: 100000


In [19]:
# Show sample graph data
print("=== Sample User with Occupation ===")
result = run_query("""
    MATCH (u:User)-[:HAS_OCCUPATION]->(o:Occupation)
    RETURN u.userId AS userId, u.age AS age, u.gender AS gender, o.name AS occupation
    LIMIT 5
""")
for r in result:
    print(r)

print("\n=== Sample Movie with Genres ===")
result = run_query("""
    MATCH (m:Movie)-[:BELONGS_TO]->(g:Genre)
    RETURN m.title AS title, collect(g.name) AS genres
    LIMIT 5
""")
for r in result:
    print(r)

print("\n=== Sample Rating ===")
result = run_query("""
    MATCH (u:User)-[r:RATED]->(m:Movie)
    RETURN u.userId AS userId, m.title AS movie, r.rating AS rating
    LIMIT 5
""")
for r in result:
    print(r)


=== Sample User with Occupation ===
{'userId': 7, 'age': 57, 'gender': 'M', 'occupation': 'administrator'}
{'userId': 8, 'age': 36, 'gender': 'M', 'occupation': 'administrator'}
{'userId': 34, 'age': 38, 'gender': 'F', 'occupation': 'administrator'}
{'userId': 42, 'age': 30, 'gender': 'M', 'occupation': 'administrator'}
{'userId': 48, 'age': 45, 'gender': 'M', 'occupation': 'administrator'}

=== Sample Movie with Genres ===
{'title': 'unknown', 'genres': ['unknown']}
{'title': 'Good Morning (1971)', 'genres': ['unknown']}
{'title': 'GoldenEye (1995)', 'genres': ['Action', 'Adventure', 'Thriller']}
{'title': 'Get Shorty (1995)', 'genres': ['Action', 'Comedy', 'Drama']}
{'title': 'From Dusk Till Dawn (1996)', 'genres': ['Action', 'Comedy', 'Crime', 'Horror', 'Thriller']}

=== Sample Rating ===
{'userId': 1, 'movie': 'Toy Story (1995)', 'rating': 5}
{'userId': 1, 'movie': 'GoldenEye (1995)', 'rating': 3}
{'userId': 1, 'movie': 'Four Rooms (1995)', 'rating': 4}
{'userId': 1, 'movie': 'Get 

## 2B) Graph Traversal - Movie Recommendations

For a given user, recommend 10 movies they haven't watched yet but are highly rated by similar users.

**Threshold Decision**: "Highly rated" means ratings >= 4 stars (out of 5). This captures movies users actually enjoyed while filtering out mediocre ratings.

**Similar Users**: Users are similar if they rated the same movies with similar ratings (within 1 point difference).


In [20]:
# 2B: Movie Recommendations
# Threshold: rating >= 4 (highly rated)

def recommend_movies(user_id, high_rating_threshold=4, limit=10):
    """
    Recommend movies based on similar users' ratings.
    
    Algorithm:
    1. Find movies target user rated highly (>=4)
    2. Find users who also rated those movies highly (similar users)
    3. Find movies similar users rated highly that target user hasn't seen
    4. Rank by average rating
    """
    query = """
    // Find the target user and movies they rated highly
    MATCH (targetUser:User {userId: $userId})-[r1:RATED]->(m1:Movie)
    WHERE r1.rating >= $threshold
    
    // Find similar users who also rated those movies highly
    MATCH (similarUser:User)-[r2:RATED]->(m1)
    WHERE similarUser <> targetUser 
      AND r2.rating >= $threshold
    
    // Find movies that similar users rated highly but target user hasn't seen
    MATCH (similarUser)-[r3:RATED]->(recommendedMovie:Movie)
    WHERE r3.rating >= $threshold
      AND NOT EXISTS {
          MATCH (targetUser)-[:RATED]->(recommendedMovie)
      }
    
    // Aggregate and rank
    RETURN recommendedMovie.title AS movieTitle,
           recommendedMovie.movieId AS movieId,
           round(avg(r3.rating), 2) AS avgRating,
           count(DISTINCT similarUser) AS numSimilarUsers
    ORDER BY avgRating DESC, numSimilarUsers DESC
    LIMIT $limit
    """
    
    return run_query(query, {
        "userId": user_id,
        "threshold": high_rating_threshold,
        "limit": limit
    })

# Test with user 1
print("=== Movie Recommendations for User 1 ===")
print("(Movies not watched by User 1, but highly rated by similar users)")
print(f"Threshold: rating >= 4 stars\n")

recommendations = recommend_movies(user_id=1)
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['movieTitle']}")
    print(f"   Average Rating: {rec['avgRating']}, Recommended by {rec['numSimilarUsers']} similar users")


=== Movie Recommendations for User 1 ===
(Movies not watched by User 1, but highly rated by similar users)
Threshold: rating >= 4 stars

1. Incognito (1997)
   Average Rating: 5.0, Recommended by 5 similar users
2. World of Apu, The (Apur Sansar) (1959)
   Average Rating: 5.0, Recommended by 4 similar users
3. Aparajito (1956)
   Average Rating: 5.0, Recommended by 4 similar users
4. Star Kid (1997)
   Average Rating: 5.0, Recommended by 3 similar users
5. Hugo Pool (1997)
   Average Rating: 5.0, Recommended by 3 similar users
6. Walking and Talking (1996)
   Average Rating: 5.0, Recommended by 3 similar users
7. Prefontaine (1997)
   Average Rating: 5.0, Recommended by 3 similar users
8. Santa with Muscles (1996)
   Average Rating: 5.0, Recommended by 2 similar users
9. Angel Baby (1995)
   Average Rating: 5.0, Recommended by 2 similar users
10. Saint of Fort Washington, The (1993)
   Average Rating: 5.0, Recommended by 2 similar users


In [21]:
# Test with user 100
print("=== Movie Recommendations for User 100 ===")
print("(Movies not watched by User 100, but highly rated by similar users)")
print(f"Threshold: rating >= 4 stars\n")

recommendations = recommend_movies(user_id=100)
for i, rec in enumerate(recommendations, 1):
    print(f"{i}. {rec['movieTitle']}")
    print(f"   Average Rating: {rec['avgRating']}, Recommended by {rec['numSimilarUsers']} similar users")


=== Movie Recommendations for User 100 ===
(Movies not watched by User 100, but highly rated by similar users)
Threshold: rating >= 4 stars

1. Incognito (1997)
   Average Rating: 5.0, Recommended by 5 similar users
2. Walking and Talking (1996)
   Average Rating: 5.0, Recommended by 3 similar users
3. Star Kid (1997)
   Average Rating: 5.0, Recommended by 3 similar users
4. Hugo Pool (1997)
   Average Rating: 5.0, Recommended by 3 similar users
5. Steel (1997)
   Average Rating: 5.0, Recommended by 3 similar users
6. Two or Three Things I Know About Her (1966)
   Average Rating: 5.0, Recommended by 2 similar users
7. Thin Line Between Love and Hate, A (1996)
   Average Rating: 5.0, Recommended by 2 similar users
8. Prefontaine (1997)
   Average Rating: 5.0, Recommended by 2 similar users
9. Maya Lin: A Strong Clear Vision (1994)
   Average Rating: 5.0, Recommended by 2 similar users
10. Letter From Death Row, A (1998)
   Average Rating: 5.0, Recommended by 2 similar users


## 2C) Pattern Matching - Movie Triangles

Find "movie triangles" (triples of movies) with shared genres, where all movies share at least one genre and are connected through users who rated all three movies highly.

**Requirements** (as per assignment):
- Find triples of movies (m1, m2, m3)
- All three movies must share at least one common genre
- Connected through users who rated ALL THREE movies highly

**Threshold Decision**: "Highly rated" means ratings > 4 stars (i.e., 5-star ratings), as suggested in the assignment. If no results found, fall back to >= 4 stars.


In [22]:
# 2C: Find Movie Triangles
# Threshold Decision: "highly rated" = rating > 4 (i.e., rating = 5, as per assignment suggestion)
# 
# Requirements (as per assignment):
# - Find triples of movies (m1, m2, m3)
# - All three movies share at least one common genre
# - Connected through users who rated ALL THREE movies highly (> 4 stars)
# - Output: genre and titles of all three movies

# Reconnect to Neo4j
print("Creating fresh Neo4j connection...")
try:
    driver.close()
except:
    pass
driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()
print("Connected successfully!\n")

print("=== Movie Triangles (Triples of Highly Rated Movies with Shared Genre) ===")
print("Threshold: rating > 4 stars (i.e., 5-star ratings)")
print("Requirements: Three movies sharing at least one genre,")
print("connected by users who rated all three movies > 4 stars\n")

# Strategy: Use smaller dataset to avoid Neo4j AuraDB free tier memory limit (250MB)
# (1) Find top 15 popular 5-star movies
# (2) Find triangles among them (without counting users to save memory)
# (3) Verify at least one user exists for each triangle

print("Step 1: Finding top 15 most popular highly-rated movies...")

popular_query = """
MATCH (u:User)-[r:RATED]->(m:Movie)
WHERE r.rating > 4
WITH m, count(DISTINCT u) AS fiveStarCount
WHERE fiveStarCount >= 15
RETURN m.movieId AS movieId, m.title AS title, fiveStarCount
ORDER BY fiveStarCount DESC
LIMIT 15
"""

try:
    popular_movies = run_query(popular_query)
    print(f"Found {len(popular_movies)} popular 5-star movies:")
    for m in popular_movies[:5]:
        print(f"  - {m['title']} ({m['fiveStarCount']} five-star ratings)")
    print("  ...")
    
    if len(popular_movies) >= 3:
        movie_ids = [m['movieId'] for m in popular_movies]
        
        print("\nStep 2: Finding movie triangles (3 movies sharing a genre)...")
        
        # Find all triangles (movies sharing genre) - memory efficient
        triangle_query = """
        MATCH (m1:Movie)-[:BELONGS_TO]->(g:Genre)<-[:BELONGS_TO]-(m2:Movie),
              (m3:Movie)-[:BELONGS_TO]->(g)
        WHERE m1.movieId IN $movieIds 
          AND m2.movieId IN $movieIds 
          AND m3.movieId IN $movieIds
          AND m1.movieId < m2.movieId AND m2.movieId < m3.movieId
        RETURN DISTINCT g.name AS genre,
               m1.movieId AS m1Id, m1.title AS movie1,
               m2.movieId AS m2Id, m2.title AS movie2,
               m3.movieId AS m3Id, m3.title AS movie3
        LIMIT 30
        """
        
        triangles = run_query(triangle_query, {"movieIds": movie_ids})
        print(f"Found {len(triangles)} potential triangles\n")
        
        if triangles:
            print("Step 3: Verifying triangles have users who rated all three > 4 stars...\n")
            
            valid_triangles = []
            for tri in triangles:
                # Check if at least one user rated all three movies > 4 stars
                verify_query = """
                MATCH (u:User)-[r1:RATED]->(m1:Movie {movieId: $m1Id}),
                      (u)-[r2:RATED]->(m2:Movie {movieId: $m2Id}),
                      (u)-[r3:RATED]->(m3:Movie {movieId: $m3Id})
                WHERE r1.rating > 4 AND r2.rating > 4 AND r3.rating > 4
                RETURN count(DISTINCT u) AS userCount
                LIMIT 1
                """
                result = run_query(verify_query, {
                    "m1Id": tri['m1Id'], 
                    "m2Id": tri['m2Id'], 
                    "m3Id": tri['m3Id']
                })
                
                if result and result[0]['userCount'] > 0:
                    tri['userCount'] = result[0]['userCount']
                    valid_triangles.append(tri)
                    
                if len(valid_triangles) >= 10:  # Found enough
                    break
            
            if valid_triangles:
                print(f"Found {len(valid_triangles)} valid movie triangles:\n")
                for i, tri in enumerate(valid_triangles, 1):
                    print(f"{i}. Genre: {tri['genre']}")
                    print(f"   Movie 1: {tri['movie1']}")
                    print(f"   Movie 2: {tri['movie2']}")
                    print(f"   Movie 3: {tri['movie3']}")
                    print(f"   Users who rated all three > 4 stars: {tri['userCount']}")
                    print()
            else:
                print("No triangles found with users rating all three > 4 stars.")
                print("Trying with >= 4 stars threshold...\n")
                
                for tri in triangles[:20]:
                    verify_query = """
                    MATCH (u:User)-[r1:RATED]->(m1:Movie {movieId: $m1Id}),
                          (u)-[r2:RATED]->(m2:Movie {movieId: $m2Id}),
                          (u)-[r3:RATED]->(m3:Movie {movieId: $m3Id})
                    WHERE r1.rating >= 4 AND r2.rating >= 4 AND r3.rating >= 4
                    RETURN count(DISTINCT u) AS userCount
                    LIMIT 1
                    """
                    result = run_query(verify_query, {
                        "m1Id": tri['m1Id'], 
                        "m2Id": tri['m2Id'], 
                        "m3Id": tri['m3Id']
                    })
                    
                    if result and result[0]['userCount'] > 0:
                        tri['userCount'] = result[0]['userCount']
                        valid_triangles.append(tri)
                        
                    if len(valid_triangles) >= 10:
                        break
                
                if valid_triangles:
                    print(f"Found {len(valid_triangles)} valid movie triangles (>= 4 stars):\n")
                    for i, tri in enumerate(valid_triangles, 1):
                        print(f"{i}. Genre: {tri['genre']}")
                        print(f"   Movie 1: {tri['movie1']}")
                        print(f"   Movie 2: {tri['movie2']}")
                        print(f"   Movie 3: {tri['movie3']}")
                        print(f"   Users who rated all three >= 4 stars: {tri['userCount']}")
                        print()
        else:
            print("No triangles found among popular movies.")
    else:
        print("Not enough popular movies to form triangles.")
        
except Exception as e:
    print(f"Error: {e}")


Creating fresh Neo4j connection...
Connected successfully!

=== Movie Triangles (Triples of Highly Rated Movies with Shared Genre) ===
Threshold: rating > 4 stars (i.e., 5-star ratings)
Requirements: Three movies sharing at least one genre,
connected by users who rated all three movies > 4 stars

Step 1: Finding top 15 most popular highly-rated movies...
Found 15 popular 5-star movies:
  - Star Wars (1977) (325 five-star ratings)
  - Fargo (1996) (227 five-star ratings)
  - Godfather, The (1972) (214 five-star ratings)
  - Raiders of the Lost Ark (1981) (202 five-star ratings)
  - Pulp Fiction (1994) (188 five-star ratings)
  ...

Step 2: Finding movie triangles (3 movies sharing a genre)...
Found 30 potential triangles

Step 3: Verifying triangles have users who rated all three > 4 stars...

Found 10 valid movie triangles:

1. Genre: Drama
   Movie 1: Braveheart (1995)
   Movie 2: Pulp Fiction (1994)
   Movie 3: Shawshank Redemption, The (1994)
   Users who rated all three > 4 stars: 

## 2D) Graph Algorithm Computation - Movie Similarity

Compute movie similarity based on shared genres. The MovieLens 100K dataset doesn't include actor information, so we focus on genre-based similarity.

**Similarity Score Calculation**:
- `genreScore`: Number of shared genres between two movies
- `totalScore = genreScore` (no actor data available)

**Threshold Decision**: Threshold >= 3 shared genres creates SIMILAR_TO relationships. This keeps meaningful similarity while avoiding too many edges.

### Exploratory Analysis
Analyze how many movies share different numbers of genres.


In [23]:
# 2D: Genre overlap distribution analysis

# Check connection
print("Checking Neo4j connection...")
try:
    driver.verify_connectivity()
    print("Connection OK\n")
except:
    print("Reconnecting...")
    driver = GraphDatabase.driver(URI, auth=AUTH)
    driver.verify_connectivity()
    print("Reconnected!\n")

print("=== Exploratory Analysis: Genre Overlap Distribution ===")
print("How many movie pairs share 1, 2, 3, 4+ genres?\n")

# Count movie pairs by shared genres
query = """
MATCH (m1:Movie)-[:BELONGS_TO]->(g:Genre)<-[:BELONGS_TO]-(m2:Movie)
WHERE m1.movieId < m2.movieId
WITH m1, m2, count(g) AS sharedGenres
RETURN sharedGenres, count(*) AS numPairs
ORDER BY sharedGenres
"""

try:
    result = run_query(query)
    if result:
        total_pairs = sum(r['numPairs'] for r in result)
        print(f"Total movie pairs with at least 1 shared genre: {total_pairs}\n")
        print("Distribution:")
        for r in result:
            percentage = (r['numPairs'] / total_pairs) * 100
            print(f"  {r['sharedGenres']} shared genres: {r['numPairs']} pairs ({percentage:.1f}%)")
    else:
        print("No results returned.")
except Exception as e:
    print(f"Query error: {e}")


Checking Neo4j connection...
Connection OK

=== Exploratory Analysis: Genre Overlap Distribution ===
How many movie pairs share 1, 2, 3, 4+ genres?

Total movie pairs with at least 1 shared genre: 489791

Distribution:
  1 shared genres: 458002 pairs (93.5%)
  2 shared genres: 30230 pairs (6.2%)
  3 shared genres: 1512 pairs (0.3%)
  4 shared genres: 44 pairs (0.0%)
  5 shared genres: 3 pairs (0.0%)


In [24]:
# 2D: Create SIMILAR_TO relationships
# Based on analysis, we choose threshold >= 3 shared genres
# This filters out pairs with only superficial similarity

# Check connection
print("Checking Neo4j connection...")
try:
    driver.verify_connectivity()
    print("Connection OK\n")
except:
    print("Reconnecting...")
    driver = GraphDatabase.driver(URI, auth=AUTH)
    driver.verify_connectivity()
    print("Reconnected!\n")

print("=== Creating SIMILAR_TO Relationships ===")
print("Threshold: >= 3 shared genres")
print("This creates meaningful similarity edges while controlling graph density\n")

try:
    # Remove existing SIMILAR_TO relationships
    run_query("MATCH ()-[r:SIMILAR_TO]->() DELETE r")
    print("Removed existing SIMILAR_TO relationships")
    
    # Create SIMILAR_TO for movies with >= 3 shared genres
    query = """
    MATCH (m1:Movie)-[:BELONGS_TO]->(g:Genre)<-[:BELONGS_TO]-(m2:Movie)
    WHERE m1.movieId < m2.movieId
    WITH m1, m2, count(g) AS genreScore, collect(g.name) AS sharedGenres
    WHERE genreScore >= 3
    MERGE (m1)-[s:SIMILAR_TO]->(m2)
    SET s.totalScore = genreScore,
        s.sharedGenres = sharedGenres
    RETURN count(*) AS relationshipsCreated
    """
    
    result = run_query(query)
    if result:
        print(f"Created {result[0]['relationshipsCreated']} SIMILAR_TO relationships")
    else:
        print("Query returned no results.")
except Exception as e:
    print(f"Error: {e}")
    print("Reconnecting and retrying...")
    
    driver = GraphDatabase.driver(URI, auth=AUTH)
    driver.verify_connectivity()
    
    try:
        run_query("MATCH ()-[r:SIMILAR_TO]->() DELETE r")
        result = run_query("""
            MATCH (m1:Movie)-[:BELONGS_TO]->(g:Genre)<-[:BELONGS_TO]-(m2:Movie)
            WHERE m1.movieId < m2.movieId
            WITH m1, m2, count(g) AS genreScore, collect(g.name) AS sharedGenres
            WHERE genreScore >= 3
            MERGE (m1)-[s:SIMILAR_TO]->(m2)
            SET s.totalScore = genreScore,
                s.sharedGenres = sharedGenres
            RETURN count(*) AS relationshipsCreated
        """)
        if result:
            print(f"Created {result[0]['relationshipsCreated']} SIMILAR_TO relationships")
    except Exception as e2:
        print(f"Retry error: {e2}")


Checking Neo4j connection...
Connection OK

=== Creating SIMILAR_TO Relationships ===
Threshold: >= 3 shared genres
This creates meaningful similarity edges while controlling graph density

Removed existing SIMILAR_TO relationships
Created 1559 SIMILAR_TO relationships


In [25]:
# Show at least 3 SIMILAR_TO relationships as required by assignment
print("=== Sample SIMILAR_TO Relationships (showing at least 3) ===\n")

query = """
MATCH (m1:Movie)-[s:SIMILAR_TO]->(m2:Movie)
RETURN m1.title AS movie1, 
       m2.title AS movie2, 
       s.totalScore AS totalScore,
       s.sharedGenres AS sharedGenres
ORDER BY s.totalScore DESC
LIMIT 10
"""

try:
    result = run_query(query)
    if result:
        for i, r in enumerate(result, 1):
            print(f"{i}. SIMILAR_TO Relationship:")
            print(f"   Movie 1: {r['movie1']}")
            print(f"   Movie 2: {r['movie2']}")
            print(f"   Total Score (shared genres): {r['totalScore']}")
            genres = r['sharedGenres'] if r['sharedGenres'] else []
            print(f"   Shared Genres: {', '.join(genres)}")
            print()
    else:
        print("No SIMILAR_TO relationships found.")
except Exception as e:
    print(f"Error: {e}")
    print("Reconnecting and retrying...")
    try:
        driver = GraphDatabase.driver(URI, auth=AUTH)
        result = run_query(query)
        if result:
            for i, r in enumerate(result, 1):
                print(f"{i}. {r['movie1']} <-> {r['movie2']} (Score: {r['totalScore']})")
    except Exception as e2:
        print(f"Retry error: {e2}")


=== Sample SIMILAR_TO Relationships (showing at least 3) ===

1. SIMILAR_TO Relationship:
   Movie 1: Star Wars (1977)
   Movie 2: Return of the Jedi (1983)
   Total Score (shared genres): 5
   Shared Genres: Action, Adventure, Romance, Sci-Fi, War

2. SIMILAR_TO Relationship:
   Movie 1: Star Wars (1977)
   Movie 2: Empire Strikes Back, The (1980)
   Total Score (shared genres): 5
   Shared Genres: Action, Adventure, Romance, Sci-Fi, War

3. SIMILAR_TO Relationship:
   Movie 1: Empire Strikes Back, The (1980)
   Movie 2: Return of the Jedi (1983)
   Total Score (shared genres): 5
   Shared Genres: Action, Adventure, Romance, Sci-Fi, War

4. SIMILAR_TO Relationship:
   Movie 1: 20,000 Leagues Under the Sea (1954)
   Movie 2: Kid in King Arthur's Court, A (1995)
   Total Score (shared genres): 4
   Shared Genres: Adventure, Children's, Fantasy, Sci-Fi

5. SIMILAR_TO Relationship:
   Movie 1: Aladdin (1992)
   Movie 2: Hercules (1997)
   Total Score (shared genres): 4
   Shared Genres: A

In [26]:
# Verify SIMILAR_TO distribution
print("=== SIMILAR_TO Relationships by Score ===\n")

query = """
MATCH ()-[s:SIMILAR_TO]->()
RETURN s.totalScore AS score, count(*) AS count
ORDER BY score DESC
"""

try:
    result = run_query(query)
    if result:
        for r in result:
            print(f"  Score {r['score']}: {r['count']} relationships")
    else:
        print("No SIMILAR_TO relationships found in the database.")
except Exception as e:
    print(f"Error: {e}")
    print("Reconnecting...")
    try:
        driver = GraphDatabase.driver(URI, auth=AUTH)
        result = run_query(query)
        if result:
            for r in result:
                print(f"  Score {r['score']}: {r['count']} relationships")
    except Exception as e2:
        print(f"Retry error: {e2}")


=== SIMILAR_TO Relationships by Score ===

  Score 5: 3 relationships
  Score 4: 44 relationships
  Score 3: 1512 relationships


In [27]:
# Final graph statistics
print("=== Final Graph Statistics ===")

# Check connection
try:
    driver.verify_connectivity()
except:
    print("Reconnecting...")
    driver = GraphDatabase.driver(URI, auth=AUTH)
    driver.verify_connectivity()

try:
    # Count nodes and relationships
    result = run_query("MATCH (n) RETURN labels(n)[0] AS label, count(*) AS count ORDER BY label")
    print("\nNode counts:")
    if result:
        for r in result:
            print(f"  {r['label']}: {r['count']}")
    
    result = run_query("MATCH ()-[r]->() RETURN type(r) AS type, count(*) AS count ORDER BY type")
    print("\nRelationship counts:")
    if result:
        for r in result:
            print(f"  {r['type']}: {r['count']}")
except Exception as e:
    print(f"Error: {e}")


=== Final Graph Statistics ===

Node counts:
  Genre: 19
  Movie: 1682
  Occupation: 21
  User: 943

Relationship counts:
  BELONGS_TO: 2893
  HAS_OCCUPATION: 943
  RATED: 100000
  SIMILAR_TO: 1559


In [28]:
# Close the Neo4j connection
driver.close()
print("Neo4j connection closed.")


Neo4j connection closed.
