In [1]:
from langchain_community.graphs import Neo4jGraph
# from neo4j.debug import watch
import pandas as pd
from pyprojroot import here
# watch("neo4j")

In [2]:
NEO4J_URL = "bolt://localhost:7687"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "12345678"
NEO4J_DATABASE = 'neo4j'

In [3]:
graph = Neo4jGraph(url=NEO4J_URL, username=NEO4J_USERNAME, password=NEO4J_PASSWORD, database=NEO4J_DATABASE)

**Nodes:**
- `Movie`: Represents a movie. Each movie node has <u>attributes</u> such as **id** (a unique identifier for the movie), **released** (the release date of the movie), **title** (the movie's title), and **imdbRating** (the movie's rating on IMDb).
- `Person`: Represents an individual who can either be an <u>actor</u> or a <u>director</u> (or both) in movies. Each person node has a <u>single attribute</u>, **name**, which is the name of the person.
- `Genre`: Represents a movie genre. Each genre node has a <u>single attribute</u>, **name**, which is the genre type (e.g., Action, Comedy, Drama, etc.).

**Relationships:**
- `:DIRECTED`: A directional relationship from a Person node to a Movie node, signifying that the person directed the movie.
- `:ACTED_IN`: A directional relationship from a Person node to a Movie node, signifying that the person acted in the movie.
- `:IN_GENRE`: A directional relationship from a Movie node to a Genre node, signifying that the movie belongs to that particular genre.

**Instructions in the script:**
- `LOAD CSV WITH HEADERS`: Loads a CSV file that contains the movie data with headers indicating each column's purpose.
- `MERGE`: Ensures that a node or relationship is created if it does not already exist; otherwise, it matches the existing node or relationship. This prevents duplication.
- `SET`: Assigns properties to the nodes after they've been created or matched.
- `FOREACH`: Executes the contained commands for each element in a list. This is used to iterate over the lists of directors, actors, and genres associated with each movie. It ensures that all the respective Person and Genre nodes are created and linked appropriately to the Movie nodes.

**The Graph Structure:**
- Each Movie node is connected to one or more Person nodes by either :DIRECTED or :ACTED_IN relationships, depending on whether the person is listed as a director or an actor of that movie.
- Each Movie node is also connected to one or more Genre nodes by the :IN_GENRE relationship, representing the genres that the movie is categorized under.

This script effectively takes movie data from a CSV and constructs a rich graph that interlinks movies with the people who directed and acted in them, as well as the genres to which they belong. It is a typical graph structure for a recommendation system or for analysis of relationships within the movie industry.

**To load the data withing cypher query:**

--------------------
- NOTE: Uncomment `dbms.security.allow_csv_import_from_file_urls=true` in `neo4j.conf` to be able to load the file locally.
- NOTE: Absolute path + no space for the file directory.
--------------------

In [4]:
movie_csv_path = here("data/movie_csv/movie.csv")
movie_csv_path

WindowsPath('c:/Users/froozitalab/OneDrive - R.W. Tomlinson Ltd/Documents/Codes/Advanced-QA-and-RAG-Series/KnowledgeGraph-Q&A-and-RAG-with-TabularData/data/movie_csv/movie.csv')

In [5]:
print(pd.read_csv('c:/Users/froozitalab/movie.csv').columns)

Index(['movieId', 'released', 'title', 'actors', 'director', 'genres',
       'imdbRating', 'tagline'],
      dtype='object')


In [6]:
print(pd.read_csv('c:/Users/froozitalab/movie.csv').columns)

Index(['movieId', 'released', 'title', 'actors', 'director', 'genres',
       'imdbRating', 'tagline'],
      dtype='object')


In [7]:
movie_directory = "c:/Users/froozitalab/movie.csv"

![Alt Text](../../images/movie_KnowledgeGraph.png)

In [8]:
# import movie information from the CSV file with tagline
graph.query("""
LOAD CSV WITH HEADERS FROM 'file:///' + $movie_directory
AS row
MERGE (m:Movie {id:row.movieId})
SET m.released = date(row.released),
    m.title = row.title,
    m.tagline = row.tagline,
    m.imdbRating = toFloat(row.imdbRating)
FOREACH (director in split(row.director, '|') | 
    MERGE (p:Person {name:trim(director)})
    MERGE (p)-[:DIRECTED]->(m))
FOREACH (actor in split(row.actors, '|') | 
    MERGE (p:Person {name:trim(actor)})
    MERGE (p)-[:ACTED_IN]->(m))
FOREACH (genre in split(row.genres, '|') | 
    MERGE (g:Genre {name:trim(genre)})
    MERGE (m)-[:IN_GENRE]->(g))
""",
params={"movie_directory": movie_directory})

[]

In [9]:
graph.refresh_schema()
print(graph.schema)

Node properties are the following:
Movie {title: STRING, tagline: STRING, imdbRating: FLOAT, id: STRING, released: DATE},Person {name: STRING},Genre {name: STRING}
Relationship properties are the following:

The relationships are the following:
(:Movie)-[:IN_GENRE]->(:Genre),(:Person)-[:DIRECTED]->(:Movie),(:Person)-[:ACTED_IN]->(:Movie)


In [10]:
# Match all nodes in the graph
cypher = """
  MATCH (n) 
  RETURN count(n)
  """
result = graph.query(cypher)
result

[{'count(n)': 132}]

### **Create vector embedding from the `tagline` column**

In [11]:
import os
import pandas as pd

In [12]:
movie_csv_path = here("data/movie_csv/movie.csv")
df = pd.read_csv(movie_csv_path)
df.columns

Index(['Unnamed: 0', 'movieId', 'released', 'title', 'actors', 'director',
       'genres', 'imdbRating', 'tagline', 'taglineEmbedding'],
      dtype='object')

In [13]:
model_name = "gpt-35-turbo-1106"
azure_openai_api_key = os.environ["OPENAI_API_KEY"]
azure_openai_endpoint = os.environ["OPENAI_API_BASE"]

**Load the Azure OpenAI Embedding Model**

In [14]:
from openai import AzureOpenAI

client = AzureOpenAI(
  api_key = azure_openai_api_key,  
  api_version = "2023-07-01-preview",
  azure_endpoint = azure_openai_endpoint
)
def embed_text(text):
    response = client.embeddings.create(
    input = text,
    model= "text-embedding-ada-002"
    )
    return response.data[0].embedding

In [15]:
embedding_list = [embed_text(i) for i in df["tagline"]]

In [16]:
df["taglineEmbedding"] = embedding_list

**Create a vector index**

In [17]:
graph.query("""
  CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
  FOR (m:Movie) ON (m.taglineEmbedding) 
  OPTIONS { indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }}"""
)

[]

In [18]:
graph.query("""
  SHOW VECTOR INDEXES
  """
)

[{'id': 3,
  'name': 'movie_tagline_embeddings',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Movie'],
  'properties': ['taglineEmbedding'],
  'indexProvider': 'vector-1.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0}]

**Populate the index**

In [19]:
for index, row in df.iterrows():
    movie_id = row['movieId']
    embedding = row['taglineEmbedding']
    graph.query(f"MATCH (m:Movie {{id: '{movie_id}'}}) SET m.taglineEmbedding = {embedding}")

**Verify that the index was created successfuly**

In [20]:
graph.refresh_schema()
print(graph.schema)

Node properties are the following:
Movie {title: STRING, taglineEmbedding: LIST, tagline: STRING, imdbRating: FLOAT, id: STRING, released: DATE},Person {name: STRING},Genre {name: STRING}
Relationship properties are the following:

The relationships are the following:
(:Movie)-[:IN_GENRE]->(:Genre),(:Person)-[:DIRECTED]->(:Movie),(:Person)-[:ACTED_IN]->(:Movie)


In [21]:
result = graph.query("""
    MATCH (m:Movie) 
    WHERE m.tagline IS NOT NULL
    RETURN m.tagline, m.taglineEmbedding
    LIMIT 1
    """
)

In [22]:
result[0]['m.tagline']

'The adventure takes off!'

In [23]:
result[0]['m.taglineEmbedding']

[0.02379636839032173,
 -0.036385245621204376,
 -0.006874361541122198,
 -0.012947257608175278,
 -0.02100752852857113,
 0.028044788166880608,
 -0.016055380925536156,
 -0.0235748253762722,
 0.002005293732509017,
 -0.014426385052502155,
 0.01529952697455883,
 0.021906733512878418,
 0.0055092633701860905,
 -0.02496924437582493,
 0.0266634002327919,
 -0.021281199529767036,
 0.026611272245645523,
 -0.0358639694750309,
 0.01058847177773714,
 -0.007793114986270666,
 -0.00776053499430418,
 -0.012373850680887699,
 0.009891261346638203,
 0.01376175507903099,
 -0.005189979914575815,
 -0.00039075533277355134,
 0.011350841261446476,
 -0.0024402353446930647,
 0.03338789567351341,
 -0.006789654027670622,
 0.013996330089867115,
 0.012289143167436123,
 0.005453877151012421,
 -0.011168394237756729,
 -0.008711868897080421,
 -0.00540174962952733,
 -0.010510279797017574,
 -0.0034404387697577477,
 -0.0008869881276041269,
 -0.01247159019112587,
 0.00840561743825674,
 0.015064951963722706,
 -0.01717613078653812