<a href="https://colab.research.google.com/github/QED137/cineBoat/blob/main/CompletefinalProjectWBS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **The project outline**
The Multimodal RAG for Film and Entertainment project  aims to build an intelligent system that provides insights and recommendations about movies and TV shows by integrating a knowledge graph with multimodal data. It combines structured relationships (e.g., directors, genres, cast) from a Neo4j knowledge graph with embeddings from trailers, reviews, and metadata for enhanced retrieval and contextualized responses.
##Core Features:
* Integrate structured data using a Neo4j knowledge graph.
* Process multimodal unstructured data (text, video, audio) into embeddings.
* Use a Retrieval-Augmented Generation (RAG) pipeline for intelligent
   recommendations and insights.

### Tool and Technology

| Component	| Tools/Technologies |
|------------|------------------- |
|Knowledge Graph	| Neo4j, Cypher|
|Text Embeddings |	Hugging Face (BERT, GPT), OpenAI|
|Video Embeddings|	CLIP, VideoMAE|
|Vector Search	|Pinecone, FAISS, Weaviate|
|Visualization	|Neo4j Browser, NetworkX, matplotlib|
|Language Generation |	GPT-4, Hugging Face Transformers|

## Work Flow
Step-by-Step Implementation
1. **Define Scope and Data Sources**
* Target Output: Deliver contextualized recommendations (e.g., "Find movies
 like Inception that share a similar visual style and have a compelling plot").
* Data Requirements:
* Structured Data: Movies, TV shows, directors, genres, cast relationships.
* Unstructured Data:
   *  Video: Trailers, key scenes.
   * Text: Reviews, summaries, audience feedback.
   * Metadata: IMDb/Rotten Tomatoes scores, release year.
2. **Set Up Knowledge Graph (Neo4j)**

   a. Install neo4js
    * Install Neo4j locally or use Neo4j AuraDB for a cloud-hosted option.
   b. Define Schema:
     * Nodes: Movie, Director, Actor, Genre.
     *Relationships:

        * (:Movie)-[:DIRECTED_BY]->(:Director)

        * (:Movie)-[:HAS_GENRE]->(:Genre)
        
        *  (:Movie)-[:FEATURES]->(:Actor)

  c. Populate Data

3. **Process Multimodal Data**

   a. Text Data (Reviews, Metadata)
   
      Use embeddings from transformer models like BERT or OpenAI for feature extraction

   b. Video Data (Trailers)
      Extract video and audio embeddings using models like CLIP or VideoMAE:

   c. Audio Data (Soundtracks, Dialogues)
     Use models like OpenL3 or Wav2Vec for embedding extraction.   

4. Build RAG Workflow

     a. Retrieve Structured Data:
     
     * Use Neo4j to query relationships and basic metadata.

  b. Retrieve Unstructured Data:
    * Use a vector database (e.g., Pinecone, Weaviate, or FAISS) to perform similarity searches on multimodal embeddings.

  c. Combine Results:
  
     * Merge Neo4j results with vector database results in a unified response.
        
     * Example: Find movies similar to Inception using graph metadata and embedding similarity.

  d. Generate Contextual Responses:
     
     * Use a language model like GPT-4 for natural language output:     
5.** User Interaction**
  
  a. Visualization and Exploration
  Neo4j Browser:
   
   * Visualize nodes and relationships directly in the Neo4j interface.
  
  b. Local Interface:
     
     * Use Jupyter Notebooks for an interactive educational experience.
        
     * Include Python libraries like matplotlib or plotly for graph visualization.

## **Workflow Demonstration**

1.   Input a query like "Find action movies directed by James Cameron with a tone similar to Avatar."

2.   Query Neo4j for action movies by James Cameron.

3. Search vector embeddings for movies with similar visual and auditory tone.

4. Combine and display the results.

# Essential Installations

In [3]:

!pip install -q neo4j pandas

!pip install -q python-dotenv
!pip install -q  neo4j
!pip install -q langchain_community
!pip install -q streamlit neo4j openai
!pip install -q keybert transformers
!pip install -q keybert
!pip install -q streamlit-chat
!pip install -q python-dotenv
!npm install -qqq -U localtunnel
!pip install -q langchain_openai
!pip install -q langchain openai
!pip install -q fuzzywuzzy

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.7/301.7 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m34.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.3/44.3 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m61.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!npm audit report

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K[1m# npm audit report[22m

[1maxios[22m  0.8.1 - 0.27.2
Severity: [33m[1mmoderate[22m[39m
[1mAxios Cross-Site Request Forgery Vulnerability[22m - https://github.com/advisories/GHSA-wf5p-g6vw-rhxx
[33m[1mfix available[22m[39m via `npm audit fix --force`
Will install localtunnel@1.8.3, which is a breaking change
[2mnode_modules/axios[22m
  [1mlocaltunnel[22m  >=1.9.0
  Depends on vulnerable versions of [1maxios[22m
  [2mnode_modules/localtunnel[22m

2 [33m[1mmoderate[22m[39m severity vulnerabilities

To address all issues (including breaking changes), run:
  npm audit fix --force
[1G[0K⠼[1G[0K

### Essential Imports

In [4]:
from neo4j import GraphDatabase
from dotenv import load_dotenv
import os

from langchain_community.graphs import Neo4jGraph

# Warning control
import warnings
import requests
from requests.auth import HTTPBasicAuth
import kagglehub
import pandas as pd
import os
from neo4j import GraphDatabase
from google.colab import userdata
import streamlit as st


# Setting up environement

### On colab
If one is using the code only colab then setting up this envirnement with your own keys and passwords.

### Environement for Streamlit app
Write your own secrets.toml file an then upload it to colab by running the folloiwng caommand. If one want to use namimg convention similar to this code then write passowrds and key in xour secrets.py file like the following. In your  **secrets.toml** file

In [5]:
!mkdir -p ~/.streamlit
from google.colab import files
uploaded = files.upload()  # Use the upload dialog to select `secrets.toml`
!mv secrets.toml ~/.streamlit/ #strealit takes paswwords and keys from here

Saving secrets.toml to secrets.toml


### Set the passwords and enviroment for streamlit app

In [6]:
uri = st.secrets["NEO4J_URI"]
username = st.secrets["NEO4J_USERNAME"]
password = st.secrets["NEO4J_PASSWORD"]
TMDB_API_KEY = st.secrets["TMDB_API"]
database = st.secrets["NEO4J_DATABASE"]
openai_key = st.secrets["OPENAI_API_KEY"]

## Setting up NEO4j database

In [7]:
#setting connection to neo4j

try:
    driver = GraphDatabase.driver(uri, auth=(username, password))
    with driver.session() as session:
        result = session.run("RETURN 'Connected to Neo4j from Colab!' AS message")
        for record in result:
            print(record["message"])
except Exception as e:
    print("Failed to connect to Neo4j:", e)
finally:
    driver.close()


Connected to Neo4j from Colab!


In [None]:
from dotenv import load_dotenv
import os

from langchain_community.graphs import Neo4jGraph

# Warning control
import warnings
warnings.filterwarnings("ignore")

In [8]:
#Initialize a knowledge graph instance using LangChain's Neo4j integration
kg = Neo4jGraph(
    uri, username, password, database
)

  kg = Neo4jGraph(


In [9]:
cypher = """
  MATCH (n)
  RETURN count(n)
  """

In [10]:
result = kg.query(cypher)
result

[{'count(n)': 3678}]

In [11]:
#aliases like sql
cypher = """
  MATCH (n)

  RETURN count(n) AS numberOfNodes
  """

In [None]:
result = kg.query(cypher)
result

[{'numberOfNodes': 3678}]

In [19]:
## some queries using cypher
cypher = """
  MATCH (nineties:Movie)
  WHERE nineties.year >= 2006
    AND nineties.year < 2020
  RETURN nineties.title limit 10
  """
kg.query(cypher)


[{'nineties.title': 'Harry Potter and the Deathly Hallows: Part 2'},
 {'nineties.title': 'Office Christmas Party'},
 {'nineties.title': 'The Neon Demon'},
 {'nineties.title': 'Dangal'},
 {'nineties.title': '10 Cloverfield Lane'},
 {'nineties.title': 'Finding Dory'},
 {'nineties.title': "Miss Peregrine's Home for Peculiar Children"},
 {'nineties.title': 'Divergent'},
 {'nineties.title': 'Mike and Dave Need Wedding Dates'},
 {'nineties.title': 'Boyka: Undisputed IV'}]

In [13]:
cypher ="""
Match (m:Movie)
where m.year <= 2006
Return m.year limit 10
"""
kg.query(cypher)

[{'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006},
 {'m.year': 2006}]

In [None]:
cypher ="""
Match (m:Movie)
where m.year <= 2006
Return m.year, m.title
"""
kg.query(cypher)

[{'m.year': 2006, 'm.title': 'Casino Royale'},
 {'m.year': 2006, 'm.title': 'Cars'},
 {'m.year': 2006, 'm.title': "Pan's Labyrinth"},
 {'m.year': 2006, 'm.title': 'Apocalypto'},
 {'m.year': 2006, 'm.title': 'The Host'},
 {'m.year': 2006, 'm.title': 'Children of Men'},
 {'m.year': 2006, 'm.title': 'The Devil Wears Prada'},
 {'m.year': 2006, 'm.title': 'The Fast and the Furious: Tokyo Drift'},
 {'m.year': 2006, 'm.title': 'Step Up'},
 {'m.year': 2006, 'm.title': 'Silent Hill'},
 {'m.year': 2006, 'm.title': 'Marie Antoinette'},
 {'m.year': 2006, 'm.title': 'The Lives of Others'},
 {'m.year': 2006, 'm.title': 'A Good Year'},
 {'m.year': 2006, 'm.title': 'Deja Vu'},
 {'m.year': 2006, 'm.title': 'The Break-Up'},
 {'m.year': 2006, 'm.title': 'Idiocracy'},
 {'m.year': 2006, 'm.title': 'Little Miss Sunshine'},
 {'m.year': 2006, 'm.title': "She's the Man"},
 {'m.year': 2006, 'm.title': 'X-Men: The Last Stand'},
 {'m.year': 2006, 'm.title': 'The Pursuit of Happyness'},
 {'m.year': 2006, 'm.title'

In [None]:
#number of movie
cypher = """
  MATCH (m:Movie)
  RETURN count(m) AS numberOfMovies
  """
kg.query(cypher)

[{'numberOfMovies': 999}]

### Playing with neo4j

In [None]:
movie_title = "Inception"
cypher = f"""
MATCH (m:Movie {{title: '{movie_title}'}})
RETURN m.title AS title
"""
result = kg.query(cypher)
if result:
    print(f"The movie '{movie_title}' exists in the database.")
else:
    print(f"The movie '{movie_title}' is not in the database.")


The movie 'Inception' exists in the database.


# Downlaoding Kaggle data

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

#



Mounted at /content/drive


In [None]:
from google.colab import files
files.upload()  # Upload kaggle.json here


Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"janmajaykumar","key":"303ae1d6d6dd04b357aa7c364b59fb75"}'}

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json


In [None]:
import os

# Create the Kaggle directory
os.makedirs('/root/.kaggle', exist_ok=True)

# Move the kaggle.json file to the .kaggle directory
!mv kaggle.json /root/.kaggle/

# Change permissions
!chmod 600 /root/.kaggle/kaggle.json


In [None]:
# Download movie dataset
!kaggle datasets download -d rounakbanik/the-movies-dataset

# Unzip the downloaded dataset
!unzip the-movies-dataset.zip


Dataset URL: https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
License(s): CC0-1.0
Downloading the-movies-dataset.zip to /content
 99% 226M/228M [00:11<00:00, 20.9MB/s]
100% 228M/228M [00:11<00:00, 20.9MB/s]
Archive:  the-movies-dataset.zip
  inflating: credits.csv             
  inflating: keywords.csv            
  inflating: links.csv               
  inflating: links_small.csv         
  inflating: movies_metadata.csv     
  inflating: ratings.csv             
  inflating: ratings_small.csv       


In [None]:
link_small= pd.read_csv("/content/credits.csv")
link_small.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [None]:
#another way of improting data from kaggle
import kagglehub

# Download the latest version of the dataset
path = kagglehub.dataset_download("rishabjadhav/imdb-actors-and-movies")

print("Path to dataset files:", path)


Downloading from https://www.kaggle.com/api/v1/datasets/download/rishabjadhav/imdb-actors-and-movies?dataset_version_number=1...


100%|██████████| 453M/453M [00:05<00:00, 90.4MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/rishabjadhav/imdb-actors-and-movies/versions/1


In [None]:
!ls {path}

combined.csv  names.csv  titles.csv
combined.csv  names.csv  titles.csv


In [None]:
csv_file_path = f"{path}/combined.csv"
df_one = pd.read_csv(csv_file_path)

# Step 4: Display the DataFrame
df_one.sample(19)
df_one.shape

(546421, 5)

In [None]:
df_one.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 546421 entries, 0 to 546420
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   primaryName        546421 non-null  object 
 1   birthYear          546421 non-null  int64  
 2   deathYear          184608 non-null  float64
 3   primaryProfession  546421 non-null  object 
 4   knownForTitle      546421 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 20.8+ MB


In [None]:
#one want to move the downlaoded file to the local drive
# from google.colab import drive
# drive.mount('/content/drive')
# import shutil
# shutil.move(path, '/content/drive/My Drive/imdb-dataset')


Mounted at /content/drive


'/content/drive/My Drive/imdb-dataset'

In [None]:
# #IMdb actor data set
# !kaggle datasets download -d lsrishabjadhav/imdb-actors-and-movies
# !!unzip the-movies-dataset.zip


In [None]:
df_movie_meta = pd.read_csv("/content/movies_metadata.csv")


In [None]:
df_rating=pd.read_csv("/content/ratings.csv")

In [None]:
df_rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


In [None]:
df_keywords= pd.read_csv("/content/keywords.csv")

In [None]:
## reading data from anther file
df_actor=pd.read_csv('/content/drive/MyDrive/imdb-dataset/combined.csv')

In [None]:
df_actor

Unnamed: 0,primaryName,birthYear,deathYear,primaryProfession,knownForTitle
0,Fred Astaire,1899,1987.0,"actor,miscellaneous,producer",The Towering Inferno
1,Lauren Bacall,1924,2014.0,"actress,soundtrack,archive_footage",To Have and Have Not
2,Brigitte Bardot,1934,,"actress,music_department,producer",Contempt
3,John Belushi,1949,1982.0,"actor,writer,music_department",Saturday Night Live
4,Ingmar Bergman,1918,2007.0,"writer,director,actor",Wild Strawberries
...,...,...,...,...,...
546416,William Riva,1919,1999.0,set_decorator,The Paul Winchell Show
546417,Frank J. Gaily,1915,2008.0,sound_department,After Hours
546418,Ben Ray Lujan,1972,,archive_footage,CNN Newsroom
546419,Henry Lawfull,2006,,actor,A Boy Called Christmas


##  Extracting Data from movie database

In [None]:
# Parse JSON-like columns (genres, production_companies, etc.)
import ast
def parse_column(data):
    try:
        return [item['name'] for item in ast.literal_eval(data)]
    except:
        return []

In [None]:
movies['genres'] = movies['genres'].apply(parse_column)
movies['production_companies'] = movies['production_companies'].apply(parse_column)


In [None]:
movies['production_companies']

Unnamed: 0,production_companies
0,[Pixar Animation Studios]
1,"[TriStar Pictures, Teitler Film, Interscope Co..."
2,"[Warner Bros., Lancaster Gate]"
3,[Twentieth Century Fox Film Corporation]
4,"[Sandollar Productions, Touchstone Pictures]"
...,...
45461,[]
45462,[Sine Olivia]
45463,[American World Pictures]
45464,[Yermoliev]


### complete movie data set

In [None]:
## another movie dataset with million  movie
import kagglehub

# Download latest version
path = kagglehub.dataset_download("alanvourch/tmdb-movies-daily-updates")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/alanvourch/tmdb-movies-daily-updates?dataset_version_number=299...


100%|██████████| 249M/249M [00:13<00:00, 19.1MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/alanvourch/tmdb-movies-daily-updates/versions/299


In [None]:
!ls {path}

TMDB_all_movies.csv


In [None]:
movie_path= f"{path}/TMDB_all_movies.csv"

In [None]:
movie_df= pd.read_csv(movie_path)

In [None]:
# # Load the original dataset
# movie_df = pd.read_csv(movie_path)

# # Subset the first 45,000 rows using .iloc
# subset_df = movie_df.iloc[:44990]

# # Save the subset to a new CSV file
# subset_df.to_csv("subset_movie_data.csv", index=False, encoding='utf-8')


In [None]:
# with open('subset_movie_data.csv', 'r') as f:
#     for i in range(10):  # Print the first 10 lines
#         print(f.readline())


id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,original_language,original_title,overview,popularity,tagline,genres,production_companies,production_countries,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes,poster_path

2,Ariel,7.1,335.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,fi,Ariel,"After the coal mine he works at closes and his father commits suicide, a Finnish man leaves for the city to make a living but there, he is framed and imprisoned for various crimes.",11.915,,"Comedy, Drama, Romance, Crime",Villealfa Filmproductions,Finland,suomi,"Heikki Salomaa, Hanna Jokinen, Matti Pellonpää, Pekka Wilen, Timo Toikka, Tomi Salmela, Tarja Keinänen, Kari Helaseppä, Markku Rantala, Heikki Anttila, Esko Nikkari, Timo Harakka, Mikko Remes, Erkki Pajala, Eetu Hilkamo, Hannu Kivisalo, Matti Jaaranen, Esko Salminen, Jaakko Talaskivi, Turo Pajala, Susanna Haavisto, Merja Pulkkinen, Pentti Auer, J

### IMdb databse

In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("yusufdelikkaya/imdb-movie-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/yusufdelikkaya/imdb-movie-dataset?dataset_version_number=1...


100%|██████████| 134k/134k [00:00<00:00, 37.7MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/yusufdelikkaya/imdb-movie-dataset/versions/1





In [None]:
!ls {path}

imdb_movie_dataset.csv


In [None]:
dat_path= f"{path}/imdb_movie_dataset.csv"

In [None]:
#df_imdb= pd.read_csv(dat_path)
df_imdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Rank                1000 non-null   int64  
 1   Title               1000 non-null   object 
 2   Genre               1000 non-null   object 
 3   Description         1000 non-null   object 
 4   Director            1000 non-null   object 
 5   Actors              1000 non-null   object 
 6   Year                1000 non-null   int64  
 7   Runtime (Minutes)   1000 non-null   int64  
 8   Rating              1000 non-null   float64
 9   Votes               1000 non-null   int64  
 10  Revenue (Millions)  872 non-null    float64
 11  Metascore           936 non-null    float64
dtypes: float64(3), int64(4), object(5)
memory usage: 93.9+ KB


### IMdb data processing

In [None]:
def parse_actors(actor_data):
    if isinstance(actor_data, str):
        return actor_data.split(", ")  # Split actors by ", "
    return []

df_imdb['parsed_actors'] = df_imdb['Actors'].apply(parse_actors)

# Create actor_df with movie_id and individual actor names
actor_data = []
for _, row in df_imdb.iterrows():
    for actor in row['parsed_actors']:
        actor_data.append({'movie_id': row['Rank'], 'actor_name': actor})

actor_df = pd.DataFrame(actor_data)


In [None]:
df_imdb.head(10)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0
5,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0
6,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0
7,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0
8,9,The Lost City of Z,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0
9,10,Passengers,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0


In [None]:
actor_df.head(10)

Unnamed: 0,movie_id,actor_name
0,1,Chris Pratt
1,1,Vin Diesel
2,1,Bradley Cooper
3,1,Zoe Saldana
4,2,Noomi Rapace
5,2,Logan Marshall-Green
6,2,Michael Fassbender
7,2,Charlize Theron
8,3,James McAvoy
9,3,Anya Taylor-Joy


In [None]:
# parse director
director_data = []
for _, row in df_imdb.iterrows():
    director_data.append({'movie_id': row['Rank'], 'director_name': row['Director']})

director_df = pd.DataFrame(director_data)


In [None]:
director_df.head(10)

Unnamed: 0,movie_id,director_name
0,1,James Gunn
1,2,Ridley Scott
2,3,M. Night Shyamalan
3,4,Christophe Lourdelet
4,5,David Ayer
5,6,Yimou Zhang
6,7,Damien Chazelle
7,8,Sean Foley
8,9,James Gray
9,10,Morten Tyldum


In [None]:
def parse_genres(genre_data):
    if isinstance(genre_data, str):
        return genre_data.split(",")  # Split genres by ", "
    return []

df_imdb['parsed_genres'] = df_imdb['Genre'].apply(parse_genres)

# Create genre_df with movie_id and individual genre names
genre_data = []
for _, row in df_imdb.iterrows():
    for genre in row['parsed_genres']:
        genre_data.append({'movie_id': row['Rank'], 'genre_name': genre})

genre_df = pd.DataFrame(genre_data)


In [None]:
genre_df.head(10)

Unnamed: 0,movie_id,genre_name
0,1,Action
1,1,Adventure
2,1,Sci-Fi
3,2,Adventure
4,2,Mystery
5,2,Sci-Fi
6,3,Horror
7,3,Thriller
8,4,Animation
9,4,Comedy


In [None]:
df_imdb.head(10)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,parsed_actors,parsed_genres
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,"[Chris Pratt, Vin Diesel, Bradley Cooper, Zoe ...","[Action, Adventure, Sci-Fi]"
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,"[Noomi Rapace, Logan Marshall-Green, Michael F...","[Adventure, Mystery, Sci-Fi]"
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,"[James McAvoy, Anya Taylor-Joy, Haley Lu Richa...","[Horror, Thriller]"
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,"[Matthew McConaughey,Reese Witherspoon, Seth M...","[Animation, Comedy, Family]"
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,"[Will Smith, Jared Leto, Margot Robbie, Viola ...","[Action, Adventure, Fantasy]"
5,6,The Great Wall,"Action,Adventure,Fantasy",European mercenaries searching for black powde...,Yimou Zhang,"Matt Damon, Tian Jing, Willem Dafoe, Andy Lau",2016,103,6.1,56036,45.13,42.0,"[Matt Damon, Tian Jing, Willem Dafoe, Andy Lau]","[Action, Adventure, Fantasy]"
6,7,La La Land,"Comedy,Drama,Music",A jazz pianist falls for an aspiring actress i...,Damien Chazelle,"Ryan Gosling, Emma Stone, Rosemarie DeWitt, J....",2016,128,8.3,258682,151.06,93.0,"[Ryan Gosling, Emma Stone, Rosemarie DeWitt, J...","[Comedy, Drama, Music]"
7,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0,"[Essie Davis, Andrea Riseborough, Julian Barra...",[Comedy]
8,9,The Lost City of Z,"Action,Adventure,Biography","A true-life drama, centering on British explor...",James Gray,"Charlie Hunnam, Robert Pattinson, Sienna Mille...",2016,141,7.1,7188,8.01,78.0,"[Charlie Hunnam, Robert Pattinson, Sienna Mill...","[Action, Adventure, Biography]"
9,10,Passengers,"Adventure,Drama,Romance",A spacecraft traveling to a distant colony pla...,Morten Tyldum,"Jennifer Lawrence, Chris Pratt, Michael Sheen,...",2016,116,7.0,192177,100.01,41.0,"[Jennifer Lawrence, Chris Pratt, Michael Sheen...","[Adventure, Drama, Romance]"


In [None]:
movie_df = df_imdb.drop(columns=['Actors', 'Director', 'Genre', 'parsed_actors', 'parsed_genres'])


In [None]:
movie_df.head(10)

Unnamed: 0,Rank,Title,Description,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,A group of intergalactic criminals are forced ...,2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Following clues to the origin of mankind, a te...",2012,124,7.0,485820,126.46,65.0
2,3,Split,Three girls are kidnapped by a man with a diag...,2016,117,7.3,157606,138.12,62.0
3,4,Sing,"In a city of humanoid animals, a hustling thea...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,A secret government agency recruits some of th...,2016,123,6.2,393727,325.02,40.0
5,6,The Great Wall,European mercenaries searching for black powde...,2016,103,6.1,56036,45.13,42.0
6,7,La La Land,A jazz pianist falls for an aspiring actress i...,2016,128,8.3,258682,151.06,93.0
7,8,Mindhorn,A has-been actor best known for playing the ti...,2016,89,6.4,2490,,71.0
8,9,The Lost City of Z,"A true-life drama, centering on British explor...",2016,141,7.1,7188,8.01,78.0
9,10,Passengers,A spacecraft traveling to a distant colony pla...,2016,116,7.0,192177,100.01,41.0


### Writing IMdb data to neo4j

In [None]:
!pip install neo4j --upgrade



In [None]:
from logging import error
import pandas as pd
from neo4j.exceptions import ConstraintError  # Import ConstraintError from neo4j.exceptions

for _, row in movie_df.iterrows():
    # Modified cypher query to use MERGE on title instead of id
    cypher = """
    MERGE (m:Movie {title: $title})
    ON CREATE SET m.id = $id,
                    m.description = $description,
                    m.year = $year,
                    m.runtime = $runtime,
                    m.rating = $rating,
                    m.votes = $votes,
                    m.revenue = $revenue,
                    m.metascore = $metascore
    ON MATCH SET m.description = $description,
                    m.year = $year,
                    m.runtime = $runtime,
                    m.rating = $rating,
                    m.votes = $votes,
                    m.revenue = $revenue,
                    m.metascore = $metascore
    """
    parameters = {
        "id": row["Rank"],
        "title": row["Title"],
        "description": row["Description"],
        "year": row["Year"],
        "runtime": row["Runtime (Minutes)"],
        "rating": row["Rating"],
        "votes": row["Votes"],
        "revenue": row["Revenue (Millions)"],
        "metascore": row["Metascore"]
    }

    # Try to create/update the node, catching ConstraintError
    try:
        kg.query(cypher, parameters)
    except ConstraintError as e:
        # If ConstraintError, print the error message and the conflicting movie title
        print(f"ConstraintError: {e}")
        print(f"Conflicting movie title: {row['title']}")
        # You can choose to skip the current movie, log it, or handle it differently
        # For example, to skip:
        continue

### relationship among movie , director , genre in neo4j

In [None]:
#Create Actor Nodes and ACTED_IN Relationships
for _, row in actor_df.iterrows():
    # Create Actor node
    actor_cypher = """
    MERGE (a:Actor {name: $actor_name})
    """
    kg.query(actor_cypher, {"actor_name": row["actor_name"]})

    # Create ACTED_IN relationship
    relationship_cypher = """
    MATCH (a:Actor {name: $actor_name}), (m:Movie {id: $movie_id})
    MERGE (a)-[:ACTED_IN]->(m)
    """
    kg.query(relationship_cypher, {"actor_name": row["actor_name"], "movie_id": row["movie_id"]})


In [None]:
# Create Director Nodes and DIRECTED Relationships
for _, row in director_df.iterrows():
    # Create Director node
    director_cypher = """
    MERGE (d:Director {name: $director_name})
    """
    kg.query(director_cypher, {"director_name": row["director_name"]})

    # Create DIRECTED relationship
    relationship_cypher = """
    MATCH (d:Director {name: $director_name}), (m:Movie {id: $movie_id})
    MERGE (d)-[:DIRECTED]->(m)
    """
    kg.query(relationship_cypher, {"director_name": row["director_name"], "movie_id": row["movie_id"]})


In [None]:
#writing genre relatioship with database
for _, row in genre_df.iterrows():
    # Create Genre node
    genre_cypher = """
    MERGE (g:Genre {name: $genre_name})
    """
    kg.query(genre_cypher, {"genre_name": row["genre_name"]})

    # Create HAS_GENRE relationship
    relationship_cypher = """
    MATCH (g:Genre {name: $genre_name}), (m:Movie {id: $movie_id})
    MERGE (m)-[:HAS_GENRE]->(g)
    """
    kg.query(relationship_cypher, {"genre_name": row["genre_name"], "movie_id": row["movie_id"]})


## data preprocessin

In [None]:
# import pandas as pd
# import ast

# # Load the full dataset
# #movie_df = pd.read_csv(movie_path, encoding='utf-8', on_bad_lines='skip')

# # Handle missing values
# movie_dat.fillna("Unknown", inplace=True)

# # Parse JSON-like columns (if necessary)
# def parse_column(data):
#     if isinstance(data, str):  # Ensure the value is a string
#         try:
#             # Safely evaluate JSON-like strings
#             return [item['name'] for item in ast.literal_eval(data)]
#         except (ValueError, SyntaxError, KeyError, TypeError):
#             # Return empty list if parsing fails
#             return []
#     return []  # Return empty list for non-string values


# movie_dat['genres'] = movie_dat['genres'].apply(parse_column)
# movie_dat['production_companies'] = movie_dat['production_companies'].apply(parse_column)
# movie_dat['cast'] = movie_dat['cast'].apply(lambda x: parse_column(x)[:10])  # Limit to top 10 actors

# # Validate the processed data
# print(movie_dat.head())


# movie_dat.to_csv('processed_movie_data.csv', index=False, encoding='utf-8')


In [None]:
# Drop the 'poster_path' column
movie_dat = movie_df.drop('poster_path', axis='columns')

In [None]:
movie_dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1021773 entries, 0 to 1021772
Data columns (total 28 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   id                       1021773 non-null  int64 
 1   title                    1021773 non-null  object
 2   vote_average             1021773 non-null  object
 3   vote_count               1021773 non-null  object
 4   status                   1021773 non-null  object
 5   release_date             1021773 non-null  object
 6   revenue                  1021773 non-null  object
 7   runtime                  1021773 non-null  object
 8   budget                   1021773 non-null  object
 9   imdb_id                  1021773 non-null  object
 10  original_language        1021773 non-null  object
 11  original_title           1021773 non-null  object
 12  overview                 1021773 non-null  object
 13  popularity               1021773 non-null  object
 14  ta

In [None]:
#preprocession
'''
movie_dat has cat and doirector cloum and thes column has list of vlaues.
various actors and directors in onr column.
we need to process this to make one idem per row
'''

import pandas as pd

# Parse the cast column into lists
def parse_cast(cast_data):
    if isinstance(cast_data, str):
        return cast_data.split(", ")  # Split comma-separated actors
    return []

# Apply the parsing function
movie_dat['parsed_cast'] = movie_dat['cast'].apply(parse_cast)

# Create actor_df with movie_id and individual actor names
actor_data = []
for _, row in movie_dat.iterrows():
    for actor in row['parsed_cast']:
        if actor.lower() != "unknown":  # Exclude 'unknown' values
            actor_data.append({'movie_id': row['id'], 'actor_name': actor})

actor_df = pd.DataFrame(actor_data)


In [None]:
actor_df.head(20)

Unnamed: 0,movie_id,actor_name
0,2,Heikki Salomaa
1,2,Hanna Jokinen
2,2,Matti Pellonpää
3,2,Pekka Wilen
4,2,Timo Toikka
5,2,Tomi Salmela
6,2,Tarja Keinänen
7,2,Kari Helaseppä
8,2,Markku Rantala
9,2,Heikki Anttila


In [None]:
#movie_dat['cast']
# Subset 45,000 rows for Neo4j
subset_df = movie_dat.iloc[:45995].copy()

# Save the subset as a new CSV file
subset_df.to_csv('movie_data_set.csv', index=False, encoding='utf-8')

# Verify the subset
print(subset_df.head())
print(f"Subset size: {subset_df.shape}")


   id                             title vote_average vote_count    status  \
0   2                             Ariel          7.1      335.0  Released   
1   3               Shadows in Paradise          7.3      369.0  Released   
2   5                        Four Rooms          5.8     2628.0  Released   
3   6                    Judgment Night          6.5      331.0  Released   
4   8  Life in Loops (A Megacities RMX)          7.5       27.0  Released   

  release_date     revenue runtime      budget    imdb_id  ...  \
0   1988-10-21         0.0    73.0         0.0  tt0094675  ...   
1   1986-10-17         0.0    74.0         0.0  tt0092149  ...   
2   1995-12-09   4257354.0    98.0   4000000.0  tt0113101  ...   
3   1993-10-15  12136938.0   109.0  21000000.0  tt0107286  ...   
4   2006-01-01         0.0    80.0     42000.0  tt0825671  ...   

       production_countries                        spoken_languages cast  \
0                   Finland                                   su

In [None]:
from logging import error
import pandas as pd
# Use 'on_bad_lines' instead of 'error_bad_lines'
# Try using 'engine='python'' for more robust parsing
sub = pd.read_csv('movie_data_set.csv', on_bad_lines='skip', sep=',', engine='python')
#The on_bad_lines argument can be set to 'skip' to skip bad lines,
#'warn' to issue a warning and continue, or 'error' to raise an error.
# Using engine='python' can be slower but might handle inconsistencies better

In [None]:
sub['cast'][0]

'[]'

In [None]:
sub.head(10)

Unnamed: 0,id,title,vote_average,vote_count,status,release_date,revenue,runtime,budget,imdb_id,...,production_countries,spoken_languages,cast,director,director_of_photography,writers,producers,music_composer,imdb_rating,imdb_votes
0,2,Ariel,7.1,335.0,Released,1988-10-21,0.0,73.0,0.0,tt0094675,...,Finland,suomi,[],Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Aki Kaurismäki,Unknown,7.4,8790.0
1,3,Shadows in Paradise,7.3,369.0,Released,1986-10-17,0.0,74.0,0.0,tt0092149,...,Finland,"suomi, English, svenska",[],Aki Kaurismäki,Timo Salminen,Aki Kaurismäki,Mika Kaurismäki,Unknown,7.5,7552.0
2,5,Four Rooms,5.8,2628.0,Released,1995-12-09,4257354.0,98.0,4000000.0,tt0113101,...,United States of America,English,[],"Quentin Tarantino, Robert Rodriguez, Alexandre...","Guillermo Navarro, Andrzej Sekula, Phil Parmet...","Quentin Tarantino, Robert Rodriguez, Alexandre...","Quentin Tarantino, Alexandre Rockwell, Lawrenc...",Combustible Edison,6.7,112688.0
3,6,Judgment Night,6.5,331.0,Released,1993-10-15,12136938.0,109.0,21000000.0,tt0107286,...,United States of America,English,[],Stephen Hopkins,Peter Levy,"Jere Cunningham, Lewis Colick","Gene Levy, Lloyd Segan, Marilyn Vance",Alan Silvestri,6.6,19322.0
4,8,Life in Loops (A Megacities RMX),7.5,27.0,Released,2006-01-01,0.0,80.0,42000.0,tt0825671,...,Austria,"English, हिन्दी, 日本語, Pусский, Español",[],Timo Novotny,Wolfgang Thaler,"Michael Glawogger, Timo Novotny","Ulrich Gehmacher, Timo Novotny",Unknown,8.2,284.0
5,9,Sunday in August,7.135,26.0,Released,2004-09-02,0.0,15.0,0.0,tt0425473,...,Germany,Deutsch,[],"Marc Meyer, Anna Haas",Peter Polsak-Lohmann,Marc Meyer,Marc Meyer,Christian Biegai,6.8,14.0
6,11,Star Wars,8.204,20568.0,Released,1977-05-25,775398007.0,121.0,11000000.0,tt0076759,...,United States of America,English,[],George Lucas,Gilbert Taylor,George Lucas,"Rick McCallum, George Lucas, Gary Kurtz",John Williams,8.6,1480052.0
7,12,Finding Nemo,7.819,19180.0,Released,2003-05-30,940335536.0,100.0,94000000.0,tt0266543,...,United States of America,English,[],Andrew Stanton,"Sharon Calahan, Jeremy Lasky","Jim Capobianco, Blake Tucker, Andrew Stanton, ...","Graham Walters, John Lasseter",Thomas Newman,8.2,1137252.0
8,13,Forrest Gump,8.471,27417.0,Released,1994-06-23,677387716.0,142.0,55000000.0,tt0109830,...,United States of America,English,[],Robert Zemeckis,Don Burgess,"Winston Groom, Eric Roth","Steve Tisch, Steve Starkey, Wendy Finerman",Alan Silvestri,8.8,2321918.0
9,14,American Beauty,8.0,11990.0,Released,1999-09-15,356296601.0,122.0,15000000.0,tt0169547,...,United States of America,English,[],Sam Mendes,Conrad L. Hall,Alan Ball,"Bruce Cohen, Dan Jinks",Thomas Newman,8.3,1230664.0


In [None]:
subset_df = movie_df.iloc[:45995].copy()


## Writing movie data to neo4j

In [None]:
from logging import error
import pandas as pd
from neo4j import ConstraintError  # Import ConstraintError from neo4j

# Use 'on_bad_lines' instead of 'error_bad_lines'
# Try using 'engine='python'' for more robust parsing
sub = pd.read_csv('movie_data_set.csv', on_bad_lines='skip', sep=',', engine='python')
#The on_bad_lines argument can be set to 'skip' to skip bad lines,
#'warn' to issue a warning and continue, or 'error' to raise an error.
# Using engine='python' can be slower but might handle inconsistencies better



for _, row in subset_df.iterrows():
    cypher = """
    MERGE (m:Movie {id: $id})
    ON CREATE SET m.title = $title,
                  m.release_date = $release_date,
                  m.vote_average = $vote_average,
                  m.vote_count = $vote_count,
                  m.revenue = $revenue,
                  m.budget = $budget,
                  m.runtime = $runtime
    ON MATCH SET m.release_date = COALESCE(m.release_date, $release_date),
                 m.vote_average = COALESCE(m.vote_average, $vote_average),
                 m.vote_count = COALESCE(m.vote_count, $vote_count),
                 m.revenue = COALESCE(m.revenue, $revenue),
                 m.budget = COALESCE(m.budget, $budget),
                 m.runtime = COALESCE(m.runtime, $runtime);
    """

    parameters = {
        "id": row["id"],
        "title": row["title"],
        "release_date": row["release_date"],
        "vote_average": row["vote_average"],
        "vote_count": row["vote_count"],
        "revenue": row["revenue"],
        "budget": row["budget"],
        "runtime": row["runtime"]
    }

    # Try to create/update the node, catching ConstraintError
    try:
        kg.query(cypher, parameters)
    except ConstraintError as e:
        # If ConstraintError, print the error message and the conflicting movie title
        print(f"ConstraintError: {e}")
        print(f"Conflicting movie title: {row['title']}")
        # You can choose to skip the current movie, log it, or handle it differently
        # For example, to skip:
        # continue

In [None]:
!pip install neo4j --upgrade



In [None]:
from logging import error
import pandas as pd
from neo4j.exceptions import ConstraintError  # Import ConstraintError from neo4j.exceptions

# Use 'on_bad_lines' instead of 'error_bad_lines'
# Try using 'engine='python'' for more robust parsing
sub = pd.read_csv('movie_data_set.csv', on_bad_lines='skip', sep=',', engine='python')
#The on_bad_lines argument can be set to 'skip' to skip bad lines,
#'warn' to issue a warning and continue, or 'error' to raise an error.
# Using engine='python' can be slower but might handle inconsistencies better



for _, row in subset_df.iterrows():
    cypher = """
    MERGE (m:Movie {id: $id})
    ON CREATE SET m.title = $title,
                  m.release_date = $release_date,
                  m.vote_average = $vote_average,
                  m.vote_count = $vote_count,
                  m.revenue = $revenue,
                  m.budget = $budget,
                  m.runtime = $runtime
    ON MATCH SET m.release_date = COALESCE(m.release_date, $release_date),
                 m.vote_average = COALESCE(m.vote_average, $vote_average),
                 m.vote_count = COALESCE(m.vote_count, $vote_count),
                 m.revenue = COALESCE(m.revenue, $revenue),
                 m.budget = COALESCE(m.budget, $budget),
                 m.runtime = COALESCE(m.runtime, $runtime);
    """

    parameters = {
        "id": row["id"],
        "title": row["title"],
        "release_date": row["release_date"],
        "vote_average": row["vote_average"],
        "vote_count": row["vote_count"],
        "revenue": row["revenue"],
        "budget": row["budget"],
        "runtime": row["runtime"]
    }

    # Try to create/update the node, catching ConstraintError
    try:
        kg.query(cypher, parameters)
    except ConstraintError as e:
        # If ConstraintError, print the error message and the conflicting movie title
        print(f"ConstraintError: {e}")
        print(f"Conflicting movie title: {row['title']}")
        # You can choose to skip the current movie, log it, or handle it differently
        # For example, to skip:
        # continue

### Adding poster and trailer link to the Neo4j database
Using OMDB API

In [None]:
import os
import requests

# Get the API key from the environment
import os
from google.colab import userdata # we stored our access token as a colab secret

os.environ["OMDBI_API_TOKEN"] = userdata.get('OMDB_api')
api_key = os.environ.get("OMDBI_API_TOKEN")

if not api_key:
    print("API key not found in the environment.")
    exit(1)

# Function to fetch data from OMDb
def fetch_omdb_data(title, year=None):
    url = f"http://www.omdbapi.com/?t={title}&apikey={api_key}"
    if year:
        url += f"&y={year}"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        if data.get("Response") == "True":
            return {
                "poster_url": data.get("Poster"),
                "description": data.get("Plot")
            }
        else:
            print(f"OMDb Error: {data.get('Error')} for title: {title}")
    else:
        print(f"HTTP Error: {response.status_code}")
    return None


In [None]:
result = fetch_omdb_data("Guardians of the Galaxy", year=2014)
print(result)


{'poster_url': 'https://m.media-amazon.com/images/M/MV5BM2ZmNjQ2MzAtNDlhNi00MmQyLWJhZDMtNmJiMjFlOWY4MzcxXkEyXkFqcGc@._V1_SX300.jpg', 'description': 'A group of intergalactic criminals must pull together to stop a fanatical warrior with plans to purge the universe.'}


Collecting keybert
  Downloading keybert-0.8.5-py3-none-any.whl.metadata (15 kB)
Downloading keybert-0.8.5-py3-none-any.whl (37 kB)
Installing collected packages: keybert
Successfully installed keybert-0.8.5


In [None]:
from keybert import KeyBERT

# Initialize KeyBERT model
kw_model = KeyBERT()

def extract_keywords(text):
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 2), stop_words="english")
    return [kw[0] for kw in keywords]

# Example: Use the description to generate keywords
movie_data = fetch_omdb_data("Guardians of the Galaxy", year=2014)
if movie_data:
    keywords = extract_keywords(movie_data["description"])
    print("Keywords:", keywords)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Keywords: ['intergalactic criminals', 'group intergalactic', 'intergalactic', 'purge universe', 'criminals pull']


### Creating tagline and keywords using bert
In this section I will pull description paramter form the movie from neo4j then we will create a tagöline and few keywords for a prticular movie. Later , we will write bakc to the database keywords and taglines to neo4j

In [None]:
from keybert import KeyBERT

# Initialize KeyBERT
kw_model = KeyBERT()

# Fetch movie descriptions from Neo4j
def get_movie_descriptions():
    cypher = """
    MATCH (m:Movie)
    RETURN m.id AS id, m.description AS description
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "description": record["description"]} for record in results]

# Generate keywords and tagline
def generate_keywords_and_tagline(description):
    keywords = kw_model.extract_keywords(description, keyphrase_ngram_range=(1, 2), stop_words="english")
    keywords = [kw[0] for kw in keywords]
    tagline = f"A story about {keywords[0]}" if keywords else "A story worth watching."
    return keywords, tagline

# Process all movie data
def process_and_update_movies():
    movie_data = get_movie_descriptions()
    for movie in movie_data:
        if movie["description"]:  # Ensure description is not None
            keywords, tagline = generate_keywords_and_tagline(movie["description"])

            # Update Neo4j
            cypher = """
            MATCH (m:Movie {id: $id})
            SET m.keywords = $keywords, m.tagline = $tagline
            """
            parameters = {
                "id": movie["id"],
                "keywords": keywords,
                "tagline": tagline
            }
            kg.query(cypher, parameters)

# Execute the processing
process_and_update_movies()


#### Importing Tagline and keywords from another database


In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("juzershakir/tmdb-movies-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/juzershakir/tmdb-movies-dataset?dataset_version_number=1...


100%|██████████| 2.97M/2.97M [00:00<00:00, 116MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/juzershakir/tmdb-movies-dataset/versions/1





In [None]:
!ls {path}


tmdb_movies_data.csv


In [None]:
tag_path = f"{path}/tmdb_movies_data.csv"
tag_line_df = pd.read_csv(tag_path)  # Use the variable, not the literal string
#print(tag_path)

tag_line_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date       

merging both db for getting tagline and kewords

In [None]:
# Merge the two datasets on Title and Year
merged_df = pd.merge(
    df_imdb,  # Your 1000 movie entries
    tag_line_df,  # The DataFrame with taglines and keywords
    left_on=["Title", "Year"],
    right_on=["original_title", "release_year"],
    how="inner"
)

# Check the merged DataFrame
#print(merged_df[['Title', 'Year', 'tagline', 'keywords']].head())

merged_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 635 entries, 0 to 634
Data columns (total 33 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Rank                  635 non-null    int64  
 1   Title                 635 non-null    object 
 2   Genre                 635 non-null    object 
 3   Description           635 non-null    object 
 4   Director              635 non-null    object 
 5   Actors                635 non-null    object 
 6   Year                  635 non-null    int64  
 7   Runtime (Minutes)     635 non-null    int64  
 8   Rating                635 non-null    float64
 9   Votes                 635 non-null    int64  
 10  Revenue (Millions)    609 non-null    float64
 11  Metascore             606 non-null    float64
 12  id                    635 non-null    int64  
 13  imdb_id               635 non-null    object 
 14  popularity            635 non-null    float64
 15  budget                6

In [None]:
import requests

def fetch_movie_data_from_tmdb(title):
    # Search for the movie
    search_url = f"https://api.themoviedb.org/3/search/movie?api_key={TMDB_API_KEY}&query={title}"
    if year:
        search_url += f"&year={year}"

    response = requests.get(search_url)
    if response.status_code == 200:
        search_results = response.json().get("results", [])
        if search_results:
            # Fetch the first matching movie's details
            movie_id = search_results[0]["id"]
            movie_details_url = f"https://api.themoviedb.org/3/movie/{movie_id}?api_key={TMDB_API_KEY}"
            movie_details = requests.get(movie_details_url).json()

            # Extract tagline and keywords
            tagline = movie_details.get("tagline", "No tagline available")

            # Keywords are fetched from a different endpoint
            keywords_url = f"https://api.themoviedb.org/3/movie/{movie_id}/keywords?api_key={TMDB_API_KEY}"
            keywords_response = requests.get(keywords_url).json()
            keywords = [k["name"] for k in keywords_response.get("keywords", [])]

            return tagline, keywords
    return None, []


In [None]:
# test above code
title = "	Prometheus"
year = 2012
tagline, keywords = fetch_movie_data_from_tmdb(title)
print(f"Tagline: {tagline}")
print(f"Keywords: {keywords}")

Tagline: The search for our beginning could lead to our end.
Keywords: ['android', 'alien', 'space', 'creature', 'spin off', 'creation', 'emergency surgery', 'stasis', 'archeological dig', 'god complex', 'cave drawing', 'prometheus', 'genetic mutation', 'origins of life', '2090s', 'objective']


writing tagline and keyowrds to the neo4j DB

In [None]:
def update_movie_in_neo4j(title, tagline, keywords):
    query = """
    MATCH (m:Movie {title: $title})
    SET m.tagline = $tagline, m.keywords = $keywords
    """
    with driver.session() as session:
        session.run(query, {"title": title, "tagline": tagline, "keywords": keywords})

# Assuming `df_imdb` is your DataFrame with the 1000 movies

for _, row in df_imdb.iterrows():
    title = row['Title']  # Title of the movie
    #year = row['Year']    # Release year of the movie

    try:
        # Fetch data from TMDB
        tagline, keywords = fetch_movie_data_from_tmdb(title)

        # Update Neo4j
        update_movie_in_neo4j(title, tagline, keywords)
        print(f"Updated: {title} with tagline and keywords.")
    except Exception as e:
        print(f"Failed to update {title} ")



Updated: Guardians of the Galaxy with tagline and keywords.
Updated: Prometheus with tagline and keywords.
Updated: Split with tagline and keywords.
Updated: Sing with tagline and keywords.
Updated: Suicide Squad with tagline and keywords.
Updated: The Great Wall with tagline and keywords.
Updated: La La Land with tagline and keywords.
Updated: Mindhorn with tagline and keywords.
Updated: The Lost City of Z with tagline and keywords.
Updated: Passengers with tagline and keywords.
Updated: Fantastic Beasts and Where to Find Them with tagline and keywords.
Updated: Hidden Figures with tagline and keywords.
Updated: Rogue One with tagline and keywords.
Updated: Moana with tagline and keywords.
Updated: Colossal with tagline and keywords.
Updated: The Secret Life of Pets with tagline and keywords.
Updated: Hacksaw Ridge with tagline and keywords.
Updated: Jason Bourne with tagline and keywords.
Updated: Lion with tagline and keywords.
Updated: Arrival with tagline and keywords.
Updated: Go

Some keywords and taglines are null value there is need to rewrite those . We will use bert function like aboive to rewrite

In [None]:
"""
replacing Null values in tagline and keywords.
Tagliens and Kewords has been wriiten uing TMDB api . But some of them were empty
. So we are going to genrate empyty tagline and keywords using bert
from description parameter
"""
def update_tagline_and_key_movies():
    movie_data = get_movie_descriptions()  # Fetch movies from Neo4j
    for movie in movie_data:
        if movie["description"]:  # Ensure the description is valid
            keywords, tagline = generate_keywords_and_tagline(movie["description"])

            # Update only null fields in Neo4j
            cypher = """
            MATCH (m:Movie {id: $id})
            SET
                m.tagline = CASE
                               WHEN m.tagline IS NULL OR m.tagline = "" THEN $tagline
                               ELSE m.tagline
                           END,
                m.keywords = CASE
                               WHEN m.keywords IS NULL OR m.keywords = "" THEN $keywords
                               ELSE m.keywords
                           END
            """
            parameters = {
                "id": movie["id"],
                "keywords": keywords,
                "tagline": tagline
            }
            try:
                kg.query(cypher, parameters)
                print(f"Updated movie {movie['id']} with tagline and keywords.")
            except Exception as e:
                print(f"Failed to update movie {movie['id']}: {e}")


In [None]:
update_tagline_and_key_movies()

Updated movie 115 with tagline and keywords.
Updated movie 116 with tagline and keywords.
Updated movie 117 with tagline and keywords.
Updated movie 118 with tagline and keywords.
Updated movie 119 with tagline and keywords.
Updated movie 120 with tagline and keywords.
Updated movie 121 with tagline and keywords.
Updated movie 122 with tagline and keywords.
Updated movie 123 with tagline and keywords.
Updated movie 124 with tagline and keywords.
Updated movie 125 with tagline and keywords.
Updated movie 126 with tagline and keywords.
Updated movie 127 with tagline and keywords.
Updated movie 128 with tagline and keywords.
Updated movie 129 with tagline and keywords.
Updated movie 130 with tagline and keywords.
Updated movie 131 with tagline and keywords.
Updated movie 132 with tagline and keywords.
Updated movie 133 with tagline and keywords.
Updated movie 134 with tagline and keywords.
Updated movie 135 with tagline and keywords.
Updated movie 136 with tagline and keywords.
Updated mo

In [None]:
def update_keywords():
    movie_data = get_movie_descriptions()  # Fetch movies from Neo4j
    for movie in movie_data:
        if movie["description"]:  # Ensure the description is valid
            keywords, tagline = generate_keywords_and_tagline(movie["description"])
            cypher = """
            MATCH (m:Movie {id: $id})
            SET
                m.keywords = CASE
                               WHEN m.keywords IS NULL OR m.keywords = [] THEN $keywords
                               ELSE m.keywords
                           END
            """
            parameters = {
                "id": movie["id"],
                "keywords": keywords
            }
            try:
                kg.query(cypher, parameters)
                print(f"Updated movie {movie['id']} with keywords.")
            except Exception as e:
                print(f"Failed to update movie {movie['id']}: {e}")



In [None]:
update_keywords()

Updated movie 115 with keywords.
Updated movie 116 with keywords.
Updated movie 117 with keywords.
Updated movie 118 with keywords.
Updated movie 119 with keywords.
Updated movie 120 with keywords.
Updated movie 121 with keywords.
Updated movie 122 with keywords.
Updated movie 123 with keywords.
Updated movie 124 with keywords.
Updated movie 125 with keywords.
Updated movie 126 with keywords.
Updated movie 127 with keywords.
Updated movie 128 with keywords.
Updated movie 129 with keywords.
Updated movie 130 with keywords.
Updated movie 131 with keywords.
Updated movie 132 with keywords.
Updated movie 133 with keywords.
Updated movie 134 with keywords.
Updated movie 135 with keywords.
Updated movie 136 with keywords.
Updated movie 137 with keywords.
Updated movie 138 with keywords.
Updated movie 139 with keywords.
Updated movie 140 with keywords.
Updated movie 141 with keywords.
Updated movie 142 with keywords.
Updated movie 143 with keywords.
Updated movie 144 with keywords.
Updated mo

In [None]:
# testing query
cypher = """
MATCH (m:Movie)
WHERE $keyword IN m.keywords
RETURN m.title, m.tagline, m.description
LIMIT 10;
"""
kg.query(cypher, {"keyword": "war"})


[{'m.title': 'The Hobbit: The Battle of the Five Armies',
  'm.tagline': 'A story about bilbo',
  'm.description': 'Bilbo and Company are forced to engage in a war against an array of combatants and keep the Lonely Mountain from falling into the hands of a rising darkness.'},
 {'m.title': 'Lincoln',
  'm.tagline': 'A story about president struggles',
  'm.description': "As the War continues to rage, America's president struggles with continuing carnage on the battlefield as he fights with many inside his own cabinet on the decision to emancipate the slaves."}]

# Preparing Text for RAG
In this section we will create text embedding for rag . We will use tagline for embedding for implementaing better RAG system

### creating vector index

In [None]:
kg.query("""
  CREATE VECTOR INDEX movie_tagline_embeddings IF NOT EXISTS
  FOR (m:Movie) ON (m.taglineEmbedding)
  OPTIONS { indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }}"""
)


[]

In [None]:
kg.query("""
  SHOW VECTOR INDEXES
  """
)

[{'id': 6,
  'name': 'movie_tagline_embeddings',
  'state': 'ONLINE',
  'populationPercent': 100.0,
  'type': 'VECTOR',
  'entityType': 'NODE',
  'labelsOrTypes': ['Movie'],
  'properties': ['taglineEmbedding'],
  'indexProvider': 'vector-2.0',
  'owningConstraint': None,
  'lastRead': None,
  'readCount': 0}]

### Populate the vector Index

#### Invoking hugging face for emnbedding models

In [None]:
import os
from google.colab import userdata # we stored our access token as a colab secret

os.environ["HUGGINGFACEHUB_API_TOKEN"] = userdata.get('new_token')

In [None]:
%%bash
pip install -qqq -U langchain-huggingface
pip install -qqq -U langchain
pip install -qqq -U langchain-community
pip install -qqq -U faiss-cpu
pip install -qU langchain_community
pip install -qU pymupdf

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.5/27.5 MB 43.2 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 19.8/19.8 MB 69.1 MB/s eta 0:00:00


In [None]:
## Setting up LLM
from langchain_huggingface import HuggingFaceEndpoint

# This info's at the top of each HuggingFace model page
hf_model = "mistralai/Mistral-7B-Instruct-v0.3"

llm = HuggingFaceEndpoint(repo_id = hf_model)

In [None]:
## embedding model
from langchain_huggingface import HuggingFaceEmbeddings

# embeddings
embedding_model = "sentence-transformers/all-MiniLM-l6-v2"
embeddings_folder = "/content/"

embeddings = HuggingFaceEmbeddings(model_name=embedding_model,
                                   cache_folder=embeddings_folder)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
#uding huggingface model for creating embedding
from langchain.embeddings import HuggingFaceEmbeddings
kg.query("""
  MATCH (m:Movie)
  WHERE m.tagline IS NOT NULL
  WITH m, HuggingFaceEmbeddings(embedding_model) AS embedder
  SET m.taglineEmbedding = embedder.embed_query(m.tagline)
""")

### Implementing HuggingFace model for embedding

In [None]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
def fetch_movies_with_taglines():
    cypher = """
    MATCH (m:Movie) WHERE m.tagline IS NOT NULL
    RETURN m.id AS id, m.tagline AS tagline
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "tagline": record["tagline"]} for record in results]
def generate_embedding(tagline):
    return embedding_model.encode(tagline).tolist()

def update_movie_embedding(movie_id, embedding):
    cypher = """
    MATCH (m:Movie {id: $id})
    SET m.taglineEmbedding = $embedding
    """
    with driver.session() as session:
        session.run(cypher, {"id": movie_id, "embedding": embedding})

def process_and_update_embeddings():
    movies = fetch_movies_with_taglines()
    for movie in movies:
        tagline = movie["tagline"]
        movie_id = movie["id"]

        # Generate embedding
        embedding = generate_embedding(tagline)

        # Update Neo4j
        update_movie_embedding(movie_id, embedding)
        print(f"Updated movie ID {movie_id} with tagline embedding.")




modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Execute the processing
process_and_update_embeddings()

Updated movie ID 115 with tagline embedding.
Updated movie ID 116 with tagline embedding.
Updated movie ID 117 with tagline embedding.
Updated movie ID 118 with tagline embedding.
Updated movie ID 119 with tagline embedding.
Updated movie ID 120 with tagline embedding.
Updated movie ID 121 with tagline embedding.
Updated movie ID 122 with tagline embedding.
Updated movie ID 123 with tagline embedding.
Updated movie ID 124 with tagline embedding.
Updated movie ID 125 with tagline embedding.
Updated movie ID 126 with tagline embedding.
Updated movie ID 127 with tagline embedding.
Updated movie ID 128 with tagline embedding.
Updated movie ID 129 with tagline embedding.
Updated movie ID 130 with tagline embedding.
Updated movie ID 131 with tagline embedding.
Updated movie ID 132 with tagline embedding.
Updated movie ID 133 with tagline embedding.
Updated movie ID 134 with tagline embedding.
Updated movie ID 135 with tagline embedding.
Updated movie ID 136 with tagline embedding.
Updated mo

#### Query the embeddings from neo4j
I am using neo4j AuraDB free version of knowledge graph database. Vectir databse is not avaialbel for this. So i used emebeiddng vecotr as list and retriveing cosine similarity suong manully . it is giving recemmendatiuon manually.

In [None]:
def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

In [None]:
import numpy as np
from numpy.linalg import norm

# Cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Example function to recommend movies
def recommend_movies(query, top_k=5):
    # Generate embedding for the query using Hugging Face
    query_embedding = embedding_model.encode(query).tolist()

    # Fetch movie embeddings from Neo4j
    movies = fetch_movie_embeddings()

    # Calculate similarity scores
    similarities = [
        {"id": movie["id"], "title": movie["title"], "score": cosine_similarity(query_embedding, movie["embedding"])}
        for movie in movies
    ]

    # Sort by similarity score and return top_k results
    recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
    return recommendations


In [None]:

question = "A Love story."
recommendations = recommend_movies(question)

print("Top Recommendations:")
for rec in recommendations:
    print(f"Title: {rec['title']}, Similarity Score: {rec['score']}")


Top Recommendations:
Title: Blue Valentine, Similarity Score: 1.0000000000000002
Title: Anna Karenina, Similarity Score: 0.8451157761142856
Title: Cinderella, Similarity Score: 0.7187387364941482
Title: Tramps, Similarity Score: 0.6740038104071534
Title: All Good Things, Similarity Score: 0.6404637424966216


# Improve Query with LLM , Prompt Engineering.
Simple query will give answer to one type of question at one time we can levereage llm and other models to dveelop a robust rag. that can answer seevral qestion at one time . With the help of porompt engineering. for example . What is highly rated mobvie give moive like them . what is movie between 2006 to 2008 , can you receomend movie like adventure

### Writing Cypher with LLM

In [None]:
from dotenv import load_dotenv
import os

import textwrap

# Langchain
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector
from langchain_openai import OpenAIEmbeddings
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI

# Warning control
import warnings
warnings.filterwarnings("ignore")

In [None]:
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to
query a graph database.
Instructions:
Use only the provided relationship types and properties in the
schema. Do not use any other relationship types or properties that
are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than
for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Examples: Here are a few examples of generated Cypher
statements for particular questions:

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title
kg.query(cypher)
The question is:
{question}"""


In [None]:
CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"],
    template=CYPHER_GENERATION_TEMPLATE
)

In [None]:
cypherChain = GraphCypherQAChain.from_llm(
    ChatOpenAI(temperature=0),
    graph=kg,
    verbose=True,
    cypher_prompt=CYPHER_GENERATION_PROMPT,
    allow_dangerous_requests=True
)

In [None]:
def prettyCypherChain(question: str) -> str:
    response = cypherChain.run(question)
    print(textwrap.fill(response, 60))

In [None]:
prettyCypherChain("What are movies after 2007?")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Movie)
WHERE m.year > 2007
RETURN m.title, m.year[0m
Full Context:
[32;1m[1;3m[{'m.title': 'Harry Potter and the Deathly Hallows: Part 2', 'm.year': 2011}, {'m.title': 'Office Christmas Party', 'm.year': 2016}, {'m.title': 'The Neon Demon', 'm.year': 2016}, {'m.title': 'Dangal', 'm.year': 2016}, {'m.title': '10 Cloverfield Lane', 'm.year': 2016}, {'m.title': 'Finding Dory', 'm.year': 2016}, {'m.title': "Miss Peregrine's Home for Peculiar Children", 'm.year': 2016}, {'m.title': 'Divergent', 'm.year': 2014}, {'m.title': 'Mike and Dave Need Wedding Dates', 'm.year': 2016}, {'m.title': 'Boyka: Undisputed IV', 'm.year': 2016}][0m

[1m> Finished chain.[0m
Harry Potter and the Deathly Hallows: Part 2, Office
Christmas Party, The Neon Demon, Dangal, 10 Cloverfield
Lane, Finding Dory, Miss Peregrine's Home for Peculiar
Children, Mike and Dave Need Wedding Dates, Boyka:
Undisputed IV.


### Streamlit with cypher generation with llm
Another method of query

In [None]:
from langchain_core.prompts.prompt import PromptTemplate

CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Examples: Here are a few examples of generated Cypher statements for particular questions:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title
The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    verbose=True,
    allow_dangerous_requests=True,
)

In [None]:
chain.invoke({"query": "movie A love story"})
#chain.invoke({"query":"What are movies after 2008"})





[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m:Movie {title: "A love story"})-[:HAS_GENRE]->(g:Genre)
RETURN m, g[0m
Full Context:
[32;1m[1;3m[][0m

[1m> Finished chain.[0m


{'query': 'movie A love story',
 'result': 'I\'m sorry, but I don\'t have any information about a movie called "A Love Story."'}

### check effect of preprocessing query

In [None]:
# import re
# def preprocess_query(query):
#     # Define a list of stopwords to remove
#     stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]

#     # Lowercase the query for consistency
#     query = query.lower()

#     # Remove stopwords
#     for word in stopwords:
#         query = query.replace(word, "")

#     # Remove single characters
#     query = re.sub(r'\b\w\b', '', query)

#     # Remove extra spaces
#     query = re.sub(r'\s+', ' ', query).strip()

#     # Remove punctuation
#     query = re.sub(r'[^\w\s]', '', query)

#     return query
def preprocess_query(query):
    stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]
    query = query.lower()
    # Remove only stopwords, not structural terms
    query_words = query.split()
    filtered_words = [word for word in query_words if word not in stopwords]
    return " ".join(filtered_words)


In [None]:
print(preprocess_query("Recommend movies like Interstellar"))
# Output: "like interstellar"


like interstellar


In [None]:
# def recommend_movies(query, top_k=5):
#     # Step 1: Preprocess query
#     processed_query = preprocess_query(query)
#     print(f"Processed Query: {processed_query}")

#     # Step 2: Check if query is "movies like [X]"
#     if "like" in processed_query:
#         # Extract target movie
#         target_movie = processed_query.split("like")[-1].strip()
#         print(f"Target Movie: {target_movie}")

#         # Fetch target movie embedding
#         cypher = f"""
#         MATCH (m:Movie)
#         WHERE toLower(m.title) = toLower("{target_movie}")
#         RETURN m.id AS id, m.taglineEmbedding AS embedding
#         """
#         print(f"Cypher Query to Fetch Target Embedding: {cypher}")

#         target_embedding_result = kg.query(cypher)
#         if not target_embedding_result:
#             return [{"title": f"Movie '{target_movie}' not found", "score": 0.0}]

#         target_movie_id = target_embedding_result[0]["id"]
#         target_embedding = target_embedding_result[0]["embedding"]
#         print(f"Target Movie Embedding: {target_embedding}")

#         # Fetch all movie embeddings excluding the target movie
#         movies = fetch_movie_embeddings()
#         filtered_movies = [movie for movie in movies if movie["id"] != target_movie_id]

#         # Calculate similarities
#         similarities = [
#             {
#                 "id": movie["id"],
#                 "title": movie["title"],
#                 "score": cosine_similarity(target_embedding, movie["embedding"]),
#             }
#             for movie in filtered_movies
#         ]

#         # Sort and return top results
#         recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
#         return recommendations
#     else:
#         # Handle other queries with standard embedding
#         query_embedding = embedding_model.encode(processed_query).tolist()
#         movies = fetch_movie_embeddings()

#         similarities = [
#             {
#                 "id": movie["id"],
#                 "title": movie["title"],
#                 "score": cosine_similarity(query_embedding, movie["embedding"]),
#             }
#             for movie in movies
#         ]

#         recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
#         return recommendations
def recommend_movies(query, top_k=5):
    processed_query = preprocess_query(query)
    print(f"Processed Query: {processed_query}")

    if "like" in processed_query:
        target_movie = processed_query.split("like")[-1].strip()
        print(f"Target Movie: {target_movie}")

        cypher = f"""
        MATCH (m:Movie)
        WHERE toLower(m.title) = toLower("{target_movie}")
        RETURN m.id AS id, m.taglineEmbedding AS embedding
        """
        target_embedding_result = kg.query(cypher)

        if not target_embedding_result:
            return [{"title": f"Movie '{target_movie}' not found", "score": 0.0}]

        target_movie_id = target_embedding_result[0]["id"]
        target_embedding = target_embedding_result[0]["embedding"]

        movies = fetch_movie_embeddings()
        filtered_movies = [movie for movie in movies if movie["id"] != target_movie_id]

        similarities = [
            {
                "id": movie["id"],
                "title": movie["title"],
                "score": cosine_similarity(target_embedding, movie["embedding"]),
            }
            for movie in filtered_movies
        ]

        recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
        return recommendations
    else:
        query_embedding = embedding_model.encode(processed_query).tolist()
        movies = fetch_movie_embeddings()

        similarities = [
            {
                "id": movie["id"],
                "title": movie["title"],
                "score": cosine_similarity(query_embedding, movie["embedding"]),
            }
            for movie in movies
        ]

        recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
        return recommendations



In [None]:
query5 = "recommend me A love story"
recommendations = recommend_movies(query5)

print("Top Recommendations:")
for rec in recommendations:
    print(f"Title: {rec['title']}, Similarity Score: {rec['score']}")


Processed Query: a love story
Top Recommendations:
Title: Blue Valentine, Similarity Score: 0.9503850611328301
Title: Anna Karenina, Similarity Score: 0.8133487662771884
Title: Tramps, Similarity Score: 0.7537903593103467
Title: The Danish Girl, Similarity Score: 0.7153557956078992
Title: Cinderella, Similarity Score: 0.7059203068077425


### App with similarity search

In [None]:
%%writefile my_app.py
import streamlit as st
from sentence_transformers import SentenceTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from neo4j import GraphDatabase
import numpy as np
from numpy.linalg import norm


# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Initialize LangChain Neo4jGraph integration
kg = Neo4jGraph(uri, username, password, database)

# Initialize embedding model
embedding_model2 = SentenceTransformer("all-MiniLM-L6-v2")

# Function for cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Fetch movie embeddings from Neo4j
def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

# Recommend movies based on cosine similarity
def recommend_movies(query, top_k=5):
    query_embedding = embedding_model2.encode(query).tolist()
    movies = fetch_movie_embeddings()
    similarities = [
        {
            "id": movie["id"],
            "title": movie["title"],
            "score": cosine_similarity(query_embedding, movie["embedding"]),
        }
        for movie in movies
    ]
    recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
    return recommendations

# Prompt template for Cypher query generation
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Examples: Here are a few examples of generated Cypher statements for particular questions:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title

# five Movies directed by Ridley Scott
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
where d.name = "Ridley Scott"
RETURN m.title, m.year LIMIT 5

# Who directed Iron Man?
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE m.title = "Iron Man"
RETURN  m.year, d.name;


The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    verbose=True,
    allow_dangerous_requests=True,
)

# Streamlit UI
st.title("🎥 Movie Query and Recommendation System")
st.subheader("Ask questions about movies or get recommendations!")

# User input
query = st.text_input("Enter your question or query (e.g., 'Recommend movies like Interstellar')")

if st.button("Submit"):
    if query:
        if "recommend" in query.lower() or "similar" in query.lower():
            # Handle embedding-based recommendation
            st.subheader("Recommended Movies Based on Similarity:")
            recommendations = recommend_movies(query)
            for rec in recommendations:
                st.write(f"{rec['title']} (Similarity: {rec['score']:.2f})")
        else:
            # Handle Cypher-based queries
            st.subheader("Generated Cypher Query:")
            response = chain.invoke({"query": query})
            generated_query = response.get("query", "No query generated.")
            st.code(generated_query, language="cypher")

            # Fetch and display results
            results = response.get("result", "No results found.")
            if results and isinstance(results, list):
                st.subheader("Query Results:")
                for record in results:
                    st.write(record)
            else:
                st.write(results)
    else:
        st.warning("Please enter a valid query.")


Writing my_app.py


### (Enhanced) App with similarity search

In [None]:
%%writefile my_app.py
import streamlit as st
from sentence_transformers import SentenceTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from neo4j import GraphDatabase
import numpy as np
from numpy.linalg import norm
import re

# Replace with your Neo4j credentials
from google.colab import userdata
os.environ["NEO_URL"] = userdata.get('NEO4J_URI')
os.environ["NEO_USERNAME"] = userdata.get('NEO4J_USERNAME')
os.environ["NEO_PASSWORD"] = userdata.get('NEO4J_PASSWORD')
os.environ["NEO_DATABASE"] = userdata.get('NEO4J_DATABASE')

uri = os.environ.get("NEO_URL")
username = os.environ.get("NEO_USERNAME")
password = os.environ.get("NEO_PASSWORD")
database = os.environ.get("NEO_DATABASE")

# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Initialize LangChain Neo4jGraph integration
kg = Neo4jGraph(uri, username, password, database)

# Initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Function for cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Fetch movie embeddings from Neo4j
def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

# Preprocess user query
def preprocess_query(query):
    stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]
    query = query.lower()
    for word in stopwords:
        query = query.replace(word, "")
    query = re.sub(r'\b\w\b', '', query)
    query = re.sub(r'\s+', ' ', query).strip()
    query = re.sub(r'[^\w\s]', '', query)
    return query

# Recommend movies based on embeddings
def recommend_movies(query, top_k=5):
    processed_query = preprocess_query(query)
    print(f"Processed Query: {processed_query}")

    if "like" in processed_query:
        target_movie = processed_query.split("like")[-1].strip()
        print(f"Target Movie: {target_movie}")

        cypher = f"""
        MATCH (m:Movie)
        WHERE toLower(m.title) = toLower("{target_movie}")
        RETURN m.id AS id, m.taglineEmbedding AS embedding
        """
        print(f"Cypher Query to Fetch Target Embedding: {cypher}")
        target_embedding_result = kg.query(cypher)

        if not target_embedding_result:
            return [{"title": f"Movie '{target_movie}' not found", "score": 0.0}]

        target_movie_id = target_embedding_result[0]["id"]
        target_embedding = target_embedding_result[0]["embedding"]
        print(f"Target Movie Embedding: {target_embedding}")

        movies = fetch_movie_embeddings()
        filtered_movies = [movie for movie in movies if movie["id"] != target_movie_id]

        similarities = [
            {
                "id": movie["id"],
                "title": movie["title"],
                "score": cosine_similarity(target_embedding, movie["embedding"]),
            }
            for movie in filtered_movies
        ]

        recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
        return recommendations
    else:
        query_embedding = embedding_model.encode(processed_query).tolist()
        movies = fetch_movie_embeddings()

        similarities = [
            {
                "id": movie["id"],
                "title": movie["title"],
                "score": cosine_similarity(query_embedding, movie["embedding"]),
            }
            for movie in movies
        ]

        recommendations = sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]
        return recommendations

# Cypher Query Generation Prompt
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Examples:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title

# Movies directed by Ridley Scott
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE d.name = "Ridley Scott"
RETURN m.title, m.year LIMIT 5

# Who directed Iron Man?
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE m.title = "Iron Man"
RETURN m.year, d.name

The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    verbose=True,
    allow_dangerous_requests=True,
)

# Streamlit UI
st.title("🎥 Movie Query and Recommendation System")
st.subheader("Ask questions about movies or get recommendations!")

query = st.text_input("Enter your query (e.g., 'Recommend movies like Interstellar')")

if st.button("Submit"):
    if query:
        if "recommend" in query.lower() or "similar" in query.lower():
            st.subheader("Recommended Movies Based on Similarity:")
            recommendations = recommend_movies(query)
            for rec in recommendations:
                st.write(f"{rec['title']} (Similarity: {rec['score']:.6f})")
        else:
            st.subheader("Generated Cypher Query:")
            response = chain.invoke({"query": query})
            generated_query = response.get("query", "No query generated.")
            st.code(generated_query, language="cypher")
            results = response.get("result", "No results found.")
            if results and isinstance(results, list):
                st.subheader("Query Results:")
                for record in results:
                    st.write(record)
            else:
                st.write(results)
    else:
        st.warning("Please enter a valid query.")


Overwriting my_app.py


In [None]:
schema = kg.get_structured_schema
print(f"Schema Length: {len(schema)} characters")


Schema Length: 4 characters


In [None]:
!streamlit run my_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.133.75.130
[1G[0K⠙[1G[0Kyour url is: https://mean-poets-wait.loca.lt


### Adding Memory to RAG app (with similiarity search)

In [None]:
!pip install -q --upgrade streamlit


In [17]:
%%writefile my_app_with_mem.py
import streamlit as st
from sentence_transformers import SentenceTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from neo4j import GraphDatabase
import numpy as np
from numpy.linalg import norm
import re
from langchain.memory import ConversationBufferMemory
uri = st.secrets["NEO4J_URI"]
username = st.secrets["NEO4J_USERNAME"]
password = st.secrets["NEO4J_PASSWORD"]
TMDB_API_KEY = st.secrets["TMDB_API"]
database = st.secrets["NEO4J_DATABASE"]
openai_key = st.secrets["OPENAI_API_KEY"]
# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Initialize LangChain Neo4jGraph integration
kg = Neo4jGraph(uri, username, password, database)

# Initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize memory for conversational context
memory = ConversationBufferMemory(memory_key="chat_history",return_messages=True)
#memory = ConversationBufferMemory(k=3, return_messages=True)

# Function for cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

# Preprocess user query
def preprocess_query(query):
    stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]
    query = query.lower()
    for word in stopwords:
        query = query.replace(word, "")
    query = re.sub(r'\b\w\b', '', query)
    query = re.sub(r'\s+', ' ', query).strip()
    query = re.sub(r'[^\w\s]', '', query)
    return query

# Recommend movies based on embeddings
from fuzzywuzzy import process  # Install using pip install fuzzywuzzy

def recommend_movies(query, top_k=5):
    # Fetch all movies and their embeddings
    movies = fetch_movie_embeddings()
    available_titles = [movie["title"] for movie in movies]

    # Preprocess query
    processed_query = preprocess_query(query)

    # Extract the target movie from the query
    if "like" in processed_query:
        target_movie = processed_query.split("like")[-1].strip()
        # Use fuzzy matching to find the best match
        matched_title, similarity = process.extractOne(target_movie, available_titles)

        if similarity < 80:  # Threshold for a "good match"
            return [{"title": f"No close matches found for '{target_movie}'. Did you mean '{matched_title}'?", "score": 0.0}]

        # Fetch the embedding for the matched title
        target_embedding = next(
            (movie["embedding"] for movie in movies if movie["title"] == matched_title), None
        )
        if not target_embedding:
            return [{"title": f"Movie '{matched_title}' not found in the database.", "score": 0.0}]

        # Calculate similarity with other movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(target_embedding, movie["embedding"]),
            }
            for movie in movies if movie["title"] != matched_title
        ]

        # Sort and return top recommendations
        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]

    else:
        query_embedding = embedding_model.encode(processed_query).tolist()

        # Calculate similarity with all movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(query_embedding, movie["embedding"]),
            }
            for movie in movies
        ]

        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]


# Cypher Query Generation Prompt
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Examples:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title

# Movies directed by Ridley Scott
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE d.name = "Ridley Scott"
RETURN m.title, m.year LIMIT 5

# Who directed Iron Man?
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE m.title = "Iron Man"
RETURN m.year, d.name

The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    memory=memory,
    verbose=True,
    allow_dangerous_requests=True,
)
#chain.invoke({"query":"Movie similar to Inferno"})
#chain.invoke({"query":"recommend movie like Intersteller"})
# Streamlit UI
st.title("🎥 Movie Chatbot with Memory")
st.subheader("Ask questions about movies or get recommendations!")
if "input" not in st.session_state:
    st.session_state.input = ""  # Initialize input during the first load


# Initialize session state variables
if "messages" not in st.session_state:
    st.session_state.messages = []

if "memory" not in st.session_state:
    st.session_state.memory = {"last_movie": None, "last_director": None}

# # Display past messages
# for message in st.session_state.messages:
#     if message["role"] == "user":
#         st.markdown(f"**You:** {message['content']}")
#     else:
#         st.markdown(f"**Bot:** {message['content']}")

# Function to extract and update context for movies or directors
def extract_and_update_context(query, response):
    """
    Extracts relevant context (e.g., director, movie) from the query and response
    and updates the memory.
    """
    # Handle director-related responses
    if "directed" in response.lower():
        # Extract the director's name from the response
        parts = response.split("directed")
        if len(parts) > 1:
            director = parts[0].strip()  # Extract name before "directed"
            st.session_state.memory["last_director"] = director

    # Handle movie recommendation queries
    if "recommend" in query.lower() or "similar" in query.lower():
        if "like" in query.lower():
            # Extract the target movie from the query
            target_movie = query.split("like")[-1].strip()
            st.session_state.memory["last_movie"] = target_movie



# Function to handle ambiguous queries
def handle_context_based_query(query):
    """
    Handles queries that rely on contextual memory, such as pronouns.
    """
    # Handle "like this/it" for movies
    if "like this" in query.lower() or "like it" in query.lower():
        last_movie = st.session_state.memory.get("last_movie", None)
        if last_movie:
            recommendations = recommend_movies(f"recommend me movie like {last_movie}")
            return "\n".join([f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations])
        else:
            return "I'm sorry, I don't know what you're referring to. Could you specify the movie?"

    # Handle "from him" for directors
    if "from him" in query.lower() or "from her" in query.lower() or "from them" in query.lower():
        last_director = st.session_state.memory.get("last_director", None)
        if last_director:
            # Query Neo4j for movies directed by the last tracked director
            cypher = f"""
            MATCH (d:Director {{name: "{last_director}"}})-[:DIRECTED]->(m:Movie)
            RETURN m.title AS title
            """
            results = kg.query(cypher)
            if results:
                return "\n".join([record["title"] for record in results])
            else:
                return f"I couldn't find any movies directed by {last_director}."
        else:
            return "I'm sorry, I don't know who you're referring to. Could you clarify?"

    return None  # Not a context-based query


# Function to handle query submission and clear input
def handle_query():
    """
    Handles user queries and updates session state with bot responses.
    """
    query = st.session_state.get("input", "").strip()  # Access user input
    if query:  # Proceed only if the input is not empty
        # Ensure query is added only once
        if not st.session_state.messages or st.session_state.messages[-1] != {"role": "user", "content": query}:
            st.session_state.messages.append({"role": "user", "content": query})

            # Generate bot response
            response = handle_context_based_query(query)
            if not response:  # If no context-based response, proceed as usual
                with st.spinner("Thinking..."):
                    if "recommend" in query.lower() or "similar" in query.lower():
                        recommendations = recommend_movies(query)
                        response = "\n".join(
                            [f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations]
                        )
                    else:
                        try:
                            result = chain.invoke({"query": query})
                            generated_query = result.get("query", "No query generated.")
                            st.code(generated_query, language="cypher")
                            response = result.get("result", "No results found.")
                        except Exception as e:
                            response = f"An error occurred: {e}"

            # Extract and store context for follow-up queries
            extract_and_update_context(query, response)

            # Append bot response only once
            if not st.session_state.messages or st.session_state.messages[-1] != {"role": "bot", "content": response}:
                st.session_state.messages.append({"role": "bot", "content": response})

        # Clear input field for next query
        st.session_state.input = ""


# def display_chat_history():
#     """
#     Renders the chat history stored in `st.session_state.messages`.
#     """
#     # Use a single loop to iterate and display messages
#     for i, message in enumerate(st.session_state.messages):
#         if message["role"] == "user":
#             st.markdown(f"**You:** {message['content']}")
#         else:
#             st.markdown(f"**Bot:** {message['content']}")


# # Main Streamlit app logic
# st.title("🎥 Movie Chatbot with Memory")
# st.subheader("Ask questions about movies or get recommendations!")

# # Initialize session state variables
# if "messages" not in st.session_state:
#     st.session_state.messages = []

# if "memory" not in st.session_state:
#     st.session_state.memory = {"last_movie": None, "last_director": None}

# Chat input field linked to `st.session_state.input`
# Display chat history


st.text_input(
    "Type your message:",
    placeholder="Type your question here...",
    value=st.session_state.get("input", ""),  # Default to empty string
    key="input",
    on_change=lambda:handle_query()  # Trigger `handle_query` when the input changes
)

for message in st.session_state.messages:
    if message["role"] == "user":
        st.markdown(f"**You:** {message['content']}")
    else:
        st.markdown(f"**Bot:** {message['content']}")
# Display chat history only once
#display_chat_history()

# Optional: Debugging information in the sidebar
st.sidebar.write("**Tracked Context (Debugging):**", st.session_state.memory)


Overwriting my_app_with_mem.py


In [18]:
!streamlit run my_app_with_mem.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.28.149.205
[1G[0K⠙[1G[0Kyour url is: https://sixty-lies-arrive.loca.lt


In [None]:
!pip install -q fuzzywuzzy

In [None]:
#chain.invoke({"query":"Movie similar to Inferno"})
chain.invoke({"query":"recommend movie like Intersteller"})

NameError: name 'chain' is not defined

###(c) adding memeory and fethcing poster
this is not working need to fix it

In [16]:
%%writefile my_app_third.py
import streamlit as st
from sentence_transformers import SentenceTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from neo4j import GraphDatabase
import numpy as np
from numpy.linalg import norm
import re
from fuzzywuzzy import process
import requests
uri = st.secrets["NEO4J_URI"]
username = st.secrets["NEO4J_USERNAME"]
password = st.secrets["NEO4J_PASSWORD"]
TMDB_API_KEY = st.secrets["TMDB_API"]
database = st.secrets["NEO4J_DATABASE"]
openai_key = st.secrets["OPENAI_API_KEY"]
# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Initialize LangChain Neo4jGraph integration
kg = Neo4jGraph(uri, username, password, database)

# Initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize memory for conversational context
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Function for cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

@st.cache_data
def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

# Preprocess user query
def preprocess_query(query):
    stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]
    query = query.lower()
    for word in stopwords:
        query = query.replace(word, "")
    query = re.sub(r'\b\w\b', '', query)
    query = re.sub(r'\s+', ' ', query).strip()
    query = re.sub(r'[^\w\s]', '', query)
    return query

# Recommend movies based on embeddings
@st.cache_data
def recommend_movies(query, top_k=5):
    # Fetch all movies and their embeddings
    movies = fetch_movie_embeddings()
    available_titles = [movie["title"] for movie in movies]

    # Preprocess query
    processed_query = preprocess_query(query)

    # Extract the target movie from the query
    if "like" in processed_query:
        target_movie = processed_query.split("like")[-1].strip()
        # Use fuzzy matching to find the best match
        matched_title, similarity = process.extractOne(target_movie, available_titles)

        if similarity < 80:  # Threshold for a "good match"
            return [{"title": f"No close matches found for '{target_movie}'. Did you mean '{matched_title}'?", "score": 0.0}]

        # Fetch the embedding for the matched title
        target_embedding = next(
            (movie["embedding"] for movie in movies if movie["title"] == matched_title), None
        )
        if not target_embedding:
            return [{"title": f"Movie '{matched_title}' not found in the database.", "score": 0.0}]

        # Calculate similarity with other movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(target_embedding, movie["embedding"]),
            }
            for movie in movies if movie["title"] != matched_title
        ]

        # Sort and return top recommendations
        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]

    else:
        query_embedding = embedding_model.encode(processed_query).tolist()

        # Calculate similarity with all movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(query_embedding, movie["embedding"]),
            }
            for movie in movies
        ]

        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]

# Cypher Query Generation Prompt
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Examples:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title

# Movies directed by Ridley Scott
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE d.name = "Ridley Scott"
RETURN m.title, m.year LIMIT 5

# Who directed Iron Man?
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE m.title = "Iron Man"
RETURN m.year, d.name

The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    memory=memory,
    verbose=True,
    allow_dangerous_requests=True,
)
# Fetch posters from TMDb
def fetch_movie_poster(movie_name):
    url = f"https://api.themoviedb.org/3/search/movie?api_key={TMDB_API_KEY}&query={movie_name}"
    try:
        response = requests.get(url)
        response.raise_for_status()
        results = response.json().get("results", [])
        if results and "poster_path" in results[0]:
            poster_path = results[0]["poster_path"]
            return f"https://image.tmdb.org/t/p/w500{poster_path}"
        return None
    except Exception as e:
        print(f"Error fetching poster for {movie_name}: {e}")
        return None


# if recommendations:
#     for rec in recommendations:
#         poster_url = fetch_movie_poster(rec['title'])
#         st.image(poster_url, width=100, caption=rec['title'])


# Function to extract and update context for movies or directors
def extract_and_update_context(query, response):
    """
    Extracts relevant context (e.g., director, movie) from the query and response
    and updates the memory.
    """
    # Handle director-related responses
    if "directed" in response.lower():
        # Extract the director's name from the response
        parts = response.split("directed")
        if len(parts) > 1:
            director = parts[0].strip()  # Extract name before "directed"
            st.session_state.memory["last_director"] = director

    # Handle movie recommendation queries
    if "recommend" in query.lower() or "similar" in query.lower():
        if "like" in query.lower():
            # Extract the target movie from the query
            target_movie = query.split("like")[-1].strip()
            st.session_state.memory["last_movie"] = target_movie

# Function to handle ambiguous queries
def handle_context_based_query(query):
    """
    Handles queries that rely on contextual memory, such as pronouns.
    """
    # Handle "like this/it" for movies
    if "like this" in query.lower() or "like it" in query.lower():
        last_movie = st.session_state.memory.get("last_movie", None)
        if last_movie:
            recommendations = recommend_movies(f"recommend me movie like {last_movie}")
            return "\n".join([f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations])
        else:
            return "I'm sorry, I don't know what you're referring to. Could you specify the movie?"

    # Handle "from him" for directors
    if "from him" in query.lower() or "from her" in query.lower() or "from them" in query.lower():
        last_director = st.session_state.memory.get("last_director", None)
        if last_director:
            # Query Neo4j for movies directed by the last tracked director
            cypher = f"""
            MATCH (d:Director {{name: "{last_director}"}})-[:DIRECTED]->(m:Movie)
            RETURN m.title AS title
            """
            results = kg.query(cypher)
            if results:
                return "\n".join([record["title"] for record in results])
            else:
                return f"I couldn't find any movies directed by {last_director}."
        else:
            return "I'm sorry, I don't know who you're referring to. Could you clarify?"

    return None  # Not a context-based query

# Function to handle query submission and clear input
def handle_query():
    """
    Handles user queries and updates session state with bot responses.
    """
    query = st.session_state.get("input", "").strip()  # Access user input
    if not query:  # Proceed only if the input is not empty
        return

    # Ensure query is added only once
    if not st.session_state.messages or st.session_state.messages[-1] != {"role": "user", "content": query}:
        st.session_state.messages.append({"role": "user", "content": query})

        # Generate bot response
        response = handle_context_based_query(query)
        if not response:  # If no context-based response, proceed as usual
            with st.spinner("Thinking..."):
                response = generate_response(query)

        # Extract and store context for follow-up queries
        extract_and_update_context(query, response)

        # Append bot response only once
        if not st.session_state.messages or st.session_state.messages[-1] != {"role": "bot", "content": response}:
            st.session_state.messages.append({"role": "bot", "content": response})

    # Clear input field for next query
    st.session_state.input = ""

# def generate_response(query):
#     """
#     Generates a response based on the user query.
#     """
#     if "recommend" in query.lower() or "similar" in query.lower():
#         recommendations = recommend_movies(query)
#         return "\n".join([f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations])
#     else:
#         try:
#             result = chain.invoke({"query": query})
#             generated_query = result.get("query", "No query generated.")
#             st.code(generated_query, language="cypher")
#             return result.get("result", "No results found.")
#         except Exception as e:
#             return f"An error occurred: {e}"
def generate_response(query):
    if "recommend" in query.lower() or "similar" in query.lower():
        recommendations = recommend_movies(query)
        cols = st.columns(5)
        for idx, rec in enumerate(recommendations):
            with cols[idx % 5]:
                poster_url = fetch_movie_poster(rec["title"])
                if poster_url:
                    st.image(poster_url, caption=f"{rec['title']} (Similarity: {rec['score']:.6f})", use_column_width=True)
                else:
                    st.write(f"{rec['title']} (Similarity: {rec['score']:.6f}) - No poster available")
    else:
        st.subheader("Generated Cypher Query:")
        response = chain.invoke({"query": query})
        generated_query = response.get("query", "No query generated.")
        st.code(generated_query, language="cypher")
        results = response.get("result", "No results found.")
        if results:
            st.write(results)

# Streamlit UI
# Main Streamlit UI
st.title("🎥 Movie Chatbot with Memory")
st.markdown("**📢 Ask questions about movies or get recommendations!**")
if "input" not in st.session_state:
    st.session_state.input = ""  # Initialize input during the first load


# Initialize session state variables
if "messages" not in st.session_state:
    st.session_state.messages = []

if "memory" not in st.session_state:
    st.session_state.memory = {"last_movie": None, "last_director": None}
# Sidebar options
with st.sidebar:
    st.title("🎬 Options")
    if st.button("Clear Chat History"):
        st.session_state.messages = []
        st.rerun()
    st.markdown("**Instructions:**\n- Ask questions like:\n  - Who directed Interstellar?\n  - Recommend movies like Inception.")

# Chat input field
query = st.text_area(
    "💬 Your Question:",
    placeholder="Ask about movies, directors, or get recommendations...",
    value=st.session_state.get("input", ""),
    key="input",
    on_change=handle_query
)


# Display chat history
st.markdown("### 💬 Conversation History")
for message in st.session_state.messages:
    if message["role"] == "user":
        st.markdown(f'<div style="text-align: right; background-color: #dcf8c6; padding: 8px; border-radius: 10px; margin: 5px;">{message["content"]}</div>', unsafe_allow_html=True)
    else:
        st.markdown(f'<div style="text-align: left; background-color: #f1f0f0; padding: 8px; border-radius: 10px; margin: 5px;">{message["content"]}</div>', unsafe_allow_html=True)

# Debugging information
with st.expander("🔍 Debugging Information"):
    st.write("**Context Memory:**", st.session_state.memory)


Writing my_app_third.py


In [None]:
!streamlit run my_app_third.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.106.63.128
[1G[0K⠙[1G[0Kyour url is: https://eighty-buttons-scream.loca.lt


In [None]:
#!pip install gradio
import gradio as gr

def greet(name):
    return f"Hello {name}!"

iface = gr.Interface(fn=greet, inputs="text", outputs="text")
iface.launch()


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://1e7ddc9b1ba7ebbf05.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## App without similarity search

In [None]:
%%writefile my_app.py

import streamlit as st
from sentence_transformers import SentenceTransformer
from neo4j import GraphDatabase
from langchain_community.graphs import Neo4jGraph
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI
import numpy as np
from numpy.linalg import norm
import textwrap

# Replace with your Neo4j credentials
uri = os.environ.get("NEO_URL")
username = os.environ.get("NEO_USERNAME")
password = os.environ.get("NEO_PASSWORD")
database = os.environ.get("NEO_DATABASE")

# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

#Initialize a knowledge graph instance using LangChain's Neo4j integration
kg = Neo4jGraph(
    uri, username, password, database
)

# Prompt Template for Cypher Query Generation
CYPHER_GENERATION_TEMPLATE = """Task:Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Do not respond to any questions that might ask anything else than for you to construct a Cypher statement.
Do not include any text except the generated Cypher statement.
Examples: Here are a few examples of generated Cypher statements for particular questions:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title
The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize the GraphCypherQAChain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    verbose=True,
    allow_dangerous_requests=True,
)

# Function to run the chain and format the response
def prettyCypherChain(question: str) -> str:
    """Runs the cypherChain and formats the response."""
    response = chain.run(question)

    if isinstance(response, str):  # Check if response is a string
        st.write(textwrap.fill(response, 60))
    elif isinstance(response, dict):  # Check if response is a dictionary
        for key, value in response.items():
            st.write(f"{key}: {value}")
    elif isinstance(response, list):  # Check if response is a list
        # Create a DataFrame for better display in Streamlit
        import pandas as pd
        df = pd.DataFrame(response)
        st.dataframe(df)
    else:
        st.write(response) # Print the response as is if it's not a string, dictionary, or list
# Streamlit App
def main():
    st.title("Neo4j Movie Recommendation App")
    question = st.text_input("Enter your question about movies:")

    if st.button("Get Answer"):
        if question:
            st.write("**Question:**", question)
            with st.spinner("Generating Cypher query and fetching results..."):
                try:
                    prettyCypherChain(question)
                except Exception as e:
                    st.error(f"An error occurred: {e}")


if __name__ == "__main__":
    main()

Overwriting my_app.py


In [None]:
#chain.invoke({"query":"Movie similar to Inferno"})
def prettyCypherChain(question: str) -> str:
    """Runs the cypherChain and formats the response."""
    response = cypherChain.run(question)

    if isinstance(response, str):  # Check if response is a string
        st.write(textwrap.fill(response, 60))
    elif isinstance(response, dict):  # Check if response is a dictionary
        for key, value in response.items():
            st.write(f"{key}: {value}")
    elif isinstance(response, list):  # Check if response is a list
        # Create a DataFrame for better display in Streamlit
        import pandas as pd
        df = pd.DataFrame(response)
        st.dataframe(df)
    else:
        st.write(response) # Print the response as is if it's not a string, dictionary, or list

In [None]:
prettyCypherChain("Movie similar to Iron Man")



[1m> Entering new GraphCypherQAChain chain...[0m
Generated Cypher:
[32;1m[1;3mMATCH (m1:Movie {title: "Iron Man"})-[:HAS_GENRE]->(g:Genre)<-[:HAS_GENRE]-(m2:Movie)
WHERE m1 <> m2
RETURN m2.title, COLLECT(DISTINCT g.name) as genres
[0m
Full Context:
[32;1m[1;3m[{'m2.title': 'Dangal', 'genres': ['Action']}, {'m2.title': 'Boyka: Undisputed IV', 'genres': ['Action']}, {'m2.title': 'The Dark Knight Rises', 'genres': ['Action']}, {'m2.title': 'Transformers: Age of Extinction', 'genres': ['Action', 'Adventure', 'Sci-Fi']}, {'m2.title': 'Furious 6', 'genres': ['Action']}, {'m2.title': 'Star Trek', 'genres': ['Action', 'Adventure', 'Sci-Fi']}, {'m2.title': 'Watchmen', 'genres': ['Action']}, {'m2.title': 'Inferno', 'genres': ['Action', 'Adventure']}, {'m2.title': 'Sicario', 'genres': ['Action']}, {'m2.title': 'Aliens vs Predator - Requiem', 'genres': ['Action', 'Sci-Fi']}][0m

[1m> Finished chain.[0m
Transformers: Age of Extinction, Star Trek, and Sicario are
movies similar to Iron M

In [None]:
!streamlit run my_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.106.132.174
[1G[0K⠙[1G[0K⠹[1G[0Kyour url is: https://lemon-symbols-brush.loca.lt


# Creating RAG with Streamlit

In [None]:
%%writefile my_app.py
import streamlit as st
from sentence_transformers import SentenceTransformer
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_core.prompts.prompt import PromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.memory import ConversationBufferMemory
from neo4j import GraphDatabase
import numpy as np
from numpy.linalg import norm
import re
from fuzzywuzzy import process

# Replace with your Neo4j credentials
import streamlit as st




# Initialize Neo4j driver
driver = GraphDatabase.driver(uri, auth=(username, password))

# Initialize LangChain Neo4jGraph integration
kg = Neo4jGraph(uri, username, password, database)

# Initialize embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Initialize memory for conversational context
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Function for cosine similarity
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

@st.cache_data
def fetch_movie_embeddings():
    cypher = """
    MATCH (m:Movie)
    WHERE m.taglineEmbedding IS NOT NULL
    RETURN m.id AS id, m.title AS title, m.taglineEmbedding AS embedding
    """
    results = kg.query(cypher)
    return [{"id": record["id"], "title": record["title"], "embedding": record["embedding"]} for record in results]

# Preprocess user query
def preprocess_query(query):
    stopwords = ["recommend", "movie", "similar", "movies", "show", "find", "me", "please"]
    query = query.lower()
    for word in stopwords:
        query = query.replace(word, "")
    query = re.sub(r'\b\w\b', '', query)
    query = re.sub(r'\s+', ' ', query).strip()
    query = re.sub(r'[^\w\s]', '', query)
    return query

# Recommend movies based on embeddings
@st.cache_data
def recommend_movies(query, top_k=5):
    # Fetch all movies and their embeddings
    movies = fetch_movie_embeddings()
    available_titles = [movie["title"] for movie in movies]

    # Preprocess query
    processed_query = preprocess_query(query)

    # Extract the target movie from the query
    if "like" in processed_query:
        target_movie = processed_query.split("like")[-1].strip()
        # Use fuzzy matching to find the best match
        matched_title, similarity = process.extractOne(target_movie, available_titles)

        if similarity < 80:  # Threshold for a "good match"
            return [{"title": f"No close matches found for '{target_movie}'. Did you mean '{matched_title}'?", "score": 0.0}]

        # Fetch the embedding for the matched title
        target_embedding = next(
            (movie["embedding"] for movie in movies if movie["title"] == matched_title), None
        )
        if not target_embedding:
            return [{"title": f"Movie '{matched_title}' not found in the database.", "score": 0.0}]

        # Calculate similarity with other movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(target_embedding, movie["embedding"]),
            }
            for movie in movies if movie["title"] != matched_title
        ]

        # Sort and return top recommendations
        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]

    else:
        query_embedding = embedding_model.encode(processed_query).tolist()

        # Calculate similarity with all movies
        similarities = [
            {
                "title": movie["title"],
                "score": cosine_similarity(query_embedding, movie["embedding"]),
            }
            for movie in movies
        ]

        return sorted(similarities, key=lambda x: x["score"], reverse=True)[:top_k]

# Cypher Query Generation Prompt
CYPHER_GENERATION_TEMPLATE = """Task: Generate Cypher statement to query a graph database.
Instructions:
Use only the provided relationship types and properties in the schema.
Do not use any other relationship types or properties that are not provided.
Schema:
{schema}
Note: Do not include any explanations or apologies in your responses.
Examples:

# How many people played in Top Gun?
MATCH (m:Movie {{name:"Top Gun"}})<-[:ACTED_IN]-()
RETURN count(*) AS numberOfActors

# What are movies after year 2006?
Match (m:Movie)
where m.year > 2006
Return m.year, m.title

# Movies directed by Ridley Scott
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE d.name = "Ridley Scott"
RETURN m.title, m.year LIMIT 5

# Who directed Iron Man?
MATCH (d:Director)-[:DIRECTED]->(m:Movie)
WHERE m.title = "Iron Man"
RETURN m.year, d.name

The question is:
{question}"""

CYPHER_GENERATION_PROMPT = PromptTemplate(
    input_variables=["schema", "question"], template=CYPHER_GENERATION_TEMPLATE
)

# Initialize Cypher QA Chain
chain = GraphCypherQAChain.from_llm(
    graph=kg,
    cypher_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo"),
    qa_llm=ChatOpenAI(temperature=0, model="gpt-3.5-turbo-16k"),
    memory=memory,
    verbose=True,
    allow_dangerous_requests=True,
)

# Function to extract and update context for movies or directors
def extract_and_update_context(query, response):
    """
    Extracts relevant context (e.g., director, movie) from the query and response
    and updates the memory.
    """
    # Handle director-related responses
    if "directed" in response.lower():
        # Extract the director's name from the response
        parts = response.split("directed")
        if len(parts) > 1:
            director = parts[0].strip()  # Extract name before "directed"
            st.session_state.memory["last_director"] = director

    # Handle movie recommendation queries
    if "recommend" in query.lower() or "similar" in query.lower():
        if "like" in query.lower():
            # Extract the target movie from the query
            target_movie = query.split("like")[-1].strip()
            st.session_state.memory["last_movie"] = target_movie

# Function to handle ambiguous queries

def handle_context_based_query(query):
    """
    Handles queries that rely on contextual memory, such as pronouns.
    """
    # Handle "like this/it" for movies
    if "like this" in query.lower() or "like it" in query.lower():
        last_movie = st.session_state.memory.get("last_movie", None)
        if last_movie:
            recommendations = recommend_movies(f"recommend me movie like {last_movie}")
            return "\n".join([f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations])
        else:
            return "I'm sorry, I don't know what you're referring to. Could you specify the movie?"

    # Handle "from him" for directors
    if "from him" in query.lower() or "from her" in query.lower() or "from them" in query.lower():
        last_director = st.session_state.memory.get("last_director", None)
        if last_director:
            # Query Neo4j for movies directed by the last tracked director
            cypher = f"""
            MATCH (d:Director {{name: "{last_director}"}})-[:DIRECTED]->(m:Movie)
            RETURN m.title AS title
            """
            results = kg.query(cypher)
            if results:
                return "\n".join([record["title"] for record in results])
            else:
                return f"I couldn't find any movies directed by {last_director}."
        else:
            return "I'm sorry, I don't know who you're referring to. Could you clarify?"

    return None  # Not a context-based query

# Function to handle query submission and clear input
def handle_query():
    """
    Handles user queries and updates session state with bot responses.
    """
    query = st.session_state.get("input", "").strip()  # Access user input
    if not query:  # Proceed only if the input is not empty
        return

    # Ensure query is added only once
    if not st.session_state.messages or st.session_state.messages[-1] != {"role": "user", "content": query}:
        st.session_state.messages.append({"role": "user", "content": query})

        # Generate bot response
        response = handle_context_based_query(query)
        if not response:  # If no context-based response, proceed as usual
            with st.spinner("Thinking..."):
                response = generate_response(query)

        # Extract and store context for follow-up queries
        extract_and_update_context(query, response)

        # Append bot response only once
        if not st.session_state.messages or st.session_state.messages[-1] != {"role": "bot", "content": response}:
            st.session_state.messages.append({"role": "bot", "content": response})

    # Clear input field for next query
    st.session_state.input = ""

def generate_response(query):
    """
    Generates a response based on the user query.
    """
    if "recommend" in query.lower() or "similar" in query.lower():
        recommendations = recommend_movies(query)
        return "\n".join([f"- {rec['title']} (Similarity: {rec['score']:.6f})" for rec in recommendations])
    else:
        try:
            result = chain.invoke({"query": query})
            generated_query = result.get("query", "No query generated.")
            st.code(generated_query, language="cypher")
            return result.get("result", "No results found.")
        except Exception as e:
            return f"An error occurred: {e}"

# # Streamlit UI
# # Main Streamlit UI
# st.title("🎥 Movie Chatbot with Memory")
# st.markdown("**📢 Ask questions about movies or get recommendations!**")
# if "input" not in st.session_state:
#     st.session_state.input = ""  # Initialize input during the first load


# # Initialize session state variables
# if "messages" not in st.session_state:
#     st.session_state.messages = []

# if "memory" not in st.session_state:
#     st.session_state.memory = {"last_movie": None, "last_director": None}
# # Sidebar options
# with st.sidebar:
#     st.title("🎬 Options")
#     if st.button("Clear Chat History"):
#         st.session_state.messages = []
#         st.experimental_rerun()
#     st.markdown("**Instructions:**\n- Ask questions like:\n  - Who directed Interstellar?\n  - Recommend movies like Inception.")

# # Chat input field
# query = st.text_area(
#     "💬 Your Question:",
#     placeholder="Ask about movies, directors, or get recommendations...",
#     value=st.session_state.get("input", ""),
#     key="input",
#     on_change=handle_query
# )

# # Display chat history
# st.markdown("### 💬 Conversation History")
# for message in st.session_state.messages:
#     if message["role"] == "user":
#         st.markdown(f'<div style="text-align: right; background-color: #dcf8c6; padding: 8px; border-radius: 10px; margin: 5px;">{message["content"]}</div>', unsafe_allow_html=True)
#     else:
#         st.markdown(f'<div style="text-align: left; background-color: #f1f0f0; padding: 8px; border-radius: 10px; margin: 5px;">{message["content"]}</div>', unsafe_allow_html=True)

# # Debugging information
# with st.expander("🔍 Debugging Information"):
#     st.write("**Context Memory:**", st.session_state.memory)
# Custom CSS for background color
def add_custom_css():
    st.markdown(
        """
        <style>
        /* Background color for the entire app */
        .stApp {
            background-color: #f0f8ff;
        }
        /* Chat bubbles styling */
        .chat-bubble-user {
            text-align: right;
            background-color: #dcf8c6;
            padding: 8px;
            border-radius: 10px;
            margin: 5px;
        }
        .chat-bubble-bot {
            text-align: left;
            background-color: #f1f0f0;
            padding: 8px;
            border-radius: 10px;
            margin: 5px;
        }
        div.stButton > button {
    background-color: #4682b4;
    color: white;
    padding: 0.5rem 1rem;
    border-radius: 5px;
    border: none;
    font-size: 1rem;
    cursor: pointer;
}
div.stButton > button:hover {
    background-color: #5a9bd4;
}
        </style>
        """,
        unsafe_allow_html=True,
    )

# Add custom CSS
add_custom_css()

# Main Streamlit UI
st.title("🎥 CineGraph: Movie Chatbot")
st.markdown("**📢 Ask questions about movies or get recommendations!**")

# Initialize session state variables
if "input" not in st.session_state:
    st.session_state.input = ""  # Initialize input during the first load

if "messages" not in st.session_state:
    st.session_state.messages = []

if "memory" not in st.session_state:
    st.session_state.memory = {"last_movie": None, "last_director": None}

# Sidebar options
with st.sidebar:
    st.title("🎬 Options")
    if st.button("Clear Chat History"):
        st.session_state.messages = []
        st.rerun()
    st.markdown("**Instructions:**\n- Ask questions like:\n  - Who directed Interstellar?\n  - Recommend movies like Inception.")

# Quick Question Buttons
st.markdown("### Quick Questions")
col1, col2 = st.columns(2)

with col1:
    if st.button("Who directed Interstellar?"):
        st.session_state.input = "Who directed Interstellar?"
        handle_query()

with col2:
    if st.button("Recommend movies like Inception"):
        st.session_state.input = "Recommend movies like Inception"
        handle_query()

# Chat input field
query = st.text_area(
    "💬 Your Question:",
    placeholder="Ask about movies, directors, or get recommendations...",
    value=st.session_state.get("input", ""),
    key="input",
    on_change=handle_query()
)


# Display chat history
st.markdown("### 💬 Conversation History")
for message in reversed(st.session_state.messages):
    if message["role"] == "user":
        st.markdown(
            f'<div class="chat-bubble-user">{message["content"]}</div>',
            unsafe_allow_html=True,
        )
    else:
        st.markdown(
            f'<div class="chat-bubble-bot">{message["content"]}</div>',
            unsafe_allow_html=True,
        )

# Debugging information
with st.expander("🔍 Debugging Information"):
    st.write("**Context Memory:**", st.session_state.memory)


Overwriting my_app.py


In [None]:
!streamlit run my_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.106.132.174
[1G[0K⠙[1G[0Kyour url is: https://chubby-eels-switch.loca.lt


In [None]:
import openai
def query_movies_by_genre(genre_name):
    cypher_query = """
    MATCH (m:Movie)-[:HAS_GENRE]->(g:Genre {name: $genre_name})
    RETURN m.title AS title, m.description AS description,
           m.tagline AS tagline, m.poster_path AS poster_path,
           m.trailer_url AS trailer_url, m.keywords AS keywords
    LIMIT 10;
    """
    with driver.session() as session:
        result = session.run(cypher_query, {"genre_name": genre_name})
        return [record.data() for record in result]
def query_movies_by_actor(actor_name):
    cypher_query = """
    MATCH (a:Actor {name: $actor_name})-[:ACTED_IN]->(m:Movie)
    RETURN m.title AS title, m.description AS description,
           m.tagline AS tagline, m.poster_path AS poster_path,
           m.trailer_url AS trailer_url, m.keywords AS keywords
    LIMIT 10;
    """
    with driver.session() as session:
        result = session.run(cypher_query, {"actor_name": actor_name})
        return [record.data() for record in result]

def format_movie_data(movies):
    formatted = []
    for movie in movies:
        formatted.append({
            "title": movie["title"],
            "description": movie["description"],
            "tagline": movie.get("tagline", "No tagline available"),
            "poster": movie.get("poster_path"),
            "trailer": movie.get("trailer_url"),
            "keywords": movie.get("keywords", [])
        })
    return formatted

def generate_recommendations(movies):
    prompt = "Here are some movie recommendations:\n\n"
    for movie in movies:
        prompt += f"Title: {movie['title']}\n"
        prompt += f"Tagline: {movie['tagline']}\n"
        prompt += f"Description: {movie['description']}\n"
        prompt += f"Keywords: {', '.join(movie['keywords'])}\n\n"
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response["choices"][0]["message"]["content"]



In [None]:
!streamlit run my_app.py &>/content/logs.txt & npx localtunnel --port 8501 & curl ipv4.icanhazip.com

34.106.253.37
[1G[0K⠙[1G[0Kyour url is: https://young-mirrors-switch.loca.lt


In [None]:
#!cat my_app.py  # Ensure the content is correctly written
!cat /content/logs.txt



Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.


  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://172.28.0.12:8501
  External URL: http://34.106.253.37:8501

  Stopping...
