## Description

In this notebook:

- upload TheBloke/Mixtral_7Bx2_MoE-GPTQ from HuggningFace,
- use the model to generate categories and sample question:cypher pairs.

These are tests only! Also notice this is a 7Bx2 and not 7Bx8 MoE.

## Workspace Setup

In [None]:
# Provide HuggingFace token - step not required
!huggingface-cli login

In [None]:
!pip3 install --upgrade transformers optimum
!pip3 install --upgrade auto-gptq

In [None]:
# Load and mount the drive helper
from google.colab import drive

# This will prompt for authorization
drive.mount('/content/drive')

# Set the working directory
%cd '/content/drive/MyDrive/cypherGen/'

# Create a path variable for the data folder
data_path = '/content/drive/MyDrive/cypherGen/datas/'

In [None]:
#import pandas as pd

# Import the local modules
from utils.utilities import *

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_name_or_path = "TheBloke/Mixtral_7Bx2_MoE-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
                                             device_map="auto",
                                             trust_remote_code=False,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

In [None]:
# Inference can be done using transformers' pipeline

def gen_text(prompt_template):
    print("*** Text Generation Pipeline:\n")
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.5, #0.7
        top_p=0.95,
        top_k=10, # 40 on HF, 50 default
        repetition_penalty=1.1
        )
    return pipe(prompt_template) #print(pipe(prompt_template)[0]['generated_text'])

## Generate Categories of Queries

In [None]:
# The prompt has to be adjusted to the LLM by using the specific meta-tokens
prompt_cat = """I have a knowledge graph for which I would like to generate
about 1000 very interesting questions which span 10 categories (or types) about the graph.
They should cover single nodes questions, two or three more nodes, many relationships questions.
 Please provide these 10 categories.

 Here is the graph schema:
Node properties are the following:\n
Article {abstract: STRING, article_id: INTEGER, comments: STRING, title: STRING},
Keyword {name: STRING, key_id: STRING},
Topic {cluster: INTEGER, description: STRING, label: STRING},
Author {author_id: STRING, affiliation: STRING,first_name: STRING, last_name: STRING},
DOI {name: STRING, doi_id: STRING},
Categories {category_id: STRING, specifications: STRING},
Report {report_id: STRING, report_no: STRING},
UpdateDate {update_date: DATE},
Journal {name: STRING, journal_id: STRING}\n
Relationship properties are the following:\n
PUBLISHED_IN {meta: STRING, pages: STRING, year: INTEGER}\n
The relationships are the following:\n
(:Article)-[:HAS_KEY]->(:Keyword),
(:Article)-[:HAS_DOI]->(:DOI),
(:Article)-[:HAS_CATEGORY]->(:Categories),
(:Article)-[:WRITTEN_BY]->(:Author),
(:Article)-[:UPDATED]->(:UpdateDate),
(:Article)-[:PUBLISHED_IN]->(:Journal),
(:Article)-[:HAS_REPORT]->(:Report),
(:Keyword)-[:HAS_TOPIC]->(:Topic)"
"""
system_message = "You are an experienced, very helpful Python and Neo4j/Cypher developer."
prompt_template_cat=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt_cat}<|im_end|>
<|im_start|>assistant
'''
print("\n\n*** Generate:")

In [None]:
# Generate categories using text generation pipeline
test = gen_text(prompt_template_cat)

In [None]:
print(test[0]['generated_text'])

In [None]:
# Generate categories using model generate
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids,
                        temperature=0.7,
                        do_sample=True,
                        top_p=0.95,
                        top_k=40,
                        max_new_tokens=512)
print(tokenizer.decode(output[0]))

## Generate Individual Pairs Question:Cypher

In [None]:
prompt_pair = """Generate 10 questions and their corresponding Cypher statements
about the Neo4j graph database with the following schema:
Node properties are the following:\n
Article {abstract: STRING, article_id: INTEGER, comments: STRING, title: STRING},
Keyword {name: STRING, key_id: STRING},
Topic {cluster: INTEGER, description: STRING, label: STRING},
Author {author_id: STRING, affiliation: STRING,first_name: STRING, last_name: STRING},
DOI {name: STRING, doi_id: STRING},
Categories {category_id: STRING, specifications: STRING},
Report {report_id: STRING,report_no: STRING},
UpdateDate {update_date: DATE},
Journal {name: STRING, journal_id: STRING}\n
Relationship properties are the following:\n
PUBLISHED_IN {meta: STRING, pages: STRING, year: INTEGER}\n
The relationships are the following:\n
(:Article)-[:HAS_KEY]->(:Keyword),
(:Article)-[:HAS_DOI]->(:DOI),
(:Article)-[:HAS_CATEGORY]->(:Categories),
(:Article)-[:WRITTEN_BY]->(:Author),
(:Article)-[:UPDATED]->(:UpdateDate),
(:Article)-[:PUBLISHED_IN]->(:Journal),
(:Article)-[:HAS_REPORT]->(:Report),
(:Keyword)-[:HAS_TOPIC]->(:Topic)
The questions should be article based and should be phrased in a natural conversational manner.
Make the questions diverse and interesting.
Make sure to use the latest Cypher version and that all the queries are working Cypher queries for the provided graph.
You may add values for the node attributes as needed.
Do not add any comments, do not label or number the questions."""
system_message = "You are an experienced and useful Python and Neo4j/Cypher developer. "
prompt_template=f'''<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt_pair}<|im_end|>
<|im_start|>assistant
'''
print("\n\n*** Generate:")

In [None]:
input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids,
                        temperature=0.1,
                        do_sample=True,
                        top_p=0.95,
                        top_k=10,
                        max_new_tokens=512) # increase to at least 1K
print(tokenizer.decode(output[0]))