## Tutorial

### Install Langchain

Install the `langchain`, `langchain-experimental`, and `langchain-google-geni` libraries. You can install these directly into your Spark Serverless environment.

In [None]:
pip install langchain langchain-experimental langchain-google-genai

### Create an API key

Create an API key using <a href="https://aistudio.google.com/app/apikey"><span style="color:blue">Google AI Studio</span></a>. Run the next cell and paste the API key in when prompted.


In [None]:
from getpass import getpass

api_key = getpass()

### Import required libraries

In [None]:
from langchain_experimental.agents.agent_toolkits import create_spark_dataframe_agent
from langchain_google_genai import GoogleGenerativeAI
from pyspark.sql import SparkSession

### Create a connection to the Gemini model service

Create an LLM object using the `GoogleGenerativeAI` class which creates a connection to the Gemini model service.

In [None]:
llm = GoogleGenerativeAI(model="gemini-pro", temperature=0.0, google_api_key=api_key)

Use `llm.invoke` to ask Gemini a question and confirm your connection to the service.

In [None]:
print(llm.invoke("What is the best programming language?"))

### Create a Spark Session

Create a connection to the Spark context in your environment.

In [None]:
spark = SparkSession.builder.getOrCreate()

### Load data

Load your BigLake table `gcp_primary_staging.thelook_ecommerce_order_items` into your environment. This table contains ecommerce orders.

In [None]:
df = spark.read.format("bigquery").load("next-2024-spark-demo.gcp_primary_staging.thelook_ecommerce_order_items")

View some of the data

In [None]:
df.show(10)

Use the `create_spark_dataframe_agent` method to configure a LangChain agent using the loaded dataset and Gemini model. The `verbose=True` parameter, send to std.out the steps the agent is taking. Omitting this parameter to suppresses this output.

In [None]:
agent = create_spark_dataframe_agent(llm=llm, df=df, verbose=True)

Use natural language to gain insights into your data. To start with something simple, ask for the order_id and the price of the most expensive order.

In [None]:
agent.invoke("what was the order id and the price of the most expensive order?")

With the verbose parameter set to True, we can see exactly how the agent is working. The agent generates code based on the schema of the dataframe and executes it. It doesn't always get it on the first try, but it is able to learn from the errors it sees to adjust and correct until it lands on an acceptable answer.

Next, make a request that involves the agent importing new functions.

In [None]:
agent.invoke("What week of the year has the total highest sales overall?")

Now you probably don't want to include this natural language prompt directly into a production environment. Instead, we can ask Gemini to generate the PySparkSQL code for us that would create the same output.

In [None]:
agent.invoke("Print the PySpark code that answers 'What week of the year has the total highest sales overall?' Include all necessary imports.")

Like anything created by the still-maturing LLM technology, review generated code for accuracy.