# Data Dash

In this notebook, we will ingest data about car races from Bigtable and analyze it with various Spark tools.

## Connect to Bigtable and Spark

First, we create our Spark connection and include the Bigtable Spark connector jar.

In [None]:
from pyspark.sql import SparkSession
import os

spark = (SparkSession.builder
         .config('spark.jars', "gs://spark-bigtable-preview/jars/spark-bigtable-0.0.1-preview4-SNAPSHOT.jar")
         .getOrCreate())

bigtable_project_id = os.environ["BIGTABLE_PROJECT_ID"]
bigtable_instance_id = os.environ["BIGTABLE_INSTANCE_ID"]
bigtable_table_name="data_dash_test"

# Manually indicating columns from Bigtable for Spark dataframe.
catalog = ''.join(("""{
      "table":{"namespace":"default", "name":" """ + bigtable_table_name + """
       ", "tableCoder":"PrimitiveType"},
      "rowkey":"rowkey",
      "columns":{
        "_rowkey":{"cf":"rowkey", "col":"rowkey", "type":"string"},
        "Car_ID":{"cf":"cf", "col":"car_id", "type":"string"},
        "Start":{"cf":"cf", "col":"t1_s", "type":"string"},
        "End":{"cf":"cf", "col":"t8_e", "type":"string"},
        "Checkpoint_1":{"cf":"cf", "col":"t1_s", "type":"string"},
        "Checkpoint_2":{"cf":"cf", "col":"t2_s", "type":"string"},
        "Checkpoint_3":{"cf":"cf", "col":"t3_s", "type":"string"},
        "Checkpoint_4":{"cf":"cf", "col":"t4_s", "type":"string"},
        "Checkpoint_5":{"cf":"cf", "col":"t5_s", "type":"string"},
        "Checkpoint_6":{"cf":"cf", "col":"t6_s", "type":"string"},
        "Checkpoint_7":{"cf":"cf", "col":"t7_s", "type":"string"},
        "Checkpoint_8":{"cf":"cf", "col":"t8_s", "type":"string"},
        "Checkpoint_1_end":{"cf":"cf", "col":"t1_e", "type":"string"},
        "Checkpoint_2_end":{"cf":"cf", "col":"t2_e", "type":"string"},
        "Checkpoint_3_end":{"cf":"cf", "col":"t3_e", "type":"string"},
        "Checkpoint_4_end":{"cf":"cf", "col":"t4_e", "type":"string"},
        "Checkpoint_5_end":{"cf":"cf", "col":"t5_e", "type":"string"},
        "Checkpoint_6_end":{"cf":"cf", "col":"t6_e", "type":"string"},
        "Checkpoint_7_end":{"cf":"cf", "col":"t7_e", "type":"string"},
        "Checkpoint_8_end":{"cf":"cf", "col":"t8_e", "type":"string"}
      }
      }""").split())

## Reading the raw data

Here we will read from our Bigtable table and create and display a dataframe with the data.

In [None]:
df = spark.read \
  .format('bigtable') \
  .option('spark.bigtable.project.id', bigtable_project_id) \
  .option('spark.bigtable.instance.id', bigtable_instance_id) \
  .options(catalog=catalog) \
  .load()

print('Reading the DataFrame from Bigtable:')
df.show()

## Extracting value with Spark SQL

Spark SQL gives us a SQL layer we can use on top of our data. 

>Note that for large Bigtable datasets, you will want to do some filtering on rowkey to ensure a performant query.

### Query the total times for each race

In [None]:
df.createOrReplaceTempView("races")

totalTimes = spark.sql("SELECT _rowkey, bround((end - start)/1000,2) as duration_in_secs FROM races")
totalTimes.show()

### Query the total time per race and plot the average per car

In [None]:
averagePerCar = spark.sql("SELECT car_id, bround(avg((end - start)/1000),2) as duration_in_secs FROM races GROUP BY car_id ORDER BY car_id")
averagePerCar.toPandas().plot.bar(x='car_id')

# Can also do the same thing with a pure spark dataframe in more of a builder format.
# df.withColumn('TotalTime', (df.End - df.Start)/1000).groupBy('Car_ID').avg('TotalTime').orderBy('Car_ID').toPandas().plot.bar(x='Car_ID')

### Calculate speed for cars

Using a car length of 2.5 inches, we will find the speed in **miles per hour** at each checkpoint using when the time it entered the checkpoint and the time when it exited it.

We are approxomating the conversion of inches per second to miles per hour as: 1 in/s = 0.0568 mph

In [None]:
speeds = spark.sql(
    "SELECT _rowkey, car_id, "
    "bround(.0568*2.5/((Checkpoint_1_end - Checkpoint_1)/1000),5) as C1_speed, "
    "bround(.0568*2.5/((Checkpoint_2_end - Checkpoint_2)/1000),5) as C2_speed, "
    "bround(.0568*2.5/((Checkpoint_3_end - Checkpoint_3)/1000),5) as C3_speed, "
    "bround(.0568*2.5/((Checkpoint_4_end - Checkpoint_4)/1000),5) as C4_speed, "
    "bround(.0568*2.5/((Checkpoint_5_end - Checkpoint_5)/1000),5) as C5_speed, "
    "bround(.0568*2.5/((Checkpoint_6_end - Checkpoint_6)/1000),5) as C6_speed, "
    "bround(.0568*2.5/((Checkpoint_7_end - Checkpoint_7)/1000),5) as C7_speed, "
    "bround(.0568*2.5/((Checkpoint_8_end - Checkpoint_8)/1000),5) as C8_speed  "
    "FROM races "
    "ORDER BY start DESC "
    "LIMIT 2 "
)

speeds.show()
speeds.toPandas().plot.bar(x="car_id")

### Helper functions for graphing races

Now we can perform some math on each of the races and graph each one to see the results. We'll define a few helper functions here.

In [None]:
import pandas as pd
from IPython import display
import time

def graphRaces(races, key="_rowkey", live_refresh=False):    
    recentRacesWithDiffs = races

    # Find the diffs for each checkpoint
    checkpoint_cols = [col for col in recentRacesWithDiffs.columns if col.startswith('Checkpoint_')]
    for checkpoint in checkpoint_cols:
        recentRacesWithDiffs = recentRacesWithDiffs.withColumn(
            f"{checkpoint}_diff", 
            (recentRacesWithDiffs[checkpoint] - recentRacesWithDiffs.Start)/1000
        )
    
    # Create a new data structure to use the diffs
    data = {}
    checkpointDiffCols = [f"Checkpoint_{i}_diff" for i in range(1,9)]
    for row in recentRacesWithDiffs.collect():
        data[row[key]] = [row[col] for col in checkpointDiffCols]
    
    raceData = pd.DataFrame(data, index=range(1,9))
    if not(live_refresh):
        display.display(raceData.plot.line())
    return raceData
    
def graphRacesByCar(races):
    graphRaces(races, "Car_ID")

In [None]:
# Live refreshing graph
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

# races = spark.sql(
#         "SELECT *, FROM_UNIXTIME(start), random() as rand FROM races "
#         "ORDER BY rand DESC "
#         "LIMIT 2 "
#     )
# data = graphRaces(races)
# display.display(data.plot.line())
# clear

while True:
    # Update the data
    races = spark.sql(
        "SELECT *, FROM_UNIXTIME(start) as start FROM races "
        "ORDER BY start DESC "
        "LIMIT 2 "
    )   
    data = graphRaces(races, live_refresh=True)
    plt.plot(data)
    
    legend = list(map(lambda x: x.split('#')[0], data.columns))
    plt.legend(legend)
    
    plt.show()

    display.clear_output(wait=True)  # Clear the previous output
    time.sleep(.2)

### Graph the two most recent races against each other

In [None]:
races = spark.sql(
    "SELECT *, FROM_UNIXTIME(start) FROM races "
    "ORDER BY start DESC "
    "LIMIT 2 "
)
graphRaces(races)

### Query the most recent race for each car and order them by total time

In [None]:
recentRaces = spark.sql(
    "SELECT * FROM races "
    "WHERE (_rowkey, car_id) IN ( "
    "   SELECT MAX(_rowkey), car_id "
    "   FROM races "
    "   GROUP BY car_id) "
)
# recentRaces.show()

graphRacesByCar(recentRaces)

### Graph all the races for one car

In [None]:
races = spark.sql(
    "SELECT *, FROM_UNIXTIME(start) FROM races "
    "WHERE _rowkey LIKE 'CAR0003%' "
)

graphRaces(races)

## AI Queries

LLMs allow you to ask natural language questions of your data and have the question converted to queries that can be performed on your data. Here we will use Google Gemeni and Langchain

### Set up connection

Make sure the environment variable **GOOGLE_API_KEY** is set. You can get a key from the [AI studio](https://aistudio.google.com/app/apikey)

Create a Spark dataframe agent with the dataframe and LLM specified.

In [None]:
from langchain_experimental.agents.agent_toolkits import create_spark_dataframe_agent
from langchain_google_genai import ChatGoogleGenerativeAI

llm = ChatGoogleGenerativeAI(model="gemini-pro")

agent = create_spark_dataframe_agent(llm, df=df, verbose=True)

### Try counting the races for each car

In [None]:
agent.run("tell me the number of races for each car")

### Try graphing the total times for each car

In [None]:
agent.run("make a bar graph showing the average total time of each race per car id")

### Some additional queries for Gemeni to try

In [None]:
# agent.run("list the races where the car got to checkpoint_1 in under 5 seconds")

# agent.run("write me the sparksql to list the races where the car got to checkpoint_1 in under 5 seconds")

# agent.run("write me the code to list the races where the car got to checkpoint_1 in under 5 seconds")

# agent.run("make a pandas graph showing the races where the car got to checkpoint_1 in under 5 seconds")