## Generative AI Workshop Companion Notebook

Fiddler AI for Mozilla's Responsible AI Challenge 2023

This notebook provides examples of how to retrieve the workshop database and image files from AWS for additional and independent exploration.  This is intended for use after completing the Streamlit web-app generative workflow and analysis pages.

In [None]:
!pip install umap-learn

In [154]:
import os
import umap
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image, display

BUCKET_URL = 'https://ds-gen-ai-workshop.s3.us-west-2.amazonaws.com'
DATAFRAME_FILE = 'gen_workshop_dataframe.csv'

# SESSION is just a reference code to help isolate data from different workshop sessions and experiments.
# We'll use it below to select a subset of the complete database.   
SESSION = 'banana'
NUM_EMBEDDING_COMPONENTS = 64 # MAX 1536

### Define utility functions for data retrieval 

In [None]:
def get_file_from_s3(filename, dest_path=None):
    dest = dest_path if dest_path else filename
    
    r = requests.get(BUCKET_URL + '/' + filename)
    with open(dest,'wb') as f:  
        f.write(r.content)

def get_db_data():
    r = requests.get(BUCKET_URL + '/' + DATAFRAME_FILE)
    get_file_from_s3(DATAFRAME_FILE)    
    
    df = pd.read_csv(DATAFRAME_FILE)
    
    # Lists get stringified in csv format
    df['embedding'] = df['embedding'].apply(lambda x: eval(x))

    return df

df_raw = get_db_data()

df_raw

### Run UMAP dimensionality-reduction transformation on embeddings
Note that we reduce the the number of embedding components. 64 or 128 components provide good results given the limited data available.

In [None]:
df = df_raw[df_raw['session_id']==SESSION].reset_index(drop=True)
embs= np.array(df['embedding'].tolist())
reducer = umap.UMAP(n_components=2, n_neighbors=3, random_state=42)
umap_coords = reducer.fit_transform(embs[:, :NUM_EMBEDDING_COMPONENTS])

df_umap = pd.DataFrame(umap_coords, columns=['UMAP_'+str(x) for x in range(umap_coords.shape[1])])

df = pd.concat([df, df_umap], axis=1)

### Plot the result and label according to a categorical column from the dataframe
This would typically bee one of the feedback columns or, as set below, the category for which the user prompt was generated.

In [None]:
plt.figure(figsize=[6, 6])

col = 'category'

for val in df[col].unique():
    df_temp = df[df[col] == val]
    
    plt.plot(df_temp['UMAP_0'], df_temp['UMAP_1'], 'o', label=val)

plt.xlabel('UMAP_0')
plt.ylabel('UMAP_1')
plt.legend();

### This demonstrates how to retrieve images from S3 and display in the notebook
One might choose to isolate a cluster of interesting examples in the semantic UMAP space and show their prompts and images in a loop.

In [None]:
row_id = 1
data_row = df.iloc[row_id]

# Print some data
print(data_row)

# Display an Image
filename = data_row.prompt_id + '.png'
get_file_from_s3(filename)
display(Image(filename=filename))