<a href="https://colab.research.google.com/github/Eric-Prosser/Prompting-Insight/blob/main/Fantastic_Futures_2025_Prompting_Insight_Master.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Fantastic Futures 2025**

## **Prompting Insight: Using Generative AI and Python to Visualize Library Data**

Welcome! We hope this workshop will be an enjoyable experience for you regardless of your experience with generative AI or with Python. Although some knowledge of each will be useful, we've tried to design this workshop so that you can jump into it without any prior experience with either.

üìÅ Before You Begin: Make Your Own Copy

* **Click *File* > *Save a copy in Drive*.**

This will create a copy of the notebook in your own Google Drive.
Rename the notebook if you'd like, then start working in your version!
All of your changes will be saved automatically to your Drive.



## **Getting Started**

If you're new to Google Colab or need a refresher on Google Colab notebooks, check out these helpful resources:

[Getting Started with Google Colab: A Beginner's Guide](https://www.google.com/url?q=https%3A%2F%2Fwww.marqo.ai%2Fblog%2Fgetting-started-with-google-colab-a-beginners-guide%2F)

[Google Colaboratory](https://www.google.com/url?q=https%3A%2F%2Fcolab.google%2F)

This Colab notebook contains everything you need to follow this workshop. You‚Äôll also use a generative AI tool, we are using ChatGPT, to help write Python code. You‚Äôll switch between this notebook and ChatGPT to generate and run code that explores your data and answers questions about the dataset.


---

## **Introduction**

*Dark data* describes data that organizations regularly collect and store but remain unused, often due to a lack of capacity to evaluate the data or a lack of knowledge of its existence. Examples of dark data that might be collected by libraries includes circulation statistics, gate counts, and chat reference transcripts. This study attempts to convert a small amount of the information collected by an academic library from dark data to tangible data.

What sort of areas might you explore with this data from libraries?

* Space utilization
* Collection usage
* Patron behaviors
* Instruction effects
* Workflow optimization
* Discovery layer insights

In this workshop, we are going to look at a curated sample of transcripts from a chat reference service at a large doctoral research institution in the United States.

---

### üì¶ Setting Up Your Environment (First-Time Users Only)

üí° **Tip:** If you're running this notebook in Google Colab for the first time, run the cell below to install the libraries you'll need.

This installs:

- `pandas` for working with data
- `matplotlib` for creating plots
- `scipy` for calculating statistics
- `seaborn` for enhanced data visualizations

> You only need to run this cell once when opening the notebook. In Google Colab, you'll click on the arrow to run the code you've loaded. In this first example, the code is already loaded for you.

<p align="center">
  <img src="https://raw.githubusercontent.com/Eric-Prosser/Prompting-Insight/main/resources/FirstRun.png" alt="Click To Run" width="800">
</p>

---

üí° **Need something else?**  
Depending on the code you generate later, you might need additional packages.

If you see an error like `ModuleNotFoundError`, simply ask ChatGPT:
> ‚ÄúHow do I install this package in Google Colab?‚Äù

ChatGPT will generate a one-line install command like `!pip install package-name` for you to copy and paste.

In [1]:
# üì¶ Install essential libraries for data analysis and visualization
!pip install --quiet pandas matplotlib scipy seaborn



---

### **Part 1: Loading the dataset**

Click the **RUN** arrow next to the code cell below to execute it. This will load the dataset and display the first few rows so you can confirm everything is working correctly.

<p align="center">
  <img src="https://raw.githubusercontent.com/Eric-Prosser/Prompting-Insight/main/resources/FirstCode.png" alt="Click To Run" width="800">
</p>

In [None]:
# This code imports the dataset that we'll analyze.

import pandas as pd

url = "https://raw.githubusercontent.com/Eric-Prosser/Prompting-Insight/main/data/FF2025_sample_data.csv"
df = pd.read_csv(url)

# Display the first few rows
df.head()

If you now see a table showing the first few rows of data, you've successfully loaded the dataset ‚Äî nice work!

Here‚Äôs what the expected output looks like:

<p align="center">
  <img src="https://raw.githubusercontent.com/Eric-Prosser/Prompting-Insight/main/resources/DataHeader.png" alt="Expected Output" width="1000">
</p>

This dataset contains a small excerpt from the larger dataset available through Springshare's chat reference service with column titles renamed for ease of use. This dataset includes the following information:

* *Number* - A unique number for each chat reference session. These have been simplified for this workshop.

* *StudentLevel* - Students are identified as Graduate (post-baccalaureate) or Undergraduate (baccalaureate-seeking).

* *Timestamp* - This contains the starting date (in MM/DD/YYYY format) and starting time (24h format) of each chat reference session.
* *Transcript* - Anonymized transcripts of the chat reference sessions. Librarian statements are preceded by "Librarian:" and student statements are preceded by "Student:".
* *Duration* - The length of the chat reference session in seconds.
* *Rating* - An optional rating provided by the student at the end of the session, rated from 1 (worst) to 4 (best). An entry of 0 indicates no response to the survey.

Examine the variables listed there. Consider which ones might help you if were examining differences between how students at different levels use chat reference, how well rated the help is, or how much time the typical students uses chat reference.



### **Part 2: Priming the model**

To begin, let's prime the ChatGPT model with the background on our project. Ideally, your prompt will have sufficient structure and detail to minimize ambiguity and misunderstanding on the part of the model and to allow it to provide an output in the most useful format to you.

In the interest of time for this workshop, we've provided a context prompt that you can access at [this link](https://links.asu.edu/FF2025Prompt). Copy and paste that prompt into ChatGPT and ensure that it responds appropriately before moving on.

Note that the prompt describes the format of the dataset and also describes the actions we've taken during the prior steps of this notebook.

---

### **Part 3: Calculating statistical summaries**

Use ChatGPT (or your preferred generative AI model) to help you write Python code that calculates the **mean**, **median**, and **standard deviation** for the following columns:

- `Duration`
- `Rating`

üí° **Tips:**
* Remind ChatGPT that you've already loaded the dataset into a DataFrame called `df`, and that you'd like code you can copy and paste into your Colab notebook.
* Ask ChatGPT to ignore values of 0 in the `Rating` column because these are not true ratings but where the patron did not provide a rating.

Once you‚Äôve generated the code, paste it into the cell below and run it to calculate the statistics.

**Optional:** Want to double-check your results? Download the dataset directly from Github and calculate the same statistics using Excel or another tool you're familiar with.


In [None]:
# Paste in your code here

In [None]:
"""
Paste in your ChatGPT prompt(s) here

""";

### **Part 4a: Statistical significance**

We have now used ChatGPT to give us simple statistical analyses, but deeper insights come from more complex examinations of the data. Let's examine whether there are differences between student levels and the times they spend during each chat reference session. For example, we might hypothesize that graduate students' chats take longer to resolve than undergraduate students because, we assume, they are asking more complex questions.

To examine this we will need to work with two columns:

- `Duration`
- `StudentLevel`

What might be measured to compare the two groups of students and how can we determine whether there is a statistically significant difference between the two? Ask ChatGPT for options. What statistical tests does it suggest? Select an appropriate test and then ask ChatGPT to generate code to calculate the mean for each group and perform the appropriate statistical test.

Once you‚Äôve generated the code, paste it into the cell below and run it to calculate the statistics.

üí° **As before remember these tips:**
* Remind ChatGPT that you've already loaded the dataset into a DataFrame called `df`, and that you'd like code you can copy and paste into your Colab notebook.
* Ask ChatGPT to ignore values of 0 in the `Rating` column because these are not true ratings but where the patron did not provide a rating.


In [None]:
# Paste in your code here

In [None]:
"""
Paste in your ChatGPT prompt(s) here

""";

### **Part 4b: Visualizing the differences**

What we have generated are numerical answers. But visualizations can tell data stories better than pure numbers and can make the results more intuitive.

If you have some ideas on how to visualize the difference in means between undergraduate and gradute students, feel free to prompt ChatGPT for code to create that visualization. If you'd like to explore some options, ask ChatGPT for suggestions on what visualizations might be the most compelling. Choose one and ask ChatGPT for the code.

Once you‚Äôve generated the code, paste it into the cell below and run it to generate the visualization.

üí° **Tips**
* Don't forget to specify titles, axis labels, and legends, if needed.
* Experiment with putting both groups on to one chart or using separate charts for each student group.

In [None]:
# Paste in your code here

In [None]:
"""
Paste in your ChatGPT prompt(s) here

""";

### **Part 5: Time series**

At this point, you should be feeling more familiar with using ChatGPT to generate code and explore data, so this will a more challenging portion.

Let's say you want to examine whether there are differences in chat reference usage over time. When do more questions come in? Is there a particular time of day that is more common? Are there particular times of the year when more questions come in? Is there a difference between undergraduate and graduate students and the time they use the service?

Choose one of the following questions and use ChatGPT to generate code for an appropriate visualization. Compare the behaviors of undergraduate and graduate students.

* What time of the day is the most common for chat reference questions and does that vary based on the day of the week?
* What months of the year show the most and least usage of the chat reference service?


üí° **Background information**
* During the time frame of the data, the chat reference service was staffed and open from 8am to 5pm, Monday through Friday.
* The academic semesters generally last from mid-August through the first week of December and then mid-January through the first week of May. There is a reduced courseload available during the summer months.

In [None]:
# Paste in your code here

In [None]:
"""
Paste in your ChatGPT prompt(s) here

""";