# AI Assistant
## Group Members:
 - Krylova Alena
 - Dudic Mateja
 - Saavedra Triana Erwin Omar
 - Maringer Kelvin

## Python Version:
 - 3.13

## Contributions:
 - Krylova Alena:
     - ...
 - Dudic Mateja:
     - ...
 - Saavedra Triana Erwin Omar:
     - ...
 - Maringer Kelvin:
     - ...

# FIRST TIME SETUP
# ----------------
# MAKE SURE THAT THIS CELL RUNS WITHOUT ERRORS BEFORE PROCEEDING
# ----------------

In [None]:
# All the packages that have to be installed should be listed here
%pip install numpy pandas matplotlib seaborn kagglehub ipywidgets --quiet
# This will filter out the output from Jupyter Notebooks when committing to git, so that diffs are cleaner
! git config filter.strip-notebook-output.clean 'jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to=notebook --stdin --stdout --log-level=ERROR'

import kagglehub
import platform

# Download latest version
dataset_path = kagglehub.dataset_download("prince7489/daily-ai-assistant-usage-behavior-dataset") + ("/Daily_AI_Assistant_Usage_Behavior_Dataset.csv" if platform.system() != "Windows" else "\\Daily_AI_Assistant_Usage_Behavior_Dataset.csv")

print("Path to dataset files:", dataset_path)

# ----------------------
# ----------------------

In [None]:
#All the imports should be listed here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plot
import seaborn as sea


## *Dataset Overview*
The Daily AI Assistant Usage Behavior Dataset captures real-world patterns of how users interact with AI assistants throughout their daily activities. It provides insights into when, how, and for what purposes people used AI tools, as well as session characteristics and user satisfaction.

The dataset is published on the Kaggle platform and is intended for researchers, developers, and data science practitioners interested in user behavior analysis, personalization systems, recommendation engines, and conversational AI. It covers a wide range of AI usage scenarios, including learning, productivity, research, and routine daily tasks.

The dataset contains 300 rows and 8 columns.

*Features (their meaning and data types):*  
1st column: timestamp - date and time when the interaction with an AI tool started, data type - categorical (string)  
2nd column: device - type of device which was used to access an AI tool (desktop, mobile, smart speaker), data type - categorical (string)  
3rd column: usage_category - for what purpose the user used an AI tool (education, daily tasks, research and etc), data type - categorical (string)  
4th column: prompt_length - lenght of the user`s prompt (measured in charakters), data type - integer  
5th column: session_length_minutes - duration of the session in minutes, data type - float  
6th column: satisfaction_rating - user satisfaction score from 1 to 5, data type - integer  
7th column: assistant_model - which AI assistant model was used during the session, data type - categorical(string)  
8th column: tokens_used - number of tokens used during the session, data type - integer  

Most features from the data set are categorical, making the dataset suitable for analyzing patterns and user behavior segmentation (for example, feature 'timestamp' allows to see if people use AI tools more often on weeekdays or weekends, in the mornings or in the evenings).

To obtain a statistical summary of the numerical features, the describe() method was used.
It provided key statistics such as mean, standard deviation, minimum and maximum values, as well as quartiles. It allows to better understand distributaion of data.   
*Some observations from the desccribe() function:*   
The average prompt length is 129 characters, it indicates that users often submit detailed prompts.  
The average session duration is about 7.7 minutes, indicating that most interactions with the AI assistant are relatively short.  
The average satisfaction rating is close to 3 (on a scale from 1 to 5), which shows users` experience in general is neutral (or positiv.)  
Token usage varies significantly, showing the differences in query complexity.  


In [None]:
## Code for the dataset overview Here
data = pd.read_csv(dataset_path)
print(data.head())
print(data.info())   # to get information about the dataframe (number of rows, columns, data types)
# to get basic statistics about the dataframe (mean, std, min, max, etc.)
print(data.describe())

## *Data Quality Check*



In [None]:
## Code for the data quality check Here
print(f"Missing values per columns: \n {data.isnull().sum()}") 
# The dataset contains no missing values
rows = 1
cols = len(data.select_dtypes(include=[np.number]).columns)
fig, axes = plot.subplots(rows, cols, figsize=(15,5))
axes = axes.flatten()

for i, col in enumerate(data.select_dtypes(include=[np.number]).columns):
    sea.boxplot(y=data[col], ax=axes[i])
plot.suptitle("Boxplot for the numerical columns ")
plot.show()

#from the boxplots we can see that there are no outliers in the numerical columns

print(data['device'].value_counts())
print(data['usage_category'].value_counts())
print(data['assistant_model'].value_counts())
# For the categorical columns, we examined  the unique values and their frequencies. There are no unusual or extremely rare entries, so no outliers were detected in the dataset.


## *Data-Preprocessing*
 - Additional Notes etc...

In [None]:
# From here you can take for the Data Quality Check
RAW_data = pd.read_csv(dataset_path)

# first we check the number of missing values in each column
print(RAW_data.isnull().sum())
# after checking we can see that there are no missing values in the dataset

# so for the outliers in this dataset , I think the best aproach would be to leave them and just mark them as outliers. In this dataset the outliers might be relevant data from users that have a diferent behavior than the average user, so removing them would mean losing relevant data.

# now we will check for outliers using the IQR method

y = RAW_data.select_dtypes(include=[np.number])
print(y)
for column in y:
    quartile_min = RAW_data[column].quantile(0.25)
    quartile_max = RAW_data[column].quantile(0.75)

    IQR = quartile_max - quartile_min

    lower_bound = quartile_min - 1.5 * IQR
    upper_bound = quartile_max + 1.5 * IQR

    outliers_promt_length = RAW_data[(RAW_data[column] < lower_bound) | (RAW_data[column] > upper_bound)].count()
    outliers_promt_length = outliers_promt_length.sum()

    print(f"the number of outliers in {column} is the following: \n{outliers_promt_length}")

 # Start of the data preprocessing

# after checking we can see that there are no outliers in the dataset, surprinsing but good.
# next we will check for duplicates in the dataset
x = RAW_data.duplicated().sum()
print(f"the number of duplicates in the dataset is the following: \n{x}")
# after checking we can see that there are no duplicates in the dataset

#

# so we continue with creating the required columns


# I made an funtion for this part so its easier to read , and taking noticing that the timestamp is an string i decided to slice the string to get the hour part and then convert it to int to compare it
def timeOfDay(hour):
    if 5 <= hour <= 11:
        return "morning"
    elif 12 <= hour <= 17:
        return "afternoon"
    elif 18 <= hour <= 22:
        return "evening"
    else:
        return "Night"

RAW_data["timeOfDay"] = RAW_data["timestamp"].apply(lambda x:timeOfDay(int(x[11:-6])))
RAW_data["year"] = RAW_data["timestamp"].apply(lambda x:int(x[0:4]))

# now we are going to convert the columns and timestamp to their proper datatypes

RAW_data["timestamp"] = pd.to_datetime(RAW_data["timestamp"])
RAW_data["device"] = RAW_data["device"].astype("category")
RAW_data["assistant_model"] = RAW_data["assistant_model"].astype("category")
RAW_data["timeOfDay"] = RAW_data["timeOfDay"].astype("category")
RAW_data["usage_category"] = RAW_data["usage_category"].astype("category")

# note that the numericals stay the same, and I decided to leave year as a number

RAW_data




## *Data Analysis*
 - Additional Notes etc...

In [None]:
#1. Different AI Assistants used (count and percentage).

assistant = RAW_data['assistant_model'] #selecting the column assistant_model
count = assistant.value_counts() #counting how many times each assistant appears
percentage = ((count/RAW_data['assistant_model'].count())*100).round(2) #calculating the percentage of each assistants occurrence

different_assistants = pd.DataFrame({'count': count, 'percentage': percentage}) #making a dataframe with the results
print(different_assistants)
#There is 5 different AI models in the dataset

#2. Average session length per assistant model

average_session_length = RAW_data.groupby('assistant_model')['session_length_minutes'].mean().round(2)
average_session_length_output = pd.DataFrame({'Average length': average_session_length})
print(average_session_length_output)
#Average session length is relatively similar for all AI models, with the highest being GPT-5 at 8.15 and the lowest being o1 at 7.18 minutes

#3. Usage category per assistant model
pivotinho = pd.pivot_table(RAW_data, index='assistant_model', columns='usage_category', aggfunc='count', values='timestamp')
print(pivotinho)
# An interesting observation is that o1 is used the most for Writing and Education compared to other categories, most likely due to better reasoning than other models, 
# In general the most used models are the three GPT models, with GPT-4 being the most consistently used of the three

#4. Longest average prompt length and use time per task
longest_avg_prompt = RAW_data.groupby('usage_category')['prompt_length'].mean().round(2)
print(longest_avg_prompt)

longest_avg_time = RAW_data.groupby('usage_category')['session_length_minutes'].mean().round(2)
print(longest_avg_time)
#The longest average prompt length is in Research with 141.26 characters on average, while the longest average session length is for coding with 8.51 seconds

#5. Usage category per time of day
usage_category_per_timeOfDay = pd.pivot_table(RAW_data, index='usage_category', columns='timeOfDay', aggfunc='count', values='timestamp')
print(usage_category_per_timeOfDay)
# Most categories have a specific time of day during which they are the least occurent. For example, education is least used at night, while writing is the least used during the evening


#6. Popularity of assistants over time
assistant_model_per_year = pd.pivot_table(RAW_data, index='assistant_model', columns='year', aggfunc='count', values='timestamp')
print(assistant_model_per_year)
#just one year is given and the most used assistant is GPT-4

In [None]:
## code for data analysis here

#Visualizations:
#– Plot distributions of key features using histograms, KDE plots, and boxplots.
#– Use color to distinguish individual assistants.

def rem_uscore(axis):
    """Removes underscores from x and y axis labels because it looks better"""
    axis.set_xlabel(axis.get_xlabel().replace("_", " ").replace("minutes", "in min."))
    axis.set_ylabel(axis.get_ylabel().replace("_", " ").replace("minutes", "in min."))
    print(axis.get_xlabel())

#
# HISTOGRAMS
#

fg, axes = plot.subplots(1, 3, figsize=(30, 10))
fg.suptitle("Histograms (∑300 entries)")

#  sea.kdeplot(data=RAW_data, x="satisfaction_rating", hue="assistant_model", fill=True,alpha=.1,palette="muted")
#  plot.gca().axes.get_yaxis().set_visible(False) # Hide "Density" label since it's annoying


# Histograms are chosen to show the distribution amongst assistant models and devices.
# This helps to visualize which moedels and devices are actually being used. (NOTE: this sample size is quite small (300 entries) so the data might not be real world applicable)
# I cannot imagine that smart speakers are the most used platform for AI assistants, but hey, who knows! maybe i just havn't kept in touch with the latest trends :')
# NOTE: who TF uses a smart speaker???

sea.histplot(ax=axes[0],data=RAW_data, x="assistant_model",stat="count", palette="muted", hue="assistant_model",legend=False)
axes[0].set_title("Assistant Model count")
rem_uscore(axes[0])
print("Most people seem to use GPT-4o, which is interesting as it is not exactly the cheapest model available.")



#  axes[1].tick_params(left=False)  # Remove y-axis ticks

print("The biggest platform is relatively suprising: smart speakers. I assume that this means devices like Amazon Alexa and Google Home, which is interesting since these devices are not really known for their AI capabilities. Wired... and interesting!")


sea.histplot(ax=axes[1],data=RAW_data, x="usage_category", palette="muted", hue="usage_category",legend=False)


#sea.histplot(ax=axes[2],data=RAW_data, x="usage_category",palette="muted", hue="device",multiple="stack")


#sea.histplot(ax=axes[2],data=RAW_data, x="device",palette="muted",alpha=.5, hue="usage_category",multiple="stack")

axes[1].set_title("Usage Category count")
axes[1].set_ylabel('')  # Remove y-axis label
axes[1].tick_params(axis='x', labelrotation=45) # super Fancy 45° rotation for extra coolness
rem_uscore(axes[1])



sea.histplot(ax=axes[2],data=RAW_data, x="device", palette="muted", hue="device",legend=False,alpha=.25)
axes[2].set_title("Device count")

#  axes[2].set(yticklabels=[])
axes[2].set_ylabel('')  # Remove y-axis label
rem_uscore(axes[2])

axes_2_overlay = axes[2].twinx()# This combines two plots into one (so both the use counts of devices and the breakdown into usage categories can be seen AT THE SAME TIME.)

sea.countplot(ax=axes_2_overlay, data=RAW_data, x="device", hue="usage_category", palette="muted") # This is so cool. I love it.
axes_2_overlay.set_ylabel('Usage Count')  # Remove y-axis label
axes_2_overlay.get_xticklabels()[0].set_color(sea.color_palette("muted")[0])





print("The most common usage for AI seems to be education. Depending on the definition of this category, this could mean that a lot of people are using these assistats for homework and study-help.")
print("This honestly makes me question where this data is from, since I would expect a LOT more people to use AI in a more professional setting (work, coding, writing etc.) rather than for education.")
print("Coding one of the least common, which is very suprising to me, considering that AI coding assistants are one of the more ACTUALY USEFUL applications of AI right now. Interesting!")

#desktop_writing
#print(RAW_data[(RAW_data["device"]=="Smart Speaker") & (RAW_data["usage_category"]=="Coding")].value_counts())#?????????????? -> Why would someone use a smart speaker for coding??? -- This dataset is definetly not real. (Alexa, commit & push the code.)


plot.show()


#
# KDE PLOTS
#
kdeplots_defcon={"alpha":.25,"fill":True,"palette":"muted","common_norm":False} # Default parameters for all KDE plots via **UNPACKING MY BELOVED

def cts_kdeplot(ax, data, x, hue,title=None) -> None: # Helper function for creating KDE plots
    sea.kdeplot(ax=ax,data=data, x=x, hue=hue, **kdeplots_defcon) # Makes the KDE plot
    ax.set_xlim(data[x].min(),data[x].max()) # Set x-axis limits (removes tapering of the KDE at the edges)
    ax.set_title(f"{x} by {hue} type" if title is None else title) # Title formatting
    sea.move_legend(ax, "lower left") # self-explanatory
    ax.set(yticklabels=[]) # Remove y-axis numbers
    ax.set_ylabel('')  # Remove y-axis label
    ax.tick_params(left=False)  # Remove y-axis ticks
    rem_uscore(ax)
    
    
# KDE plots are chosen to show the distribution of 
#
#
    
    
fg, axes = plot.subplots(1, 3, figsize=(30, 10))
fg.suptitle("KDE Plots YEAHHH")

cts_kdeplot(axes[0],RAW_data,"session_length_minutes","device","session length by device type")

print("This is actually quite interesting! The session length seems to be shorter on desktop devices compared to mobile devices, which is wired. (one would think that desktop users would spend more time because of work/study etc.)")
print("Similarly, tablets have the longest session lengths on average. This divide between mobile and desktop could stem from the different use cases for each device type. (or the time it takes to type / enter prompts)")
print("This points towards desktop users using the AI assistant for quick queries, while mobile/tablet users might be engaging in longer interactions. Strange!")

cts_kdeplot(axes[1],RAW_data,"prompt_length","device","prompt length by device type")

print("Looking at the prompt lengths, we can see that desktop users tend to have longer prompts on average compared to mobile and tablet users.")
print("This could be due to the ease of typing on a physical keyboard, allowing for longer and more complex prompts. This is pretty strange though, since the session lengths were shorter on desktop.")

cts_kdeplot(axes[2],RAW_data,"tokens_used","assistant_model","tokens used by model type")

print("As expected, the more advanced models like GPT-5 and GPT-4o tend to use more tokens on average compared to older models like o1.")
print("However, mini seems to use the most tokens on average, which could either stem from its use (longer prompts on average) or its complexity. (which is weired since mini is supposed to be a smaller model). Strange again!")

plot.show()


#
# BOXPLOTS LEZGO
#

fg, axes = plot.subplots(1, 2, figsize=(30, 10))
fg.suptitle("Box Plots")


#sea.boxplot(ax=axes[0],data=RAW_data, x="assistant_model",hue="assistant_model", y="satisfaction_rating", palette="muted",medianprops={"color": "r", "linewidth": 2},notch=True)

sea.boxplot(ax=axes[0],data=RAW_data, x="assistant_model",hue="assistant_model", y="tokens_used", palette="muted",medianprops={"color": "r", "linewidth": 2},notch=True)
axes[0].set_title("Tokens Used by Assistant Model")
rem_uscore(axes[0])

print("assistant model vs tokens used is actually quite interesting (genuenly!). One can see that the \"advancedness\" of the model doesn't really correlate with the number of tokens used. This could either point towards good optimization of the newer models, or simply that users are using the models in different ways.")
print("Mini seems to have the highest median token, whilst having a relatively big spread. The token size seems to vary quite a lot.")

print("device to session length is also notable, in that the session length seems to be quite stable across the different device types. The spread is also quite small (ranging from ~11 to ~5 minutes on most devices)")


#print("Interestingly enough, most models seem to have an completely identical satisfaction rating distribution. This is quite suprising and implies that this data might be synthetic. (i calculated it myself and all models have a mean of exactly 3 with max of .2 difference). Seems like there should be a bigger difference between models, especially considering that each model should (at least in theory) improved over the last.")
#print("GPT-5 seems to have the lowest satisfaction, however, the 5.1 version has the highest satisfaction. This could point towards some issues with GPT-5 that were fixed in 5.1. IG")

# <ignore_this>
print(RAW_data[["assistant_model","satisfaction_rating"]].groupby("assistant_model").mean())
print(RAW_data[["assistant_model","satisfaction_rating"]].groupby("assistant_model").quantile(.75))
print(RAW_data[["assistant_model","satisfaction_rating"]].groupby("assistant_model").quantile(.25))
# </ignore_this>


sea.boxplot(ax=axes[1],data=RAW_data, x="device",hue="device", y="session_length_minutes", palette="muted",medianprops={"color": "r", "linewidth": 2},notch=True)
axes[1].set_title("Session Length by Device Type")
rem_uscore(axes[1])
plot.show()


#  sea.boxplot(data=RAW_data, x="assistant_model",hue="assistant_model", y="session_length_minutes", palette="muted")
#  plot.title("Boxplot of Session Length by Assistant Model")
#  plot.show()




In [None]:
# Are there any features that clearly differentiate device types?