## SpeakQL A/B User Study Overview and Discussion of Study Design Challenges

#### SpeakQL
SpeakQL is an extension of SQL syntax that is intented to make speech-driven querying more natural while retaining the benefit correctness by construction of a formal language.

##### Features:
- Synonyms (e.g. Get area from the room table)
- Alternate ordering (e.g. From the room table get area)
- Natural functions (e.g. What is the average of area in the room table)
- Unbundling - feature allows specifying complex queries one table at a time.



### Study Design

#### Objective: 
Compare SpeakQL dialect to SQL dialect in terms of planning time, total time, and number of attempts to form a correct query.

#### Context:
Online using Zoom and a custom-built user interface with audio recording and live speech-to-text transcription.

#### Participants:
- Gradudate and Undergraduate UCSD students from CSE and Business Analytics programs.
- 23 Participants total (one dropped from data)

#### Design Considerations:
- Used counterbalancing to reduce learning effect.
  - SQL first, then SpeakQL or SpeakQL first, then SQL
- Practice session
  - 3 queries in dialect 1, then repeat in dialect 2
- Measured session
  - 12 queries in dialect 1, then repeat in dialect 2
- Compensation based on number of queries completed
- Queries increased in complexity as study progressed.
- SpeakQL feature usage was optional.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import scipy.stats as stats
from scipy.stats import normaltest
from scipy.stats import mannwhitneyu as mwu
from scipy.stats import wilcoxon as wil
from scipy.stats import ttest_ind
from scipy.ndimage import median
from scipy.ndimage import mean
from IPython.display import display
pd.options.display.float_format = '{:,.3f}'.format

df = pd.read_excel(
    './data/df/sample-distribution-and-tests-df.xlsx'
)
df = df.query("ispractice == 0")
df = df.query("idparticipant != 18")
df['relative_step'] = df.apply(
    lambda row: row.idstep - 6 if row.idstep < 19 else row.step - 18, 
    axis = 1
)
df['second_half'] = df.apply(
    lambda row: 0 if row.idstep <= 18 else 1,
    axis = 1
)

### Results Summary

#### Overall Distribution
In aggregate, planning time showed a right tail distribution for both SpeakQL and SQL

#### SpeakQL - First Attempt Planning Time:

In [None]:
df.query('language == "speakql"').first_pt.hist(bins = 30)

#### SQL - First Attempt Planning Time:

In [None]:
df.query('language == "sql"').first_pt.hist(bins = 30)

### First Planning Time Distribution by query

A Shapiro-Wilkes test of normality reveals that most distributions by query are not normal (query 10 is an exception)

In [None]:
df.columns

In [None]:
d = {"Query": [], "Stat": [], "P":[], "Normal": []}
for i in range(1, 13):
    stat, pval = stats.shapiro(df.query("relative_step == " + str(i)).first_pt)
    d["Query"].append(i)
    d["Stat"].append(stat)
    d["P"].append(pval)
    d["Normal"].append(pval > 0.05)
pd.DataFrame(data = d)

In [None]:
g = sns.FacetGrid(df, col='relative_step', col_wrap = 4, hue='language')
g.map_dataframe(
    sns.histplot, 
    'first_pt', 
    bins = 10, 
    discrete = False, 
    element = 'step',
    legend = True
)
g.add_legend()

### Number of AttemptsDistribution by query


In [None]:
d = {"Query": [], "Stat": [], "P":[], "Normal": []}
for i in range(1, 13):
    stat, pval = stats.shapiro(df.query("relative_step == " + str(i)).attemptnum)
    d["Query"].append(i)
    d["Stat"].append(stat)
    d["P"].append(pval)
    d["Normal"].append(pval > 0.05)
pd.DataFrame(data = d)

In [None]:
g = sns.FacetGrid(df, col='relative_step', col_wrap = 4, hue='language')
g.map_dataframe(
    sns.histplot, 
    'attemptnum', 
    bins = 3, 
    discrete = True, 
    element = 'bars',
    legend = True
)
g.add_legend()

### Mann-Whitney U-Test on Non-Parametric Data

In [None]:

dep_vars = ['first_pt', 'tot_tt', 'attemptnum']
result_dfs = []
# dep_vars = ['recording_time']
df_sql = df.query("language == 'sql'")
df_speakql = df.query("language == 'speakql'")

for dep_var in dep_vars:
    mwu_results = {}
    for idquery in df.idquery.unique():
        subset_sql = df_sql.where(df_sql.idquery == idquery).dropna(how = 'all')
        subset_speakql = df_speakql.where(df_speakql.idquery == idquery).dropna(how = 'all')
        mwu_results[(dep_var, idquery)] = {
            'u_test_pval' : mwu(
            x = subset_sql[dep_var],
            y = df_speakql.where(df_speakql.idquery == idquery).dropna(how = 'all')[dep_var]
            )[1],
            'median_sql' : subset_sql[dep_var].median(),
            'median_speakql' : subset_speakql[dep_var].median()
        }
    result_dfs.append(pd.DataFrame(mwu_results))


In [None]:
for result in result_dfs:
    display(result)

#### Box and Whisker plots of performance by dialect


In [None]:
df_temp = df.rename(columns={
    'first_pt': 'First Attempt Planning Time',
    'relative_step': 'Query'
})
df_temp['Query'] = df_temp.apply(lambda row: str(int(row.Query)), axis=1)
df_temp['language'] = df_temp.apply(lambda row: {'sql': 'sql (L)', 'speakql': 'speakql (R)'}[row.language], axis=1)

In [None]:
for var in ['First Attempt Planning Time', 'tot_tt']:
    g = sns.catplot(
        kind='box',data = df_temp, 
        x = 'Query', 
        y = var, 
        hue = 'language', 
        orient = 'v',
        aspect = 2.4,
        height = 3
    )
    g.set_axis_labels("", var, fontsize='large')
    hatches = ['.', '+']

In [None]:
attempt_legend = {1: "1 *", 2: "2 +", 3: "3 X"}

df_temp['Number of Attempts'] = df.apply(lambda row: attempt_legend[int(row.attemptnum)], axis = 1)
df_temp['Dialect'] = df.apply(lambda row: 'SQL' if row.language == 'sql' else 'SPK', axis = 1)
df_temp['Q'] = df.apply(lambda row: str(int(row.relative_step)), axis = 1)
hatches = ['.', '.', '+', '+', 'x', 'x']

In [None]:
g = sns.displot(
    df_temp, 
    x='Dialect', 
    col='Q', 
    hue='Number of Attempts', 
    hue_order=['3 X', '2 +', '1 *'],
    col_wrap=12,
    height=2.5,
    aspect=.5,
    kind='hist',
    multiple='stack',
    stat='count'
)
g.set_axis_labels("")
g.set_titles("Q" + "{col_name}")


### Group Effect Analysis
- We used counterbalancing between two groups to offset learning effect.
- Participants were randomly assigned to either group 1 or group 2
  - Group 1: SpeakQL first, then SQL
  - Group 2: SQL first, then SpeakQL

#### Feature Usage by Group
Feature usage was optional; and we observed different usage rates between groups.
  - Participants in group 1 (SpeakQL First) used unbundling much more frequently
  - Participants in group 2 (SQL First) used other features slightly more frequently

In [None]:
used_df = pd.melt(
    df, id_vars =['groupnum', 'idparticipant', 'idsession', 'idstep'], 
    value_vars = [
        'used_synonyms',
        'used_expression_ordering',
        'used_mod_ordering',
        'used_natural_functions',
        'used_unbundling'
    ]
)

possible_df = pd.melt(
    df, id_vars =['groupnum', 'idparticipant', 'idsession', 'idstep'], 
    value_vars = [
        'synonyms_possible',
        'expression_ordering_possible',
        'mod_ordering_possible',
        'natural_functions_possible',
        'unbundling_possible'
    ]
)

used_sum = used_df[['groupnum', 'variable', 'value']].groupby(
    ['groupnum', 'variable']
).sum()

possible_sum = possible_df[['groupnum', 'variable', 'value']].groupby(
    ['groupnum', 'variable']
).sum()

sum_compare = used_sum
sum_compare = sum_compare.rename(columns = {'value' : 'used'}).reset_index()
sum_compare['possible'] = possible_sum.reset_index().value

sum_compare['perc_usage'] = sum_compare.used / sum_compare.possible


In [None]:
g = sns.catplot(
    data = sum_compare,
    kind = 'bar',
    x = 'variable',
    y = 'perc_usage',
    hue = 'groupnum'
)
g.set_xticklabels(rotation = 45)
display(sum_compare.sort_values(by=['variable']))

#### Group Median and Distribution Observations by Dependent Variable

In [None]:
g = sns.catplot(x='groupnum', y='first_pt', data=df.sort_values(by='groupnum'), kind='box')
g3 = sns.catplot(x='groupnum', y='total_time', data=df.sort_values(by='groupnum'), kind='box')
g4 = sns.catplot(x='groupnum', y='attemptnum', data=df.sort_values(by='groupnum'), kind='box')


#### Observation of asymmetric learning
The following charts show performance by language between the first and second half of the experiment.

We can see that for Group 1 (SpeakQL first), their performance in the second half (using SQL) experienced less improvement than participants in Group 2 (SQL first).

This observation suggests some sort of asymmetry in our data.

In [None]:
sns.lmplot(data = df,
           x = 'second_half',
           y = 'first_pt',
           hue = 'language'
          )

sns.lmplot(data = df,
           x = 'second_half',
           y = 'attemptnum',
           hue = 'language'
          )

sns.lmplot(data = df,
           x = 'second_half',
           y = 'tot_tt',
           hue = 'language'
          )

#### Reconsidering Performance Data based on Groups

- Overall, the SpeakQL-first group performed more slowly in their first dialect (SpeakQL) than the SQL-first group performed in their first dialect (SQL). This suggests a SpeakQL learning curve.

- Strangely, we observe the opposite in the second dialects. The SpeakQL-first group performed more slowly in their second dialect (SQL), than the SQL-first group performed using SpeakQL.

In [None]:
df_temp = df.rename(columns={
    'tot_tt': 'Total Time',
    'relative_step': 'Query'
})
df_temp['Total Time'] = df_temp['Total Time'].astype(int)
df_temp['Query'] = df_temp.apply(lambda row: str(int(row.Query)), axis=1)
df_temp['Group + Language'] = df_temp.apply(lambda row: row.groupnum + " " + row.language, axis = 1) 

g = sns.catplot(
    kind='box',data = df_temp.sort_values(by=['groupnum', 'step']), 
    x = 'Query', 
    y = 'Total Time', 
    hue = 'Group + Language', 
    orient = 'v',
    aspect = 2,
    height = 6,
    legend = True
)

g.set_axis_labels("", "Total Time", fontsize='large')
g.set(title = "Group 1: SpeakQL first, Group 2: SQL first")

We observe this disparity in both first attempt planning time, and total time.

In [None]:
df_temp = df.rename(columns={
    'first_pt': 'First Attempt Planning Time',
    'relative_step': 'Query'
})
df_temp['First Attempt Planning Time'] = df_temp['First Attempt Planning Time'].astype(int)
df_temp['Query'] = df_temp.apply(lambda row: str(int(row.Query)), axis=1)
df_temp['Group + Language'] = df_temp.apply(lambda row: row.groupnum + " " + row.language, axis = 1) 

g = sns.catplot(
    kind='box',data = df_temp.sort_values(by=['groupnum', 'step']), 
    x = 'Query', 
    y = 'First Attempt Planning Time', 
    hue = 'Group + Language', 
    orient = 'v',
    aspect = 2,
    height = 6,
    legend = True
)

g.set_axis_labels("", "First Attempt Planning Time", fontsize='large')
g.set(title = "Group 1: SpeakQL first, Group 2: SQL first")

#### Comparison between those who used unbundling, and those who didn't

- When we distinguish between SpeakQL attempts where unbundling was used, we see that there is a more even spread between groups 1 and 2 and their dialects. Participants who avoided unbundling in the second half (SQL first group) performed better than participants who used unbundling.
- We think this is due to them essentially re-stating the same SQL query from the first half

In [None]:
df_temp = df.rename(columns={
    'first_pt': 'First Attempt Planning Time',
    'relative_step': 'Query'
})
df_temp['First Attempt Planning Time'] = df_temp['First Attempt Planning Time'].astype(int)
df_temp['query_int'] = df_temp.Query
df_temp['Query'] = df_temp.apply(lambda row: str(int(row.Query)), axis=1)
df_temp['Group + Language + Used Unbundling'] = df_temp.apply(lambda row: row.groupnum + " " + row.language + " " + str(row.used_unbundling), axis = 1) 

g = sns.catplot(
    kind='box',
    data = df_temp.sort_values(by=['groupnum', 'step']).query("query_int > 3"), 
    x = 'Query', 
    y = 'First Attempt Planning Time', 
    hue = 'Group + Language + Used Unbundling', 
    orient = 'v',
    aspect = 3,
    height = 6,
    legend = True
)

g.set_axis_labels("", "First Attempt Planning Time", fontsize='large')
g.set(title = "Group 1: SpeakQL first, Group 2: SQL first")

### Discussion

#### Some Ideas:
- Evaluate only one feature: Unbundling
- Make unbundling usage mandator
- Switch to in-person to eliminate covert behaviors
- Spread session apart by several days to facilitate "unlearning"