-
Notifications
You must be signed in to change notification settings - Fork 1
/
project_ideas.tex
150 lines (117 loc) · 8.77 KB
/
project_ideas.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
\documentclass[fleqn,reqno,10pt]{article}
\usepackage{CSPstyles}
\usepackage[margin=2cm]{geometry}
\usepackage[natbib=true,backend=biber]{biblatex}
\bibliography{project-bib.bib}
\newcommand{\pt}[1]{\textcolor{CSP-accent-1}{[PT: #1]}}
\newcommand{\faca}[1]{\textcolor{CSP-accent-2}{[FC: #1]}}
\title{Project ideas}
\author{CSP}
\date{Last compiled: \today}
\begin{document}
\maketitle
\section{LMs for prior elicitation for probabilistic cognitive models}
\begin{enumerate}
\item probabilistic cognitive models often rely on human data to estimate (statistical) world knowledge
\item running prior elicitation experiments is costly, and there are many degrees of freedom in eliciting responses and in interpreting raw data (how to scale / interpret slider ratings etc.)
\item this project could investigate if we can use LLMs, properly prompted, to elicit prior judgements instead of human participants
\begin{itemize}
\item try out different personalities in system prompt
\item check if judgements agree with human judgements
\item check if model predictions are worse or better than with human priors
\end{itemize}
\end{enumerate}
\section{Relevance judgements and LLMs}
\begin{enumerate}
\item probe the human-likeness of LLM's relevance judgements w/ and w/o prompting
\item compare to human data
\end{enumerate}
\section{How false beliefs spread through language}
\subsection{Argumentative RSA agents}
\begin{itemize}
\item run a multi-agent simulation, similar to prior literature \citep{Zollman2013:Network-Epistem}, on how false beliefs spread and amplify in a social network
\begin{itemize}
\item agents observe data from experiments
\item in previous work: agents showed their data to other agents
\item here: agents describe their findings in language
\item use RSA model with argumentative speakers
\item listeners can be naive or argumentation-aware
\item check impact on false belief propagation
\end{itemize}
\end{itemize}
\subsection{Argumentative RSA agents}
\begin{itemize}
\item as before but with griceChain agents instead of RSA
\end{itemize}
\section{LLMs and causal reasoning}
\subsection{Squeeze causal intuitions out of LLMs}
\begin{itemize}
\item use LLMs to ``construct'' intuitive causal models of everyday events
\item test whether there are biases (e.g., towards specific causal structures)
\end{itemize}
\section{LLM-chain related}
Below are some student project ideas which would help expand and substantiate the framework.
\subsection{Testing i/o}
Consists of testing griceChain on other datasets or evaluating the performance in general. Requires no or little change to the current griceChain codebase.
\begin{enumerate}
\item Applying the griceChain implementation of the contrastive reference game to more pre-existing datasets.
\begin{itemize}
\item To extend the reference game or grounded language use setting to more naturalistic datasets, Google search images or MS COCO images could be used.
\item Integrating the image captioning endpoint for the utterance proposal based on actual images can be part of the project or be implemented by us.
\end{itemize}
\item Collecting/constructing a dataset for semantic parsing
\begin{itemize}
\item Currently, the semantic parsing tests contain both examples from SuperGLUE and GLUE benchmarks as well as hand-created examples representative of the modeled communicative tasks. However, for the former, sometimes the annotated truth values need to be changed compared to the original benchmark depending on the prompt / question phrasing. For the latter, more examples could be created. Students could create larger versions of such parsing datasets.
\end{itemize}
\item Applying the pipeline with other LLM backbones to check for possible advantages for models weaker than GPT-X
\begin{itemize}
\item LLaMA, OpenAssistant, GPT4All...
\item involves benchmarking their instruction-following / few-shot learning abilities to GPT
\end{itemize}
\item Accuracy and variance estimation of the pipeline on a given task or for given sub-components
\begin{itemize}
\item for a module like the semantic parser, there is no available estimate of the variance of results across different calls of the API, as well as across different sampling temperature sampling or tested phenomena. Having this information based on rigorous testing would be very useful for estimating the performance of the overall system.
\item this requires sizable datasets of different phenomena as well as resources for running the evaluations.
\item one concern might be how generalizable / useful these results are across models, especially if we want to offer the pipeline as a cross-model solution and need to guarantee stable performance.
\end{itemize}
\end{enumerate}
\subsection{Extending the model to more tasks}
Require more substantial code extensions/revisions, but is based mostly on previous literature. E.g., implement some other RSA model with SIFD.
\begin{enumerate}
\item Extend griceChain to non-contrastive reference games.
\item Integrating new (e.g., more natural) communicative tasks beyond reference. Extending the pipeline to more complex or natural tasks could allow us to show the advantages of the controlled pipeline more clearly. This would require a bigger refactoring/extension of the current codebase.
\begin{itemize}
\item One idea that we discussed concerned the use of different utility functions.
\begin{itemize}
\item One natural extension here is to add argumentative strength to the utility function; the reason that this is natural is that argstrength in a Bayes factor/likratio implementation grounds out in a combo of semantic parser and state proposal. To implement this we need to add a state-proposer that proposes new states compatible with the utterance. For more look at the paper on argstrength RSA.
\item Another idea is to use relevance (TODO)
\end{itemize}
\item Another idea is to generate possible speaker goals/preferences based on the utterance. For instance, answering wh-questions (or questions in general) might require reasoning about the speaker's background knowledge (e.g., sensible information or goals the person might want to pursue). \pt{This is essentially a transfer / extension of the QA model to SIFD} Furthermore, the iterative pipeline might allow to model QA based on conversation history which updates goals etc rather than single-shot QA.
\item One other idea could be to zoom in on the process of reasoning about vague expressions like gradable adjectives. For instance, given descriptions of utterance context, the pipeline might trade off between contextual information and general world knowledge for inferring aspects like the comparison class. E.g. a griceChain implementation of Qing \& Franke (2014).
\end{itemize}
\end{enumerate}
\subsection{Improving current griceChain implementation for discriminative reference games}
This involves fairly substantial revisions to the code \& possibly even conceptual work. Student would change griceChain in some way to improve the performance on current task. Possibly extend SIFD paradigm.
\begin{enumerate}
\item Clarification questions: cf. self-critical and revising agents
\begin{itemize}
\item The pragmatic module could be augmented by reasoning about which action next to picking one of the available utterances to select. These alternative actions could be, e.g., asking clarification questions. This could be guided be reasoning about the uncertainty of the current information. \pt{uncertainty in LLMs is quite a big topic by itself; modeling the reasoning is also a non-trivial task, which could be solved most naively by recursively applying the SIFD pipeline to compare utilities od possible outcomes of different actions}
\end{itemize}
\item ``Loopy agent'' going back to the step of generating possible utterance proposals whenever current set of options isn't useful
\begin{itemize}
\item Involves the issue of putting a cap on the number of iterations the agent could loop for.
\item Otherwise, a relatively straightforward recurrence over our current approach.
\end{itemize}
\end{enumerate}
\subsection{Real world application}
\begin{enumerate}
\item Finding some real world use case for griceChain, exploiting its advantages over bare LLMs in terms of explainability and transparency.
\end{enumerate}
\section{Probing LLMs}
\begin{itemize}
\item Probing LLMs seems like one of the straightforward methods for investigating the representations they might be building up.
\item Following up on the discussion of whether LLMs do develop some form of understanding or reasoning and how to find that out, it would be great to get into the common methods hands-on.
\item \pt{concrete applications tbd.}
\end{itemize}
\printbibliography[heading=bibintoc]
\end{document}