/
HW4.tex
361 lines (283 loc) · 13.4 KB
/
HW4.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
\documentclass[11pt]{article}
\usepackage{writeup/common}
\usepackage{url}
\usepackage{listings}
\title{CS 287 \\ Assignment 4: Word Segmentation}
\date{}
\begin{document}
\maketitle{}
\begin{center}
\textbf{Due:} Friday, April 1st, 11:59 pm
\end{center}
Over the last couple years there has been tremendous interest in
recurrent neural networks for natural language tasks, and in
particular the long short-term memory (LSTM) network. The last
several classes have focused on the theory and structure of
recurrent networks. For this assignment you will actually
construct utilize an LSTM for a real-world problem.
The problem we will focus on is word segmentation, that is identifying
the spaces in sentence-based on the previous characters only. This is
a simplified version of word segmentation processes necessary for
processing languages written without spaces, such as Korean or
Chinese.
Given an input sequence of characters such as
\begin{center}
\texttt{t h e d o g w a l k s t o t h e p a r k}
\end{center}
\noindent You will attempt to give back the
original sequence of words:
\begin{center}
\texttt{the dog walks to the park}.
\end{center}
Unlike
past assignments, this assignment is not a research-level
task. However this will give you a chance to experiment with LSTM in a
clear domain. If you are interested in further experimentation, you
can tweak your code to produce a working LSTM language model and babbler.
Before you start we advise that you get very comfortable with the
notes from class, the papers on LSTMs, and especially the torch
\texttt{rnn} library.
As you complete this assignment, we ask that you submit your results
on the test data to the Kaggle competition website at
\url{https://inclass.kaggle.com/c/cs287-hw4} and that you compile your
experiences in a write-up based on the template at
\url{https://github.com/cs287/hw_template}.
\section{Data and Preprocessing}
\subsection{Data}
The data for this task is under the \texttt{data/} directory.
Our data is the same of the previous assignment, except that
now we have given you only the characters of the data set
separated with explicit space tokens.
\lstset{ basicstyle=\ttfamily, breaklines=true}
\begin{lstlisting}
> head data/valid_chars.txt
c o n s u m e r s <space> m a y <space> w a n t <space> t o <space> m o v e <space> t h e i r
<space> t e l e p h o n e s <space> a <space> l i t t l e <space> c l o s e r <space> t o <space>
t h e <space> t v <space> s e t </s> < u n k > <space> < u n k > <space> w a t c h i n g <space>
a b c <space> ' s <space> m o n d a y <space> n i g h t <space> f o o t b a l l <space> c a n
<space> n o w <space> v o t e <space> d u r i n g <space> < u n k > <space> f o r <space> t h e
<space> g r e a t e s t <space> p l a y <space> i n <space> N <space> y e a r s <space> f r o m
<space> a m o n g <space> f o u r <space> o r <space> f i v e <space> < u n k > <sp
\end{lstlisting}
Note that for this problem set we no longer print the data file in lines, we
simply concatenate together each one of the sentence separated with a $</s>$
symbol.
The test data is in file \texttt{test\_chars.txt}. This file consists
of one sentence per line and includes no explicit spaces. The aim of
the project is to re-insert spaces back into the sentences. We have
also provided this file for the validation data.
\begin{lstlisting}
> head data/valid_chars_kaggle.txt
c o n s u m e r s m a y w a n t t o m o v e t h e i r t e l e p h o n e s a l i t t l
e c l o s e r t o t h e t v s e t </s>
< u n k > < u n k > w a t c h i n g a b c ' s m o n d a y n i g h t f o o t b a l l c
a n n o w v o t e d u r i n g < u n k > f o r t h e g r e a t e s t p l a y i n N y
e a r s f r o m a m o n g f o u r o r f i v e < u n k > < u n k > </s>
t w o w e e k s a g o v i e w e r s o f s e v e r a l n b c < u n k > c o n s u m e r
s e g m e n t s s t a r t e d c a l l i n g a N n u m b e r f o r a d v i c e o n v
a r i o u s < u n k > i s s u e s </s>
a n d t h e n e w s y n d i c a t e d r e a l i t y s h o w h a r d c o p y r e c o r
d s v i e w e r s ' o p i n i o n s f o r p o s s i b l e a i r i n g o n t h e n e x
t d a y ' s s h o w </s>
i n t e r a c t i v e t e l e p h o n e t e c h n o l o g y h a s t a k e n a n e w l e
\end{lstlisting}
The Kaggle answer file should simply provide the number of spaces for each sentence.
We have provided you with a sample answer file for reference.
\begin{lstlisting}
> head data/valid_chars_kaggle_answer.txt
ID,Count
1,13
2,26
3,20
4,20
5,18
6,11
7,26
\end{lstlisting}
\subsection{Preprocessing}
For this assignment you will write your own preprocessing code
in \texttt{preprocessing.py}. Preprocessing for this assignment
means transforming the training corpus into a representation
usable for RNN training. As in previous assignments you
will need to create a map from characters
to indices. However, since there is no longer a fixed window size the layout is
a bit different. The following is a description of how to structure the
input data for a transducer RNN.
Say our training corpus consists of characters $w_1, \ldots, w_n$
In a perfect world,
we could have the RNN consume as one long matrix.
\[
\begin{bmatrix}
w_1 & w_2 & \ldots & w_n
\end{bmatrix}
\]
However, this would mean we would need to wait $n$ steps to backprop. Instead
it is more common to select a fixed \textit{sequence length} $l$ for doing
backpropagation. We remember the hidden state between
sequences ($\bolds_l$ combines with $w_{l+1}$), but only
backprop within a sequence length window.
\[
\begin{bmatrix}
w_1 & w_2 & \ldots & w_{l}
\end{bmatrix} \begin{bmatrix}
w_{l+1} & w_{l+2} & \ldots & w_{2l}
\end{bmatrix}
\ldots
\begin{bmatrix}
w_{n-l+1} & w_{n-l+2} & \ldots & w_{n}
\end{bmatrix}
\]
When you use this strategy, you much call the following rnn function to save states between
matrices (this only needs to be done once at initialization of the model).
\begin{lstlisting}
model:remember('both')
\end{lstlisting}
While this method has a fixed backprop length of $l$, with a reasonable
size $l$ it works well in practice. Unfortunately it is quite slow since
we cannot use a batch (sequential processing). To get around this issue
we split the sequence into $b$ parts and fold them over each other.
\[
\begin{bmatrix}
w_1 & w_2 & \ldots & w_{l}\\
w_{n/b+1} & w_{n/b+2} & \ldots & w_{n/b+l}\\
\vdots \\
w_{(b-1)n/b + 1} & w_{(b-1)n/b+2} & \ldots & w_{(b-1)n/b+l}
\end{bmatrix} \begin{bmatrix}
w_{l+1} & \ldots & w_{2l} \\
w_{n/b+l+1} & \ldots & w_{n/b+2l}\\
& \vdots & \\
& \ldots &
\end{bmatrix}
\ldots
\begin{bmatrix}
w_{n/b-l+1} & w_{n/b-l+2} & \ldots & w_{n/b}\\
& \ldots & \\
& \vdots & \\
& \ldots &
\end{bmatrix}
\]
This allows both batching and backprop with a fixed length. Note
though that the hidden states at first timestep ($w_{n/b+1}$)
will be incorrect since they get computed before their correct
previous hidden ($w_{n/b}$). However this is tolerable since it
only leads to $b$ incorrect hidden states during training.
Since we will be building a transducer, we also need the training
data to be in the same format. Say we are predicting $\hat{y}$ at
each step. Then our target output will be of the form,
\[
\begin{bmatrix}
y_1 & y_2 & \ldots & y_{l}\\
y_{n/b+1} & y_{n/b+2} & \ldots & y_{n/b+l}\\
\vdots \\
y_{(b-1)n/b + 1} & y_{(b-1)n/b+2} & \ldots & y_{(b-1)n/b+l}
\end{bmatrix} \begin{bmatrix}
y_{l+1} & \ldots & y_{2l} \\
y_{n/b+l+1} & \ldots & y_{n/b+2l}\\
& \vdots & \\
& \ldots &
\end{bmatrix}
\ldots
\begin{bmatrix}
y_{n/b-l+1} & y_{n/b-l+2} & \ldots & y_{n/b}\\
& \ldots & \\
& \vdots & \\
& \ldots &
\end{bmatrix}
\]
For language modeling the target output $y_i$ would be the next word
$w_{i+1}$. For this assignment, our goal is simply space
prediction. Therefore the target output $y_i$ will be $2$ if the next
word $w_{i+1}$ is a space and $1$ otherwise.
\section{Code Setup}
Write your main code in \texttt{HW4.lua}. For this assignment part,
you can (and should) use the \texttt{nn} and \texttt{rnn} library in addition to the
standard Torch library. You should not need to use any extra libraries.
\subsection{Hyperparameters}
Several of the models described have explicit hyperparameters that you will
need to tune. It is your responsibility to cleanly separate these out from
the models themselves and expose as command-line options. This makes it much
easier to run experiments and to utilize experimental scripts.
\subsection{Logging and Reporting}
As part of the write-up, you will need to report on the training and
predictive accuracy of your models. To make this possible, your code
should report on various metrics of the model both at training and
test time. We will leave it up to you on which metrics to log, but we
recommend reporting training speed, training set NLL, training set
predictive accuracy, and validation predictive accuracy. It is your
responsibility to convince us that the model is correctly training.
\section{Models}
For this assignment you will implement the three models. We will warm
up with a count-based model. Then we will
try our Bengio NNLM and an LSTM model.
\subsection{Count-Based Model}
To start, modify your code from HW3
to implement a count-based character n-gram model.
The model should only give the probability of the next
word being a space
\[ P(w_i =<\mathrm{space}> | w_{i-n+1}, \ldots w_{i-1}) \]
Test the model out by determining its perplexity on
the validation set and by running on the segmentation task.
For segmentation you should run two inference algorithms.
\begin{enumerate}
\item Greedily predict whether the next word is a space, and feedback
into the model. (i.e. if it is a space, add the $<$space$>$ to the context. Otherwise continue normally).
\item Dynamic programming over n-grams as described in class.
\end{enumerate}
\subsection{Neural Language Model}
As a second baseline, you will modify your neural language
model from the HW3 to predict whether the next character is a
space or not. This should be relatively similar to your previous
code, except that you will be prediction $1$ or $2$ instead
of a softmax over words. For this problem you can have a much smaller embedding
size of around $\approx 15$
As above you should test your model
by running greedy search and dynamic programming (for a small size).
\subsection{RNN Model}
The main portion of the assignment will be implementing the
RNN model using the Elements \texttt{rnn} library. Much of the
main portion of the work is done for you by the library. However
getting the details right is crucially important.
For this model you should implement a generic RNN transducer.
Given the current state of the RNN you are simply trying to predict
whether this is a space in the next word.
\paragraph{Necessary Hacks:}
In any torch model you can call the following function one time to get
a vector of all the params and gradients of the params.
\begin{lstlisting}
local params, grad_params = model:getParameters()
\end{lstlisting}
Once you have these vectors you should do the following:
\begin{itemize}
\item To begin you should initialize all the params to be uniform between $-0.05$ and $0.05$.
\item Before updating your parameters, you should perform gradient normalization with max norm $5$.
\item (Optional) Use the perplexity of the validation set as a guide for the learning rate. If the
validation perplexity does not decrease, then half the current value of the learning rate. This
can be useful for dynamically finding a good rate.
\end{itemize}
Once you have the model, use it in the greedy fashion described above.
At each time step predict whether the next word is a space. If it is, feed
in a $<$space$>$ to the model before continuing, otherwise continue onward.
Keep track of how many spaces you have seen so far.
\subsection{Additional Experiments}
Once these models are constructed, you should also report on
additional experiments on these data sets. We will leave this aspect
open-ended, but suggestion include:
\begin{itemize}
\item Compare performance between the LSTM, RNN and GRU on this task
\item Experiment with the stack LSTM and dropout. The rnn library provides details for how to implement this.
\item Try implementing a bidirectional LSTM for prediction. This model is provided by the rnn library. In this model you do not need to do greedy search, however you will
have to change your training such that you do not see the spaces in the data.
\item Try using your LSTM for language modeling on the previous
homework (HW3). You should be able to use basically the same setup
for this problem but with words instead of characters. Likely you will want to run this on the GPU.
\end{itemize}
\section{Report and Submission}
For your write-up, follow the report template at
\url{https://github.com/cs287/hw_template}. Be sure to include a link
to your code, Kaggle ID, and reports on your results.
In addition to submitting your Kaggle results, we also expect you to report on your
experimental process. This should include data tables, graphs and discussion of any
issues that you may run into.
\bibliographystyle{apalike}
\bibliography{HW4}
\end{document}