PPO-Seq2Seq Train a Seq2Seq model by using PPO to generate samples. The Seq2Seq model learns to match the output of a Python REPL with an RL model generating the samples.