Skip to content
This repository has been archived by the owner on Feb 25, 2022. It is now read-only.

Finetuning doesn't run #232

Closed
SamyakDhole opened this issue Jul 4, 2021 · 4 comments
Closed

Finetuning doesn't run #232

SamyakDhole opened this issue Jul 4, 2021 · 4 comments
Labels
bug Something isn't working.

Comments

@SamyakDhole
Copy link

Sorry, ML noob here. Trying to use the colab guide and the finetuning step doesn't seem to work.

2021-07-04 03:20:56.228130: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
Current step 400000
Saving config to gs://mcstories/GPT3_2-7B
2021-07-04 03:21:02.008931: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-04 03:21:02.020322: E tensorflow/stream_executor/cuda/cuda_driver.cc:328] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-07-04 03:21:02.020389: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (f01fb2d6dc70): /proc/driver/nvidia/version does not exist
Done!
params = defaultdict(<function fetch_model_params.<locals>.<lambda> at 0x7fdf753fd5f0>, {'n_head': 20, 'n_vocab': 50257, 'embed_dropout': 0, 'lr': 0.00016, 'lr_decay': 'cosine', 'warmup_steps': 3000, 'beta1': 0.9, 'beta2': 0.95, 'epsilon': 1e-08, 'ada_epsilon1': '1e-30', 'ada_epsilon2': 0.001, 'opt_name': 'adam', 'weight_decay': 0, 'train_batch_size': 8, 'attn_dropout': 0, 'train_steps': 401000, 'lr_decay_end': 300000, 'eval_steps': 0, 'predict_steps': 0, 'res_dropout': 0, 'eval_batch_size': 128, 'predict_batch_size': 8, 'iterations': 500, 'n_embd': 2560, 'datasets': [['mcstories', None, None, None]], 'model_path': 'gs://mcstories/GPT3_2-7B', 'n_ctx': 2048, 'n_layer': 32, 'scale_by_depth': True, 'scale_by_in': False, 'attention_types': ['global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local', 'global', 'local'], 'mesh_shape': 'x:4,y:2', 'layout': 'intermediate_expanded:x,heads:x,memory_length:y,embd:y', 'activation_function': 'gelu', 'recompute_grad': True, 'gradient_clipping': 1.0, 'tokens_per_mb_per_replica': 4096, 'padding_id': 50257, 'eos_id': 50256, 'dataset_configs': {'mcstories': {'path': 'gs://mcstories', 'eval_path': '', 'n_vocab': 50256, 'tokenizer_is_pretrained': True, 'tokenizer_path': 'gpt2', 'eos_id': 50256, 'padding_id': 50257}}, 'mlm_training': False, 'causal': True, 'num_cores': 8, 'auto_layout': False, 'auto_layout_and_mesh_shape': False, 'use_tpu': True, 'gpu_ids': ['device:GPU:0'], 'steps_per_checkpoint': 500, 'predict': False, 'model': 'GPT', 'export': False, 'sampling_use_entmax': False, 'moe_layers': None, 'slow_sampling': False})
Using config: {'_model_dir': 'gs://mcstories/GPT3_2-7B', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
cluster_def {
  job {
    name: "worker"
    tasks {
      key: 0
      value: "10.92.117.106:8470"
    }
  }
}
isolate_session_state: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_checkpoint_save_graph_def': True, '_service': None, '_cluster_spec': ClusterSpec({'worker': ['10.92.117.106:8470']}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://10.92.117.106:8470', '_evaluation_master': 'grpc://10.92.117.106:8470', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1, '_tpu_config': TPUConfig(iterations_per_loop=500, num_shards=8, num_cores_per_replica=1, per_host_input_for_training=4, tpu_job_name=None, initial_infeed_sleep_secs=None, input_partition_dims=None, eval_training_input_configuration=2, experimental_host_call_every_n_steps=1, experimental_allow_per_host_v2_parallel_get_next=False, experimental_feed_hook=None), '_cluster': <tensorflow.python.distribute.cluster_resolver.tpu.tpu_cluster_resolver.TPUClusterResolver object at 0x7fdf753db650>}
_TPUContext: eval_on_tpu True
Querying Tensorflow master (grpc://10.92.117.106:8470) for TPU system metadata.
2021-07-04 03:21:03.186786: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:373] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Initializing TPU system (master: grpc://10.92.117.106:8470) to fetch topology for model parallelism. This might take a while.
Found TPU system:
*** Num TPU Cores: 8
*** Num TPU Workers: 1
*** Num TPU Cores Per Worker: 8
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:CPU:0, CPU, -1, -1708599650092293211)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:0, TPU, 17179869184, 2923321451137325719)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:1, TPU, 17179869184, 1342907906313734704)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:2, TPU, 17179869184, -3495764244476789790)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:3, TPU, 17179869184, -3451536103897048572)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:4, TPU, 17179869184, -45557024542703693)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:5, TPU, 17179869184, -6138582724471757239)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:6, TPU, 17179869184, 8586638361773924849)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU:7, TPU, 17179869184, -281148374153625489)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:TPU_SYSTEM:0, TPU_SYSTEM, 8589934592, 7206878807369438684)
*** Available Device: _DeviceAttributes(/job:worker/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 4952312271246522530)
From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/training_util.py:236: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
Calling model_fn.
WARNING:root:Changing batch size with sequential_input() will result in some data being skipped or repeated. Please ensure your batch size stays constant throughout training.
training_loop marked as finished
Reraising captured error
Traceback (most recent call last):
  File "main.py", line 257, in <module>
    main(args)
  File "main.py", line 251, in main
    estimator.train(input_fn=partial(input_fn, global_step=current_step, eval=False), max_steps=params["train_steps"])
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3110, in train
    rendezvous.raise_errors()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/usr/local/lib/python3.7/dist-packages/six.py", line 703, in reraise
    raise value
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2942, in _call_model_fn
    config)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3222, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1484, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1539, in _invoke_input_fn_and_record_structure
    num_hosts))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1143, in generate_broadcast_enqueue_ops_fn
    inputs = _Inputs.from_input_fn(input_fn(user_context))
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3076, in _input_fn
    return input_fn(**kwargs)
  File "/content/GPTNeo/inputs.py", line 119, in sequential_input
    "train_batch_size"])  # TODO: fix for > 1 epoch
  File "/content/GPTNeo/inputs.py", line 52, in _get_skip_index
    return skip_idx, remainder
UnboundLocalError: local variable 'skip_idx' referenced before assignment

Is this a bug? Or am I just breaking something in the previous steps?

@SamyakDhole SamyakDhole added the bug Something isn't working. label Jul 4, 2021
@shawwn
Copy link

shawwn commented Jul 4, 2021

Almost certainly a bug, based on the backtrace. Sorry that your first ML experience ran into this; stuff like that always felt really bad back when I first got into ML.

Keep trying, though. I don't know if anyone will hunt down this bug. But you could e.g. open inputs.py and start looking around line 52 for clues, or search for a different fine tuning colab entirely. (I'm not sure if GPT-J can be fine tuned without a TPU pod, but 1.5B certainly can. Though I don't know what the best fine tuning notebook is as of today.)

@bterrific2008
Copy link

Hi! I ran into this similar issue, but resolved it after I realized the path in my dataset configuration file was incorrect. Can you provide additional information about what you're doing @SamyakDhole ?

@CupOfGeo
Copy link

Hello I was also having this issue but i changed my path to the file in
my_bucket/dataset/custom/my_dataset_*.tfrecords and it worked hope this helps
but now having bucket service issues the fun never ends

@StellaAthena
Copy link
Member

Hello I was also having this issue but i changed my path to the file in
my_bucket/dataset/custom/my_dataset_*.tfrecords and it worked hope this helps

Is this due to a misspecified instruction in the colab notebook or did you not realize where you had saved the tfrecords?

but now having bucket service issues the fun never ends

Unfortunately Google keeps changing how permissions work :( Six months ago this worked out of the box… I’ll see if I can find some time to hunt down the new way to handle permissions this weekend.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working.
Projects
None yet
Development

No branches or pull requests

5 participants