## **실습 목표**
**하이퍼파라미터**는 딥러닝 모델 성능에 큰 영향을 미치기 때문에, 적절한 값을 찾는 것이 중요하다.   
**Ray**는 이런 **하이퍼파라미터의 최적값을 효율적으로 찾아내기 위한 도구**로, 병렬 연산 및 다양한 최적화 전략을 제공한다. 이를 통해 학습 시간을 단축시키고 모델의 성능을 최적화 할 수 있다.     
CIFAR-10 데이터셋을 활용한 이미지 분류 작업에 Ray를 PyTorch와 연동하여 학습시켜보자.

### **모델이 성능이 안 나올 경우**
1. 모델을 바꿔보기
    - 사실상 가장 많은 영향을 미치지만, 이미 좋은 모델이 많이 나와있음.
2. Data 바꿔보기(데이터를 추가하거나, 기존 data에 오류가 있는지 확인)
    - 가장 좋은 성능을 냄.
    - Data는 많으면 많을 수록 좋음
3. Hyperparameter Tunning
    - 그렇게 영향이 크진 않음.
    - 마지막의 마지막 방법인 느낌
    - 중요성은 낮아졌지만 그래도 함.

**❓퀴즈**     
### **Hyperparameter**
- **🖊 정답:** 모델 스스로 학습하지 않는 값
    - 사람이 지정해줘야 함
    - learning rate, 모델의 크기, optimizer 등
    

### **Grid Layout & Random Layout**
가장 기본적인 방법. 최근에는 베이지안 기반 기법들이 주도하고 있음.    
<br>

**❓퀴즈**     
#### **Grid Layout**
- 일정한 범위를 정해서 값을 자름
- **🖊 정답:** 하이퍼 파라미터의 가능한 모든 조합을 시험하여 최적의 조합을 찾는 방법
- 하나를 차례대로 골라서 학습 수행, 가장 좋은 성능을 내는 것을 찾음.
- lr는 로그를 취해서 사용

#### **Random Layout**
랜덤하게 값을 찾아 학습을 수행, 그 중 가장 잘 나온 값을 사용


### **Ray**
- Hyperparameter tunning의 대표적인 도구
- multi node multi processing 지원 모듈
- ML/DL의 병렬 처리를 위해 개발
- Hyperparameter Search를 위한 다양한 모듈 제공
<br><br>

**❓퀴즈:** Ray에 관한 설명 중 가장 올바르지 않은 것은?       
**🖊 정답:** Hyperparameter에만 특화되어있다.

In [None]:
!pip install ray



In [None]:
# Ray는 내부적으로 tensorboardX라는 모듈을 사용함
!pip install tensorboardX



In [None]:
!pip install wandb



### **1. 데이터 로딩 및 전처리**

In [None]:
import os
import ray
import wandb
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from functools import partial
from torch.utils.data import random_split
from ray import tune
from ray.tune import CLIReporter
from ray.tune.schedulers import ASHAScheduler
from ray.tune.search.bayesopt import BayesOptSearch
from ray.tune.search.hyperopt import HyperOptSearch

In [None]:
def load_data(data_dir='./data'):
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
        ])

    # CIFAR10 데이터셋 로드
    trainset = torchvision.datasets.CIFAR10(
        root=data_dir, train=True, download=True, transform=transform
    )

    testset = torchvision.datasets.CIFAR10(
        root=data_dir, train=False, download=True, transform=transform
    )

    return trainset, testset

### **2. 신경망 모델 정의**

In [None]:
class Net(nn.Module):
    # 모델 초기화
    # l1, l2는 여기서는 마지막 layer들의 크기에 관련된 파라미터
    def __init__(self, l1=120, l2=84):
        super(Net, self).__init__()
        # 컨볼루션 레이어와 풀링 레이어
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, l1)
        self.fc2 = nn.Linear(l1, l2)
        self.fc3 = nn.Linear(l2, 10)

    # 순전파 정의
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

### **모델 학습**

In [None]:
# 학습 과정의 처음부터 끝까지 하나의 함수에 정의되어 있어야, Ray가 불러올 수 있다
def train_cifar(config, data_dir=None):
    # 모델 초기화
    net = Net(config["l1"], config["l2"])

    # 사용 가능한 장치 확인 (GPU 또는 CPU)
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if torch.cuda.device_count() > 1:
            net = nn.DataParallel(net)  # 멀티 GPU 사용 시 데이터 병렬 처리
    net.to(device)  # 모델을 해당 장치로 이동

    # 손실 함수 정의: 교차 엔트로피 손실 사용
    criterion = nn.CrossEntropyLoss()
    # 최적화 알고리즘 정의: SGD 사용
    optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9)

    # 체크포인트에서 모델 및 최적화 상태 로드 (있을 경우)
    checkpoint = ray.train.get_checkpoint()
    if checkpoint:
        model_state, optimizer_state = torch.load(checkpoint.path)
        net.load_state_dict(model_state)
        optimizer.load_state_dict(optimizer_state)

    # 데이터 로드 및 학습/검증 데이터 분할
    trainset, testset = load_data(data_dir)
    test_abs = int(len(trainset) * 0.8)
    train_subset, val_subset = random_split(
        trainset, [test_abs, len(trainset) - test_abs])

    # 데이터 로더 설정
    trainloader = torch.utils.data.DataLoader(
        train_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)
    valloader = torch.utils.data.DataLoader(
        val_subset,
        batch_size=int(config["batch_size"]),
        shuffle=True,
        num_workers=8)

    # wandb를 사용하여 학습 과정 모니터링
    wandb.init(project='torch-turn', entity='nayoungpark')
    wandb.watch(net)

    # 학습 시작
    for epoch in range(10):  # 전체 데이터셋에 대해 10번 반복
        running_loss = 0.0
        epoch_steps = 0
        for i, data in enumerate(trainloader, 0):
            # 입력 데이터 및 레이블 로드
            inputs, labels = data
            inputs, labels = inputs.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # 손실 값 누적
            running_loss += loss.item()
            epoch_steps += 1

            # 2000 미니 배치마다 손실 출력
            if i % 2000 == 1999:
                print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1,
                                                running_loss / epoch_steps))
                running_loss = 0.0

        # 검증 데이터에 대한 손실 계산
        val_loss = 0.0
        val_steps = 0
        total = 0
        correct = 0
        for i, data in enumerate(valloader, 0):
            with torch.no_grad():
                inputs, labels = data
                inputs, labels = inputs.to(device), labels.to(device)

                outputs = net(inputs)
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()

                loss = criterion(outputs, labels)
                val_loss += loss.cpu().numpy()
                val_steps += 1

        # wandb에 학습 및 검증 손실 로깅
        wandb.log({"val_loss": val_loss})
        wandb.log({"loss": loss})

        # 체크포인트 저장
        # 이 부분은 버전 이슈로 인해 강의와 다를 수 있습니다.
        checkpoint = ray.train.get_checkpoint()
        path = checkpoint.save_path("checkpoint")
        torch.save((net.state_dict(), optimizer.state_dict()), path)

        # Ray Tune에 손실 및 정확도 보고
        tune.report(loss=(val_loss / val_steps), accuracy=correct / total)

    print("Finished Training")

### **4. 모델 성능 평가**

In [None]:
def test_accuracy(net, device='cpu'):
    trainset, testset = load_data()

    # 테스트 데이터 로더 설정
    testloader = torch.utils.data.DataLoader(
        testset,
        batch_size=4,
        shuffle=False,
        num_workers=2
    )

    # 정확하게 분류된 이미지 수 초기화
    correct = 0

    # 전체 이미지 수 초기화
    total = 0

    # 테스트 중에는 역전파가 필요 없으므로 비활성화
    with torch.no_grad():
        for data in testloader: # 배치 단위로 데이터 가져오기
            images, labels = data
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            # 가장 높은 확률을 가진 클래스 선택
            _, predicted = torch.max(outputs.data, 1)
            # 전체 이미지 수 업데이트
            total += labels.size(0) # labels의 첫번째 차원 반환. 배치 크기
            correct += (predicted == labels).sum().item()

    return correct/total

### **5. 메인 실행**

In [None]:
def main(num_samples=10, max_num_epochs=10, gpus_per_trial=2):
    # 데이터 디렉토리 절대 경로 설정
    data_dir = os.path.abspath("./data")

    # 데이터 로드 함수 호출
    load_data(data_dir)

    # 하이퍼파라미터 설정
    config = {# config에 search space 지정
        "l1": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "l2": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([2, 4, 8, 16])
    }

    # 스케줄러 설정 (ASHA 스케줄러 사용)
    # 베이시안 optimization같은 알고리즘을 지정해줄 수 있다.
    scheduler = ASHAScheduler( # 알고리즘 실행 중간에 의미없다고 생각하는 loss값이 잘 안나오는 metric들을 잘라내는 알고리즘.
                               # 전체를 가지고 튜닝을 하면 시간이 오래걸리고 안 쓰는 결과들이 나온다.
        metric="loss",
        mode="min",
        max_t=max_num_epochs,
        grace_period=1,
        reduction_factor=2)

    # 리포터 설정
    reporter = CLIReporter(# Command line 출력 방식 지정
        metric_columns=["loss", "accuracy", "training_iteration"])

    # Ray Tune을 사용하여 학습 실행
    result = tune.run( # 병렬 처리 양식
        partial(train_cifar, data_dir=data_dir), # partial: 데이터를 쪼개는 함수
        resources_per_trial={"cpu": 2, "gpu": gpus_per_trial},
        config=config,
        num_samples=num_samples,
        scheduler=scheduler,
        progress_reporter=reporter) # 여러개의 GPU에 뿌려져서 학습 진

    # 최적의 트라이얼 결과 가져오기
    best_trial = result.get_best_trial("loss", "min", "last")
    print("Best trial config: {}".format(best_trial.config))
    print("Best trial final validation loss: {}".format(
        best_trial.last_result["loss"]))
    print("Best trial final validation accuracy: {}".format(
        best_trial.last_result["accuracy"]))

    # 최적의 트라이얼로 모델 초기화
    best_trained_model = Net(best_trial.config["l1"], best_trial.config["l2"])
    device = "cpu"
    if torch.cuda.is_available():
        device = "cuda:0"
        if gpus_per_trial > 1:
            best_trained_model = nn.DataParallel(best_trained_model)
    best_trained_model.to(device)

    # 최적의 트라이얼 체크포인트 로드
    best_checkpoint_dir = best_trial.checkpoint.value
    model_state, optimizer_state = torch.load(os.path.join(
        best_checkpoint_dir, "checkpoint"))
    best_trained_model.load_state_dict(model_state)

    # 테스트 데이터셋에서 정확도 계산
    test_acc = test_accuracy(best_trained_model, device)
    print("Best trial test set accuracy: {}".format(test_acc))


if __name__ == "__main__":
    # WandB 로그인 및 메인 함수 호출
    wandb.login(key="")
    main(num_samples=10, max_num_epochs=10, gpus_per_trial=0)

[34m[1mwandb[0m: Currently logged in as: [33mhcc9876[0m ([33mnayoungpark[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Files already downloaded and verified
Files already downloaded and verified


2024-06-17 02:55:07,651	INFO worker.py:1753 -- Started a local Ray instance.
2024-06-17 02:55:09,539	INFO tune.py:253 -- Initializing Ray automatically. For cluster usage or custom Ray initialization, call `ray.init(...)` before `tune.run(...)`.


+--------------------------------------------------------------------+
| Configuration for experiment     train_cifar_2024-06-17_02-55-09   |
+--------------------------------------------------------------------+
| Search algorithm                 BasicVariantGenerator             |
| Scheduler                        AsyncHyperBandScheduler           |
| Number of trials                 10                                |
+--------------------------------------------------------------------+

View detailed results here: /root/ray_results/train_cifar_2024-06-17_02-55-09
To visualize your results with TensorBoard, run: `tensorboard --logdir /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts`

Trial status: 10 PENDING
Current time: 2024-06-17 02:55:11. Total running time: 1s
Logical resource usage: 0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+

[36m(func pid=5938)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=5938)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=5938)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00000_0_batch_size=16,lr=0.0469_2024-06-17_02-55-10/wandb/run-20240617_025522-ka506r1c
[36m(func pid=5938)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=5938)[0m wandb: Syncing run eager-jazz-29
[36m(func pid=5938)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=5938)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/ka506r1c



Trial status: 1 RUNNING | 9 PENDING
Current time: 2024-06-17 02:55:42. Total running time: 31s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00000   RUNNING    0.0468533               16 |
| train_cifar_0289e_00001   PENDING    0.0407774                8 |
| train_cifar_0289e_00002   PENDING    0.0111418                2 |
| train_cifar_0289e_00003   PENDING    0.000215367              4 |
| train_cifar_0289e_00004   PENDING    0.000454672              2 |
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar

2024-06-17 02:55:54,780	ERROR tune_controller.py:1331 -- Trial task failed for trial train_cifar_0289e_00000
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2613, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): [36mray::ImplicitFunc.train()[39m (pid=5938, ip=172.28.0.12, actor_i


Trial train_cifar_0289e_00000 errored after 0 iterations at 2024-06-17 02:55:54. Total running time: 44s
Error file: /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts/train_cifar_0289e_00000_0_batch_size=16,lr=0.0469_2024-06-17_02-55-10/error.txt

Trial train_cifar_0289e_00001 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_0289e_00001 config             |
+--------------------------------------------------+
| batch_size                                     8 |
| l1                                             4 |
| l2                                           256 |
| lr                                       0.04078 |
+--------------------------------------------------+
[36m(func pid=6298)[0m Files already downloaded and verified
[36m(func pid=6298)[0m Files already downloaded and verified


[36m(func pid=6298)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=6298)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=6298)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00001_1_batch_size=8,lr=0.0408_2024-06-17_02-55-11/wandb/run-20240617_025603-h3o2bocd
[36m(func pid=6298)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=6298)[0m wandb: Syncing run solar-deluge-30
[36m(func pid=6298)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=6298)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/h3o2bocd



Trial status: 1 ERROR | 1 RUNNING | 8 PENDING
Current time: 2024-06-17 02:56:12. Total running time: 1min 1s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00001   RUNNING    0.0407774                8 |
| train_cifar_0289e_00002   PENDING    0.0111418                2 |
| train_cifar_0289e_00003   PENDING    0.000215367              4 |
| train_cifar_0289e_00004   PENDING    0.000454672              2 |
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_00009   PENDING    0.0305181                4 |

2024-06-17 02:56:52,639	ERROR tune_controller.py:1331 -- Trial task failed for trial train_cifar_0289e_00001
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2613, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): [36mray::ImplicitFunc.train()[39m (pid=6298, ip=172.28.0.12, actor_i


Trial train_cifar_0289e_00001 errored after 0 iterations at 2024-06-17 02:56:52. Total running time: 1min 42s
Error file: /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts/train_cifar_0289e_00001_1_batch_size=8,lr=0.0408_2024-06-17_02-55-11/error.txt

Trial train_cifar_0289e_00002 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_0289e_00002 config             |
+--------------------------------------------------+
| batch_size                                     2 |
| l1                                             8 |
| l2                                             4 |
| lr                                       0.01114 |
+--------------------------------------------------+
[36m(func pid=6719)[0m Files already downloaded and verified
[36m(func pid=6719)[0m Files already downloaded and verified


[36m(func pid=6719)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=6719)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=6719)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00002_2_batch_size=2,lr=0.0111_2024-06-17_02-55-11/wandb/run-20240617_025702-uibxgaur
[36m(func pid=6719)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=6719)[0m wandb: Syncing run fearless-universe-31
[36m(func pid=6719)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=6719)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/uibxgaur



Trial status: 2 ERROR | 1 RUNNING | 7 PENDING
Current time: 2024-06-17 02:57:12. Total running time: 2min 1s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00002   RUNNING    0.0111418                2 |
| train_cifar_0289e_00003   PENDING    0.000215367              4 |
| train_cifar_0289e_00004   PENDING    0.000454672              2 |
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_00009   PENDING    0.0305181                4 |
| train_cifar_0289e_00000   ERROR      0.0468533               16 |

2024-06-17 02:59:04,702	ERROR tune_controller.py:1331 -- Trial task failed for trial train_cifar_0289e_00002
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2613, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): [36mray::ImplicitFunc.train()[39m (pid=6719, ip=172.28.0.12, actor_i


Trial train_cifar_0289e_00002 errored after 0 iterations at 2024-06-17 02:59:04. Total running time: 3min 54s
Error file: /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts/train_cifar_0289e_00002_2_batch_size=2,lr=0.0111_2024-06-17_02-55-11/error.txt

Trial train_cifar_0289e_00003 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_0289e_00003 config             |
+--------------------------------------------------+
| batch_size                                     4 |
| l1                                           128 |
| l2                                            32 |
| lr                                       0.00022 |
+--------------------------------------------------+
[36m(func pid=7456)[0m Files already downloaded and verified

Trial status: 3 ERROR | 1 RUNNING | 6 PENDING
Current time: 2024-06-17 02:59:12. Total running time: 4min 1s
Logical resour

[36m(func pid=7456)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=7456)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=7456)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00003_3_batch_size=4,lr=0.0002_2024-06-17_02-55-11/wandb/run-20240617_025914-9gdc6qwn
[36m(func pid=7456)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=7456)[0m wandb: Syncing run wild-valley-32
[36m(func pid=7456)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=7456)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/9gdc6qwn


[36m(func pid=7456)[0m [1,  2000] loss: 2.303
[36m(func pid=7456)[0m [1,  4000] loss: 1.148
Trial status: 3 ERROR | 1 RUNNING | 6 PENDING
Current time: 2024-06-17 02:59:42. Total running time: 4min 31s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00003   RUNNING    0.000215367              4 |
| train_cifar_0289e_00004   PENDING    0.000454672              2 |
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_00009   PENDING    0.0305181                4 |
| train_cifar_0289e_00000   ERROR      

2024-06-17 03:00:29,017	ERROR tune_controller.py:1331 -- Trial task failed for trial train_cifar_0289e_00003
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2613, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): [36mray::ImplicitFunc.train()[39m (pid=7456, ip=172.28.0.12, actor_i


Trial train_cifar_0289e_00003 errored after 0 iterations at 2024-06-17 03:00:29. Total running time: 5min 18s
Error file: /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts/train_cifar_0289e_00003_3_batch_size=4,lr=0.0002_2024-06-17_02-55-11/error.txt

Trial train_cifar_0289e_00004 started with configuration:
+--------------------------------------------------+
| Trial train_cifar_0289e_00004 config             |
+--------------------------------------------------+
| batch_size                                     2 |
| l1                                            32 |
| l2                                             8 |
| lr                                       0.00045 |
+--------------------------------------------------+
[36m(func pid=7993)[0m Files already downloaded and verified
[36m(func pid=7993)[0m Files already downloaded and verified


[36m(func pid=7993)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=7993)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=7993)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00004_4_batch_size=2,lr=0.0005_2024-06-17_02-55-11/wandb/run-20240617_030036-5j50jzlj
[36m(func pid=7993)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=7993)[0m wandb: Syncing run super-surf-33
[36m(func pid=7993)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=7993)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/5j50jzlj



Trial status: 4 ERROR | 1 RUNNING | 5 PENDING
Current time: 2024-06-17 03:00:42. Total running time: 5min 32s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00004   RUNNING    0.000454672              2 |
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_00009   PENDING    0.0305181                4 |
| train_cifar_0289e_00000   ERROR      0.0468533               16 |
| train_cifar_0289e_00001   ERROR      0.0407774                8 |
| train_cifar_0289e_00002   ERROR      0.0111418                2 

2024-06-17 03:02:38,240	ERROR tune_controller.py:1331 -- Trial task failed for trial train_cifar_0289e_00004
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2613, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 861, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): [36mray::ImplicitFunc.train()[39m (pid=7993, ip=172.28.0.12, actor_i


Trial train_cifar_0289e_00004 errored after 0 iterations at 2024-06-17 03:02:38. Total running time: 7min 27s
Error file: /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/driver_artifacts/train_cifar_0289e_00004_4_batch_size=2,lr=0.0005_2024-06-17_02-55-11/error.txt

Trial status: 5 ERROR | 5 PENDING
Current time: 2024-06-17 03:02:42. Total running time: 7min 32s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00005   PENDING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_

[36m(func pid=8723)[0m wandb: Currently logged in as: hcc9876 (nayoungpark). Use `wandb login --relogin` to force relogin
[36m(func pid=8723)[0m wandb: Tracking run with wandb version 0.17.1
[36m(func pid=8723)[0m wandb: Run data is saved locally in /tmp/ray/session_2024-06-17_02-55-02_238598_5248/artifacts/2024-06-17_02-55-09/train_cifar_2024-06-17_02-55-09/working_dirs/train_cifar_0289e_00005_5_batch_size=16,lr=0.0127_2024-06-17_02-55-11/wandb/run-20240617_030246-7i3peumu
[36m(func pid=8723)[0m wandb: Run `wandb offline` to turn off syncing.
[36m(func pid=8723)[0m wandb: Syncing run spring-tree-34
[36m(func pid=8723)[0m wandb: ⭐️ View project at https://wandb.ai/nayoungpark/torch-turn
[36m(func pid=8723)[0m wandb: 🚀 View run at https://wandb.ai/nayoungpark/torch-turn/runs/7i3peumu
2024-06-17 03:03:06,076	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/root/ray_results/train_cifar_2024-06-17_02-55-09' in 0.0083s.



Trial status: 5 ERROR | 1 RUNNING | 4 PENDING
Current time: 2024-06-17 03:03:06. Total running time: 7min 55s
Logical resource usage: 2.0/2 CPUs, 0/1 GPUs (0.0/1.0 accelerator_type:T4)
+-----------------------------------------------------------------+
| Trial name                status              lr     batch_size |
+-----------------------------------------------------------------+
| train_cifar_0289e_00005   RUNNING    0.0126617               16 |
| train_cifar_0289e_00006   PENDING    0.00133331               8 |
| train_cifar_0289e_00007   PENDING    0.000125546             16 |
| train_cifar_0289e_00008   PENDING    0.000782675              4 |
| train_cifar_0289e_00009   PENDING    0.0305181                4 |
| train_cifar_0289e_00000   ERROR      0.0468533               16 |
| train_cifar_0289e_00001   ERROR      0.0407774                8 |
| train_cifar_0289e_00002   ERROR      0.0111418                2 |
| train_cifar_0289e_00003   ERROR      0.000215367              4 

KeyboardInterrupt: 