# Problem 6.

## Part 1.
Episodic tasks consist of distinct episodes with a clear start and end, where actions within one episode do not affect future ones. In contrast, continuous tasks proceed indefinitely or for an extended period without explicit resets, and decisions made at any point can influence the task's progression over time. These distinctions impact how problems are modeled and solved, particularly in fields like reinforcement learning.


Playing a game of chess is a classic example of an episodic task. Each game starts with the same board setup, and players make moves until the game ends in a win, loss, or draw. Once the game concludes, the outcome does not affect the next game, as each episode is treated independently of others.


Driving a car in a simulation exemplifies a continuous task. The simulation has no fixed endpoint and keeps running as long as the car operates. Each decision, such as accelerating, steering, or braking, affects the car's position, speed, and environment, influencing future decisions in a continuous loop.


## Part 2.

#### 1.

Exploration refers to the agent’s strategy of trying new actions to discover their effects and potential rewards in the environment. The goal of exploration is to gather more information about the environment, which is crucial when the agent is uncertain about the outcomes of certain actions. By exploring, the agent can avoid getting stuck in a suboptimal solution and potentially find better policies. A classic example is a robot randomly choosing paths in a maze to map all possible routes.

Exploitation, on the other hand, involves the agent using the knowledge it has already acquired to select the action that it believes will maximize the immediate reward. This approach prioritizes using what the agent already knows to achieve higher rewards rather than learning new information. For instance, if a robot knows a particular path leads to a treasure, it will keep choosing that path to maximize its reward.



####2.

1.   Exploration: By choosing a random action with a probability of ϵ, the actor ensures that it occasionally explores less familiar or suboptimal actions.

2.   Exploitation: By selecting the action with the highest estimated value (based on the current Q-values or policy) with a probability of 1−ϵ, the actor leverages its existing knowledge to maximize immediate rewards.   

#### 3

ε should, in general, follow a schedule and not be fixed. Having a fixed ε keeps a constant balance between exploration and exploitation, but it is not efficient for learning because the agent will either explore too much or not enough if ε is high or low, respectively. Actors use the ϵ-greedy policy to trade off between exploration (selecting actions to learn their rewards) and exploitation (selecting the best-learned action to maximize immediate rewards) for more effective learning.

#### 4.
When ϵ is high, the agent favors exploration by taking random actions more frequently. On the other hand, when ϵ is low, the agent favors exploitation, choosing the actions with the highest estimated value according to current knowledge. As ϵ gradually decreases in the learning process, the agent shifts toward exploitation and focuses on receiving rewards to the maximum, basing its actions on knowledge learned during exploration.

## Part3

Steps:

1. **Initialization of Replay Memory and Q-Network:** Initialize with a replay memory and a Q-network with random weights. In Q-Learning, there is no concept of replay memory, but instead, a Q-table is maintained; instead of a neural network, a

2. **Select Action Using ϵ-Greedy Policy:** Generate actions by following the ϵ-greedy policy.

3. **Perform Action and Store Experience**: Execute the selected action, and store the experience concerning state, action, reward, and next state in the replay memory. For Q-Learning, instantly after seeing an experience (and not employing a replay memory), an update is performed to the Q-table.

4. **Sample Mini-batches:** From the replay memory, extract mini-batches of experiences used for training. In Q-Learning, updates will only depend on the very last experience seen.

5. **Calculate Target Q-Value and Minimize Loss:** Determine the target Q-value and update the weights of the network by minimizing the loss. Q-learning updates the Q-values directly through the Bellman equation.

6. **Periodic Update Target Network:** Periodically align the target network with the main Q-network to make updates stable. Q-Learning does not use a target network; hence, the update is not that stable.

## Part4.
The target Q-network helps in improving training stability by providing a fixed reference for updating the Q-values. The target calculation is decoupled from the online network, and gains less feedback loop, which might cause oscillation in the process. It updates the target network occasionally to ensure smoother convergence in a high-dimensional environment.

## Part 5
Experience replay enhances Q-Learning in that it stores past experiences in a buffer and randomly samples them during training. This disrupts the sequence of highly correlated data and allows it to learn from a large set of experiences. More importantly, it gains much in efficiency by reusing past data instead of constant new interactions with the environment. It will enable the agent to learn from previous experiences, albeit those generated under different conditions or policies, thereby assuring more stability and efficiency in learning.

##Part 6.
Prioritized Experience Replay is an enhancement to the traditional experience replay mechanism in reinforcement learning, where experiences in the replay buffer are sampled based on their importance rather than uniformly at random.  This method ensures that the agent focuses more on learning from significant experiences, such as those with higher temporal-difference (TD) errors, which indicate greater discrepancies between the predicted Q-value and the target Q-value.  Prioritizing these experiences accelerates learning by addressing critical mistakes more frequently.

Priority for different experiences in prioritized experience replay is calculated based on their importance to learning, which is typically determined by how surprising or significant the experience is.  This importance is measured by the difference between the predicted value of taking an action in a given state and the actual observed outcome.  Experiences with larger differences, or errors, are considered more critical because they indicate areas where the agent's current understanding is incorrect or incomplete.

## Part 7.

GORILA and Ape-X share key similarities as distributed reinforcement learning architectures. Both leverage distributed learning by employing multiple actors that interact with the environment in parallel, generating a diverse range of experiences sent to a centralized learner. This design accelerates training by efficiently collecting large volumes of data. Additionally, both architectures utilize a single centralized learner to update the policy or Q-network, ensuring streamlined model optimization. They are also highly scalable, capable of handling numerous distributed actors to tackle complex tasks through parallel computation.

The primary differences between GORILA and Ape-X lie in their experience replay and actor-learner interactions. While GORILA relies on a shared replay buffer with uniform sampling, Ape-X introduces prioritized experience replay, focusing on experiences with higher TD errors to enhance learning efficiency. Furthermore, GORILA features loosely coupled actors and learners, leading to less frequent policy updates for actors, whereas Ape-X ensures actors sample actions directly from the latest policy, aligning their behavior with the current learning state. In terms of complexity, GORILA requires managing multiple learners and replay buffers, making its implementation more intricate compared to Ape-X, which employs a single learner and centralized prioritized replay buffer for simplicity and scalability.

# Problem 7

## Part 1.
1. IBM Watson (watsonx.ai):

- TensorFlow: 2.14.1
- PyTorch: 2.1.2
- TensorBoard: 2.12.2
- torchVision: 0.15.2
- OpenCV: 4.7.0
- scikit-learn: 1.1.3
- XGBoost: 1.7.6
- ONNX: 1.13.1
- PyArrow: 11.0.0
- Python: 3.10.12

2. Google Cloud AI Platform (Vertex AI):
- Base: Versions up to CUDA 12.3
- TensorFlow: Versions up to 2.17.0
- PyTorch: Versions up to 2.3.0

3. Microsoft Azure AI:

- TensorFlow: 2.5
- PyTorch: 1.9.0
- CUDA, cuDNN, NVIDIA Driver: CUDA 11
- Horovod: 0.21.3
- scikit-learn: 0.20.3

4. Amazon Web Services (AWS) AI/ML:

- TensorFlow: Versions up to 2.6
- PyTorch: Versions up to 2.0
- Apache MXNet: Version 1.8
- XGBoost: Version 1.7.6
- SageMaker AI Generic Estimator

## Part2.
1. IBM Watson:

Compute Units Offered: IBM's watsonx platform is designed to be flexible and can be deployed across multiple cloud environments, including Microsoft Azure and AWS. This flexibility allows users to leverage the GPU offerings of these cloud providers when deploying watsonx services.
- NVIDIA H100
- NVIDIA L40S
- NVIDIA L4
- NVIDIA P100
- NVIDIA T4


2. Google Cloud AI Platform:

Compute Units Offered: Google Cloud offers a range of GPUs, including NVIDIA's A100, V100, P100, and T4 Tensor Core GPUs, to cater to various machine learning and deep learning workloads. These GPUs are available across different machine types and can be customized based on performance requirements.
- NVIDIA A100
- NVIDIA V100
- NVIDIA P100
- NVIDIA T4
- NVIDIA P4
- NVIDIA K80
- Tensor Processing Units (TPUs)


3. Microsoft Azure AI:

Compute Units Offered: Azure provides a variety of GPU options, such as NVIDIA's A100, V100, P100, and K80 GPUs, through its NC, ND, and NV series virtual machines. These VMs are tailored for different AI and machine learning tasks, offering flexibility in performance and cost.
- NVIDIA A100
- NVIDIA V100
- NVIDIA T4
- NVIDIA P100
- NVIDIA K80
- Field-Programmable Gate Arrays (FPGAs)


4. Amazon Web Services (AWS) AI/ML:

Compute Units Offered: AWS offers a comprehensive selection of GPU instances, including NVIDIA's A100, V100, K80, and T4 Tensor Core GPUs, through its P4, P3, P2, and G4 instance families. Additionally, AWS has developed its own Trainium chips, designed specifically for high-performance AI training workloads, providing an alternative to traditional GPUs.
- NVIDIA A100
- NVIDIA V100 (P3 instances)
- NVIDIA T4 (G4 instances)
- NVIDIA K80
- AWS Trainium chips
- NVIDIA H100








## Part3.

**IBM Watson (watsonx.ai):**
- IBM Watson Studio
- Watson Machine Learning
- ModelOps (for deployment and monitoring)
- AI Factsheets (for model documentation and governance)

**Google Cloud AI Platform (Vertex AI):**
- Vertex AI Model Registry
- Vertex AI Pipelines
- Vertex AI Monitoring
- Vertex Explainable AI

**Microsoft Azure AI:**
- Azure Machine Learning Studio
- Azure ML Pipelines
- Azure ML Model Registry
- Azure Monitor (for deployment and monitoring)

**Amazon Web Services (AWS) AI/ML:**
- Amazon SageMaker Model Registry
- Amazon SageMaker Pipelines
- Amazon SageMaker Clarify (for bias detection)
- Amazon SageMaker Model Monitor


## Part 4.


**IBM Watson (watsonx.ai):**
- Application logs accessible through Watson Studio and Watson Machine Learning.
- Resource usage monitoring for GPU, CPU, and memory via integrated dashboards.
- Alerts and notifications for anomalies in resource consumption or application performance.

**Google Cloud AI Platform (Vertex AI):**
- Application logs available via Google Cloud Logging.
- Resource usage (GPU, CPU, memory) monitoring through Google Cloud Monitoring.
- Customizable dashboards and alerts to track model and infrastructure performance.

**Microsoft Azure AI:**
- Application logs managed through Azure Monitor.
- Resource usage monitoring for GPU, CPU, and memory integrated with Azure Insights.
- Automated alerts and visualizations for performance metrics via Azure Portal.

**Amazon Web Services (AWS) AI/ML:**
- Application logs accessible via Amazon CloudWatch Logs.
- GPU, CPU, and memory usage monitoring provided through Amazon CloudWatch.
- Custom alerts and dashboards for detailed performance and resource tracking.

## Part5.
### Visualization of Training Metrics by Major ML Cloud Platforms

**IBM Watson (watsonx.ai):**
- Performance metrics like accuracy, loss, and throughput visualized in real-time via Watson Studio.
- Customizable dashboards to monitor training progress and analyze key metrics.

**Google Cloud AI Platform (Vertex AI):**
- Metrics such as accuracy, precision, recall, and loss visualized during training through TensorBoard integration.
- Vertex AI Pipelines provide logs and visual summaries of model performance throughout the training process.

**Microsoft Azure AI:**
- Training metrics (e.g., accuracy and loss) visualized in real-time through Azure ML Studio dashboards.
- Support for TensorBoard to track detailed model training metrics and throughput.

**Amazon Web Services (AWS) AI/ML:**
- Metrics like accuracy, loss, and throughput visualized via Amazon SageMaker Debugger.
- Integration with TensorBoard for detailed real-time tracking of training progress and performance metrics.

## Part 6.



*  Example Training Job: Image Classification
Task: Train a ResNet50 model on the CIFAR-10 dataset.
Configuration:
Dataset: cifar10
Framework: TensorFlow
Batch size: 32
Epochs: 10
Learning rate: 0.001
Compute: 1 NVIDIA Tesla V100 GPU



1. IBM Watson (watsonx.ai) File Format: YAML

In [None]:
training_job:
  name: "image-classification-job"
  framework: "tensorflow"
  framework_version: "2.14.1"
  runtime: "python3.10"
  compute:
    gpu: 2
    memory: "16GB"
  data:
    source: "cos://bucket-name/dataset/"
    target: "/mnt/dataset"
  hyperparameters:
    batch_size: 32
    epochs: 10
    learning_rate: 0.001
  output:
    model_path: "cos://bucket-name/model-output/"


2. Google Cloud AI Platform (Vertex AI)
File Format: JSON



In [None]:
{
  "jobId": "image-classification-job",
  "trainingInput": {
    "scaleTier": "CUSTOM",
    "masterConfig": {
      "imageUri": "tensorflow:2.14.1",
      "acceleratorConfig": {
        "count": 2,
        "type": "NVIDIA_TESLA_A100"
      }
    },
    "workerConfig": {
      "replicaCount": 2,
      "machineType": "n1-standard-4"
    },
    "args": ["--batch_size=32", "--epochs=10", "--learning_rate=0.001"],
    "region": "us-central1"
  },
  "inputDataConfig": {
    "uri": "gs://bucket-name/dataset/"
  },
  "outputDataConfig": {
    "uri": "gs://bucket-name/model-output/"
  }
}


3. Microsoft Azure AI
File Format: YAML

In [None]:
job:
  name: "image-classification-job"
  environment: "AzureML-TensorFlow-2.14.1"
  compute: "gpu-cluster"
  resources:
    gpu: 2
    memory: "16GB"
  input_data:
    - id: "dataset"
      path: "azureml://datastore/dataset/"
  hyperparameters:
    batch_size: 32
    epochs: 10
    learning_rate: 0.001
  output_data:
    model: "azureml://datastore/model-output/"


4. AWS SageMaker
File Format:  JSON


In [None]:
{
  "TrainingJobName": "image-classification-job",
  "AlgorithmSpecification": {
    "TrainingImage": "tensorflow:2.14.1",
    "TrainingInputMode": "File"
  },
  "ResourceConfig": {
    "InstanceType": "ml.p3.2xlarge",
    "InstanceCount": 1,
    "VolumeSizeInGB": 50
  },
  "HyperParameters": {
    "batch_size": "32",
    "epochs": "10",
    "learning_rate": "0.001"
  },
  "InputDataConfig": [
    {
      "ChannelName": "training",
      "DataSource": {
        "S3DataSource": {
          "S3Uri": "s3://bucket-name/dataset/",
          "S3DataType": "S3Prefix"
        }
      }
    }
  ],
  "OutputDataConfig": {
    "S3OutputPath": "s3://bucket-name/model-output/"
  }
}
