# K-armed Bandit Problem
The K-armed bandit problem is a classic problem in reinforcement learning and decision theory. It is a simplified model of the exploration-exploitation trade-off, where an agent must choose between different actions to maximize its total reward over time.
### Introduction
In the K-armed bandit problem, an agent is faced with [K different actions (or arms)] to choose from. Each action provides a reward drawn from a probability distribution specific to that action. The goal of the agent is to maximize the total reward over a series of trials by choosing the best actions.
### Define K-armed
1. **K**    : The number of different actions (or arms) available to the agent.
2. **Arms** : Each arm represents a different action that the agent can take. Each arm has an associated reward distribution.
---

### How it work 
1. **Initialization** : The agent starts with no knowledge about the reward distributions of the arms.
2. **Action selection**: At each time step, the agent selects one of the K arms to pull based on a strategy or policy.
3. **Reward Observation**:The agent observes the reward obtained from the selected arm.
4. **Update**: The agent updates its knowledge about the reward distribution of the selected arm based on the observed reward.
5. **Repeat**: Steps 2-4 are repeated for a specified number of trials or until a stopping condition is met.

# Exploration vs Exploitation
The main challenge in the K-armed bandit problem is balancing:
- **Exploration** : Trying out different arms to learn their reward distributions.
- **Exploitation**: Selection the arm that is currently estimated to provide the highest reward.
---
# Example code 

In [35]:
import numpy as np
class EpsilonGreedyAgent:
    def __init__(self, num_actions, epsilon=0.1):
        self.num_actions = num_actions  # Số hành động (tay cầm của máy đánh bạc)
        self.epsilon = epsilon          # Xác suất chọn hành động ngẫu nhiên (exploration)
        self.action_values = np.zeros(num_actions)  # Giá trị ước lượng của mỗi hành động
        self.action_counts = np.zeros(num_actions)  # Số lần mỗi hành động được chọn
    def select_action(self):
        if np.random.rand() < self.epsilon:
            # Chọn ngẫu nhiên (exploration)
            action = np.random.randint(self.num_actions)
        else:
            # Chọn hành động có giá trị kỳ vọng cao nhất (exploitation)
            action = np.argmax(self.action_values)
        return action
    def update_value(self, action, reward):
        self.action_counts[action] += 1  # Tăng số lần hành động được chọn
        self.action_values[action] += (1 / self.action_counts[action]) * (reward - self.action_values[action])
class MultiArmedBandit:
    def __init__(self, num_arms):
        self.num_arms = num_arms
        self.true_action_values = np.random.normal(0, 1, num_arms)
    def get_reward(self, action):
        return np.random.normal(self.true_action_values[action], 1)
num_arms = 5 # Số tay cầm
num_steps = 10000  # Số lần thực hiện
agent = EpsilonGreedyAgent(num_arms)  # Khởi tạo tác nhân epsilon-greedy
bandit = MultiArmedBandit(num_arms)  # Khởi tạo môi trường
total_rewards = 0 # Tổng phần thưởng
for step in range(num_steps):
    action = agent.select_action()            # Tác nhân chọn hành động
    reward = bandit.get_reward(action)        # Nhận phần thưởng từ môi trường
    agent.update_value(action, reward)       # Cập nhật giá trị kỳ vọng
    total_rewards += reward                   # Cộng dồn phần thưởng
print("Total rewards obtained:", total_rewards)
print("Estimated action values:", agent.action_values)
print("Number of times each action was selected:", agent.action_counts)



Total rewards obtained: 9571.506450417282
Estimated action values: [ 0.68105166  0.80288199  0.56973275 -0.02326379  1.00141859]
Number of times each action was selected: [ 216.  376.  227.  196. 8985.]
