# Reinforcement Learning

CREDIT: This notebook was inspired by [this Kaggle notebook on Q-learning](https://www.kaggle.com/code/kemquiros/q-learning-using-btc-dataset/notebook)

## Imports

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os

## Dataset

### Observe the dataset

As usual, try to find out what you can about this dataset. [Here is the origin of the dataset](https://www.kaggle.com/datasets/billqi/binance-bitcoin-futures-price-10s-intervals).

In [9]:
prices = np.loadtxt('prices_btc_Jan_11_2020_to_May_22_2020.txt', dtype=float)

## Reinforcement Learning

We are going to define an environment from scratch. The objective is to take the right decisions in order to increase the amount of money you gain.

### Functions to buy, sell and wait

Using the indications in the comments, implement the following functions. These will be the actions the agent can take.

In [16]:
def buy(btc_price, btc, money):
    # if there is money
    # spend all money to acquire bitcoin
    return btc, money


def sell(btc_price, btc, money):
    # if there is bitcoin
    # exchange all bitcoin for money
    return btc, money


def wait(btc_price, btc, money):
    # do nothing
    return btc, money

### Create actions, states tables

We now create the table of actions and states. A state will be the price of bitcoin.

In [17]:
np.random.seed(1)

# set of actions that the user could do
actions = { 'buy' : buy, 'sell': sell, 'wait' : wait}

actions_to_number = { 'buy' : 0, 'sell' : 1, 'wait' : 2 }
number_to_actions = { k:v for (k,v) in enumerate(actions_to_number) }

number_actions = len(actions_to_number.keys())
number_states = len(prices)

# reference table for our agent to select the best action based on the q-value, initialized randomly
q_table = np.random.rand(number_states, number_actions)

### Functions to get rewards and act upon action

In [18]:
def get_reward(before_btc, btc, before_money, money):
    # if there is bitcoin, and if the amount of bitcoin increased
    # the reward should be 1

    # if there is money, and the amount of money increased
    # the reward should also be 1

    # if neither increased, the reward should be zero
    return reward

In the following cell, implement the `choose_action` function with an epsilon-greedy policy. The function `choose action` returns a number indicating which action needs to be used (as a number).

In [None]:
def choose_action(state):
    # with probability epsilon, choose a random action uniformly
    # with probability 1-epsilon, choose the action that maximises q 
    return ...

Now, implement a function that uses the tables above to execute the relevant action.

In [None]:
def take_action(state, action):
    return ...

In this next cell, we define a function that takes an action, updates the amounts of money, and returns the new state and reward.

In [19]:
def act(state, action, theta):
    btc, money = theta
    
    done = False
    new_state = state + 1
    
    before_btc, before_money = btc, money
    btc, money = take_action(state, action)
    theta = btc, money
    
    reward = get_reward(before_btc, btc, before_money, money)
    
    if new_state == number_states:
        done = True
    
    return new_state, reward, theta, done

### Training the Q table

In [20]:
reward = 0
btc = 0
money = 100

theta = btc, money

In [21]:
# exploratory
eps = 0.3

n_episodes = 5
min_alpha = 0.02

# learning rate for Q learning
alphas = np.linspace(1.0, min_alpha, n_episodes)

# discount factor, used to balance immediate and future reward
gamma = 1.0

#### Steps for Q-network learning

Here are the 3 basic steps:

- Agent starts in a state=0 takes an action and receives a reward
- Agent selects action by referencing Q-table with highest value (max) OR by random (epsilon, ε)
- Update q-values

Complete the cell underneath to update the q table (refer to the formula we saw in class!).

The training process can be pretty long, so do not be alarmed!

In [None]:
rewards = {}

for e in range(n_episodes):
    
    total_reward = 0
    
    state = 0
    done = False
    alpha = alphas[e]
    
    while(done != True):

        action = choose_action(state)
        next_state, reward, theta, done = act(state, action, theta)
        
        total_reward += reward
        
        if(done):
            rewards[e] = total_reward
            print(f"Episode {e + 1}: total reward -> {total_reward}")
            break
        
        q_table[state][action] = ... 

        state = next_state

### Learning Analysis

In the cell underneath, plot the total reward as function of the episode. What do you observe?

In [22]:
# Your code here

### Plot the Results

The following cells let you visualize the decisions taken during the learning process.

Find ways to interpret the results you obtained. Was your algorithm useful?

In [None]:
state = 0
acts = np.zeros(number_states)
done = False

while(done != True):

        action = choose_action(state)
        next_state, reward, theta, done = act(state, action, theta)
        
        acts[state] = action
        
        total_reward += reward
        
        if(done):
            break
            
        state = next_state

In [None]:
buys_idx = np.where(acts == 0)
wait_idx = np.where(acts == 2)
sell_idx = np.where(acts == 1)

In [None]:
plt.figure(figsize=(15,15))
plt.plot(buys_idx[0], prices[buys_idx], 'bo', markersize=2)
plt.plot(sell_idx[0], prices[sell_idx], 'ro', markersize=2)
plt.plot(wait_idx[0], prices[wait_idx], 'yo', markersize=2)