### **Reinforcement Learning**

#### Terminologies

- states
- actions
- rewards
- discount factor (gamma or g)
- return
	- R1 + g\*R2 + g^2\*R3 + g^3*R4 + ...
- policy (pi)
	- pi(state) => action

#### Optimal Q function

- Q(s, a) = Return if you
	- start in state s
	- take action a (once)
	- then behave optimally after that
- Best possible return from state s is max Q(s, a)
- Best possible action in state s is the action that gives max Q(s, a)
- Bellman equation
	- Q(s, a) = R(s, a) + g * max Q(s', a')

#### Refinements to Algorithm: Mini Batching and Soft Updates

- Instead of training supervised models in neural networks with 1,000,000 dataset as a whole, train the neural network with small mini-batches of say 1000 random sample datas each time to reduce the training time.


- Instead of altering the parameters in the neural networks completely for every iterations, use small learning rate (like 0.001) and change the parameters gradually for better accuracy during every iteration.

Wrong way
- w = w_new

Correct way
- w = w + learning_rate * change_w  (i.e: dJ/dw)
- w = w + learning_rate * (w_new - w)
- w = learning_rate * w_new + (1 - learning_rate) * w

#### **Q Learning**

In [5]:
import numpy as np
import random

In [129]:
# Define the environment
num_states = 6
num_actions = 2
q_table = np.zeros((num_states, num_actions))

# Hyperparameters
learning_rate = 0.8
discount_factor = 0.5
exploration_prob = 0.2
num_episodes = 1000

# Define the transition rewards (R)
rewards = np.array([100, 0, 0, 0, 0, 40])

In [130]:
# Q-learning algorithm

q_table[0] = rewards[0]
q_table[num_states - 1] = rewards[num_states - 1]

for episode in range(num_episodes):
	state = random.randint(0, num_states - 1)
	
	while state not in [0, num_states - 1]:  # Continue until reaching termination states
		# Exploration vs. Exploitation
		if random.uniform(0, 1) < exploration_prob:
			action = random.randint(0, num_actions - 1)
		else:
			action = np.argmax(q_table[state, :])

		# Update Q-value using the Q-learning formula
		next_state = (state - 1) if action == 0 else (state + 1)
		reward = rewards[state]
		q_table[state, action] += learning_rate * (reward + \
			discount_factor * np.max(q_table[next_state, :]) - q_table[state, action])

		state = next_state

In [131]:
# Print the learned Q-table
print("Learned Q-table:")
print(q_table)

# Testing the learned policy
current_state = 3
path = [current_state]

while current_state not in [0, num_states-1]:
	action = np.argmax(q_table[current_state, :])
	current_state = (current_state - 1) if action == 0 else (current_state + 1)
	path.append(current_state)

print("Optimal path:", path)


Learned Q-table:
[[100.   100.  ]
 [ 50.    12.5 ]
 [ 25.     6.25]
 [ 12.5   10.  ]
 [  6.25  20.  ]
 [ 40.    40.  ]]
Optimal path: [3, 2, 1, 0]


#### **Deep Q Learning**

- Training deep neural networks to approximate the optimal Q function for continuos valued fucntional features.


Supervised(classification, regression), Unsupervised(clustering, anomaly detection), cnn, computer vision, rnn (nlp, sentiment), reinforcement
