# &#x1F4D1; &nbsp; <span style="color:#338DD4"> Reflections. Machine Learning for Trading. Lessons 3</span>

##  &#x1F4CA; &nbsp; Links

A Tour of Machine Learning Algorithms: http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

Deep Learning for Multivariate Financial Time Series: http://www.math.kth.se/matstat/seminarier/reports/M-exjobb15/150612a.pdf

Parametric and Nonparametric Machine Learning Algorithms: http://machinelearningmastery.com/parametric-and-nonparametric-machine-learning-algorithms/

Nearest Neighbors: http://scikit-learn.org/stable/modules/neighbors.html

Kernel ridge regression: http://scikit-learn.org/stable/modules/kernel_ridge.html

Cross-validation: evaluating estimator performance: http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics

Cross-validation for time series: https://www.r-bloggers.com/cross-validation-for-time-series/

Ensemble methods: http://scikit-learn.org/stable/modules/ensemble.html

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking): http://machinelearningmastery.com/machine-learning-ensembles-with-r/

Ensemble Machine Learning Algorithms in Python with scikit-learn: http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/

Reinforcement Learning Toolkit: http://incompleteideas.net/rlai.cs.ualberta.ca/RLAI/RLtoolkit/RLtoolkit1.0.html

Deep Learning - The Past, Present and Future of Artificial Intelligence: https://www.slideshare.net/LuMa921/deep-learning-the-past-present-and-future-of-artificial-intelligence

Reinforcement Learning in Online Stock Trading Systems: https://pdfs.semanticscholar.org/be8e/61fe568712c799219fb612d190b4e62642ae.pdf

Creating a Planning and Learning Algorithm: http://burlap.cs.brown.edu/tutorials/cpl/p3.html

UC Berkeley CS188 Project 3 Reinforcement Learning: http://ai.berkeley.edu/reinforcement.html

Integrating Planning, Acting, and Learning (Dyna): https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node96.html

Q-Learning Step-By-Step Tutorial: http://mnemstudio.org/path-finding-q-learning-tutorial.htm

UCL Course on RL: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html

MathQuill: http://mathquill.com/

### 03-01 How Machine Learning is used at a hedge fund

Examples of good inputs (predictive factors) are:
    
1. Price momentum
2. Bollinger value
3. Current price

while examples of outputs would be:
    
1. Future price
2. Future return


### Algorithms
#### By Learning Style:

 - ***Supervised Learning ( Labeled Data â Direct feedback â Predict outcome/future )***
 
Input data is called training data and has a known label or result such as spam/not-spam or a stock price at a time.

A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.

Example problems are classification and regression.

Example algorithms include Logistic Regression and the Back Propagation Neural Network.

- ***Unsupervised Learning ( No labels â No feedback â Find hidden structure )***

Input data is not labeled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and k-Means.

- ***Semi-Supervised (Mixed) Learning***

Input data is a mixture of labeled and unlabelled examples.

There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model the unlabeled data.

- ***Reinforcement Learning ( Reward System â Decision Process â Learn series of action )***

Input data is a set of states, actions and a reward function which is affected by states and actions.

A model is trained to find actions with maximum values of the reward function. 

Example problems are game playing and control problems. Canonical Example: Grid World.

Example algorithms are Markov Decision Process, Temporal Difference (TD) learning, Q-learning, etc.

#### By Similarity:

- ***Regression Algorithms***

Regression is concerned with modeling the relationship between variables that is iteratively refined using a measure of error in the predictions made by the model.

Regression methods are a workhorse of statistics and have been co-opted into statistical machine learning. This may be confusing because we can use regression to refer to the class of problem and the class of algorithm. Really, regression is a process.

The most popular regression algorithms are:

      - Ordinary Least Squares Regression (OLSR)
      - Linear Regression
      - Logistic Regression
      - Stepwise Regression
      - Multivariate Adaptive Regression Splines (MARS)
      - Locally Estimated Scatterplot Smoothing (LOESS)    


 - ***Instance-based Algorithms***

Instance-based learning model is a decision problem with instances or examples of training data that are deemed important or required to the model.

Such methods typically build up a database of example data and compare new data to the database using a similarity measure in order to find the best match and make a prediction. For this reason, instance-based methods are also called winner-take-all methods and memory-based learning. Focus is put on the representation of the stored instances and similarity measures used between instances.

The most popular instance-based algorithms are:

      - k-Nearest Neighbor (kNN)
      - Learning Vector Quantization (LVQ)
      - Self-Organizing Map (SOM)
      - Locally Weighted Learning (LWL)
   
- ***Regularization Algorithms***

An extension made to another method (typically regression methods) that penalizes models based on their complexity, favoring simpler models that are also better at generalizing.

I have listed regularization algorithms separately here because they are popular, powerful and generally simple modifications made to other methods.

The most popular regularization algorithms are:

      - Ridge Regression
      - Least Absolute Shrinkage and Selection Operator (LASSO)
      - Elastic Net
      - Least-Angle Regression (LARS)  


- ***Decision Tree Algorithms***

Decision tree methods construct a model of decisions made based on actual values of attributes in the data.

Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are trained on data for classification and regression problems. Decision trees are often fast and accurate and a big favorite in machine learning.

The most popular decision tree algorithms are:

      - Classification and Regression Tree (CART)
      - Iterative Dichotomiser 3 (ID3)
      - C4.5 and C5.0 (different versions of a powerful approach)
      - Chi-squared Automatic Interaction Detection (CHAID)
      - Decision Stump
      - M5
      - Conditional Decision Trees
      
- ***Bayesian Algorithms***

Bayesian methods are those that explicitly apply Bayesâ Theorem for problems such as classification and regression.

The most popular Bayesian algorithms are:

      - Naive Bayes
      - Gaussian Naive Bayes
      - Multinomial Naive Bayes
      - Averaged One-Dependence Estimators (AODE)
      - Bayesian Belief Network (BBN)
      - Bayesian Network (BN)
      
- ***Clustering Algorithms***

Clustering, like regression, describes the class of problem and the class of methods.

Clustering methods are typically organized by the modeling approaches such as centroid-based and hierarchal. All methods are concerned with using the inherent structures in the data to best organize the data into groups of maximum commonality.

The most popular clustering algorithms are:

      - k-Means
      - k-Medians
      - Expectation Maximisation (EM)
      - Hierarchical Clustering
      
- ***Association Rule Learning Algorithms***

Association rule learning methods extract rules that best explain observed relationships between variables in data.

These rules can discover important and commercially useful associations in large multidimensional datasets that can be exploited by an organization.

The most popular association rule learning algorithms are:

      - Apriori algorithm
      - Eclat algorithm

- ***Artificial Neural Network Algorithms***

Artificial Neural Networks are models that are inspired by the structure and/or function of biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification problems but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of problem types.

Note that I have separated out Deep Learning from neural networks because of the massive growth and popularity in the field. Here we are concerned with the more classical methods.

The most popular artificial neural network algorithms are:

      - Perceptron
      - Back-Propagation
      - Hopfield Network
      - Radial Basis Function Network (RBFN)
      
- ***Deep Learning Algorithms***

Deep Learning methods are a modern update to Artificial Neural Networks that exploit abundant cheap computation.

They are concerned with building much larger and more complex neural networks and, as commented on above, many methods are concerned with semi-supervised learning problems where large datasets contain very little labeled data.

The most popular deep learning algorithms are:

      - Deep Boltzmann Machine (DBM)
      - Deep Belief Networks (DBN)
      - Convolutional Neural Network (CNN)
      - Stacked Auto-Encoders
          
- ***Dimensionality Reduction Algorithms***

Like clustering methods, dimensionality reduction seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or order to summarize or describe data using less information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a supervised learning method. Many of these methods can be adapted for use in classification and regression.

      - Principal Component Analysis (PCA)
      - Principal Component Regression (PCR)
      - Partial Least Squares Regression (PLSR)
      - Sammon Mapping
      - Multidimensional Scaling (MDS)
      - Projection Pursuit
      - Linear Discriminant Analysis (LDA)
      - Mixture Discriminant Analysis (MDA)
      - Quadratic Discriminant Analysis (QDA)
      - Flexible Discriminant Analysis (FDA)

- ***Ensemble Algorithms***

Ensemble methods are models composed of multiple weaker models that are independently trained and whose predictions are combined in some way to make the overall prediction.

Much effort is put into what types of weak learners to combine and the ways in which to combine them. This is a very powerful class of techniques and as such is very popular.

      - Boosting
      - Bootstrapped Aggregation (Bagging)
      - AdaBoost
      - Stacked Generalization (blending)
      - Gradient Boosting Machines (GBM)
      - Gradient Boosted Regression Trees (GBRT)
      - Random Forest

**Deep Learning**

Part of the machine learning field of learning representations of data. 

Exceptional effective at learning patterns. 

Utilizes learning algorithms that derive meaning out of data by using a hierarchy of multiple layers that mimic the neural networks of our brain. 

If you provide the system tons of information, it begins to understand it and respond in useful ways.

**AI**

- Artificial Narrow Intelligence (ANI): Machine intelligence that equals or exceeds human intelligence or efficiency at a specific task. 
- Artificial General Intelligence (AGI): A machine with the ability to apply intelligence to any problem, rather than just one specific problem (human-level intelligence). 
- Artificial Superintelligence (ASI): An intellect that is much smarter than the best human brains in practically every field, including scientific creativity, general wisdom and social skills.

### 03-02 Regression

***Parametric Machine Learning Algorithms***

The algorithms involve two steps:

- Select a form for the function.
- Learn the coefficients for the function from the training data.

Assuming the functional form of a line greatly simplifies the learning process. Now, all we need to do is estimate the coefficients of the line equation and we have a predictive model for the problem.

Examples of parametric machine learning algorithms include:

- Logistic Regression
- Linear Discriminant Analysis
- Perceptron
- Naive Bayes
- Simple Neural Networks

Benefits of Parametric Machine Learning Algorithms:

- Simpler: These methods are easier to understand and interpret results.
- Speed: Parametric models are very fast to learn from data.
- Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

Limitations of Parametric Machine Learning Algorithms:

- Constrained: By choosing a functional form these methods are highly constrained to the specified form.
- Limited Complexity: The methods are more suited to simpler problems.
- Poor Fit: In practice the methods are unlikely to match the underlying mapping function.

***Nonparametric Machine Learning Algorithms***

Nonparametric methods seek to best fit the training data in constructing the mapping function, whilst maintaining some ability to generalize to unseen data. As such, they are able to fit a large number of functional forms.

Examples of popular nonparametric machine learning algorithms are:

- k-Nearest Neighbors
- Decision Trees like CART and C4.5
- Support Vector Machines

Benefits of Nonparametric Machine Learning Algorithms:

- Flexibility: Capable of fitting a large number of functional forms.
- Power: No assumptions (or weak assumptions) about the underlying function.
- Performance: Can result in higher performance models for prediction.

Limitations of Nonparametric Machine Learning Algorithms:

- More data: Require a lot more training data to estimate the mapping function.
- Slower: A lot slower to train as they often have far more parameters to train.
- Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made.

*Regression (more examples):*
- Lasso, Ridge regression (Regularized Linear Regression)
- Kernel Regression
- Regression Trees, Splines, Wavelet estimators, â¦

*Nearest Neighbors*
The principle behind nearest neighbor methods is to find a predefined number of training samples closest in distance to the new point, and predict the label from these. The number of samples can be a user-defined constant (k-nearest neighbor learning), or vary based on the local density of points (radius-based neighbor learning). The distance can, in general, be any metric measure: standard Euclidean distance is the most common choice. Neighbors-based methods are known as non-generalizing machine learning methods, since they simply ârememberâ all of its training data.


*Kernel regression* is a method for nonlinear regression in which the target value for a test point is estimated using a weighted average of the surrounding training samples. The weights are typically obtained by applying a distance-based kernel function to each of the samples, which presumes the existence of a well-defined distance metric.

In [29]:
# KNN (K- Nearest Neighbors)
X = [[0], [1.5], [2], [3.5], [4], [5.5], [6], [7.5], [8], [9.5], [10], [11.5], [12], [1.5], [14], [15.5]]
y = [0, 0.2, 1.8, 1, 0.6, 1.8, 2.5, 1.3, 3.9, 4.1, 4.6, 2.7, 5.9, 4.4, 6.8, 7.5]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y) 
neigh.predict([[1.9], [6.9]])

array([ 0.2,  1.3])

In [None]:
%%R
library(knn)
x <- cbind(x_train,y_train)
fit <- knn(y_train ~ ., data = x, k=3)
summary(fit) 
predict_y= predict(fit, x_test)

*Cross-validation*, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is a prediction, and one wants to estimate how accurately a predictive model will perform in practice. In a prediction problem, a model is usually given a dataset of known data on which training is run (training dataset), and a dataset of unknown data (or first seen data) against which the model is tested (testing dataset). The goal of cross-validation is to define a dataset to "test" the model in the training phase (i.e., the validation dataset), in order to limit problems like overfitting, give an insight on how the model will generalize to an independent dataset (i.e., an unknown dataset, for instance from a real problem), etc.

One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.

One of the main reasons for using cross-validation instead of using the conventional validation (e.g. partitioning the dataset into two sets of 70% for training and 30% for test) is that there is not enough data available to partition it into separate training and test sets without losing significant modeling or testing capability. In these cases, a fair way to properly estimate model prediction performance is to use cross-validation as a powerful general technique.

In summary, cross-validation combines (averages) measures of fit (prediction error) to derive a more accurate estimate of model prediction performance.

***Training set  â to fit the parameters***

***Validation set â  to tune the parameters***

***Test set â to evaluate the performance***


### 03-03 Assessing a learning algorithm
Indicators to judge the effectivenessof the model.

The first is the *root mean square error(RMSE)*.

In [None]:
# Root Mean Squared Error (RMSE)
%R RMSE <- sqrt(mean((y-y_pred)^2))

from sklearn.metrics import mean_squared_error
RMSE = mean_squared_error(y, y_pred)**0.5

!!!! For financial data, we donât want to accidentally look forward in time, so we would only use *roll forward cross validation*. This simply demands that all the training data is before the test data.

The second metric for how well an algorithm is working is the *correlation* of the test data and predicted values. Strong correlation, close to Â±1, indicates a good algorithm whereas a weak correlation, close to zero, indicates a poor algorithm. 

*Overfitting* is the point at which error for training data is decreasing while error for test data is increasing.

### 03-04 Ensemble learners, bagging and boosting

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

Combine Model Predictions Into Ensemble Predictions.

- **Bagging**: building multiple models (typically of the same type) from different subsamples of the training dataset.
- **Boosting**: building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
- **Stacking**: building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

Two families of ensemble methods are usually distinguished:

In **averaging methods**, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Examples: Bagging methods, Forests of randomized trees, ...
    
By contrast, in **boosting methods**, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Examples: AdaBoost, Gradient Tree Boosting, ...

**Bayesian model combination (BMC)** is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results.

At least three R packages offer Bayesian model averaging tools, including the BMS (an acronym for Bayesian Model Selection) package, the BAS (an acronym for Bayesian Adaptive Sampling) package, and the BMA package.

### 03-05 Reinforcement Learning

**Reinforcement Learning** is a type of Machine Learning, and thereby also a branch of Artificial Intelligence. It allows machines and software agents to automatically determine the ideal behaviour within a specific context, in order to maximize its performance. Simple reward feedback is required for the agent to learn its behaviour; this is known as the reinforcement signal.

Reinforcement learning describes the problem that is how to go about maximizing the reward. In the stock market, the reward is return on trades, and we want to find out
how to maximize returns. This problem is complicated by time constraints. The value of future gains diminishes with time, so itâs unreasonable to use an infinite horizon on which
to base returns. However, optimizing returns over too short a time may limit rewards from seeing a much larger overall gain.

A **Markov Decision Process (MDP)** is just like a Markov Chain, except the transition matrix depends on the action taken by the decision maker (agent) at each time step. The agent receives a reward, which depends on the action and the state. The goal is to find a function, called a policy, which specifies which action to take in each state, so as to maximize some function (e.g., the mean or expected discounted sum) of the sequence of rewards. One can formalize this in terms of Bellman's equation, which can be solved iteratively using policy iteration. The unique fixed point of this equation is the optimal policy.

A Markov Decision Process (MDP) model contains:
- A set of possible world states S
- A set of possible actions A
- A real valued reward function R(s,a)
- A description T of each actionâs effects in each state.

Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history.

### 03-06 Q-Learning 

**Q-learning** is a **model-free reinforcement learning technique**. 

Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function that ultimately gives the expected utility of taking a given action in a given state and following the optimal policy thereafter. A policy is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. 

One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.

Q-Learning doesnât need to have any sort of model of the transition function T or the reward function R. It builds a table of utility values as the agent interacts with the world. These are the Q values. At each state, the agent can then use the Q values to select the best action. Q-Learning is guaranteed to give an optimal policy, as it is proven to always converge. Q represents the value of taking action a in state s. This value includes the immediate reward for taking action a, and the discounted (future) reward for all optimal future actions after having taken a.

What we want to find for a particular state is what policy, Î (s) we should take. Using Q values, all we need to do is find the maximum Q value for that state.

Î (s) = argmaxa(Q[s, a]) 

So, we go through each action a and see which action has the maximum Q value for state s. Eventually, after learning enough, the agent will converge to the optimal policy, Ïâ(s), and optimal Q table, Qâ[s, a].

1. Initialize the Q table with small random values
2. Compute s
3. Select a
4. Observe r, s
5. Update Q
6. Step forward in time, then repeat from step 2.


In [None]:
"""
Update Rule

The formula for computing Q for any state-action pair <s, a>, given an experience tuple <s, a, s', r>, is:

Q'[s, a] = (1 - Î±) Â· Q[s, a] + Î± Â· (r + Î³ Â· Q[s', argmaxa'(Q[s', a'])])

Here:

- r = R[s, a] is the immediate reward for taking action a in state s,
- Î³ â [0, 1] (gamma) is the discount factor used to progressively reduce the value of future rewards,
- s' is the resulting next state,
- argmaxa'(Q[s', a']) is the action that maximizes the Q-value among all possible actions a' from s', and,
- Î± â [0, 1] (alpha) is the learning rate used to vary the weight given to new experiences compared with past Q-values.
"""

**Exploration**

The success of a Q-learning algorithm depends on the exploration of the state-action space. If you only explore a small subset of it, you might not find the best policies. One way
to ensure that you explore as much as possible is to introduce randomness into selecting actions during the learning phase. So basically, you see first whether you want to take the action with the maximal Q value or choose a random action, then if you take a random action, each action gets a probability which decreases over subsequent iterations.


**Q-Learning for Trading**

Now that we know what Q-learning is, we need to figure out how to apply it to the context of trading. That means that we need to define what state, action, and reward mean. 

Actions are straightforward, as there are basically three of them:
â¢ BUY
â¢ SELL
â¢ NOTHING

Our rewards can be daily returns or cumulative returns after a trade cycle (buyâsell). However, using daily returns will allow the agent to converge on a Q value more quickly, because if it waited until a sell, then it would have to look at all of the actions backwards until the buy to get that reward.

Now, we just need to figure out how to determine state. Some good factors to determine state are:

- Adjusted Close/Simple Moving Average
- Bollinger Band value
- P/E ratio
- Holding stock (whether or not weâre holding the stock)
- Return since entry

Our state must be a single number so we can look it up in the table easily. To make it simpler, weâll confine the state to be an integer, which means we need to discretize each factor and then combine them into an overall state. Our state space is discrete, so the combined value is the overall state.

**Discretization**

In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals. This can be useful when creating probability mass functions â formally, in density estimation. It is a form of discretization in general and also of binning, as in making a histogram. Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.

To discretize, what we do is take the data for a factor over its range, then divide it into n bins. Then we find the threshold by iterating over the data by the step size
and taking the value at each position.


One main problem with Q-Learning is that it takes a lot of experience tuples to converge to the optimal Q value. This means the agent has to take many real interactions with the world (execute trades) to learn.

**Advantages**

The main advantage of a model-free approach like Q-Learning over model-based techniques is that it can easily be applied to domains where all states and/or transitions are not fully defined.

As a result, we do not need additional data structures to store transitions T(s, a, s') or rewards R(s, a).

Also, the Q-value for any state-action pair takes into account future rewards. Thus, it encodes both the best possible value of a state (maxa Q(s, a)) as well as the best policy in terms of the action that should be taken (argmaxa Q(s, a)).

**Issues**

The biggest challenge is that the reward (e.g. for buying a stock) often comes in the future - representing that properly requires look-ahead and careful weighting.

Another problem is that taking random actions (such as trades) just to learn a good strategy is not really feasible (you'll end up losing a lot of money!).

In the next lesson, we will discuss an algorithm that tries to address this second problem by simulating the effect of actions based on historical data.

### 03-07 Dyna

Dyna-Q is a simple architecture integrating the major functions needed in an on-line planning agent. 

Dyna-Q includes the processes: planning, acting, model-learning, and direct RL (all occurring continually). 

- The planning method is the random-sample one-step tabular Q-planning method. 

- The direct RL method is one-step tabular Q-learning. 

- The model-learning method is also table-based and assumes the world is deterministic.

Quiz: How To Evaluate T?

## $\frac{T_c\left[s,a,s'\right]}{\Sigma_iT_c\left[s,a,i\right]}$

R'[s, a] = (1 â Î±)R[s, a] + Î±r

How Dyna-Q works:

- Q learning

  - init Q table
  - observe S
  - execute a, observe (s,r)
  - update Q with (s, a, s', ri)
  
- Update model 

  - update Tâ[s,a,sâ]
  - update Râ[s,a]
  
- Dyna Q

  - s = random
  - a = random
  - sâ = infer from T
  - r = R[s,a]
  - update Q with new experience tuple
  - repeat many times (â¼ 100 â 200)
