<a href="https://colab.research.google.com/github/Amenasetheru/Stock-Market-trading-bot-using-Reinforcement-Learning/blob/master/Stock_Market_trading_bot_using_Reinforcement_Learning1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stock Market trading bot using Reinforcement Learning

**Stage-1: Installing dependencies and environment setup**

In [None]:
!pip install pandas-datareader



**Stage-2: Importing project dependencies**

In [None]:
# Importing project dependencies
import math
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import pandas_datareader as data_reader

from tqdm import tqdm_notebook, tqdm
from collections import deque

  from pandas.util.testing import assert_frame_equal


In [None]:
tf.__version__

'2.2.0'

In [None]:
# Define the algorithm
class AI_Trader():


  def __init__(self, state_size, action_space=3, model_name="AITrader"):# Stay, Buy Sell,
    
    # Define all hyper parameters of the network
    # The first two are just state size and action space defined as an object attribute

    self.state_size = state_size
    self.action_space = action_space


    # This hyper parameter is the experience of repaly memory
    # We store inside it 2000 elements 
    self.memory = deque(maxlen=2000)
    # Initialize an empty list call inventory. This list will hold both
    # Stocks since we cannot sell a stock if we haven't bought it before 
    self.inventory = []
    self.model_name = model_name

    # This gamma parameter helps us to miximize the current reward
    # Over a long time reward. Set it between 0.9 and 1
    self.gamma = 0.95
    # The epsilon parameter is used to determine whether should we choose 
    # the random action or to use the model for it. Start by defining it to 1
    # This means that at the very beginning of the training process when 
    # The network is not trained at all. All actions are performed randomly
    # But over time, we want to decrease this number so we can stop using 
    # random actions and start using mostly our train network.
    # Even though we have  a fully train network we still want the agent
    # to take some random actions and that is for our environment exploation
    # And that is where the next variable come into play. 
    self.epsilon = 1.0
    # Here when the value of epsilon is equal to or less this number
    # We will start decreasing it any further
    self.epsilon_final = 0.01
    # Define epsilon decay to less than 1. Here we set it to 0.995. It
    # Helps to determine how fast to decrease epsilon
    self.epsilon_decay = 0.995

    # This function will create a network, initialize it and then 
    # Store in this cell argument for us
    self.model = self.model_builder()

**Stage-3: Building the AI Trader Network** 

In [None]:
# Build the model
# This function doesn't take any argument. It jus provide a keyword self
def model_builder(self):
  model = tf.keras.models.Sequential()

# Our states are nothing more than previous end days and stock prices
# Over those days. 
# Our state is nothing more than a vector of numbers and we can simply
# Use a fully connected network.
  model.add(tf.keras.layers.Dense(units=32, activation="relu",input_dim=self.state_size))
  model.add(tf.keras.layers.Dense(units=64, activation="relu"))
  model.add(tf.keras.layers.Dense(units=128, activation="relu"))

  # For the dense layer, the number of units or neurons should be the same 
  # As the number of classes or in this case the number of actions.
  # The activation function is linear because we are going to use a
  # Mean squared error for our loss. We will see why in a few windows
  # For we will modify our actions with our rewards which is a continuous
  # Number and not a class 
  model.add(tf.keras.layers.Dense(units=self.action_space, activation="linear"))

  
  model.compile(loss="mse", optimizer=tf.keras.optimizers.Adam(lr=0.001))
  return model

### Building The Trade function

We explain the role of the trade function:

This function takes the state as an input and then generates a random number.

If that generated number is less than or equal to epsilon(Notice that at the beginning, it is going to be always less than epsilon)

This function will return totally a random generated number randing between 0 and 2. That is our actions

If this is not the case, this function will call our model and perform prediction based on the inout sate and return only the action that has the highest probanility or the likehood between those free actions



In [None]:
# Build a trade function. It takes the state as an input and then
# spills out an action to perform in a particular state
# It takes only one argument and that is our state. For that state,
# We need to determine whether should we use the random generated
# Action or should we use our model to perform the action

# To do that, we say if randdom.random is less than or equal to
# Our epsilon, it is only in that case that we are going to return
# A random action. So We will have to return random thus we will have
# To call random.randrange function that can takes multiple argument
# But in our case here, we are going to provide only stop point.
# Which is self.action_space and ten it will randomly select 
# A number between 0 and 2. 
def trade(self, state):
    if random.random()<= self.epsilon:
        return random.randrange(self.action_space)
  # if this is not the case or the random
  # Generator number is bigger than epsilon then we are going to
  # Use our model to use an action to perform. So to do that, we say
  # Actions is equal to self.model.predict and for the argument here,
  # We are going to provide our state and we need to return a
  # Single number. So we are going to use np.argmax to return only one
  # action which is the highest probability. As the argument for
  # Argmax, we put actions of zero because of the output shape. 
    actions = self.model.predict(state)
    return np.argmax(actions[0])

### Buiding the Custom Training Function

In [None]:
# This function will take a batch of saved data
# And train the model based. It takes only one argument and that
# Is the batch_sise which could be anywhere from 32 to 256 or even
# More and this is an additional hyper parameter that we can play with
# The first thing that we have to do is to select the data from the 
# experience replay momery.
def batch_train(self, batch_size):
  # So batch is equal to an empty list
  batch = []
  
  # Then we have to iterate through the memory. For i in range then
  # Len of self dot memory (and we need to remember that this memory
  # Is defined up here as the deque data structure.) minus batch_size
  # plus 1 because we are dealing here with a time constraint data.
  # We don't want to randomly select samples from memory. We always
  # Sample from the end of memory and this indexing method will help
  # Us to get the exact number of points inside the batch of data.
  # We complete our for loop by defining the end of the index to be the
  # Len of self dot memory
  for i in range(len(self.memory) - batch_size + 1,len(self.memory)):
  # Now the for loop calls the batch list and append that element 
  # from the memory itself.
    batch.append(self.memory[i])

  
  # Now we have the batch data it is time to iterate through it 
  # And to train the model for each sample.
  # From that batch, let's write: for state, action, reward, next_state
  # Done in batch with these four elements, we are iterating through 
  # That all information stored inside the batch of data.
  # We need just te be careful here for the order of variables is
  # Very important. 
  for state, action, reward, next_state, done in batch:
  # Inside the for loop reward is equal to reward

    reward = reward
  
  # Let's make sure that our agent is not in the terminal state
  # So let's write if not done, we are doing this very simple check
  # To make sure that our agent is not in the terminal state.
  # We can calculate our reward in a different ways. If the agent is
  # In a terminal state, we will use the current reward as a reward.
    if not done:
  # But if it is not in the terminal state, and there are few more
  # Actions to be played. We are going to calculate the total and
  # Leave all as the current reward.  so in this if statemant, the 
  # reward is to reward plus self dot gamma multiplied by np.dot
  # Amax. These function returns are the maximum value from an input
  # Array and that is exactly what we want. We want to return the highest
  # Value of predictions.
  # Inside that function we provide self dot model dot predict 
  # of next_state and said that to zero[0] beacuse of the output size

      reward = reward + self.gamma * np.amax(self.model.predict(next_state)[0])
  # And that is our discount that totals the reward after 
  # They have defined the target variable which is predicted by 
  # The model as well.
  # So target is equal to self dot model dot predict of state.
  # At this point, It is just an action and we want to modify
  # It with our current reward and this is exactly why we use a means 
  # Squared error instead of crossentropy loss 
    target = self.model.predict(state)
  
  # the target Of zero because of the output shape and the action
  # And this is the action performed and it is all equal to reward
    target[0][action] = reward
  

  # Now that we have our target and our state, we can then feed the
  # Model in just writing self dot model dot fit.
  # And always provide state for the features and target for our target
  # Epochs just set to 1 because we train the model very often on each
  # Sample from our batch. We do not want to rpint all these  results
  # Just put verbose is equal to zero at the end of this function.
    self.model.fit(state, target, epochs=1, verbose=0)

  # Lest's decrease the epsilon parameter se we can stop performing
  # Random actions at one point.
  # If self dot epsilon is bigger that self dot epsilon final. 
  if self.epsilon > self.epsilon_final:
  
  # And if that is correct, let's decrease it by multiplying it with
  # Epsilon decay
    self.epsilon *= self.epsilon_decay


**Stage-4: Dataset Preprocessing**

Defining helper function

**Sigmoid**

The sigmoid is an activation function used mostly at the end of the network.

When we have binary classification ultimately it scale a number to the range from 0 to 1. This function here help us to scale our price.


We are doing this so we can compare and gather the real difference between each day.  For one day the stock can be 200 and jump the next day to 1000.



The difference between them is the same as forty five(45) to 200. The difference in the jump is the same but the price cannot handle this.

So we need to use sigmoid function to have the same number to represent this difference.




In [None]:
def sigmoid(x):
  return 1/ (1 + math.exp(-x))

**price format function**

This function helps hus to print out the price of the stock we bought or sold. 

It says if the number is negative. So we lost some money for example it will add a minus in front of it and we are using this string formating.


To limit the printed number to only 2 decimal points.

We are doing the same thing fi the number is positive although we don't add the minus in front of it.





In [None]:
def stock_price_format(n):
  if n < 0:
    return "- $ {0:2f}".format(abs(n))
  else:
    return "$ {0:2f}".format(abs(n))


**Dataset loader**

Let's define a dataset variable and call from our library the data_reader. 

This function has a data either object that takes a stock price information. 


As we can see, it can reach to Google finance, Yahoo finance or any other provider of stock market information.


In our case, we are going to use Yahoo finance. It provide the mos relevant information for us. 

Ths stock name represent the name of the company on the stock exchange market.

Let's use Apple for this example to get Apple stock information, we write a sting "APPPL"


The next argument that we have to specify is data source or which provider we should ask to provide information for us.


In our case, we saying data source is equal to Yahoo

We are going to copy and paste right here in the function but before we do that let's check what it does.

It will check through Yahoo
Get Apple stock information and save that to our data set variable

It uses pandas 's data frame object to store that information

Now we dont' have to use all these information if we want to pedict stock market price.

Here we are going to use only the close column for our example.

That is the coulumn we are trying to predict and also from these data we are going to build state for our network.

This index here is the date in here we need to specify the starting date and the end date 


And if we specify dataset dot  index, it is going to take all the data from all the data set.

But if we specify just the first one it is going to get our starting date and that's pretty much it.

We can convert this date to a string format and split it on  space so we don't use this time or related information by just the date

So this is the same thing that is happening in our function

We are taking the first and the last date from our dataset putting them in the string format and getting the information like what what is the starting and the ending date of the dataset


Instead of specifying Apple as a string let's use the argument of the function instead .

In [None]:
def dataset_loader(stock_name):
  # Complete the dataset loader function
  dataset = data_reader.DataReader(stock_name, data_source="yahoo")

  start_date = str(dataset.index[0]).split()[0]
  end_date = str(dataset.index[-1]).split()[0]

  close = dataset["Close"]
  return close

In [None]:
dataset = data_reader.DataReader(stock_name, data_source="yahoo")


**State Creator**

This function takes data and generates states from it.

first let's see how to transform the problem of stock market trading to a reinforcement setting on this graph.

We have information about Aplle stock price from 2010 until today. 

On the x axis we have time information 
And on the y axis we have prices of stocks on a specific day which is not visible.

It is taken from Yahoo Finance as we can see here.

These blue main line on the graph shows how the value of a company changes over time.

Let's see how to modify these data so that a reinforcement algorithim can understand it.

Each point ont this graph is nothing more but the floating ni=umber that represents the price stock today.

Our task is to predict what is going to happen next; Is the price going up or going down and the next day based on that information, our agent will determine what to do: sell, buy, or do nothing.

Let's take six data points here or six days for example and transform them or this part of the graph into numbers. The red part right here si what we are trying to predict and that is our target.


Well Let's take this red line and transform it into number and let's say it is forty seven points six
47.6. These numbers are totally random

They are not taken from the graph

In this example here we have the window_size is equal to 5 which is also the argument of the state create function.

Based on this argument we determine how many previous days we consider before predicting the current one.

Now we have our state made of 5 numbers where each number represents one day in the past.

Based on this it doesn't look right this is nothing  more but a regression problem.

We have some numbers and we are predicting our target which is also continuous.

Well let's modify these solutions so we have our actions instead of the real number of targets for our estate part


We can still use a row numbers but that won't help us anymore

Since our target is not the real number or real price but an action, lets change our state to use the difference between days as our state. 

These information will represent price shnages over time and potentially catch the trend  in the future as well.


Now ae have a state and this new state has only 4 numbers when our window size had 5. We will handle this in code.

This was just for a demonstration. Because we have changed information over time, our action and the new state is bought because prive was prtty high and we expect to drop again.

So nased on that information we are going to performbuy stock at the new state.

Let's go back and implement this in the code.

let's implement this strategy. Here we have a function called state creator that takes three arguments data which is:
- the stock market: Data downloaded with the dataset loaded function

- Timestamp: This is the day in the dataset that we want to predict for.
Il could be anywhere from 0 to the landfall data.

- And lastly, we have window size: This argument determines how many previous days we want to consider in our state to predict the current one

This argument goes anywhere from one to the line of data.

We can play with this window size parameter and see what is the best size for the company that we are trying tp trade for.

In this section, let's work with window size of 10.

The first thing to do is calculate the staring ID so let's write ID is equal to timestep minus window_size plus 1

Now we have to calculate the new starting date of our state.

It is calculatedlike this timestep minus window size plus 1


for example when the timestep is zero or our agent is just starting  and the window size si 10, the starting ID is minus 9.

The plus 1 is added because the way we create our state.

We don't want prices on certain days but as we explained differences between the current price and the  previous price, to see that change between dates, we start with plus 1.


Now that we have know our stating ID, we need to handle two diferent cases when the starting ID is negative and when it is positive.

If the starting ID is bigger than or equal to zero, this will handle the case when the starting ID is positive

When we have that situation, we create state like this. We know data is equal to data of starting a day until timestep plus 1.

If that is not the case and our starting ID in negative, we will append the first day info as many times as we need to match with the window size of a data window size over data


Window data is equal to then  put minus in front of starting ID because starting ID at this point is negative then multiplied by the list of data of zero.

This will replicate this member many times and now we need to spend the rest of the element to have the full window size of data.


Plus list of data from zero 2 timestep plus 1

Now we have our data from which we can create our state of data.

Let's define an empty list called state is equal to empty list.

And after that we can iterate through te whole window data list for i in range window size minus 1

The minus 1 is here because we have differences between the current element and the one after.

State append.

We have now to normalize the difference between the next day and the current day because prices can be very different.

We want to scale the difference between prices on the same scale so we have the same difference no matter the price.

We are going to do that with sigmoid function.

So let write sigmoid of window data of i plus 1 minus window data of i

Return  numpy array of the state and we are done.

Now we completed the function that is going to create the state for us

The same state and the same methode that was explained on the graph.







In [None]:
def state_creator(data, timestep, window_size):
  starting_id = timestep - window_size + 1
  if starting_id> 0:
    windowed_data = data[starting_id:timestep + 1]
  else:
    windowed_data = -starting_id * [data[0]] + list(data[0:timestep+1])
  
  state = []
  for i in range(window_size - 1):
    state.append(sigmoid(windowed_data[i+1] - windowed_data[i]))

  return np.array([state])

**Loading a dataset**

In [None]:
stock_name = "AAPL"
data = dataset_loader(stock_name)

# Stage-5: Training the AI trader

### Setting hyper parameters

The first hyper parameter is the window size. It is equal to 10. That means that we are going to use 10 previous days to predict the current one in the previous section.


We used word epochs before but in the the reinforcement learning we use rather episode.

So we need to define how many times we are going to run the whole dataset or the whole environment.

in our case here we will say episode is equal to 1000. 

The algorithm is going to run very slowly.

So we want to wait for all of them to pass then we  will specify 32 for the battch size.

And at the end, data sample is going to be equal to len of data minus 1

Since we are trying to predict the next day we can't use the last one.

In [None]:
window_size = 10 
episodes = 100
batch_size = 32
data_samples = len(data) - 1

### Defining the Trade Model

Now it is time to define our tarder bot.

Let's call it trader and it is going to be equal to our class AI trader. We have to remember that it takes a lot of arguments.

But since we defined the action space to be free as default and the model name to be a trader by default.
We wont' change that. 


So the only thing that we have to specify is the state's size.

For us that is our window size 

Let's track the model structure.

Trader dot model and there is always dot summary



In [None]:
trader = AI_Trader(window_size)

In [None]:
trader.model.summary()

### Training Loop

**Definition of the training loop**

As always, it is going to be a for loop to iterate through all episodes that we define.

And we defined 1000 episodes.
For episodes in the range of 1, episodes plus 1.

The next thing to do is to print out the current episode so we can kepp track of the training process.

So let's print episode and use these kind of formating. We can use this format function but we need just to specify 2 things since we have 2 curly brackets that we need to populate. 

So we will specify episode and episodes. We are just counting how many episodes are left.


Let us define our initial state and the initail state is alwyas the same.


At the very beginning of the episode, we define it as a state and it is equal to creator which takes 3 arguments: data, 0 for at this point, our timestep is zero. And lastly we have window_size plus 1.

Then we are going to define 2 variables so we can keep track of that.

The first one is total_profit. Actually, we don't have to specify this, but if we want to see how the model is progressing over time, it is recommended to have this variable.


The second one is trader dot inventory so we can access our inventory. It is an empty list.

We need to remember that our inventory is just a python list that stores all stock that we bought.

But sometimes, we can finish our episode without selling all the stocks. 

So we want to start our episode clean without any stock in the inventory.

So here at this point we are just making sure that we have clean inventory before we start our episode.


After, that let's define our timestep. It represents how many samples we have. One timestep is one day.
That is why timesteps repersent how many samples we have.

Thus we write for t in tqdm(tqdm is just used to visualize the progress bar) in range. We have always to provide data sample that we defined above as our hyper parameter.

The first thing that we have to access is an action. That action is going to be taken by the model.


So we say action is equal to trader dot trade and here we nned to provide our state.

Thes actions are going to totally randomly selected after some tome when the model is trained enough. It is going to take these actions by itself.


So Now that we have an action, it is time to perform it to get to the next state.

Let's define next state is equal to state_creator which take data as always then t plus 1 sine we want the next state and not the current one and again window size plus 1.

Now define a reward because we didn't calculate anything at this point. The reward is going to be zero


So action 1 is buying 
Action 2 is selling

We can only trade with actions or stocks that we already bought.

The next ting to do is to define an if statement which checks if an action performed right now by the model is 1 or is we bought a stock.

We are going to say if an action is equal to 1 and if that is the case, the agent is buying. 

In that case the only thing that we have to do is to append the current stock to the inventory.

To do that we are going to define trader dot inventory sinec it is on the python list and call append of data t because we wnat the current dat iinformation to be our bought stock. 

We can ad here something like agen bought the stock for x price.


Ok We checked if the action was 1 meaning if the agent was buying stocks rather selling them.

Now we are goint to check if the agent is selling stock meaning id action is equal to 2.

We know that we cannot sell stocks if we haven't previously bought them.
Again we cannot sell stocks if they are not in our inventory.

Thus we nned to make sure that we already have something in our inventory.

To handle this situation, we are going to introduce an additional condition.

Hence we say if len of our trader dot inventory is bigger than 0 and in other words, we have some stocks bought already.

So it both conditions are true, we are going to track what is the buy price.

Let's say buy_price is equal to and we will use our trader dot inventory to pop of zero.

By doing this, we are selling stock in ; and that is an additional strategy that we can use to improve this algorithm.


Now we are going to calculate the reward by using max between date of t which is the current dat of our stock minus buy_price and 0.

This means that if dataof t is less than buy_price, we lost money and the reward here is zero.


The toptal_profit is increased right here with the difference between te current price and the buy_price.

The whole user experinece can be improved by using simple print statement here such as the agent has bought or sold the stock for that price  and the week it earned or lost taht amount of money

The nice thing to do is to check whether or not this is the last sample in our dataset and if that is the case, we are done  for we do not have  any more steps to perform in the current episode so we can say if t is equal to data_sample minus 1 if that is the case , we say done is equal to true ortherwise we say done is equal to false and that is pretty much it.


The next thing to do is to append all the data to our trader memory or experience replay buffer

To do this, we say trader dot memory dot append, we call this function since we are using just a simple python list and now we have to provide a lot of things;

Let's say state, action, reward, and then we have to specify next state we calculated and lastly we add done.

This is all what we have to provide to our memory then we are going to shange the state to our next state so we can iterate through the whole thing.

The next step is to print out the total profit.

Before we start with our training process, there are two more things to do and that is to check if we have more information in our memory of our batch_size.

We are going to call len of trade dot memory is bigger than batch_size
If that is the case, we are going to call trade dot batch_train and the only argument that we need to provide is batch_size.

Now in he main episode loop we are going to check if the number of episodes in a total division of 10 is equal to zero.

And if that is the case, we are going to save the model


To save the model, let's specify trader dot model.save and inside this saved function, the only argument to provide is the model name.
Let's use the same name of our class: ai trader then the curly brackets

So we can populate with episode index then dot h5 which is the extension of our weights.
And since we need to populate these curly barckets dot format and in that provide episode.




In [None]:
for episode in range(1, episodes + 1):
  print("Episode: {}/{}".format(episode, episodes))

  state = state_creator(data, 0 window_size + 1)
  total_profit = 0
  trader.inventory = []

  for t in tqdm(range(data_samples)):
    action = trader.trade(state)

    next_sate = state_cerator(data, t+1, window_size + 1)
    reward = 0

    if action == 1:
      trader.inventory.append(data[t])
      print("AI Trader bought: ", stock_price_format(data[t])) 
    
    elif action == 2 and len(trader.inventory) > 0:
      buy_price = trader.inventory.pop(0)

      reward = max(data[t] - buy_price, 0)
      total_profit = total_profit + data[t] - buy_price, "Profit"

    if t == data_samples - 1:
      done = True
    else:
      donn = False

    trader.memory.append(state, action, reward, next_state, done)
    state = next_state

    if done:
      print("##############################")
      print("TOTAL PROFIT: {}".format(total_profit))
      print("##############################")

    if len(trader.memory) > batch_size:
      trader.batch_train(batch_size)

    if episode % 10 == 0:
      trader.model.save("ai_reader_{}.h5".format(episode))
