# Smart Charging Using Reinforcement Learning

Simple simpy example from workshop (ChargingEV.py)

In [8]:
import simpy

env = simpy.Environment()
bcs = simpy.Resource(env, capacity=2)


def car(env, name, bcs, driving_time, charge_duration):
    # Simulate driving to the BCS
    yield env.timeout(driving_time)

    # Request one of its charging spots
    print('%s arriving at %d' % (name, env.now))
    with bcs.request() as req:
        yield req

        # Charge the battery
        print('%s starting to charge at %s' % (name, env.now))
        yield env.timeout(charge_duration)
        print('%s leaving the bcs at %s' % (name, env.now))


for i in range(20):
    env.process(car(env, 'Car %d' % i, bcs, i * 2, 5))


env.run()

Car 0 arriving at 0
Car 0 starting to charge at 0
Car 1 arriving at 2
Car 1 starting to charge at 2
Car 2 arriving at 4
Car 0 leaving the bcs at 5
Car 2 starting to charge at 5
Car 3 arriving at 6
Car 1 leaving the bcs at 7
Car 3 starting to charge at 7
Car 4 arriving at 8
Car 5 arriving at 10
Car 2 leaving the bcs at 10
Car 4 starting to charge at 10
Car 6 arriving at 12
Car 3 leaving the bcs at 12
Car 5 starting to charge at 12
Car 7 arriving at 14
Car 4 leaving the bcs at 15
Car 6 starting to charge at 15
Car 8 arriving at 16
Car 5 leaving the bcs at 17
Car 7 starting to charge at 17
Car 9 arriving at 18
Car 10 arriving at 20
Car 6 leaving the bcs at 20
Car 8 starting to charge at 20
Car 11 arriving at 22
Car 7 leaving the bcs at 22
Car 9 starting to charge at 22
Car 12 arriving at 24
Car 8 leaving the bcs at 25
Car 10 starting to charge at 25
Car 13 arriving at 26
Car 9 leaving the bcs at 27
Car 11 starting to charge at 27
Car 14 arriving at 28
Car 15 arriving at 30
Car 10 leaving 

## Learning Resources and environments we can use

- (Alternative: Simpy)

- OpenAI gym
    - [Official Gymnasium GitHub](https://github.com/Farama-Foundation/Gymnasium)
    - [Gymnasium Docu](https://gymnasium.farama.org/)
        - [Creating a Gymnasium Environment](https://gymnasium.farama.org/tutorials/gymnasium_basics/environment_creation/)
        - [Lunar Lander example](https://gymnasium.farama.org/environments/box2d/lunar_lander/)
    - [OpenAI Learning platform for Gymnasium](https://spinningup.openai.com/en/latest/index.html)
        - [OpenAI Algorithms](https://spinningup.openai.com/en/latest/algorithms/ddpg.html#background)

- [Here is a good resource for all deep-RL algorithms](https://github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch/blob/master/README.md) (look into the results folder)

## Finite Markov Decision Process (MDP) (M+B)

- getState() and maxAction() and plotRunningAverage()

A **Markov Process** (or Markov Chain) is a tuple ⟨S, A, P, R⟩
- S is a set of **states**
    - States: time, battery_level, (charging_rate), (atHome)
        - time: Discrete 15-minute intervals from 2 p.m. to 4 p.m. (count down?)
        - battery_level: The current battery level (from 0 kW to battery capacity kW)
        - charging_rate: The current charging rate (between 0 kW and the highest rate (e.g., 22 kW (We need to choose)))
        - atHome: Indicator if the agent is at home or departured
- A is a set of **actions**
    - Actions: {zero: 0 kW, low: 7 kW, medium: 14 kW, high: 21 kW}
        - The actions are the discrete charging rates that the agent can choose at each discrete timestep.
- P is a state **transition probability** function, (P<sup>a</sup><sub>SS'</sub> = P[S<sub>t</sub> = s'| S<sub>t-1</sub> = s, A<sub>t-1</sub> = a])
    - Transition Probability:
- R is a **reward** function of states and actions
    - Reward:
        - Running out of energy: eg. -1000
        - Charging costs: i.e.,  charging cost (t,p) = ∑<sub>𝑡∈𝑇</sub> 𝛼<sub>𝑡</sub> 𝑒<sup>𝑝</sup>), where 𝛼<sub>𝑡</sub> is the time coefficient and p is the charging rate.
    - Reward function:
    
**Goal:** Finding the optimal policies (which action to take in different states)  
**Trade-off:** The agent’s goal is to avoid running out of energy (you should consider a very high penalty for running out of energy) and to minimize the recharging cost.

**Further Assumptions:**
- Battery capacity: 
- Energy demand function: stochastic value following a normal distribution
    - Parameters: 𝜇= 30 kWh, 𝜎 = 5 kWh (Note: must be generated exactly when the driver wants to leave.)

## First RL-model: SARSA (TD control problem, On-Policy) (B)

## Second RL-model: Q-learning (TD control problem, Off-Policy) (M)

## (Second RL-model advanced: Double Q-learning)

## Third RL-model: Deep Q-learning
Build this in Google Colab with GPU environment selected