#### Module 12: Reinforcement Learning

#### Case Study–1

Domain – Logistics

focus – Optimal path

Business challenge/requirement

BluEx is a leading logistics company in India. It's known for efficient delivery of packets to customers. However, BluEx is facing a challenge where its van drivers are taking a suboptimal path for delivery. This is causing delays and higher fuel costs.
You as an ML expert have to create an ML model using Reinforcement Learning so that an efficient path is found through the program.

Key issues 

Data has lots of attributes and classification could be tricky

Considerations

Reinforcement Learning is tricky, so the expectation is to come up with a sample flow and full-fledged implementation will be done by the team later

Data volume

- None. Sample data is hard coded in the program

Additional information

- NA

Business benefits

Up to 15% of fuel cost can be saved by taking the optimal path

In [1]:
import numpy as np
import random

# Grid world (0 = free, 1 = obstacle)
GRID = np.array([
    [0, 0, 0, 0],
    [0, 1, 0, 0],
    [0, 0, 0, 0],
    [0, 0, 1, 0]
])

START = (0, 0)
GOAL = (3, 3)
ACTIONS = [(0,1), (0,-1), (1,0), (-1,0)]  # Right, Left, Down, Up

# Rewards
STEP_REWARD = -1
OBSTACLE_PENALTY = -5
GOAL_REWARD = 10

# Q-table
Q = np.zeros((GRID.shape[0], GRID.shape[1], len(ACTIONS)))

def step(state, action):
    r, c = state
    dr, dc = action
    nr, nc = r+dr, c+dc
    if nr<0 or nr>=GRID.shape[0] or nc<0 or nc>=GRID.shape[1] or GRID[nr,nc]==1:
        return state, OBSTACLE_PENALTY, False
    if (nr,nc)==GOAL:
        return (nr,nc), GOAL_REWARD, True
    return (nr,nc), STEP_REWARD, False

# Training
for ep in range(500):
    state = START
    done = False
    while not done:
        a = random.randint(0,3)
        next_state, reward, done = step(state, ACTIONS[a])
        r,c = state
        nr,nc = next_state
        Q[r,c,a] += 0.1*(reward + 0.9*np.max(Q[nr,nc]) - Q[r,c,a])
        state = next_state

# Extract path
path = [START]
state = START
for _ in range(20):
    r,c = state
    a = np.argmax(Q[r,c])
    state, _, done = step(state, ACTIONS[a])
    path.append(state)
    if done: break

print("Learned path:", path)

Learned path: [(0, 0), (0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]
