# Connect Three 

The primary description of this coursework is available on the CM20252 Moodle page. This is the Jupyter notebook you must complete and submit to receive marks. This notebook adds additional detail to the coursework specification but does not repeat the information that has already been provided there. 

You must follow all instructions given in this notebook precisely.

Restart the kernel and run all cells before submitting the notebook. This will guarantee that we will be able to run your code for testing. Remember to save your work regularly.

__You will develop players for Connect-Three on a grid that is 5 columns wide and 3 rows high. An example is shown below showing a win for Player Red.__

<img src="images/connect3.png" style="width: 200px;"/>

## Preliminaries

For your reference, below is a visual depiction of the agent-environment interface in reinforcement learning. The interaction of the agent with its environments starts at decision stage $t=0$ with the observation of the current state $s_0$. (Notice that there is no reward at this initial stage.) The agent then chooses an action to execute at decision stage $t=1$. The environment responds by changing its state to $s_1$ and returning the numerical reward signal $r_1$. 

<img src="images/agent-environment.png" style="width: 500px;"/>

<br><br><br>

Below, we provide some code that will be useful for implementing parts of this interface. You are not obligated to use this code; please feel free to develop your own code from scratch. 

### Code details

We provide a `Connect` class that you can use to simulate Connect-Three games. The following cells in this section will walk you through the basic usage of this class by playing a couple of games.

We import the `connect` module and create a Connect-Three environment called `env`. The constructor method has one argument called `verbose`. If `verbose=True`, the `Connect` object will regularly print the progress of the game. This is useful for getting to know the provided code, debugging your code, or if you just want to play around. You will want to set `verbose=False` when you run hundreds of episodes to complete the marked exercises.

This `Connect` environment uses the strings `'o'` and `'x'` instead of different disk colors in order to distinguish between the two players. We can specify who should start the game using the `starting_player` argument.

In [1]:
import connect
env = connect.Connect(starting_player='x', verbose=True)

Game has been reset.
[[' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ']]



We can interact with the environment using the `act()` method. This method takes an `action` (an integer) as input and computes the response of the environment. An action is defined as the column index that a disk is dropped into. The `act()` method returns the `reward` for player `'o'` and a boolean, indicating whether the game is over (`True`) or not (`False`). 

In [2]:
reward, game_over = env.act(action=2)
print("reward =", reward)
print("game_over =", game_over)

[[' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'x' ' ' ' ']]
reward = 0
game_over = False


Because we set `verbose=True` when we created our environment, the grid is printed each time we call the `act()` method. You probably might want to set `verbose=False` when you run Q-learning for thousands of episodes. 

As expected, the `reward` is 0 and no one has won the game yet (`game_over` is `False`). Let us drop another disk into the same column.

In [3]:
reward, game_over = env.act(action=2)

[[' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'o' ' ' ' ']
 [' ' ' ' 'x' ' ' ' ']]


We see that the `Connect` environment automatically switches the

The `grid` is stored as a two-dimensional `numpy` array in the `Connect` class and you can easily access it by calling...

In [4]:
current_grid = env.grid
print(current_grid)

[[' ' ' ' 'x' ' ' ' ']
 [' ' ' ' 'o' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ']]


Note that the grid now appears to be "upside down" because `numpy` arrays are printed from "top to bottom".
We can also print it the way it is printed by the Connect class by calling...

In [5]:
print(current_grid[::-1])

[[' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'o' ' ' ' ']
 [' ' ' ' 'x' ' ' ' ']]


Let's make another move.

In [6]:
reward, game_over = env.act(action=2)

[[' ' ' ' 'x' ' ' ' ']
 [' ' ' ' 'o' ' ' ' ']
 [' ' ' ' 'x' ' ' ' ']]


Let us try to put another disk in the same column with `act(action=2)`. The environment will throw an error because that column is already filled.

In [7]:
# This cell should throw an IndexError!
env.act(action=2)

IndexError: index 3 is out of bounds for axis 0 with size 3

The attribute `.available_actions` of the `Connect` class contains a `numpy` array of all not yet filled columns. This variable should help you to avoid errors like the one we have just encountered.

In [None]:
print(env.available_actions)

Note that column index '2' is missing because this column is already filled.

Let's keep on playing until some player wins...

In [None]:
reward, game_over = env.act(action=3)
print("reward =", reward, "game_over =", game_over) 
reward, game_over = env.act(action=1)
print("reward =", reward, "game_over =", game_over)
reward, game_over = env.act(action=3)
print("reward =", reward, "game_over =", game_over)
reward, game_over = env.act(action=1)
print("reward =", reward, "game_over =", game_over)
reward, game_over = env.act(action=3)
print("reward =", reward, "game_over =", game_over)

#### Note that the `reward` returned by the `act()` method is the reward for player `'o'`.

You can reset the game using the `reset()` method. This method cleans the grid and makes sure that the it is the `starting_player`'s turn as defined earlier.

In [None]:
env.reset()
reward, game_over = env.act(1)

Feel free to modify existing or add new methods to the `Connect` class.

## Q-learning

**Your opponent is always the first player. Your agent is always the second player.**

For your reference, the pseudo-code for Q-learning is reproduced below from the textbook (Reinforcement Learning, Sutton & Barto, 1998, Section 6.5).
<img src="images/q_learning.png" style="width: 600px;"/>

Prepare a **learning curve** following the directions below. We refer to this as Plot 1.

After $n$ steps of interaction with the environment, play $m$ games with the current policy of the agent (without modifying the policy). Think of this as interrupting the agent for a period of time to test how well it has learned so far. Your plot should show the total score obtained in these $m$ games as a function of $n, 2n, 3n, … kn$. The choices of $n$ and $k$ are up to you. They should be reasonable values that demonstrate the efficiency of the learning and how well the agent learns to play the game eventually. Use $m=10$. 

This plot should show the mean performance of `a` agents, not the performance of a single agent. Because of the stochasticity in the environment, you will obtain two different learning curves from two different agents even though they are using exactly the same algorithm. We suggest setting `a` to 20 or higher.

Present a single mean learning curve with your choice of parameters $\epsilon$ and $\alpha$. The plot should also show (as a baseline) the mean performance of a random agent that does not learn but chooses actions uniformly randomly from among the legal actions. Label this line “Random Agent”. 

Please include this plot as a static figure in the appropriate cell below. That is, compute the learning curve in the lab or at home (this may take a couple of minutes depending on your implementation) and save the figure in the same directory as your notebook. Import this figure in the appropriate answer cell under (A). You can look at the source code of this markdown cell (double click on it!) to find out how to embed figures using html. Do **not** use drag & drop to include figures; we would not be able to see them! Make sure to include the locally stored images in your submission. 

In [None]:
'''
import numpy as np
import matplotlib.pyplot as plt

n = 1000
k = 20
m = 10

a = 5

scores = np.zeros( (a, k) )
for A in range(a): #Each Agent
    
    print('Agent ', A)
    game = qLearning()
    for K in range(k): #Learning Stages
        #game.learn(n)
        
        _, score = game.rand_game(m)
        scores[A][K] = score
        print('Score: ', score, ' (', K, '/', k,')')
        
averageScores = np.zeros(k)
for C in range(k):
    total = 0
    for R in range(a):
        total = total + scores[R][C]
        
    averageScores[C] = total / a
    
print(scores)
print('')
print(averageScores)

plt.plot(averageScores)
plt.ylabel('Score')
plt.xlabel('# of games learnt from')
plt.show()
'''

(A) [continued} Insert your static learning curve here (Plot 1).

<img src="images/20_agents.png" style="width: 600px;"/>
<img src="images/20_agents_overlap.png" style="width: 600px;"/>

For these graphs, I had the following values:
-20 Agents
-n = 1000
-k = 20
-m = 10
-epsilon = .7
-alpha = .5
-lambda = .7 (discount factor)


(B) In 3 sentences or less, explain your conclusions from the plot above. How close does your (average) agent get to the best possible level of performance? How efficiently does your (average) agent learn? 

From the plots above it is clear that training the AI improves how well it performs. The best possible level of performance would be a score of 10 by winning each time - here it is clear that by the end it doesn't reach this level but gets 4 more wins (on average) than the opponent does, which equates to a win rate of roughly 70%. After researching on Q-Learning, I saw a good amount of games to train an AI on is 100,000 games, here it gets to 70% winrate with just 20,000 - it doesn't need a lot of games, but it does take a while computationally (had to wait around 30 minutes to do a total of 20*20,000 = 400,000 games).


(C) In five sentences or less, explain the key aspects of your implementation. How many state-action pairs do you represent in your Q-table? Describe and justify your settings of $\alpha$ and $\epsilon$. Are there any things you tried out that are not in your final implementation?

The basics of my implementation was having a game class and a qLearning class - the game class hooks into connect.py and gives me useful functions like having the opponent move in a random, valid position. The qLearning class was a lot more detailed - I have an epsilon-greedy function which helps us choose which position for the agent to move, functions that make it easy to update my Q-Table etc.. 
I set up my Q-table using a dictionary instead of a fixed sized array - there were going to be a lot of states that would be impossible to reach (table full of x) and some states that would rarely be reached. For each state, it had 5 possible actions. Obviously some states wouldn't be able to execute all 5 though.
I set my alpha to 0.5 - for the implementation of this game, a terminal state is reachable very quickly as the opponent has to strategy to their movements so doesn't need a big memory, but winning is not instantanous. Epsilon is set to 0.7, I found this a good value to have the right balance of moving randomly to get the Q-Table updated with a variety of values as well as to find an optimum solution.
One thing I didn't keep in my implementation (due to the specification of the coursework) was varying the epsilon value. At the beginning when the AI knows nothing, the best way to get good Q-Values is to try as many possibilities as possible (epsilon = 1), then I would slowly decrese it down to around 0.45, as the more it learns it becomes better to optimise the Q-Values by using them.


(D) In the cell below, make it possible for us to produce from scratch a learning curve similar to Plot 1 but for a single agent, for a $k$ value of your own choosing. You do not need to include the baseline for random play.  This code should run in less than 30 seconds (ours runs in 2 seconds). 


In [None]:
import numpy as np
import connect

class connect3_Game():
    
    def __init__(self, VB = False):
        self.env = connect.Connect(starting_player='x', verbose=VB)
        self.state = self.env.grid
        
        self.agent = 'o'
        self.opponent = 'x'
        self.winner = False
        self.totalMoves = 0
        
    def updateGame(self, reward, game_over):
        self.totalMoves = self.totalMoves + 1
        
        if game_over:
            self.winner = self.opponent
            if reward == 1:
                self.winner = self.agent
        
    def opponent_move(self):
        '''
            Has the opponent play in a random (available) location
        '''
        #print("Opponent Move")
        action = np.random.choice( self.env.available_actions )
        reward, game_over = self.env.act( action )
        self.updateGame(reward, game_over)
        
        return reward, action
        
    def agent_move(self, pos):
        #print("Agent Move")
        reward, game_over = self.env.act(pos)
        self.updateGame(reward, game_over)
        return reward
       
        

        
class qLearning():
    
    def __init__(self):
        self.qTable = dict()
        
        self.trueEpsilon = .7
        self.epsilon = self.trueEpsilon
        
        self.alpha = .5
        self.discountFactor = .7
        self.currentEpisode = 0
        self.lookAhead = 0 #How many actions to look forward when estimating max reward
        self.newGame = connect3_Game(VB = False)
        
        self.agent = 'o'
        self.opponent = 'x'
        
        
        
    def setupQValue(self, state):
        if state in self.qTable:
            return
        
        self.qTable[state] = np.zeros(5, dtype = float)
        
    def updateQValue(self, state, move, value):
        self.setupQValue(state)
        self.qTable[state][move] = value
        
    def getQValue(self, state, move):
        self.setupQValue(state)
        return self.qTable[state][move]
    
    
    def getBestMove(self, state, game):
        self.setupQValue(state)
        best = -(10 ** 10)
        usePos = 0
        i = 0
        for j in self.qTable[state]:
            if j >= best and np.isin(i, game.env.available_actions):
                best = j
                usePos = i
            
            i = i + 1
        
        #print("Best Q Value: ", best)
        return usePos
        
    
    def hashGrid(self, game):
        state = [0] * 15
        i = 0
        for row in game.env.grid:
            for bit in row:
                if bit == game.agent:
                    state[i] = 1
                elif bit == game.opponent:
                    state[i] = -1
                    
                i = i + 1
                
        return tuple(state)

        
    
    def eGreedy(self, game):
        r = np.random.uniform()
        
        if r < self.epsilon or self.getBestMove( self.hashGrid(game), game ) == 0: #Choose a random move
            return np.random.choice( game.env.available_actions )
        else: #Choose best move
            return self.getBestMove( self.hashGrid(game), game )
        
    def epsilonFactor(self, factor):
        self.epsilon = self.minE + (self.maxE - self.minE) * factor
        #print("epsilon: ", self.epsilon)
        
    
    def getPossibleGames(self, game):
        games = [None] * 5
        for i in range(5):
            if np.isin(i, game.env.available_actions):
                tempGame = connect3_Game(VB = False)
                tempGame.env.grid = np.copy( game.env.grid )
                tempGame.env.active_player = game.env.active_player
                tempGame.env.act(i)
                
                games[i] = tempGame 
                
        return games
                
        
    def estimatedReward(self, game, level):
        trueMAX = -(10 ** 10)
        MAX = trueMAX
        
        possibleGames = self.getPossibleGames(game)
        i = 0
        for GAME in possibleGames:
            if GAME:
                HASH = self.hashGrid(GAME)
                qVal = self.getQValue(HASH, i)
                if qVal > MAX:
                    MAX = qVal
                    
            i = i + 1
            
        if MAX == trueMAX:
            MAX = 0
            
        #if MAX != 0:
            #print("Estimated: ", MAX)
            
        return MAX
                
        
        
    def rewardHandler(self, reward, game, state, action):
        currentQ = self.getQValue(state, action)
        
        estimateMax = self.estimatedReward(game, 0)
        #print(currentQ, reward, estimateMax)
        qValue = (1-self.alpha) * currentQ + self.alpha * (reward + self.discountFactor * estimateMax)
        #print("New Q value: ", qValue)
        
        #if currentQ != qValue:
            #print("Old: ", currentQ, " | New: ", qValue)
                
        self.updateQValue(state, action, qValue)

    def setupGame(self):
        self.newGame.env.reset()
        self.newGame.env.active_player = self.opponent
        self.newGame.winner = False
        return self.newGame
    
    
    def run_episode(self, showGrid):
        newGame = self.setupGame()
        self.currentEpisode = self.currentEpisode + 1
        
        moveRandom = False
        newGame.env.act(2) #Starts the opponent in the middle
        
        while not newGame.winner:
            
            if moveRandom:
                state = self.hashGrid(newGame)
                reward, action = newGame.opponent_move()
                self.rewardHandler(reward, newGame, state, action)
                if newGame.winner:
                    #if not showGrid and self.currentEpisode % 100 == 0:
                        #print('Episode ', self.currentEpisode, ' finished, winner: ', newGame.winner)
                    return newGame.winner
            else:
                moveRandom = True
            
            state, action = self.hashGrid(newGame), self.eGreedy(newGame) 
            reward = newGame.agent_move(action)
            self.rewardHandler(reward, newGame, state, action)
            if newGame.winner:
                #if not showGrid and self.currentEpisode % 100 == 0:
                    #print('Episode ', self.currentEpisode, ' finished, winner: ', newGame.winner)
                return newGame.winner
            
    def learn(self, n):
        self.epsilon = self.trueEpsilon
        plays = 0
        for i in range(n):
            #game.epsilonFactor(1 - plays/maxN) #Starts learning by doing it randomly, then allows some use of Q-Values
            game.run_episode(False)
            
            
        #print("Finished learning (", n, " games)")
        
        
        
    def test(self, n):
        self.epsilon = 0
        
        wins = 0
        score = 0
        for i in range(n):
            winner = game.run_episode(False)
            if winner == self.agent:
                wins = wins + 1
                score = score + 1
            elif winner == self.opponent:
                score = score - 1
                
        return wins, score
    
    def tester(self, n):
        best = -1000
        for j in range(1):
            _, score = self.test(n)
            if score > best:
                best = score
                
        return best
        
    
    
    
    def rand_episode(self, showGrid):
        newGame = self.setupGame()

        while not newGame.winner:
            
            reward, action = newGame.opponent_move()
            if newGame.winner:
                return newGame.winner
                
            reward = newGame.agent_move( np.random.choice( newGame.env.available_actions ) )
            if newGame.winner:
                return newGame.winner       
    
    def rand_game(self, n):
        wins = 0
        score = 0
        for i in range(n):
            winner = game.rand_episode(False)
            if winner == self.agent:
                wins = wins + 1
                score = score + 1
            elif winner == self.opponent:
                score = score - 1
                
        return wins, score        

    
    

    
    
    
import matplotlib.pyplot as plt

n = 500
k = 8
m = 10
a = 1

scores = np.zeros( (a, k) )
for A in range(a): #Each Agent
    
    print('Agent ', A)
    game = qLearning()
    for K in range(k): #Learning Stages
        game.learn(n)
        
        score = game.tester(m)
        scores[A][K] = score
        print('Score: ', score, ' (', K, '/', k,')')
        
averageScores = np.zeros(k)
for C in range(k):
    total = 0
    for R in range(a):
        total = total + scores[R][C]
        
    averageScores[C] = total / a
    
#print(scores)
#print('')
#print(averageScores)

plt.plot(averageScores)
plt.ylabel('Score')
plt.xlabel('# of games learnt from')
plt.show()

# IMPORTANT: How to submit.

If any of the following instructions is not clear, please ask your tutors well ahead of the submission deadline.

### Before you submit
- We will not be able to mark your coursework if it takes more than 1 minutes to execute your entire notebook. That is, comment out (but do not delete) the code that you used to produce Plot 1 (i.e., learning curve averaged across many agents). Do **not** comment out the code that you use to produce a learning curve for a single agent (Exercise D).
- Restart the kernel (_Kernel $\rightarrow$ Restart & Run All_) and make sure that you can run all cells from top to bottom without any errors.
- Make sure that your code is written in Python 3 (and not in Python 2!). You can check the Python version of the current session in the top-right corner below the Python logo.

### Submission file
- Please upload to Moodle a .zip file (**not** `.rar`, `.7z`, or any other archive format) that contains the completed Jupyter notebook (`ai4_connect_three.ipynb`) as well as the pre-computed figure(s). 
- **If** you change the `connect.py` file or write your own version of the environment, include the corresponding file in your submission, but give it any other name than `connect.py`. If you do not change its name, it will be overwritten  and we won't be able to execute your code! Make sure that you import the correct module when you rename your file, for example, use `import myConnect` if your file is called `myConnect.py`.
- Do not include any identifying information. Not in the code cells, not in the file names, nowhere! Marking is anonymous.

In [None]:
'''
game = qLearning()

game.learn(2000)
    
testGames = 100
wins, score = game.test(testGames)
print("Score: ", score)
print("Win rate: ", wins/testGames * 100)
'''