# Trexquant Interview Project (The Hangman Game)

* Copyright Trexquant Investment LP. All Rights Reserved. 
* Redistribution of this question without written consent from Trexquant is prohibited

## Instruction:
For this coding test, your mission is to write an algorithm that plays the game of Hangman through our API server. 

When a user plays Hangman, the server first selects a secret word at random from a list. The server then returns a row of underscores (space separated)—one for each letter in the secret word—and asks the user to guess a letter. If the user guesses a letter that is in the word, the word is redisplayed with all instances of that letter shown in the correct positions, along with any letters correctly guessed on previous turns. If the letter does not appear in the word, the user is charged with an incorrect guess. The user keeps guessing letters until either (1) the user has correctly guessed all the letters in the word
or (2) the user has made six incorrect guesses.

You are required to write a "guess" function that takes current word (with underscores) as input and returns a guess letter. You will use the API codes below to play 1,000 Hangman games. You have the opportunity to practice before you want to start recording your game results.

Your algorithm is permitted to use a training set of approximately 250,000 dictionary words. Your algorithm will be tested on an entirely disjoint set of 250,000 dictionary words. Please note that this means the words that you will ultimately be tested on do NOT appear in the dictionary that you are given. You are not permitted to use any dictionary other than the training dictionary we provided. This requirement will be strictly enforced by code review.

You are provided with a basic, working algorithm. This algorithm will match the provided masked string (e.g. a _ _ l e) to all possible words in the dictionary, tabulate the frequency of letters appearing in these possible words, and then guess the letter with the highest frequency of appearence that has not already been guessed. If there are no remaining words that match then it will default back to the character frequency distribution of the entire dictionary.

This benchmark strategy is successful approximately 18% of the time. Your task is to design an algorithm that significantly outperforms this benchmark.

# Imports

In [1]:
from sklearn.model_selection import train_test_split
from torch.nn import Embedding, Linear, ReLU, GRU
from torch.optim import Adam
from torch.nn.utils.rnn import pack_padded_sequence
import torch
from torch import nn
import numpy as np
import pickle
import math
import pandas as pd

import json
import requests
import random
import string
import secrets
import time
import re
import collections

try:
    from urllib.parse import parse_qs, urlencode, urlparse
except ImportError:
    from urllib.parse import parse_qs, urlparse
    from urllib import urlencode

import warnings
warnings.filterwarnings("ignore")

In [2]:
class HangmanAPI(object):
    def __init__(self, access_token=None, session=None, timeout=None):
        self.hangman_url = self.determine_hangman_url()
        self.access_token = access_token
        self.session = session or requests.Session()
        self.timeout = timeout
        self.guessed_letters = []
        
        full_dictionary_location = "words_250000_train.txt"
        self.full_dictionary = self.build_dictionary(full_dictionary_location)        
        self.full_dictionary_common_letter_sorted = collections.Counter("".join(self.full_dictionary)).most_common()
        
        self.current_dictionary = []
        
    @staticmethod
    def determine_hangman_url():
        links = ['https://trexsim.com', 'https://sg.trexsim.com']

        data = {link: 0 for link in links}

        for link in links:

            requests.get(link)

            for i in range(10):
                s = time.time()
                requests.get(link)
                data[link] = time.time() - s

        link = sorted(data.items(), key=lambda x: x[1])[0][0]
        link += '/trexsim/hangman'
        return link

    def guess(self, word): # word input example: "_ p p _ e "
        ###############################################
        # Replace with your own "guess" function here #
        ###############################################

        # clean the word so that we strip away the space characters
        # replace "_" with "." as "." indicates any character in regular expressions
        clean_word = word[::2].replace("_",".")
        
        # find length of passed word
        len_word = len(clean_word)
        
        # grab current dictionary of possible words from self object, initialize new possible words dictionary to empty
        current_dictionary = self.current_dictionary
        new_dictionary = []
        
        # iterate through all of the words in the old plausible dictionary
        for dict_word in current_dictionary:
            # continue if the word is not of the appropriate length
            if len(dict_word) != len_word:
                continue
                
            # if dictionary word is a possible match then add it to the current dictionary
            if re.match(clean_word,dict_word):
                new_dictionary.append(dict_word)
        
        # overwrite old possible words dictionary with updated version
        self.current_dictionary = new_dictionary
        
        
        # count occurrence of all characters in possible word matches
        full_dict_string = "".join(new_dictionary)
        
        c = collections.Counter(full_dict_string)
        sorted_letter_count = c.most_common()                   
        
        guess_letter = '!'
        
        # return most frequently occurring letter in all possible words that hasn't been guessed yet
        for letter,instance_count in sorted_letter_count:
            if letter not in self.guessed_letters:
                guess_letter = letter
                break
            
        # if no word matches in training dictionary, default back to ordering of full dictionary
        if guess_letter == '!':
            sorted_letter_count = self.full_dictionary_common_letter_sorted
            for letter,instance_count in sorted_letter_count:
                if letter not in self.guessed_letters:
                    guess_letter = letter
                    break            
        
        return guess_letter

    ##########################################################
    # You'll likely not need to modify any of the code below #
    ##########################################################
    
    def build_dictionary(self, dictionary_file_location):
        text_file = open(dictionary_file_location,"r")
        full_dictionary = text_file.read().splitlines()
        text_file.close()
        return full_dictionary
                
    def start_game(self, practice=True, verbose=True):
        # reset guessed letters to empty set and current plausible dictionary to the full dictionary
        self.guessed_letters = []
        self.current_dictionary = self.full_dictionary
                         
        response = self.request("/new_game", {"practice":practice})
        if response.get('status')=="approved":
            game_id = response.get('game_id')
            word = response.get('word')
            tries_remains = response.get('tries_remains')
            if verbose:
                print("Successfully start a new game! Game ID: {0}. # of tries remaining: {1}. Word: {2}.".format(game_id, tries_remains, word))
            while tries_remains>0:
                # get guessed letter from user code
                guess_letter = self.guess(word)
                    
                # append guessed letter to guessed letters field in hangman object
                self.guessed_letters.append(guess_letter)
                if verbose:
                    print("Guessing letter: {0}".format(guess_letter))
                    
                try:    
                    res = self.request("/guess_letter", {"request":"guess_letter", "game_id":game_id, "letter":guess_letter})
                except HangmanAPIError:
                    print('HangmanAPIError exception caught on request.')
                    continue
                except Exception as e:
                    print('Other exception caught on request.')
                    raise e
               
                if verbose:
                    print("Sever response: {0}".format(res))
                status = res.get('status')
                tries_remains = res.get('tries_remains')
                if status=="success":
                    if verbose:
                        print("Successfully finished game: {0}".format(game_id))
                    return True
                elif status=="failed":
                    reason = res.get('reason', '# of tries exceeded!')
                    if verbose:
                        print("Failed game: {0}. Because of: {1}".format(game_id, reason))
                    return False
                elif status=="ongoing":
                    word = res.get('word')
        else:
            if verbose:
                print("Failed to start a new game")
        return status=="success"
        
    def my_status(self):
        return self.request("/my_status", {})
    
    def request(
            self, path, args=None, post_args=None, method=None):
        if args is None:
            args = dict()
        if post_args is not None:
            method = "POST"

        # Add `access_token` to post_args or args if it has not already been
        # included.
        if self.access_token:
            # If post_args exists, we assume that args either does not exists
            # or it does not need `access_token`.
            if post_args and "access_token" not in post_args:
                post_args["access_token"] = self.access_token
            elif "access_token" not in args:
                args["access_token"] = self.access_token

        time.sleep(0.2)

        num_retry, time_sleep = 50, 2
        for it in range(num_retry):
            try:
                response = self.session.request(
                    method or "GET",
                    self.hangman_url + path,
                    timeout=self.timeout,
                    params=args,
                    data=post_args,
                    verify=False
                )
                break
            except requests.HTTPError as e:
                response = json.loads(e.read())
                raise HangmanAPIError(response)
            except requests.exceptions.SSLError as e:
                if it + 1 == num_retry:
                    raise
                time.sleep(time_sleep)

        headers = response.headers
        if 'json' in headers['content-type']:
            result = response.json()
        elif "access_token" in parse_qs(response.text):
            query_str = parse_qs(response.text)
            if "access_token" in query_str:
                result = {"access_token": query_str["access_token"][0]}
                if "expires" in query_str:
                    result["expires"] = query_str["expires"][0]
            else:
                raise HangmanAPIError(response.json())
        else:
            raise HangmanAPIError('Maintype was not text, or querystring')

        if result and isinstance(result, dict) and result.get("error"):
            raise HangmanAPIError(result)
        return result
    
class HangmanAPIError(Exception):
    def __init__(self, result):
        self.result = result
        self.code = None
        try:
            self.type = result["error_code"]
        except (KeyError, TypeError):
            self.type = ""

        try:
            self.message = result["error_description"]
        except (KeyError, TypeError):
            try:
                self.message = result["error"]["message"]
                self.code = result["error"].get("code")
                if not self.type:
                    self.type = result["error"].get("type", "")
            except (KeyError, TypeError):
                try:
                    self.message = result["error_msg"]
                except (KeyError, TypeError):
                    self.message = result

        Exception.__init__(self, self.message)

# Split Data into train and test dataset

In [3]:
development_words_location = "data/words_250000_train.txt"
with open(development_words_location, "r") as fp:
    development_words = fp.read().splitlines()

In [4]:
train_words, test_words = train_test_split(development_words, test_size=0.2, random_state=42, shuffle=True)
print(len(train_words), len(test_words))

181840 45460


In [5]:
print(train_words[:5])
print(test_words[:5])

['exhilarating', 'clonic', 'semiphenomenally', 'preascertaining', 'benoit']
['timpani', 'worsle', 'yinst', 'grangerized', 'matatua']


In [6]:
train_location = "data/train_words.txt"
with open(train_location, "w") as fp:
    for train_word in train_words:
        fp.write(f"{train_word}\n")

In [7]:
test_location = "data/test_words.txt"
with open(test_location, "w") as fp:
    for test_word in test_words:
        fp.write(f"{test_word}\n")

# Build model architecture

In [8]:
class HangmanGRU(nn.Module):
    def __init__(
        self, 
        vocab_size,
        gru_hidden_dim,
        gru_num_layers,
        char_embedding_dim,
        missed_char_linear_dim,
        nn_hidden_dim,
        gru_dropout,
        learning_rate
    ):
        super(HangmanGRU, self).__init__()

        ## Different model dimentions
        self.gru_hidden_dim = gru_hidden_dim 
        self.gru_num_layers = gru_num_layers

        ## Embedding layer for character input
        self.embedding = Embedding(vocab_size + 1, char_embedding_dim)

        ## Missed characters linear layer
        self.missed_characters_linear_layer = Linear(vocab_size, missed_char_linear_dim) 

        ## Declare GRU
        self.hangman_gru = GRU(
            input_size = char_embedding_dim, 
            hidden_size = self.gru_hidden_dim, 
            num_layers = self.gru_num_layers,
            dropout = gru_dropout,
            bidirectional=True, 
            batch_first=True
        )
            
        # NN after GRU output
        nn_in_features = missed_char_linear_dim + (self.gru_hidden_dim * 2)
        self.nn_hidden_layer = Linear(nn_in_features, nn_hidden_dim)
        self.relu = ReLU()
        self.nn_output_layer = Linear(nn_hidden_dim, vocab_size)

        ## Set up optimizer
        self.optimizer = Adam(self.parameters(), lr=learning_rate)

    def forward(self, x, x_lenths, missed_characters):
        x = self.embedding(x)
        batch_size, seq_len, _ = x.size()
        x = pack_padded_sequence(x, x_lenths, batch_first=True, enforce_sorted=False)
        
        ## Run through GRU
        output, hidden = self.hangman_gru(x)
        hidden = hidden.view(self.gru_num_layers, 2, -1, self.gru_hidden_dim)
        hidden = hidden[-1]
        hidden = hidden.permute(1, 0, 2)
        hidden = hidden.contiguous().view(hidden.shape[0], -1)

        ## Project missed_characters to higher dimension
        missed_characters = self.missed_characters_linear_layer(missed_characters)
        
        ## Concatenate GRU output and missed_characters
        concatenated = torch.cat((hidden, missed_characters), dim=1)
        
        ## Run NN after GRU
        nn_output = self.nn_hidden_layer(concatenated)
        nn_output = self.relu(nn_output)
        nn_output = self.nn_output_layer(nn_output)
        return nn_output

    def calculate_loss(self, model_out, labels, input_lengths, missed_characters, use_cuda):
        outputs = nn.functional.log_softmax(model_out, dim=1)
        
        ## Calculate model output loss for miss characters
        miss_penalty = torch.sum((outputs * missed_characters), dim=(0,1))/outputs.shape[0]
        
        ## Convert input lengths to float
        input_lengths = input_lengths.float()
        
        ## Weights per example is inversely proportional to length of word
        ## This is because shorter words are harder to predict due to higher chances of missing a character
        weights_orig = (1/input_lengths)/torch.sum(1/input_lengths).unsqueeze(-1)
        weights = torch.zeros((weights_orig.shape[0], 1))    
        
        ## Resize so that torch can process it correctly
        weights[:, 0] = weights_orig

        if use_cuda:
            weights = weights.cuda()
        
        ## Actual Loss
        loss_function = nn.BCEWithLogitsLoss(weight=weights, reduction='sum')
        actual_penalty = loss_function(model_out, labels)
        return actual_penalty, miss_penalty

# Save Data Encoding Functions

In [9]:
def filter_and_encode(word, vocab_size, min_allowed_word_lengh, char_to_id):
	## Remove spaces, small words and make the word into lower case
	word = word.strip().lower()
	if len(word) < min_allowed_word_lengh:
		return None, None, None

	encoded_word = np.zeros((len(word), vocab_size + 1))
	
	## Char location dict
	## For Ex 'goto', char_location_dict = {'g_id':[0], 'o_id':[1, 3], 't_id':[2]}
	char_location_dict = {k: [] for k in range(vocab_size)}

	for i, c in enumerate(word):
		idx = char_to_id[c]
		char_location_dict[idx].append(i)
		encoded_word[i][idx] = 1

	## Char location list
	## For Ex 'goto', char_location_list = [[0], [1, 3], [2]]
	char_location_list = [x for x in char_location_dict.values() if(len(x) > 0)]

	## word_set
	## For Ex 'goto', word_set = {'g', 'o', 't'}
	word_set = set(list(word))
	return encoded_word, char_location_list, word_set

def get_one_hot_encoded_words(
    word_list,
    vocab_size,
    min_allowed_word_lengh
):
    char_to_id = {chr(97+x): x for x in range(vocab_size)}
    char_to_id['BLANK'] = vocab_size
    encoded_word_list = []
    for word in word_list:
        encoded_word, char_location_list, word_set = filter_and_encode(word, vocab_size, min_allowed_word_lengh, char_to_id)
        if encoded_word is not None:
            encoded_word_list.append((encoded_word, char_location_list, word_set))
    return encoded_word_list

## Save Encoded Train Words

In [10]:
## Save Encoded Train Words
train_words_location = "data/train_words.txt"
with open(train_words_location, "r") as fp:
    train_words = fp.read().splitlines()
train_words[:5]

['exhilarating', 'clonic', 'semiphenomenally', 'preascertaining', 'benoit']

In [11]:
encoded_train_words = get_one_hot_encoded_words(
    word_list = train_words,
    vocab_size = 26,
    min_allowed_word_lengh = 3
)
encoded_train_words[:1]

[(array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0

In [12]:
print(len(train_words), len(encoded_train_words))

181840 181622


In [13]:
encoded_train_words_location = "data/encoded_train_words.pickle"
with open(encoded_train_words_location, "wb") as fp:
    pickle.dump(encoded_train_words, fp)

## Save Encoded Test Words

In [14]:
## Save Encoded Test Words
test_words_location = "data/test_words.txt"
with open(test_words_location, "r") as fp:
    test_words = fp.read().splitlines()
test_words[:5]

['timpani', 'worsle', 'yinst', 'grangerized', 'matatua']

In [15]:
encoded_test_words = get_one_hot_encoded_words(
    word_list = test_words,
    vocab_size = 26,
    min_allowed_word_lengh = 3
)
print(len(test_words), len(encoded_test_words))
encoded_test_words[:1]

45460 45397


[(array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
  [[4], [1, 6], [2], [5], [3], [0]],
  {'a', 'i', 'm', 'n', 'p', 't'})]

In [16]:
encoded_test_words_location = "data/encoded_test_words.pickle"
with open(encoded_test_words_location, "wb") as fp:
    pickle.dump(encoded_test_words, fp)

# Get current epoch data function

In [17]:
def get_current_epoch_data(
	encoded_word_list, 
	epoch_number, 
	total_epochs,
	vocab_size
):
	## As training progresses the prob of dropping chars increases using sigmoid on epoch
	drop_char_probability = 1/(1+np.exp(-epoch_number/total_epochs))
	cur_epoch_data_list = []
	all_character_set = set([chr(97+x) for x in range(vocab_size)])
	char_to_id = {chr(97+x): x for x in range(vocab_size)}
	char_to_id['BLANK'] = vocab_size

	for i, (encoded_word, char_location_list, word_set) in enumerate(encoded_word_list):
		## Number of characters to drop
		num_char_to_drop = np.random.binomial(len(char_location_list), drop_char_probability)
		if num_char_to_drop == 0:
			num_char_to_drop = 1

		## Drop chars inversely proportional to number of occurences of each character
		## For Ex: goto, char_location_list = [[0], [1, 3], [2]]
		## drop_char_probability_list = [0.4, 0.2, 0.4]
		## to_drop = [0, 1]
		drop_char_probability_list = [1/len(x) for x in char_location_list]
		drop_char_probability_list = [x/sum(drop_char_probability_list) for x in drop_char_probability_list]
		to_drop = np.random.choice(len(char_location_list), num_char_to_drop, p=drop_char_probability_list, replace=False)

		## Cha positions to drop
		## For Ex: goto, char_location_list = [[0], [1, 3], [2]] and to_drop = [0, 1]
		## drop_char_idx = [0, 1, 3]
		drop_char_idx = []
		for char_group in to_drop:
			drop_char_idx += char_location_list[char_group]
		
		## drop_char_idx = model target
		## Assuming voab_size = 4
		## For Ex: goto, encoded_word = [[1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 1, 0]]
		## unclipped_target = [1, 0, 2, 0]
		## target = [1, 0, 1, 0]
		unclipped_target = np.sum(encoded_word[drop_char_idx], axis=0)
		target = np.clip(unclipped_target, 0, 1)

		## Remove blank in target
		target = target[:-1]
		
		## Drop chars and assign blank_character
		input_vec = np.copy(encoded_word)
		blank_vec = np.zeros((1, vocab_size + 1))
		blank_vec[0, vocab_size] = 1
		input_vec[drop_char_idx] = blank_vec

		## Provide character id instead of 1-hot encoded vector for embedding
		input_vec = np.argmax(input_vec, axis=1)
		## For Ex: goto, encoded_word = [[1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0, 0, 1, 0]]
		## drop_char_idx = [0, 1, 3]
		## target = [1, 0, 1, 0]
		## input_vec = [26, 26, 19, 26] (26 = BLANK, 19 = t)
		
		## randomly pick a few characters from vocabulary as characters which were predicted but declared as not present by game
		not_present_char_sorted_array = np.array(sorted(list(all_character_set - word_set)))
		num_missed_chars = np.random.randint(0, 10)
		miss_char_sorted_array = np.random.choice(not_present_char_sorted_array, num_missed_chars)
		miss_char_id_sorted_list = [char_to_id[x] for x in miss_char_sorted_array]
		## Ex word is 'goto', num_missed_chars = 2, miss_char_id_sorted_list = [1, 3] 
		## (which correspond to the characters b and d)
		
		miss_vec = np.zeros(vocab_size)
		miss_vec[miss_char_id_sorted_list] = 1
		## If vocab_size = 6, b = 1, d = 3 and b, d are missed
		## miss_vec = [0, 1, 0, 1, 0, 0]
		
		## Append tuple to cur_epoch_data_list
		cur_epoch_data_list.append((input_vec, target, miss_vec))

	## Shuffle dataset before feeding batches to the model
	np.random.shuffle(cur_epoch_data_list)
	return cur_epoch_data_list

## Test current epoch data function

In [18]:
encoded_train_words_location = "data/encoded_train_words.pickle"
encoded_train_word_list = pickle.load(open(encoded_train_words_location, "rb"))
encoded_train_word_list[:1]

[(array([[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0

In [19]:
cur_epoch_train_data_list = get_current_epoch_data(
	encoded_word_list = encoded_train_word_list, 
	epoch_number = 24, 
	total_epochs = 100,
	vocab_size = 26
)
cur_epoch_train_data_list[:1]

[(array([26, 26, 22, 26, 26, 26, 18]),
  array([0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0.]),
  array([0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         1., 0., 0., 0., 0., 0., 0., 1., 1.]))]

In [20]:
len(encoded_train_word_list), len(cur_epoch_train_data_list)

(181622, 181622)

# Get current batch data function

In [21]:
def batchify_words(input_vec_list, vocab_size):
    total_seq = len(input_vec_list)
    max_len = max([len(x) for x in input_vec_list])
    batched_input_list = []

    for word in input_vec_list:
        if max_len != len(word):
            ## Add blanks to get max len
            blank_vec = (vocab_size * np.ones((max_len - word.shape[0])))
            word = np.concatenate((word, blank_vec), axis=0)
        batched_input_list.append(word)

    return np.array(batched_input_list)

def get_cur_batch_data(
    cur_epoch_data_list, 
    batch_id, 
    batch_size,
    vocab_size
):
    if(((batch_id + 1) * batch_size) <= len(cur_epoch_data_list)):
        start_index = (batch_id * batch_size)
        end_index = ((batch_id + 1) * batch_size)
        cur_batch_data_list = cur_epoch_data_list[start_index: end_index]
    else:
        start_index = (batch_id * batch_size)
        end_index = len(cur_epoch_data_list)
        cur_batch_data_list = cur_epoch_data_list[start_index: end_index]
    
    ## Convert to numpy arrays
    word_length_array = np.array([len(x[0]) for x in cur_batch_data_list])
    input_vec_list = [x[0] for x in cur_batch_data_list]
    batched_input_array = batchify_words(input_vec_list, vocab_size)
    batched_label_array = np.array([x[1] for x in cur_batch_data_list])
    batched_missed_char_array = np.array([x[2] for x in cur_batch_data_list])

    ## Return batch
    return batched_input_array, batched_label_array, batched_missed_char_array, word_length_array

## Test Get current batch data function

In [22]:
batched_input_array, batched_label_array, batched_missed_char_array, word_length_array = get_cur_batch_data(
    cur_epoch_data_list = cur_epoch_train_data_list, 
    batch_id = 2, 
    batch_size = 4000,
    vocab_size = 26
)

In [23]:
print(
    len(batched_input_array),
    len(batched_label_array),
    len(batched_missed_char_array),
    len(word_length_array)
)

4000 4000 4000 4000


In [24]:
batched_input_array[:1]

array([[19., 26., 26., 26., 26., 20., 26., 26., 26., 26., 26., 26., 26.,
        26., 26., 26., 26., 26., 26., 26., 26., 26., 26.]])

In [25]:
batched_label_array[:1]

array([[1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]])

In [26]:
batched_missed_char_array[:1]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        1., 1., 0., 0., 0., 1., 0., 0., 0., 1.]])

In [27]:
word_length_array[:5]

array([ 6,  4, 14,  9,  9])

# Test Bi Direnctional GRU Function

In [28]:
def test(
	epoch,
	model,
	total_epochs,
	encoded_test_word_list,
	batch_size,
	vocab_size,
	cuda
):
	model.eval()

	## Initialize epoch loss
	test_loss = 0.0
	test_miss_penalty = 0.0

	## Without gradient update
	with torch.no_grad():
		## Get cur_epoch_train_data_list
		cur_epoch_test_data_list = get_current_epoch_data(
			encoded_word_list = encoded_test_word_list, 
			epoch_number = epoch,
			total_epochs = total_epochs,
			vocab_size = vocab_size
		)

		## Loop over batches
		no_batches = int(math.ceil(len(cur_epoch_test_data_list) / batch_size))
		for batch_id in range(no_batches):
			## Get batch
			inputs, labels, miss_chars, input_lengths = get_cur_batch_data(
				cur_epoch_data_list = cur_epoch_test_data_list, 
				batch_id = batch_id, 
				batch_size = batch_size,
				vocab_size = vocab_size
			)
			
			## Embeddings should be of dtype long
			inputs = torch.from_numpy(inputs).long()
			
			## Convert to torch tensors
			labels = torch.from_numpy(labels).float()
			miss_chars = torch.from_numpy(miss_chars).float()
			input_lengths = torch.from_numpy(input_lengths).long()

			if(cuda==True):
				inputs = inputs.cuda()
				labels = labels.cuda()
				miss_chars = miss_chars.cuda()
				input_lengths = input_lengths.cuda()

			# zero the parameter gradients
			model.optimizer.zero_grad()
			
			# Forward Pass
			outputs = model(inputs, input_lengths, miss_chars)
			loss, miss_penalty = model.calculate_loss(outputs, labels, input_lengths, miss_chars, cuda)
			test_loss += loss.item()
			test_miss_penalty += miss_penalty.item()

	# Average out the losses
	test_loss = (test_loss / no_batches)
	test_miss_penalty = (test_miss_penalty / no_batches)
	return test_loss, test_miss_penalty

# Train Bi Direnctional GRU Function

In [31]:
def train(
	total_epochs,
	encoded_train_words_location,
	encoded_test_words_location,
	batch_size,
	vocab_size,
	cuda,
	save_every,
	model_output_location,
	gru_hidden_dim = 512,
	gru_num_layers = 2,
	char_embedding_dim = 128,
	missed_char_linear_dim = 256,
	nn_hidden_dim = 256,
	gru_dropout = 0.3,
	learning_rate = 0.0005
):
	## Load model and set it to train mode
	model = HangmanGRU(
		vocab_size = vocab_size,
        gru_hidden_dim = gru_hidden_dim,
        gru_num_layers = gru_num_layers,
        char_embedding_dim = char_embedding_dim,
        missed_char_linear_dim = missed_char_linear_dim,
        nn_hidden_dim = nn_hidden_dim,
        gru_dropout = gru_dropout,
        learning_rate = learning_rate
	)
	model.train()

	## Get encoded_train_word_list
	encoded_train_word_list = pickle.load(open(encoded_train_words_location, "rb"))
	
	## Get encoded_test_word_list
	encoded_test_word_list = pickle.load(open(encoded_test_words_location, "rb"))

	## Lists to store losses
	train_loss_list = []
	train_miss_penalty_list = []
	test_loss_list = []
	test_miss_penalty_list = []

	## Loop over Train Data
	for epoch in range(1, total_epochs+1):
		## Initialize epoch loss
		train_loss = 0.0
		train_miss_penalty = 0.0

		## Get cur_epoch_train_data_list
		cur_epoch_train_data_list = get_current_epoch_data(
			encoded_word_list = encoded_train_word_list, 
			epoch_number = epoch, 
			total_epochs = total_epochs,
			vocab_size = vocab_size
		)

		## Loop over batches
		no_batches = int(math.ceil(len(cur_epoch_train_data_list) / batch_size))
		for batch_id in range(no_batches):
			## Get batch
			inputs, labels, miss_chars, input_lengths = get_cur_batch_data(
				cur_epoch_data_list = cur_epoch_train_data_list, 
				batch_id = batch_id, 
				batch_size = batch_size,
				vocab_size = vocab_size
			)
			
			## Embeddings should be of dtype long
			inputs = torch.from_numpy(inputs).long()
			
			## Convert to torch tensors
			labels = torch.from_numpy(labels).float()
			miss_chars = torch.from_numpy(miss_chars).float()
			input_lengths = torch.from_numpy(input_lengths).long()

			if(cuda==True):
				inputs = inputs.cuda()
				labels = labels.cuda()
				miss_chars = miss_chars.cuda()
				input_lengths = input_lengths.cuda()

			## Zero the parameter gradients
			model.optimizer.zero_grad()
			
			## Forward Pass, Loss calculation, Backward Pass, Optimize
			outputs = model(inputs, input_lengths, miss_chars)
			loss, miss_penalty = model.calculate_loss(outputs, labels, input_lengths, miss_chars, cuda)
			loss.backward()
			model.optimizer.step()

			## store loss
			train_loss += loss.item()
			train_miss_penalty += miss_penalty.item()

		# Test model after epoch
		test_loss, test_miss_penalty = test(
			epoch = epoch,
			model = model,
			total_epochs = total_epochs,
			encoded_test_word_list = encoded_test_word_list,
			batch_size = batch_size,
			vocab_size = vocab_size,
			cuda = cuda
		)
		model.train()

		# Store losses
		train_loss = (train_loss / no_batches)
		train_loss_list.append(train_loss)
		train_miss_penalty = (train_miss_penalty/ no_batches)
		train_miss_penalty_list.append(train_miss_penalty)
		test_loss_list.append(test_loss)
		test_miss_penalty_list.append(test_miss_penalty)

		# Save Losses
		df_losses = pd.DataFrame(
			{
				"train_loss": train_loss_list,
				"train_miss_penalty": train_miss_penalty_list,
				"test_loss": test_loss_list,
				"test_miss_penalty": test_miss_penalty_list
			}
		)
		df_losses_location = f"{model_output_location}/df_losses.csv"
		df_losses.to_csv(df_losses_location, index=False)

		# Save model
		if(epoch % save_every == 0):
			model_path = f"{model_output_location}/models"
			model_file_name = f"{model_path}/model_epoch_{str(epoch).zfill(4)}.pth"
			torch.save({
				'epoch': epoch,
				'model_state_dict': model.state_dict(),
				'optimizer_state_dict': model.optimizer.state_dict(),
				'train_loss': train_loss,
				'test_loss': test_loss,
			}, model_file_name)

# Train Bi Directional GRU

In [32]:
train(
	total_epochs = 10,
	encoded_train_words_location = "data/encoded_train_words.pickle",
	encoded_test_words_location = "data/encoded_test_words.pickle",
	batch_size = 250000,
	vocab_size = 26,
	cuda = False,
	save_every = 1,
	model_output_location = "model_output",
	gru_hidden_dim = 512,
	gru_num_layers = 2,
	char_embedding_dim = 128,
	missed_char_linear_dim = 256,
	nn_hidden_dim = 256,
	gru_dropout = 0.3,
	learning_rate = 0.0005
)

: 

: 

# API Usage Examples

## To start a new game:
1. Make sure you have implemented your own "guess" method.
2. Use the access_token that we sent you to create your HangmanAPI object. 
3. Start a game by calling "start_game" method.
4. If you wish to test your function without being recorded, set "practice" parameter to 1.
5. Note: You have a rate limit of 20 new games per minute. DO NOT start more than 20 new games within one minute.

In [None]:
api = HangmanAPI(access_token="INSERT_YOUR_TOKEN_HERE", timeout=2000)


## Playing practice games:
You can use the command below to play up to 100,000 practice games.

In [None]:
api.start_game(practice=1,verbose=True)
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
practice_success_rate = total_practice_successes / total_practice_runs
print('run %d practice games out of an allotted 100,000. practice success rate so far = %.3f' % (total_practice_runs, practice_success_rate))


## Playing recorded games:
Please finalize your code prior to running the cell below. Once this code executes once successfully your submission will be finalized. Our system will not allow you to rerun any additional games.

Please note that it is expected that after you successfully run this block of code that subsequent runs will result in the error message "Your account has been deactivated".

Once you've run this section of the code your submission is complete. Please send us your source code via email.

In [None]:
for i in range(1000):
    print('Playing ', i, ' th game')
    # Uncomment the following line to execute your final runs. Do not do this until you are satisfied with your submission
    #api.start_game(practice=0,verbose=False)
    
    # DO NOT REMOVE as otherwise the server may lock you out for too high frequency of requests
    time.sleep(0.5)

## To check your game statistics
1. Simply use "my_status" method.
2. Returns your total number of games, and number of wins.

In [None]:
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
success_rate = total_recorded_successes/total_recorded_runs
print('overall success rate = %.3f' % success_rate)