# Trexquant Interview Project (The Hangman Game)

* Copyright Trexquant Investment LP. All Rights Reserved. 
* Redistribution of this question without written consent from Trexquant is prohibited

## Instruction:
For this coding test, your mission is to write an algorithm that plays the game of Hangman through our API server. 

When a user plays Hangman, the server first selects a secret word at random from a list. The server then returns a row of underscores (space separated)—one for each letter in the secret word—and asks the user to guess a letter. If the user guesses a letter that is in the word, the word is redisplayed with all instances of that letter shown in the correct positions, along with any letters correctly guessed on previous turns. If the letter does not appear in the word, the user is charged with an incorrect guess. The user keeps guessing letters until either (1) the user has correctly guessed all the letters in the word
or (2) the user has made six incorrect guesses.

You are required to write a "guess" function that takes current word (with underscores) as input and returns a guess letter. You will use the API codes below to play 1,000 Hangman games. You have the opportunity to practice before you want to start recording your game results.

Your algorithm is permitted to use a training set of approximately 250,000 dictionary words. Your algorithm will be tested on an entirely disjoint set of 250,000 dictionary words. Please note that this means the words that you will ultimately be tested on do NOT appear in the dictionary that you are given. You are not permitted to use any dictionary other than the training dictionary we provided. This requirement will be strictly enforced by code review.

You are provided with a basic, working algorithm. This algorithm will match the provided masked string (e.g. a _ _ l e) to all possible words in the dictionary, tabulate the frequency of letters appearing in these possible words, and then guess the letter with the highest frequency of appearence that has not already been guessed. If there are no remaining words that match then it will default back to the character frequency distribution of the entire dictionary.

This benchmark strategy is successful approximately 18% of the time. Your task is to design an algorithm that significantly outperforms this benchmark.

In [1]:
from hangman_api import *
import model_jinna as model_jn
import bidirectional_lstm as bi_lstm
%load_ext autoreload
%autoreload 2

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from tensorflow import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding,Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

import torch
from torch.utils.data import Dataset
import torch.nn as nn
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split


In [2]:
# pip install torch

In [3]:
# conda install tensorflow

# API Usage Examples

## To start a new game:
1. Make sure you have implemented your own "guess" method.
2. Use the access_token that we sent you to create your HangmanAPI object. 
3. Start a game by calling "start_game" method.
4. If you wish to test your function without being recorded, set "practice" parameter to 1.
5. Note: You have a rate limit of 20 new games per minute. DO NOT start more than 20 new games within one minute.

In [4]:
api = HangmanAPI(access_token="548234d05e76068d5ce791cbd1e644", timeout=2000)

In [8]:
# Design of the algorithm:
# 1. For the first guess: select the letter with the highest occurence in all words with length +- 3 (lower bounded by 1)
# 2. For the subsequent guesses: build and use n_gram dictionary to find the most probable letter 
# 3. check if LSTM is needed

Testing LSTM 

In [7]:
# LSTM: check the most probable guesses based on incomplete word
words = api.full_dictionary  
word_for_model = words[52423]
def mask_random_letter(word):
    if not word:
        return word  # handle empty string

    unique_letters = list(set(word))
    chosen_letter = random.choice(unique_letters)
    masked_word = word.replace(chosen_letter, '_')
    
    print(f"Original word: {word}")
    print(f"Chosen letter: '{chosen_letter}'")
    print(f"Masked word:   {masked_word}")
    return masked_word

# Example usage:
masked = mask_random_letter(word_for_model)


Original word: disembosom
Chosen letter: 'm'
Masked word:   dise_boso_


In [8]:
# lstm_model.save('my_lstm_model.keras')  # newer Keras format
# lstm_model = keras.models.load_model('my_lstm_model.keras')
model = model_jn.LSTMCharModel(vocab_size = 27)
model.load_state_dict(torch.load('lstm_model.pth'))
model.eval()

import json
with open("model_metadata.json", "r") as f:
    data = json.load(f)

# Extract specific parts
char_to_idx = data["char_to_idx"]
idx_to_char = {int(k): v for k, v in data["idx_to_char"].items()}  # Convert keys to int
max_len = data["max_len"]

ranked_predictions = model_jn.get_ranked_letter_probs(model, masked, char_to_idx, idx_to_char, max_len)
print(ranked_predictions)

# Show top 10
for char, prob in ranked_predictions[:6]:
    print(f"{char}: {prob:.4f}")

[('c', 0.4223), ('k', 0.4164), ('f', 0.0883), ('d', 0.0355), ('b', 0.0225), ('m', 0.0109), ('l', 0.0026), ('j', 0.0012), ('h', 0.0001), ('i', 0.0001), ('_', 0.0), ('a', 0.0), ('e', 0.0), ('g', 0.0), ('n', 0.0), ('o', 0.0), ('p', 0.0), ('q', 0.0), ('r', 0.0), ('s', 0.0), ('t', 0.0), ('u', 0.0), ('v', 0.0), ('w', 0.0), ('x', 0.0), ('y', 0.0), ('z', 0.0)]
c: 0.4223
k: 0.4164
f: 0.0883
d: 0.0355
b: 0.0225
m: 0.0109


In [9]:
bi_lstm_model, vocab = bi_lstm.load_model()         
# print('top 6 letters predicted by lstm are: ')
most_common_by_model = bi_lstm.predict_missing(bi_lstm_model, vocab, masked)

Predictions (char: prob):
n: 0.9419
p: 0.0316
g: 0.0078
k: 0.0045
s: 0.0042
z: 0.0042
o: 0.0035


Comments
LSTM 
- one directional lstm works better in actually prediction 
- but still mostly perform worse than ngram 
- when it predicted correctly, also most of time aligned with the ngrams result 
- lots of time unable to find out the last letter 

Ngram 
- works much better and faster when I reduced min gram from 3 to len(word)-4 (reduce number of matched grams)
- but the issues is the matched grams will quickly dropped to [] 
- too many matched case is not good 
- although there are matched grams but there is no nonguessed letters

Next steps:
1. [done] set a backup n_gram dictionary with smaller n (only use ngram)
2. must we ensure matching lengthen when comparing n_gram? contain also works? especially for shorter words?
    if word lengthen < x, we cut by x-1 and cut dictionary by x-1 to x+3 and use contain logics 
2. use pretrained language model to split the word into phonetic syllabuls and then use n_gram 
2. train another neural network model (not lstm) with inputs as such: word[:i], word[i+1:], masked_word, i, i/len() and output as word[i]


## Playing practice games:
You can use the command below to play up to 100,000 practice games.

In [None]:
test = 2
score = 0
check = []
win = 0 
for i in range(test):
    print('Playing ', i+1, ' th game')
    if api.start_game(practice=1,verbose=True):
        score += 1
        win = 1
    check.append([len(api.current_word),win])
    [total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)

    practice_success_rate = total_practice_successes / total_practice_runs
    print('run %d practice games out of an allotted 100,000. practice success rate so far = %.3f' % (total_practice_runs, practice_success_rate))

Playing  1  th game
Successfully start a new game! Game ID: 01cbc242e26c. # of tries remaining: 6. Word: _ _ _ _ _ _ _ _ _ _ .
Guessing letter: e
Sever response: {'game_id': '01cbc242e26c', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ _ e _ _ _ _ _ _ _ '}
Guessing letter: r
Sever response: {'game_id': '01cbc242e26c', 'status': 'ongoing', 'tries_remains': 6, 'word': '_ r e _ _ _ _ _ r _ '}
len(self.n_gram_dictionary) is 2138152
len(self.n_gram_backup) is 7468580
len(n_gram_dictionary) is 2138152
some matched grams are: 
['breviat', 'breviat', 'breviat', 'irepullers', 'dremembere', 'wreathwork']
letters predicted by n_grams are: 
[('a', 48615), ('o', 48458), ('i', 46073), ('t', 42365), ('n', 32968), ('s', 32785), ('p', 25334), ('c', 25025), ('l', 22102), ('u', 20524), ('d', 17494), ('m', 15762), ('h', 14445), ('g', 12451), ('b', 11434), ('f', 9089), ('y', 7362), ('v', 6649), ('w', 4700), ('k', 3340), ('x', 1707), ('q', 969), ('z', 850), ('j', 673)]
Guessing letter: a
Sever respons

In [9]:
print(score)
print(i+1)
print(score/(i+1))

0
2
0.0


In [10]:
check
# 115
# 200
# 0.575

[['_ _ _ _ _ _ _ _ _ _ ', 20, 0], ['_ _ _ _ _ _ ', 12, 0]]

## Playing recorded games:
Please finalize your code prior to running the cell below. Once this code executes once successfully your submission will be finalized. Our system will not allow you to rerun any additional games.

Please note that it is expected that after you successfully run this block of code that subsequent runs will result in the error message "Your account has been deactivated".

Once you've run this section of the code your submission is complete. Please send us your source code via email.

In [None]:
for i in range(1000):
    print('Playing ', i, ' th game')
    # Uncomment the following line to execute your final runs. Do not do this until you are satisfied with your submission
    #api.start_game(practice=0,verbose=False)
    
    # DO NOT REMOVE as otherwise the server may lock you out for too high frequency of requests
    time.sleep(0.5)

## To check your game statistics
1. Simply use "my_status" method.
2. Returns your total number of games, and number of wins.

In [None]:
[total_practice_runs,total_recorded_runs,total_recorded_successes,total_practice_successes] = api.my_status() # Get my game stats: (# of tries, # of wins)
success_rate = total_recorded_successes/total_recorded_runs
print('overall success rate = %.3f' % success_rate)

In [27]:
words = api.full_dictionary  
X, y = model_jn.generate_training_data(words)

# Convert words and labels to numerical representations
X_numerical, y_numerical, char_to_index, index_to_char = model_jn.convert_to_numerical(X, y)
print(char_to_index)
print(index_to_char)

# Pad sequences to ensure equal length
X_padded = pad_sequences(X_numerical)
y_padded = pad_sequences([y_numerical], padding='post')[0]

{'w': 1, 'e': 2, 'r': 3, 'c': 4, '_': 5, 'p': 6, 'h': 7, 'v': 8, 'z': 9, 'y': 10, 'l': 11, 'k': 12, 't': 13, 'a': 14, 'n': 15, 'q': 16, 'i': 17, 'm': 18, 'b': 19, 'f': 20, 's': 21, 'x': 22, 'g': 23, 'u': 24, 'o': 25, 'j': 26, 'd': 27}
{1: 'w', 2: 'e', 3: 'r', 4: 'c', 5: '_', 6: 'p', 7: 'h', 8: 'v', 9: 'z', 10: 'y', 11: 'l', 12: 'k', 13: 't', 14: 'a', 15: 'n', 16: 'q', 17: 'i', 18: 'm', 19: 'b', 20: 'f', 21: 's', 22: 'x', 23: 'g', 24: 'u', 25: 'o', 26: 'j', 27: 'd'}


In [35]:
y_padded.shape[0]

1681209