# Paardensprong
The goal of this notebook is to do the exploratory analysis needed to build a model that can predict the probability of a correct and timely guess. 
I want to use that model to select puzzles which I have a ~50% chance of solving

* Theoretical model:
  * Basic puzzle:
    * Finding direction in the word
    * Finding starting point in the word

  * Recognizing the word
    * Word frequency in normal language
    * Having seen it recently (in a puzzle) - not implemented yet
    * Pronunciation matches writing it down (e.g. fauteuil is very hard) - not implemented yet

  * Puzzle (inadvertent biases from my stsarting point, so that I take too long to switch to the correct point)
    * Direction 
    * Starting point
    * And sometimes I lose track when a single letter occurs very frequently - not implemented yet

In [None]:
from dotenv import load_dotenv
import importlib.resources
import os

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
load_dotenv()

PLAYERNAME = os.getenv('playername')

database_url_prod = os.getenv('PROD_DATABASE_URL').replace('postgresql', 'postgresql+psycopg')
engine_prod = create_engine(database_url_prod)

database_url_dev = os.getenv('DATABASE_URL').replace('postgresql', 'postgresql+psycopg')
engine_dev = create_engine(database_url_dev)

In [None]:
with engine_prod.connect() as conn:
    games = pd.read_sql_query('SELECT * FROM paardensprong.games', con=conn, index_col='game_id')
    guesses = pd.read_sql_query('SELECT * FROM paardensprong.guesses', con=conn, index_col='guess_id')

guesses_relevant = (guesses.set_index('game_id')
                    .rename(columns={'correct': 'GuessCorrect'})
                    [['guess_time', 'GuessCorrect']]
                    )                           

df = (games
      # Drop games which have no guess - probably time out because of long loading times
      .join(guesses_relevant, how='inner')
      .query('playername == @PLAYERNAME')
      .assign(PuzzleTimeSec = lambda df: (df['guess_time'] - df['start_time']).dt.seconds,
              # The on time is a bit strict; since you need a few seconds typing time
              # But that's on purpose: it makes sense to train to have a bit of spare time
              # And it helps the model since you have just a few more unsuccessfulls to train on
              OnTime = lambda df: df['PuzzleTimeSec'].lt(30),
              Success = lambda df: df['GuessCorrect'] & df['OnTime'],
              )
      # A few answers were given extremely late; probably when reconnecting
      .query('PuzzleTimeSec < 120')
      )

Here we immediately see the problem: ~90% is solved successfully on time; so I want the most challenging puzzles.

In [None]:
df['Success'].value_counts(normalize=True)

In [None]:
df['PuzzleTimeSec'].hist(bins=range(0, df['PuzzleTimeSec'].max() + 5, 5))
df['PuzzleTimeSec'].describe()

In [None]:
DATA_PATH = importlib.resources.files('tweevoortwaalf.Data').joinpath('suitable_8_letter_words.txt')
eightletterwords = pd.read_csv(DATA_PATH, header=None).squeeze()

While solving the puzzle, I often look for illogical consecutive letters. Then we know that can't be correct, so the solution should go the other way around. First, we generalize this 

In [None]:
vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2))
vectorizer.fit(eightletterwords)
ngrams_occurences_total = vectorizer.transform(eightletterwords).toarray().sum(axis=0)

In [None]:
def easyness_score(woord, vectorizer=vectorizer):
    "Sums all transitions of letters -> the higher, the more logical"
    ngrams_occurences_word = vectorizer.transform([woord])
    return (ngrams_occurences_word * ngrams_occurences_total).sum()

## Direction
The theory is as follows: 
To find the logical direction, you check the second least likely transition per direction
The least likely transition of two characters would be the word boundary
If there is another one, then perhaps the word is not written in that direction
Then, we compare the second least likely transition per direct
We add a compensation because there can be 2 impossible transitions in the wrong direction

This could probably be improved by explicitly checking the least likely transition is a likely word boundary (e.g. zwerfkei: wz is illogical transition AND ew is impossible end of word, even though "ziek"  is a good start of word)

In [None]:
def logical_single_direction(word):
    """Check the second least likely transition, including the transition across the word boundary"""
    circular_word = word + word[0]
    logical = []
    for i in range(len(circular_word) - 1):
        logical.append(easyness_score(circular_word[i] + circular_word[i + 1]))
    return sorted(logical)[1]


def logical_correct_direction(word, compensation=0.5):
    """Compare both directions"""
    logical_actual_direction = logical_single_direction(word)
    logical_wrong_direction = logical_single_direction(word[::-1])
    return (logical_actual_direction + compensation) / (logical_wrong_direction + compensation)

direction = df['answer'].apply(logical_correct_direction)
directionbins = pd.qcut(direction, q=5)
df.groupby(directionbins)['Success'].agg(['count', 'mean'])

Indeed we see that words where the direction is clear are more often guessed

In [None]:
df.loc[direction.nlargest(5).index]

## Word boundary
Assuming the correct direction, how special is the actual transition of the word boundary compared to the other character transitions?
This is not perfect yet.. e.g. ox and xi are special in oxidator, but that doesnt make the X the logical starting letter


In [None]:
def logical_word_boundary(word, compensation=0.5):
    circular_word = word + word[0]
    logical = []
    for i in range(len(circular_word) - 1):
        logical.append(1 / (easyness_score(circular_word[i] + circular_word[i + 1]) + compensation))
    return logical[-1] / sum(logical)

wordboundary = df['answer'].apply(logical_word_boundary)
wordboundarybins = pd.qcut(wordboundary, q=5)
df.groupby(wordboundarybins)['Success'].agg(['count', 'mean'])

In [None]:
wordboundaryscore = eightletterwords.apply(logical_word_boundary)
eightletterwords.loc[wordboundaryscore.nsmallest(10).index]

## Knowing word
I hypothesize that recognizing the word is easier if you have seen the word recently. That would be related to how often you see it in normal use of the language, and whether it was played recently (which is not implemented yet).

In [None]:
wordlist = pd.read_csv('../tweevoortwaalf/Data/wordlist.csv')
# There are some duplicates in Word for words including ij, where one occurs very infrequently
frequency = wordlist.query('Length == 8').groupby('Word')['Frequency'].max()

In [None]:
new = df.merge(frequency, left_on='answer', right_index=True)
frequencybins = pd.qcut(new['Frequency'], q=5)
df.groupby(frequencybins)['Success'].agg(['count', 'mean'])

Indeed the least frequent words are guessed less often - this is in line with the hypothesis that especially the words that you don't know are much harder.

## Puzzle: startpoint
I'm not convinced the puzzle characteristics will have a strong effect; I think the word is more important. But it would be silly to rule out my own biases

In [None]:
df.groupby('startpoint')['Success'].value_counts(normalize=True).unstack()

From this data, it's definitely impossible to rule out the hypothesis.

## Puzzle: direction 

In [None]:
df.groupby('direction')['Success'].value_counts(normalize=True).unstack()