# Our Goal:
Be able to determine, given a consecutive string of letters, if it is pronounceable by english-speaking people. 


# Problem statement
- We want to improve at classifying "words" as being pronounceable or not
- We will measure our progress by the percentage of words correctly classified
- based on our database of a literal dictionary and randomly-ish generated non-pronounceable words


# Methodology
Train several different models on our dataset, trying to teach them what "pronounceable" words look like
This will include a manually designed heuristic and different ML models

## Input Formulation (ML Models)
We need to transform words into input vectors, since we need to have quantifiable data. 

Since we need to store the bigrams of the words, and we care about order, we decided to define the features of our vectors as a list of all possible bigrams within the english alphabet. This results in a 26*26 = 676-dimensional space. We will not be able to encode the order of the bigrams, because any meaningful encoding of this would result in either difficulty plotting the data or skewed data. For example, if the first bigram in the word was given a value of 1, the second was given a value of 2, et cetera, then the feature vectors of longer words would become further and further from the origin in the dimensions of their later bigrams.

If our model seems to be less accurate than we would like, we will experiment with finding a way to encode the order.


### For example, 
Our feature vector will take the following shape:
`["aa": int, "ab": int, "ac": int, "ad": int .. "zz": int]`

So for a word like "abba", which contains the bigrams `["ab", "bb", "ba"]`,
Our feature vector would be:

`["aa": 0, "ab": 1, "ac": 0 ... "ba": 1, "bb": 1, "bc": 0 ...]`





## Helper Methods

In [1]:
def get_bigrams(word:str) -> list:
    return [i+j for i, j in \
            zip(word, word[1:])]



# Dataset Creation Code

In [43]:
import os
import re
import base64
import requests
def return_pronounceable_words():
    #open dataset
    data_path = os.path.normpath(os.path.join(os.path.dirname(os.path.abspath("word_pronounceability.ipynb")), '..', 'words.txt'))
    data = set()
    with open(data_path, 'r') as f:
        raw_text = f.readlines()
        for line in raw_text:
            for word in line.lower().replace("-", " ").split():
                if re.compile(".*[^a-z].*").match(word) == None:
                    data.add(word)
    return data

            
def return_unpronounceable_words():
    # Generate words that break rules of english
    



# Heuristic Model
Our initial hypothesis was that the pronounceability of a word correlated very strongly with it's *likelihood*. This is to say, if it is statistically probable that a sequence of letters could make a word, it is also statistically probable that it can be pronounced. This is not without caveats, however: especially because we have idiomatically accepted brand names like "Exxon" which contain strings of letters which no (or at least very few) dictionary words contain. This fact would drive down the likelihood that these such strings of letters would appear, yet we can pronounce them perfectly fine. However, despite these caveats, we feel that this is a reasonable heuristic.

## Heuristic Model Code

In [3]:
from contextlib import contextmanager
from statistics import mean
def pronounceable_score_heuristic(letters: str) -> float:
    """ Generates a numerical score representing the likelihood that a word is pronounceable.

    Args:
        letters (str): string of length 2 (bigram) containing only alphabetical characters to check our dataset for occurrences of

    Returns:
        float: a score representing the likelihood that we can pronounce this string of letters. 0.5 is generally pronounceable, 0.2 is not.
    """
    assert len(letters) == 2
    # check dataset for occurrences of [letters].
    # If they never appear, the string almost certainly cannot be pronounced
    # If they appear, determine how often by dividing the number of times they were found by the amount of words checked
    proportion = dict(in_line=0, not_in=0)

    

        proportion['in_line'] -= 1 if (not proportion['in_line'] > 0) else 0
        # if the set of letters is never found, then it almost certainly can't be pronounced, or possibly is simply not in our dataset.
    # return the amount of times it was found divided by the total lines in the file (multiply by 10 to trim leading zeroes)
    return (proportion['in_line'] / sum([proportion[key] for key in proportion]))*10 
def is_pronounceable_heuristic(word: str) -> bool:
    #temporary and pretty-good threshold value. Could use some fine-tuning.
    THRESHOLD = 0.35

    # magic list comprehension to extract each sequential pair of letters from the word, and get the pronounceability score for each pair:
    # for example:  
    # "hello" ->  ["he", "el", "ll", "lo"] -> [0.45..., 0.62..., 0.57..., 0.44...]
    average_score = [pronounceable_score_heuristic(pair) for pair in \
        get_bigrams(word)]

    # if the average pronounceability score is too low, we assume it isn't pronounceable.

    return mean(average_score) >= THRESHOLD


# this function allows for more concise and readable code in our main test flow.
# uses a contextlib contextmanager to implement __enter__ and __exit__ for our function
# so we can use it in 'with' statements.
@contextmanager
def heuristic_function():
    function = is_pronounceable_heuristic
    try:
        yield function
    finally:
        pass


IndentationError: unexpected indent (2062137893.py, line 20)

# Vectorization Code

In [None]:
from itertools import combinations_with_replacement
import string
def generate_feature_vector(input):
    if("str" in str(type(input))):
        return generate_feature_vector(get_bigrams(input))
    elif("list" in str(type(input))):
        feature_vector = {
            str(bigram) : 1 if (str(bigram[0])+str(bigram[1])) in input else 0 for bigram in list(combinations_with_replacement(list(string.ascii_lowercase), 2))
        }
        return feature_vector
    else:
        raise TypeError(f"Requires either 'str' or List[str] as input for generate_feature_vector(), found {type(input)}.")



 ## ML Code


 Now that we have 