# Naive Sentence to Emoji Translation 
## Purpose
To workshop a naive version of an sentence to emoji translation algorithm. The general idea is that sentences can be "chuncked" out into n-grams that are more related to a single emoji. The related-ness of an n-gram to an emoji is directly related to the cosine similarity of the sent2vec representation of the sentence and the sent2vec representation of one of the emoji's definitions. The emoji definitons are gathered from the [emoji2vec](https://github.com/uclmr/emoji2vec) github repo and the sent2vec model is from the [sent2vec](https://github.com/epfml/sent2vec) github repo. 

## Issues
- The generation of all of the n-grams is so incredibly slow
- There are some issues with lemmatization (e.g. poop != pooped when lemmatized)

## Ideas
- Add bias for fewer emojis
- Turn the summarization into a class as to easily test new configurations of lemmatizers/stop-words

In [246]:
import sent2vec
from scipy.spatial.distance import cosine
from typing import List, Tuple
from itertools import combinations
import numpy as np
from nltk import word_tokenize
from functools import lru_cache
from nltk.stem import PorterStemmer, WordNetLemmatizer, SnowballStemmer
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
import spacy

In [170]:
# Initialize the sent2vec model
s2v = sent2vec.Sent2vecModel()
s2v.load_model('../models/wiki_unigrams.bin') # https://drive.google.com/open?id=0B6VhzidiLvjSa19uYWlLUEkzX3c

In [248]:
# Intitialize the NLTK lemmatizer
lemmatizerSpacy = spacy.load('en', disable=['parser', 'ner'])
ps = PorterStemmer()
sb = SnowballStemmer("english")
lemmatizerNLTK = WordNetLemmatizer()

## Sentence Cleaning
The general idea with sentence cleaning is that the sentences need to be put into the same "format" for better analysis. There are two main aspects of cleaning: 1) removal, and 2) modification. Removal is primarily for tokens that do not contribute to the sentence at all. These include ".", "and", "but". Normally this is a standard step in sentence cleaning but it has actually has zero effect on the output that I can see. However, token modification changes the spelling of tokens to uniform all tokens that use the same root. For example "rocked", "rock", "rocking" should all be reduced to their lemma of "rock". There are two different ways to do this: [stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html). 

In [239]:
lemma_func = stemmer.stem
keep_stop_words = False
def clean_sentence(sent: str) -> str:
    """
    Clean and lemmatize a sentence
    
    TODO: More complex cleaning when the dataset get's more messy
    
    Args:
        sent(str): Sentence to clean
    Rets:
        (str): Cleaned sentence
    """
    # Lemmatize each word in the sentence
    #return " ".join([token.lemma_ for token in lemmatizer(sent.lower())])
    return " ".join([lemma_func(token) for token in word_tokenize(sent.lower()) if token not in stopwords or keep_stop_words])

In [274]:
# Define the array to store the (emoji, repr) 2-tuple
emoji_embeddings = []
def generate_emoji_embeddings():
    emoji_embeddings = []
    # Open the file that stores the emoji, description 2-tuple list
    with open("emoji_joined.txt") as emojis:
        for defn in emojis:
            # The file is tab-delim
            split = defn.split("\t")

            # Get the emoji and the description from the current line
            emoji = split[-1].replace("\n", "")
            desc = clean_sentence(split[0])

            # Add each emoji and embedded description to the list
            emoji_embeddings.append((emoji, s2v.embed_sentence(desc)))

In [256]:
@lru_cache(maxsize=100)
def closest_emoji(sent: str) -> Tuple[str, int]:
    """
    Get the closest emoji to the given sentence
    
    Args:
        sent(List[str]): Sentence to check
    Ret:
        (Tuple[str, int]) Closest emoji, the respective cosine similarity
    
    """    
    # Embed the sentence using sent2vec 
    emb = s2v.embed_sentence(sent)

    # Start the lowest cosine at higher than it could ever be
    lowest_cos = 1_000_000

    # The best emoji starts as an empty string placeholder
    best_emoji = ""

    # Loop through the dictionary
    for emoji in emoji_embeddings:
        # Get the current emoji's embedding
        emoji_emb = emoji[1]

        # Check the cosine difference between the emoji's embedding and
        # the sentence's embedding
        curr_cos = cosine(emoji_emb, emb)

        # If it lower than the lowest then it is the new best
        if curr_cos < lowest_cos:
            lowest_cos = curr_cos
            best_emoji = emoji[0]

    # Return a 2-tuple containing the best emoji and its cosine differnece
    return best_emoji, lowest_cos

In [273]:
def combinations_of_sum(sum_to, combo=None):
    combos = []
    if combo is None:
        combo = [1 for x in range(sum_to)]
        combos.append(combo)
    
    if len(combo) == 0:
        return None
    
    for i in range(1, len(combo)):
        combo_to_query = combo[:i-1] + [sum(combo[i - 1:i + 1])] + combo[i+1:]
        combos.append(combo_to_query)
        [combos.append(combo) for combo in combinations_of_sum(sum_to, combo_to_query) if combo is not None]
            
    return combos
    
def combinations_of_sent(sent):
    sent_combos = []
    def combinations_of_sent_helper(sent):
        sent = word_tokenize(sent)
        combos = np.unique(combinations_of_sum(len(sent)))
        sent_combos = []
        for combo in combos:
            sent_combo = []
            curr_i = 0
            for combo_len in combo:
                space_joined = " ".join(sent[curr_i:combo_len + curr_i])
                if space_joined not in sent_combo:
                    sent_combo.append(space_joined) 
                curr_i += combo_len

            if sent_combo not in sent_combos:
                sent_combos.append(sent_combo)
        return sent_combos
    
    return combinations_of_sent_helper(sent)

[['1', '2', '3'], ['1', '2 3'], ['1 2', '3'], ['1 2 3']]


In [209]:
def summarize(sent:str) -> Tuple[List[str], List[float], List[str]]: 
    """
    Summarize the given sentence into emojis
    
    Args:
        sent(str): Sentence to summarize
    Rets:
        (Tuple[List[str], List[float], List[str]]): (Emoji Sentence, 
        List of Uncertainty values for the corresponding emoji,
        list of n-grams used to generate the corresponding emoji)
    """
    # Clean the sentence
    sent = clean_sentence(sent)
    
    # Generate all combinations of sentences
    sent_combos = combinations_of_sent(sent)
    # Init "best" datamembers as empty or exceedingly high
    best_emojis = ""
    best_n_grams = []
    best_uncertainties = [100_000_000]
    # Iterate through every combination of sentence combos
    for sent_combo in sent_combos:
        # Start the local data members as empty
        emojis = ""
        uncertainties = []
        # Iterate through each n_gram adding the uncertainty and emoji to the lists
        for n_gram in sent_combo:
            close_emoji, cos_diff = closest_emoji(n_gram)
            emojis += close_emoji
            uncertainties.append(cos_diff)

        # Check if the average uncertainty is less than the best
        # TODO: Maybe a median check would be helpful as well?
        if sum(uncertainties)/len(uncertainties) < sum(best_uncertainties)/len(best_uncertainties):
            # Update the best emojis
            best_emojis = emojis
            best_n_grams = sent_combo
            best_uncertainties = uncertainties[:]
            
    # Clear the function cache on closest_emoji because it is unlikely the next run will make use of them
    closest_emoji.cache_clear()
    
    # Return the emoji "sentence", list of all the cosine similarities, and all of the n-grams
    return (best_emojis, best_uncertainties, best_n_grams)

In [275]:
def format_summary(sents, p_lemma_func, p_keep_stop_words):
    lemma_func = p_lemma_func
    keep_stop_words = p_keep_stop_words
    generate_emoji_embeddings()
    for sent in sents:
        summarization_res = summarize(sent)
        print(sent, "|", round(1 - sum(summarization_res[1])/len(summarization_res[1]), 3), "|", [round(x, 3) for x in summarization_res[1]] ,"|", summarization_res[2], "|", summarization_res[0] + "|")

In [277]:
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
format_summary(sents, stemmer.stem, False)

christmas music rings from the clock tower | -999999.0 | [1000000, 1000000, 1000000, 1000000, 1000000] | ['christmas', 'music', 'rings', 'clock', 'tower'] | |
It isn't perfect but it is a start | -999999.0 | [1000000, 1000000, 1000000] | ["n't", 'perfect', 'start'] | |
The sun is rising over new york city | -999999.0 | [1000000, 1000000, 1000000, 1000000, 1000000] | ['sun', 'rising', 'new', 'york', 'city'] | |


In [237]:
lemma_func = stemmer.stem
keep_stop_wordsords = True
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['christma music', 'ring', 'clock', 'tower'] | 🎻💍🕰🏰|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['sun', 'rise', 'new york', 'citi'] | 🌄🌇🗽🚏|


[None, None, None]

In [242]:
lemma_func = lemmatizerNLTK.lemmatize
keep_stop_wordsords = False
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 1.0 | [0.0, 0.0, 0.0, 0.0, 0.0] | ['christmas', 'music', 'ring', 'clock', 'tower'] | 🎄🎻💍⏰🏰|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 0.927 | [0.219, 0.0, 0.0] | ['sun rising', 'new york', 'city'] | 🌄🗽🚏|


[None, None, None]

In [245]:
# SEEMS LIKE THIS ONE IS BEST
# Equal to the one above but less computationally intensive
lemma_func = lemmatizerNLTK.lemmatize
keep_stop_wordsords = True
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 1.0 | [0.0, 0.0, 0.0, 0.0, 0.0] | ['christmas', 'music', 'ring', 'clock', 'tower'] | 🎄🎻💍⏰🏰|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 0.927 | [0.219, 0.0, 0.0] | ['sun rising', 'new york', 'city'] | 🌄🗽🚏|


[None, None, None]

In [252]:
lemma_func = sb.stem
keep_stop_wordsords = True
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['christma music', 'ring', 'clock', 'tower'] | 🎻💍🕰🏰|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['sun', 'rise', 'new york', 'citi'] | 🌄🌇🗽🚏|


[None, None, None]

In [255]:
lemma_func = sb.stem
keep_stop_wordsords = False
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['christma music', 'ring', 'clock', 'tower'] | 🎻💍🕰🏰|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 1.0 | [0.0, 0.0, 0.0, 0.0] | ['sun', 'rise', 'new york', 'citi'] | 🌄🌇🗽🚏|


[None, None, None]

In [267]:
lemma_func = lambda x: x
keep_stop_wordsords = True
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 0.956 | [0.0, 0.0, 0.133] | ['christmas', 'music', 'rings clock tower'] | 🎄🎻🏫|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 0.927 | [0.219, 0.0, 0.0] | ['sun rising', 'new york', 'city'] | 🌄🗽🚏|


[None, None, None]

In [270]:
lemma_func = lambda x: x
keep_stop_wordsords = False
sents = ["christmas music rings from the clock tower", "It isn't perfect but it is a start", "The sun is rising over new york city"]
[format_summary(sent) for sent in sents]

christmas music rings from the clock tower | 0.956 | [0.0, 0.0, 0.133] | ['christmas', 'music', 'rings clock tower'] | 🎄🎻🏫|
It isn't perfect but it is a start | 0.823 | [0.353, 0.0] | ["n't perfect", 'start'] | 💯🌱|
The sun is rising over new york city | 0.927 | [0.219, 0.0, 0.0] | ['sun rising', 'new york', 'city'] | 🌄🗽🚏|


[None, None, None]