# Development Discussion for EmoPy

We will use this notebook to illustrate the scraping of the [Unicode Emoji List](https://unicode.org/emoji/charts/full-emoji-list.html). While there are likely many more uses for having an easily accessible Python repository of emojis, on creation the purpose was to create a set of tools to make searching and extracting emojis from common social media text easier. 

### Acquiring

In [1]:
import requests as rq
from lxml import html

In [2]:
# Get the emojis and their descriptions
resp = rq.get("https://unicode.org/emoji/charts/full-emoji-list.html")
html_tree = html.fromstring(resp.text)
emojis = html_tree.xpath('//td[@class="chars"]//text()')
descrs = html_tree.xpath('//td[contains(@class,"name")]//text()')

for idx, (e, d) in enumerate(zip(emojis, descrs)):
    print(f'Emoji: {e} || Descr: {d}')
    if idx >= 5:
        break
print("...")
print(f"There are {len(emojis)} emojis in the list")

Emoji: 😀 || Descr: grinning face
Emoji: 😃 || Descr: grinning face with big eyes
Emoji: 😄 || Descr: grinning face with smiling eyes
Emoji: 😁 || Descr: beaming face with smiling eyes
Emoji: 😆 || Descr: grinning squinting face
Emoji: 😅 || Descr: grinning face with sweat
...
There are 1816 emojis in the list


### Sorting

What we will quickly realize on exploration is that emojis can be several characters long, and often the longer (in terms of number of characters) emojis are often "glued" together versions of shorter emojis, often glued with the unicode `\u200d` character. Since we will want to search for the actual, intended emoji, we will sort these from longest to shortest. 

To be clear, imagine that we had "emojis" `A+B`, `A`, and `B`. Then, suppose we were searching a text which includes the single emoji `A+B`. If we searched for just `A` or just `B`, we would find them; however, we really want to find the full emoji `A+B` and not give the limelight to the shorter "building blocks" of the true emoji. This leads us to search by first looking for `A+B`; if that's not there, then search for `A`; if not there, then `B`. This is the perspective of ordering longer-to-shorter when we search.

In [3]:
emojis, descrs = zip(*sorted(list(zip(emojis, descrs)), key=lambda x: len(x[0]), reverse=True))

In [4]:
for idx, (e, d) in enumerate(zip(emojis, descrs)):
    print(f'Emoji: {e} || Length: {len(e)} || Descr: {d}')
    if idx >= 5:
        break

Emoji: 👩‍❤️‍💋‍👨 || Length: 8 || Descr: kiss: woman, man
Emoji: 👨‍❤️‍💋‍👨 || Length: 8 || Descr: kiss: man, man
Emoji: 👩‍❤️‍💋‍👩 || Length: 8 || Descr: kiss: woman, woman
Emoji: 👨‍👩‍👧‍👦 || Length: 7 || Descr: family: man, woman, girl, boy
Emoji: 👨‍👩‍👦‍👦 || Length: 7 || Descr: family: man, woman, boy, boy
Emoji: 👨‍👩‍👧‍👧 || Length: 7 || Descr: family: man, woman, girl, girl


### Searching via RegEx

We will create a regular expression of the form `(A+B|A|B)` so that if `A+B` appears, it will not match specifically `A` nor `B`, but will if they are separate instances. Since we already have ordered the emojis from longest character length to shortest, you would think that all that remains is to compile the regular expression 
```
emojis_re_str = "(" + "|".join(emojis)) + ")"
emojis_re = re.compile(emojis_re_str)
```
However, you will run into an error if you try this. The reason for the error is because of the asterisk character `*` which the interpreter wants as a regex repeat, but which takes place in the emoji `emojis[223]` (take a look!). So, we need to enter a break character:

In [5]:
import re

In [6]:
emojis_re_str = "(" + "|".join(["(%s)"%el.replace("*","\*") for el in emojis]) + ")"
emojis_re = re.compile(emojis_re_str)

What this expression should function as when finding the matches is start from left, move right (and ignoring those which were already found to the left!). Moreover, as it will look like analogous to `((A+B)|(A)|(B))`, the inner parentheses will help divvy up the finding, returning a tuple `tup` for each found emoji, with the first element `tup[0]` being the actual found emoji and the remaining entries `tup[1:]` having the same number of entries as inner parentheses, with entry `''` if the emoji was not found or `'<emoji>'` if the emoji was found. For example, if we have the text `s="lorem ipsum A+B"`, then `re.findall('((A+B)|(A)|(B))', s)` would result in the list of tuples `[('A+B', 'A+B', '', '')]`; whereas, with the text `s2="lorem ipsum A B"`, the result of `re.findall('((A+B)|(A)|(B))', s2)` would be the list of tuples `[('A','','A',''), ('B','','','B')]`. 

To show that this works as expected in terms of the gluing, consider the first emoji in our list `emojis[0]`. This emoji is of a man and woman kissing and is created by concatenating the emojis for a woman, the emoji of a heart, the emoji of kiss lips, and finally with the emoji of a man. Take a look, and notice that the "glue" is the unicode `\u200d`:

In [7]:
list(emojis[0])

['👩', '\u200d', '❤', '️', '\u200d', '💋', '\u200d', '👨']

Now, we will create a false sentence that one might see in Twitter which contains both the 8-character kiss emoji, a woman, a man, and a heart emoji:

In [8]:
em_breakdown = list(emojis[0])
s = f'Hi!! :) I love you {emojis[0]}{em_breakdown[0]}{em_breakdown[-1]}{em_breakdown[2]}'
print(s)

Hi!! :) I love you 👩‍❤️‍💋‍👨👩👨❤


Let's extract the emojis that we find with this regex, and notice that we should be able to match them to the emoji descriptions based on the location of the emoji element in the tuple:

In [9]:
for e in emojis_re.findall(s):
    em = e[0]
    arg = e[1:].index(em)
    print(em,descrs[arg])

👩‍❤️‍💋‍👨 kiss: woman, man
👩 woman
👨 man
❤ red heart


Bingo! Notice that we did not duplicate any emojis that appeared in the "glued" expression, nor forget the ones which were singularly apart from that expression, plus we were able to flesh out the descriptive text as well!

### Removing Emojis from a Document

We've successufully been able to extract emojis in a desireable way from a document. Now, let's focus on extracting the non-emoji text. We will be naive about this and simply create a negative search for the emojis in our list. 

In [10]:
not_emojis_re_str = "[^" + "".join([el.replace("*","\*") for el in emojis]) + "]"
not_emojis_re = re.compile(not_emojis_re_str)

In [11]:
# Test
"".join(not_emojis_re.findall(s))

'Hi!! :) I love you '

Now, the truth is this could use some work to make it a bit more efficient. However, for our purposes, we'll use this for now, but remember to revisit the issue later!

## Using the emopy Class

We create a class which does this scraping for us, and also includes methods for extracting the actual emojis and emoji-extracted text. Here is an example. The code in `emopy.py` agrees very closely with that above in the notebook, with a slight modification to the emoji and description extraction which considers the case that there are no emojis within the text from the start.

In [12]:
from emopy import emopy
emos = emopy()

In [13]:
# Helper function for below inspecting the behavior below
def quick_peak(emojis, descrs, text = ''):
    if not emojis:
        print(f"No emojis found.")
    else:
        for em, de in zip(emojis, descrs):
            print(f"Emoji: {em} || Descr: {de}")
    if text: print(f"Text extracted: '{text}'")

In [14]:
quick_peak(emos.emojis[:5], emos.descrs[:5])

Emoji: 👩‍❤️‍💋‍👨 || Descr: kiss: woman, man
Emoji: 👨‍❤️‍💋‍👨 || Descr: kiss: man, man
Emoji: 👩‍❤️‍💋‍👩 || Descr: kiss: woman, woman
Emoji: 👨‍👩‍👧‍👦 || Descr: family: man, woman, girl, boy
Emoji: 👨‍👩‍👦‍👦 || Descr: family: man, woman, boy, boy


In [15]:
print(f"Applying to the string: {s}\n")
# Use emojis_descrs_text_from_doc method
e,d,t = emos.emojis_descrs_text_from_doc(s)
# Inspect
quick_peak(e,d,t)

Applying to the string: Hi!! :) I love you 👩‍❤️‍💋‍👨👩👨❤

Emoji: 👩‍❤️‍💋‍👨 || Descr: kiss: woman, man
Emoji: 👩 || Descr: woman
Emoji: 👨 || Descr: man
Emoji: ❤ || Descr: red heart
Text extracted: 'Hi!! :) I love you '


In [16]:
print(f"Applying to the string: 'Hi!'\n")
# Use emojis_descrs_text_from_doc method
e2,d2,t2 = emos.emojis_descrs_text_from_doc('Hi!')
# Inspect
quick_peak(e2,d2,t2)

Applying to the string: 'Hi!'

No emojis found.
Text extracted: 'Hi!'


### Extracting Emojis and Text From Many

We can also apply our method to an iterator of strings; however, the method name subtly changes by ending with the `s`, making the plural distinction.

In [17]:
ss = [s, 'Hi!', f'Good boy, Gibson. {emos.emojis[150]}']
ss

['Hi!! :) I love you 👩\u200d❤️\u200d💋\u200d👨👩👨❤',
 'Hi!',
 'Good boy, Gibson. 😮\u200d💨']

In [18]:
for e,d,t in emos.emojis_descrs_text_from_docs(ss):
    quick_peak(e,d,t)
    print("")

Emoji: 👩‍❤️‍💋‍👨 || Descr: kiss: woman, man
Emoji: 👩 || Descr: woman
Emoji: 👨 || Descr: man
Emoji: ❤ || Descr: red heart
Text extracted: 'Hi!! :) I love you '

No emojis found.
Text extracted: 'Hi!'

Emoji: 😮‍💨 || Descr: ⊛ face exhaling
Text extracted: 'Good boy, Gibson. '

