In [1]:
%%capture
!pip install advertools

In [2]:
import re
from collections import namedtuple, Counter

with open('../input/emoji-data-descriptions-codepoints/emoji-test.txt', 'rt') as file:
    emoji_raw = file.read()
print(emoji_raw[:2800])

# emoji-test.txt
# Date: 2022-08-12, 20:24:39 GMT
# © 2022 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use, see https://www.unicode.org/terms_of_use.html
#
# Emoji Keyboard/Display Test Data for UTS #51
# Version: 15.0
#
# For documentation and usage, see https://www.unicode.org/reports/tr51
#
# This file provides data for testing which emoji forms should be in keyboards and which should also be displayed/processed.
# Format: code points; status # emoji name
#     Code points — list of one or more hex code points, separated by spaces
#     Status
#       component           — an Emoji_Component,
#                             excluding Regional_Indicators, ASCII, and non-Emoji.
#       fully-qualified     — a fully-qualified emoji (see ED-18 in UTS #51),
#                             excluding Emoji_Component
#       minimally-qualified — a minimally-qualified emoji (see ED-18a in UTS #51)
#    

The first few lines explain some details about the file and how the data are represented. The remainder is like the last lines. Each line represents an emoji, and whenever there is a new group and/or sub-group, those are listed (on a line starting with # and the name of the group/sub-group), to show to group/sub-group, the following emoji belong to. 

We will go through the lines, one by one, and extract the information that we need and then put them in an easy-to-use format (`namedtuple`) so we can then use them to create the regex and the CSV file.  

A few things about emoji that need to be understood in order to get what we want done. 

# Single and multi code point emoji

Some emoji can simply be thought of as regular characters.

In [3]:
print('\U00000063')  # the lower-cae letter "c" for example

c


In [4]:
print('\U0001F44D')

👍


But what about the similar emoji 👍🏿?  
Let's first compare the two.

In [5]:
len('👍'), len('👍🏿')

(1, 2)

# 🤔

In [6]:
print('👍🏿'[0], '👍🏿'[1])

👍 🏿


In [7]:
import unicodedata
unicodedata.name('👍'), unicodedata.name('👍🏿'[0]), unicodedata.name('👍🏿'[1])

('THUMBS UP SIGN', 'THUMBS UP SIGN', 'EMOJI MODIFIER FITZPATRICK TYPE-6')

In [8]:
s = 'The rest of my friends are at the restaurant.'
regex = re.compile('rest|restaurant')
regex.findall(s)

['rest', 'rest']

In [9]:
regex2 = re.compile('restaurant|rest')
regex2.findall(s)

['rest', 'restaurant']

In [10]:
thumbs_sentence = 'This is thumbs up: 👍, and this is thumbs up with dark skin tone: 👍🏿'
thumbs_regex = re.compile('👍|👍🏿')

thumbs_regex.findall(thumbs_sentence)

['👍', '👍']

In [11]:
thumbs_regex2 = re.compile('👍🏿|👍')
thumbs_regex2.findall(thumbs_sentence)

['👍', '👍🏿']

Since the dark tone thumbs up emoji is made up of two code points, and since the first one is made of one, we are faced with the same case of "rest" and "restaurant". The regex finds the first word from left to right, and returns it. As in the previous example, putting the longer word first, made sure that we check for it first, and solves the issue. 


Here are the two emoji represented by code points. You can see that the first part of each of the 'words' is the same. 

In [12]:
print('\U0001F44D', '\U0001F44D\U0001F3FF')  # the U0001F44D code point exists in both

👍 👍🏿


There are five skin tones, as well as four hair types. All of those fall under the group "component". Those emoji are not supposed to appear on their own, because they really don't mean anything. They function mainly as modifiers for the previous emoji, appearing right before them.  
Here they are, and we will be skipping them when creating the final regex. 

In [13]:
for i, line in enumerate(emoji_raw.splitlines()):
    if '; component' in line:
        print(i, line)

3287 1F3FB                                                  ; component           # 🏻 E1.0 light skin tone
3288 1F3FC                                                  ; component           # 🏼 E1.0 medium-light skin tone
3289 1F3FD                                                  ; component           # 🏽 E1.0 medium skin tone
3290 1F3FE                                                  ; component           # 🏾 E1.0 medium-dark skin tone
3291 1F3FF                                                  ; component           # 🏿 E1.0 dark skin tone
3294 1F9B0                                                  ; component           # 🦰 E11.0 red hair
3295 1F9B1                                                  ; component           # 🦱 E11.0 curly hair
3296 1F9B3                                                  ; component           # 🦳 E11.0 white hair
3297 1F9B2                                                  ; component           # 🦲 E11.0 bald


Now we create the data structure that will hold our emoji entries. We will use the `namedtuple` because it has a nice representation, telling us exactly what each element means, as well as giving us the ability to extract those elements by name, using dot notation `entry.name` or `entry.group` for example. 

In [14]:
EmojiEntry = namedtuple('EmojiEntry', ['codepoint', 'status', 'emoji', 'name', 'group', 'sub_group'])

The following code goes through lines one by one, extracting the information that is needed, and appending each entry to `emoji_entries` which will be a list containing all of them.  
I have annotated the code with some comments, and below elaborated a little more to clarify.

In [15]:
E_regex = re.compile(r' ?E\d+\.\d+ ') # remove the pattern E<digit(s)>.<digit(s)>
emoji_entries = []

for line in emoji_raw.splitlines()[32:]:  # skip the explanation lines
    if line == '# Status Counts':  # the last line in the document
        break
    if 'subtotal:' in line:  # these are lines showing statistics about each group, not needed
        continue
    if not line:  # if it's a blank line
        continue
    if line.startswith('#'):  # these lines contain group and/or sub-group names
        if '# group:' in line:
            group = line.split(':')[-1].strip()
        if '# subgroup:' in line:
            subgroup = line.split(':')[-1].strip()
    if group == 'Component':  # skin tones, and hair types, skip, as mentioned above
        continue
    if re.search('^[0-9A-F]{3,}', line):  # if the line starts with a hexadecimal number (an emoji code point)
        # here we define all the elements that will go into emoji entries
        codepoint = line.split(';')[0].strip()  # in some cases it is one and in others multiple code points
        status = line.split(';')[-1].split()[0].strip() # status: fully-qualified, minimally-qualified, unqualified
        if line[-1] == '#':
            # The special case where the emoji is actually the hash sign "#". In this case manually assign the emoji
            if 'fully-qualified' in line:
                emoji = '#️⃣'
            else:
                emoji = '#⃣'  # they look the same, but are actually different 
        else:  # the default case
            emoji = line.split('#')[-1].split()[0].strip()  # the emoji character itself
        if line[-1] == '#':  # (the special case)
            name = '#'
        else:  # extract the emoji name
            split_hash = line.split('#')[1]
            rm_capital_E = E_regex.split(split_hash)[1]
            name = rm_capital_E
        templine = EmojiEntry(codepoint=codepoint,
                              status=status,
                              emoji=emoji,
                              name=name,
                              group=group,
                              sub_group=subgroup)
        emoji_entries.append(templine)


In [16]:
emoji_dict = {x.emoji: x for x in emoji_entries}

In [17]:
emoji_dict['😆'].emoji

'😆'

In [18]:
emoji_entries[0]

EmojiEntry(codepoint='1F600', status='fully-qualified', emoji='😀', name='grinning face', group='Smileys & Emotion', sub_group='face-smiling')

In [19]:
emoji_entries[0].emoji

'😀'

In [20]:
emoji_entries[0].group, emoji_entries[0].sub_group

('Smileys & Emotion', 'face-smiling')

Here is a quick summary of the counts of the groups, sub-groups, and all group/sub-group combinations:

In [21]:
Counter([x.group for x in emoji_entries])

Counter({'People & Body': 2998,
         'Objects': 310,
         'Symbols': 304,
         'Flags': 275,
         'Travel & Places': 267,
         'Smileys & Emotion': 180,
         'Animals & Nature': 159,
         'Food & Drink': 135,
         'Activities': 96})

In [22]:
sorted(Counter([x.sub_group for x in emoji_entries]).items(), key=lambda x: x[1], reverse=True)[:30]

[('person-role', 635),
 ('family', 534),
 ('person-sport', 395),
 ('person-activity', 318),
 ('person-gesture', 300),
 ('country-flag', 258),
 ('person-fantasy', 245),
 ('person', 192),
 ('animal-mammal', 68),
 ('hand-fingers-open', 67),
 ('sky & weather', 65),
 ('hands', 62),
 ('hand-fingers-partial', 55),
 ('transport-ground', 55),
 ('clothing', 50),
 ('body-parts', 49),
 ('alphanum', 49),
 ('hand-single-finger', 43),
 ('person-resting', 42),
 ('geometric', 38),
 ('hand-fingers-closed', 36),
 ('tool', 35),
 ('arrow', 35),
 ('food-prepared', 34),
 ('time', 34),
 ('av-symbol', 34),
 ('other-symbol', 33),
 ('place-building', 32),
 ('office', 31),
 ('game', 30)]

In [23]:
Counter([' | '.join([x.group, x.sub_group]) for x in emoji_entries])

Counter({'People & Body | person-role': 635,
         'People & Body | family': 534,
         'People & Body | person-sport': 395,
         'People & Body | person-activity': 318,
         'People & Body | person-gesture': 300,
         'Flags | country-flag': 258,
         'People & Body | person-fantasy': 245,
         'People & Body | person': 192,
         'Animals & Nature | animal-mammal': 68,
         'People & Body | hand-fingers-open': 67,
         'Travel & Places | sky & weather': 65,
         'People & Body | hands': 62,
         'People & Body | hand-fingers-partial': 55,
         'Travel & Places | transport-ground': 55,
         'Objects | clothing': 50,
         'People & Body | body-parts': 49,
         'Symbols | alphanum': 49,
         'People & Body | hand-single-finger': 43,
         'People & Body | person-resting': 42,
         'Symbols | geometric': 38,
         'People & Body | hand-fingers-closed': 36,
         'Objects | tool': 35,
         'Symbols | arrow':

## Emoji status
In case you are wondering about the status column, this is the explanation from the
[Unicode official documentation:](http://unicode.org/reports/tr51/#def_qualified_emoji_character) 

>ED-17a. qualified emoji character — An emoji character in a string that (a) has default emoji presentation or (b) is the first character in an emoji modifier sequence or (c) is not a default emoji presentation character, but is the first character in an emoji presentation sequence.  
>ED-18. fully-qualified emoji — A qualified emoji character, or an emoji sequence in which each emoji character is qualified.  
>ED-18a. minimally-qualified emoji — An emoji sequence in which the first character is qualified but the sequence is not fully qualified.  
>ED-19. unqualified emoji — An emoji that is neither fully-qualified nor minimally qualified.

As mentioned above, we need to handle single and multiple code point emoji slightly differently.  
We start by extracting the multi code points.

In [24]:
multi_codepoint_emoji = []

for code in [c.codepoint.split() for c in emoji_entries]:
    if len(code) > 1:
        # turn to a hexadecimal number zfilled to 8 zeros e.g: '\U0001F44D'
        hexified_codes = [r'\U' + x.zfill(8) for x in code]  
        hexified_codes = ''.join(hexified_codes)  # join all hexadecimal components 
        multi_codepoint_emoji.append(hexified_codes)

# sorting by length in decreasing order is extremely important as demonstrated above
multi_codepoint_emoji_sorted = sorted(multi_codepoint_emoji, key=len, reverse=True)

# join with a "|" to function as an "or" in the regex
multi_codepoint_emoji_joined = '|'.join(multi_codepoint_emoji_sorted)  
multi_codepoint_emoji_joined[:400]  # sample

'\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FC|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FD|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FE|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F'

In [25]:
single_codepoint_emoji = []

for code in [c.codepoint.split() for c in emoji_entries]:
    if len(code) == 1:
        single_codepoint_emoji.append(code[0])

# Regex character ranges

Since the single code point emoji are basically one character each, they can be treated as normal letters or numbers in the regex.  
One important feature of character classes is their ability to contain character ranges. 
If I want to match a character that falls between A and F, there are two ways to define the character class: 

- `[ABCDEF]`
- `[A-F]`

They effectively mean the same thing. The advantage of the second is that it is much more readable (imagine wanting to match the letters from A to T for example). It would be very difficult to read through and understand which letters are included. `[A-T]` is very easy to read.  
I also believe there might be a slight performance boost with character ranges. Some regex engines do certain optimizations on their own, and I'm not aware of those details. But in general making two comparisons is way more efficient than making fifty.  
For example, you have the number 42, and want to check if it falls between 1 and 100. 
In the character class case, you make to comparisons. You check if 42 >= 1 and 42 <=100.  
If you have all the numbers listed from 1 to 100, then you will have to make 42 comparisons to find out. On average, if you have a range of 100 numbers, you will be making fifty comparisons to find out. With larger ranges, this can obviously go very big.  

Below is the function `get_ranges`. It takes a list of integers, and returns a list of tuples, each representing the local minimum and maximum for any number of contiguous integers (numbers differing by 1).  
For example if I have the list `[1, 2, 3, 4, 6 7, 8, 10, 20]`, it will return `[(1, 4), (6, 8), (10, 10), (20, 20)]`

The numbers 1, 2, 3, and 4, can converted into a character range `[1-4]`, so do the numbers 6, 7, and 8. 10 and 20 are not part of a series of integers differing by one, so they are represented as single-number ranges. Later they will be used as single characters in the regex.

In [26]:
def get_ranges(nums):
    """Reduce a list of integers to tuples of local maximums and minimums.

    :param nums: List of integers.
    :return ranges: List of tuples showing local minimums and maximums
    """
    nums = sorted(nums)
    lows = [nums[0]]
    highs = []
    if nums[1] - nums[0] > 1:
        highs.append(nums[0])
    for i in range(1, len(nums)-1):
        if (nums[i] - nums[i-1]) > 1:
            lows.append(nums[i])
        if (nums[i + 1] - nums[i]) > 1:
            highs.append(nums[i])
    highs.append(nums[-1])
    if len(highs) > len(lows):
        lows.append(highs[-1])
    return [(l, h) for l, h in zip(lows, highs)]

In [27]:
# We first convert single_codepoint_emoji to integers to make calculations easier
single_codepoint_emoji_int = [int(x, base=16) for x in single_codepoint_emoji]
single_codepoint_emoji_ranges = get_ranges(single_codepoint_emoji_int)
single_codepoint_emoji_ranges[:10]

[(169, 169),
 (174, 174),
 (8252, 8252),
 (8265, 8265),
 (8482, 8482),
 (8505, 8505),
 (8596, 8601),
 (8617, 8618),
 (8986, 8987),
 (9000, 9000)]

In [28]:
single_codepoint_emoji_raw = r''  # start with an empty raw string
for code in single_codepoint_emoji_ranges:
    if code[0] == code[1]:  # in this case make it a single hexadecimal character
        temp_regex =  r'\U' + hex(code[0])[2:].zfill(8)
        single_codepoint_emoji_raw += temp_regex
    else:
        # otherwise create a character range, joined by '-'
        temp_regex = '-'.join([r'\U' + hex(code[0])[2:].zfill(8), r'\U' + hex(code[1])[2:].zfill(8)])
        single_codepoint_emoji_raw += temp_regex

single_codepoint_emoji_raw[:100]  # sample

'\\U000000a9\\U000000ae\\U0000203c\\U00002049\\U00002122\\U00002139\\U00002194-\\U00002199\\U000021a9-\\U000021'

# Final regex
Now that we have created our sorted multi-code point characters, and generated the ranges for the single-code point emoji, we need to combine them together.  
The regex wil start with the longer 'words', which are emoji, represented by more than one character. These have already been sorted by length, in descending order. 
Single-code point emoji have already been made into a character class, where some values are single characters, and some are character ranges. 

The final regex will look something like this: 

`multi_code_point_emoji|[character_class_of_single_code_points]`

In more detail, this is how the first `multi_code_point_emoji` part will look like:

`longest_multi_code_point|shorter_multiple_code_point|...|shortest_multiple_code_point`

This is how the character class part `[character_class_of_single_code_points]` will look like: 
For simplicity I refer to `single_code_point` as `sp`. 

`[sp1sp2sp3sp4-sp20sp25sp500-sp600]` and so on. 

Below we concatenate both regexes into one, and show the first and last 500 characters as a sample. 

In [29]:
all_emoji_regex = re.compile(multi_codepoint_emoji_joined + '|' +  r'[' + single_codepoint_emoji_raw + r']')
all_emoji_regex.pattern[:500], all_emoji_regex.pattern[-500:]

('\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FC|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FD|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FE|\\U0001F9D1\\U0001F3FB\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001F3FF|\\U0001F9D1\\U0001F3FC\\U0000200D\\U00002764\\U0000FE0F\\U0000200D\\U0001F48B\\U0000200D\\U0001F9D1\\U0001',
 'U0001f5dc-\\U0001f5de\\U0001f5e1\\U0001f5e3\\U0001f5e8\\U0001f5ef\\U0001f5f3\\U0001f5fa-\\U0001f64f\\U0001f680-\\U0001f6c5\\U0001f6cb-\\U0001f6d2\\U0001f6d5-\\U0001f6d7\\U0001f6dc-\\U0001f6e5\\U0001f6e9\\U0001f6eb-\\U0001f6ec\\U0001f6f0\\U0001f6f3-\\U0001f6fc\\U0001f7e0-\\U0001f7eb\\U0001f7f0\\U0001f90c-\\U0001f93a\\U0001f93c-\\U0001f945\\U0001f947-\\U0001f9af\\U0001f9b4-\\U0001f9ff\\U0001fa70-\\U0001fa7c\\U0001fa80-\\U0001fa88\\U0001fa90-

# Testing
We need to know that our work is correct. It is easy to get it wrong, especially when we are talking about 3k+ characters, and especially that many of them are combinations of the others. 

As a quick sanity check, let see how many characters were actually in the initial text file. Each emoji entry contained a semicolon, so let's count those: 

![](https://drive.google.com/uc?id=1njXayyA6eOqFxeEamQwldH8z0MkFxR6h)  

* There are 4,734 semicolons in the file. One of them is part of the explanation on the first line, and remember that there were nine characters that we omitted, because they were basically modifiers. So the final number should be 4,734 - 1 - 9 = 4,724. 

Now we run `findall` by the combined final regex on a string that we create.  
This string is all the emoji characters in `emoji_entries` separated by spaces. Their number needs to be exactly 4,724. 

In [30]:
all_emoji_regex.findall(' '.join([x.emoji for x in emoji_entries])).__len__()

4724

So far so good. Let's get some more assurance.

The code below goes through all the lines of the raw text file, as downloaded from the Unicode site.  
First we define `count` as zero, and increment its value, every time we find a new match. This should add up to the same number 3,287.  
We also create a set `found_emoji` where we add every emoji we find to it. If we match a certain emoji more than once and add it to the set, it will be discarded, because sets only contain unique values. Again the length of this set, should be equal to our magic number. If not, it means we found duplicates. Or it means we are matching other things, if we get a higher number. 

Lines 6-8 check if the length of the match is more than one, meaning the regex found more than one match in the line. We might be wrongly matching something more than once. It actually broke a few times, when I first ran it, until I fixed the issues.  
One final test is asserting that the name of the emoji (which we extract from `emoji_entries` is contained in the line in the raw text file, making sure that the names also correspond to the correct value, and extracted correctly. 

In [31]:
count = 0
found_emoji = set()
for line in emoji_raw.splitlines()[30:]:
    match = all_emoji_regex.findall(line)
    if match:
        if len(match) > 1:
            break
        count += 1
        found_emoji.add(match[0])
        temp_name = [x.name for x in emoji_entries if x.emoji == match[0]][0]
        assert temp_name in line

count, found_emoji.__len__()

(4724, 4724)

## 🎉 🎉 🎉 🎊 🎊 🎊 👍 👏 😉

To save as a DataFrame, we can run the following code.  
I made it semicolon-separated, as there were commas in the descriptions so this is easier. The I let `pandas` do the heavy lifting of converting back to comma-separated format. 

In [32]:
with open('emoji_df.csv', 'wt') as file:
    print('emoji;name;group;sub_group;codepoints', file=file)
    for i, em in enumerate(emoji_entries):
        print(f"{em.emoji};{em.name};{em.group};{em.sub_group};{em.codepoint}", file=file)

In [33]:
import pandas as pd
pd.options.display.max_columns = None

emoji_df = pd.read_csv('emoji_df.csv', sep=';')
emoji_df.to_csv('emoji_df.csv', index=False)
emoji_df = pd.read_csv('emoji_df.csv')
emoji_df[:35]

Unnamed: 0,emoji,name,group,sub_group,codepoints
0,😀,grinning face,Smileys & Emotion,face-smiling,1F600
1,😃,grinning face with big eyes,Smileys & Emotion,face-smiling,1F603
2,😄,grinning face with smiling eyes,Smileys & Emotion,face-smiling,1F604
3,😁,beaming face with smiling eyes,Smileys & Emotion,face-smiling,1F601
4,😆,grinning squinting face,Smileys & Emotion,face-smiling,1F606
5,😅,grinning face with sweat,Smileys & Emotion,face-smiling,1F605
6,🤣,rolling on the floor laughing,Smileys & Emotion,face-smiling,1F923
7,😂,face with tears of joy,Smileys & Emotion,face-smiling,1F602
8,🙂,slightly smiling face,Smileys & Emotion,face-smiling,1F642
9,🙃,upside-down face,Smileys & Emotion,face-smiling,1F643


# Emoji in Real-life Data
Let's see how we can use this regex on a tweet dataset containing five thousand tweets that contain the hashtag #JustDoIt.

In [34]:
justdoit = pd.read_csv('../input/5000-justdoit-tweets-dataset/justdoit_tweets_2018_09_07_2.csv')
justdoit.head(3)

Unnamed: 0,tweet_contributors,tweet_coordinates,tweet_created_at,tweet_display_text_range,tweet_entities,tweet_extended_entities,tweet_favorite_count,tweet_favorited,tweet_full_text,tweet_geo,tweet_id,tweet_id_str,tweet_in_reply_to_screen_name,tweet_in_reply_to_status_id,tweet_in_reply_to_status_id_str,tweet_in_reply_to_user_id,tweet_in_reply_to_user_id_str,tweet_is_quote_status,tweet_lang,tweet_metadata,tweet_place,tweet_possibly_sensitive,tweet_quoted_status,tweet_quoted_status_id,tweet_quoted_status_id_str,tweet_retweet_count,tweet_retweeted,tweet_source,tweet_truncated,tweet_user,user_contributors_enabled,user_created_at,user_default_profile,user_default_profile_image,user_description,user_entities,user_favourites_count,user_follow_request_sent,user_followers_count,user_following,user_friends_count,user_geo_enabled,user_has_extended_profile,user_id,user_id_str,user_is_translation_enabled,user_is_translator,user_lang,user_listed_count,user_location,user_name,user_notifications,user_profile_background_color,user_profile_background_image_url,user_profile_background_image_url_https,user_profile_background_tile,user_profile_banner_url,user_profile_image_url,user_profile_image_url_https,user_profile_link_color,user_profile_sidebar_border_color,user_profile_sidebar_fill_color,user_profile_text_color,user_profile_use_background_image,user_protected,user_screen_name,user_statuses_count,user_time_zone,user_translator_type,user_url,user_utc_offset,user_verified
0,,,Fri Sep 07 16:25:06 +0000 2018,"[0, 75]","{'hashtags': [{'text': 'quote', 'indices': [47...","{'media': [{'id': 1038100853872197632, 'id_str...",0,False,Done is better than perfect. — Sheryl Sandberg...,,1038100857932394496,1038100857932394496,,,,,,False,en,"{'iso_language_code': 'en', 'result_type': 're...",,False,,,,0,False,"<a href=""https://statusbrew.com"" rel=""nofollow...",False,"{'id': 3188618684, 'id_str': '3188618684', 'na...",False,Fri May 08 10:27:51 +0000 2015,True,False,I share tips to achieve your health goals and ...,{'url': {'urls': [{'url': 'https://t.co/jGlJsw...,307.0,False,57983.0,False,48721.0,False,False,3188619000.0,3188619000.0,False,False,en,629.0,"California, USA",Ultra YOU Woman,False,C0DEED,http://abs.twimg.com/images/themes/theme1/bg.png,https://abs.twimg.com/images/themes/theme1/bg.png,False,https://pbs.twimg.com/profile_banners/31886186...,http://pbs.twimg.com/profile_images/5970009262...,https://pbs.twimg.com/profile_images/597000926...,1DA1F2,C0DEED,DDEEF6,333333,True,False,UltraYOUwoman,91870.0,,none,https://t.co/jGlJswxjwS,,False
1,,,Fri Sep 07 16:24:59 +0000 2018,"[0, 237]","{'hashtags': [{'text': 'hero', 'indices': [90,...",,0,False,Shout out to the Great Fire Department and the...,,1038100830807904256,1038100830807904256,,,,,,False,en,"{'iso_language_code': 'en', 'result_type': 're...",,False,,,,0,False,"<a href=""http://www.facebook.com/twitter"" rel=...",False,"{'id': 18387174, 'id_str': '18387174', 'name':...",False,Fri Dec 26 09:30:23 +0000 2008,False,False,All Business inquiries contact cluuxx@gmail.co...,{'url': {'urls': [{'url': 'http://t.co/lVm8vfD...,1178.0,False,13241.0,False,5489.0,False,False,18387170.0,18387170.0,False,False,en,150.0,"Miami, Florida",Yung Cut Up (Videos),False,131516,http://abs.twimg.com/images/themes/theme14/bg.gif,https://abs.twimg.com/images/themes/theme14/bg...,True,https://pbs.twimg.com/profile_banners/18387174...,http://pbs.twimg.com/profile_images/9453331145...,https://pbs.twimg.com/profile_images/945333114...,3B94D9,FFFFFF,EFEFEF,333333,True,False,yungcutup,618822.0,,none,http://t.co/lVm8vfDbfO,,False
2,,,Fri Sep 07 16:24:50 +0000 2018,"[0, 176]","{'hashtags': [{'text': 'JustDoIt', 'indices': ...","{'media': [{'id': 1038100773396041728, 'id_str...",0,False,There are some AMAZINGLY hilarious Nike Ad mem...,,1038100793147248640,1038100793147248640,,,,,,False,en,"{'iso_language_code': 'en', 'result_type': 're...",,False,,,,0,False,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 32645612, 'id_str': '32645612', 'name':...",False,Fri Apr 17 23:04:15 +0000 2009,False,False,Morning Traffic Reporter @CBS4Indy | Traffic A...,{'url': {'urls': [{'url': 'https://t.co/g9exqg...,11864.0,False,11377.0,False,2386.0,False,False,32645610.0,32645610.0,False,False,en,193.0,"Indianapolis, IN",Rachel Bogle,False,FFFAFF,http://abs.twimg.com/images/themes/theme1/bg.png,https://abs.twimg.com/images/themes/theme1/bg.png,False,https://pbs.twimg.com/profile_banners/32645612...,http://pbs.twimg.com/profile_images/9863459563...,https://pbs.twimg.com/profile_images/986345956...,050505,FFFFFF,FC6A71,50505,True,False,rachelbogle,48075.0,,none,https://t.co/g9exqgZp9x,,True


The `word_frequency` function in `advertools` extracts words and counts their occurrences on an absolute and weighted basis. The function takes an optional `regex` parameter, whereby the function counts occurrences of matches of the regex (and not all words).  
We can now use the regex created, to extract and count emoji in our dataset. 

In [35]:
import advertools as adv
justdoit_emoji_freq = (adv.word_frequency(justdoit['tweet_full_text'],
                                          justdoit['user_followers_count'],
                                          regex=all_emoji_regex.pattern))
justdoit_emoji_freq.head(15)

Unnamed: 0,word,abs_freq,wtd_freq,rel_value
0,🐒,1,2896006.0,2896006.0
1,😂,535,762943.0,1426.0
2,✔️,136,476351.0,3503.0
3,🤣,177,359634.0,2032.0
4,🇺🇸,48,322516.0,6719.0
5,🏈,80,288040.0,3600.0
6,🔥,122,232236.0,1904.0
7,😘,13,199547.0,15350.0
8,👮‍♂️,2,189520.0,94760.0
9,😃,13,173217.0,13324.0


The `abs_freq` column shows how many times each emoji was used (simply count). While `wtd_freq` counts the number of followers of the person who tweeted the tweet for each occurrence.  
In sample above you can see the monkey emoji being used only once, but since the user who tweeted has 2.9M followers, it has the highest `wtd_freq` of all emoji.  

Using the emoji_dict that we created we can show names, groups, and sub-groups of each emoji:

In [36]:
justdoit_emoji_freq['name'] = [emoji_dict[word].name if word != '️' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq['group'] = [emoji_dict[word].group if word != '️' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq['sub_group'] = [emoji_dict[word].sub_group if word != '️' else '' for word in justdoit_emoji_freq['word']]
justdoit_emoji_freq[:40]

Unnamed: 0,word,abs_freq,wtd_freq,rel_value,name,group,sub_group
0,🐒,1,2896006.0,2896006.0,monkey,Animals & Nature,animal-mammal
1,😂,535,762943.0,1426.0,face with tears of joy,Smileys & Emotion,face-smiling
2,✔️,136,476351.0,3503.0,check mark,Symbols,other-symbol
3,🤣,177,359634.0,2032.0,rolling on the floor laughing,Smileys & Emotion,face-smiling
4,🇺🇸,48,322516.0,6719.0,flag: United States,Flags,country-flag
5,🏈,80,288040.0,3600.0,american football,Activities,sport
6,🔥,122,232236.0,1904.0,fire,Travel & Places,sky & weather
7,😘,13,199547.0,15350.0,face blowing a kiss,Smileys & Emotion,face-affection
8,👮‍♂️,2,189520.0,94760.0,man police officer,People & Body,person-role
9,😃,13,173217.0,13324.0,grinning face with big eyes,Smileys & Emotion,face-smiling


The previous table shows the frequencies per emoji.  
What about the groups and sub-groups? 

We do this next: 

In [37]:
(justdoit_emoji_freq
 .groupby('group')
 .agg({'abs_freq': 'sum', 'wtd_freq': 'sum'})
 .sort_values('wtd_freq', ascending=False)
 .style.format({'wtd_freq': '{:,.0f}'}))

Unnamed: 0_level_0,abs_freq,wtd_freq
group,Unnamed: 1_level_1,Unnamed: 2_level_1
Animals & Nature,38,2976109
Smileys & Emotion,1440,2784394
People & Body,778,1528755
Symbols,246,689172
Travel & Places,241,597755
Activities,208,500055
Flags,63,341367
Objects,145,207072
Food & Drink,13,29744
,15,15484


Note here that again, even though "Smileys & Emotion" emoji have been used 1,440 times and "Animals & Nature" only 38, the latter still ranks higher on a weighted basis.  
This is typical on social media. We often get a dataset that gets skewed by one tweet/user. 

In [38]:
(justdoit_emoji_freq
 .groupby('sub_group')
 .agg({'abs_freq': 'sum', 'wtd_freq': 'sum'})
 .sort_values('wtd_freq', ascending=False)
 .head(20)
 .style.format({'wtd_freq': '{:,.0f}'}))

Unnamed: 0_level_0,abs_freq,wtd_freq
sub_group,Unnamed: 1_level_1,Unnamed: 2_level_1
animal-mammal,19,2940580
face-smiling,815,1412457
other-symbol,202,560994
sky & weather,177,407219
person-role,13,399416
hand-fingers-closed,177,363482
sport,137,338287
country-flag,57,331833
hands,225,315005
face-affection,44,222628
