# Twitter Sentiment Analysis

Sentiment analysis is the process of analyzing online pieces of writing to determine the emotional tone they carry, whether they’re positive, negative, or neutral.
In simple words, sentiment analysis helps to find the author’s attitude towards a topic.
We want to make a study which can be socially relevant by looking at the attitude of people toward Climate Change. This may be done by scraping tweets from Twitter.
We will use a pre-labeled dataset available on kaggle: https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset


First of all, let us import all the packages that we are going to need in our project

In [2]:
import re

import matplotlib as plt
import sklearn
import pandas as pd
import nltk
import warnings
warnings.filterwarnings('ignore')

# Loading the data

Let us first import the dataset in order to see with what kind of data we are dealing with

In [3]:
df = pd.read_csv('twitter_sentiment_data.csv')
df.head()

Unnamed: 0,sentiment,message,tweetid
0,-1,@tiniebeany climate change is an interesting h...,792927353886371840
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793124211518832641
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,793124402388832256
3,1,RT @Mick_Fanning: Just watched this amazing do...,793124635873275904
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125156185137153


In [4]:
df['message'][186]

"RT @TimotheusW: Harrowing read about the relentless pursuit of #CSG in #Australia - 'Australia isnÃ¢â‚¬â„¢t Ã¢â‚¬Å“tacklingÃ¢â‚¬ï†\x9d climate change, weÃ¢â‚¬Â¦"

We have 2 separate columns for the table id and the tweetid, let us use our ids: they will use less space than storing huge numbers.
We will remove the column associated to the tweetids

In [5]:
df = df.drop('tweetid', axis=1)
df.head()

Unnamed: 0,sentiment,message
0,-1,@tiniebeany climate change is an interesting h...
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...
3,1,RT @Mick_Fanning: Just watched this amazing do...
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ..."


We will also drop the tweets which are about news and neutral since we want our model to know about the general feeling of people about global warming.
For this reason, we will only keep the messages associated to texts with sentiment 1 or -1

In [6]:
print(df.size)

df = df[(df.sentiment != 0) & (df.sentiment != 2)]
print(df.size)
df.head()

87886
53904


Unnamed: 0,sentiment,message
0,-1,@tiniebeany climate change is an interesting h...
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...
3,1,RT @Mick_Fanning: Just watched this amazing do...
9,1,#BeforeTheFlood Watch #BeforeTheFlood right he...


We can now take a look at how the tweets are done:

In [7]:
print(df['message'][0])

@tiniebeany climate change is an interesting hustle as it was global warming but the planet stopped warming for 15 yes while the suv boom


Many tweets are also the same, we should get rid of them.
Messages with same texts are likely to be due to some spamming bot

In [8]:
df.drop_duplicates(subset=['message'], inplace=True)
df['original_message'] = df['message']

## Preprocessing

When text files are considered, there aren't many features available, we only have text.
From there we will need to extract everything that we need to perform our classification task.

One of the main problems with text data is noise. Capital letters, punctuation, links, spelling errors are some examples of problems that add noise to the data and are likely to worsen the performance of the model.
The steps used for the cleaning of the text data will be the following:
1) Lowercase the text
2) Remove the punctuation
3) Remove the stop-words
4) Remove @mentions
5) Removal of HTML links
6) Spell check

### Lowercasing

When considering comments, capital letters don't matter because they are used randomly like:
WHATTT? i CAnT BeLIEVE ITttt

In [9]:
before = df['message'][1]
df['message'] = df['message'].apply(str.lower)
after = df['message'][1]

print('The text before the transformation was:\n',before,'\nNow it is:\n',after)

The text before the transformation was:
 RT @NatGeoChannel: Watch #BeforeTheFlood right here, as @LeoDiCaprio travels the world to tackle climate change https://t.co/LkDehj3tNn httÃ¢â‚¬Â¦ 
Now it is:
 rt @natgeochannel: watch #beforetheflood right here, as @leodicaprio travels the world to tackle climate change https://t.co/lkdehj3tnn httã¢â‚¬â¦


### Punctuation Removal

For the same reason as before, consider the message WHAT????????????????? R U SERIOUS ??? OMG!!!
As you can see there's a lot of noise in the sentence.
While it can be helpful to understand the meaning or for POS tagging, in our case punctuation is just useless and more importantly is not very needed in comments where it could be used wrongly (it's not a scientific paper or an article).

In [10]:
import string

punctuation = string.punctuation
print(punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can see that there is the @: for twitter this is a special character so we should deal with it separately.
Also some usernames can contain '_'

From twitter docs: "A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores"

In [11]:
punctuation = punctuation.replace('@','')
punctuation = punctuation.replace('_','')
print(punctuation)

!"#$%&'()*+,-./:;<=>?[\]^`{|}~


Ok, now we can move ahead by removing all the other elements. None of them seems to be relevant for our task

In [12]:
before = df['message'][1]
df['message'] = df['message'].apply(lambda x: x.translate(str.maketrans('','',punctuation)))
after = df['message'][1]

print('The text before the transformation was:\n',before,'\nNow it is:\n',after)

The text before the transformation was:
 rt @natgeochannel: watch #beforetheflood right here, as @leodicaprio travels the world to tackle climate change https://t.co/lkdehj3tnn httã¢â‚¬â¦ 
Now it is:
 rt @natgeochannel watch beforetheflood right here as @leodicaprio travels the world to tackle climate change httpstcolkdehj3tnn httã¢â‚¬â¦


### Mentions, rt and hashtag Removal

In twitter, it is possible for user to mention other people.
This information is totally irrelevant when considering the sentiment/emotions of a text. We shall move ahead and deal with it.

In [13]:
before = df['message'][1]
# re.sub will remove any substring that matches with the following regex:
# one @, followed by any character alphanumeric including _
# OR the #
# OR any number of digit >=1
df['message'] = df['message'].apply(lambda x: re.sub(r'@\w+|#\w+|\d+|rt', '', x))
after = df['message'][1]

print('The text before the transformation was:\n',before,'\nNow it is:\n',after)

The text before the transformation was:
 rt @natgeochannel watch beforetheflood right here as @leodicaprio travels the world to tackle climate change httpstcolkdehj3tnn httã¢â‚¬â¦ 
Now it is:
   watch beforetheflood right here as  travels the world to tackle climate change httpstcolkdehjtnn httã¢â‚¬â¦


### Removal of links

Links as well are not very useful when talking about understanding a text, they're just a reference to something else.
That's why we will remove anything starting with www, http or https (usually what links start with)

In [14]:
before = df['message'][1]

# This regex will match anything starting with (http or www or https) followed by any character that isn't a space: therefore the whole word starting with these values will be eliminated until a space is found.
df['message'] = df['message'].apply(lambda x: re.sub(r'htt\S+|www\S+', '', x))
after = df['message'][1]

print('The text before the transformation was:\n',before,'\nNow it is:\n',after)

The text before the transformation was:
   watch beforetheflood right here as  travels the world to tackle climate change httpstcolkdehjtnn httã¢â‚¬â¦ 
Now it is:
   watch beforetheflood right here as  travels the world to tackle climate change  


## Handling strange characters

By scraping the dataset, I found some strange characters that are very likely to be just some errors related to the wrong understanding of the encoding format.
This problem was not solvable by changing the reading encoding, which means that the errors must be done when saving the datasets.
I tried to copy as many of those characters as I could find while looking through the dataset.

We remove the characters by simply compiling all these special characters in a regex and applying it to the message column

In [15]:
def handle_undefined_chars(text):
    # handling BOM characters
    try:
        clean = text.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = text
    l = ['ã', '¢', 'â', '‚', '¬', 'å', '“', 'ã', 'ï', '†', '', '¦','°','å','³','’','„','¹','â','è','é','ç','ò','à','§','€']
    regex = re.compile(r'\w*(' + '|'.join(l) + r')\w*')

    return regex.sub('', clean)

df['message'] = df['message'].apply(lambda x: handle_undefined_chars(x))

### Removal of stopwords

Stopwords are frequent words like 'the', 'a', 'about' which are very often used but don't provide us any useful information about the analysis of the meaning of the text.

In [16]:
nltk.download('stopwords')
from nltk.corpus import stopwords

english_stopwords = stopwords.words('english')
print(english_stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This is too much: so much information would be lost in the process!
Not, don't, no etc. are lost in the process: we are losing way too many pieces of information.

For example:
"I am not happy" becomes "I am happy"!!!
The meaning of the sentence is totally lost!
It's better if we define our own set of stopwords.

In [17]:
my_stopwords=['a', 'about', 'above', 'after', 'again', 'all', 'am', 'an',
              'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before',
              'being', 'below', 'between', 'both', 'but', 'by', 'can', 'd', 'did', 'do',
              'does', 'doing', 'down', 'during', 'each', 'few', 'for', 'from',
              'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
              'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
              'into', 'is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
              'me', 'more', 'most', 'my', 'myself', 'now', 'o', 'of', 'on', 'once',
              'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'own', 're',
              's', 'same', 'she', "shes", 'should', "shouldve", 'so', 'some', 'such',
              't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
              'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
              'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was',
              'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom',
              'why', 'will', 'with', 'won', 'u', 'y', 'you', "youd", "youll", "youre",
              "youve", 'your', 'yours', 'yourself', 'yourselves']

def cleaning_stopwords(text):
    return " ".join([w for w in str(text).split() if w not in my_stopwords])

before = df['message'][1]
df['message'] = df['message'].apply(lambda x: cleaning_stopwords(x))
after = df['message'][1]

print('The text before the transformation was:\n',before,'\nNow it is:\n',after)

The text before the transformation was:
   watch beforetheflood right here as  travels the world to tackle climate change   
Now it is:
 watch beforetheflood right travels world tackle climate change


### Removal of emoticon and emojis

When writing texts on the internet, people often use emojis and emoticons to convey their emotions.
Let us consider:
- "Today there's the snow :)"
- "Today there's the snow :("
The sentence is the same but the sentiment of the message is different: in the first case it's positive while in the second sentence it is negative.

Of course a text analyzer is not able to understand what's the meaning of a colon followed by a parathesis, therefore it will just get rid of the information, it would be only noise.
However, it is important to transform these emojis in such a way they can be of help for the task we're currently considering.
The same reasoning is for emojis.

At the following <a href="https://github.com/NeelShah18/emot/blob/master"> github repo </a> we can find a list of emoticon and emojis which are associated to their corresponding word transcription

Let us first show some examples of how this actually work:

In [18]:
EMO_UNICODE = {
    u':OK_button:': u'\U0001F197',
    u':OK_hand:': u'\U0001F44C',
    u':ON!_arrow:': u'\U0001F51B',
    u':anger_symbol:': u'\U0001F4A2',
    u':angry_face:': u'\U0001F620',
    u':angry_face_with_horns:': u'\U0001F47F',
    u':anguished_face:': u'\U0001F627',
    u':ant:': u'\U0001F41C',
    u':antenna_bars:': u'\U0001F4F6',
    u':anticlockwise_arrows_button:': u'\U0001F504',
    u':articulated_lorry:': u'\U0001F69B',
    u':artist_palette:': u'\U0001F3A8',
    u':astonished_face:': u'\U0001F632',
    u':atom_symbol:': u'\U0000269B',
    u':backhand_index_pointing_down:': u'\U0001F447',
    u':backhand_index_pointing_left:': u'\U0001F448',
    u':backhand_index_pointing_right:': u'\U0001F449',
    u':backhand_index_pointing_up:': u'\U0001F446',
    u':beating_heart:': u'\U0001F493',
    u':biohazard:': u'\U00002623',
    u':black_heart:': u'\U0001F5A4',
    u':black_large_square:': u'\U00002B1B',
    u':black_medium-small_square:': u'\U000025FE',
    u':black_medium_square:': u'\U000025FC',
    u':black_nib:': u'\U00002712',
    u':black_small_square:': u'\U000025AA',
    u':black_square_button:': u'\U0001F532',
    u':blossom:': u'\U0001F33C',
    u':blowfish:': u'\U0001F421',
    u':blue_book:': u'\U0001F4D8',
    u':blue_circle:': u'\U0001F535',
    u':blue_heart:': u'\U0001F499',
    u':boar:': u'\U0001F417',
    u':bomb:': u'\U0001F4A3',
    u':bookmark:': u'\U0001F516',
    u':bookmark_tabs:': u'\U0001F4D1',
    u':books:': u'\U0001F4DA',
    u':bread:': u'\U0001F35E',
    u':bridge_at_night:': u'\U0001F309',
    u':briefcase:': u'\U0001F4BC',
    u':bright_button:': u'\U0001F506',
    u':broken_heart:': u'\U0001F494',
    u':bug:': u'\U0001F41B',
    u':building_construction:': u'\U0001F3D7',
    u':burrito:': u'\U0001F32F',
    u':bus:': u'\U0001F68C',
    u':bus_stop:': u'\U0001F68F',
    u':bust_in_silhouette:': u'\U0001F464',
    u':busts_in_silhouette:': u'\U0001F465',
    u':butterfly:': u'\U0001F98B',
    u':cactus:': u'\U0001F335',
    u':calendar:': u'\U0001F4C5',
    u':call_me_hand:': u'\U0001F919',
    u':camel:': u'\U0001F42A',
    u':camera:': u'\U0001F4F7',
    u':camera_with_flash:': u'\U0001F4F8',
    u':camping:': u'\U0001F3D5',
    u':candle:': u'\U0001F56F',
    u':candy:': u'\U0001F36C',
    u':canoe:': u'\U0001F6F6',
    u':card_file_box:': u'\U0001F5C3',
    u':cat:': u'\U0001F408',
    u':cat_face:': u'\U0001F431',
    u':cat_face_with_tears_of_joy:': u'\U0001F639',
    u':cat_face_with_wry_smile:': u'\U0001F63C',
    u':chains:': u'\U000026D3',
    u':chart_decreasing:': u'\U0001F4C9',
    u':chart_increasing:': u'\U0001F4C8',
    u':chart_increasing_with_yen:': u'\U0001F4B9',
    u':cheese_wedge:': u'\U0001F9C0',
    u':chequered_flag:': u'\U0001F3C1',
    u':cherries:': u'\U0001F352',
    u':cherry_blossom:': u'\U0001F338',
    u':chestnut:': u'\U0001F330',
    u':chicken:': u'\U0001F414',
    u':children_crossing:': u'\U0001F6B8',
    u':chipmunk:': u'\U0001F43F',
    u':chocolate_bar:': u'\U0001F36B',
    u':church:': u'\U000026EA',
    u':cigarette:': u'\U0001F6AC',
    u':cinema:': u'\U0001F3A6',
    u':circled_M:': u'\U000024C2',
    u':circus_tent:': u'\U0001F3AA',
    u':cityscape:': u'\U0001F3D9',
    u':cloud_with_lightning:': u'\U0001F329',
    u':cloud_with_lightning_and_rain:': u'\U000026C8',
    u':cloud_with_rain:': u'\U0001F327',
    u':cloud_with_snow:': u'\U0001F328',
    u':clown_face:': u'\U0001F921',
    u':coffin:': u'\U000026B0',
    u':confetti_ball:': u'\U0001F38A',
    u':confounded_face:': u'\U0001F616',
    u':confused_face:': u'\U0001F615',
    u':cucumber:': u'\U0001F952',
    u':curly_loop:': u'\U000027B0',
    u':currency_exchange:': u'\U0001F4B1',
    u':curry_rice:': u'\U0001F35B',
    u':custard:': u'\U0001F36E',
    u':customs:': u'\U0001F6C3',
    u':cyclone:': u'\U0001F300',
    u':dagger:': u'\U0001F5E1',
    u':dango:': u'\U0001F361',
    u':dark_skin_tone:': u'\U0001F3FF',
    u':dashing_away:': u'\U0001F4A8',
    u':deciduous_tree:': u'\U0001F333',
    u':deer:': u'\U0001F98C',
    u':delivery_truck:': u'\U0001F69A',
    u':department_store:': u'\U0001F3EC',
    u':derelict_house:': u'\U0001F3DA',
    u':desert:': u'\U0001F3DC',
    u':desert_island:': u'\U0001F3DD',
    u':disappointed_but_relieved_face:': u'\U0001F625',
    u':disappointed_face:': u'\U0001F61E',
    u':dizzy:': u'\U0001F4AB',
    u':dizzy_face:': u'\U0001F635',
    u':dollar_banknote:': u'\U0001F4B5',
    u':double_exclamation_mark:': u'\U0000203C',
    u':elephant:': u'\U0001F418',
    u':face_screaming_in_fear:': u'\U0001F631',
    u':face_with_cold_sweat:': u'\U0001F613',
    u':face_with_head-bandage:': u'\U0001F915',
    u':face_with_medical_mask:': u'\U0001F637',
    u':face_with_open_mouth:': u'\U0001F62E',
    u':face_with_open_mouth_&_cold_sweat:': u'\U0001F630',
    u':face_with_rolling_eyes:': u'\U0001F644',
    u':face_with_steam_from_nose:': u'\U0001F624',
    u':face_with_stuck-out_tongue:': u'\U0001F61B',
    u':face_with_stuck-out_tongue_&_closed_eyes:': u'\U0001F61D',
    u':face_with_stuck-out_tongue_&_winking_eye:': u'\U0001F61C',
    u':face_with_tears_of_joy:': u'\U0001F602',
    u':face_with_thermometer:': u'\U0001F912',
    u':face_without_mouth:': u'\U0001F636',
    u':fearful_face:': u'\U0001F628',
    u':fire:': u'\U0001F525',
    u':fire_engine:': u'\U0001F692',
    u':fireworks:': u'\U0001F386',
    u':fish:': u'\U0001F41F',
    u':fish_cake_with_swirl:': u'\U0001F365',
    u':fishing_pole:': u'\U0001F3A3',
    u':five-thirty:': u'\U0001F560',
    u':five_o’clock:': u'\U0001F554',
    u':flag_in_hole:': u'\U000026F3',
    u':flashlight:': u'\U0001F526',
    u':fleur-de-lis:': u'\U0000269C',
    u':flexed_biceps:': u'\U0001F4AA',
    u':floppy_disk:': u'\U0001F4BE',
    u':flower_playing_cards:': u'\U0001F3B4',
    u':flushed_face:': u'\U0001F633',
    u':fog:': u'\U0001F32B',
    u':foggy:': u'\U0001F301',
    u':folded_hands:': u'\U0001F64F',
    u':footprints:': u'\U0001F463',
    u':fork_and_knife:': u'\U0001F374',
    u':fork_and_knife_with_plate:': u'\U0001F37D',
    u':fountain:': u'\U000026F2',
    u':fountain_pen:': u'\U0001F58B',
    u':four-thirty:': u'\U0001F55F',
    u':four_leaf_clover:': u'\U0001F340',
    u':four_o’clock:': u'\U0001F553',
    u':fox_face:': u'\U0001F98A',
    u':framed_picture:': u'\U0001F5BC',
    u':french_fries:': u'\U0001F35F',
    u':fried_shrimp:': u'\U0001F364',
    u':frog_face:': u'\U0001F438',
    u':front-facing_baby_chick:': u'\U0001F425',
    u':frowning_face:': u'\U00002639',
    u':frowning_face_with_open_mouth:': u'\U0001F626',
    u':fuel_pump:': u'\U000026FD',
    u':full_moon:': u'\U0001F315',
    u':full_moon_with_face:': u'\U0001F31D',
    u':funeral_urn:': u'\U000026B1',
    u':game_die:': u'\U0001F3B2',
    u':gear:': u'\U00002699',
    u':gem_stone:': u'\U0001F48E',
    u':ghost:': u'\U0001F47B',
    u':girl:': u'\U0001F467',
    u':glass_of_milk:': u'\U0001F95B',
    u':glasses:': u'\U0001F453',
    u':globe_showing_Americas:': u'\U0001F30E',
    u':globe_showing_Asia-Australia:': u'\U0001F30F',
    u':globe_showing_Europe-Africa:': u'\U0001F30D',
    u':globe_with_meridians:': u'\U0001F310',
    u':glowing_star:': u'\U0001F31F',
    u':goal_net:': u'\U0001F945',
    u':goat:': u'\U0001F410',
    u':goblin:': u'\U0001F47A',
    u':gorilla:': u'\U0001F98D',
    u':graduation_cap:': u'\U0001F393',
    u':grapes:': u'\U0001F347',
    u':green_apple:': u'\U0001F34F',
    u':green_book:': u'\U0001F4D7',
    u':green_heart:': u'\U0001F49A',
    u':green_salad:': u'\U0001F957',
    u':grimacing_face:': u'\U0001F62C',
    u':grinning_cat_face_with_smiling_eyes:': u'\U0001F638',
    u':grinning_face:': u'\U0001F600',
    u':grinning_face_with_smiling_eyes:': u'\U0001F601',
    u':growing_heart:': u'\U0001F497',
    u':guard:': u'\U0001F482',
    u':guitar:': u'\U0001F3B8',
    u':hamburger:': u'\U0001F354',
    u':hammer:': u'\U0001F528',
    u':hammer_and_pick:': u'\U00002692',
    u':hammer_and_wrench:': u'\U0001F6E0',
    u':hamster_face:': u'\U0001F439',
    u':handbag:': u'\U0001F45C',
    u':handshake:': u'\U0001F91D',
    u':hatching_chick:': u'\U0001F423',
    u':headphone:': u'\U0001F3A7',
    u':hear-no-evil_monkey:': u'\U0001F649',
    u':heart_decoration:': u'\U0001F49F',
    u':heart_suit:': u'\U00002665',
    u':heart_with_arrow:': u'\U0001F498',
    u':heart_with_ribbon:': u'\U0001F49D',
    u':hibiscus:': u'\U0001F33A',
    u':high-heeled_shoe:': u'\U0001F460',
    u':high-speed_train:': u'\U0001F684',
    u':high-speed_train_with_bullet_nose:': u'\U0001F685',
    u':high_voltage:': u'\U000026A1',
    u':honey_pot:': u'\U0001F36F',
    u':honeybee:': u'\U0001F41D',
    u':horizontal_traffic_light:': u'\U0001F6A5',
    u':horse:': u'\U0001F40E',
    u':hospital:': u'\U0001F3E5',
    u':hot_beverage:': u'\U00002615',
    u':hot_dog:': u'\U0001F32D',
    u':hot_pepper:': u'\U0001F336',
    u':hot_springs:': u'\U00002668',
    u':hotel:': u'\U0001F3E8',
    u':hourglass:': u'\U0000231B',
    u':hourglass_with_flowing_sand:': u'\U000023F3',
    u':house:': u'\U0001F3E0',
    u':house_with_garden:': u'\U0001F3E1',
    u':hugging_face:': u'\U0001F917',
    u':hundred_points:': u'\U0001F4AF',
    u':hushed_face:': u'\U0001F62F',
    u':ice_cream:': u'\U0001F368',
    u':ice_hockey:': u'\U0001F3D2',
    u':ice_skate:': u'\U000026F8',
    u':inbox_tray:': u'\U0001F4E5',
    u':incoming_envelope:': u'\U0001F4E8',
    u':index_pointing_up:': u'\U0000261D',
    u':kick_scooter:': u'\U0001F6F4',
    u':kimono:': u'\U0001F458',
    u':kiss:': u'\U0001F48F',
    u':koala:': u'\U0001F428',
    u':label:': u'\U0001F3F7',
    u':lady_beetle:': u'\U0001F41E',
    u':leopard:': u'\U0001F406',
    u':level_slider:': u'\U0001F39A',
    u':locomotive:': u'\U0001F682',
    u':loudly_crying_face:': u'\U0001F62D',
    u':map_of_Japan:': u'\U0001F5FE',
    u':meat_on_bone:': u'\U0001F356',
    u':medical_symbol:': u'\U00002695',
    u':mount_fuji:': u'\U0001F5FB',
    u':mountain:': u'\U000026F0',
    u':mountain_cableway:': u'\U0001F6A0',
    u':mountain_railway:': u'\U0001F69E',
    u':mouth:': u'\U0001F444',
    u':necktie:': u'\U0001F454',
    u':nerd_face:': u'\U0001F913',
    u':neutral_face:': u'\U0001F610',
    u':no_smoking:': u'\U0001F6AD',
    u':non-potable_water:': u'\U0001F6B1',
    u':open_book:': u'\U0001F4D6',
    u':open_file_folder:': u'\U0001F4C2',
    u':open_hands:': u'\U0001F450',
    u':package:': u'\U0001F4E6',
    u':page_facing_up:': u'\U0001F4C4',
    u':page_with_curl:': u'\U0001F4C3',
    u':peace_symbol:': u'\U0000262E',
    u':pen:': u'\U0001F58A',
    u':pick:': u'\U000026CF',
    u':potable_water:': u'\U0001F6B0',
    u':pouting_face:': u'\U0001F621',
    u':prohibited:': u'\U0001F6AB',
    u':purple_heart:': u'\U0001F49C',
    u':question_mark:': u'\U00002753',
    u':rabbit:': u'\U0001F407',
    u':rabbit_face:': u'\U0001F430',
    u':radioactive:': u'\U00002622',
    u':railway_car:': u'\U0001F683',
    u':railway_track:': u'\U0001F6E4',
    u':rainbow:': u'\U0001F308',
    u':rainbow_flag:': u'\U0001F3F3 \U0000FE0F \U0000200D \U0001F308',
    u':raised_back_of_hand:': u'\U0001F91A',
    u':raised_fist:': u'\U0000270A',
    u':raised_hand:': u'\U0000270B',
    u':raised_hand_with_fingers_splayed:': u'\U0001F590',
    u':raising_hands:': u'\U0001F64C',
    u':record_button:': u'\U000023FA',
    u':recycling_symbol:': u'\U0000267B',
    u':red_heart:': u'\U00002764',
    u':relieved_face:': u'\U0001F60C',
    u':roller_coaster:': u'\U0001F3A2',
    u':rolling_on_the_floor_laughing:': u'\U0001F923',
    u':shallow_pan_of_food:': u'\U0001F958',
    u':shamrock:': u'\U00002618',
    u':shark:': u'\U0001F988',
    u':sign_of_the_horns:': u'\U0001F918',
    u':skull:': u'\U0001F480',
    u':skull_and_crossbones:': u'\U00002620',
    u':sleeping_face:': u'\U0001F634',
    u':sleepy_face:': u'\U0001F62A',
    u':slightly_frowning_face:': u'\U0001F641',
    u':slightly_smiling_face:': u'\U0001F642',
    u':smiling_face:': u'\U0000263A',
    u':smiling_face_with_halo:': u'\U0001F607',
    u':smiling_face_with_heart-eyes:': u'\U0001F60D',
    u':smiling_face_with_horns:': u'\U0001F608',
    u':smiling_face_with_open_mouth:': u'\U0001F603',
    u':smiling_face_with_open_mouth_&_closed_eyes:': u'\U0001F606',
    u':smiling_face_with_open_mouth_&_cold_sweat:': u'\U0001F605',
    u':smiling_face_with_open_mouth_&_smiling_eyes:': u'\U0001F604',
    u':smiling_face_with_smiling_eyes:': u'\U0001F60A',
    u':smiling_face_with_sunglasses:': u'\U0001F60E',
    u':smirking_face:': u'\U0001F60F',
    u':snail:': u'\U0001F40C',
    u':snake:': u'\U0001F40D',
    u':sneezing_face:': u'\U0001F927',
    u':snow-capped_mountain:': u'\U0001F3D4',
    u':snowboarder:': u'\U0001F3C2',
    u':snowman:': u'\U00002603',
    u':snowman_without_snow:': u'\U000026C4',
    u':sparkling_heart:': u'\U0001F496',
    u':speak-no-evil_monkey:': u'\U0001F64A',
    u':speaker_high_volume:': u'\U0001F50A',
    u':speaker_low_volume:': u'\U0001F508',
    u':speaker_medium_volume:': u'\U0001F509',
    u':speaking_head:': u'\U0001F5E3',
    u':speech_balloon:': u'\U0001F4AC',
    u':speedboat:': u'\U0001F6A4',
    u':spider:': u'\U0001F577',
    u':spider_web:': u'\U0001F578',
    u':spiral_calendar:': u'\U0001F5D3',
    u':spiral_notepad:': u'\U0001F5D2',
    u':spiral_shell:': u'\U0001F41A',
    u':spoon:': u'\U0001F944',
    u':sport_utility_vehicle:': u'\U0001F699',
    u':spouting_whale:': u'\U0001F433',
    u':stop_button:': u'\U000023F9',
    u':stop_sign:': u'\U0001F6D1',
    u':stopwatch:': u'\U000023F1',
    u':straight_ruler:': u'\U0001F4CF',
    u':strawberry:': u'\U0001F353',
    u':studio_microphone:': u'\U0001F399',
    u':stuffed_flatbread:': u'\U0001F959',
    u':sun:': u'\U00002600',
    u':sun_behind_cloud:': u'\U000026C5',
    u':sun_behind_large_cloud:': u'\U0001F325',
    u':sun_behind_rain_cloud:': u'\U0001F326',
    u':sun_behind_small_cloud:': u'\U0001F324',
    u':sun_with_face:': u'\U0001F31E',
    u':sunrise:': u'\U0001F305',
    u':sunrise_over_mountains:': u'\U0001F304',
    u':sunset:': u'\U0001F307',
    u':suspension_railway:': u'\U0001F69F',
    u':sweat_droplets:': u'\U0001F4A6',
    u':teacup_without_handle:': u'\U0001F375',
    u':tear-off_calendar:': u'\U0001F4C6',
    u':telephone:': u'\U0000260E',
    u':telephone_receiver:': u'\U0001F4DE',
    u':telescope:': u'\U0001F52D',
    u':television:': u'\U0001F4FA',
    u':ten-thirty:': u'\U0001F565',
    u':ten_o’clock:': u'\U0001F559',
    u':tennis:': u'\U0001F3BE',
    u':tent:': u'\U000026FA',
    u':thermometer:': u'\U0001F321',
    u':thinking_face:': u'\U0001F914',
    u':thought_balloon:': u'\U0001F4AD',
    u':three-thirty:': u'\U0001F55E',
    u':three_o’clock:': u'\U0001F552',
    u':thumbs_down:': u'\U0001F44E',
    u':thumbs_up:': u'\U0001F44D',
    u':ticket:': u'\U0001F3AB',
    u':tiger:': u'\U0001F405',
    u':tiger_face:': u'\U0001F42F',
    u':timer_clock:': u'\U000023F2',
    u':tired_face:': u'\U0001F62B',
    u':toilet:': u'\U0001F6BD',
    u':tomato:': u'\U0001F345',
    u':tongue:': u'\U0001F445',
    u':top_hat:': u'\U0001F3A9',
    u':tornado:': u'\U0001F32A',
    u':trackball:': u'\U0001F5B2',
    u':tractor:': u'\U0001F69C',
    u':trade_mark:': u'\U00002122',
    u':train:': u'\U0001F686',
    u':tram:': u'\U0001F68A',
    u':tram_car:': u'\U0001F68B',
    u':triangular_flag:': u'\U0001F6A9',
    u':triangular_ruler:': u'\U0001F4D0',
    u':trident_emblem:': u'\U0001F531',
    u':trolleybus:': u'\U0001F68E',
    u':trophy:': u'\U0001F3C6',
    u':tropical_drink:': u'\U0001F379',
    u':tropical_fish:': u'\U0001F420',
    u':trumpet:': u'\U0001F3BA',
    u':tulip:': u'\U0001F337',
    u':tumbler_glass:': u'\U0001F943',
    u':turkey:': u'\U0001F983',
    u':turtle:': u'\U0001F422',
    u':twelve-thirty:': u'\U0001F567',
    u':twelve_o’clock:': u'\U0001F55B',
    u':two-hump_camel:': u'\U0001F42B',
    u':two-thirty:': u'\U0001F55D',
    u':two_hearts:': u'\U0001F495',
    u':umbrella:': u'\U00002602',
    u':umbrella_on_ground:': u'\U000026F1',
    u':umbrella_with_rain_drops:': u'\U00002614',
    u':unamused_face:': u'\U0001F612',
    u':unicorn_face:': u'\U0001F984',
    u':unlocked:': u'\U0001F513',
    u':up-down_arrow:': u'\U00002195',
    u':up-left_arrow:': u'\U00002196',
    u':up-right_arrow:': u'\U00002197',
    u':up_arrow:': u'\U00002B06',
    u':up_button:': u'\U0001F53C',
    u':upside-down_face:': u'\U0001F643',
    u':vertical_traffic_light:': u'\U0001F6A6',
    u':vibration_mode:': u'\U0001F4F3',
    u':victory_hand:': u'\U0000270C',
    u':video_camera:': u'\U0001F4F9',
    u':video_game:': u'\U0001F3AE',
    u':videocassette:': u'\U0001F4FC',
    u':violin:': u'\U0001F3BB',
    u':volcano:': u'\U0001F30B',
    u':volleyball:': u'\U0001F3D0',
    u':vulcan_salute:': u'\U0001F596',
    u':waning_crescent_moon:': u'\U0001F318',
    u':waning_gibbous_moon:': u'\U0001F316',
    u':warning:': u'\U000026A0',
    u':wastebasket:': u'\U0001F5D1',
    u':watch:': u'\U0000231A',
    u':water_buffalo:': u'\U0001F403',
    u':water_closet:': u'\U0001F6BE',
    u':water_wave:': u'\U0001F30A',
    u':waving_hand:': u'\U0001F44B',
    u':wavy_dash:': u'\U00003030',
    u':waxing_crescent_moon:': u'\U0001F312',
    u':waxing_gibbous_moon:': u'\U0001F314',
    u':weary_cat_face:': u'\U0001F640',
    u':weary_face:': u'\U0001F629',
    u':whale:': u'\U0001F40B',
    u':wheel_of_dharma:': u'\U00002638',
    u':wheelchair_symbol:': u'\U0000267F',
    u':white_circle:': u'\U000026AA',
    u':white_exclamation_mark:': u'\U00002755',
    u':white_flag:': u'\U0001F3F3',
    u':white_flower:': u'\U0001F4AE',
    u':wilted_flower:': u'\U0001F940',
    u':wind_chime:': u'\U0001F390',
    u':wind_face:': u'\U0001F32C',
    u':wine_glass:': u'\U0001F377',
    u':winking_face:': u'\U0001F609',
    u':wolf_face:': u'\U0001F43A',
    u':world_map:': u'\U0001F5FA',
    u':worried_face:': u'\U0001F61F',
    u':yellow_heart:': u'\U0001F49B',
    u':yen_banknote:': u'\U0001F4B4',
    u':yin_yang:': u'\U0000262F',
    u':zipper-mouth_face:': u'\U0001F910',
    u':zzz:': u'\U0001F4A4',
}

UNICODE_EMOJI = {v: k for k, v in EMO_UNICODE.items()}

In [19]:
# Now we will run a regex for every emoticon and every emoji to see if there is a match. In such a case, the emoji will be substituted by the textual representation

def convert_emojis(text):
    for emot in UNICODE_EMOJI:
        regex = r'('+emot+')'
        text = re.sub(regex, UNICODE_EMOJI[emot], text)
    return text

text = "HAHAHA this is so funny 😂"
convert_emojis(text)

'HAHAHA this is so funny :face_with_tears_of_joy:'

There is still something that needs to be done in order to achieve the desired outcome: let us remove the colons at the beginning and end of the transcription

In [20]:
def convert_emojis(text):
    for emot in UNICODE_EMOJI:
        regex = r'('+emot+')'
        text = re.sub(regex, ' '.join(UNICODE_EMOJI[emot].replace(':','').split('_')), text)
    return text

text = "HAHAHA this is so funny 😂"
convert_emojis(text)

'HAHAHA this is so funny face with tears of joy'

##### Emoticon handling
Handling emoticons will be similar but just a bit more tricky. In fact, to some emoticons that are multiple meanings that can be associated to it.

For example:
":‑D"  --->   "Laughing, big grin or laugh with glasses"

Since we want all the replacements to be done in the same way for emojis and emoticons, we will remove the commas and substitute all the spaces with underscores

In [21]:
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, angry or pouting",
    u":-\(":"Frown, sad, angry or pouting",
    u":\(":"Frown, sad, angry or pouting",
    u":‑c":"Frown, sad, angry or pouting",
    u":c":"Frown, sad, angry or pouting",
    u":‑<":"Frown, sad, angry or pouting",
    u":<":"Frown, sad, angry or pouting",
    u":‑\[":"Frown, sad, angry or pouting",
    u":\[":"Frown, sad, angry or pouting",
    u":-\|\|":"Frown, sad, angry or pouting",
    u">:\[":"Frown, sad, angry or pouting",
    u":\{":"Frown, sad, angry or pouting",
    u":@":"Frown, sad, angry or pouting",
    u">:\(":"Frown, sad, angry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [22]:
def convert_emoticon(text):
    for emot in EMOTICONS:
        regex = r'('+emot+')'
        text = re.sub(regex, EMOTICONS[emot].replace(':','').replace(',','').lower(), text)
    return text

text = "HAHAHA this is so funny XD"
convert_emoticon(text)

'HAHAHA this is so funny laughing big grin or laugh with glasses'

Now that we're done with our functions, we can perform the transformation over our dataset for both emojis and emoticons

In [23]:
df['message'] = df['message'].apply(lambda x: convert_emojis(x))

In [24]:
df['message'] = df['message'].apply(lambda x: convert_emoticon(x))

### Spellchecking

Another important text preprocessing step is spelling correction. Typos are common in text data and we might want to correct those spelling mistakes before we do our analysis.
For example consider the example:
"I canqt be happy today".
Here the user wanted to say "can't", however because of a spelling error our model won't be able to understand that the user in unhappy: it will just see something that doesn't understand (canqt) and will draw its conclusions from the word happy (which is actually the negation of the meaning of the sentence).

For this task, the package spello becomes really handy.

In [25]:
! pip install spello



Once this is installed, import the pretrained model for fixing the spelling

In [26]:
from spello.model import SpellCorrectionModel

sp = SpellCorrectionModel(language='en')
sp.load('./en.pkl/en.pkl')

<spello.model.SpellCorrectionModel at 0x1fbecdbc0a0>

Let's try the spelling checker function to see how it actually works:

In [27]:
text = "I canqt be happy today"
print(sp.spell_correct(text)['spell_corrected_text'])

I can't be happy today


And now we can run it over the whole dataset

In [28]:
df['message'] = df['message'].apply(lambda x: sp.spell_correct(x)['spell_corrected_text'])

## Handling of contractions

Once that we spelled checked our data we can take everything one step further:
we will handle contractions by modifying them in their full form

In [29]:
# Defining the dictionary of negations that we want to handle
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                 "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                 "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                 "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                 "mustn't":"must not"}

# Compiling the regex for performance reasons
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def remove_contractions(text):
    return neg_pattern.sub(lambda x: negations_dic[x.group()], text)

df['message'] = df['message'].apply(lambda x: remove_contractions(x))

### Tokenization of words

Before saving the file, we want to transform the words obtained after our preprocessing to tokens since all the methods that we are going to use later will work directly on the words.

The TweetTokenizer from the package nltk.tokenize seems to be a good match for our goal.

In [30]:
from nltk.tokenize import TweetTokenizer
nltk.download('punkt')

tok = TweetTokenizer()
df['message'] = df['message'].apply(lambda x: tok.tokenize(x))
df.head()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,sentiment,message,original_message
0,-1,"[climate, change, interesting, hostile, global...",@tiniebeany climate change is an interesting h...
1,1,"[watch, beforetheflood, right, travels, world,...",RT @NatGeoChannel: Watch #BeforeTheFlood right...
2,1,"[fabulous, leonardo, decaprio, film, climate, ...",Fabulous! Leonardo #DiCaprio's film on #climat...
3,1,"[watched, amazing, documentary, leonardodicapr...",RT @Mick_Fanning: Just watched this amazing do...
9,1,"[beforetheflood, watch, beforetheflood, right,...",#BeforeTheFlood Watch #BeforeTheFlood right he...


### Lemmatization of words

Lemmatization is the process of transforming words to their base form (or lemma, as it is usually said in NLP).
For example: "loving" => "love"

This process is different from stemming which is simply removing the suffix of the word. I thought that using the lemma rather than a word without meaning may be helpful. Of course, the time needed for lemmatization is higher compared to stemming but by looking on the internet it seems to be the best choice in my case.

In [31]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

To perform our task we need to download WordNet.
WordNet is a lexical database for the English language, which was created by Princeton, and is part of the NLTK corpus.

You can use WordNet alongside the NLTK module to find the meanings of words, synonyms, antonyms, and more. We will use the Lemmatizer from this package.

In [32]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [33]:
lemmatizer = WordNetLemmatizer()

def lemmat(w_list):
    lemm_sentence = []
    for w in w_list:
        pos_tag = nltk.pos_tag([w])[0]
        # Adjective
        if pos_tag[1].startswith('J'):
            wtag = wordnet.ADJ
        # Noun
        elif pos_tag[1].startswith('N'):
            wtag = wordnet.NOUN
        # Adverb
        elif pos_tag[1].startswith('R'):
            wtag = wordnet.ADV
        # Verb
        elif pos_tag[1].startswith('V'):
            wtag = wordnet.VERB
        # Default to noun
        else:
            wtag = wordnet.NOUN

    # Lemmatize each word in tweet
        lemmetized_word = lemmatizer.lemmatize(w, pos=wtag)
        lemm_sentence.append(lemmetized_word)
    return lemm_sentence

df['message'] = df['message'].apply(lambda x: lemmat(x))

In [34]:
df['message']

0        [climate, change, interest, hostile, global, w...
1        [watch, beforetheflood, right, travel, world, ...
2        [fabulous, leonardo, decaprio, film, climate, ...
3        [watch, amaze, documentary, leonardodicaprio, ...
9        [beforetheflood, watch, beforetheflood, right,...
                               ...                        
43935             [american, scar, clown, climate, change]
43936    [aikbaatsunithi, global, warm, negative, effec...
43938    [dear, yeah, right, human, mediate, climate, c...
43939    [respective, pas, prevent, climate, change, gl...
43942    [wealthy, fossil, fuel, industry, know, climat...
Name: message, Length: 24463, dtype: object

For future use, we will also store the "preprocessed version" of the message as a string in a dedicated column.

In [35]:
df['preprocessed_text'] = df['message'].apply(lambda x: " ".join(x))

## Saving preprocessed data to a text file

There are more things that we can do, but since they depend on the approach used, we will deal with it later. (I am talking about the generation of count vectors with Bag Of Words or TF-idf).

For now, we just save our preprocessed data in a .csv file

In [36]:
df.to_csv('preprocessed_data.csv', index=False)