# Emoji based Sentiment Analysis

Emoticons or emojis are widely used to express simple or complex emotions int the form of icons commonly used on most online websites, forums, and chats such as social media websites.

Sentiment analysis, also knonwn as Opinion mining, is a powerful tool which allows us to describe a user's emotional state based on the words chosen by them. Most sentiment analysis uses text rather than emoticons, but adding emojis will add an extra layer of analysis to further help classify the messages.

Data for sentiment analysis contain polarity associated with each word and these values are used to decide the overall sentiment of a sentence.

In [1]:
import pandas as pd
import numpy as np

# Preprocess Data

### Emoji Dataset Preprocessing

Emojis are Unicode graphic symbols, used as a shorthand <b>to express concepts and ideas</b>. In contrast to the small number of well-known emoticons that carry clear emotional contents, there are hundreds of emojis. 

In [2]:
# Setup the data for emoji
df_emoji = pd.read_csv("dataset/Emoji_Sentiment_Data.csv", 
                       usecols = ['Emoji', 'Negative', 'Neutral', 'Positive'])
df_emoji

Unnamed: 0,Emoji,Negative,Neutral,Positive
0,😂,3614,4163,6845
1,❤,355,1334,6361
2,♥,252,1942,4950
3,😍,329,1390,4640
4,😭,2412,1218,1896
...,...,...,...,...
964,➛,0,1,0
965,♝,0,1,0
966,❋,0,1,0
967,✆,0,1,0


In [3]:
df_emoji.Emoji.values

array(['😂', '❤', '♥', '😍', '😭', '😘', '😊', '👌', '💕', '👏', '😁', '☺', '♡',
       '👍', '😩', '🙏', '✌', '😏', '😉', '🙌', '🙈', '💪', '😄', '😒', '💃', '💖',
       '😃', '😔', '😱', '🎉', '😜', '☯', '🌸', '💜', '💙', '✨', '😳', '💗', '★',
       '█', '☀', '😡', '😎', '😢', '💋', '😋', '🙊', '😴', '🎶', '💞', '😌', '🔥',
       '💯', '🔫', '💛', '💁', '💚', '♫', '😞', '😆', '😝', '😪', '�', '😫', '😅',
       '👊', '💀', '😀', '😚', '😻', '©', '👀', '💘', '🐓', '☕', '👋', '✋', '🎊',
       '🍕', '❄', '😥', '😕', '💥', '💔', '😤', '😈', '►', '✈', '🔝', '😰', '⚽',
       '😑', '👑', '😹', '👉', '🍃', '🎁', '😠', '🐧', '☆', '🍀', '🎈', '🎅', '😓',
       '😣', '😐', '✊', '😨', '😖', '💤', '💓', '👎', '💦', '✔', '😷', '⚡', '🙋',
       '🎄', '💩', '🎵', '➡', '😛', '😬', '👯', '💎', '🌿', '🎂', '🌟', '🔮', '❗',
       '👫', '🏆', '✖', '☝', '😙', '⛄', '👅', '♪', '🍂', '💏', '🔪', '🌴', '👈',
       '🌹', '🙆', '➜', '👻', '💰', '🍻', '🙅', '🌞', '🍁', '⭐', '▪', '🎀', '━',
       '☷', '🐷', '🙉', '🌺', '💅', '🐶', '🌚', '👽', '🎤', '👭', '🎧', '👆', '🍸',
       '🍷', '®', '🍉', '😇', '☑', '🏃', '😿', '│', '💣', '🍺', '▶', '😲

### Set to Binary Polarity and Normalize to 0 and 1

In [4]:
# compare the polarity of the dataset and turn the polarity to binary
# 0 = negative, 1= positive
polarity_ls = []
for index, row in df_emoji.iterrows():
    
    # polarity == sentiment
    # initial polarity is negative
    polarity = 0 
    
    # positive if positive value is greater than negative value
    arg_1 = row['Positive'] > row['Negative']
    
    # positive if neutral value is odd and positive and negative value are equal
    arg_2 = row['Positive'] == row['Negative'] and row['Neutral'] % 2 != 0 
    
    # positive if either of the two arguments are true
    if arg_1 or arg_2:
        polarity = 1
    polarity_ls.append(polarity)
    
# create new emoji dataset
new_df_emoji = pd.DataFrame(polarity_ls, columns=['sentiment'])
new_df_emoji['emoji'] = df_emoji['Emoji'].values
new_df_emoji

Unnamed: 0,sentiment,emoji
0,1,😂
1,1,❤
2,1,♥
3,1,😍
4,0,😭
...,...,...
964,1,➛
965,1,♝
966,1,❋
967,1,✆


### Tweet Posts Dataset Prerprocessing

A <b>10k size dataset is provided</b> in the folder but you can download a 1.6m data online.

To download the 1.6m tweet dataset (optional)
https://www.kaggle.com/kazanova/sentiment140


In [5]:
# 10k data preprocessing

# process 1.6m data to 10k data and save to csv
# df_posts = pd.read_csv("dataset/senti_text_data.csv", header = None, engine='python')
# get the smaller 10k dataset  from the 1.6m dataset
# df_posts = df_posts.iloc[::160]
# df_posts.to_csv("datasets/senti_text_data_10k.csv")

# read the included 10k data csv
# df_posts = pd.read_csv("dataset/10k_tweet_dataset.csv", header = None, engine='python')
# df_posts.drop(df_posts.columns[0], axis=1, inplace=True)
# df_posts = df_posts.iloc[1:]
# df_posts.columns = [0,1,2,3,4,5] # rename columns to numbers
# df_posts

In [5]:
df_posts = pd.read_csv("dataset/processed_tweet_dataset.csv")
df_posts = df_posts.drop([df_posts.columns[0]], axis=1)
df_posts

Unnamed: 0,sentiment,post
0,0,"- Awww, that's a bummer. You shoulda got David..."
1,0,Picked Mich St to win it all from the get go. ...
2,0,throat is closing up and i had some string che...
3,0,"If he doesn't get better in a few days, he cou..."
4,0,I'm sure everyone has ruined my gift to you Wh...
...,...,...
9995,1,- i know now what is that haha X)
9996,1,- had a great time with some of the best peopl...
9997,1,"Tyreseee, when you're heading to The Netherlan..."
9998,1,"don't know what you could possibly mean, dear ..."


In [7]:
# 1.6m data preprocessing
# df_posts = pd.read_csv("dataset/senti_text_data.csv", header = None, engine='python')
# df_posts

# Classification using Naive Bayes

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. 

In [6]:
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

### tf–idf or TFIDF

short for <b>term frequency–inverse document frequency</b>, is a numerical statistic that is intended to reflect <b>how important a word is</b> to a document in a collection or corpus.

In [13]:
# TFIDF vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True,
                            strip_accents='ascii', stop_words='english')

Check out the emojis' sentiments

Novak et al. (2015), " It turns out that <b>most of the emojis are positive</b>, especially the most popular ones"
https://doi.org/10.1371/journal.pone.0144296

In [14]:
# print out the emoticons and sentiment values
e_c, p = 0, 0
for index, row in new_df_emoji.iterrows():
    print(f"{row['emoji']} = {row['sentiment']}")
    p += 1 if row['sentiment'] else 0
    e_c += 1

😂 = 1
❤ = 1
♥ = 1
😍 = 1
😭 = 0
😘 = 1
😊 = 1
👌 = 1
💕 = 1
👏 = 1
😁 = 1
☺ = 1
♡ = 1
👍 = 1
😩 = 0
🙏 = 1
✌ = 1
😏 = 1
😉 = 1
🙌 = 1
🙈 = 1
💪 = 1
😄 = 1
😒 = 0
💃 = 1
💖 = 1
😃 = 1
😔 = 0
😱 = 1
🎉 = 1
😜 = 1
☯ = 1
🌸 = 1
💜 = 1
💙 = 1
✨ = 1
😳 = 1
💗 = 1
★ = 1
█ = 0
☀ = 1
😡 = 0
😎 = 1
😢 = 1
💋 = 1
😋 = 1
🙊 = 1
😴 = 0
🎶 = 1
💞 = 1
😌 = 1
🔥 = 1
💯 = 1
🔫 = 0
💛 = 1
💁 = 1
💚 = 1
♫ = 1
😞 = 0
😆 = 1
😝 = 1
😪 = 0
� = 1
😫 = 0
😅 = 1
👊 = 1
💀 = 0
😀 = 1
😚 = 1
😻 = 1
© = 1
👀 = 1
💘 = 1
🐓 = 1
☕ = 1
👋 = 1
✋ = 1
🎊 = 1
🍕 = 1
❄ = 1
😥 = 1
😕 = 0
💥 = 1
💔 = 0
😤 = 0
😈 = 1
► = 1
✈ = 1
🔝 = 1
😰 = 0
⚽ = 1
😑 = 0
👑 = 1
😹 = 1
👉 = 1
🍃 = 1
🎁 = 1
😠 = 0
🐧 = 1
☆ = 1
🍀 = 1
🎈 = 1
🎅 = 1
😓 = 0
😣 = 0
😐 = 0
✊ = 1
😨 = 0
😖 = 0
💤 = 1
💓 = 1
👎 = 0
💦 = 1
✔ = 1
😷 = 0
⚡ = 1
🙋 = 1
🎄 = 1
💩 = 0
🎵 = 1
➡ = 1
😛 = 1
😬 = 1
👯 = 1
💎 = 1
🌿 = 1
🎂 = 1
🌟 = 1
🔮 = 1
❗ = 1
👫 = 1
🏆 = 1
✖ = 1
☝ = 1
😙 = 1
⛄ = 1
👅 = 1
♪ = 1
🍂 = 1
💏 = 1
🔪 = 1
🌴 = 1
👈 = 1
🌹 = 1
🙆 = 1
➜ = 1
👻 = 1
💰 = 1
🍻 = 1
🙅 = 0
🌞 = 1
🍁 = 1
⭐ = 1
▪ = 1
🎀 = 1
━ = 1
☷ = 1
🐷 = 1
🙉 = 1
🌺 = 1
💅 = 1
🐶 = 1
🌚 = 1
👽 = 1
🎤 = 1
👭 = 1
🎧 = 

In [15]:
print(f'Total Positive Emojis are ({p}:{e_c}) or {round(p / e_c * 100)}%')

Total Positive Emojis are (795:969) or 82%


In [10]:
new_df_post = df_posts

In [16]:
# dependent variable will be linked as:
# 0 = negative, 1 = positive
y = new_df_post.sentiment
# convert 'sentence' from text to features
X = vectorizer.fit_transform(new_df_post.post)

print(y.shape)
print(X.shape)
print(f'{X.shape[0]} observations X {X.shape[1]} unique words')


(10000,)
(10000, 13201)
10000 observations X 13201 unique words


### Training
We will achieve an accuracy score of:

    ~85% if we used 1.6m dataset
    
    ~80% if we used the 10k dataset

In [17]:
# Test Train Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=None)

# we will train a naive bayes classifier
clf = naive_bayes.MultinomialNB()

# clf = naive_bayes.BernoulliNB()

clf.fit(X_train, y_train)

# test our models accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])


0.7872704248868129

### Processing the inputs - Extraction of emoji and texts

In [36]:
import emoji
text = "#samplesenti @emojitweets i ❤❤❤ sentiment &quot; analysis &quot; http://senti.com/pic_01.jpg "
def extract_text_and_emoji(text = text):
    global allchars, emoji_list
    # remove all tagging and links, not need for sentiments
    remove_keys = ('@', 'http://', '&', '#')
    clean_text = ' '.join(txt for txt in text.split() if not txt.startswith(remove_keys))
#     print(clean_text)
    
    # setup the input, get the characters and the emoji lists
    allchars = [str for str in text]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    
    # extract text
    clean_text = ' '.join([str for str in clean_text.split() if not any(i in str for i in emoji_list)])
    
    # extract emoji
    clean_emoji = ''.join([str for str in text.split() if any(i in str for i in emoji_list)])
    return (clean_text, clean_emoji)

allchars, emoji_list = 0, 0
(ct, ce) = extract_text_and_emoji()
print('\nAll Char:', allchars)
print('\nAll Emoji:',emoji_list)
print('\n', ct)
print('\n',ce)

AttributeError: module 'emoji' has no attribute 'UNICODE_EMOJI'

In [21]:
pip list

Package              Version
-------------------- -----------
anyio                3.6.1
argon2-cffi          21.3.0
argon2-cffi-bindings 21.2.0
asttokens            2.0.5
async-generator      1.10
attrs                21.4.0
autopep8             2.0.1
Babel                2.10.1
backcall             0.2.0
beautifulsoup4       4.11.1
bleach               5.0.0
certifi              2021.10.8
cffi                 1.15.0
charset-normalizer   2.0.12
click                8.1.3
click-plugins        1.1.1
cligj                0.7.2
colorama             0.4.4
comtypes             1.1.10
cycler               0.11.0
debugpy              1.6.0
decorator            5.1.1
defusedxml           0.7.1
emoji                2.2.0
entrypoints          0.4
et-xmlfile           1.1.0
exceptiongroup       1.1.0
executing            0.8.3
fastjsonschema       2.15.3
Fiona                1.9.1
fonttools            4.33.3
geographiclib        2.0
geopandas            0.12.2
geopy                2.3.0
h11      



### Get the sentiments of the processed posts

In [29]:
def get_sentiment(s_input = 'good product'):
    # turn input into array
    input_array= np.array([s_input])
    # vectorize the input
    input_vector = vectorizer.transform(input_array)
    # predict the score of vector
    pred_senti = clf.predict(input_vector)

    return pred_senti[0]
print(get_sentiment())

1


In [28]:
def get_emoji_sentiment(emoji_ls = '❤❤❤', emoji_df = new_df_emoji):
    emoji_val_ls = []
    for e in emoji_ls:
        get_emo_senti = [row['sentiment'] for index, row in emoji_df.iterrows() if row['emoji'] == e]
        emoji_val_ls.append(get_emo_senti[0])
    return emoji_val_ls

ges = get_emoji_sentiment()
print('Sentiment value of each emoji:',ges)

Sentiment value of each emoji: [1, 1, 1]


### Building the sentiment analysis

In [38]:
def get_text_emoji_sentiment(input_test = 'good product❤❤❤'):
    # separate text and emoji
    (ext_text, ext_emoji) = extract_text_and_emoji(input_test)
    print(f'\tExtracted: "{ext_text}" , {ext_emoji}')

    # get text sentiment
    senti_text = get_sentiment(ext_text)
    print(f'\tText value: {senti_text}')

    # get emoji sentiment
    senti_emoji_value = sum(get_emoji_sentiment(ext_emoji, new_df_emoji))
    print_emo_val_avg = 0 if len(ext_emoji) == 0 else senti_emoji_value/len(ext_emoji)
    print(f'\tEmoji average value: {print_emo_val_avg}')

    # avg the sentiment of emojis and text
    senti_avg = (senti_emoji_value + senti_text) / (len(ext_emoji) + 1)
    print(f'\tAverage value: {senti_avg}')

    # set value of avg sentiment to either pos or neg 
    senti_truth = "Positive" if senti_avg >= 0.5 else "Negative"
    
    return senti_truth

print(get_text_emoji_sentiment())

AttributeError: module 'emoji' has no attribute 'UNICODE_EMOJI'

### Print the tweets with emoji

In [31]:
# print the sentiment of input
"I hate sentiment analysis 😘",
"I love 😨 sentiment analysis 😩",
"Naive Bayes is awesome 👌",
"😔 Naive Bayes is cool 😖"

input_tests = [
   "i ❤❤❤ sentiment analysis",
    "Naive Bayes is awesome 😘😩😖😨"
]
def print_senti_status(test):
    print('========================================')
    print(f'Your input is "{test}" \n')
    sentiment = get_text_emoji_sentiment(test)
    print(f'\nYour input is of "{sentiment}" sentiment'.upper())
    print('========================================')
    
# for test in input_tests:
#     print_senti_status(test)
#     print('\n\n')

## Tweet Something

In [32]:
import ipywidgets as widgets
import warnings; warnings.simplefilter('ignore')

In [33]:
# for text area
l = widgets.Layout(flex='0 1 auto', height='50px',width='auto')
post_tweet = widgets.Textarea(value='🎶 Tweet 🐤 your feelings 😲 🎶', layout=l)
print(post_tweet.value)
# for button
button = widgets.Button(description="Say your Sentiments!")
output = widgets.Output()

def on_tweet_clicked(b):
    output.clear_output()
    with output:
        output.layout={'border': '1px solid black'}
        print_senti_status(post_tweet.value)


🎶 Tweet 🐤 your feelings 😲 🎶


In [34]:
# sample tweets with emoji
'''
"i ❤❤❤ sentiment analysis",
"I hate sentiment analysis 😘",
"I love 😨 sentiment analysis 😩",
"Naive Bayes is awesome 👌",
"😔 Naive Bayes is cool 😖"
"Naive Bayes is 😘😩😖😨"
'''

display(post_tweet,button, output)
button.on_click(on_tweet_clicked)

Textarea(value='🎶 Tweet 🐤 your feelings 😲 🎶', layout=Layout(flex='0 1 auto', height='50px', width='auto'))

Button(description='Say your Sentiments!', style=ButtonStyle())

Output()

### Conclusion
This is method 1. Where the training of the tweets are sepparate from the emoticons. the emoticons are assigned with their own sentiment polarity. then to analyze the sentiment of the tweet, we combine and average the sentiment value of both the emoticons and texts. this method will have a strong influence emoticon with a non-changing polarity value. 