# Emoji Sentiment

Are popular emojis generally associated with positive or negative sentiments?

The file `"emoji-sentiment.csv"` provides data on the sentiment associated with various emojis.

Researchers examined 1.6 million tweets across 13 European languages. Each tweet was labeled by annotators as positive (+1), negative (-1), or neutral (0). About 4% of these tweets included emojis.

Columns include:
- `Occurrences [5...max]`: Number of times the emoji appears in the dataset.
- `Position [0...1]`: Average position of the emoji in tweets, from start (0) to end (1).
- `Neg [0...1]`: Percentage of tweets with the emoji that are 'negative'.
- `Neu [0...1]`: Percentage of tweets with the emoji that are 'neutral'.
- `Pos [0...1]`: Percentage of tweets with the emoji that are 'positive'.



In [22]:
import pandas as pd
df = pd.read_csv('emoji-sentiment.csv')
df.head(3)

Unnamed: 0,Char,Image [twemoji],Unicode codepoint,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Sentiment bar (c.i. 95%),Unicode name,Unicode block
0,üòÇ,üòÇ,0x1f602,14622,0.805,0.247,0.285,0.468,,FACE WITH TEARS OF JOY,Emoticons
1,‚ù§,‚ù§,0x2764,8050,0.747,0.044,0.166,0.79,,HEAVY BLACK HEART,Dingbats
2,‚ô•,‚ô•,0x2665,7144,0.754,0.035,0.272,0.693,,BLACK HEART SUIT,Miscellaneous Symbols


- Remove unnecessary columns that are not useful for your analysis.

In [23]:
# Drop unecessary columns
columns_to_delete = ['Image [twemoji]', 'Sentiment bar (c.i. 95%)', 'Unicode name', 'Unicode block']
df = df.drop(columns=columns_to_delete)

- Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).

In [24]:
new_names = {
    'Char' : 'char',
    'Unicode codepoint' : 'unicode',
    'Occurrences [5...max]' : 'occurence',
    'Position [0...1]' : 'position',
    'Neg [0...1]' : 'negative',
    'Neut [0...1]' : 'neutral',
    'Pos [0...1]' : 'positive'
}
df = df.rename(columns=new_names)
df

Unnamed: 0,char,unicode,occurence,position,negative,neutral,positive
0,üòÇ,0x1f602,14622,0.805,0.247,0.285,0.468
1,‚ù§,0x2764,8050,0.747,0.044,0.166,0.790
2,‚ô•,0x2665,7144,0.754,0.035,0.272,0.693
3,üòç,0x1f60d,6359,0.765,0.052,0.219,0.729
4,üò≠,0x1f62d,5526,0.803,0.436,0.220,0.343
...,...,...,...,...,...,...,...
746,‚ôÆ,0x266e,5,0.937,0.125,0.625,0.250
747,üÖæ,0x1f17e,5,0.977,0.375,0.375,0.250
748,üîÑ,0x1f504,5,0.971,0.125,0.750,0.125
749,‚òÑ,0x2604,5,0.435,0.125,0.750,0.125


- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).

- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.

In [25]:
# Sentiment
df['sentiment'] = df.eval('positive - negative')

# Positive flag
df['positive_flag'] = df['sentiment'] > 0

- What percentage of emojis in the dataset have a positive sentiment?

In [26]:
sum_positives = df['positive_flag'].sum()
total_tweets = df.shape[0]
percentage_positive = sum_positives / total_tweets * 100
print(f'Percentage of positive tweets : {percentage_positive}')

Percentage of positive tweets : 82.42343541944075


- What percentage of the top 20 most popular emojis are positive?

In [27]:
positive_tweets = df.query('positive_flag == True')
most_reccurent = positive_tweets.sort_values(by=['occurence'], ascending=False)
most_reccurent.head(10)

Unnamed: 0,char,unicode,occurence,position,negative,neutral,positive,sentiment,positive_flag
0,üòÇ,0x1f602,14622,0.805,0.247,0.285,0.468,0.221,True
1,‚ù§,0x2764,8050,0.747,0.044,0.166,0.79,0.746,True
2,‚ô•,0x2665,7144,0.754,0.035,0.272,0.693,0.658,True
3,üòç,0x1f60d,6359,0.765,0.052,0.219,0.729,0.677,True
5,üòò,0x1f618,3648,0.854,0.053,0.193,0.754,0.701,True
6,üòä,0x1f60a,3186,0.813,0.06,0.237,0.704,0.644,True
7,üëå,0x1f44c,2925,0.805,0.094,0.249,0.657,0.563,True
8,üíï,0x1f495,2400,0.766,0.042,0.285,0.674,0.632,True
9,üëè,0x1f44f,2336,0.787,0.104,0.271,0.624,0.52,True
10,üòÅ,0x1f601,2189,0.796,0.127,0.296,0.577,0.45,True


- Which emoji (with more than 500 mentions) is the most positive?

In [28]:
more_500_mentions = df.query('occurence > 500')
max_pos = more_500_mentions['positive'].max()
most_pos = more_500_mentions.query('positive == @max_pos')
most_pos 

Unnamed: 0,char,unicode,occurence,position,negative,neutral,positive,sentiment,positive_flag
1,‚ù§,0x2764,8050,0.747,0.044,0.166,0.79,0.746,True


- Which emoji (with more than 500 mentions) is the most negative?

In [29]:
max_neg = more_500_mentions['negative'].max()
most_neg = more_500_mentions.query('negative == @max_neg')
most_neg

Unnamed: 0,char,unicode,occurence,position,negative,neutral,positive,sentiment,positive_flag
14,üò©,0x1f629,1808,0.826,0.591,0.186,0.223,-0.368,False
23,üòí,0x1f612,1385,0.858,0.591,0.192,0.217,-0.374,False


- Where in the tweets are most emojis located (i.e. at the beginning or the end)?

In [40]:
end_emojis = df.query('position > 0.5')
beginning_emojis = df.query('position < 0.5')
print(f'Number of emojis located more ate the end : {end_emojis.shape[0]}')
print(f'Number of emojis located more ate the beginning : {beginning_emojis.shape[0]}')

Number of emojis located more ate the end : 639
Number of emojis located more ate the beginning : 111


- Is there a difference in the placement of positive versus negative emojis within a tweet?

In [43]:
# The average position of positive tweets
position_pos = positive_tweets['position'].mean()
print(f'Average position for positive tweets : {position_pos}\n')

# Searching for all negative tweets 
negative_tweets = df.query('positive_flag == False')
position_neg = negative_tweets['position'].mean()
print(f'Average position for negative tweets : {position_neg}\n')


Average position for positive tweets : 0.662248788368336

Average position for negative tweets : 0.6810227272727273

