# Emoji Sentiment

Are popular emojis generally associated with positive or negative sentiments?

The file `"emoji-sentiment.csv"` provides data on the sentiment associated with various emojis.

Researchers examined 1.6 million tweets across 13 European languages. Each tweet was labeled by annotators as positive (+1), negative (-1), or neutral (0). About 4% of these tweets included emojis.

Columns include:
- `Occurrences [5...max]`: Number of times the emoji appears in the dataset.
- `Position [0...1]`: Average position of the emoji in tweets, from start (0) to end (1).
- `Neg [0...1]`: Percentage of tweets with the emoji that are 'negative'.
- `Neu [0...1]`: Percentage of tweets with the emoji that are 'neutral'.
- `Pos [0...1]`: Percentage of tweets with the emoji that are 'positive'.



In [3]:
# FOR GOOGLE COLAB ONLY.
# Uncomment and run the code below. A dialog will appear to upload files.
# Upload 'emoji-sentiment.csv'.

# from google.colab import files
# uploaded = files.upload()

In [4]:
import pandas as pd
df = pd.read_csv('emoji-sentiment.csv')
df.head(3)

Unnamed: 0,Char,Image [twemoji],Unicode codepoint,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Sentiment bar (c.i. 95%),Unicode name,Unicode block
0,😂,😂,0x1f602,14622,0.805,0.247,0.285,0.468,,FACE WITH TEARS OF JOY,Emoticons
1,❤,❤,0x2764,8050,0.747,0.044,0.166,0.79,,HEAVY BLACK HEART,Dingbats
2,♥,♥,0x2665,7144,0.754,0.035,0.272,0.693,,BLACK HEART SUIT,Miscellaneous Symbols


### Project Ideas:

Data Cleaning: 
- Remove unnecessary columns that are not useful for your analysis.

- Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).

New Variables:
- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).

- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.

Types of questions you can now answer more easily:
- What percentage of emojis in the dataset have a positive sentiment?

- What percentage of the top 20 most popular emojis are positive?

- Which emoji (with more than 500 mentions) is the most positive?

- Which emoji (with more than 500 mentions) is the most negative?

- Where in the tweets are most emojis located (i.e. at the beginning or the end)?

- Is there a difference in the placement of positive versus negative emojis within a tweet?

In [None]:
# YOUR CODE HERE (add additional cells as needed)
#Data Cleaning: 
#Remove unnecessary columns that are not useful for your analysis.
print('Current dataset shape : ',df.shape)
print('\nAll columns: ')
print(df.columns.tolist())
print('\nDataset info:')
df.info()


Current dataset shape :  (751, 11)

All columns: 
['Char', 'Image [twemoji]', 'Unicode codepoint', 'Occurrences [5...max]', 'Position [0...1]', 'Neg [0...1]', 'Neut [0...1]', 'Pos [0...1]', 'Sentiment bar (c.i. 95%)', 'Unicode name', 'Unicode block']

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751 entries, 0 to 750
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Char                      751 non-null    object 
 1   Image [twemoji]           751 non-null    object 
 2   Unicode codepoint         751 non-null    object 
 3   Occurrences [5...max]     751 non-null    int64  
 4   Position [0...1]          751 non-null    float64
 5   Neg [0...1]               751 non-null    float64
 6   Neut [0...1]              751 non-null    float64
 7   Pos [0...1]               751 non-null    float64
 8   Sentiment bar (c.i. 95%)  0 non-null      float64
 9   Unicode name       

In [21]:
print(df.head())
for col in df.columns:
    print(f"- {col}:{df[col].dtype}")

  Char Image [twemoji] Unicode codepoint  Occurrences [5...max]  \
0    😂               😂           0x1f602                  14622   
1    ❤               ❤            0x2764                   8050   
2    ♥               ♥            0x2665                   7144   
3    😍               😍           0x1f60d                   6359   
4    😭               😭           0x1f62d                   5526   

   Position [0...1]  Neg [0...1]  Neut [0...1]  Pos [0...1]  \
0             0.805        0.247         0.285        0.468   
1             0.747        0.044         0.166        0.790   
2             0.754        0.035         0.272        0.693   
3             0.765        0.052         0.219        0.729   
4             0.803        0.436         0.220        0.343   

   Sentiment bar (c.i. 95%)                         Unicode name  \
0                       NaN               FACE WITH TEARS OF JOY   
1                       NaN                    HEAVY BLACK HEART   
2             

In [23]:
useful_columns = ['Char','Occurrences [5...max]','Position [0...1]','Neg [0...1]','Neut [0...1]','Pos [0...1]']
df = df[useful_columns].copy()
df

Unnamed: 0,Char,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1]
0,😂,14622,0.805,0.247,0.285,0.468
1,❤,8050,0.747,0.044,0.166,0.790
2,♥,7144,0.754,0.035,0.272,0.693
3,😍,6359,0.765,0.052,0.219,0.729
4,😭,5526,0.803,0.436,0.220,0.343
...,...,...,...,...,...,...
746,♮,5,0.937,0.125,0.625,0.250
747,🅾,5,0.977,0.375,0.375,0.250
748,🔄,5,0.971,0.125,0.750,0.125
749,☄,5,0.435,0.125,0.750,0.125


In [24]:
#Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).
useful_columns = {'Char':'emoji','Occurrences [5...max]':'occurences','Position [0...1]':'position','Neg [0...1]':'negative','Neut [0...1]':'neutral','Pos [0...1]':'positive'}
df.rename(columns=useful_columns,inplace=True)
df

Unnamed: 0,emoji,occurences,position,negative,neutral,positive
0,😂,14622,0.805,0.247,0.285,0.468
1,❤,8050,0.747,0.044,0.166,0.790
2,♥,7144,0.754,0.035,0.272,0.693
3,😍,6359,0.765,0.052,0.219,0.729
4,😭,5526,0.803,0.436,0.220,0.343
...,...,...,...,...,...,...
746,♮,5,0.937,0.125,0.625,0.250
747,🅾,5,0.977,0.375,0.375,0.250
748,🔄,5,0.971,0.125,0.750,0.125
749,☄,5,0.435,0.125,0.750,0.125


In [26]:
#- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).
df['sentiment'] = df['positive'] - df['negative']
df

Unnamed: 0,emoji,occurences,position,negative,neutral,positive,sentiment
0,😂,14622,0.805,0.247,0.285,0.468,0.221
1,❤,8050,0.747,0.044,0.166,0.790,0.746
2,♥,7144,0.754,0.035,0.272,0.693,0.658
3,😍,6359,0.765,0.052,0.219,0.729,0.677
4,😭,5526,0.803,0.436,0.220,0.343,-0.093
...,...,...,...,...,...,...,...
746,♮,5,0.937,0.125,0.625,0.250,0.125
747,🅾,5,0.977,0.375,0.375,0.250,-0.125
748,🔄,5,0.971,0.125,0.750,0.125,0.000
749,☄,5,0.435,0.125,0.750,0.125,0.000


In [27]:
#- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.
df['positive_flag'] = df['sentiment'] > 0
df.head()


Unnamed: 0,emoji,occurences,position,negative,neutral,positive,sentiment,positive_flag
0,😂,14622,0.805,0.247,0.285,0.468,0.221,True
1,❤,8050,0.747,0.044,0.166,0.79,0.746,True
2,♥,7144,0.754,0.035,0.272,0.693,0.658,True
3,😍,6359,0.765,0.052,0.219,0.729,0.677,True
4,😭,5526,0.803,0.436,0.22,0.343,-0.093,False


In [28]:
# What percentage of emojis in the dataset have a positive sentiment?
percent_positive = df['positive_flag'].mean() * 100
print(f"Percentage of emojis with positive sentiment: {percent_positive:.2f}%")

# What percentage of the top 20 most popular emojis are positive?
top20 = df.nlargest(20, 'occurences')
percent_top20_positive = top20['positive_flag'].mean() * 100
print(f"Percentage of top 20 most popular emojis that are positive: {percent_top20_positive:.2f}%")

# Which emoji (with more than 500 mentions) is the most positive?
most_positive = df[df['occurences'] > 500].sort_values('sentiment', ascending=False).iloc[0]
print(f"Most positive emoji (>500 mentions): {most_positive['emoji']} (sentiment: {most_positive['sentiment']:.3f})")

# Which emoji (with more than 500 mentions) is the most negative?
most_negative = df[df['occurences'] > 500].sort_values('sentiment').iloc[0]
print(f"Most negative emoji (>500 mentions): {most_negative['emoji']} (sentiment: {most_negative['sentiment']:.3f})")

# Where in the tweets are most emojis located (i.e. at the beginning or the end)?
avg_position = df['position'].mean()
print(f"Average emoji position in tweets: {avg_position:.3f} (0=start, 1=end)")

# Is there a difference in the placement of positive versus negative emojis within a tweet?
avg_pos_positive = df[df['positive_flag']]['position'].mean()
avg_pos_negative = df[~df['positive_flag']]['position'].mean()
print(f"Average position of positive emojis: {avg_pos_positive:.3f}")
print(f"Average position of negative emojis: {avg_pos_negative:.3f}")



Percentage of emojis with positive sentiment: 82.42%
Percentage of top 20 most popular emojis that are positive: 90.00%
Most positive emoji (>500 mentions): ❤ (sentiment: 0.746)
Most negative emoji (>500 mentions): 😒 (sentiment: -0.374)
Average emoji position in tweets: 0.666 (0=start, 1=end)
Average position of positive emojis: 0.662
Average position of negative emojis: 0.681
