<a href="https://colab.research.google.com/github/BhavikBuchke/Cisco-Data-science-program/blob/main/Data%20cleaning/emoji%20sentiment%20project/Emoji%20Sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emoji Sentiment

Are popular emojis generally associated with positive or negative sentiments?

The file `"emoji-sentiment.csv"` provides data on the sentiment associated with various emojis.

Researchers examined 1.6 million tweets across 13 European languages. Each tweet was labeled by annotators as positive (+1), negative (-1), or neutral (0). About 4% of these tweets included emojis.

Columns include:
- `Occurrences [5...max]`: Number of times the emoji appears in the dataset.
- `Position [0...1]`: Average position of the emoji in tweets, from start (0) to end (1).
- `Neg [0...1]`: Percentage of tweets with the emoji that are 'negative'.
- `Neu [0...1]`: Percentage of tweets with the emoji that are 'neutral'.
- `Pos [0...1]`: Percentage of tweets with the emoji that are 'positive'.



In [1]:
# FOR GOOGLE COLAB ONLY.
# Uncomment and run the code below. A dialog will appear to upload files.
# Upload 'emoji-sentiment.csv'.

from google.colab import files
uploaded = files.upload()

Saving emoji-sentiment.csv to emoji-sentiment.csv


In [2]:
import pandas as pd
df = pd.read_csv('emoji-sentiment.csv')
df.head(3)

Unnamed: 0,Char,Image [twemoji],Unicode codepoint,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Sentiment bar (c.i. 95%),Unicode name,Unicode block
0,üòÇ,üòÇ,0x1f602,14622,0.805,0.247,0.285,0.468,,FACE WITH TEARS OF JOY,Emoticons
1,‚ù§,‚ù§,0x2764,8050,0.747,0.044,0.166,0.79,,HEAVY BLACK HEART,Dingbats
2,‚ô•,‚ô•,0x2665,7144,0.754,0.035,0.272,0.693,,BLACK HEART SUIT,Miscellaneous Symbols


### Project Ideas:

Data Cleaning:
- Remove unnecessary columns that are not useful for your analysis.

- Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).

New Variables:
- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).

- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.

Types of questions you can now answer more easily:
- What percentage of emojis in the dataset have a positive sentiment?

- What percentage of the top 20 most popular emojis are positive?

- Which emoji (with more than 500 mentions) is the most positive?

- Which emoji (with more than 500 mentions) is the most negative?

- Where in the tweets are most emojis located (i.e. at the beginning or the end)?

- Is there a difference in the placement of positive versus negative emojis within a tweet?

In [3]:
df.columns

Index(['Char', 'Image [twemoji]', 'Unicode codepoint', 'Occurrences [5...max]',
       'Position [0...1]', 'Neg [0...1]', 'Neut [0...1]', 'Pos [0...1]',
       'Sentiment bar (c.i. 95%)', 'Unicode name', 'Unicode block'],
      dtype='object')

In [4]:
# Data Cleaning
# Removing Unnecessary Columns
df.drop(columns=['Char','Image [twemoji]','Unicode codepoint','Unicode block','Unicode name','Sentiment bar (c.i. 95%)'], inplace=True)
df.head(3)

Unnamed: 0,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1]
0,14622,0.805,0.247,0.285,0.468
1,8050,0.747,0.044,0.166,0.79
2,7144,0.754,0.035,0.272,0.693


In [5]:
# Rename the remaining columns using snake_case
import re

def snake_case(name):
    name = re.sub(r'\[.*?\]', '', name)
    name = re.sub(r'[^a-zA-Z0-9_\s]', '', name)
    name = name.lower()
    name = re.sub(r'\s+', '_', name)
    name = re.sub(r'_+$', '', name)
    return name

# Apply the function to rename columns
df.columns = [snake_case(col) for col in df.columns]
df.head(3)

Unnamed: 0,occurrences,position,neg,neut,pos
0,14622,0.805,0.247,0.285,0.468
1,8050,0.747,0.044,0.166,0.79
2,7144,0.754,0.035,0.272,0.693


In [6]:
# New variables
# Adding new column called sentiment
df['sentiment'] = round(df.eval('pos - neg'),2)
df.head(3)

Unnamed: 0,occurrences,position,neg,neut,pos,sentiment
0,14622,0.805,0.247,0.285,0.468,0.22
1,8050,0.747,0.044,0.166,0.79,0.75
2,7144,0.754,0.035,0.272,0.693,0.66


In [7]:
# Adding positive_flag column.
def positive_flag(sentiment):
  if sentiment > 0:
    return True
  else:
    return False

df['positive_flag'] = df['sentiment'].apply(positive_flag)
df.head(3)

Unnamed: 0,occurrences,position,neg,neut,pos,sentiment,positive_flag
0,14622,0.805,0.247,0.285,0.468,0.22,True
1,8050,0.747,0.044,0.166,0.79,0.75,True
2,7144,0.754,0.035,0.272,0.693,0.66,True


In [8]:
# Percentage of emojis in the dataset have a positive sentiment.
positive_percentage = (df['positive_flag'] == True).mean() * 100
print(f"{round(positive_percentage,2)}% emojis in the dataset have a positive sentiment")

82.29% emojis in the dataset have a positive sentiment


In [9]:
# Percentage of the top 20 most popular emojis are positive
Top_20_emojies = df.query("occurrences >= occurrences.mean()")
print(Top_20_emojies.head(20))
positive_percentage = (Top_20_emojies['positive_flag'] == True).mean() * 100
print(f"\n{round(positive_percentage,2)}% of the top 20 most popular emojis are positive")

    occurrences  position    neg   neut    pos  sentiment  positive_flag
0         14622     0.805  0.247  0.285  0.468       0.22           True
1          8050     0.747  0.044  0.166  0.790       0.75           True
2          7144     0.754  0.035  0.272  0.693       0.66           True
3          6359     0.765  0.052  0.219  0.729       0.68           True
4          5526     0.803  0.436  0.220  0.343      -0.09          False
5          3648     0.854  0.053  0.193  0.754       0.70           True
6          3186     0.813  0.060  0.237  0.704       0.64           True
7          2925     0.805  0.094  0.249  0.657       0.56           True
8          2400     0.766  0.042  0.285  0.674       0.63           True
9          2336     0.787  0.104  0.271  0.624       0.52           True
10         2189     0.796  0.127  0.296  0.577       0.45           True
11         2062     0.799  0.062  0.218  0.720       0.66           True
12         1975     0.764  0.052  0.227  0.721     

In [10]:
# Emoji (with more than 500 mentions) is the most positive.
# Emoji (with more than 500 mentions) is the most negative.

popular_emojis = df[df['occurrences'] > 500]
most_positive = popular_emojis.sort_values(by='pos', ascending=False)
most_negative = popular_emojis.sort_values(by='neg', ascending=False)
print(f'The most positive emoji is:\n {most_positive.head(1)}')
print(f'The most negative emoji is:\n {most_negative.head(1)}')

The most positive emoji is:
    occurrences  position    neg   neut   pos  sentiment  positive_flag
1         8050     0.747  0.044  0.166  0.79       0.75           True
The most negative emoji is:
     occurrences  position    neg   neut    pos  sentiment  positive_flag
23         1385     0.858  0.591  0.192  0.217      -0.37          False


In [11]:
# Where in the tweets are most emojis located.
emoji_position = df['position'].value_counts()
print(emoji_position.head(3))

position
0.739    8
0.794    6
0.814    5
Name: count, dtype: int64


In [14]:
# Difference in the placement of positive versus negative emojis within a tweet.
positive_placement = df[df['positive_flag'] == True]['position'].mean()
negative_placement = df[df['positive_flag'] == False]['position'].mean()
print(f"The average position of positive emojis is: {round(positive_placement,3)} and negative emojis is: {round(negative_placement,2)} in tweets")

The average position of positive emojis is: 0.663 and negative emojis is: 0.68 in tweets
