<a href="https://colab.research.google.com/github/Aditi-dev07/Cisco_DataScience_Projects/blob/main/Data%20Cleaning/Emoji%20sentiment/emoji-sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Emoji Sentiment

Are popular emojis generally associated with positive or negative sentiments?

The file `"emoji-sentiment.csv"` provides data on the sentiment associated with various emojis.

Researchers examined 1.6 million tweets across 13 European languages. Each tweet was labeled by annotators as positive (+1), negative (-1), or neutral (0). About 4% of these tweets included emojis.

Columns include:
- `Occurrences [5...max]`: Number of times the emoji appears in the dataset.
- `Position [0...1]`: Average position of the emoji in tweets, from start (0) to end (1).
- `Neg [0...1]`: Percentage of tweets with the emoji that are 'negative'.
- `Neu [0...1]`: Percentage of tweets with the emoji that are 'neutral'.
- `Pos [0...1]`: Percentage of tweets with the emoji that are 'positive'.



In [None]:
# FOR GOOGLE COLAB ONLY.
# Uncomment and run the code below. A dialog will appear to upload files.
# Upload 'emoji-sentiment.csv'.

from google.colab import files
uploaded = files.upload()

Saving emoji-sentiment.csv to emoji-sentiment.csv


In [None]:
import pandas as pd
df = pd.read_csv('emoji-sentiment.csv')
df.head(3)

Unnamed: 0,Char,Image [twemoji],Unicode codepoint,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Sentiment bar (c.i. 95%),Unicode name,Unicode block
0,üòÇ,üòÇ,0x1f602,14622,0.805,0.247,0.285,0.468,,FACE WITH TEARS OF JOY,Emoticons
1,‚ù§,‚ù§,0x2764,8050,0.747,0.044,0.166,0.79,,HEAVY BLACK HEART,Dingbats
2,‚ô•,‚ô•,0x2665,7144,0.754,0.035,0.272,0.693,,BLACK HEART SUIT,Miscellaneous Symbols


### Project Ideas:

Data Cleaning:
- Remove unnecessary columns that are not useful for your analysis.

- Rename the remaining columns using `snake_case` (all lowercase letters with underscores between words).

New Variables:
- Add a new column called `sentiment`, where sentiment = (% positive tweets) - (% negative tweets).

- Add a `positive_flag` column that is `True` if `sentiment > 0` (or above a set threshold), otherwise `False`.

Types of questions you can now answer more easily:
- What percentage of emojis in the dataset have a positive sentiment?

- What percentage of the top 20 most popular emojis are positive?

- Which emoji (with more than 500 mentions) is the most positive?

- Which emoji (with more than 500 mentions) is the most negative?

- Where in the tweets are most emojis located (i.e. at the beginning or the end)?

- Is there a difference in the placement of positive versus negative emojis within a tweet?

#Remove unnecessary columns that are not useful for your analysis.

In [None]:
df = df.drop(columns=['Char', 'Image [twemoji]', 'Unicode codepoint', 'Sentiment bar (c.i. 95%)'])
display(df.head())

Unnamed: 0,Occurrences [5...max],Position [0...1],Neg [0...1],Neut [0...1],Pos [0...1],Unicode name,Unicode block
0,14622,0.805,0.247,0.285,0.468,FACE WITH TEARS OF JOY,Emoticons
1,8050,0.747,0.044,0.166,0.79,HEAVY BLACK HEART,Dingbats
2,7144,0.754,0.035,0.272,0.693,BLACK HEART SUIT,Miscellaneous Symbols
3,6359,0.765,0.052,0.219,0.729,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,5526,0.803,0.436,0.22,0.343,LOUDLY CRYING FACE,Emoticons


#Rename the remaining columns using snake_case

In [None]:
df = df.rename(columns={
    'Occurrences [5...max]': 'occurrences',
    'Position [0...1]': 'position',
    'Neg [0...1]': 'neg',
    'Neut [0...1]': 'neut',
    'Pos [0...1]': 'pos',
    'Unicode name': 'unicode_name',
    'Unicode block': 'unicode_block'
})
display(df.head())

Unnamed: 0,occurrences,position,neg,neut,pos,unicode_name,unicode_block
0,14622,0.805,0.247,0.285,0.468,FACE WITH TEARS OF JOY,Emoticons
1,8050,0.747,0.044,0.166,0.79,HEAVY BLACK HEART,Dingbats
2,7144,0.754,0.035,0.272,0.693,BLACK HEART SUIT,Miscellaneous Symbols
3,6359,0.765,0.052,0.219,0.729,SMILING FACE WITH HEART-SHAPED EYES,Emoticons
4,5526,0.803,0.436,0.22,0.343,LOUDLY CRYING FACE,Emoticons


##New Variables

In [None]:
df["sentiment"] = df["pos"]- df["neg"]
df["positive_flag"] = df["sentiment"] > 0
display(df.head())

Unnamed: 0,occurrences,position,neg,neut,pos,unicode_name,unicode_block,sentiment,positive_flag
0,14622,0.805,0.247,0.285,0.468,FACE WITH TEARS OF JOY,Emoticons,0.221,True
1,8050,0.747,0.044,0.166,0.79,HEAVY BLACK HEART,Dingbats,0.746,True
2,7144,0.754,0.035,0.272,0.693,BLACK HEART SUIT,Miscellaneous Symbols,0.658,True
3,6359,0.765,0.052,0.219,0.729,SMILING FACE WITH HEART-SHAPED EYES,Emoticons,0.677,True
4,5526,0.803,0.436,0.22,0.343,LOUDLY CRYING FACE,Emoticons,-0.093,False


##1.What percentage of emojis in the dataset have a positive sentiment?

In [None]:
positive_sentiment = df['sentiment'] > 0
percentage_positive = (positive_sentiment.sum() / len(df)) * 100
print(f"Percentage of emojis with a positive sentiment: {percentage_positive:.2f}%")

Percentage of emojis with a positive sentiment: 82.42%


##2.What percentage of the top 20 most popular emojis are positive?

In [None]:
popular_emojis = df.sort_values(by='occurrences', ascending=False).head(20)
percentage_positive = (popular_emojis['sentiment'] > 0).mean() * 100
print(f"Percentage of the top 20 most popular emojis that are positive: {percentage_positive:.2f}%")

Percentage of the top 20 most popular emojis that are positive: 90.00%


##3.Which emoji (with more than 500 mentions) is the most positive?



In [None]:
positive_emo = df[df['occurrences'] > 500]
most_positive_emoji = positive_emo.loc[positive_emo['sentiment'].idxmax()]
print(f"The most positive emoji with more than 500 mentions is: {most_positive_emoji['unicode_name']}")
#

The most positive emoji with more than 500 mentions is: HEAVY BLACK HEART


##4.Which emoji (with more than 500 mentions) is the most negative?

In [None]:
negative_emo = df[df['occurrences'] > 500]
most_negative_emoji = negative_emo.loc[negative_emo['sentiment'].idxmin()]
print(f"The most negative emoji with more than 500 mentions is: {most_negative_emoji['unicode_name']}")

The most negative emoji with more than 500 mentions is: UNAMUSED FACE


##5.Where in the tweets are most emojis located (i.e. at the beginning or the end)?

In [None]:
avg_position = df['position'].mean()

if avg_position < 0.5:
    print(f"On average, emojis are located towards the beginning of tweets (average position: {avg_position:.3f}).")
elif avg_position > 0.5:
    print(f"On average, emojis are located towards the end of tweets (average position: {avg_position:.3f}).")
else:
    print(f"On average, emojis are located in the middle of tweets (average position: {avg_position:.3f}).")

On average, emojis are located towards the end of tweets (average position: 0.666).


##6.Is there a difference in the placement of positive versus negative emojis within a tweet?

In [None]:
positive_emojis_df = df[df['positive_flag'] == True]
display(positive_emojis_df.head())

Unnamed: 0,occurrences,position,neg,neut,pos,unicode_name,unicode_block,sentiment,positive_flag
0,14622,0.805,0.247,0.285,0.468,FACE WITH TEARS OF JOY,Emoticons,0.221,True
1,8050,0.747,0.044,0.166,0.79,HEAVY BLACK HEART,Dingbats,0.746,True
2,7144,0.754,0.035,0.272,0.693,BLACK HEART SUIT,Miscellaneous Symbols,0.658,True
3,6359,0.765,0.052,0.219,0.729,SMILING FACE WITH HEART-SHAPED EYES,Emoticons,0.677,True
5,3648,0.854,0.053,0.193,0.754,FACE THROWING A KISS,Emoticons,0.701,True


In [None]:
negative_emojis_df = df[df['positive_flag'] == False]
display(negative_emojis_df.head())

Unnamed: 0,occurrences,position,neg,neut,pos,unicode_name,unicode_block,sentiment,positive_flag
4,5526,0.803,0.436,0.22,0.343,LOUDLY CRYING FACE,Emoticons,-0.093,False
14,1808,0.826,0.591,0.186,0.223,WEARY FACE,Emoticons,-0.368,False
23,1385,0.858,0.591,0.192,0.217,UNAMUSED FACE,Emoticons,-0.374,False
27,1205,0.866,0.464,0.219,0.318,PENSIVE FACE,Emoticons,-0.146,False
39,798,0.634,0.09,0.853,0.057,FULL BLOCK,Block Elements,-0.033,False


In [None]:
avg_position_positive = positive_emojis_df['position'].mean()
avg_position_negative = negative_emojis_df['position'].mean()

print(f"Average position of positive emojis: {avg_position_positive:.3f}")
print(f"Average position of negative emojis: {avg_position_negative:.3f}")

if avg_position_positive > avg_position_negative:
    print("Positive emojis are, on average, located further towards the end of tweets than negative emojis.")
elif avg_position_negative > avg_position_positive:
    print("Negative emojis are, on average, located further towards the end of tweets than positive emojis.")
else:
    print("The average positions of positive and negative emojis are similar.")

Average position of positive emojis: 0.662
Average position of negative emojis: 0.681
Negative emojis are, on average, located further towards the end of tweets than positive emojis.
