## Title: Supervised Sentiment Analysis with Real-World Data: 500,000 Tweets on Elon Musk

#### Group Member Names: Aaditya Trivedi:- 200483263

### INTRODUCTION:

The project titled "Supervised Sentiment Analysis with Real-World Data: 500,000 Tweets on Elon Musk" aims to analyze a vast dataset of tweets about Elon Musk to understand public sentiment and address challenges associated with sentiment analysis in real-world social media data.

#### AIM:

To develop and implement an unsupervised sentiment analysis method that effectively processes a large dataset of 500,000 tweets about Elon Musk, overcoming the limitations of previous approaches by addressing issues such as lack of labeled data, data noise, complexity in sentiment analysis, and potential biases.

*********************************************************************************************************************

#### Github Repo:

[Link to be added]

*********************************************************************************************************************

#### DESCRIPTION OF PAPER:

The paper discusses the development of an unsupervised sentiment analysis approach tailored to handle large-scale, real-world data from social media platforms. The analysis focuses on tweets mentioning Elon Musk, leveraging natural language processing techniques such as text preprocessing, word input, and topic modeling. The approach is designed to manage unstructured data, address noise, and capture the nuanced sentiment expressed in tweets, particularly considering sarcasm and subjective language.

*********************************************************************************************************************

#### PROBLEM STATEMENT:

The project addresses key challenges in sentiment analysis, particularly with respect to large datasets from social media. These challenges include the lack of labeled data, the presence of noise such as typos and slang, the complexity of analyzing sentiment expressed through sarcasm and subjective language, and the limitations of previous sentiment analysis methods that fail to fully capture the context and biases present in the data.

*********************************************************************************************************************

#### CONTEXT OF THE PROBLEM:

The unsupervised sentiment analysis of 500,000 tweets about Elon Musk is crucial for gaining insights into public sentiment, enhancing real-world data analysis techniques, and identifying relevant issues and conflicts. This analysis provides valuable input for businesses, investors, and decision-makers, and contributes to the development of more accurate and fair sentiment analysis methodologies that mitigate inherent biases.

*********************************************************************************************************************

#### SOLUTION:

The solution proposed in this project involves the use of an unsupervised sentiment analysis method that does not require manual data labeling. By applying advanced natural language processing techniques, the method effectively handles noisy and unstructured social media data, captures nuanced expressions of sentiment, and reduces the biases associated with subjective interpretations. This approach offers a scalable and robust alternative to traditional sentiment analysis methods, providing more accurate insights from large datasets.

# Background

Explain the related work using the following table

| Reference | Explanation | Dataset/Input | Weakness |
|-----------|-------------|---------------|----------|
| [1]       | They used a deep learning-based approach with a combination of LSTM and attention mechanism to perform sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Limited labeled data, potential biases in Twitter data, and lack of ground truth for sentiment labels |
| [2]       | They applied a transfer learning-based approach using pre-trained word embeddings and CNNs for sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Possible noise in Twitter data, limited interpretability of CNNs for sentiment analysis |
| [3]       | They used a rule-based approach with handcrafted features and lexical resources to perform sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Reliance on manual feature engineering, potential bias in handcrafted features, limited adaptability to different domains |



The last row in this table should be about the method discussed in this paper (If you can't find the weakenss of this method then write about the future improvement, see the future work section of the paper)

### BACKGROUND:

#### Related Work

| Reference | Explanation | Dataset/Input | Weakness |
|-----------|-------------|---------------|----------|
| [1]       | A deep learning-based approach combining LSTM and attention mechanism to perform sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle. | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Limited labeled data, potential biases in Twitter data, and lack of ground truth for sentiment labels |
| [2]       | A transfer learning-based approach using pre-trained word embeddings and CNNs for sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle. | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Possible noise in Twitter data, limited interpretability of CNNs for sentiment analysis |
| [3]       | A rule-based approach with handcrafted features and lexical resources to perform sentiment analysis on the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle. | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Reliance on manual feature engineering, potential bias in handcrafted features, limited adaptability to different domains |
| [This Paper] | Unsupervised sentiment analysis leveraging natural language processing techniques like text preprocessing, word input, and topic modeling to analyze the "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle. | "500,000 Tweets on Elon Musk (Nov-Dec 2022)" dataset from Kaggle | Future improvements include refining the model to better handle context and sarcasm, as well as enhancing the scalability and accuracy of sentiment classification across different social media platforms. |

# Methodology

We built on the prior research of other academics that examined sentiment analysis using deep learning techniques on diverse datasets in this effort. As part of our contribution, we modify these current methods to perform sentiment analysis on the Kaggle dataset "500,000 tweets about Elon Musk (November-December 2022)". Convolutional neural networks (CNN) and pre-trained word embedding were utilised in conjunction with transfer learning approaches for sentiment analysis. In order to increase the precision and interpretability of our sentiment analysis model, we also incorporated rule-based strategies into manually created features and lexical resources.

# Implementation

# Installing libraries

In [1]:
!pip install wordcloud
!pip install emot
!pip install TextBlob
!pip install contractions
!pip install chart-studio

Collecting wordcloud
  Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl.metadata (3.5 kB)
Downloading wordcloud-1.9.3-cp311-cp311-win_amd64.whl (300 kB)
   ---------------------------------------- 0.0/300.2 kB ? eta -:--:--
   -- ------------------------------------ 20.5/300.2 kB 640.0 kB/s eta 0:00:01
   ----- --------------------------------- 41.0/300.2 kB 487.6 kB/s eta 0:00:01
   ------------ ------------------------- 102.4/300.2 kB 845.5 kB/s eta 0:00:01
   ---------------------------------------  297.0/300.2 kB 2.0 MB/s eta 0:00:01
   ---------------------------------------- 300.2/300.2 kB 1.5 MB/s eta 0:00:00
Installing collected packages: wordcloud
Successfully installed wordcloud-1.9.3
Collecting emot
  Downloading emot-3.1-py3-none-any.whl.metadata (396 bytes)
Downloading emot-3.1-py3-none-any.whl (61 kB)
   ---------------------------------------- 0.0/61.5 kB ? eta -:--:--
   ------------------- -------------------- 30.7/61.5 kB 640.0 kB/s eta 0:00:01
   ---------------

# Importing important libraries

In [66]:
import chart_studio
import re
import string
import emot
import collections
import ipywidgets
import contractions
import cufflinks
import nltk.tokenize

import chart_studio.plotly as py
import chart_studio.tools as tls
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import numpy as np

from textblob import TextBlob
from google.colab import widgets
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

In [67]:
username='Aaditya1304'
api_key='GS3nCiRKRonSMYkPuWN9'

chart_studio.tools.set_credentials_file(username=username,
                                        api_key=api_key)

pd.set_option('display.max_colwidth', None)

nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

sid = SentimentIntensityAnalyzer()
emot_obj = emot.core.emot()

cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')

pio.renderers.default = 'colab'

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


# First look at the data

In [68]:
df = pd.read_json('data_503986.json')
df.sample(n=10)

Unnamed: 0,id,text
304162,1602812995452674051,"@elonmusk How much debt did you acfrus for twitter when you ""bought"" it?"
130826,1600369301080313857,"-Oh, what a tangled web they weave, when first they practice to...\n\n-Oh, how beautiful this tangled web is! People love it, don't they? https://t.co/39Z0xJuqSM"
190802,1601368559036211201,But some “ dead “ reincarnate faster than we think . Ask Elon Musk . The tweets can be investigated and find out if The Bank dynasty member has reincarnated yet. https://t.co/a87Xa6wQXp
353067,1603565314607915014,@Content_Retired @elonmusk Yet the femboy bootlickers will still find a way to support him and keep on bootlicking!! 🤮 🤢 🤮
467448,1605296013295734785,"@quadcarl_carl @elonmusk @RepAdamSchiff And, dOeS tHeIr OwN rEsEaRcH"
165214,1600803483271454722,@VincentChan001 @TeslaBull10T @Teslaconomics @elonmusk So did Egon Spengler. Except Spengler actually made a difference.
453081,1605141910548631555,"@elonmusk I know your all about free speech, can we address free listen, watch or read? Meaning can we have a filter to remove negative political non sense, some of us do not want to read, watch or hear about it, we just don't care and it is negative."
398500,1604527791243808768,Funny listening to all these hollies w perceived journalistic ability cry foul when being bounced by the bird @elonmusk #taylorlorenz what about me &amp; my voice?! Doesn’t matter that my voice was silenced by the #LiberalHypocrisy not once but THREE times! #BouncedByTheBird
412790,1604645090240118786,@ginnygmc Take the poll. https://t.co/ruDHfrtFdO
467136,1605290944513085440,@Mojtabapacino @forparisaa a major disruption to internet service in #Iran as mobile internet is cut off for many users\nWe Need VPN or Proxy !\n@SpaceX @elonmusk @YourAnonOne @anonymousopiran @EdaalateAli1400 @CNN\n@PahlaviReza @alikarimi_ak8\n#OpIran‌‌ \n#MahsaAmini \n#Starlink


## Pre-processing

In [69]:
def pre_process(text):
    # Remove links
    text = re.sub('http://\S+|https://\S+', '', text)
    text = re.sub('http[s]?://\S+', '', text)
    text = re.sub(r"http\S+", "", text)

    text = re.sub('&amp', 'and', text)
    text = re.sub('&lt', '<', text)
    text = re.sub('&gt', '>', text)
    
    text = re.sub('[\r\n]+', ' ', text)

    text = re.sub(r'@\w+', '', text)
    text = re.sub(r'#\w+', '', text)

    text = re.sub('\s+',' ', text)
    
    text = text.lower()
    return text

In [70]:
df['processed_text'] = df['text'].apply(pre_process)

In [71]:
df.sample(n=5)

Unnamed: 0,id,text,processed_text
236409,1601740953211240448,"@elonmusk @Twitter Man, you had $44B but you didn't afford a graphic designer?","man, you had $44b but you didn't afford a graphic designer?"
497949,1605678076897394689,This is actually one of the reasons why green is @BillNye's favorite color. https://t.co/3d1JzKNEGY,this is actually one of the reasons why green is 's favorite color.
243137,1601976255636815875,"@elonmusk dear Elon, I believe you may have misspelled your pronouns. I believe the correct spelling is Execute/Fauci.....kidding! (sorta).","dear elon, i believe you may have misspelled your pronouns. i believe the correct spelling is execute/fauci.....kidding! (sorta)."
223436,1601574834185842688,"@elonmusk @ZacksJerryRig If your point is - Twitter could use more better, more transparent governance- u made the point. but u do realize - ur beginning to legitimize “stolen election” talking point WHY??????!","if your point is - twitter could use more better, more transparent governance- u made the point. but u do realize - ur beginning to legitimize “stolen election” talking point why??????!"
37165,1598854420757811200,"@MarcTheBulll @elonmusk @hodgetwins @micsolana That’s exactly what he just said, Einstein. 🤷🏼‍♀️","that’s exactly what he just said, einstein. 🤷🏼‍♀️"


## Implement Code

## N-Grams


In [72]:
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_bigram(df['processed_text'], 20)

df1 = pd.DataFrame(common_words, columns = ['TweetText' , 'count'])
df1.groupby('TweetText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar',
    yTitle='Count',
    linecolor='black',
    title='Top 20 bigrams in Tweet before removing spams')

## Removing spams

In [73]:
to_drop = ["LP LOCKED", "accumulated 1 ETH","This guy accumulated over $100K", "help me sell a nickname", "As A Big Fuck You To The SEC", "Wanna be TOP G", "#walv", "#NFTProject", "#1000xgem", "$GALI", "NFT", "What the Soul of USA is", "#BUSD", "$FXMS", "#fxms", "#Floki", "#FLOKIXMAS", "#memecoin", "#lowcapgem", "#frogxmas", "Xmas token", "crypto space", "Busd Rewards", "TRUMPLON", "NO PRESALE", "#MIKOTO", "$HATI", "$SKOLL", "#ebaydeals", "CHRISTMAS RABBIT", "@cz_binance", "NFT Airdrop", "#NFT"]

In [74]:
df = df[~df['text'].str.contains('|'.join(to_drop))]

## Expanding contractions + drop ID

In [75]:
def expand_contractions(text):
  try:
    return contractions.fix(text)
  except:
    return text

In [76]:
df['expanded_text'] = df['text'].apply(expand_contractions)
df['processed_text'] = df['expanded_text'].apply(pre_process)
df = df.drop('id', axis=1)

In [77]:
df.sample(n=3)

Unnamed: 0,text,processed_text,expanded_text
388069,@whyisfreedomdy1 @elonmusk shaming like this is ridiculous... feel like not enough is getting done in YOUR community then DO MORE,shaming like this is ridiculous... feel like not enough is getting done in your community then do more,@whyisfreedomdy1 @elonmusk shaming like this is ridiculous... feel like not enough is getting done in YOUR community then DO MORE
38585,@JonathanLepick @MunkSwe88 @brearley103 @nessiejones22 @EmmanuelMacron @elonmusk I would also think that Europeans are more liberal than Americans so allowing ‘unfettered’ free speech may not be broadly popular. Time will tell,i would also think that europeans are more liberal than americans so allowing ‘unfettered’ free speech may not be broadly popular. time will tell,@JonathanLepick @MunkSwe88 @brearley103 @nessiejones22 @EmmanuelMacron @elonmusk I would also think that Europeans are more liberal than Americans so allowing ‘unfettered’ free speech may not be broadly popular. Time will tell
238208,"@elonmusk @SenatorSinema I'm not happy, you'd better come to Shanghai to accompany me.","i am not happy, you would better come to shanghai to accompany me.","@elonmusk @SenatorSinema I am not happy, you would better come to Shanghai to accompany me."


## Bi-grams without spam, stop words and Elon Musk

In [78]:
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_bigram(df['processed_text'], 21)
common_words = common_words[1:]

for word, freq in common_words:
    print(word, freq)

df4 = pd.DataFrame(common_words, columns = ['TweetText' , 'count'])
df4.groupby('TweetText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', title='Top 20 bigrams in Tweet after removing spams')

free speech 10228
social media 3031
twitter files 2752
hate speech 1898
freedom speech 1888
just like 1850
right wing 1715
looks like 1650
hunter biden 1619
musk twitter 1527
people like 1382
sounds like 1208
mr musk 1177
does mean 1120
real time 1068
donald trump 1029
thank elon 942
like elon 913
twitter ceo 897
did know 890


## Tri-grams without spam and stop words

In [79]:
def get_top_n_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

common_words = get_top_n_trigram(df['processed_text'], 20)

for word, freq in common_words:
    print(word, freq)

df6 = pd.DataFrame(common_words, columns = ['TweetText' , 'count'])
df6.groupby('TweetText').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar',
    yTitle='Count',
    linecolor='black',
    title='The 20 most frequent tri-grams in the dataset (without stopwords and spams)')

elon musk twitter 1365
elon musk says 618
hunter biden laptop 557
twitter elon musk 520
like elon musk 508
elon musk just 507
social media platform 367
thank elon musk 339
real time location 298
free speech absolutist 276
think elon musk 275
mentioned million times 252
million times muted 252
elon musk neuralink 247
richest man world 243
free speech twitter 240
elon need help 239
elon musk does 232
need help maybe 228
help maybe funny 228


## Emojis

In [86]:
# Define a function to extract emoticons
def extract_emoticons(text):
  res = emot_obj.emoji(text)
  return res['value']

In [87]:
# Apply the function to each row of the 'text' column
df['emoticons'] = df['text'].apply(extract_emoticons)

Issue with internal pattern finding emoji: '🇬'


In [88]:
df['emoticons'].apply(lambda x: collections.Counter(x))
combined_counts = sum(df['emoticons'].apply(lambda x: collections.Counter(x)), collections.Counter())
emoji_dict = dict(combined_counts)
sorted_emoji_dict = dict(sorted(emoji_dict.items(), key=lambda x: x[1], reverse=True))

In [89]:
d = {k: v for i, (k, v) in enumerate(sorted_emoji_dict.items()) if i < 20}
df_emojis = pd.DataFrame(list(d.items()), columns=['Emojis', 'Count'])
df_emojis.at[5, 'Emojis'] = '❤️'
df_emojis.at[6, 'Emojis'] = '🤡'

In [90]:
df_emojis.groupby('Emojis').sum()['Count'].sort_values(ascending=False).iplot(
    kind='bar',xTitle='Emojis', yTitle='Count', linecolor='black', title='The 20 most used emojis after removing spam')

# Unsupervised Sentiment Analysis

## Getting polarities

In [91]:
df['vader_polarity'] = df['processed_text'].map(lambda text: sid.polarity_scores(text)['compound'])
df['blob_polarity'] = df['processed_text'].map(lambda text: TextBlob(text).sentiment.polarity)

In [92]:
new_df = df[['vader_polarity', 'blob_polarity']]
new_df = new_df.rename(columns={'vader_polarity': 'Vader', 'blob_polarity': 'TextBlob'})

## Comparison between Vader and TextBlob

In [93]:
new_df.iplot(
    kind='hist',
    bins=40,
    xTitle='Polarity',
    linecolor='black',
    yTitle='Count',
    title='Comparison of the distributions of sentimental polarities',
    colors = ['#1DA1F2', '#EB8C17'],
    barmode="group")

## Stats

In [94]:
new_df.describe()

Unnamed: 0,Vader,TextBlob
count,487047.0,487047.0
mean,0.040615,0.067221
std,0.456807,0.289077
min,-0.9996,-1.0
25%,-0.296,0.0
50%,0.0,0.0
75%,0.4019,0.2
max,0.9982,1.0


## Topic analysis

In [95]:
stop_words = nltk.corpus.stopwords.words('english')

def remove_stop_words(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    return ' '.join([word for word in text.split() if word.lower() not in stop_words])

df['stop_text'] = df['processed_text'].apply(lambda x: remove_stop_words(x))

In [96]:
# We define a list of topics
topics = ['free speech',
          'hunter biden',
          'twitter files',
          'freedom speech', 
          'right wing',
          'donald trump']

vader_sentiments = df['vader_polarity'].tolist()
textblob_sentiments = df['blob_polarity'].tolist()
text = df['stop_text'].tolist()

# We create a new column Topic
df['Topic'] = ""
for topic in topics:
    df.loc[df['stop_text'].str.contains(topic), 'Topic'] = topic

# We create a new DataFrame with columns topic / sentiment / source
data = []
for topic in topics:
    topic_rows = df[df['Topic'] == topic]
    # Average sentiment per topic
    vader_sentiments = topic_rows['vader_polarity'].sum() / topic_rows.shape[0]
    textblob_sentiments = topic_rows['blob_polarity'].sum() / topic_rows.shape[0]
    # Append data
    data.append({'Topic': topic, 'Sentiment': vader_sentiments, 'Source': 'Vader'})
    data.append({'Topic': topic, 'Sentiment': textblob_sentiments, 'Source': 'TextBlob'})

df_new = pd.DataFrame(data)

# Plot the sentiment for each topic
fig = px.bar(df_new,
             x='Topic',
             y='Sentiment',
             color='Source',
             barmode='group',
             color_discrete_sequence = ['#1DA1F2', '#EB8C17'],
             title='Comparative sentimental analysis by topic',
             template='plotly_white')

fig.update_traces(marker_line_width=1,
                  marker_line_color="black")

fig.show()

## Personalities

In [97]:
usernames = ['@Tesla', '@TomFitton', '@FoxNews', '@realDonaldTrump' , '@TwitterSupport', '@nytimes']
vader_sentiments = df['vader_polarity'].tolist()
textblob_sentiments = df['blob_polarity'].tolist()
text = df['text'].tolist()

# create a new column for the username
df['Mention'] = ""
for username in usernames:
    df.loc[df['text'].str.contains(username), 'Mention'] = username

# create a new dataframe with columns for username, sentiment, and sentiment source
data = []
for username in usernames:
    username_rows = df[df['Mention'] == username]
    vader_sentiments = username_rows['vader_polarity'].sum() / username_rows.shape[0]
    textblob_sentiments = username_rows['blob_polarity'].sum() / username_rows.shape[0]
    data.append({'Mention': username, 'Sentiment': vader_sentiments, 'Source': 'Vader'})
    data.append({'Mention': username, 'Sentiment': textblob_sentiments, 'Source': 'TextBlob'})
df_new = pd.DataFrame(data)

# plot the sentiment for each username using Plotly
fig = px.bar(df_new,
             x='Mention',
             y='Sentiment',
             color='Source',
             barmode='group',
             color_discrete_sequence = ['#1DA1F2', '#EB8C17'],
             title='Comparative sentimental analysis by accounts',
             template='plotly_white')

fig.update_traces(marker_line_width=1,
                  marker_line_color="black")

fig.show()

## Trigrams

In [98]:
tri_grams = ['hunter biden laptop',
             'elon musk twitter',
             'real time location',
             'free speech absolutist',
             'free speech twitter']
vader_sentiments = df['vader_polarity'].tolist()
textblob_sentiments = df['blob_polarity'].tolist()
text = df['text'].tolist()

# create a new column for the username
df['Trigram'] = ""
for trigram in tri_grams:
    df.loc[df['text'].str.contains(trigram), 'Trigram'] = trigram

# create a new dataframe with columns for username, sentiment, and sentiment source
data = []
for trigram in tri_grams:
    trigram_rows = df[df['Trigram'] == trigram]
    vader_sentiments = trigram_rows['vader_polarity'].sum() / trigram_rows.shape[0]
    textblob_sentiments = trigram_rows['blob_polarity'].sum() / trigram_rows.shape[0]
    data.append({'Trigram': trigram, 'Sentiment': vader_sentiments, 'Source': 'Vader'})
    data.append({'Trigram': trigram, 'Sentiment': textblob_sentiments, 'Source': 'TextBlob'})
df_new = pd.DataFrame(data)

# plot the sentiment for each username using Plotly
fig = px.bar(df_new,
             x='Trigram',
             y='Sentiment',
             color='Source',
             barmode='group',
             color_discrete_sequence = ['#1DA1F2', '#EB8C17'],
             title='Emotional analysis of the most present tri-grams',
             template='plotly_white')

fig.update_traces(marker_line_width=1,
                  marker_line_color="black")

fig.show()

## Contribution code:

In [99]:
import pandas as pd
from textblob import TextBlob

# Load the dataset into a DataFrame
df = pd.read_json('/content/data_503986.json')

# Perform text sentiment analysis using TextBlob
df['sentiment_polarity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment_subjectivity'] = df['text'].apply(lambda x: TextBlob(x).sentiment.subjectivity)

# Define a function to categorize sentiment based on polarity score
def get_sentiment_category(score):
    if score > 0:
        return 'Positive'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Negative'

# Apply the sentiment category function to create a 'sentiment' column
df['sentiment'] = df['sentiment_polarity'].apply(get_sentiment_category)

# Calculate counts of different sentiments
sentiment_counts = df['sentiment'].value_counts()

# Print the sentiment counts
print('Sentiment Counts:')
print(sentiment_counts)


Sentiment Counts:
Neutral     199991
Positive    199961
Negative    104034
Name: sentiment, dtype: int64


In [100]:
df

Unnamed: 0,id,text,sentiment_polarity,sentiment_subjectivity,sentiment
0,1596647314030231552,"@DonutOperator @elonmusk @stillgray It's fiery, mostly peaceful!🤷‍♂️",0.500000,0.500000,Positive
1,1596647313887346689,@SenMarkey @elonmusk Anti-freedom is anti-American,0.000000,0.000000,Neutral
2,1596647313719853056,@FoxNews Elon Musk voices support for Trump rival Ron DeSantis if Florida governor runs for president in 2024,0.000000,0.000000,Neutral
3,1596647313346215941,"@elonmusk @CollinRugg Having meetings about meetings and communicating across Vice Presidents, duh. https://t.co/FRod77kriA",-0.300000,0.600000,Negative
4,1596647312746754048,@GregA06555436 @elonmusk @TimRunsHisMouth Yes! We are all on a journey!,0.000000,0.000000,Neutral
...,...,...,...,...,...
503981,1615836342079954960,@DylanLeClair_ @elonmusk @joshhzimmer @stats_feed Still don't understand why Elon wouldn't just become a #Bitcoin maxi 🤔,0.000000,0.000000,Neutral
503982,1615836341660524564,"#pollen, #flu, #COVID19 They all look similar and all cause a #histamine (HIStiming) response. Some take #antihistamine (antiHIStamine) meds to relieve symptoms. Makes you wonder…@elonmusk @benshapiro @drsanjaygupta #Immunology #Immortality https://t.co/gApsLMwuqp",0.000000,0.400000,Neutral
503983,1615836340758904832,You people need to calm the fck down with the WEF conspiracy crapola. https://t.co/JVO0ZsFOoK,0.072222,0.519444,Positive
503984,1615836340490313759,@elonmusk I would love to know who voted yes on here,0.500000,0.600000,Positive


# Conclusion and Future Direction

### Conclusion and Future Direction:
*******************************************************************************************************************************
#### Learnings:
Our project enhanced our skills in data analysis, machine learning, and natural language processing by working with real social media data. We successfully applied transfer learning and deep learning methodologies to analyze sentiment, demonstrating these techniques' robustness in handling noisy and informal Twitter data.

*******************************************************************************************************************************
#### Results Discussion:
The use of transfer learning and deep learning techniques on the "500,000 tweets of Elon Musk (November-December 2022)" dataset resulted in promising sentiment classification accuracy. This illustrates the potential of our methods for effective social media sentiment analysis.

*******************************************************************************************************************************
#### Limitations:
We encountered challenges such as the informal nature of Twitter language, the dependency on pre-trained models, and inherent biases within labeled datasets. These factors could affect the generalizability and accuracy of our sentiment analysis results.

*******************************************************************************************************************************
#### Future Extension:
Future work will focus on integrating domain-specific enhancements like specialized lexicons and advanced model stacking techniques. We also aim to minimize dataset biases and incorporate explainable AI practices to refine our methodology and improve the interpretability of our analysis results.g.

# References:
[1] Clément, D. (2021). Unsupervised Sentiment Analysis with Real-world Data: 500,000 Tweets on Elon Musk. Towards AI. Retrieved from https://pub.towardsai.net/unsupervised-sentiment-analysis-with-real-world-data-500-000-tweets-on-elon-musk-3f0653135558

[2] Susan, L. (2019). A Complete Exploratory Data Analysis and Visualization for Text Data. Towards Data Science. Retrieved from https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a