Load the tweets data from the specified source and sample 500 rows for analysis.



In [23]:
import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('tweets-data.csv')

# Sample 500 rows randomly from the DataFrame
df_sampled = df.sample(n=500, random_state=42)

# Display the first few rows of the sampled DataFrame
display(df_sampled.head())

Unnamed: 0.1,Unnamed: 0,Date Created,Number of Likes,Source of Tweet,Tweets,hashtag
2899,897,2023-06-25 11:06:23+00:00,2,,Le #DessinDePresse de Sanaga : ls sont morts c...,titan
594,594,2023-06-25 18:23:19+00:00,0,,#Russia #Wagner #RussiaCivilWar https://t.co/P...,wagner
2870,868,2023-06-25 11:32:00+00:00,1,,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,titan
52,52,2023-06-25 19:11:12+00:00,21,,Auch heute geht die politische Nachricht des T...,wagner
1391,390,2023-06-25 16:21:52+00:00,1,,@crazyclipsonly Same type that would take a ho...,titanic


Define the function to calculate sentiment label and score using the VADER algorithm as instructed.



In [24]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

def get_vader_sentiment(tweet):
    """
    Calculates sentiment label and score using VADER.

    Args:
        tweet: The tweet text.

    Returns:
        A tuple containing the sentiment label and the compound score.
    """
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(tweet)
    compound_score = scores['compound']

    if compound_score >= 0.05:
        sentiment_label = 'POSITIVE'
    elif compound_score <= -0.05:
        sentiment_label = 'NEGATIVE'
    else:
        sentiment_label = 'NEUTRAL'

    return sentiment_label, compound_score

Define the function to clean tweet text and apply it to the DataFrame.



In [25]:
import re
import string

def clean_tweet_text(text):
    """
    Cleans tweet text by removing URLs, mentions, hashtags, punctuation,
    converting to lowercase, and removing extra whitespace.

    Args:
        text: The tweet text string.

    Returns:
        The cleaned tweet text string.
    """
    if not isinstance(text, str):
        return ""
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\S+', '', text)
    # Remove hashtags
    text = re.sub(r'#\S+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'Tweets' column
df_sampled['cleaned_tweets'] = df_sampled['Tweets'].apply(clean_tweet_text)

# Display the first few rows with the new column
display(df_sampled[['Tweets', 'cleaned_tweets']].head())

Unnamed: 0,Tweets,cleaned_tweets
2899,Le #DessinDePresse de Sanaga : ls sont morts c...,le de sanaga ls sont morts comme ils ont vécu ...
594,#Russia #Wagner #RussiaCivilWar https://t.co/P...,
2870,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,exclusive content
52,Auch heute geht die politische Nachricht des T...,auch heute geht die politische nachricht des t...
1391,@crazyclipsonly Same type that would take a ho...,same type that would take a homemade playstati...


Apply the VADER sentiment analysis function to the cleaned tweets and store the results in new columns.



In [26]:
# Apply the get_vader_sentiment function to the 'cleaned_tweets' column
vader_results = df_sampled['cleaned_tweets'].apply(get_vader_sentiment)

# Create new columns for VADER sentiment label and score
df_sampled['vader sentiment label'] = vader_results.apply(lambda x: x[0])
df_sampled['vader sentiment score'] = vader_results.apply(lambda x: x[1])

# Display the first few rows with the original Tweets column and the new VADER sentiment columns
display(df_sampled[['Tweets', 'cleaned_tweets', 'vader sentiment label', 'vader sentiment score']].head())

Unnamed: 0,Tweets,cleaned_tweets,vader sentiment label,vader sentiment score
2899,Le #DessinDePresse de Sanaga : ls sont morts c...,le de sanaga ls sont morts comme ils ont vécu ...,NEUTRAL,0.0
594,#Russia #Wagner #RussiaCivilWar https://t.co/P...,,NEUTRAL,0.0
2870,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,exclusive content,POSITIVE,0.128
52,Auch heute geht die politische Nachricht des T...,auch heute geht die politische nachricht des t...,NEGATIVE,-0.5994
1391,@crazyclipsonly Same type that would take a ho...,same type that would take a homemade playstati...,NEUTRAL,0.0


Define a cleaning function for transformers that keeps hashtags and apply it to the 'Tweets' column to create a new column 'cleaned_tweets_transformers'. Then display the relevant columns to verify the cleaning.



In [27]:
import re
import string

def clean_tweet_text_transformer(text):
    """
    Cleans tweet text for transformer models by removing URLs, mentions,
    punctuation, converting to lowercase, and removing extra whitespace,
    while keeping hashtags.

    Args:
        text: The tweet text string.

    Returns:
        The cleaned tweet text string.
    """
    if not isinstance(text, str):
        return ""
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\S+', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the cleaning function to the 'Tweets' column
df_sampled['cleaned_tweets_transformers'] = df_sampled['Tweets'].apply(clean_tweet_text_transformer)

# Display the first few rows with the original Tweets column and the new cleaned_tweets_transformers column
display(df_sampled[['Tweets', 'cleaned_tweets_transformers']].head())

Unnamed: 0,Tweets,cleaned_tweets_transformers
2899,Le #DessinDePresse de Sanaga : ls sont morts c...,le dessindepresse de sanaga ls sont morts comm...
594,#Russia #Wagner #RussiaCivilWar https://t.co/P...,russia wagner russiacivilwar
2870,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,exclusive content cosplay japan titan titanics...
52,Auch heute geht die politische Nachricht des T...,auch heute geht die politische nachricht des t...
1391,@crazyclipsonly Same type that would take a ho...,same type that would take a homemade playstati...


Import the pipeline function from the transformers library, create a sentiment analysis pipeline, apply it to the cleaned tweets for transformers, and extract the sentiment label and score into new columns. Finally, display the relevant columns to verify the results.



In [29]:
from transformers import pipeline

# Define the maximum sequence length
max_sequence_length = 512

# Truncate the cleaned tweets to the maximum sequence length
df_sampled['cleaned_tweets_truncated'] = df_sampled['cleaned_tweets_transformers'].apply(lambda x: x[:max_sequence_length])

# Create a sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Apply the pipeline to the truncated cleaned tweets
transformer_results = df_sampled['cleaned_tweets_truncated'].apply(lambda x: sentiment_pipeline(x)[0])

# Extract the sentiment label and score
df_sampled['sentiment label'] = transformer_results.apply(lambda x: x['label'])
df_sampled['sentiment score'] = transformer_results.apply(lambda x: x['score'])

# Display the first few rows with the original Tweets column and the new sentiment columns
display(df_sampled[['Tweets', 'cleaned_tweets_truncated', 'sentiment label', 'sentiment score']].head())

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


Unnamed: 0,Tweets,cleaned_tweets_truncated,sentiment label,sentiment score
2899,Le #DessinDePresse de Sanaga : ls sont morts c...,le dessindepresse de sanaga ls sont morts comm...,NEGATIVE,0.979517
594,#Russia #Wagner #RussiaCivilWar https://t.co/P...,russia wagner russiacivilwar,NEGATIVE,0.962062
2870,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,exclusive content cosplay japan titan titanics...,NEGATIVE,0.961531
52,Auch heute geht die politische Nachricht des T...,auch heute geht die politische nachricht des t...,NEGATIVE,0.977816
1391,@crazyclipsonly Same type that would take a ho...,same type that would take a homemade playstati...,NEGATIVE,0.994473


Create a new DataFrame containing only the rows where the sentiment labels from VADER and the transformers model disagree, display the first 10 rows, and then calculate and print the number and percentage of these disagreeing tweets.



In [30]:
# Create a new DataFrame with disagreeing tweets
disagreeing_tweets = df_sampled[df_sampled['vader sentiment label'] != df_sampled['sentiment label']].copy()

# Display the first 10 rows of the disagreeing_tweets DataFrame
print("First 10 rows of disagreeing tweets:")
display(disagreeing_tweets.head(10))

# Calculate the number of disagreeing tweets
num_disagreeing = len(disagreeing_tweets)
print(f"\nNumber of tweets where sentiment labels disagree: {num_disagreeing}")

# Calculate the percentage of disagreeing tweets
total_sampled_tweets = len(df_sampled)
percentage_disagreeing = (num_disagreeing / total_sampled_tweets) * 100
print(f"Percentage of tweets where sentiment labels disagree: {percentage_disagreeing:.2f}%")

First 10 rows of disagreeing tweets:


Unnamed: 0.1,Unnamed: 0,Date Created,Number of Likes,Source of Tweet,Tweets,hashtag,cleaned_tweets,vader sentiment label,vader sentiment score,cleaned_tweets_transformers,cleaned_tweets_truncated,sentiment label,sentiment score
2899,897,2023-06-25 11:06:23+00:00,2,,Le #DessinDePresse de Sanaga : ls sont morts c...,titan,le de sanaga ls sont morts comme ils ont vécu ...,NEUTRAL,0.0,le dessindepresse de sanaga ls sont morts comm...,le dessindepresse de sanaga ls sont morts comm...,NEGATIVE,0.979517
594,594,2023-06-25 18:23:19+00:00,0,,#Russia #Wagner #RussiaCivilWar https://t.co/P...,wagner,,NEUTRAL,0.0,russia wagner russiacivilwar,russia wagner russiacivilwar,NEGATIVE,0.962062
2870,868,2023-06-25 11:32:00+00:00,1,,Exclusive content -https://t.co/oEiSIIB2Z1\n.\...,titan,exclusive content,POSITIVE,0.128,exclusive content cosplay japan titan titanics...,exclusive content cosplay japan titan titanics...,NEGATIVE,0.961531
1391,390,2023-06-25 16:21:52+00:00,1,,@crazyclipsonly Same type that would take a ho...,titanic,same type that would take a homemade playstati...,NEUTRAL,0.0,same type that would take a homemade playstati...,same type that would take a homemade playstati...,NEGATIVE,0.994473
807,807,2023-06-25 18:08:26+00:00,0,,#SUGA_AgustD_TOUR_in_Seoul #SUGA_AgustD_TOUR #...,wagner,,NEUTRAL,0.0,sugaagustdtourinseoul sugaagustdtour glastonbu...,sugaagustdtourinseoul sugaagustdtour glastonbu...,NEGATIVE,0.955373
2761,759,2023-06-25 12:39:52+00:00,0,,#Titan mishap: #Implosion with incredible forc...,titan,mishap with incredible force amp speed crushin...,NEGATIVE,-0.5612,titan mishap implosion with incredible force a...,titan mishap implosion with incredible force a...,POSITIVE,0.996252
196,196,2023-06-25 18:59:25+00:00,0,,#Wagner #Russia,wagner,,NEUTRAL,0.0,wagner russia,wagner russia,POSITIVE,0.992612
1576,575,2023-06-25 14:51:24+00:00,0,,#merri le #titanic 2 le retour https://t.co/4s...,titanic,le 2 le retour via,NEUTRAL,0.0,merri le titanic 2 le retour via,merri le titanic 2 le retour via,POSITIVE,0.896984
670,670,2023-06-25 18:17:22+00:00,0,,"Il Segretario di Stato americano #Blinken: ""no...",wagner,il segretario di stato americano non credo che...,NEUTRAL,0.0,il segretario di stato americano blinken non c...,il segretario di stato americano blinken non c...,NEGATIVE,0.953018
679,679,2023-06-25 18:16:43+00:00,0,,#تنظيف_المكيفات #غسيل_مكيفات #تكييفات #مكيف #م...,wagner,غسيل مكيفات اسبيلت شباك داكت,NEUTRAL,0.0,تنظيفالمكيفات غسيلمكيفات تكييفات مكيف مكيفات ص...,تنظيفالمكيفات غسيلمكيفات تكييفات مكيف مكيفات ص...,NEGATIVE,0.680166



Number of tweets where sentiment labels disagree: 367
Percentage of tweets where sentiment labels disagree: 73.40%


## conclusion:

*   The high percentage of disagreement suggests that VADER and the chosen transformers model interpret sentiment differently on this specific dataset. This could be due to various factors, including the nuances of tweet language, the specific training data of each model, or the cleaning methods used.
*   Further investigation into the types of tweets where the models disagree could provide insights into their respective strengths and weaknesses. Manually reviewing a subset of disagreeing tweets could help understand the reasons behind the discrepancies.
