<a href="https://colab.research.google.com/github/LETIMEI/LeetCode_SQL/blob/main/Lecture_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiment Analysis

# Data Preprocessing

In [1]:
import re
from bs4 import BeautifulSoup

def clean_text(text):
    text = BeautifulSoup(text, "lxml").text  # Remove HTML tags
    text = re.sub(r'[\W]', ' ', text)  # Remove non-alphanumeric characters
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.lower().strip()

# Example usage
dirty_text = "This is an <html>example</html> text with 😊"
cleaned_text = clean_text(dirty_text)
print(cleaned_text)


this is an example text with


# Sentiment Analysis Methods

# AFINN

The AFINN lexicon is used primarily in a sentiment analysis technique known as the lexicon-based approach. This approach involves assigning sentiment scores to individual words found in a text and then aggregating these scores to determine the overall sentiment of the text. The AFINN lexicon is particularly suited for this method due to its simplicity and direct mapping of words to numerical sentiment scores.

Key Characteristics of the AFINN Lexicon-Based Sentiment Analysis:
Scoring: Each word in the AFINN lexicon is assigned a score ranging from -5 to +5, where negative numbers represent negative sentiments and positive numbers represent positive sentiments. The score indicates the intensity of the sentiment.

Simplicity: The technique is straightforward to implement because it directly sums up the scores of the words that appear in the text. It does not require complex algorithms or machine learning models.

Efficiency: This approach can be very fast, making it suitable for applications that need to process large volumes of text quickly, such as real-time sentiment analysis on social media platforms.

Context Ignorance: While efficient and straightforward, the major drawback is that it generally ignores the context in which words are used. This can lead to inaccuracies, especially in texts where the sentiment is conveyed through sarcasm, irony, or context-dependent meanings.

AFINN is particularly favored for its ease of use and effectiveness in straightforward applications where context and linguistic subtlety are less critical.

In [2]:
pip install afinn

Collecting afinn
  Downloading afinn-0.1.tar.gz (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m461.0 kB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: afinn
  Building wheel for afinn (setup.py) ... [?25l[?25hdone
  Created wheel for afinn: filename=afinn-0.1-py3-none-any.whl size=53430 sha256=4a512b9b80beebc5dc89dc5089346d47ce38db09d0d37d9ed2e53898d9679b91
  Stored in directory: /root/.cache/pip/wheels/b0/05/90/43f79196199a138fb486902fceca30a2d1b5228e6d2db8eb90
Successfully built afinn
Installing collected packages: afinn
Successfully installed afinn-0.1


In [5]:
from afinn import Afinn

afinn = Afinn()

# Example usage
score = afinn.score('This is excellent!')
print(score)  # Positive score


3.0


# SentiWordNet

SentiWordNet is utilized in sentiment analysis through a lexicon-based approach similar to AFINN, but it provides a more nuanced analysis by incorporating multiple sentiment scores for words based on their usage in different contexts (synsets). SentiWordNet, an extension of the widely used WordNet database, provides scores for positivity, negativity, and objectivity for each synset (group of synonymous words that share a common meaning).

Characteristics of SentiWordNet-Based Sentiment Analysis:
Synset-Based Scoring: Each entry (or synset) in WordNet, which can be a word or a phrase, has associated sentiment scores in SentiWordNet:

Positivity Score: Indicates how positive a synset is.
Negativity Score: Indicates how negative a synset is.
Objectivity Score: Indicates how objective, or neutral, a synset is.
Contextual Sensitivity: Unlike simpler lexicon-based methods that assign a single sentiment score to a word regardless of usage, SentiWordNet's linkage to WordNet synsets allows it to provide different sentiment scores depending on the context in which a word is used.

Sentiment Calculation: The overall sentiment of a text can be calculated by analyzing the sentiment scores of its constituent words based on their relevant synsets, taking into account the words’ part of speech and sense.

Complexity: This approach is more complex than using a straightforward dictionary like AFINN because it requires determining the correct sense of a word within its context before applying the sentiment scores.

Beyond simply classifying sentiments as positive or negative, SentiWordNet allows for measuring the degree of sentiment and objectivity, supporting more detailed analyses.SentiWordNet provides a rich lexical resource for sentiment analysis, especially useful in applications requiring a deep understanding of the lexical semantics of the language.

In [6]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn

# Ensure that the necessary resources are downloaded
nltk.download('wordnet')
nltk.download('sentiwordnet')

def get_sentiment(word, pos=None):
    """Get the sentiment scores for the best sense of the word"""
    synsets = wn.synsets(word, pos=pos)
    if not synsets:
        return None

    # Choose the first synset as the most common usage
    synset = synsets[0]
    swn_synset = swn.senti_synset(synset.name())
    return swn_synset.pos_score(), swn_synset.neg_score(), swn_synset.obj_score()

# Example usage
word = 'happy'
pos_score, neg_score, obj_score = get_sentiment(word, pos=wn.ADJ)
print(f"Positive score: {pos_score}, Negative score: {neg_score}, Objective score: {obj_score}")


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


Positive score: 0.875, Negative score: 0.0, Objective score: 0.125


## TextBlob

TextBlob is another popular library for processing textual data in Python. It is particularly known for its simplicity and ease of use, providing a straightforward API for tackling common natural language processing (NLP) tasks, including part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

Overview of TextBlob in Sentiment Analysis:

TextBlob’s sentiment analysis is built upon the pattern library, which itself is based on a lexicon similar to AFINN but includes an assessment of both **polarity** and **subjectivity**:

* Polarity: Measures the positivity or negativity of the text. The polarity score is a float within the range [-1.0, 1.0].
* Subjectivity: Measures the degree of personal opinion, emotion, or judgment within the text. The subjectivity score is a float within the range [0.0, 1.0].

Key Features of TextBlob:

* Ease of Use: TextBlob's API is very user-friendly, making it easy to perform complex NLP tasks with only a few lines of code.

* Language Support and Translation: TextBlob supports multiple languages for basic NLP tasks and can leverage Google Translate for text translation.

TextBlob fits well into the sentiment analysis toolkit as a versatile, easy-to-use option suitable for various texts. TextBlob assesses both the emotional leaning and the objective vs. subjective nature of the text.

In [6]:
from textblob import TextBlob

def analyze_sentiment_textblob(text):
    """Function to analyze sentiment using TextBlob's built-in sentiment analyzer."""
    return TextBlob(text).sentiment

# Example usage
example_text = "This is a good product!"
sentiment = analyze_sentiment_textblob(example_text)
print(f"Polarity: {sentiment.polarity}, Subjectivity: {sentiment.subjectivity}")


Polarity: 0.875, Subjectivity: 0.6000000000000001


## VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. It is different from other sentiment analysis tools like AFINN and SentiWordNet in several significant ways:

Key Features of VADER:

* Specifically Designed for Social Media: VADER was developed with a focus on social media contexts, taking into account the peculiarities and informal language used in tweets, Facebook posts, and other similar platforms.

* Rule-Based with a Sentiment Lexicon: VADER combines a sentiment lexicon (a list of lexical features e.g., words, emoji) that are generally labeled according to their semantic orientation as either positive or negative. What makes VADER particularly powerful is that it also incorporates grammatical and syntactical rules to determine sentiment, which helps capture more nuances than typical lexicons.

* Handles Slang and Emoticons: It understands modern slang used on social media (e.g., "sux", "lol") and includes a robust set of emoticons. This feature is crucial for analyzing contemporary sentiment expressions on social media effectively.

* Contextual Awareness: VADER not only scores words but also considers context. For example, it intensifies the sentiment if it detects an exclamation mark or diminishes it if it detects words like "kind of" or "sort of".

How VADER Differs from AFINN and SentiWordNet:

* AFINN: AFINN provides a list of words rated from -5 to +5. It is straightforward and simple, assigning scores to words without considering context or any modifiers. It does not account for slang, emoticons, or idiomatic phrases often found in social media, making it less effective for such platforms compared to VADER.

* SentiWordNet: This is an extension of WordNet which provides scores for positivity, negativity, and objectivity for each WordNet synset (group of synonymous words). While SentiWordNet is more sophisticated than AFINN in handling context due to its association with specific meanings of words, it lacks the built-in rules for handling the dynamics of sentence-level sentiment, modifiers, or contemporary slang and emojis, making it less suitable for on-the-fly social media sentiment analysis compared to VADER.

In [7]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

sid = SentimentIntensityAnalyzer()
print(sid.polarity_scores('This is amazingly good!'))


{'neg': 0.0, 'neu': 0.463, 'pos': 0.537, 'compound': 0.54}


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


# Transformer based Sentiment Analysis

BERT and its variants (like RoBERTa, DistilBERT, etc.) represent a significant shift in how machines understand human languages due to their architecture and training approaches, which allow them to capture the context of every word in a text in relation to all the other words in the sentence, rather than in one-direction at a time.

Key Features of BERT for Sentiment Analysis:

* Contextual Understanding: BERT’s most significant feature is its ability to consider the full context of a word by looking at the words that come before and after it. This is a considerable advantage for sentiment analysis, especially in handling sentences where the meaning can significantly change based on context or word placement.

* Pre-training on Large Corpuses: BERT is pre-trained on a vast corpus of text from the internet, which gives it a broad understanding of language and context before it is even fine-tuned for specific tasks like sentiment analysis.

Comparison with Other Sentiment Analysis Tools:

* TextBlob & AFINN: These tools use static lexicons or simple rule-based approaches for sentiment analysis. They do not account for the context in which a word appears, which can lead to inaccurate sentiment analysis in more complex sentences. BERT, by contrast, considers the entire sentence structure, which helps in understanding context-dependent meanings.


* VADER: VADER is highly tuned for social media and can interpret slang, emojis, and emoticons effectively. While BERT can also be fine-tuned for social media text, it would require specific training data reflecting these nuances. VADER is out-of-the-box ready for social media, whereas BERT would need some adaptation.

* SentiWordNet: Unlike SentiWordNet, which provides scores based on word senses, BERT evaluates the sentiment based on the overall sentence semantics, making it much more robust for sentences where multiple word senses are involved.

BERTSentiment, leveraging models like BERT, fits into sentiment analysis as a high-performance tool capable of understanding nuanced and context-dependent language, making it superior for applications requiring a deep understanding of text sentiment. It is particularly effective in environments where the context significantly impacts meaning, such as in customer feedback, movie reviews, and other forms of textual analysis where traditional lexicons might fall short. Its use, however, requires more computational resources compared to simpler tools like TextBlob or VADER, and it might be considered overkill for simpler applications.

In [9]:
%pip install transformers



In [10]:
from transformers import pipeline

# Load sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Analyze sentiment
result = sentiment_pipeline("I love using transformers for natural language processing!")
print(result)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.9987679123878479}]


# Complexities of language


# Complex sentiments

In [11]:
from textblob import TextBlob
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
from transformers import pipeline
nlp = pipeline("sentiment-analysis")


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.



Let us try the following sentences and capture the sentiment.

"The movie isn't really all that great."

"Oh great, another rainy day."

 "I’m not unhappy with how things turned out."

 "is this good - more I think about it I do no think so"

 "This camera fails to impress me."

 "This product is barely functional"

 "What a great car, it did not start the first day."

 "Trying out Chrome because Firefox keeps crashing"

"For paintX, one coat can cover the wood color"

"For paintY, we need three coats to cover the wood color"

"hello :-) ;-)"

In [12]:
texts = [
    "No, I am good."
    "I am no good."
    "The movie isn't really all that great.",
    "Oh great, another rainy day.",
    "I’m not unhappy with how things turned out.",
    "is this good - more I think about it I do no think so",
    "This camera fails to impress me.",
    "This product is barely functional.",
    "What a great car, it did not start the first day.",
    "Trying out Chrome because Firefox keeps crashing.",
    "For paintX, one coat can cover the wood color.",
    "For paintY, we need three coats to cover the wood color.",
    "hello :-) ;-)"
]

In [13]:
import pandas as pd

# List to collect the sentiment data
data = []

# Iterate through each text to analyze sentiment
for text in texts:
    # TextBlob sentiment
    blob = TextBlob(text)
    tb_sentiment = blob.sentiment

    # VADER sentiment
    vader_sentiment = sid.polarity_scores(text)

    # BERT sentiment
    bert_result = nlp(text)

    # Append results to data list
    data.append({
        "Text": text,
        "TextBlob Sentiment": tb_sentiment[0],
        "VADER Compound Score": vader_sentiment['compound'],
        "BERT Sentiment": bert_result[0]['label'],
        "BERT Score": bert_result[0]['score'],
    })

# Create DataFrame from the collected data
sentiment_df = pd.DataFrame(data)
sentiment_df

Unnamed: 0,Text,TextBlob Sentiment,VADER Compound Score,BERT Sentiment,BERT Score
0,"No, I am good.I am no good.The movie isn't rea...",0.5,0.2415,NEGATIVE,0.999534
1,"Oh great, another rainy day.",0.8,0.5859,POSITIVE,0.985306
2,I’m not unhappy with how things turned out.,0.3,0.3252,POSITIVE,0.996785
3,is this good - more I think about it I do no t...,0.6,0.1779,POSITIVE,0.99983
4,This camera fails to impress me.,-0.5,0.0258,NEGATIVE,0.999775
5,This product is barely functional.,0.05,0.0,NEGATIVE,0.999817
6,"What a great car, it did not start the first day.",0.525,0.6249,NEGATIVE,0.977874
7,Trying out Chrome because Firefox keeps crashing.,0.0,0.0,NEGATIVE,0.996089
8,"For paintX, one coat can cover the wood color.",0.0,0.0,NEGATIVE,0.968502
9,"For paintY, we need three coats to cover the w...",0.0,0.0,NEGATIVE,0.999018


##Slang

In [14]:
# Read the slang lexicon
slang_df = pd.read_csv("slang_lexicon.csv")
slang_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'slang_lexicon.csv'

In [None]:
# Convert the DataFrame into a dictionary with lower case keys
slang_dict = pd.Series(slang_df['explanation'].values, index=slang_df['slang'].str.lower()).to_dict()


In [None]:
# Replace slang
import re

def replace_slang(text, slang_dict):
    # Use regex to extract words, keeping them only and converting text to lower case for matching
    words = re.findall(r'\b\w+\b', text.lower())  # Extracts words and converts them to lower case
    # Replace each word with the corresponding explanation if it exists in the dictionary
    return ' '.join([slang_dict.get(word, word) for word in words])

In [None]:
# Example usage
text_with_slang = "BRB, going to the store, LOL!"
cleaned_text = replace_slang(text_with_slang, slang_dict)
print(cleaned_text)

text_with_slang = "tbh, icymi i was like ok"
cleaned_text = replace_slang(text_with_slang, slang_dict)
print(cleaned_text)

# Your Turn
1. Conduct sentiment analysis on `sms_spam.csv`. Assess if the average sentiment of the text is related to whether the message is 'ham' or 'spam'.
2. Open the `kickstarter.csv` file, which contains information about various Kickstarter projects, such as their title, description, funding goal, number of backers, and state (successful, failed, canceled, etc.). The file was downloaded from https://webrobots.io/kickstarter-datasets/. Conduct sentiment analysis on the variable `blurb` and assess whether the sentiment is related to `state` which has values successful, failed, and canceled.


In [16]:
import pandas as pd
import re
import nltk

In [18]:
df = pd.read_csv('sms_spam.csv')
df

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will Ã¼ b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


**Data Preprocessing**

In [22]:
import re
from bs4 import BeautifulSoup

def clean_text(text):
    text = BeautifulSoup(text, "lxml").text  # Remove HTML tags
    text = re.sub(r'[\W]', ' ', text)  # Remove non-alphanumeric characters
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.lower().strip()

# Example usage
#dirty_text = df['text']
#cleaned_text = clean_text(dirty_text)
df['cleaned_text'] = df['text'].apply(clean_text)
#print(cleaned_text)
df

  text = BeautifulSoup(text, "lxml").text  # Remove HTML tags


Unnamed: 0,type,text,cleaned_text
0,ham,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,ham,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in 2 a wkly comp to win fa cup fina...
3,ham,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah i don t think he goes to usf he lives arou...
...,...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...,this is the 2nd time we have tried 2 contact u...
5570,ham,Will Ã¼ b going to esplanade fr home?,will ã¼ b going to esplanade fr home
5571,ham,"Pity, * was in mood for that. So...any other s...",pity was in mood for that so any other suggest...
5572,ham,The guy did some bitching but I acted like i'd...,the guy did some bitching but i acted like i d...


**AFINN**

In [None]:
pip install afinn

In [25]:
from afinn import Afinn

afinn = Afinn()

score = df['cleaned_text'].apply(afinn.score)
print(score)

0       1.0
1       0.0
2       5.0
3       0.0
4       0.0
       ... 
5569    4.0
5570    0.0
5571   -2.0
5572    5.0
5573    6.0
Name: cleaned_text, Length: 5574, dtype: float64


**SentiWordNet**

In [30]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn

# Ensure that the necessary resources are downloaded
nltk.download('wordnet')
nltk.download('sentiwordnet')

def get_sentiment(word, pos=None):
    """Get the sentiment scores for the best sense of the word"""
    synsets = wn.synsets(word, pos=pos)
    if not synsets:
        return None

    # Choose the first synset as the most common usage
    synset = synsets[0]
    swn_synset = swn.senti_synset(synset.name())
    return swn_synset.pos_score(), swn_synset.neg_score(), swn_synset.obj_score()

# Example usage
sentiment_scores = df['cleaned_text'].apply(lambda x: get_sentiment(x))

# Filter out rows where sentiment_scores are None
sentiment_scores_filtered = sentiment_scores.dropna()

# Create a DataFrame from the filtered sentiment_scores
df[['pos_score', 'neg_score', 'obj_score']] = pd.DataFrame(sentiment_scores_filtered.tolist(), index=sentiment_scores_filtered.index)

# Display the DataFrame with sentiment scores
print(df[['cleaned_text', 'pos_score', 'neg_score', 'obj_score']])



                                           cleaned_text  pos_score  neg_score  \
0     go until jurong point crazy available only in ...        NaN        NaN   
1                               ok lar joking wif u oni        NaN        NaN   
2     free entry in 2 a wkly comp to win fa cup fina...        NaN        NaN   
3           u dun say so early hor u c already then say        NaN        NaN   
4     nah i don t think he goes to usf he lives arou...        NaN        NaN   
...                                                 ...        ...        ...   
5569  this is the 2nd time we have tried 2 contact u...        NaN        NaN   
5570               will ã¼ b going to esplanade fr home        NaN        NaN   
5571  pity was in mood for that so any other suggest...        NaN        NaN   
5572  the guy did some bitching but i acted like i d...        NaN        NaN   
5573                          rofl its true to its name        NaN        NaN   

      obj_score  
0        

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


**TextBlob**

In [31]:
from textblob import TextBlob

def analyze_sentiment_textblob(text):
    """Function to analyze sentiment using TextBlob's built-in sentiment analyzer."""
    return TextBlob(text).sentiment

sentiment_scores = df['cleaned_text'].apply(analyze_sentiment_textblob)

# Split the sentiment scores into two separate columns
df[['polarity', 'subjectivity']] = pd.DataFrame(sentiment_scores.tolist(), index=df.index)

# Display the DataFrame with sentiment scores
print(df[['cleaned_text', 'polarity', 'subjectivity']])

                                           cleaned_text  polarity  \
0     go until jurong point crazy available only in ...  0.150000   
1                               ok lar joking wif u oni  0.500000   
2     free entry in 2 a wkly comp to win fa cup fina...  0.300000   
3           u dun say so early hor u c already then say  0.100000   
4     nah i don t think he goes to usf he lives arou...  0.000000   
...                                                 ...       ...   
5569  this is the 2nd time we have tried 2 contact u...  0.144444   
5570               will ã¼ b going to esplanade fr home  0.000000   
5571  pity was in mood for that so any other suggest... -0.112500   
5572  the guy did some bitching but i acted like i d...  0.216667   
5573                          rofl its true to its name  0.575000   

      subjectivity  
0         0.762500  
1         0.500000  
2         0.550000  
3         0.300000  
4         0.000000  
...            ...  
5569      0.611111  
557

**Combined**

In [32]:
import pandas as pd

# List to collect the sentiment data
data = []

# Iterate through each text to analyze sentiment
for text in df['cleaned_text']:
    # TextBlob sentiment
    blob = TextBlob(text)
    tb_sentiment = blob.sentiment

    # VADER sentiment
    vader_sentiment = sid.polarity_scores(text)

    # BERT sentiment
    bert_result = nlp(text)

    # Append results to data list
    data.append({
        "Text": text,
        "TextBlob Sentiment": tb_sentiment[0],
        "VADER Compound Score": vader_sentiment['compound'],
        "BERT Sentiment": bert_result[0]['label'],
        "BERT Score": bert_result[0]['score'],
    })

# Create DataFrame from the collected data
sentiment_df = pd.DataFrame(data)
sentiment_df

Unnamed: 0,Text,TextBlob Sentiment,VADER Compound Score,BERT Sentiment,BERT Score
0,go until jurong point crazy available only in ...,0.150000,0.4019,NEGATIVE,0.990238
1,ok lar joking wif u oni,0.500000,0.4767,NEGATIVE,0.982236
2,free entry in 2 a wkly comp to win fa cup fina...,0.300000,0.7964,NEGATIVE,0.986346
3,u dun say so early hor u c already then say,0.100000,0.0000,NEGATIVE,0.993365
4,nah i don t think he goes to usf he lives arou...,0.000000,-0.1027,NEGATIVE,0.997008
...,...,...,...,...,...
5569,this is the 2nd time we have tried 2 contact u...,0.144444,0.8720,NEGATIVE,0.995909
5570,will ã¼ b going to esplanade fr home,0.000000,0.0000,POSITIVE,0.865926
5571,pity was in mood for that so any other suggest...,-0.112500,-0.2960,NEGATIVE,0.999179
5572,the guy did some bitching but i acted like i d...,0.216667,0.8934,NEGATIVE,0.997138


**Kichstarter.csv**

In [3]:
df = pd.read_csv("Kickstarter.csv")

In [4]:
import re
from bs4 import BeautifulSoup

def clean_text(text):
    text = BeautifulSoup(text, "lxml").text  # Remove HTML tags
    text = re.sub(r'[\W]', ' ', text)  # Remove non-alphanumeric characters
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space
    return text.lower().strip()

# Example usage
#dirty_text = df['text']
#cleaned_text = clean_text(dirty_text)
df['cleaned_blurb'] = df['blurb'].apply(clean_text)
#print(cleaned_text)
head(df)

  text = BeautifulSoup(text, "lxml").text  # Remove HTML tags


Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,country_displayable_name,created_at,creator,currency,currency_symbol,...,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_exchange_rate,usd_pledged,usd_type,cleaned_blurb
0,7,Teen sensation Emma Bilyou is one of the most ...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",1500,US,the United States,1370358703,"{""id"":1720335385,""name"":""Dwaine Harris"",""is_re...",USD,$,...,True,False,successful,1375629060,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,1500.000000,international,teen sensation emma bilyou is one of the most ...
1,12,Thomas Nöla has recorded a new album. He would...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",581,US,the United States,1363533018,"{""id"":474097748,""name"":""Thomas Nöla"",""slug"":""t...",USD,$,...,True,False,successful,1373056493,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,581.660000,international,thomas nöla has recorded a new album he would ...
2,134,The world's first ever full-length album of br...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",15631,US,the United States,1369246990,"{""id"":1877703222,""name"":""Kathleen Smith"",""is_r...",USD,$,...,True,False,successful,1373036535,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,15631.000000,international,the world s first ever full length album of br...
3,76,"Love, laugh, dance, and cry, to ten new synthp...","{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",4177,US,the United States,1364398436,"{""id"":1680516234,""name"":""Kite Flying Robot"",""s...",USD,$,...,True,False,successful,1372680035,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,4177.000000,international,love laugh dance and cry to ten new synthpop a...
4,399,An all new recording from Christian music vete...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",41100,US,the United States,1366950463,"{""id"":363468529,""name"":""CRYSTAL LEWIS"",""slug"":...",USD,$,...,True,False,successful,1372658437,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,41100.250000,international,an all new recording from christian music vete...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3649,135,Tea reimagined: A chilled and bubbly new way t...,"{""id"":307,""name"":""Drinks"",""analytics_name"":""Dr...",10591,AU,Australia,1558443357,"{""id"":71436957,""name"":""East Forged"",""slug"":""ea...",AUD,$,...,True,False,successful,1568757563,0.678831,"{""web"":{""project"":""https://www.kickstarter.com...",0.683997,10511.913093,international,tea reimagined a chilled and bubbly new way to...
3650,82,Weathervane Music and WXPN created Shaking Thr...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",6631,US,the United States,1289416053,"{""id"":27389455,""name"":""Weathervane Music"",""is_...",USD,$,...,True,False,successful,1293856215,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,6631.000000,international,weathervane music and wxpn created shaking thr...
3651,70,Orange Drink has a new album! Fans of Daft Pu...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",4050,US,the United States,1288308969,"{""id"":650403848,""name"":""Hemlock Records"",""slug...",USD,$,...,True,False,successful,1293238812,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,4050.220000,international,orange drink has a new album fans of daft punk...
3652,31,Jackets is recording our first CD. We need yo...,"{""id"":42,""name"":""Pop"",""analytics_name"":""Pop"",""...",1210,US,the United States,1290225677,"{""id"":103412278,""name"":""Cary McCartin"",""is_reg...",USD,$,...,True,False,successful,1293847213,1.000000,"{""web"":{""project"":""https://www.kickstarter.com...",1.000000,1210.000000,international,jackets is recording our first cd we need your...




In [12]:
import pandas as pd

# List to collect the sentiment data
data = []

# Iterate through each text to analyze sentiment
for text in df['cleaned_blurb']:
    # TextBlob sentiment
    blob = TextBlob(text)
    tb_sentiment = blob.sentiment

    # VADER sentiment
    vader_sentiment = sid.polarity_scores(text)

    # BERT sentiment
    bert_result = nlp(text)

    # Append results to data list
    data.append({
        "Text": text,
        "State": df['state'],
        "TextBlob Sentiment": tb_sentiment[0],
        "VADER Compound Score": vader_sentiment['compound'],
        "BERT Sentiment": bert_result[0]['label'],
        "BERT Score": bert_result[0]['score'],
    })

# Create DataFrame from the collected data
sentiment_df = pd.DataFrame(data)
sentiment_df

Unnamed: 0,Text,State,TextBlob Sentiment,VADER Compound Score,BERT Sentiment,BERT Score
0,teen sensation emma bilyou is one of the most ...,0 successful 1 successful 2 ...,0.400000,0.7501,POSITIVE,0.999768
1,thomas nöla has recorded a new album he would ...,0 successful 1 successful 2 ...,0.005682,0.8360,NEGATIVE,0.985511
2,the world s first ever full length album of br...,0 successful 1 successful 2 ...,0.433333,0.8807,POSITIVE,0.999626
3,love laugh dance and cry to ten new synthpop a...,0 successful 1 successful 2 ...,0.234091,0.6908,POSITIVE,0.989056
4,an all new recording from christian music vete...,0 successful 1 successful 2 ...,0.045455,0.0000,POSITIVE,0.999853
...,...,...,...,...,...,...
3649,tea reimagined a chilled and bubbly new way to...,0 successful 1 successful 2 ...,0.250216,0.7003,POSITIVE,0.999585
3650,weathervane music and wxpn created shaking thr...,0 successful 1 successful 2 ...,0.400000,0.8625,POSITIVE,0.999722
3651,orange drink has a new album fans of daft punk...,0 successful 1 successful 2 ...,0.318182,0.8225,POSITIVE,0.998340
3652,jackets is recording our first cd we need your...,0 successful 1 successful 2 ...,0.250000,0.4019,NEGATIVE,0.971482


# Lexicons

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
# Initialize VADER
sid = SentimentIntensityAnalyzer()

# Access the lexicon (which is a dictionary of words and their scores)
vader_lexicon = sid.lexicon

# Print some entries from the lexicon
print({word: score for word, score in list(vader_lexicon.items())[:100]})

## Download the lexicons

In [None]:
pip install afinn

In [None]:
# AFINN
import csv
from afinn import Afinn

def export_afinn_to_csv(file_path):
    # Initialize Afinn and access the wordlist
    afinn = Afinn()

    # AFINN wordlist is accessible as a dictionary {word: score}
    wordlist = afinn._dict

    # Write to CSV
    with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Word', 'Score'])  # header row
        for word, score in wordlist.items():
            writer.writerow([word, score])

# Specify the path for your CSV file
export_afinn_to_csv('afinn_lexicon.csv')


In [None]:
# VADER
import csv

# Write the VADER lexicon to a CSV file
with open('vader_lexicon.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Token', 'Sentiment Score'])
    for key, value in vader_lexicon.items():
        writer.writerow([key, value])

In [None]:
# Sentiwordnet - takes long to run
import csv
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet as wn

nltk.download('sentiwordnet')
nltk.download('wordnet')

# Function to get sentiment data
def get_sentiwordnet_data():
    data = []
    # Iterate over all the synset terms in the WordNet
    for synset in wn.all_synsets():
        # Get SentiWordNet synset equivalent
        senti_synset = swn.senti_synset(synset.name())
        if senti_synset:
            # Prepare data that includes the synset name, its positive, negative, and objective scores
            data.append([
                synset.name(),
                synset.definition(),
                senti_synset.pos_score(),
                senti_synset.neg_score(),
                senti_synset.obj_score()
            ])
    return data

# Get the data from SentiWordNet
senti_data = get_sentiwordnet_data()

# Write the data to a CSV file
with open('sentiwordnet.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Synset', 'Definition', 'Positive Score', 'Negative Score', 'Objective Score'])
    writer.writerows(senti_data)

print("SentiWordNet data has been written to 'sentiwordnet.csv'.")
