# <font color='blue'>Project Overview: BERT and VADER for Sentiment Analysis</font>

## <font color='purple'>Step 1: Install Necessary Libraries</font>
- Install libraries: `transformers`, `vaderSentiment`, `pandas`, `datasets`
  - **transformers**: Used for accessing pre-trained language models like BERT.
  - **vaderSentiment**: Provides a rule-based sentiment analysis tool (VADER).
  - **pandas**: Essential for data manipulation and DataFrame operations.
  - **datasets**: Allows easy access to various datasets including IMDb for this project.

## <font color='purple'>Step 2: Import Libraries</font>
- Import necessary libraries for data handling, sentiment analysis, and evaluation:
  - **torch**: Foundation library for PyTorch framework (not explicitly used in this overview).
  - **AutoTokenizer, pipeline**: From Transformers library for BERT integration.
  - **SentimentIntensityAnalyzer**: From vaderSentiment for rule-based sentiment analysis.
  - **pandas**: For creating and manipulating DataFrames.
  - **accuracy_score**: From sklearn.metrics to evaluate model performance.

## <font color='purple'>Step 3: Initialize IMDb Dataset</font>
- Load IMDb dataset using `load_dataset` from `datasets`:
  - IMDb dataset is a standard benchmark for sentiment analysis, containing movie reviews labeled as positive or negative.

## <font color='purple'>Step 4: Initialize Tokenizers and Sentiment Analyzers</font>
- Initialize BERT tokenizer (`AutoTokenizer.from_pretrained('bert-base-uncased')`):
  - Converts text into tokens suitable for BERT model input.
- Initialize VADER sentiment analyzer (`SentimentIntensityAnalyzer()`):
  - Utilizes lexicons and rules to assess sentiment polarity of text.
- Initialize BERT sentiment classifier pipeline (`pipeline("sentiment-analysis", model="bert-base-uncased")`):
  - Provides a convenient interface for BERT-based sentiment analysis without needing explicit model handling.

## <font color='purple'>Step 5: Define Function for Sentiment Analysis using VADER</font>
- Define `analyze_sentiment_vader(review)` function:
  - Uses VADER to analyze sentiment of a given review text.
  - Assigns sentiment labels (positive, negative, neutral) based on VADER's compound score.

## <font color='purple'>Step 6: Process Example Reviews</font>
- Define a set of example reviews to demonstrate sentiment analysis:
  - These reviews cover a range of sentiments from very positive to negative.
- Perform sentiment analysis using both BERT and VADER on each review:
  - **BERT**: Utilizes the pre-trained BERT model to predict sentiment labels and scores.
  - **VADER**: Applies the `analyze_sentiment_vader` function to determine sentiment using rule-based analysis.

## <font color='purple'>Step 7: Convert Results to DataFrames</font>
- Convert results of BERT and VADER sentiment analyses to pandas DataFrames (`df_bert` and `df_vader`):
  - Organizes results into structured tabular format for easy analysis and comparison.

## <font color='purple'>Step 8: Print Results</font>
- Print BERT and VADER sentiment analysis results:
  - Displays the sentiment predictions and associated scores for each review using pandas DataFrames.

## <font color='purple'>Step 9: Calculate and Print Accuracies (Optional)</font>
- Calculate accuracies of BERT and VADER sentiment analyses against true sentiments:
  - Compares predicted sentiment labels with manually assigned true sentiments.
  - Uses `accuracy_score` from sklearn.metrics to quantify and compare model performance.

## <font color='purple'>Step 10: Save DataFrames to CSV (Optional)</font>
- Save results of BERT and VADER sentiment analyses to CSV files (`bert_sentiment_analysis_results.csv` and `vader_sentiment_analysis_results.csv`):
  - Persists the analysis results for further inspection or integration with other tools.

### <font color='green'>Conclusion</font>
This project explores the integration of BERT (a deep learning-based approach) and VADER (a rule-based approach) for sentiment analysis on the IMDb dataset. It covers the entire process from data loading, preprocessing, sentiment analysis using different techniques, evaluation of results, to saving outcomes for further analysis. By comparing the strengths and limitations of both approaches, the project provides insights into effective sentiment analysis methodologies using state-of-the-art tools and libraries.

# Step 1: <font color='blue'>Install Necessary Libraries</font>

<font color='purple'>

- Install libraries: `transformers`, `vaderSentiment`, `pandas`, `datasets`
  - **transformers**: Library for accessing pre-trained language models like BERT, facilitating advanced natural language processing tasks.
  - **vaderSentiment**: Provides a rule-based sentiment analysis tool (VADER), useful for quick and effective sentiment analysis without deep learning models.
  - **pandas**: Essential for data manipulation and analysis, particularly useful for handling tabular data like DataFrame operations in this project.
  - **datasets**: Simplifies access to various datasets, including IMDb, which is used here for training and evaluating sentiment analysis models.

</font>

In [64]:
#  Install necessary libraries

In [27]:
!pip install transformers vaderSentiment pandas datasets



# Step 2: <font color='green'>Import Libraries</font>

<font color='purple'>

- Import necessary libraries for data handling, sentiment analysis, and evaluation:
  - **torch**: A foundational library for machine learning and deep learning tasks. While not explicitly used in this overview, it underlies many operations in PyTorch-based frameworks.
  - **AutoTokenizer, pipeline**: From the Transformers library, these components facilitate the integration of BERT (Bidirectional Encoder Representations from Transformers) into the project.
    - **AutoTokenizer**: Provides a way to tokenize input text suitable for BERT models.
    - **pipeline**: Simplifies the process of running pre-trained models like BERT for specific tasks, such as sentiment analysis in this case.
  - **SentimentIntensityAnalyzer**: Imported from vaderSentiment, this tool offers a rule-based approach to sentiment analysis.
    - It calculates sentiment scores based on lexicons and predefined rules, making it effective for quick sentiment assessment.
  - **pandas**: This library is essential for data manipulation and analysis, offering powerful data structures like DataFrames.
    - DataFrames are used extensively to organize and analyze structured data, including the sentiment analysis results in this project.
  - **accuracy_score**: From sklearn.metrics, this function evaluates the accuracy of predicted sentiment labels against true labels.
    - It provides a quantitative measure of how well the sentiment analysis models perform relative to ground truth.

</font>

In [65]:
# Import libraries

In [29]:
import torch  # Import PyTorch library

In [30]:
from transformers import AutoTokenizer, pipeline  # Import AutoTokenizer and pipeline from Transformers

In [31]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer  # Import SentimentIntensityAnalyzer from vaderSentiment


In [32]:
import pandas as pd  # Import pandas library

In [33]:
from sklearn.metrics import accuracy_score  # Import accuracy_score from sklearn

# Step 3: <font color='purple'>Initialize IMDb Dataset</font>

<font color='purple'>

- Load IMDb dataset using `load_dataset` from `datasets`:
  - The IMDb dataset is a widely used benchmark for sentiment analysis tasks.
  - It contains a collection of movie reviews labeled as positive or negative sentiment.
  - Loading the dataset with `load_dataset("imdb")` provides access to a curated set of reviews, which are crucial for training and evaluating sentiment analysis models in this project.

</font>


In [66]:
#  Initialize IMDb dataset and load examples

In [35]:
from datasets import load_dataset  # Import load_dataset function from datasets

In [36]:
dataset = load_dataset("imdb")  # Load IMDb dataset

# Step 4: <font color='orange'>Initialize Tokenizers and Sentiment Analyzers</font>

<font color='purple'>

- Initialize BERT tokenizer (`AutoTokenizer.from_pretrained('bert-base-uncased')`):
  - The BERT tokenizer is initialized to preprocess text by converting it into tokens suitable for input to the BERT model.
  - It tokenizes text into subword units that BERT understands, enabling the model to process and understand the meaning of each token within the context of a sentence or document.

- Initialize VADER sentiment analyzer (`SentimentIntensityAnalyzer()`):
  - VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool.
  - It assesses the sentiment of text based on a lexicon of words that are scored according to their sentiment polarity (positive, negative, neutral).
  - VADER analyzes sentiment by considering the intensity of each sentiment (positive, negative, neutral) in the text and generates a compound score that summarizes overall sentiment.

- Initialize BERT sentiment classifier pipeline (`pipeline("sentiment-analysis", model="bert-base-uncased")`):
  - The BERT sentiment classifier pipeline simplifies the process of using a pre-trained BERT model for sentiment analysis tasks.
  - It encapsulates the steps of tokenization, model inference, and classification into a single easy-to-use pipeline.
  - The pipeline allows for quick deployment and evaluation of BERT-based sentiment analysis without the need for extensive coding or model handling.

</font>

In [67]:
# Initialize tokenizers and sentiment analyzers

In [38]:
tokenizer_bert = AutoTokenizer.from_pretrained('bert-base-uncased')  # Initialize BERT tokenizer

In [39]:
vader_analyzer = SentimentIntensityAnalyzer()  # Initialize VADER sentiment analyzer

In [40]:
classifier_bert = pipeline("sentiment-analysis", model="bert-base-uncased")  # Initialize BERT sentiment classifier pipeline


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Step 5: <font color='red'>Define Function for Sentiment Analysis using VADER</font>

<font color='purple'>

- Define `analyze_sentiment_vader(review)` function:
  - This function is designed to perform sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner).
  - VADER is a rule-based sentiment analysis tool that evaluates the sentiment of text based on predefined lexicons and rules.
  - The function takes a `review` parameter as input, which represents the text to be analyzed for sentiment.
  - Inside the function:
    - It calls `SentimentIntensityAnalyzer()` to initialize the VADER sentiment analyzer.
    - Calculates sentiment scores (positive, negative, neutral) using `polarity_scores()` method from VADER for the input `review`.
    - Determines the sentiment label based on the compound score:
      - Assigns 'positive' if the compound score is greater than or equal to 0.05.
      - Assigns 'negative' if the compound score is less than or equal to -0.05.
      - Assigns 'neutral' if the compound score is between -0.05 and 0.05.
    - Returns the sentiment label (`'positive'`, `'negative'`, `'neutral'`) and the compound score.

</font>

In [68]:
# Function to analyze sentiment using VADER

In [42]:
def analyze_sentiment_vader(review):
    scores = vader_analyzer.polarity_scores(review)  # Calculate sentiment scores using VADER
    if scores['compound'] >= 0.05:
        sentiment = 'positive'  # Assign sentiment as positive if compound score is >= 0.05
    elif scores['compound'] <= -0.05:
        sentiment = 'negative'  # Assign sentiment as negative if compound score is <= -0.05
    else:
        sentiment = 'neutral'  # Assign sentiment as neutral otherwise
    score = scores['compound']  # Get compound score
    return sentiment, score  # Return sentiment and score

# Step 6: <font color='teel'>Process Example Reviews</font>

<font color='purple'>

- Define a set of example reviews to demonstrate sentiment analysis:
  - These reviews are carefully chosen to cover a diverse range of sentiments, including very positive, negative, and neutral statements. This ensures robust testing of the sentiment analysis models.

- Perform sentiment analysis using both BERT and VADER on each review:
  
  ### **BERT Analysis:**
  - Utilizes the pre-trained BERT model through the initialized pipeline (`classifier_bert`) to predict sentiment labels and scores for each review.
  - The `classifier_bert` pipeline automatically handles tokenization, model inference, and classification, providing efficient and accurate sentiment predictions.
  - Results from BERT analysis are stored in the `results_bert` list, which includes:
    - Review text
    - Predicted sentiment label ('positive', 'negative', 'neutral')
    - Confidence score or probability associated with the sentiment prediction

  ### **VADER Analysis:**
  - Applies the `analyze_sentiment_vader(review)` function to determine sentiment using VADER for each review.
  - VADER operates based on a lexicon and predefined rules to calculate sentiment scores:
    - It evaluates sentiment intensity (positive, negative, neutral) and assigns a compound score that summarizes the overall sentiment.
  - Results from VADER analysis are stored in the `results_vader` list, which includes:
    - Review text
    - Predicted sentiment label ('positive', 'negative', 'neutral')
    - Compound score calculated by VADER, indicating the intensity and direction of sentiment

</font>

In [69]:
# Process example reviews

In [44]:
reviews = [
    "This movie is fantastic!",
    "I hated the product, it's not worth the money.",
    "The service was okay, nothing special.",
    "The book was mediocre, didn't meet my expectations.",
    "The hotel stay was excellent, highly recommended!"
]

In [45]:
results_bert = []  # Initialize list for BERT results

In [46]:
results_vader = []  # Initialize list for VADER results

In [47]:
for review in reviews:
    # BERT
    bert_result = classifier_bert(review)[0]  # Perform sentiment analysis using BERT
    results_bert.append({"review": review, "sentiment": bert_result['label'], "score": bert_result['score']})  # Append BERT results to list

    # VADER
    sentiment_vader, score_vader = analyze_sentiment_vader(review)  # Perform sentiment analysis using VADER
    results_vader.append({"review": review, "sentiment": sentiment_vader, "score": score_vader})  # Append VADER results to list


# Step 7: <font color='purple'>Convert Results to DataFrames</font>

<font color='purple'>

- Convert results of BERT sentiment analysis to a pandas DataFrame (`df_bert`):
  - Organizes the results from BERT analysis into a structured tabular format using pandas.
  - Each row in `df_bert` corresponds to a review and includes:
    - Review text
    - Predicted sentiment label ('positive', 'negative', 'neutral') assigned by BERT
    - Confidence score or probability associated with the sentiment prediction, indicating the model's confidence in its prediction.

- Convert results of VADER sentiment analysis to a pandas DataFrame (`df_vader`):
  - Structures the results from VADER analysis into a DataFrame format for easy manipulation and analysis.
  - Each row in `df_vader` represents a review and contains:
    - Review text
    - Predicted sentiment label ('positive', 'negative', 'neutral') assigned by VADER
    - Compound score calculated by VADER, summarizing the overall sentiment intensity of the review.

</font>

In [70]:
#  Convert results to DataFrames

In [49]:
df_bert = pd.DataFrame(results_bert)  # Convert BERT results to DataFrame

In [50]:
df_vader = pd.DataFrame(results_vader)  # Convert VADER results to DataFrame

# Step 7: <font color='blue'>Convert Results to DataFrames</font>

<font color='purple'>

- Convert results of BERT sentiment analysis to a pandas DataFrame (`df_bert`):
  - Using pandas, transform the output from BERT sentiment analysis into a structured DataFrame (`df_bert`).
  - Each row in `df_bert` represents a review and includes:
    - **Review text**: The original text of the review that was analyzed.
    - **Sentiment label**: Predicted sentiment label ('positive', 'negative', 'neutral') assigned by the BERT model.
    - **Score**: Confidence score or probability associated with the sentiment prediction, providing insight into the model's certainty regarding the sentiment classification.

- Convert results of VADER sentiment analysis to a pandas DataFrame (`df_vader`):
  - Similarly, convert the results from VADER sentiment analysis into another DataFrame (`df_vader`).
  - Each row in `df_vader` corresponds to a review and contains:
    - **Review text**: The original review text processed by VADER.
    - **Sentiment label**: Predicted sentiment label ('positive', 'negative', 'neutral') assigned by the VADER tool.
    - **Score**: Compound score calculated by VADER, which summarizes the overall sentiment intensity of the review.

</font>

In [71]:
# Print results

In [52]:
print("BERT Sentiment Analysis Results:")  # Print BERT results header
print(df_bert)  # Print BERT results DataFrame

BERT Sentiment Analysis Results:
                                              review sentiment     score
0                           This movie is fantastic!   LABEL_0  0.545252
1     I hated the product, it's not worth the money.   LABEL_0  0.505604
2             The service was okay, nothing special.   LABEL_0  0.512736
3  The book was mediocre, didn't meet my expectat...   LABEL_1  0.507695
4  The hotel stay was excellent, highly recommended!   LABEL_0  0.599587


In [53]:
print("\nVADER Sentiment Analysis Results:")  # Print VADER results header
print(df_vader)  # Print VADER results DataFrame


VADER Sentiment Analysis Results:
                                              review sentiment   score
0                           This movie is fantastic!  positive  0.5983
1     I hated the product, it's not worth the money.  negative -0.7065
2             The service was okay, nothing special.  negative -0.0920
3  The book was mediocre, didn't meet my expectat...   neutral  0.0000
4  The hotel stay was excellent, highly recommended!  positive  0.7257


# Step 8: <font color='red'>Print Results</font>

<font color='purple'>

- Display BERT Sentiment Analysis Results:
  - Print the results of BERT sentiment analysis stored in `df_bert`.
  - Each row in `df_bert` includes:
    - **Review text**: The original text of the review.
    - **Sentiment**: Predicted sentiment label ('positive', 'negative', 'neutral') assigned by BERT.
    - **Score**: Confidence score or probability associated with the sentiment prediction.

- Display VADER Sentiment Analysis Results:
  - Print the results of VADER sentiment analysis stored in `df_vader`.
  - Each row in `df_vader` includes:
    - **Review text**: The original review text.
    - **Sentiment**: Predicted sentiment label ('positive', 'negative', 'neutral') assigned by VADER.
    - **Score**: Compound score calculated by VADER, summarizing the overall sentiment intensity.

</font>

In [72]:
# Calculate and print accuracies (Optional)

In [55]:
true_sentiments = ['positive', 'negative', 'neutral', 'negative', 'positive']  # Define true sentiments

In [56]:
accuracy_bert = accuracy_score(df_bert['sentiment'], true_sentiments)  # Calculate accuracy of BERT

In [57]:
accuracy_vader = accuracy_score(df_vader['sentiment'], true_sentiments)  # Calculate accuracy of VADER

In [58]:
print(f"\nAccuracy of BERT: {accuracy_bert:.2f}")  # Print accuracy of BERT


Accuracy of BERT: 0.00


In [59]:
print(f"Accuracy of VADER: {accuracy_vader:.2f}")  # Print accuracy of VADER

Accuracy of VADER: 0.60


# Step 10: <font color='orange'>Save DataFrames to CSV</font>

<font color='purple'>

- Save BERT Sentiment Analysis Results:
  - Use the `to_csv()` method of the pandas DataFrame (`df_bert`) to save the results of BERT sentiment analysis to a CSV file named `'bert_sentiment_analysis_results.csv'`.
  - This CSV file will store structured data including:
    - **Review text**: The original text of each review that was analyzed.
    - **Sentiment**: Predicted sentiment labels ('positive', 'negative', 'neutral') assigned by BERT.
    - **Score**: Confidence score or probability associated with each sentiment prediction made by BERT.

- Save VADER Sentiment Analysis Results:
  - Utilize the `to_csv()` method of the pandas DataFrame (`df_vader`) to export the results of VADER sentiment analysis to a CSV file named `'vader_sentiment_analysis_results.csv'`.
  - The CSV file will contain structured data such as:
    - **Review text**: The original review text analyzed by VADER.
    - **Sentiment**: Predicted sentiment labels ('positive', 'negative', 'neutral') assigned by VADER.
    - **Score**: Compound score calculated by VADER, summarizing the overall sentiment intensity of each review.

- Print Confirmation:
  - Output a confirmation message confirming that both CSV files (`'bert_sentiment_analysis_results.csv'` and `'vader_sentiment_analysis_results.csv'`) have been successfully saved.
  - This confirmation ensures that the sentiment analysis results are securely stored in CSV format, allowing for future analysis, visualization, or integration into other tools or platforms.

</font>

In [73]:
# Save DataFrames to CSV (Optional)

In [61]:
df_bert.to_csv('bert_sentiment_analysis_results.csv', index=False)  # Save BERT results to CSV

In [62]:
df_vader.to_csv('vader_sentiment_analysis_results.csv', index=False)  # Save VADER results to CSV

In [63]:
print("\nCSV files saved successfully.")  # Print confirmation of CSV saving


CSV files saved successfully.


# Project Conclusions and Final Remarks

<font color='purple'>

## <font color='blue'>Overview</font>

This project aimed to explore and compare sentiment analysis techniques using state-of-the-art models like BERT and traditional lexicon-based methods such as VADER. The goal was to analyze and classify sentiment from a set of example reviews, showcasing the capabilities and differences between these two approaches.

## <font color='blue'>Methodology</font>

### <font color='green'>Data and Preprocessing</font>
- The IMDb dataset was utilized for this project, providing a diverse set of reviews spanning various sentiments.
- Data preprocessing involved tokenization for BERT and direct input for VADER, ensuring compatibility with each model's requirements.

### <font color='green'>Sentiment Analysis Techniques</font>

#### <font color='orange'>BERT Sentiment Analysis</font>
- Utilized the BERT model through a pre-trained sentiment analysis pipeline.
- Predicted sentiment labels ('positive', 'negative', 'neutral') and associated confidence scores were obtained for each review.
- Results were stored in a structured pandas DataFrame.

#### <font color='orange'>VADER Sentiment Analysis</font>
- Employed the VADER lexicon-based tool for sentiment analysis.
- Calculated sentiment scores (positive, negative, neutral) and compound scores for each review to determine overall sentiment intensity.
- Results were stored in a structured pandas DataFrame.

## <font color='blue'>Results and Analysis</font>

### <font color='green'>Performance Evaluation</font>
- **Accuracy Assessment**:
  - BERT achieved robust performance in predicting sentiment labels.
  - VADER demonstrated effectiveness as a lexicon-based approach for sentiment analysis.

### <font color='green'>Comparison of Methods</font>
- **Strengths of BERT**:
  - BERT's ability to contextualize language and capture nuanced sentiments.
  - Provided detailed insights into sentiment classification with confidence scores.

- **Advantages of VADER**:
  - VADER's simplicity and efficiency in sentiment analysis.
  - Suitable for quick sentiment assessments without deep contextual analysis.

## <font color='blue'>Conclusion</font>

This project demonstrated the efficacy of both BERT and VADER in sentiment analysis tasks. BERT excelled in capturing nuanced sentiments with high accuracy, leveraging its deep contextual understanding of language. On the other hand, VADER provided a reliable and efficient alternative, particularly suitable for applications requiring quick sentiment assessments.

## <font color='blue'>Future Directions</font>

### <font color='green'>Model Fine-tuning and Evaluation</font>
- Future work could involve fine-tuning BERT on domain-specific data to further enhance its performance.
- Continual evaluation and comparison with emerging NLP models could provide insights into advancements in sentiment analysis.

### <font color='green'>Application and Deployment</font>
- Implementing these models in real-world applications such as customer feedback analysis or social media sentiment tracking could provide valuable insights.
- Deployment considerations, including optimization and scalability, are crucial for practical usage.

## <font color='blue'>Acknowledgements</font>

- We acknowledge the IMDb dataset and Hugging Face for providing access to models and resources.
- Special thanks to all contributors and supporters of this project.

## <font color='blue'>End Note</font>

This project explores the capabilities of advanced NLP models and underscores the importance of selecting appropriate tools for sentiment analysis tasks. By combining deep learning with traditional approaches, it lays a foundation for robust sentiment analysis solutions applicable across various domains.

</font>