# AI/ML with Python: Web Scraping & Sentiment Analysis
## Sentiment Analysis Tools

### Step 4 - Introduction to VADER

We will start off by importing `nltk` (Natural Language Toolkit) which allows us utilise its internal package `SentimentIntensityAnalyzer` that will provide us with the necessary polarity scores in terms of negative, neutral, or positive. To start, ensure that you have `ntlk` installed on your local machine. If you haven't, open your terminal and do `pip install nltk` as shown below.

After importing `nltk`, ensure that you have `vader_lexicon` downloaded. Once everything is completed, we will proceed to import `SentimentIntensityAnalyzer` as a package from `nltk.sentiment`.

In [5]:
# pip3 install nltk or pip install nltk
import nltk

# Remember to download vader_lexicon if you havent! Simply uncomment the code below and run it 
# nltk.download('vader_lexicon')

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

With our setup complete, we're now equipped to analyze the sentiment of various sentences. We will utilize the `polarity_scores` method to evaluate and display their sentiment metrics. Proceed with executing the following code to observe the breakdown of sentiment scores for each sentence provided.

In [16]:
# Sample texts that we will be using for sentiment analysis
texts = [
    "I love this product, it's absolutely amazing!",
    "This is the worst movie I have ever seen.",
    "I'm not sure how I feel about this new policy.",
    "Meh, it was okay, nothing special.",
    "Wow, this new update is fantastic! 😊"
]

for text in texts:
    scores = sia.polarity_scores(text)
    print(f"Text: {text}")
    print(f"Scores: {scores}")
    print()

Text: I love this product, it's absolutely amazing!
Scores: {'neg': 0.0, 'neu': 0.318, 'pos': 0.682, 'compound': 0.862}

Text: This is the worst movie I have ever seen.
Scores: {'neg': 0.369, 'neu': 0.631, 'pos': 0.0, 'compound': -0.6249}

Text: I'm not sure how I feel about this new policy.
Scores: {'neg': 0.197, 'neu': 0.803, 'pos': 0.0, 'compound': -0.2411}

Text: Meh, it was okay, nothing special.
Scores: {'neg': 0.421, 'neu': 0.355, 'pos': 0.225, 'compound': -0.1675}

Text: Wow, this new update is fantastic! 😊
Scores: {'neg': 0.0, 'neu': 0.342, 'pos': 0.658, 'compound': 0.8268}



The `polarity_scores` method from `SentimentIntensityAnalyzer` in VADER is a function that <b>computes sentiment scores</b> for a given piece of text. When you pass text to this method, it returns a dictionary with <b>four different scores</b> that quantify the sentiment of the text. The score breaks down into four different aspects, each being negative, neutral, positive and compound. Here is a breakdown of what each of them mean:

<b>Negative:</b> This score indicates the proportion of the text that carries a negative sentiment. The value ranges from 0 to 1, where higher values correspond to negative sentiment.

<b>Neutral:</b> This score represents the proportion of the text that is considered neutral (lacking positive or negative sentiment). Like the negative score, this also ranges from 0 to 1.

<b>Positive:</b> This score reflects the proportion of the text that conveys a positive sentiment. It is also a value between 0 and 1, with higher values denoting stronger positive sentiment.

<b>Compound:</b> The compound score is a composite score that calculates the sum of the positive, negative, and neutral scores, which is then normalized between -1 (most extreme negative) and +1 (most extreme positive). This score attempts to represent the overall sentiment of the text in a single number.

From this example, you can see how text and sentences are quantified based on how negative or positive they are. Feel free to try it out with some of your own sentences, and see its results!

### Step 5 - Introduction to AFINN

To begin, AFINN uses a Valence Score Assignment where each word in the AFINN lexicon has been manually rated by humans for sentiment strength. For example, the word "happy" might have a positive valence of 3, while "sad" has a negative valence of -2. To calculate its sentiments,  a sentiment analysis algorithm will look up each word in the AFINN lexicon when processing the text. If the word exists in the lexicon, its valence score will contribute to the total sentiment score of the text.

The total sentiment score of a piece of text is calculated by summing the valence scores of all sentiment-bearing words found in the lexicon. The sum can be normalized or adjusted based on the length of the text to provide an average sentiment score per term if desired. Before we start coding, ensure that you have afinn installed via your local terminal. Then, proceed to import 'Afinn' from afinn and assign it to afinn, as seen in the example below.

In [17]:
# pip3 install afinn

from afinn import Afinn

# We initialize Afinn sentiment analyzer
afinn = Afinn()

Now, let's move on to examining a series of texts. We'll apply our analysis and you'll notice that each word is assigned a score from -5 to +5. However, the cumulative score for an entire sentence may exceed this range, as it represents the sum of the individual scores for all words contained in that sentence.

In [18]:
# List of sentences to analyze
texts = [
    "I really love the new design of your website!",
    "I hate waiting in long queues.",
    "This is utterly fantastic!",
    "It's raining again. This weather is depressing.",
    "I'm not sure how I feel about the new policy.",
    "I love the absolutely wonderful performance, it was simply perfect and made me incredibly happy!"
]

# Analyze the sentiment of each sentence
for text in texts:
    score = afinn.score(text)
    print(f"Text: {text}\nScore: {score}\n")

Text: I really love the new design of your website!
Score: 3.0

Text: I hate waiting in long queues.
Score: -3.0

Text: This is utterly fantastic!
Score: 4.0

Text: It's raining again. This weather is depressing.
Score: -2.0

Text: I'm not sure how I feel about the new policy.
Score: 0.0

Text: I love the absolutely wonderful performance, it was simply perfect and made me incredibly happy!
Score: 13.0



The <b>sentiment score</b> of a piece of text—such as a sentence or an entire document—is calculated by <b>summing the scores of all words</b> that appear in the text and are also present in the AFINN lexicon. In the case where a sentence has words not found in the AFINN lexicon, those words simply do not contribute to the score. 

The final score reflects the overall sentiment as quantified by the lexicon, with higher positive scores indicating more positive sentiment, scores around zero indicating neutrality, and negative scores indicating negative sentiment.

## Step 7 - Sentiment Analysis with Naive Bayes Classifier

The objective of this exercise will be to build a Naive Bayes model that can classify text samples into either positive or negative sentiments. In this exercise, we'll be using an inbuilt dataset "movie_reviews" within the `nltk` library. As usual, ensure that you have the `nltk` and `scikit-learn` libraries installed. When you are done, proceed to import the following packages: 

In [19]:
# pip3 install scikit-learn

import nltk
from nltk.corpus import movie_reviews
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import re
import random

Now that we have imported all the packages, we can proceed to download the <b>movie_reviews</b> dataset from the `nltk` library, which is a collection of movie reviews that have been categorized as either positive or negative. 

It contains 2,000 movie reviews, with an equal number of positive and negative reviews. This balanced dataset is ideal for training and testing sentiment analysis algorithms, specifically the Naive Bayes Classifiers in this case to determine whether a new movie review is positive or negative.

In [34]:
# Download the movie review dataset from nltk
# nltk.download('movie_reviews')

# Run the below code to load the reviews and text preprocessing the data for modeling
documents = [(" ".join(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Split the dataset into the text and labels
texts, labels = zip(*documents)

# The handle_negation function is designed to preprocess text to better handle negations when performing sentiment analysis. 
# Negation words like "not," "no," "never," and "cannot" can completely change the sentiment of the phrase that follows them. 
def handle_negation(text):
    # A simple way to handle negation: attach "not_" to words following a negation word
    negation_re = re.compile(r"\b(not|no|never|cannot)\b[\s]+([a-z]+)", re.IGNORECASE)
    return negation_re.sub(lambda match: f"{match.group(1)}_{match.group(2)}", text)

# Apply the negation handling to your texts
texts = [handle_negation(text) for text in texts]

# This flattens a list of lists (or a mix of lists and strings) into a list of strings, needed for text processing tasks such as vectorization
# texts = [' '.join(text) if isinstance(text, list) else text for text in texts]

Like what we have covered in our past campaign: NLP with Python, the above steps simply cleans and processes the data for trainng. It separates the dataset into both texts and its associated labels (either positive or negative). 

With that in place, we can utilise the prelabeled data `labels` and `texts` to train our model. Our next step would be preprocessing and feature extraction, a critical step that help transform raw text data into a structured format that a machine learning model can understand and learn from. Run the following code to see it in action!

In [35]:
# Split data into training and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.25, random_state=42)

# Initialize a CountVectorizer for text vectorization
vectorizer = CountVectorizer(ngram_range=(1, 2))

# Fit and transform the training data
train_vectors = vectorizer.fit_transform(train_texts)

# Transform the test data
test_vectors = vectorizer.transform(test_texts)

We split the data into 2 parts: training and testing data. `train_texts` will contain the texts that will be used for training the machine learning model. `test_texts` will contain the texts that will be used for testing the model's performance, while `train_labels` and `test_labels` will hold the corresponding labels for the training and testing texts. `train_test_split()` will be the function that performs the split of the dataset into both training and test sets.

The test_size signaling 0.25 suggests that 25% will be used for testing and the remaining 75% will be used for training. Setting a random seed will be used for experimental consistency, ensuring that every time we run this code with the same input data and random_state, we will get the exact same split.

Initialising `CountVectorizer` creates a feature extraction method from the scikit-learn library that converts a collection of text documents into a matrix of token counts, which turns raw text into features that we can feed into the machine learning model. 

As covered previously, we can see the N-grams in action, where `ngram_range=(1,2)` means the first element of the tuple, 1, implies that the minimum size of n-grams will be 1, which includes single words. The second element of the tuple, 2, means that the maximum size of n-grams will be 2, which includes pairs of consecutive words.

The fit method `fit_transform` then learns the vocabulary of the training data. It decides what tokens (words, symbols, etc.) will be considered in the text representation. After fitting, the vectorizer has a mapping from word tokens found in the training data to feature indices. We then proceed to `transform` the text into a numerical representation, a sparse matrix where each row corresponds to a document, and each column represents a token from the vocabulary. With the features extracted, we will proceed next to pass them into the classifiers.

We will initialize the Multinomial Naive Bayes Classifier that we imported earlier and assign it as a classifier. Over here, we will then use the inbuilt fit function to train the model using both training data and labels.

In [36]:
# Initialize the Multinomial Naive Bayes classifier
classifier = MultinomialNB()

# Train the classifier
classifier.fit(train_vectors, train_labels)

By taking in both the `train_vector` (the matrix) and `train_labels` (label containing "positive", "negative" or "neutral"), the classifier learns the probability of each word given each label. This is done by counting the frequency of each word in documents with each label, and then calculating the likelihood of the word occurring in each class.

Once trained, the classifier can then be used to predict the sentiment labels of new, unseen texts by calling the predict method, as you will see in the following code.

In [37]:
# Predict sentiments for test data
predictions = classifier.predict(test_vectors)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.4f}")

Accuracy: 0.8180


`accuracy_score` will then calculate the accuracy of the predictions. Accuracy is a common metric for classification tasks and is defined as the proportion of true results (both true positives and true negatives) among the total number of cases examined.

In [38]:
# Function to predict sentiment of a new review
def predict_sentiment(new_text):
    new_vector = vectorizer.transform([new_text])
    pred = classifier.predict(new_vector)
    return pred[0]

# Test the function
sample_review = "I absolutely loved this movie, the storyline was engaging from start to finish!"
print(f"The sentiment predicted by the model is: {predict_sentiment(sample_review)}")

The sentiment predicted by the model is: pos


Great! With our trained classifier ready, we have the capability to evaluate new reviews. The function `predict_sentiment` was crafted to convert any given text into a numerical format akin to the transformation applied to our training dataset. This numerical data is then presented to the trained classifier, which yields a sentiment prediction. From the example provided, the classifier was presented with an upbeat movie review, and it accurately returned the sentiment as positive.

However, it is important to remember that no model is perfect, and there will always be some instances where predictions are incorrect.  Naive Bayes is a simple probabilistic classifier that doesn’t understand context or word order; it only looks at word frequencies. 

It can’t capture the meaning of phrases as a whole, which can lead to incorrect classifications at times as well. So, don't be surprised if you put in some inputs and it generates a wrong outcome. Ultimately it's trained on 2000 words (1000 pos and 1000 neg), so a much larger dataset is needed for a greater level of accuracy. Therefore, understanding and improving upon these errors is a key part of the machine-learning process.

### Step 8 - Sentiment Analysis with Logistic Regression

Similar to the the Naive Bayes model, we will now do prediction with the Logistic Regression model using the same dataset. Unlike Naive Bayes, logistic regression incorporates a different kind of feature weighting, like TF-IDF in this case, which can improve performance by highlighting important words.

We will also utilise N-grams to make the prediction more accurate. Like Naive Bayes, we initialise the `LogisticRegression()` function, and just like how we did it for Naive Bayes model, we will fit it to the model before we use it for prediction. Let's check it out below!

In [39]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

# Use TF-IDF Vectorization instead of simple counts
vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=5, max_df=0.8)
train_vectors = vectorizer.fit_transform(train_texts)
test_vectors = vectorizer.transform(test_texts)

# Initialize the Logistic Regression classifier
logistic_classifier = LogisticRegression()

# Train the classifier on the training data and labels
logistic_classifier.fit(train_vectors, train_labels)

# Predict sentiments for test data using the trained classifier
logistic_predictions = logistic_classifier.predict(test_vectors)

# Calculate accuracy of the classifier on the test data
logistic_accuracy = accuracy_score(test_labels, logistic_predictions)

print(f"Logistic Regression Accuracy: {logistic_accuracy:.4f}")

Logistic Regression Accuracy: 0.8320


With the model trained, we go on next to make predictions with sample texts:

In [41]:
# Function to predict sentiment of a new review
def logistic_predict_sentiment(new_text):
    new_vector = vectorizer.transform([new_text])
    pred = logistic_classifier.predict(new_vector)
    return pred[0]

# Test the function
sample_review = "The sun is shining and I'm so absolutely happy today!"
print(f"The sentiment predicted by the model is: {logistic_predict_sentiment(sample_review)}")

The sentiment predicted by the model is: pos


If you are wondering how the models decide on where the sample text is positive or negative, the code below shows what is happening under the hood. It showcases the prediction probability of the sample text.

In [42]:
# Check the prediction probability for a sample text
sample_prob = logistic_classifier.predict_proba(vectorizer.transform([sample_review]))
print("Prediction probability for the sample text: ", sample_prob)

Prediction probability for the sample text:  [[0.48429751 0.51570249]]


The probability on the left represent negative sentiment while the right positive sentiment. The larger probability will result in whichever sentiment being reflected, as simple as that!

With that, we come to the end and it's time to prepare your submission!

### Step 9 - Let's Ace Your Submissions! Preparing Your Submission!

Follow the instructions under Step 9 to complete this quest! We have provided instructions below to guide you along the way, so please refer to previous steps or check the web if you are uncertain!

In [4]:
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Download the twitter_samples dataset
# nltk.download('twitter_samples')

# Import twitter_samples dataset
from nltk.corpus import twitter_samples

# Load positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

# Creating labelled data
documents = []

# Adding positive tweets
for tweet in positive_tweets:
    documents.append((tweet, "positive"))

# Adding negative tweets
for tweet in negative_tweets:
    documents.append((tweet, "negative"))

# Split the dataset into the text and labels
texts, labels = zip(*documents)

# Split data into training and test sets
texts_train, texts_test, labels_train, labels_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Begin text vectorization
vectorizer = TfidfVectorizer(ngram_range=(1, 3), min_df=5, max_df=0.8)

# Fit and transform the training data
X_train = vectorizer.fit_transform(texts_train)

# Transform the test data
X_test = vectorizer.transform(texts_test)

# Initialize the Logistic Regression classifier
classifier = LogisticRegression()

# Train the classifier
classifier.fit(X_train, labels_train)

# Predict sentiments for test data using the trained classifier
def predict_sentiment(text):
    vectorized_text = vectorizer.transform([text])
    prediction = classifier.predict(vectorized_text)[0]
    return prediction

# Test your results with the sample tweets below
sample_tweets = [
    "Absolutely loving the new update! Everything runs so smoothly and efficiently now. Great job! 👍",
    "Had an amazing time at the beach today with friends. The weather was perfect! ☀️ #blessed",
    "Extremely disappointed with the service at the restaurant tonight. Waited over an hour and still got the order wrong. 😡",
    "Feeling really let down by the season finale. It was so rushed and left too many unanswered questions. 😞 #TVShow",
    "My phone keeps crashing after the latest update. So frustrating dealing with these glitches! 😠",
]

# Test the function
for sentence in sample_tweets:
    #print(f"The sentiment predicted by the model is: {'USE FUNCTION PREDICTION FUNCTION HERE'}")
    predicted_sentiment = predict_sentiment(sentence)
    #print(f"The sentiment predicted by the model for '{sentence}' is: {predicted_sentiment}")
    print(f"The sentiment predicted by the model is: {predicted_sentiment}")

The sentiment predicted by the model is: positive
The sentiment predicted by the model is: positive
The sentiment predicted by the model is: negative
The sentiment predicted by the model is: negative
The sentiment predicted by the model is: negative
