# Understanding Semantic Analysis

In this notebook, we'll look at how we can use a computer to perform *Semantic Analysis*. Looking at some text, and extracting information relating to the **meaning** of that text.

We're going to try to grasp what's really happening and why it works, to help us form accurate intuitions about how to use these tools.

We're going to look at a very common application of semantic analysis called *Sentiment Analysis* - trying to determine what opinions are being expressed about a thing.

First, we need to install some software libraries.

***sentence-transformers*** is a model trained to turn sentences into a form that can be understood by machines. It converts them into a **vector**, which is basically a kind of position in language space. In that language space, texts that relate to similar things are closer to each other, and texts about totally different things are further apart.

***scikit-learn*** is a library that gives us the math tools to compare those positions. Language space has lots of dimensions! That's really difficult to visualise for humans, since our brains are really set up to deal with three dimensional space, but in maths it all basically works the same way whether you have 3 dimensions or 384 dimensions.

In [7]:
%pip install sentence-transformers scikit-learn -q

We're going to analyse this small set of product reviews, and use some test phrases to compare against the reviews.

In [29]:
product_reviews = [
    "This product is excellent, very happy with my purchase.",
    "It broke after a week.",
    "Good value for money, works as expected.",
    "Absolutely terrible, a complete waste of money.",
    "I love this item, it's perfect for my needs.",
    "It's okay, nothing special but it gets the job done.",
    "Very durable and well-made, highly recommend."
]

test_phrases = [
    "the product is good quality",
    "the product is poor quality"
]

Here we load our model for turning reviews into positions.

In [9]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

print("Sentence Transformer model 'all-MiniLM-L6-v2' loaded successfully.")

Sentence Transformer model 'all-MiniLM-L6-v2' loaded successfully.


And here we use the model to get the positions of each of our reviews in our language space.

---

*Our "position" is a* ***vector*** *(a direction and a distance) describing where it is, compared to the centre of our language space.*

*A vector used for this purpose - to capture relationships and meaning - is also called an* ***embedding***.

In [30]:
product_review_embeddings = model.encode(product_reviews)
test_phrase_embeddings = model.encode(test_phrases)

print(f"Generated embeddings for {len(product_reviews)} reviews")
print(f"Generated embeddings for {len(test_phrases)} test phrases")

Generated embeddings for 7 reviews
Generated embeddings for 2 test phrases


Now we compare the positions of each of our reviews to the positions of our test phrases. We're looking at how similar their vectors are. This is called *Vector Similarity*.

And we'll be using a particular kind of vector similiarity technique here called *Cosine Similarity*. Instead of looking at how *close* our positions are, we're looking to see if the vectors are *pointing in similar directions*.

The reason for this is that two positions can be far apart but similiar in meaning. If we imagine two reviews:

    "I like the product packaging"
    "I LOVE THE PRODUCT PACKAGING ITS MY FAVORITE THING IN THE WORLD
     I LOVE IT SO MUCH I HAVE SOLD ALL MY WORLDLY POSSESSIONS
     IN ORDER TO OBTAIN MORE OF THE PRODUCT PACKAGING
     WHICH I LOVE"

These two reviews have similar meaning, but the second is far greater in *magnitude*. It will be *further out* in our language space, but both reviews should be in a *similar direction* from the centre of our space.

Here's a visual aid to help us understand these relationships of direction vs magnitude in our space:

![A plot visualising how vector metrics interpret similarity. Euclidean distance groups "cat" and "dog" because they are short, related words. Cosine similarity aligns "cat" with its long definition "felis cattus, the domestic cat" because they share the same semantic direction, demonstrating why angular similarity is better for matching meaning.](https://raw.githubusercontent.com/SimonPurdie/understanding-semantic-analysis/refs/heads/master/assets/semantic_cats_and_dogs.svg)

In [32]:
from sklearn.metrics.pairwise import cosine_similarity

cosine_scores = cosine_similarity(product_review_embeddings, test_phrase_embeddings)

So now we have simple number values showing how close in meaning our reviews are to our two anchors:

    the product is good quality
    the product is poor quality

And we can use the difference between those values to determine which of our anchors the review is closer to, extracting the sentiment from it.

---

It would also be possible to use our values to check reviews were sufficiently *relevant* to either anchor, in case our dataset included reviews like:

    stop asking me questions
    siri play agadoo by black lace
    purple monkey dishwasher

but for this notebook we'll just assume our reviews are highly relevant and comprehensible, as I'm sure most online reviews would be.

In [33]:
sentiment_scores = []

for scores in cosine_scores:
    good_quality_score = scores[0]
    bad_quality_score = scores[1]

    # We calculate the score by finding the difference.
    # A positive number means it's leaning toward "Good", negative toward "Bad".
    score_diff = good_quality_score - bad_quality_score
    sentiment_scores.append(score_diff)

print("--- Sentiment Analysis Results ---")

for i, review in enumerate(product_reviews):
    current_score = sentiment_scores[i]

    # If the score is greater than 0, it's on the 'Positive' side of language space
    if current_score > 0:
        label = "Positive"
    else:
        label = "Negative"

    print(f"Review: {review}")
    print(f"Score: {current_score:.4f}")
    print(f"Result: {label}\n")

--- Sentiment Analysis Results ---
Review: This product is excellent, very happy with my purchase.
Score: 0.1565
Result: Positive

Review: It broke after a week.
Score: -0.0753
Result: Negative

Review: Good value for money, works as expected.
Score: 0.0713
Result: Positive

Review: Absolutely terrible, a complete waste of money.
Score: -0.0625
Result: Negative

Review: I love this item, it's perfect for my needs.
Score: 0.1284
Result: Positive

Review: It's okay, nothing special but it gets the job done.
Score: 0.0893
Result: Positive

Review: Very durable and well-made, highly recommend.
Score: 0.2622
Result: Positive



We can see the model successfully picked up on the sentiment in each review!

Of course, this method isn't perfect. Our results depend a lot on which anchor phrases we chose - different test phrases would give different scores.

Some scores don't completely match my intuitions either. I thought:

    This product is excellent, very happy with my purchase.

sounded a bit more enthusiastic and positive than

    Very durable and well-made, highly recommend.

but the second scored much higher. So the way that the model parses meaning isn't necessarily the same way I would.

And real language usage can get very messy and complicated. What about sarcasm or mixed feelings? Those are much trickier!

So this technique has imperfections, but it can also be an extremely powerful tool for extracting meaning from large natural language datasets.