![Data Dunkers Banner](https://github.com/PS43Foundation/data-dunkers/blob/main/docs/top-banner.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdata-dunkers%2Fdata-dunkers-modules&branch=main&subPath=AI/sentiment-analysis.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a><a href="https://colab.research.google.com/github/data-dunkers/data-dunkers-modules/blob/mainAI/sentiment-analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Colab"/></a>

# Artificial Intelligence - Sentiment Analysis

## Objectives

Students will be able to:

- use the [nltk](https://www.nltk.org/) library to implement sentiment analysis in Python
- identify ways to extract html, and ensure the correct data is extracted
- learn about different ways to implement sentiment analysis depending on the context of the problem

## Introduction

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is a way to figure out how people feel about something based on their words. For example, it can help us understand if a movie review is positive or negative, or if people are excited or unhappy about a product. 

In this notebook, you'll learn how to use a Python library called NLTK to perform sentiment analysis. We’ll also explore how to get data from websites and make sure we’re using the right information. Finally, you'll see how sentiment analysis can be applied in different situations to better understand opinions and feelings expressed in text.

We'll begin by importing the necessary libraries for this notebook.

## Import Libraries

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import requests
from bs4 import BeautifulSoup

First, let's download the **VADER** model. This is a general sentiment analysis model that will perform well for most instances of sentiment analysis.

In [None]:
nltk.download('vader_lexicon')

Next, let's see how sentiment analysis works! Below, let's set-up 3 sentences that are clearly happy, neutral, and negative in tonality. Let's see how our model interprets each sentence. 

In [None]:
sid = SentimentIntensityAnalyzer()

sentences = ["I am very extremely happy today because I scored an high score on my math test! I am feeling ecstatic!", "I am feeling okay today because I scored an average score on my science exam. I feel decent.",
             "I am feeling very sad today because I scored an low score on my history exam today. This sucks!"]

for sentence in sentences:
    score = sid.polarity_scores(sentence)
    print(f"Sentence: {sentence}")
    print(f"Sentiment Scores: {score}")
    print("\n")

In general, our model appears to have generally assessed each sentence correctly. One thing to note though is that it appears that our model tends to rate sentences neutrally (`neu`) more often than swing towards `pos` or `neg`. We can keep this in mind in our future uses of this mode.

## Extracting Information - Webscraping

**Webscraping** is a technique used to automatically gather data from websites by accessing and extracting information from websites. 

It often involves using tools or scripts to navigate websites to pull out specific data like text, images, or links. 

In our case, we'll be using [asapsports](https://www.asapsports.com/) to extract the 2024 NBA finals interviews between the Boston Celtics and the Dallas Mavericks. Let's start by finding the interview for the Dallas Mavericks, the team who lost the NBA championship

**Note:** For your own implementation of projects, you *DO NOT* have to use web-scraping to extract information. Webscraping involves understanding HTML which is separate from Python. Instead, you can simply copy and paste the string/text you'd like to perform sentiment analysis on into Python directly.

In [None]:
luka_url = "https://www.asapsports.com/show_interview.php?id=198323"

response = requests.get(luka_url)
luka_html = response.content

soup = BeautifulSoup(luka_html, "html.parser")

luka_paragraphs= soup.find_all('p')

temp = [para.get_text(strip=True) for para in luka_paragraphs]
luka_info_combined = "\n".join(temp)

print(luka_info_combined)

## Sentiment Analysis

Now that we've obtained the interview for the Dallas Mavericks (primarily Luka Dončić) let's perform some sentiment analysis on his speech.

In [None]:
luka_score = sid.polarity_scores(luka_info_combined)
luka_score

Surprisingly, we see that our model has rated Luka's speech as neutral and slightly positive compared to negative.

However, in our text, we've also included Luka's interviewer. Let's try to separate Luka's interviewer and his own dialogue to get a better understanding of his speech.

In [None]:
# separating texts

def separate_text_luka(text):
    interviewer_text = []
    luka_text = []

    lines = text.split('\n')
    current_speaker = None

    for line in lines:
        if line.startswith('Q.'):
            current_speaker = 'interviewer'
            interviewer_text.append(line[2:].strip())
        elif line.startswith('LUKA DONCIC:'):
            current_speaker = 'luka_doncic'
            luka_text.append(line[len('LUKA DONCIC:'):].strip())
        else:
            if current_speaker == 'interviewer':
                interviewer_text[-1] += ' ' + line.strip()
            elif current_speaker == 'luka_doncic':
                luka_text[-1] += ' ' + line.strip()

    return '\n'.join(interviewer_text), '\n'.join(luka_text)

interviewer, luka_doncic = separate_text_luka(luka_info_combined)

print("Interviewer:")
print(interviewer)
print("\nLuka Dončić:")
print(luka_doncic)

Now that we've separated Luka's text, let's perform sentiment analysis on both texts.

In [None]:
luka_interviewer = sid.polarity_scores(interviewer)
print(f"Luka Interviewer Score: {luka_interviewer}")

luka_individual = sid.polarity_scores(luka_doncic)
print(f"Luka Score: {luka_individual}")

Looking at both texts, we see that we still have a dominantly presence of neutrality. We do see that Luka's score is slightly more negative compared to his interviewer, however, this amount appears to be too slight to be considered a major difference.

Let's also find the text of the Boston Celtics, the team who won the NBA championship.

In [None]:
jayson_tatum_url = 'https://www.asapsports.com/show_interview.php?id=198332'

response = requests.get(jayson_tatum_url)
tatum_html = response.content

soup = BeautifulSoup(tatum_html, "html.parser")

tatum_paragraphs= soup.find_all('p')

temp = [para.get_text(strip=True) for para in tatum_paragraphs]
tatum_info_combined = "\n".join(temp)

print(tatum_info_combined)

In [None]:
tatum_score = sid.polarity_scores(tatum_info_combined)
tatum_score

Similarly to Luka, Jayson Tatum appears to be overall neutral in tonality. Let's also separate his interview to get a more accurate analysis on his emotions.

In [None]:
def separate_text_tatum(text):
    interviewer_text = []
    tatum_text = []

    lines = text.split('\n')
    current_speaker = None

    for line in lines:
        if line.startswith('Q.'):
            current_speaker = 'interviewer'
            interviewer_text.append(line[2:].strip())
        elif line.startswith('JAYSON TATUM:'):
            current_speaker = 'jayson_tatum'
            tatum_text.append(line[len('JAYSON TATUM:'):].strip())
        else:
            if current_speaker == 'interviewer':
                interviewer_text[-1] += ' ' + line.strip()
            elif current_speaker == 'jayson_tatum':
                tatum_text[-1] += ' ' + line.strip()

    return '\n'.join(interviewer_text), '\n'.join(tatum_text)

In [None]:
tatum_interviewer, tatum_individual = separate_text_tatum(tatum_info_combined)

print("Interviewer:")
print(tatum_interviewer)
print("\nJayson Tatum:")
print(tatum_individual)

In [None]:
tatum_interviewer_score = sid.polarity_scores(tatum_interviewer)
print(f"Tatum Interviewer Score: {tatum_interviewer_score}")

tatum_individual_score = sid.polarity_scores(tatum_individual)
print(f"Tatum Score: {tatum_individual_score}")

Looking at our separated scores, we see that both are still overly neutral in nature, however, Tatum does appear to have a larger `pos` score compared to his interviewer. This should be natural, as Tatum appears to be happier in his interview as he was the team who won the NBA championship.

## Conclusion

**Sentiment analysis** is a powerful tool for understanding people's emotions and opinions from text. By using techniques from the NLTK, we can automatically determine whether a piece of text expresses positive, negative, or neutral sentiments. 

This ability to analyze and interpret large volumes of text data can provide valuable insights for various applications. As you explore sentiment analysis, remember that it can be tailored to different problems, such as customer feedback to social media monitoring.