![DataDunkers.ca Banner](https://github.com/Data-Dunkers/lessons/blob/main/images/top-banner.jpg?raw=true)

# Artificial Intelligence - Sentiment Analysis

## Objectives

Students will be able to:

- use the [nltk](https://www.nltk.org/) library to implement sentiment analysis in Python
- identify ways to extract html, and ensure the correct data is extracted
- learn about different ways to implement sentiment analysis depending on the context of the problem

## Introduction

[Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is a way to figure out how people feel about something based on their words. For example, it can help us understand if a movie review is positive or negative, or if people are excited or unhappy about a product.

In this notebook, you'll learn how to use a Python library called NLTK to perform sentiment analysis. We’ll also explore how to get data from websites and make sure we’re using the right information. Finally, you'll see how sentiment analysis can be applied in different situations to better understand opinions and feelings expressed in text.

We'll begin by importing the necessary libraries and downloading a sentiment analysis model.

## Import Libraries

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import requests
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
nltk.download('vader_lexicon')

Now we can try some sentiment analysis. Below are three sentences that are happy, neutral, or negative, let's see how our model interprets each sentence.

In [None]:
sid = SentimentIntensityAnalyzer()

sentences = [
    "I am very extremely happy today because I scored an high score on my math test! I am feeling ecstatic!",
    "I am feeling okay today because I scored an average score on my science exam. I feel decent.",
    "I am feeling very sad today because I scored an low score on my history exam today. This sucks!"
    ]

for sentence in sentences:
    print(sentence)
    print(f"Sentiment Scores: {sid.polarity_scores(sentence)}")
    print("---")

In general, our model appears to have generally assessed each sentence correctly. One thing to note though is that it appears that our model tends to rate sentences neutrally (`neu`) more often than swing towards `pos` or `neg`. We can keep this in mind in our future uses of this mode.

## Book Chapters

Next we will try some sentiment analysis from a book that we will download from [Project Gutenberg](https://www.gutenberg.org). We are going to use a public domain fiction book about basketball, called [The Girls of Central High at Basketball; Or, The Great Gymnasium Mystery](https://www.gutenberg.org/ebooks/37912).

In [None]:
gutenberg_text_link = 'https://www.gutenberg.org/cache/epub/37912/pg37912.txt'

r = requests.get(gutenberg_text_link) # get the online book file
r.encoding = 'utf-8' # specify the type of text encoding in the file
book = r.text.split('***')[2] # get the part after the header
book = book.replace("’","'").replace("“",'"').replace("”",'"') # replace any 'smart quotes'
book_title = r.text[r.text.index('Title:')+7:r.text.index('Author:')-4] # find the book title

chapter_list = [] # create a list to hold the chapter texts
for chapter in book.split('CHAPTER'):
    if len(chapter)>500: # so that we are getting actual book chapters
        chapter_text = chapter.replace('\r',' ').replace('\n',' ') # delete the 'new line' characters
        chapter_list.append(chapter_text) # add the chapter to the list
chapters = pd.DataFrame(chapter_list, columns=['Chapter Text']) # create a data frame from the list
chapters['Chapter'] = chapters.index+1 # add a column with the chapter number
chapters = chapters[['Chapter', 'Chapter Text']] # reorder the columns
chapters['Chapter Length'] = chapters['Chapter Text'].apply(len) # add a column with the length of each chapter
chapters

Now that we have the book downloaded and split into chapters, we can calculate the average sentiment of each chapter.

In [None]:
chapters['Negative'] = chapters['Chapter Text'].apply(lambda text: sid.polarity_scores(text)['neg'])
chapters['Neutral'] = chapters['Chapter Text'].apply(lambda text: sid.polarity_scores(text)['neu'])
chapters['Positive'] = chapters['Chapter Text'].apply(lambda text: sid.polarity_scores(text)['pos'])

chapters

Now we can visualize the sentiment by chapter.

In [None]:
px.line(chapters, x='Chapter', y=['Negative','Positive'], title=f'Sentiment Analysis of {book_title}')

## Webscraping

**Webscraping** is a technique used to automatically gather data from websites by accessing and extracting information from websites.

It often involves using tools or scripts to navigate websites to pull out specific data like text, images, or links.

In our case, we'll be using [asapsports](https://www.asapsports.com/) to extract the 2024 NBA finals interviews between the Boston Celtics and the Dallas Mavericks. Let's start by finding the interview for the Dallas Mavericks, the team who lost the NBA championship

**Note:** For your own implementation of projects, you *DO NOT* have to use web-scraping to extract information. Webscraping involves understanding HTML which is separate from Python. Instead, you can simply copy and paste the string/text you'd like to perform sentiment analysis on into Python directly.

In [None]:
luka_url = "https://www.asapsports.com/show_interview.php?id=198323"

response = requests.get(luka_url)
luka_html = response.content
soup = BeautifulSoup(luka_html, "html.parser")
luka_paragraphs= soup.find_all('p')
temp = [para.get_text(strip=True) for para in luka_paragraphs]
luka_info_combined = "\n".join(temp)
print(luka_info_combined)

## Sentiment Analysis

Now that we've obtained the interview for the Dallas Mavericks (primarily Luka Dončić) let's perform some sentiment analysis on his speech.

In [None]:
luka_score = sid.polarity_scores(luka_info_combined)
luka_score

Surprisingly, we see that our model has rated Luka's speech as neutral and slightly positive compared to negative.

However, in our text, we've also included Luka's interviewer. Let's try to separate Luka's interviewer and his own dialogue to get a better understanding of his speech.

In [None]:
interviewer_text = []
luka_text = []
current_speaker = None

lines = luka_info_combined.split('\n')
for line in lines:
    if line.startswith('Q.'):
        current_speaker = 'interviewer'
        interviewer_text.append(line[2:].strip())
    elif line.startswith('LUKA DONCIC:'):
        current_speaker = 'luka_doncic'
        luka_text.append(line[len('LUKA DONCIC:'):].strip())
    else:
        if current_speaker == 'interviewer':
            interviewer_text[-1] += ' ' + line.strip()
        elif current_speaker == 'luka_doncic':
            luka_text[-1] += ' ' + line.strip()

interviewer_text = ' '.join(interviewer_text)
luka_text = ' '.join(luka_text)

print("Interviewer:")
print(interviewer_text)
print("Luka Dončić:")
print(luka_text)

Now that we've separated Luka's text, let's perform sentiment analysis on both texts.

In [None]:
luka_interviewer = sid.polarity_scores(interviewer_text)
print(f"Luka Interviewer Score: {luka_interviewer}")

luka_individual = sid.polarity_scores(luka_text)
print(f"Luka Score: {luka_individual}")

Looking at both texts, we see that we still have a dominantly presence of neutrality. We do see that Luka's score is slightly more negative compared to his interviewer, however, this amount appears to be too slight to be considered a major difference.

Let's also find the text of the Boston Celtics, the team who won the NBA championship.

In [None]:
jayson_tatum_url = 'https://www.asapsports.com/show_interview.php?id=198332'

response = requests.get(jayson_tatum_url)
tatum_html = response.content
soup = BeautifulSoup(tatum_html, "html.parser")
tatum_paragraphs= soup.find_all('p')
temp = [para.get_text(strip=True) for para in tatum_paragraphs]
tatum_info_combined = "\n".join(temp)
print(tatum_info_combined)

tatum_score = sid.polarity_scores(tatum_info_combined)
tatum_score

Similarly to Luka, Jayson Tatum appears to be overall neutral in tonality. Let's also separate his interview to get a more accurate analysis on his emotions.

In [None]:
interviewer_text = []
tatum_text = []
current_speaker = None

lines = tatum_info_combined.split('\n')
for line in lines:
    if line.startswith('Q.'):
        current_speaker = 'interviewer'
        interviewer_text.append(line[2:].strip())
    elif line.startswith('JAYSON TATUM:'):
        current_speaker = 'jayson_tatum'
        tatum_text.append(line[len('JAYSON TATUM:'):].strip())
    else:
        if current_speaker == 'interviewer':
            interviewer_text[-1] += ' ' + line.strip()
        elif current_speaker == 'jayson_tatum':
            tatum_text[-1] += ' ' + line.strip()
interviewer_text = ' '.join(interviewer_text)
tatum_text = ' '.join(tatum_text)

print("Interviewer:")
print(interviewer_text)
print(sid.polarity_scores(interviewer_text))
print("---")
print("Jayson Tatum:")
print(tatum_text)
print(sid.polarity_scores(tatum_text))

Looking at our separated scores, we see that both are still overly neutral in nature, however, Tatum does appear to have a larger `pos` score compared to his interviewer. This should be natural, as Tatum appears to be happier in his interview as he was the team who won the NBA championship.

## Your Turn

Now that you've seen some examples of sentiment analysis from pasted text, a downloaded book, and from web-scraped text, try it with some of your own text, or a different book from [Project Gutenberg](https://www.gutenberg.org), in the following code cell.

In [None]:
text = ""

sid.polarity_scores(text)

## Conclusion

**Sentiment analysis** is a powerful tool for understanding people's emotions and opinions from text. By using techniques from the Natural Language Toolkit, we can automatically determine whether a piece of text expresses positive, negative, or neutral sentiments.

This ability to analyze and interpret large volumes of text data can provide valuable insights for various applications. As you explore sentiment analysis, remember that it can be tailored to different problems, such as customer feedback or social media monitoring.

[![Data Dunkers License](https://github.com/Data-Dunkers/lessons/blob/main/images/bottom-banner.jpg?raw=true)](https://github.com/Data-Dunkers/lessons/blob/main/LICENSE.md)