# Introduction
Welcome to Part III! In Part II, you generated text-related metrics and visualized them with barplots. 

In this Part, you will perform sentiment analysis on the text. More specifically, you will chart how the sentiments move over parts of the texts. 

You execute this by:
- splitting a text into sentences
- measuring the sentiment in each sentence
- plotting the sentiments in each sentence
- splitting the text by chapters
- measure aggregate sentiment by chapters
- normalizing the length of the chapters for cross-text comparison

### Step 1: Import libraries
First up, let's get our libraries.
- pandas as pd
- matplotlib.pyplot as plt
- SentimentIntensityAnalyzer from vaderSentiment.vaderSentiment
- sent_tokenize from nltk

In [None]:
# Step 1: Import libraries

### Step 2: Read CSV from Part I into DataFrame
Next, let's load up the CSV that we got from Part I into a DataFrame. 

In [None]:
# Step 2: Read the CSV from Part I

## Test sentiment analysis with one text
Before we analyze all seven texts, we should start with one text and identify the best strategy needed for our analysis.

We'll start with the first text.

### Step 3: Get the first text 
Declare a variable, and assign it the first text.

In [None]:
# Step 3: Assign the first text to a variable

### Step 4: Split the text by sentence
Declare another variable, where each item is a separate sentence. 

You can use nltk's sent_token, which can tokenize texts into sentences.

You should expect 6,394 sentences in the resulting list.

In [None]:
# Step 4: Split the text by sentences

### Step 5: Calculate the sentiment score for the sentences
Now that you have a list of sentences let's loop through them and get their respective sentiment.

The code examples in the documentation are useful in getting yous started: https://github.com/cjhutto/vaderSentiment

Create a new list containing the <strong>compound</strong> scores of each sentence.

In [None]:
# Step 5: Create a list of compound scores

<details>
    <summary><strong>Click once to get pseudocode if you're stuck</strong></summary>
    <ol>
        <li>Declare a variable and assign an empty list to it</li>
        <li>Declare a variable containing a SentimentIntensityAnalyzer object</li>
        <li>Use a for loop to loop through the list from Step 4. In each loop:</li>
        <ul>
            <li>Use the SentimentIntensityAnalyzer object's .polarity_scores method to measure the current loop's sentence's score</li>
            <li>Get the value of the "compound" key of the .polarity_scores method result</li>
            <li>Append that value into the list declared above</li>
        </ul>
    </ol>
</details>

### Step 6: Create a DataFrame of sentiments for first text
Now that you have:
- a list of sentences
- a list of compound scores of each sentence

you can now create a DataFrame for them so that we can plot it later. 

![BookOneSentiment](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/BookOneSentiment.png)

This is what you'll get once you're done.

In [None]:
# Step 6: Create a sentiment DataFrame

### Step 7: Plot the sentiment 
The moment of truth...let's plot the sentiment in the "compound" column, and let's see what we get!

In [None]:
# Step 7: Plot compound

<details>
    <summary><strong>What <em>did</em> we get? Click once to see our plot</strong></summary>
    <img src="https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/BookOneSentenceSentimentPlot.png">
    <br>
    <div>It's a really messy plot that oscillates wildly. Not really helpful :/</div>
</details>

### [Optional] Perform Savitsky-Golay filter on the sentiment data
Savitsky-Golay filter removes noise from signals through polynomial smoothing. If we assume the signals to be continuous, and that the fluctuations are noise, we can use the filter to smoothen the signals. 

Reading: https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.savgol_filter.html

Here's what you need to do:
1. Import savgol_filter from scipy.signal
2. Transform 'compound' with the filter, with a window length of 151 and the 5th polynomial order
3. Plot the transformed signal

Feel free to change the window and polynomial order.

In [None]:
# Import savgol filter

# Plot the original 'compound' from the DataFrame

# Declare a variable that contains the transformed signal through savgol filter

# Plot the transformed signal

<details>
    <summary><strong>Click once to see our plot</strong></summary>
    <img src="https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/BookOneSentenceSentimentPlotWithSavGolFilter.png
">
</details>

### Measure sentiment by chapter
It seems that splitting the text by sentence led to too many values. Instead, we can try to split the text by chapter.

![ChapterSplitStrategy](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/ChapterSplitStrategy.png)

In the following approach, these are the steps:
1. <font color='red'>Split a full text into chapters</font>
2. <font color='green'>Split each chapter into sentences</font>
3. <font color='blue'>Measure the score for each sentence in a chapter</font>
4. <font color='orange'>Get the average of the sentiment in each chapter based on the sentences</font>

### Step 8: Split the text by chapter
Declare a variable, and split the text with "CHAPTER".

Make sure your list has 17 items only, i.e. remove the first item in the list after splitting.

In [None]:
# Step 8: Split the text by "CHAPTER"

### Step 9: Get a list of list of sentences
Loop through the list of chapters, and in each chapter use sent_tokenize to break the chapter into a list of sentences.

You will end up with a list containing 17 lists.

In [None]:
# Step 9: Get a list of list of sentences

<details>
    <summary><strong>Click once to see the pseudocode</strong></summary>
    <ol>
        <li>Declare an empty list</li>
        <li>Use a for loop to loop through the list of chapters. In each loop:</li>
        <ul>
            <li>Use sent_tokenize on the current chapter to split it into a list of sentences and assign it to a variable</li>
            <li>Append the list of sentences to the empty list declared above</li>
        </ul>
    </ol>
</details>

### Step 10: Measure the sentiment for each sentence in the list of list of sentences
More loops ahead! 

In the list that you got from Step 9, loop through each sentence in each of your 17 lists and measure its sentiment.

![ListOfListOfSentiments](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/ListOfListOfSentiments.png)

What you have at the end is a list of lists that contain scores.

<details>
    <summary><strong>Click once to see the pseudocode</strong></summary>
    <ol>
        <li>Declare an empty list (List 1)</li>
        <li>Declare a variable containing SentimentIntensityAnalyzer object</li>
        <li>Use a for loop to loop through the list of list of sentences. In each loop:</li>
        <ul>
            <li>Declare a variable containing an empty list (List 2)</li>
            <li>Use a for loop to loop through list of sentence. In each loop:</li>
            <ul>
                <li>Use the SentimentIntensityAnalyzer to get the polarity scores of the current sentence and assign the results to a variable</li>
                <li>Get the value of the 'compound' key of the variable above</li>
                <li>Append that 'compound' score to List 2</li>
            </ul>
            <li>Append the List 2 into List 1</li>
        </ul>
    </ol>
</details>

In [None]:
# Step 10: Measure the sentiment for each sentence 

### Step 11: Average the list of list of scores
Now that you have a list of list of scores, get the average for each list.

![AveragedChapterSentiments](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/AveragedChapterSentiments.png)

Since you have 17 chapters, you should end up wtih 17 scores in the list.

In [None]:
# Step 11: Average the compound scores in each chapter

### Step 12: Create a DataFrame for the scores
Now that you have a list of averaged sentiment scores, let's create a DataFrame containing two columns:
- chapter
- compound

You'll see something like this.

![BookOneDataFrameCompoundScores](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/BookOneDataFrameCompoundScores.png)

In [None]:
# Step 12: Create a DataFrame

### Step 13: Plot the scores
Now that you have the list of the averaged scores, let's plot it! 

Look out for how the sentiment changes throughout the book.

In [None]:
# Step 12: Plot the scores for Book 1

<details>
    <summary><strong>Click once to see what we got</strong></summary>
    <img src='https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/ChapterOneSentiments.png'>
</details>

## Sentiment analysis with all texts
Now that we have tried the analysis on one text, let's plot the rest! 

### Step 14: Repeat Steps 8-12 with all texts
You'll have to repeat the Steps 8-12 with each text in the list of texts.

At the end, you'll be able to observe seven separate plots from the books.

Make sure that you have the DataFrames for each text as well.

P.S. Bear in mind that not all texts have the same string to denote chapter, i.e. "CHAPTER". Try looking into each text to see how you'd like to split them.

In [None]:
# Step 14a: Repeat Steps 8-12 with Book 2

In [None]:
# Step 14b: Repeat Steps 8-12 with Book 3

In [None]:
# Step 14c: Repeat Steps 8-12 with Book 4

In [None]:
# Step 14d: Repeat Steps 8-12 with Book 5

In [None]:
# Step 14e: Repeat Steps 8-12 with Book 6

In [None]:
# Step 14f: Repeat Steps 8-12 with Book 7

### Step 15: Normalize the chapter lengths
You successfully made seven plots, but it is hard to compare between them because of their different lengths. 

If you tried, you'd be faced with a really messy plot.

First things first - you'll need to normalize each DataFrame's chapter. 

![ChapterNormalization](https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/ChapterNormalization.png)

We'll need to normalize the chapter lengths by dividing each chapter's "chapter" with the max value. 

In [None]:
# Divide the "chapter" column for each DataFrame by the max number of chapters
# Step 16a: Normalize the chapters in Book 1 DataFrame

In [None]:
# Step 16b: Normalize the chapters in Book 2 DataFrame

In [None]:
# Step 16c: Normalize the chapters in Book 3 DataFrame

In [None]:
# Step 16d: Normalize the chapters in Book 4 DataFrame

In [None]:
# Step 16e: Normalize the chapters in Book 5 DataFrame

In [None]:
# Step 16f: Normalize the chapters in Book 6 DataFrame

In [None]:
# Step 16g: Normalize the chapters in Book 7 DataFrame

### Step 17: Plot the compound scores from all the texts
Now that you've normalized the chapters in all DataFrames, let's plot all scores on a single figure.

In [None]:
# Step 17: Plot compound scores from all texts in a single plot

<details>
    <summary><strong>Click here once to see our plot</strong></summary>
    <img src="https://uplevelsg.s3-ap-southeast-1.amazonaws.com/ProjectHarryPotter/CompoundScoreVsAllBookChaptersNormalized.png">
    <br>
    <div>It seems like the sentiments oscillate throughout the book chapters, followed by a dip towards the end before going up again</div>
</details>

### End of Part III
What a Part!

In this Part, you calculated and plotted sentiment scores throughout the full texts.

We also had an optional part where you implemented a Savgol filter to filter the scores. 

In the next Part, you will analyze your text data in a different way, through word cloud visualization.