# SENTIMENT SCORING SOLUTION

**File:** SentimentSolution.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# CHALLENGE

In this challenge, I invite you to to do the following:

1. Import the text `LittleWomen.txt` by Louisa May Alcott from the data folder (this text is downloaded from Project Gutenberg at https://www.gutenberg.org/ebooks/514.)
1. Add section numbers for sections of 100 lines.
1. Tokenize the data.
1. Score the sentiments.
1. Calculate average sentiment scores for each section of 100 lines.
1. Graph the "sentiment arc" of the story.

# IMPORT LIBRARIES

In [None]:
# Import libraries
import re  # For regular expressions
import nltk  # For text functions
import matplotlib.pyplot as plt  # For plotting
import pandas as pd  # For dataframes
from afinn import Afinn  # For sentiment values

# Import corpora and functions from NLTK
from nltk.corpus import stopwords
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize

# Download data for NLTK
nltk.download('stopwords', quiet=True)
nltk.download('opinion_lexicon', quiet=True)
nltk.download('punkt', quiet=True)

# Use Matplotlib style sheet
plt.style.use('ggplot')

# IMPORT DATA

In [None]:
df = pd.read_csv('data/LittleWomen.txt',sep='\t')\
    .dropna()\
    .drop('gutenberg_id', 1)

df.head(10)

# PREPARE DATA


## Add Line Numbers

- These numbers will be used to divide the text into sections.

In [None]:
df['line'] = range(1, len(df) + 1)

df.head()

## Tokenize the Data

In [None]:
def clean_text(text):
    text = text.lower()  # Convert all text to lowercase
    text = text.replace("'", '')
    text = re.sub(r'[^\w]', ' ', text)  # Leave only word characters
    text = re.sub(r'\s+', ' ', text)  # Omit extra space characters
    text = text.strip()
    return text

df['text'] = df['text'].map(clean_text) 
df['text'] = df['text'].map(word_tokenize) # Split text into word tokens

df.head()

## Collect Tokens into a Single Series

In [None]:
df = df.explode('text')\
    .rename(columns={'text': 'token'})

df.head(10)

# SCORE SENTIMENTS

- Calculate sentiment scores using the AFINN lexicon, which scores words on a scale of -5 (most negative) to +5 (most positive).

In [None]:
afinn_scorer = Afinn()

df['score'] = df['token'].map(afinn_scorer.score).astype(int)
df = df[df['score'] != 0]

- Show a frequency table for the sentiment scores.

In [None]:
score_freq = df.score.value_counts().sort_index().to_frame('n')

score_freq

## Graph Score Frequencies

In [None]:
score_freq.plot.bar(
    legend=False,
    figsize=(8, 4),
    grid=True,
    color='gray')
plt.xlabel('Sentiment Score')
plt.ylabel('Frequency of Words')
plt.title('The Iliad: Sentiment Scores by Words', loc='left')
plt.xticks(rotation=0);

# SENTIMENT ARC

- Divide the text into sections of 100 lines and calculate a sentiment score for each section.

In [None]:
score_acc = df.groupby(df['line'] // 100)\
    .score.mean()\
    .to_frame('score')\
    .rename_axis('section')

score_acc.head(10)

## Plot Scores by Section to View Narrative Arc

In [None]:
ax = score_acc.plot.line(legend=False, figsize=(12, 6), grid=True, alpha=0.5, color='gray')
score_acc.rolling(10, min_periods=5).mean().plot.line(ax=ax, color='black')
plt.xlabel('Section of 100 Lines')
plt.ylabel('Mean Sentiment Score')
plt.title('Little Women: Mean Sentiment Score by Section', loc='left')
plt.axhline(0, color='red')
plt.xticks(rotation=0);

# CLEAN UP

- If desired, clear the results with Cell > All Output > Clear. 
- Save your work by selecting File > Save and Checkpoint.
- Shut down the Python kernel and close the file by selecting File > Close and Halt.