# BINARY CLASSIFICATION OF SENTIMENT

**File:** BinaryClassification.ipynb

**Course:** Data Science Foundations: Data Mining in Python

# INSTALL AND IMPORT LIBRARIES

The Python library `nltk`, for "Natural Language Toolkit," contains most of the functions we need for text mining. NLTK can be installed with Python's `pip` command. This command only needs to be done once per machine.

The standard, shorter approach may work:

In [None]:
# pip install nltk

If the above command didn't work, it may be necessary to be more explicit, in which case you could run the code below.

In [None]:
# import sys
# !{sys.executable} -m pip install nltk

Once `nltk` is installed, then load the libraries and data below.

In [None]:
# Import libraries
import re  # For regular expressions
import nltk  # For text functions
import matplotlib.pyplot as plt  # For plotting
import pandas as pd  # For dataframes

# Import corpora and functions from NLTK
from nltk.corpus import stopwords
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize

# Download data for NLTK
nltk.download('stopwords', quiet=True)
nltk.download('opinion_lexicon', quiet=True)
nltk.download('punkt', quiet=True)

# Use Matplotlib style sheet
plt.style.use('ggplot')

# IMPORT DATA

In [None]:
df = pd.read_csv('data/Iliad.txt',sep='\t')\
    .dropna()\
    .drop('gutenberg_id', 1)

df.head(10)

# PREPARE DATA


## Tokenize the Data

- A "token" is the level of analysis for text mining.
- In this case, the tokens will be individual words, which is most common, but tokens can also be pairs or triplets of words, sentences, and so on.
- In the tokenization process, it is common to standardize capitalization and remove non-word characters.

In [None]:
def clean_text(text):
    text = text.lower()  # Convert all text to lowercase
    text = text.replace("'", '')
    text = re.sub(r'[^\w]', ' ', text)  # Leave only word characters
    text = re.sub(r'\s+', ' ', text)  # Omit extra space characters
    text = text.strip()
    return text

df['text'] = df['text'].map(clean_text) 
df['text'] = df['text'].map(word_tokenize) # Split text into word tokens

df.head()

## Collect Tokens into a Single Series

In [None]:
df = df.text.explode().to_frame('token')
df.head(10)

## Sort Tokens by Frequency

In [None]:
df.token.value_counts().head(10)

## Remove Stop Words

- Stop words are common words such as "the," "and", and "a" that may interfere with the sementic analysis of text.
- It is common to use a lexicon or established list of stop words.
- However, different stop word lexiconss may process text differently.
- It is also possible to add specific words to a custom stop word list.

In [None]:
stopwords = set(stopwords.words('english')) # load stopwords

df = df[~df.token.isin(stopwords)]

## Sort Revised Tokens by Frequency

In [None]:
df.token.value_counts().head(10)

# CLASSIFY SENTIMENTS

## Identify Valenced Words with the "Opinion" Lexicon

In [None]:
sentiment_lexicon = {
    **{w: 'positive' for w in opinion_lexicon.positive()},
    **{w: 'negative' for w in opinion_lexicon.negative()}
}

df['sentiment'] = df['token'].map(sentiment_lexicon)
df = df[~df.sentiment.isna()] # ommit words out of opinion lexicon

df.head(10)

## Sort Sentiment Words by Frequency

In [None]:
df.token.value_counts().head(10)

## Summarize the Sentiment Words

In [None]:
summary_df = df.sentiment.value_counts().to_frame('n')
summary_df['prop'] = summary_df['n'] / summary_df.n.sum()

summary_df.round(3)

In [None]:
summary_df.n.plot.bar(legend=False, figsize=(8, 4), grid=True, color='gray')
plt.xlabel('Sentiment')
plt.ylabel('Frequency of Words')
plt.title('The Iliad: Proportion of positive and negative words', loc='left')
plt.xticks(rotation=0);

# CLEAN UP

- If desired, clear the results with Cell > All Output > Clear. 
- Save your work by selecting File > Save and Checkpoint.
- Shut down the Python kernel and close the file by selecting File > Close and Halt.