# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [18]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [19]:
len(reviews)

25000

In [20]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [21]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [22]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [122]:
# Incantations for the project
from collections import Counter
import numpy as np
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

# Sets index faster than lists, so this will help with processing time and efficiency
stops = set(stopwords.words("english"))

In [123]:
# Initializes counters, iterates through 'positive' and 'negative' reviews
# Creates dictionary with key value pairs of word tokens and frequency
# Splits out stop words into separate dictionaries to analyse later
# Would be interesting to see if frequency of stop words could be an indicator 
# for positive / negative reviews by itself.

positive_counts = Counter()
negative_counts = Counter()
clean_counts = Counter()
positive_stops = Counter()
negative_stops = Counter()

for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            if word in stops:
                positive_stops[word] += 1 
            else:
                positive_counts[word] += 1
                clean_counts[word] += 1

for i in range(len(reviews)):
    if(labels[i] == 'NEGATIVE'):
        for word in reviews[i].split(" "):
            if word in stops:
                negative_stops[word] += 1 
            else:
                negative_counts[word] += 1
                clean_counts[word] += 1
    

In [124]:
print ('Number of entries in positive lexicon: {}'.format(len(positive_counts)))
print ('Number of entries in negative lexicon: {}'.format(len(negative_counts)))
print ('Number of entries in clean lexicon: {}'.format(len(clean_counts)))
print ('Number of entries in positive stops lexicon: {}'.format(len(positive_stops)))
print ('Number of entries in negative stops lexicon: {}'.format(len(negative_stops)))

Number of entries in positive lexicon: 55061
Number of entries in negative lexicon: 53483
Number of entries in clean lexicon: 73921
Number of entries in positive stops lexicon: 153
Number of entries in negative stops lexicon: 152


Right now it looks like the vocabulary range is fairly consistent across both negative and positive review buckets. 

Would be interesting to see how many words were used in each of these buckets.

In [125]:
print ('Number of stop words in negative reviews: {}'.format(sum(negative_stops.values())))
print ('Number of stop words in positive reviews: {}'.format(sum(positive_stops.values())))
print ('Number of words in positive review bucket: {}'.format(sum(positive_counts.values())))
print ('Number of words in negative review bucket: {}'.format(sum(negative_counts.values())))


Number of stop words in negative reviews: 1462007
Number of stop words in positive reviews: 1468827
Number of words in positive review bucket: 2284117
Number of words in negative review bucket: 2244367


Again, it looks like these numbers are fairly comparable, with a ~2% increase in size from the negative review bucket to the positive review bucket.

In [126]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

train_data_features = vectorizer.fit_transform(positive_counts)
train_data_features = train_data_features.toarray()
print train_data_features.shape

(55061, 5000)


In [127]:
vocab = vectorizer.get_feature_names()
print vocab

[u'aa', u'orange', u'oranges', u'orangutan', u'orangutans', u'orations', u'orator', u'oratorio', u'orators', u'oratory', u'orazio', u'orb', u'orbach', u'orbit', u'orbital', u'orbiting', u'orbits', u'orc', u'orchard', u'orchestra', u'orchestral', u'orchestras', u'orchestrate', u'orchestrated', u'orchestrates', u'orchestrating', u'orchestration', u'orchidea', u'orchids', u'orco', u'orcs', u'ordained', u'ordeal', u'ordeals', u'order', u'ordered', u'orderedby', u'ordering', u'orderly', u'orders', u'ordinarily', u'ordinariness', u'ordinary', u'ordination', u'ore', u'oregon', u'oregonian', u'oreilles', u'orenji', u'oreo', u'oreos', u'org', u'organ', u'organic', u'organically', u'organisation', u'organisations', u'organise', u'organised', u'organisers', u'organising', u'organism', u'organisms', u'organist', u'organization', u'organizational', u'organizations', u'organize', u'organized', u'organizers', u'organizes', u'organizing', u'organs', u'orgasm', u'orgasmic', u'orgasms', u'orgazim', u'or