# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
len(reviews)

25000

In [3]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [4]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [5]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [13]:
from collections import Counter
import numpy as np

In [12]:
n_counter = Counter()
p_counter = Counter()
total_counter = Counter()

for i in range(len(reviews)):
    words = reviews[i].split()
    if labels[i] == "POSITIVE":
        for word in words:
            p_counter[word] += 1
            total_counter[word] += 1
    else:
        for word in words:
            n_counter[word] += 1
            total_counter[word] += 1
#print p_counter.
#print n_counter.most_commons(10)
        

In [26]:
pos_neg_ratios = Counter()

for term, cnt in list(total_counter.most_common()):
    if (cnt > 200):
        pos_neg_ratio = p_counter[term] / float(n_counter[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word, ratio in list(pos_neg_ratios.most_common()):
    if (ratio > 0):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log( 1 / (ratio + 0.01))

In [27]:
pos_neg_ratios.most_common()

[('victoria', 2.6810215287142909),
 ('captures', 2.0386195471595809),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.9783454248084671),
 ('refreshing', 1.8551812956655511),
 ('delightful', 1.8002701588959635),
 ('beautifully', 1.7626953362841438),
 ('underrated', 1.7197859696029656),
 ('superb', 1.7091514458966952),
 ('welles', 1.6677068205580761),
 ('sinatra', 1.6389967146756448),
 ('touching', 1.637217476541176),
 ('stewart', 1.6119987332957739),
 ('brilliantly', 1.5950491749820008),
 ('friendship', 1.5677652160335325),
 ('wonderful', 1.5645425925262093),
 ('magnificent', 1.54663701119507),
 ('finest', 1.5462590108125689),
 ('jackie', 1.5439233053234738),
 ('freedom', 1.5091151908062312),
 ('fantastic', 1.5048433868558566),
 ('terrific', 1.5026699370083942),
 ('noir', 1.493925025312256),
 ('outstanding', 1.4910053152089213),
 ('nancy', 1.488077055429833),
 ('marie', 1.4825711915553104),
 ('excellent', 1.4647538505723599),
 ('chan', 1.423108334242607),
 ('gem', 1.3932148039644643

In [28]:
list(reversed(pos_neg_ratios.most_common()))[:30]

[('unfunny', -2.6922395950755678),
 ('waste', -2.6193845640165536),
 ('pointless', -2.4553061800117097),
 ('redeeming', -2.3682390632154826),
 ('lousy', -2.3075726345050849),
 ('worst', -2.2869878961803778),
 ('laughable', -2.2643638801738479),
 ('awful', -2.2271942470274348),
 ('poorly', -2.2207550747464135),
 ('sucks', -1.9870682215488209),
 ('lame', -1.981767458946166),
 ('insult', -1.9783454248084671),
 ('horrible', -1.9102590939512902),
 ('amateurish', -1.9095425048844386),
 ('pathetic', -1.9003933102308506),
 ('wasted', -1.8382794848629478),
 ('crap', -1.8281271133989299),
 ('tedious', -1.802454758344803),
 ('dreadful', -1.7725281073001673),
 ('badly', -1.753626599532611),
 ('worse', -1.7372712839439852),
 ('terrible', -1.7291085042663878),
 ('embarrassing', -1.702147310538368),
 ('mess', -1.6900958154515549),
 ('garbage', -1.686913224602391),
 ('pile', -1.6682784124570338),
 ('stupid', -1.6552583827449077),
 ('vampires', -1.6191467265610613),
 ('dull', -1.5846733548097038),
 ('a