# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [7]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [8]:
len(reviews)

25000

In [9]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [10]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [11]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [12]:
positive_words = ['good', 'excellent', 'fascinating']
negative_words = ['terrible', 'bad', 'impossible', 'trash']

correct_label = 0
wrong_label = 0

n = len(reviews)

for review, label in zip(reviews, labels):
    score = 1
    for word in positive_words:
        if word in review:
            score += 1
    for word in negative_words:
        if word in review:
            score -= 1
    if (label == 'POSITIVE' and score > 0) or (label == 'NEGATIVE' and score < 0):
        correct_label += 1
    else:
        wrong_label += 1

print("Correct: {} / {} ({:1f}%)".format(correct_label, n, correct_label / n * 100.0))
print("Wrong:   {} / {} ({:1f}%)".format(wrong_label, n, wrong_label / n * 100.0))


Correct: 12064 / 25000 (48.256000%)
Wrong:   12936 / 25000 (51.744000%)


## Solution

In [13]:
from collections import Counter
import numpy as np

In [14]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [15]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

In [27]:
pos_neg_ratios = Counter()

for term,cnt in list(total_counts.most_common()):
    if(cnt > 1000):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio

for word,ratio in pos_neg_ratios.most_common():
    if(ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio+0.01)))

In [28]:
# words most frequently seen in a review with a "POSITIVE" label
pos_neg_ratios.most_common()

[('wonderful', 1.5645425925262093),
 ('excellent', 1.4647538505723599),
 ('amazing', 1.3919815802404802),
 ('favorite', 1.2668956297860055),
 ('perfect', 1.246742480713785),
 ('brilliant', 1.2287554137664785),
 ('loved', 1.1563661500586044),
 ('highly', 1.1420208631618658),
 ('today', 1.1050431789984001),
 ('beautiful', 0.97326301262841053),
 ('heart', 0.95238806924516806),
 ('great', 0.88810470901464589),
 ('enjoyed', 0.87070195951624607),
 ('strong', 0.84167135777060931),
 ('performances', 0.74883252516063137),
 ('simple', 0.74641420974143258),
 ('best', 0.72347034060446314),
 ('love', 0.69198533541937324),
 ('performance', 0.68797386327972465),
 ('works', 0.67445504754779284),
 ('city', 0.66820823221269321),
 ('both', 0.66248336767382998),
 ('definitely', 0.66199789483898808),
 ('father', 0.65172321672194655),
 ('gives', 0.63383568159497883),
 ('classic', 0.62504956428050518),
 ('fine', 0.60496962268013299),
 ('job', 0.59845562125168661),
 ('always', 0.59470710774669278),
 ('lives',

In [29]:
# words most frequently seen in a review with a "NEGATIVE" label
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('waste', -2.4907515123361046),
 ('worst', -2.1930856334332267),
 ('awful', -2.1385076866397488),
 ('horrible', -1.844894301366784),
 ('crap', -1.7677639636718392),
 ('worse', -1.6820086052689358),
 ('terrible', -1.6742829939664696),
 ('stupid', -1.6042380193725321),
 ('boring', -1.4475226133603798),
 ('bad', -1.3181383703873577),
 ('supposed', -1.2447538467688914),
 ('poor', -1.2354574363960786),
 ('oh', -1.060145138351082),
 ('save', -0.96543738528113643),
 ('gore', -0.94853321111328437),
 ('minutes', -0.91926883385924085),
 ('decent', -0.84602416208146713),
 ('ok', -0.8411845531731027),
 ('attempt', -0.82556118527256561),
 ('nothing', -0.81049650887674252),
 ('couldn', -0.79696597303708028),
 ('money', -0.79601605681206233),
 ('unfortunately', -0.77048254067171962),
 ('script', -0.75553540480209191),
 ('seriously', -0.74913839263329918),
 ('guess', -0.72192630932305224),
 ('instead', -0.71048388522675199),
 ('none', -0.6890744434562236),
 ('mean', -0.66114247745030508),
 ('looked',