# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [17]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [18]:
len(reviews)

25000

In [19]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [20]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [21]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick theory validation

In [22]:
from collections import Counter
import numpy as np

In [23]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [30]:
for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else :
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

positive_counts.most_common()

[('', 3853276),
 ('the', 1213268),
 ('.', 1117578),
 ('and', 628054),
 ('a', 585816),
 ('of', 537985),
 ('to', 467222),
 ('is', 400715),
 ('in', 351505),
 ('br', 344645),
 ('it', 336175),
 ('i', 285201),
 ('that', 249410),
 ('this', 245560),
 ('s', 236705),
 ('as', 184156),
 ('with', 162729),
 ('for', 156912),
 ('was', 153419),
 ('film', 146559),
 ('but', 145754),
 ('movie', 133518),
 ('his', 120589),
 ('on', 119056),
 ('you', 116767),
 ('he', 113974),
 ('are', 103649),
 ('not', 99904),
 ('t', 96040),
 ('one', 95585),
 ('have', 88109),
 ('be', 86912),
 ('by', 83979),
 ('all', 83594),
 ('who', 80248),
 ('an', 79058),
 ('at', 78638),
 ('from', 75369),
 ('her', 73318),
 ('they', 69265),
 ('has', 64302),
 ('so', 64078),
 ('like', 63266),
 ('about', 58191),
 ('very', 58135),
 ('out', 56938),
 ('there', 56399),
 ('she', 54453),
 ('what', 54159),
 ('or', 54124),
 ('good', 54040),
 ('more', 52647),
 ('when', 52192),
 ('some', 52087),
 ('if', 50995),
 ('just', 50064),
 ('can', 49007),
 ('story'

In [None]:
pos_neg_ratios = Counter()
for term,cnt in list(total_counts.most_common()):
    if( cnt > 50 ):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term]+1)
        pos_neg_ratios[term] = pos_neg_ratio
