# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [15]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [16]:
len(reviews)

25000

In [17]:
reviews[1]

'story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turned into an insane  violent mob by the crazy chantings of it  s singers . unfortunately it stays absurd the whole time with no general narrative eventually making it just too off putting . even those from the era should be turned off . the cryptic dialogue would make shakespeare seem easy to a third grader . on a technical level it  s better than you might think with some good cinematography by future great vilmos zsigmond . future stars sally kirkland and frederic forrest can be seen briefly .  '

In [18]:
labels[1]

'NEGATIVE'

# Lesson: Develop a Predictive Theory

In [19]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [20]:
from collections import Counter
import numpy as np

In [21]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [48]:
for i in range(len(reviews)):
    if (labels[i] == 'POSITIVE'):
        for word in reviews[i].split(' '):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(' '):
            negative_counts[word] += 1
            total_counts[word] += 1


In [51]:
positive_counts.most_common()

[('', 2201872),
 ('the', 693296),
 ('.', 638616),
 ('and', 358888),
 ('a', 334752),
 ('of', 307420),
 ('to', 266984),
 ('is', 228980),
 ('in', 200860),
 ('br', 196940),
 ('it', 192100),
 ('i', 162972),
 ('that', 142520),
 ('this', 140320),
 ('s', 135260),
 ('as', 105232),
 ('with', 92988),
 ('for', 89664),
 ('was', 87668),
 ('film', 83748),
 ('but', 83288),
 ('movie', 76296),
 ('his', 68908),
 ('on', 68032),
 ('you', 66724),
 ('he', 65128),
 ('are', 59228),
 ('not', 57088),
 ('t', 54880),
 ('one', 54620),
 ('have', 50348),
 ('be', 49664),
 ('by', 47988),
 ('all', 47768),
 ('who', 45856),
 ('an', 45176),
 ('at', 44936),
 ('from', 43068),
 ('her', 41896),
 ('they', 39580),
 ('has', 36744),
 ('so', 36616),
 ('like', 36152),
 ('about', 33252),
 ('very', 33220),
 ('out', 32536),
 ('there', 32228),
 ('she', 31116),
 ('what', 30948),
 ('or', 30928),
 ('good', 30880),
 ('more', 30084),
 ('when', 29824),
 ('some', 29764),
 ('if', 29140),
 ('just', 28608),
 ('can', 28004),
 ('story', 27120),
 ('

In [52]:
negative_counts.most_common()


[('', 5581638),
 ('the', 1663695),
 ('.', 1651728),
 ('a', 806311),
 ('and', 789861),
 ('of', 713628),
 ('to', 683056),
 ('is', 522316),
 ('br', 516164),
 ('it', 482364),
 ('in', 456916),
 ('i', 450389),
 ('this', 391680),
 ('that', 370195),
 ('s', 322267),
 ('was', 249788),
 ('movie', 231977),
 ('as', 223299),
 ('for', 220737),
 ('with', 215887),
 ('but', 214933),
 ('film', 197337),
 ('t', 183687),
 ('you', 172886),
 ('on', 171368),
 ('not', 157294),
 ('are', 146782),
 ('he', 145838),
 ('have', 143769),
 ('be', 139035),
 ('his', 136710),
 ('one', 132903),
 ('they', 120762),
 ('all', 120078),
 ('at', 119655),
 ('by', 109834),
 ('so', 107703),
 ('like', 105780),
 ('an', 105744),
 ('who', 104175),
 ('from', 100418),
 ('there', 99596),
 ('just', 95789),
 ('or', 95100),
 ('if', 88481),
 ('about', 88366),
 ('out', 87255),
 ('her', 87051),
 ('what', 82165),
 ('has', 80786),
 ('some', 80465),
 ('good', 75121),
 ('can', 74574),
 ('no', 70723),
 ('more', 69673),
 ('when', 69450),
 ('even', 6870

In [66]:
pos_neg_ratios = Counter()

for term,cnt in total_counts.most_common():
    if (cnt > 1000):
        pos_neg_ratios[term] = positive_counts[term] / float(negative_counts[term] + 1)
for word,ratio in pos_neg_ratios.most_common():
    if (ratio > 1):
        pos_neg_ratios[term] = np.log(ratio)
    else:
        pos_neg_ratios[term] = -np.log(1 / (ratio + 0.01))

In [67]:
pos_neg_ratios.most_common()

[('matthau', 1.1825396825396826),
 ('victoria', 1.1587301587301588),
 ('perfection', 1.0633946830265848),
 ('captures', 1.0308724832214766),
 ('wonderfully', 1.024085637823372),
 ('powell', 1.0162162162162163),
 ('lincoln', 0.9983792544570502),
 ('refreshing', 0.984869325997249),
 ('breathtaking', 0.984822934232715),
 ('bourne', 0.9836065573770492),
 ('flynn', 0.9731800766283525),
 ('delightful', 0.9682051282051282),
 ('andrews', 0.966542750929368),
 ('elvira', 0.9562043795620438),
 ('beautifully', 0.9557975656630365),
 ('gripping', 0.9499072356215214),
 ('underrated', 0.9469964664310954),
 ('superb', 0.9397192402972749),
 ('delight', 0.9366197183098591),
 ('welles', 0.9318681318681319),
 ('sinatra', 0.9237668161434978),
 ('unforgettable', 0.9219047619047619),
 ('touching', 0.9205548549810845),
 ('favorites', 0.9189985272459499),
 ('extraordinary', 0.9177215189873418),
 ('sullivan', 0.9169054441260746),
 ('stewart', 0.9130180969060129),
 ('brilliantly', 0.9108910891089109),
 ('friendsh

In [68]:
list(reversed(pos_neg_ratios.most_common()))[:30]

[('whelk', -4.2701204247134212),
 ('boll', 0.003980099502487562),
 ('seagal', 0.014856081708449397),
 ('mst', 0.029387755102040815),
 ('unfunny', 0.03773584905660377),
 ('waste', 0.04039167686658507),
 ('blah', 0.0425208807896735),
 ('pointless', 0.04739336492890995),
 ('atrocious', 0.04889228418640183),
 ('redeeming', 0.05158912943344081),
 ('lousy', 0.05475701574264202),
 ('worst', 0.05563835072031793),
 ('laughable', 0.05695977216091136),
 ('awful', 0.058926692388635564),
 ('poorly', 0.059334604789150244),
 ('wasting', 0.060544904137235116),
 ('remotely', 0.060897435897435896),
 ('existent', 0.06818181818181818),
 ('sucks', 0.07423580786026202),
 ('lame', 0.07445708376421924),
 ('insult', 0.07492795389048991),
 ('uninteresting', 0.07727975270479134),
 ('unconvincing', 0.07953603976801989),
 ('horrible', 0.07960965588084232),
 ('amateurish', 0.07994289793004997),
 ('pathetic', 0.08044840092317837),
 ('idiotic', 0.08085106382978724),
 ('stupidity', 0.08211143695014662),
 ('wasted', 0.