# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [1]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [2]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [3]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [4]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


### 1) Lets first try to check the lenghts of each positive/negative reviews

In [5]:
positive_review_length = 0
negative_review_length = 0
for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        positive_review_length += len(reviews[i])
    else:
        negative_review_length += len(reviews[i])
    
print("Positive review length mean: {}\nNegative review length mean: {}".format(
        positive_review_length / len(reviews),
        negative_review_length / len(reviews)
    ))

Positive review length mean: 683.81168
Negative review length mean: 662.319


> Both negative reviews and positive reviews seems to have the same lenght. There doesn't seems to be any correlation between size of a review and its sentiment

### 2) Lets try to find which words appear the most in positive/negative reviews

In [20]:
from collections import Counter

positive_words_dict = {}
negative_words_dict = {}

def get_range(dictionary, begin, end):
    return {k: v for k, v in dictionary.items() if begin <= k <= end}

def add_words_to_dictionnary(sentence, words_dict):
    wordcount = Counter(sentence.split())
    for key, value in wordcount.items():
        if key in words_dict:
            words_dict[key] += wordcount[key]
        else:
            words_dict[key] = wordcount[key]
    
for i in range(len(reviews)):
    if labels[i] == 'POSITIVE':
        add_words_to_dictionnary(reviews[i], positive_words_dict)
    else:
        add_words_to_dictionnary(reviews[i], negative_words_dict)
   
sorted_positives = sorted(positive_words_dict.items(), key=lambda x: x[1], reverse=True)
sorted_negatives = sorted(negative_words_dict.items(), key=lambda x: x[1], reverse=True)
print_n = 100

print("Most positive common words:")
for i in range(0, print_n):
    print(sorted_positives[i][0], end=', ')
  
print("\n\nMost negative common words:")
for i in range(0, print_n):
    print(sorted_negatives[i][0], end=', ')

Most positive common words:
the, ., and, a, of, to, is, in, br, it, i, that, this, s, as, with, for, was, film, but, movie, his, on, you, he, are, not, t, one, have, be, by, all, who, an, at, from, her, they, has, so, like, about, very, out, there, she, what, or, good, more, when, some, if, just, can, story, time, my, great, well, up, which, their, see, also, we, really, would, will, me, had, only, him, even, most, other, were, first, than, much, its, no, into, people, best, love, get, how, life, been, because, way, do, made, films, them, after, many, two, 

Most negative common words:
., the, a, and, of, to, br, is, it, i, in, this, that, s, was, movie, for, but, with, as, t, film, you, on, not, have, are, be, he, one, they, at, his, all, so, like, there, just, by, or, an, who, from, if, about, out, what, some, no, her, even, can, has, good, bad, would, up, only, more, when, she, really, time, had, my, were, which, very, me, see, don, we, their, do, story, than, been, much, get, becau

We can see that it is still not very useful as the most common words (we would have guess the result wihout even running the code) is a bunch of what we call **stopwords** ('the', 'it', 'i'...) which are the most common words used in english but also the ones for which we care the less

### Lets remove the stopwords and only keep the relevant ones


<font color="red">Careful: The operations in this section may take some time to execute</font>

For this task lets use the nltk library which contains a bunch of stopwords and will help us remove them from our lists

In [18]:
from nltk.corpus import stopwords

filtered_positive_words = []
filtered_negative_words = []

for i in range(len(sorted_positives)):
    if sorted_positives[i][0] not in stopwords.words('english'):
        filtered_positive_words.append(sorted_positives[i][0])
        
for i in range(len(sorted_negatives)):
    if sorted_negatives[i][0] not in stopwords.words('english'):
        filtered_negative_words.append(sorted_negatives[i][0])
        
print("Most positive words:", filtered_positive_words[:50])
print("\nMost negative words:", filtered_negative_words[:50])

Most positive common words: ['.', 'br', 'film', 'movie', 'one', 'like', 'good', 'story', 'time', 'great', 'well', 'see', 'also', 'really', 'would', 'even', 'first', 'much', 'people', 'best', 'love', 'get', 'life', 'way', 'made', 'films', 'many', 'two', 'think', 'movies', 'characters', 'character', 'man', 'show', 'watch', 'seen', 'little', 'still', 'make', 'could', 'never', 'know', 'years', 'ever', 'end', 'real', 'scene', 'back', 'though', 'new']

Most negative common words: ['.', 'br', 'movie', 'film', 'one', 'like', 'even', 'good', 'bad', 'would', 'really', 'time', 'see', 'story', 'much', 'get', 'people', 'make', 'could', 'made', 'first', 'well', 'plot', 'movies', 'acting', 'way', 'think', 'also', 'characters', 'watch', 'character', 'better', 'know', 'seen', 'ever', 'never', 'two', 'little', 'films', 'nothing', 'say', 'end', 'something', 'many', 'thing', 'show', 'scene', 'scenes', 'go', 'great']


We can see some words such as "great" are at the top of the list of positive words and other words such as "bad" are at the top of the list for negative words. But unfortunately we can also see that in the negative list a lot of "positive" words can be found at the top of the list such as "like", "good"...