# Sentiment Analysis on Movie Reviews

* [Kaggle - Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)
* [Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding](https://arxiv.org/pdf/1504.01255.pdf)

In [1]:
import pandas as pd
import numpy as np
import nltk
import csv

from keras.datasets import imdb
from keras.models import Sequential

Using TensorFlow backend.


In [2]:
(train_x, train_y), (test_x, test_y) = imdb.load_data()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz

In [3]:
# Download NLTK Corpus
nltk.download('all', halt_on_error=False)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[n

[nltk_data]    |   Unzipping corpora/sentiwordnet.zip.
[nltk_data]    | Downloading package sentence_polarity to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/sentence_polarity.zip.
[nltk_data]    | Downloading package shakespeare to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk

True

### Data

Rotten Tomatoes의 리플 그리고 점수를 모아놓은 데이터 입니다. <br>
각각의 문장들 (phrases)들은 Stanford parser를 통해서 parsed된 상태입니다. <br>
각각의 phrase는 PhraseId 를 갖고 있드면 각각의 sentence는 SentenceId를 갖고 있습니다.

0 - negative<br>
1 - somewhat negative<br>
2 - neutral<br>
3 - somewhat positive<br>
4 - positive<br>

In [30]:
reader = csv.reader(open('/dataset/imdb-sentiment-analysis/train.tsv', 'rt'), delimiter='\t')
next(reader)

sentences = []
labels = []
sentences = map(lambda d: d[2], reader)
labels = map(lambda d: d[3], reader)
list(sentences)

# train_raw = pd.read_csv('/dataset/imdb-sentiment-analysis/train.tsv', delimiter='\t')
# train_raw['Phrase']


['A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .',
 'A series of escapades demonstrating the adage that what is good for the goose',
 'A series',
 'A',
 'series',
 'of escapades demonstrating the adage that what is good for the goose',
 'of',
 'escapades demonstrating the adage that what is good for the goose',
 'escapades',
 'demonstrating the adage that what is good for the goose',
 'demonstrating the adage',
 'demonstrating',
 'the adage',
 'the',
 'adage',
 'that what is good for the goose',
 'that',
 'what is good for the goose',
 'what',
 'is good for the goose',
 'is',
 'good for the goose',
 'good',
 'for the goose',
 'for',
 'the goose',
 'goose',
 'is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .',
 'is also good for the gander , some of which occasionally amuses but none of 

In [3]:
d = train_raw['Phrase']
text = nltk.word_tokenize(d[0])
nltk.pos_tag(text)

[('A', 'DT'),
 ('series', 'NN'),
 ('of', 'IN'),
 ('escapades', 'NNS'),
 ('demonstrating', 'VBG'),
 ('the', 'DT'),
 ('adage', 'NN'),
 ('that', 'IN'),
 ('what', 'WP'),
 ('is', 'VBZ'),
 ('good', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('goose', 'NN'),
 ('is', 'VBZ'),
 ('also', 'RB'),
 ('good', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('gander', 'NN'),
 (',', ','),
 ('some', 'DT'),
 ('of', 'IN'),
 ('which', 'WDT'),
 ('occasionally', 'RB'),
 ('amuses', 'VBZ'),
 ('but', 'CC'),
 ('none', 'NN'),
 ('of', 'IN'),
 ('which', 'WDT'),
 ('amounts', 'NNS'),
 ('to', 'TO'),
 ('much', 'JJ'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('story', 'NN'),
 ('.', '.')]

In [4]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or