# Sentiment Analysis on Movie Reviews

* [Kaggle - Sentiment Analysis on Movie Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews)


In [1]:
import pandas as pd
import numpy as np
import nltk

from keras.models import Sequential

Using TensorFlow backend.


In [2]:
# Download NLTK Corpus
nltk.download('all', halt_on_error=False)

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/anderson/nltk_data...
[nltk_data]    |   Package cess_cat is already up-to-date!
[nl

True

### Data

Rotten Tomatoes의 리플 그리고 점수를 모아놓은 데이터 입니다. <br>
각각의 문장들 (phrases)들은 Stanford parser를 통해서 parsed된 상태입니다. <br>
각각의 phrase는 PhraseId 를 갖고 있드면 각각의 sentence는 SentenceId를 갖고 있습니다.

In [2]:
train_raw = pd.read_csv('/dataset/imdb-sentiment-analysis/train.tsv', delimiter='\t')
train_raw.head()

train_raw

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2
5,6,1,of escapades demonstrating the adage that what...,2
6,7,1,of,2
7,8,1,escapades demonstrating the adage that what is...,2
8,9,1,escapades,2
9,10,1,demonstrating the adage that what is good for ...,2


In [3]:
d = train_raw['Phrase']
text = nltk.word_tokenize(d[0])
nltk.pos_tag(text)

[('A', 'DT'),
 ('series', 'NN'),
 ('of', 'IN'),
 ('escapades', 'NNS'),
 ('demonstrating', 'VBG'),
 ('the', 'DT'),
 ('adage', 'NN'),
 ('that', 'IN'),
 ('what', 'WP'),
 ('is', 'VBZ'),
 ('good', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('goose', 'NN'),
 ('is', 'VBZ'),
 ('also', 'RB'),
 ('good', 'JJ'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('gander', 'NN'),
 (',', ','),
 ('some', 'DT'),
 ('of', 'IN'),
 ('which', 'WDT'),
 ('occasionally', 'RB'),
 ('amuses', 'VBZ'),
 ('but', 'CC'),
 ('none', 'NN'),
 ('of', 'IN'),
 ('which', 'WDT'),
 ('amounts', 'NNS'),
 ('to', 'TO'),
 ('much', 'JJ'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('story', 'NN'),
 ('.', '.')]

In [4]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or