#IMDb

## Data Preprocess 

In [0]:
import urllib.request
import os 
import tarfile
url = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
filepath = "data/"

if not os.path.isfile('aclImdb_v1.tar.gz'):
    result = urllib.request.urlretrieve(url, 'aclImdb_v1.tar.gz')
    print('download: ', result)

In [0]:
#unzip
if not os.path.exists('aclImdb'):
    tfile = tarfile.open('aclImdb_v1.tar.gz', 'r:gz')
    result = tfile.extractall('')

In [0]:
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
import re

In [0]:
def rm_tags(text):
    re_tag = re.compile(r'<[^>]+>')
    return re_tag.sub(' ', text)  

In [13]:
def read_files(filetype):
    path = 'aclImdb/'
    file_list = []
    postitive_path = path + filetype + '/pos/'
    for f in os.listdir(postitive_path):
        file_list += [postitive_path + f]
    
    negative_path = path + filetype + '/neg/'
    for f in os.listdir(negative_path):
        file_list += [negative_path + f]
    
    print('read', filetype, 'files:', len(file_list))
  
    all_labels = ([1] * 12500 + [0] * 12500) # make labels
    all_texts = []
    for f in file_list:
        with open(f, encoding = 'utf8') as file_input:
            all_texts += [ rm_tags( " ".join(file_input.readlines())) ]

    return all_texts, all_labels
X_train, y_train = read_files('train')
X_test, y_test = read_files('test')

read train files: 25000
read test files: 25000


In [87]:
token = Tokenizer(num_words=4000)
token.fit_on_texts(X_train)
print(token.document_count) #讀取了多少文章
#print(token.word_index) #index最前面的代表頻率最高，雖然詞只取到2000但他好像會把所有的詞都印出來...很莫名..

25000


In [0]:
X_train_seq = token.texts_to_sequences(X_train)
X_test_seq = token.texts_to_sequences(X_test)

In [94]:
# padding sequence
#未滿的補0，超過的從前面開始砍(贅字)
X_train_seq_padding = sequence.pad_sequences(X_train_seq, maxlen=300)
X_test_seq_padding = sequence.pad_sequences(X_test_seq, maxlen=300)
print(len(X_train_seq_padding[0]))

300


In [0]:
# must use before training model!
def display_test_Sentiment(i):
    SentimentDict = {1:'正面', 0:'負面'}
    print(X_test[i])
    predict = model.predict_classes(X_test_seq_padding).reshape( -1)
    print('label:',SentimentDict[y_test[i]], 'prediction:', SentimentDict[predict[i]])
#display_test_Sentiment(2)

## MLP

In [95]:
from keras.models import Sequential
from keras.layers import Dropout, Dense, Activation, Flatten, Embedding

model = Sequential()
model.add(Embedding(input_dim=4000, input_length=300, output_dim=32))
model.add(Dropout(0.2))

model.add(Flatten())

model.add(Dense(units=256, activation='relu'))
model.add(Dropout(0.35))

model.add(Dense(units=1, activation='sigmoid'))
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 300, 32)           128000    
_________________________________________________________________
dropout_10 (Dropout)         (None, 300, 32)           0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 9600)              0         
_________________________________________________________________
dense_9 (Dense)              (None, 256)               2457856   
_________________________________________________________________
dropout_11 (Dropout)         (None, 256)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 1)                 257       
Total params: 2,586,113
Trainable params: 2,586,113
Non-trainable params: 0
_________________________________________________________________


In [96]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
train_history = model.fit(X_train_seq_padding, y_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
 - 3s - loss: 0.4574 - acc: 0.7707 - val_loss: 0.4155 - val_acc: 0.8176
Epoch 2/10
 - 2s - loss: 0.1838 - acc: 0.9317 - val_loss: 0.4232 - val_acc: 0.8242
Epoch 3/10
 - 2s - loss: 0.0710 - acc: 0.9765 - val_loss: 0.6172 - val_acc: 0.7936
Epoch 4/10
 - 2s - loss: 0.0250 - acc: 0.9925 - val_loss: 0.8191 - val_acc: 0.7866
Epoch 5/10
 - 2s - loss: 0.0114 - acc: 0.9971 - val_loss: 1.2736 - val_acc: 0.7302
Epoch 6/10
 - 2s - loss: 0.0078 - acc: 0.9980 - val_loss: 1.1843 - val_acc: 0.7636
Epoch 7/10
 - 2s - loss: 0.0076 - acc: 0.9977 - val_loss: 1.2086 - val_acc: 0.7752
Epoch 8/10
 - 2s - loss: 0.0116 - acc: 0.9962 - val_loss: 1.1279 - val_acc: 0.7952
Epoch 9/10
 - 2s - loss: 0.0159 - acc: 0.9940 - val_loss: 1.2891 - val_acc: 0.7686
Epoch 10/10
 - 2s - loss: 0.0174 - acc: 0.9941 - val_loss: 1.1154 - val_acc: 0.7966


In [97]:
scores = model.evaluate(X_test_seq_padding, y_test, verbose=1)
scores



[0.7545211400258541, 0.84848]

In [51]:
display_test_Sentiment(15002)

Essentially plotless action film has two good guys (Fong and Roundtree) pitted against two bad guys (Mitchell and Pierce). Fong is perhaps the most uncharismatic action lead of the 80s, Roundtree's small part is a far cry from his "Shaft" days, and Cameron Mitchell adds another shameful role to his career, one to sit right next to his laughable turn in "The Toolbox Murders" (this man was a respected actor once, now he has come down to wearing flowers in his hair and complaining about people bleeding on his carpet). Only Stack Pierce acts with some dignity. As for the violence, don't worry: most of it is too badly done to offend anyone. (*1/2)
label: 負面 prediction: 負面


## 預測美女與野獸
http://www.imdb.com/title/tt2771200/reviews

In [86]:
text = input()
#text

token_text = token.texts_to_sequences([text]) #parameter 需要是一個list，將字串轉成list
token_text_padding = sequence.pad_sequences(token_text, maxlen=100)
predict_result = model.predict_classes(token_text_padding)
if(predict_result[0][0] == 0): 
    print('負面的')
else:
    print('正面的')

I was really looking forward to this film. Not only has Disney recently made excellent live-action versions of their animated masterpieces (Jungle Book, Cinderella), but the cast alone (Emma Watson, Ian McKellen, Kevin Kline) already seemed to make this one a sure hit. Well, not so much as it turns out.  Some of the animation is fantastic, but because characters like Cogsworth (the clock), Lumière (the candelabra) and Chip (the little tea cup) now look "realistic", they lose a lot of their animated predecessors' charm and actually even look kind of creepy at times. And ironically - unlike in the animated original - in this new realistic version they only have very limited facial expressions (which is a creative decision I can't for the life of me understand).   Even when it works: there can be too much of a good thing. The film is overstuffed with lush production design and cgi (which is often weirdly artificial looking though) but sadly lacking in charm and genuine emotion. If this we