### Sentiment movies reviews
The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

train.tsv contains the phrases and their associated sentiment labels. We have additionally provided a SentenceId so that you can track which phrases belong to a single sentence.
test.tsv contains just phrases. You must assign a sentiment label to each phrase.
The sentiment labels are:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

the [data](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) splited into two section (train and test)

In [54]:
import pandas as pd
import numpy as np
import csv
import matplotlib.pyplot as plt
%matplotlib inline

In [55]:
train_data = []
with open("./data/train.tsv") as tsvfile:
    train = csv.reader(tsvfile,delimiter = '\t')
    for row in train:
        train_data.append(row)
train_data = pd.DataFrame(train_data[1:],columns=train_data[0])

In [56]:
test_data = []
with open("./data/test.tsv") as tsvfile:
    test = csv.reader(tsvfile,delimiter = '\t')
    for row in test:
        test_data.append(row)
test_data = pd.DataFrame(test_data[1:],columns=test_data[0])

In [57]:
solution = pd.read_csv('./data/sampleSubmission.csv')

In [5]:
# train_data.set_index('PhraseId',inplace=True)
# test_data.set_index('PhraseId',inplace=True)

In [59]:
train_data.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [60]:
test_data.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [95]:
solution.head()
# print(len(solution))

Unnamed: 0,PhraseId,Sentiment
0,156061,2
1,156062,2
2,156063,2
3,156064,2
4,156065,2


In [8]:
# train_data.drop(columns='PhraseId',inplace=True)
# test_data.drop(columns='PhraseId',inplace=True)

In [68]:
train_data['SentenceId'] = train_data.SentenceId.astype('int64',inplace = True)
test_data['SentenceId'] = test_data.SentenceId.astype('int64',inplace = True)
train_data['Sentiment'] = train_data.Sentiment.astype('int64',inplace = True)
train_data['PhraseId'] = train_data.SentenceId.astype('int64',inplace = True)
test_data['PhraseId'] = test_data.SentenceId.astype('int64',inplace = True)

In [12]:
# train_data['SentenceId'] = train_data.SentenceId.astype('int64')
# test_data['SentenceId'] = test_data.SentenceId.astype('int64')
# train_data['Sentiment'] = train_data.Sentiment.astype('int64')
# train_data['PhraseId'] = train_data.SentenceId.astype('int64')
# test_data['PhraseId'] = test_data.SentenceId.astype('int64')

In [69]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [70]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66292 entries, 0 to 66291
Data columns (total 3 columns):
PhraseId      66292 non-null int64
SentenceId    66292 non-null int64
Phrase        66292 non-null object
dtypes: int64(2), object(1)
memory usage: 1.5+ MB


In [71]:
df_train = pd.DataFrame(columns=train_data.columns)
for senid in train_data['SentenceId'].unique():
    length = train_data[train_data.SentenceId == senid]['Phrase'].apply(lambda x:len(x)).max()
    idx = train_data[train_data.SentenceId == senid]['Phrase'].apply(lambda x:len(x)).idxmax()
    temp = train_data.iloc[[idx]]
    df_train = pd.concat([df_train,temp])
        
#     b = train_data[train_data.Phrase == "{}".format(senid)].Phrase.apply(lambda x:len(x))
#     for i,contex in enumerate(train_data[train_data.iloc[:,0]=="{}".format(senid)].iloc[:,1]):       

In [72]:
df_test = pd.DataFrame(columns=test_data.columns)
for senid in test_data['SentenceId'].unique():
    length = test_data[test_data.SentenceId == senid]['Phrase'].apply(lambda x:len(x)).max()
    idx = test_data[test_data.SentenceId == senid]['Phrase'].apply(lambda x:len(x)).idxmax()
    temp = test_data.iloc[[idx]]
    df_test = pd.concat([df_test,temp])

In [73]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8529 entries, 0 to 156039
Data columns (total 4 columns):
PhraseId      8529 non-null object
SentenceId    8529 non-null object
Phrase        8529 non-null object
Sentiment     8529 non-null object
dtypes: object(4)
memory usage: 333.2+ KB


In [74]:
df_train.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
63,2,2,"This quiet , introspective and entertaining in...",4
81,3,3,"Even fans of Ismail Merchant 's work , I suspe...",1
116,4,4,A positively thrilling combination of ethnogra...,3
156,5,5,Aggressive self-glorification and a manipulati...,1
166,6,6,A comedy-drama of nearly epic proportions root...,4
198,7,7,"Narratively , Trouble Every Day is a plodding ...",1
213,8,8,"The Importance of Being Earnest , so thick wit...",3
247,9,9,But it does n't leave you with much .,1
259,10,10,You could hate it for the same reason .,1


In [93]:
len(test_data)

66292

In [75]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk import regexp_tokenize
sw = stopwords.words('english')
stem = PorterStemmer()

In [76]:
def text_process(text):
    word = text.lower()
    word = word_tokenize(word)
    word = [words for words in word if word not in sw]
    word = [stem.stem(i) for i in word]
    word = " ".join(word)
    word = regexp_tokenize(word,'\w+')
    word = " ".join(word)
    return word

In [77]:
df_train['Phrase'] = df_train['Phrase'].apply(text_process)

In [78]:
df_test['Phrase'] = df_test['Phrase'].apply(text_process)

In [20]:
# from sklearn.feature_extraction.text import CountVectorizer
# from sklearn.naive_bayes import MultinomialNB
# from sklearn.metrics import classification_report

In [103]:
import tensorflow as tf
from tensorflow.keras.layers import Dense,Embedding
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer,text_to_word_sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
# max_feature = 2000

In [122]:
X_train = df_train['Phrase']
y_train = df_train['Sentiment']
X_test = df_test['Phrase']

In [104]:
max_length = max([len(s.split()) for s in X_train])

In [105]:
vocab_size = len(tokenize.word_index)+1

In [88]:
tokenize = Tokenizer(num_words=10000,oov_token='OOV')
tokenize.fit_on_texts(X_train)

In [123]:
X_train_seq = tokenize.texts_to_sequences(X_train)
X_train_seq = pad_sequences(X_train_seq,maxlen=max_length)
y_train = tf.keras.utils.to_categorical(y_train)
X_test_seq = tokenize.texts_to_sequences(X_test)
X_test_seq = pad_sequences(X_test_seq,maxlen=max_length)
y_test = tf.keras.utils.to_categorical(solution['Sentiment'])

In [129]:
model = Sequential()
model.add(Embedding(input_dim = vocab_size,output_dim=128,input_length = max_length))
model.add(LSTM(128,dropout=0.2,return_sequences=True))
model.add(LSTM(128,dropout=0.2))
model.add(Dense(5,activation='softmax'))

In [130]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

In [131]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 48, 128)           1407232   
_________________________________________________________________
lstm_5 (LSTM)                (None, 48, 128)           131584    
_________________________________________________________________
lstm_6 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 645       
Total params: 1,671,045
Trainable params: 1,671,045
Non-trainable params: 0
_________________________________________________________________


In [132]:
model.fit(X_train_seq,y_train,
          batch_size=48,
          epochs=10)

Epoch 1/10
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x21bb755f160>