# Essensial Imports
* using os to access files
* using pandas to read csv data files
* using tensorflow framework for normalizing and data cleaning
* using tensorBoard to visualize learing process
* using numpy to convert arrays to numpy arrays
* using hazm for Farsi text normalizing.

In [1]:
import os
import tensorflow as tf
import tensorboard
import pandas as pd
import numpy as np
from datetime import datetime
from hazm import Normalizer, sent_tokenize, word_tokenize, Stemmer
from tensorflow.python.client import device_lib
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


# Tensorflow version check
* Latest available version for tensorflow at the moment is 1.13.1, But for NLP we should
use 2.x. So using `enable_eager_execution()` to perform functions as 2.x

**Note**: If you are using 2.x tensorflow, you dont need this line.

In [2]:
print(tf.__version__)
tf.enable_eager_execution()

print(device_lib.list_local_devices())

1.13.1
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 1858808684693237231
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 18013350971989020986
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 11015042985065621008
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3076849664
locality {
  bus_id: 1
  links {
  }
}
incarnation: 10039970566583395186
physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0"
]


# Reading data from csv files
* we use `pd.read_csv()` to read training and test files and `pd.dropna()` to drop
not available records.
* we use `df.sample()` to shuffle dataframes rows.

In [3]:
data = pd.read_csv('./datasets/train_comments.csv').dropna()

data = data.sample(frac=1)

num_training_samples = 130000

training_data = data.iloc[:num_training_samples]
testing_data = data.iloc[num_training_samples:]

print(training_data.count())
training_data.head()

id                     130000
title                  130000
comment                130000
rate                   130000
verification_status    130000
dtype: int64


Unnamed: 0,id,title,comment,rate,verification_status
78116,86900,معیوب,علی رغم اینکه بسته بندی کالا خوب به نظر می رسی...,60.0,0
119968,133413,پوست من روشنه و این ضدآفتاب بسیار مناسب پوست‌ه...,پیشنهاد میکنم برای پوستهای روشن,0.0,0
46107,51367,کارآمد و مناسب,کارآمد و مناسب,0.0,0
3343,3752,خوشبو,واقعا خوشبو هست فقط ماندگاری نداره قیمتشم خوبه,52.0,0
25524,28424,فوق العاده زیبا و جذاب ولی قیمت بالا,من این لپ تاپ را در حدود سه روز هست که تحویل گ...,94.0,0


In [4]:
print(testing_data.count())
testing_data.head()

id                     27564
title                  27564
comment                27564
rate                   27564
verification_status    27564
dtype: int64


Unnamed: 0,id,title,comment,rate,verification_status
134230,149172,افتضاحه,قیچیش تا میشه اصلا کارکرد قیچی رو نداره یه ماک...,8.0,0
142564,158431,مناسب نیست,خوب نیست همش جدا میشه از گوشی,60.0,0
61750,68728,درجه بندی,بعداز یه مدت درجه بندیش پاک میشه و قابل استفاد...,35.0,0
57936,64533,عالیه,یک روزه دستم رسید عالیه ممنونم از دیجی کالا وا...,60.0,0
146577,162904,وسیله سرگرمی خوبیه,به عنوان هدیه گرفتم عالیه طرحش جنسش وقتی بسترو...,88.0,0


# Dataset majors
* We shall know some major information about our datasets, So we use `df.count()`.

In [5]:
training_data.describe(include='all')

Unnamed: 0,id,title,comment,rate,verification_status
count,130000.0,130000,130000,130000.0,130000.0
unique,,70257,125351,,
top,,عالیه,عالی,,
freq,,2347,229,,
mean,90083.573031,,,57.770243,0.170185
std,51933.709169,,,34.287678,0.375796
min,0.0,,,0.0,0.0
25%,45131.5,,,36.0,0.0
50%,90174.5,,,60.0,0.0
75%,135159.25,,,87.0,0.0


In [6]:
testing_data.describe(include='all')

Unnamed: 0,id,title,comment,rate,verification_status
count,27564.0,27564,27564,27564.0,27564.0
unique,,17830,27017,,
top,,عالیه,عالی,,
freq,,516,38,,
mean,89846.437491,,,57.711685,0.170512
std,52061.127631,,,34.219519,0.376089
min,11.0,,,0.0,0.0
25%,44560.75,,,36.0,0.0
50%,89717.0,,,60.0,0.0
75%,134733.5,,,86.0,0.0


# Extracting training sentences and labels
* We should extract sentences and labels from training and testing files. 

In [7]:
training_sentences = training_data.comment.astype(str).to_numpy()
training_labels = training_data.verification_status.astype(str).to_numpy()

testing_sentences = testing_data.comment.astype(str).to_numpy()
testing_labels = testing_data.verification_status.astype(str).to_numpy()

# Normalizing data
* for text preprossecing, Tensorflow offers us awesome methods

In [8]:
vocab_size = 10000
embedding_dim = 16
max_length = 600
trunc_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(oov_token=oov_tok, num_words=vocab_size)
normalizer = Normalizer()
stemmer = Stemmer()

# Data preprocessing
* As we have our tokenizer and normalizer, we start obtaining padded data by preprocessing them.

In [9]:
if os.path.exists('./model/training_sentences.npy'):
    training_sentences = np.load('./model/training_sentences.npy', allow_pickle=True)
    print("[Note]: Training sentences loaded successfully!")
else:
    for idx, sentence in enumerate(training_sentences):
        training_sentences[idx] = normalizer.normalize(sentence)
    training_sentences = np.asarray(training_sentences)
    np.save('./model/training_sentences.npy', training_sentences)
    print("[Note]: Training sentences saved successfully!")

if os.path.exists('./model/testing_sentences.npy'):
    testing_sentences = np.load('./model/testing_sentences.npy', allow_pickle=True)
    print("[Note]: Testing sentences loaded successfully!")
else:
    for idx, sentence in enumerate(testing_sentences):
        testing_sentences[idx] = normalizer.normalize(sentence)
    testing_sentences = np.asarray(testing_sentences)
    np.save('./model/testing_sentences.npy', testing_sentences)
    print("[Note]: Testing sentences saved successfully!")
    
print()
print("[Note]: Training sentences array shape:", training_sentences.shape)
print("[Note]: Testing sentences array shape:", testing_sentences.shape)

[Note]: Training sentences saved successfully!
[Note]: Testing sentences saved successfully!

[Note]: Training sentences array shape: (130000,)
[Note]: Testing sentences array shape: (27564,)


# Word to vector
* By using `Tokenizer.fit_on_texts()` we can tokenize all sentences
and by `Tokenizer.texts_to_sequences()` we can create sequences used in paddind generation by `pad_sequences`.

# Saving processed data
* Because of the fact that processing and cleaning this much data takes a lot of time and effort, we save it for further usage by `np.save()`. If `padded.npy` exists in directory `model`, it will be loaded automatically.

In [10]:
if os.path.exists('./model/train_padded.npy'):
    train_padded = np.load('./model/train_padded.npy')
    print("[Note]: Train padded loaded successfully!")
else:
    tokenizer.fit_on_texts(training_sentences)
    word_index = tokenizer.word_index
    train_sequences = tokenizer.texts_to_sequences(training_sentences)
    train_padded = pad_sequences(train_sequences, truncating='post', maxlen=max_length)
    np.save('./model/train_padded.npy', arr=train_padded)
    print("[Note]: Train padded array saved in model directory.")

if os.path.exists('./model/test_padded.npy'):
    test_padded = np.load('./model/test_padded.npy')
    print("[Note]: Test padded loaded successfully!")
else:
    tokenizer.fit_on_texts(testing_sentences)
    word_index = tokenizer.word_index
    test_sequences = tokenizer.texts_to_sequences(testing_sentences)
    test_padded = pad_sequences(test_sequences, truncating='post', maxlen=max_length)
    np.save('./model/test_padded.npy', arr=test_padded)
    print("[Note]: Test padded array saved in model directory.")

print()
print("Train padded array shape:", train_padded.shape)
print("Test padded array shape:", test_padded.shape)

[Note]: Train padded array saved in model directory.
[Note]: Test padded array saved in model directory.

Train padded array shape: (130000, 600)
Test padded array shape: (27564, 600)


# Reverse Decode

In [11]:
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

print(decode_review(test_padded[1000]))
print("--------------")
print(testing_sentences[1000])

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 

# Neural Network
* Now we have our train and tets data, So shall we begin training. By as it comes first,
at first we try neural network.
* Creating our network with one hidden layer.

In [12]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
#     tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.GRU(32),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Instructions for updating:
Colocations handled automatically by placer.


# Model Compile

In [13]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 600, 16)           160000    
_________________________________________________________________
conv1d (Conv1D)              (None, 596, 64)           5184      
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 149, 64)           0         
_________________________________________________________________
gru (GRU)                    (None, 32)                9312      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 174,529
Trainable params: 174,529
Non-trainable params: 0
_________________________________________________________________


# CallBack Class
* We use `tf.keras.callbacks` to be in touch with logs that contains accuracy and loss.

In [14]:
class Callback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('acc') > 0.99):
            self.model.stop_training = True
            print("[Note]: Reached training accuracy of 95%")

mycallback = Callback()

# Visualizing learning process with Tensorboard

In [15]:
%load_ext tensorboard

logdir = "logs/scalars/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=logdir)

# Model Training

In [None]:
if os.path.exists('./model/model.h5'):
    model = tf.keras.models.load_model('./model/model.h5')
    print("[Note]: Model loaded successfully!")
else:
    %tensorboard --logdir logs/scalars
    num_epochs = 20
    model.fit(train_padded, training_labels, epochs=num_epochs,
              validation_data=(test_padded, testing_labels),
              callbacks=[mycallback, tensorboard_callback])
    try:
        model.save('./model/model.h5', overwrite=True)
    except:
        pass
    print("[Note]: Model saved successfully!")
    
train_loss, train_acc = model.evaluate(train_padded, training_labels)
test_loss, test_acc = model.evaluate(test_padded, testing_labels)

print("Acuuracy over train set:", train_acc)
print("Accuracy over test set:", test_acc)

Instructions for updating:
Use tf.cast instead.


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


[Note]: Model loaded successfully!
  2656/130000 [..............................] - ETA: 7:57 - loss: 1.8604 - acc: 0.7455- ETA: 8:17 - loss: 1