# Convolutional Neural Network
---

It is basically a neural network, which take image as input and outputs a Label.
Example - taking a image and identifying if it contains a aeroplane in it.

Steps for CNN -
* **Convolution** : create "feature detectors" that go through the image, end up with list of feature maps which tell where this feature appears
* **Max Pooling** : apply maximum funtion to feature maps, so it can be made smaller to get more performance
* **Falttening** : create vector out of feature maps
* **Full Connection** : create a full neural network

Links - 
- https://dennybritz.com/posts/wildml/understanding-convolutional-neural-networks-for-nlp/
- https://towardsdatascience.com/nlp-with-cnns-a6aa743bdc1e

## CNN for Text

We need to create Sentences to a Matrix.

# Import Dependency

In [1]:
import numpy as np
import math
import re
import pandas as pd

from bs4 import BeautifulSoup

from google.colab import drive

In [2]:
try:
  %tensorflow_version 2.x
except Exception:
    pass

import tensorflow as tf

from tensorflow import keras
from keras import layers

import tensorflow_datasets as tfds


Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


# DataProcessing sentiment_140 dataset

## load dataset file

In [3]:
drive.mount("/content/drive")

Mounted at /content/drive


In [26]:
cols = ["sentiment", "id", "date","query", "user", "text"]

data = pd.read_csv("/content/drive/MyDrive/sentiment_140.csv", header=None, names=cols, engine="python",encoding="latin1")

In [None]:
data.head()

# PreProcess data

In [27]:
# get train and test data
data.drop(["id", "date", "query", "user"],
          axis=1,
          inplace=True)


In [None]:
data.head()

In [None]:
import seaborn as sns

np.unique(data['sentiment'], return_counts=True)
sns.countplot(x=data['sentiment']);

In [None]:

sentiment = data.iloc[:, 0].values
text = data.iloc[:, 1].values

print(text)


In [32]:
def clean_data(s):
  s = BeautifulSoup(s, "lxml").get_text()
  s = re.sub(r"@[A-Za-z0-9]+", ' ', s)
  s = re.sub(r"https?://[A-Za-z0-9./]+", ' ', s)
  s = re.sub(r"[^a-zA-Z.!?']", ' ', s)
  s = re.sub(r" +", ' ', s)

  return s

In [None]:
data_clean = [clean_data(s) for s in text]
print(data_clean[:5])

In [None]:
# convert sentiment 0,4 to 0,1
print("before",set(sentiment))
sentiment[sentiment == 4] = 1
print("after",set(sentiment))

# Tokenization

From sentence get list of numbers, where each number corresponds to a word.

In [37]:
tokenizer = tfds.deprecated.text.SubwordTextEncoder.build_from_corpus(
    data_clean, target_vocab_size=2**16
)

data_inputs = [tokenizer.encode(sentence) for sentence in data_clean]

# Padding

Goal is to Pad the tokenized data with 0 at the end of the sentences inorder to make the inputs of equal length.

For training AI the inputs are provided as batch and for this the inputs should be of same length.

0 dosent have any meaning as tokenizer will not have any word associated to 0.

In [38]:
MAX_LEN = max([len(sentence) for sentence in data_inputs])

data_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    data_inputs,
    value=0,
    padding="post",
    maxlen=MAX_LEN
)

In [None]:
from sklearn.model_selection import train_test_split

# split data to training and testing data for ML
# this will split - 80% training data and 20% testing data
x_train, x_test, y_train, y_test = train_test_split(data_inputs,sentiment,test_size=0.2)

print('tweets training',x_train.shape)
print('tweets testing',x_test.shape)
print('sentiment training',y_train.shape)
print('sentiment testing',y_test.shape)

print(x_train)

# Model building

In [40]:
class DeepCNN(tf.keras.Model):

  def __init__(self, vocab_size, emb_dim=128, nb_filters=50, FFN_units=512, nb_classes=2, dropout_rate=0.1, training=False, name="dcnn"):
    super(DeepCNN,self).__init__(name=name)

    self.embedding = layers.Embedding(vocab_size, emb_dim)

    self.bigram = layers.Conv1D(filters=nb_filters, kernel_size=2, padding="valid", activation="relu")
    self.pool_1 = layers.GlobalMaxPool1D()

    self.trigram = layers.Conv1D(filters=nb_filters, kernel_size=3, padding="valid", activation="relu")
    self.pool_2 = layers.GlobalMaxPool1D()

    self.fourgram = layers.Conv1D(filters=nb_filters, kernel_size=4, padding="valid", activation="relu")
    self.pool_3 = layers.GlobalMaxPool1D()

    self.dense_1 = layers.Dense(units=FFN_units, activation="relu")

    self.dropout = layers.Dropout(rate=dropout_rate)

    if nb_classes == 2:
      self.last_dense = layers.Dense(units=1, activation="sigmoid")
    else:
      self.last_dense = layers.Dense(units=nb_classes, activation="softmax")


  def call(self, inputs, training):
    x = self.embedding(inputs)

    x_1 = self.bigram(x)
    x_1 = self.pool_1(x_1)

    x_2 = self.trigram(x)
    x_2 = self.pool_2(x_2)

    x_3 = self.fourgram(x)
    x_3 = self.pool_3(x_3)

    merged = tf.concat([x_1,x_2,x_3], axis=-1) # (batch_size, 3 * nb_filters)
    merged = self.dense_1(merged)
    merged = self.dropout(merged, training)

    output = self.last_dense(merged)

    return output



# Application



## Config
goal is to create global variable

In [41]:
VOCAB_SIZE = tokenizer.vocab_size

EMB_DIM = 200
NB_FILTERS = 100
FFN_UNITS = 256
NB_CLASSES = len(set(y_train))

DROPOUT_RATE = 0.2

BATCH_SIZE = 32
NB_EPOCHS = 5

## Train

In [43]:
dcnn = DeepCNN(vocab_size=VOCAB_SIZE,
                emb_dim=EMB_DIM,
                nb_filters=NB_FILTERS,
                FFN_units=FFN_UNITS,
                nb_classes=NB_CLASSES,
                dropout_rate=DROPOUT_RATE)

if NB_CLASSES == 2:
    dcnn.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
else:
    dcnn.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["sparse_categorical_accuracy"])

In [47]:
# create check point to save model
checkpoint_path = "./ckpt/"

ckpt = tf.train.Checkpoint(DeepCNN=dcnn)

ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=1)

if ckpt_manager.latest_checkpoint:
    ckpt.restore(ckpt_manager.latest_checkpoint)
    print("Latest checkpoint restored!")

In [None]:
dcnn.fit(x_train,
         y_train,
         batch_size=BATCH_SIZE,
         epochs=NB_EPOCHS)
ckpt_manager.save()

In [None]:
results = dcnn.evaluate(x_test, y_test, batch_size=BATCH_SIZE)
print(results)

In [None]:
dcnn(np.array([tokenizer.encode("You are so nice")]), training=False).numpy()