# Demo: Convolution Neural Network with Trained Word2Vec Embeddings for Part-Of-Speech Tagging

### Natural language processing in artificial intelligence has come very far, but still has a ways to go.
### Are neural networks worth exploring for natural language processing in AI?
### Let's find out!

#### *Required Files: latinModel, englishModel, ttokenizer.json, ttokenizerlatin.json, wtokenizer.json, wtokenizerlatin.json*

_______________________________
### Step 1: Run the following three cells to import the required packages and prepare some information for the models.

##### This cell imports all of the required Tensorflow/Keras and Python packages:

In [None]:
import tensorflow.keras as keras
from tensorflow.keras.models import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras import backend as K

import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from nltk.tokenize import word_tokenize
import io
import json

##### The networks pad input sentences with zeros to create uniform length, but this method inflates the accuracy rate used for evaluation. A masked accuracy is therefore used for more precise evaluation. 
##### This cell defines the function for the masked accuracy:

In [None]:
def accuracy_masked(y_true, y_pred):
    y_true_class = K.argmax(y_true, axis=-1)
    y_pred_class = K.argmax(y_pred, axis=-1)

    ignore_mask = K.cast(K.not_equal(y_true_class, 0), 'int32')
    matches = K.cast(K.equal(y_true_class, y_pred_class), 'int32') * ignore_mask
    accuracy = K.sum(matches) / K.maximum(K.sum(ignore_mask), 1)
    return accuracy

##### This cell creates a Python dictionary of the various parts of speech (tags) that may be found in a sentence for English and Latin:

In [None]:
tagdictenglish = {
    "conj": "Conjunction",
    ".": "Punctuation",
    "propn": "Proper Noun",
    "num": "Numeral",
    "adv": "Adverb",
    "verb": "Verb",
    "noun": "Noun",
    "pron": "Pronoun",
    "adj": "Adjective",
    "part": "Participle",
    "det": "Determiner",
    "x": "Other",
    "adp": "Adposition",
}

tagdictlatin = {
    "propn": "Proper Noun",
    "adv": "Adverb",
    "x": "Other",
    "intj": "Interjection",
    "cconj": "Coordinating Conjunction",
    "punct": "Punctuation",
    "det": "Determiner",
    "adj": "Adjective",
    "pron": "Pronoun",
    "sconj": "Subordinating Conjunction",
    "num": "Numeral",
    "aux": "Auxiliary Verb",
    "noun": "Noun",
    "adp": "Adposition",
    "verb": "Verb",
    "part": "Particle"
}

________________________________
### Step 2: Run the next three cells to load the models and their Tokenizers.

##### This cell downloads the files for the models and Tokenizers:

In [None]:
!wget https://www.cs.mtsu.edu/~mam2hu/englishModel.zip
!unzip englishModel.zip
!wget https://www.cs.mtsu.edu/~mam2hu/latinModel.zip
!unzip latinModel.zip
!wget https://www.cs.mtsu.edu/~mam2hu/ttokenizer.json
!wget https://www.cs.mtsu.edu/~mam2hu/wtokenizer.json
!wget https://www.cs.mtsu.edu/~mam2hu/ttokenizerlatin.json
!wget https://www.cs.mtsu.edu/~mam2hu/wtokenizerlatin.json

##### This cell loads the two neural networks, one for English and one for Latin:

In [None]:
loaded_english_model_tf = tf.keras.models.load_model('englishModel',custom_objects={'accuracy_masked':accuracy_masked})
loaded_latin_model_tf = tf.keras.models.load_model('latinModel', custom_objects={'accuracy_masked':accuracy_masked})

##### The Tokenizer import from Keras allows text to be converted into a vector that is used by the network. The text is separated by punctuation and space into a sequence of words, which is then split into lists of tokens. 
##### This cell loads the Tokenizers: 

In [None]:
with open('wtokenizer.json') as f:
    data = json.load(f)
    word_tokenizer_english = keras.preprocessing.text.tokenizer_from_json(data)
    
with open('ttokenizer.json') as f:
    data = json.load(f)
    tag_tokenizer_english = keras.preprocessing.text.tokenizer_from_json(data)
    
with open('wtokenizerlatin.json') as f:
    data = json.load(f)
    word_tokenizer_latin = keras.preprocessing.text.tokenizer_from_json(data)
    
with open('ttokenizerlatin.json') as f:
    data = json.load(f)
    tag_tokenizer_latin = keras.preprocessing.text.tokenizer_from_json(data)

______________________________
### Step 3: Run the next two cells to define the Part-of-Speech-Tagging Functions.

##### The following two cells define the functions used by the networks to preprocess the input sentences, make a prediction for the parts of speech, and display the results:

In [None]:
def output_prediction_english(text):
    
    # Preprocess data by tokenizing input sentences and padding with zeros
    text = [word_tokenize(text)]
    text_encoded = word_tokenizer_english.texts_to_sequences(text) 
    text_padded = pad_sequences(text_encoded, maxlen=50, padding='pre', truncating='post')
    
    # Make a prediction 
    ynew = np.argmax(loaded_english_model_tf.predict(text_padded), axis=-1)
    prediction = ynew[0]
    
    # Trim leading zeros
    prediction = np.trim_zeros(prediction)
    
    # Decode the prediction
    decoded = tag_tokenizer_english.sequences_to_texts([prediction])
    decoded = word_tokenize(decoded[0])
    decoded = [tagdictenglish[tag] for tag in decoded]
    
    # Display the inputs and predicted outputs
    print("      Sentence= %s\nPredicted Tags= %s" % (text, decoded))

In [None]:
def output_prediction_latin(text):
    
    #Preprocess data by tokenizing input sentences and padding with zeros
    text = [word_tokenize(text)]
    text_encoded = word_tokenizer_latin.texts_to_sequences(text) 
    text_padded = pad_sequences(text_encoded, maxlen=50, padding='pre', truncating='post')
    
    # Make a prediction
    ynew = np.argmax(loaded_latin_model_tf.predict(text_padded), axis=-1)
    prediction = ynew[0]
    
    # Trim leading zeros
    prediction = np.trim_zeros(prediction)
    
    # Decode the prediction
    decoded = tag_tokenizer_latin.sequences_to_texts([prediction])
    decoded = word_tokenize(decoded[0])
    decoded = [tagdictlatin[tag] for tag in decoded]
    
    # Display the inputs and predicted outputs
    print("      Sentence= %s\nPredicted Tags= %s" % (text, decoded))

___________________________
### Step 4: See the results!

##### Create your own English and Latin sentences and use them to replace the sample sentences below, or just leave the sample sentences if you aren't feeling creative, then run the cell to see the parts of speech:

##### *Note: Be sure to leave the quotation marks.*

In [None]:
english = "The red cat took a very long walk along the winding river."  
output_prediction_english(english)

In [None]:
latin = "Forsan et haec olim meminisse iuvabit."
output_prediction_latin(latin)

___________________________
### You have now successfully used neural networks to tag sentences for English and Latin! The network model can be trained to tag sentences in any language as long as an extensive and tagged dataset is used during training.

### This demo shows that natural language processing and neural networks have a promising future together.