<a href="https://colab.research.google.com/github/Engineering-Geek/Pencil_Learning_Assessment/blob/master/Pencil_Learning_Polished.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pencil Learning Machine Learning Assignment


## Getting the Universal Sentence Encoder as per directions

In [0]:
# Getting Universal Sentence Encoder from Google

from absl import logging
import tensorflow as tf
import tensorflow_hub as hub

# cuz tensorflow loves to yell at us, I wanna just shut it up now and only
#    tell me when I'm literally making a mistake. I can tolerate warnings for now
logging.set_verbosity(logging.ERROR)
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" # @param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = hub.load(module_url)

## Creating a custom Neural Network 



*   Layer 1: String input
*   Layer 2: The universal sentence encoder
*   Layer 3: Hidden layer for basic processing
*   Layer 4: Output layer



In [0]:
#@title 
def create_custom_model(n_outputs=1000):
  def UniversalEmbedding(x):
      return model(tf.squeeze(tf.cast(x, tf.string), axis=1))

  embed_size = 512 #@param {type:"raw"}

  input_text = tf.keras.Input(shape=(1,), dtype=tf.string, name="input")
  embedding = tf.keras.layers.Lambda(UniversalEmbedding, output_shape=(embed_size,))(input_text)
  hidden_layer_1 = tf.keras.layers.Dense(units=1000, activation="relu")(embedding)
  dropout1 = tf.keras.layers.Dropout(0.2)(hidden_layer_1)
  hidden_layer_2 = tf.keras.layers.Dense(units=1000, activation="relu")(dropout1)
  dropout2 = tf.keras.layers.Dropout(0.2)(hidden_layer_2)
  output_layer = tf.keras.layers.Dense(units=n_outputs, activation="sigmoid")(dropout2)
  

  custom_model = tf.keras.Model(inputs=input_text, outputs=output_layer)
  return custom_model

## Using a custom Dataset
For this, I'm using ["Pride and Prejudice"](https://www.gutenberg.org/files/1342/1342-0.txt) and many other books from that same site. However, any txt file should in theory suffice.

I also took a little snippit from [StackOverflow](https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences) to help out a lot. I normally never copy + paste without reading/learning everything about the topic and figuring out my own roundabout way. But this snippit of code literally had everything I wanted.

In [0]:
# # Run this as many times as you want to get all your scripts in google colab
# from google.colab import files
# uploaded = files.upload()

In [0]:
# May or may not be copied from stack overflow...
# Credit where credit is due: https://stackoverflow.com/questions/4576077/how-can-i-split-a-text-into-sentences 
import re

def split_into_sentences(text):
    alphabets= "([A-Za-z])"
    prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
    suffixes = "(Inc|Ltd|Jr|Sr|Co)"
    starters = "(Mr|Mrs|Ms|Dr|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
    acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
    websites = "[.](com|net|org|io|gov)"
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = sentences[:-1]
    sentences = [s.strip() for s in sentences]
    return sentences

In [0]:
# Get Clean sentences without commas or any punctuation
titles = [
          "war_and_peace.txt",
          "huckleberry_finn.txt",
          "adventures_of_tom_sawyer.txt",
          "pride_and_prejudice.txt"
          ]
giant_script = []
for title in titles:
    with open(title, "r") as script:
        data = script.read()
        data = data.replace(",", "")
        data = data.replace("”", "\"")
        data = data.replace("“", "\"")
        data = split_into_sentences(data)
        for index in range(len(data)):
            data[index] = " ".join(data[index].strip().split()).replace("_", "")
            data[index] = data[index].replace("\"", "")
            data[index] = data[index].replace("!", "")
            data[index] = data[index].replace("?", "")
            data[index] = data[index].replace("(", "")
            data[index] = data[index].replace(")", "")
            data[index] = data[index].replace("—", " ")
    giant_script.extend(data)

#### Dictionary
I'm creating a dictionary in the coloquial sense. 

```python
# NOT THIS
{}
```
but this: 

![Actual Dictionary](https://s.yimg.com/uu/api/res/1.2/bhi1gefWkG34xPYwl5GQsQ--~B/aD0yNTY7dz0yNTY7YXBwaWQ9eXRhY2h5b24-/https://www.blogcdn.com/www.tuaw.com/media/2009/09/dictionary-256.png)

Within the book, words will be repeated. All unique words in the book will be the vocabulary of this neural network.

In [0]:
# Create a dictionary with all the words [alphabetically]
import numpy as np

all_words = []
for sentence in giant_script:
    words = sentence.split(" ")
    for word in words:
        all_words.append(word)
unique_items, counts = np.unique(all_words, return_counts=True)
DICTIONARY = unique_items
print("RATIO OF (ALL WORDS : UNIQUE WORDS) = " + str(len(all_words) / len(DICTIONARY)))

RATIO OF (ALL WORDS : UNIQUE WORDS) = 21.398075043236332


#### Generating the dataset in general
This will split every sentence from the book longer than 2 words into two parts. Then the first part of the sentence becomes the "input" and the first word of the second part is the "answer".

For example: 
* according to all known laws of aviation there is no way that a bee should be able to fly
    * according to all known laws of 
    * aviation there is no way that a bee should be able to fly
* Input: "according to all known laws of"
* Output: "aviation"

Note that the output will be turned into a 1 hot vector meant to correspond to our dictionary. Also, because we are splitting the sentence at random points, we can iterate over the whole script many times to artificially increase our dataset.
    

In [0]:
# For each sentence, split it at a random point and store the stuff to the left in an array, and the single next word to another array
from random import randint
X = []
y = []

all_words = []
id_matrix = np.identity(len(DICTIONARY) + 1, dtype=np.float32)

iterations = 1 #@param {type:"slider", min:1, max:20, step:1}
for _ in range(iterations):
    for sentence in giant_script:
        words = sentence.lower().split(" ")
        if len(words) > 2:
            index = randint(1, len(words) - 2)
            all_words.append(words)
            word = words[index]
            dictionary_index = np.where(DICTIONARY==word)
            if len(dictionary_index[0]) != 0:
                one_hot_array = id_matrix[dictionary_index][0]
                X.append(' '.join(word for word in words[:index]))
                y.append(one_hot_array)

X = np.asarray(X)
y = np.asarray(y)

In [0]:
from sklearn.model_selection import train_test_split as split
# _tv   --> Train + Validation (keras.fit has a validation split option)
# _test --> duh
X_tv, X_test, y_tv, y_test = split(X, y, test_size=0.33, random_state=420)

## Customizing our Neural Network from earlier with specifications


In [0]:
from tensorflow.keras import backend as K
# Because we have an incredibly imbalanced dataset where words like "the", "or"
#   and such are disproportionately shown, optimizing by f1 is the best here
def f1(y_true, y_pred): # taken from old keras source code
    true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
    predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + K.epsilon())
    recall = true_positives / (possible_positives + K.epsilon())
    f1_val = 2*(precision*recall)/(precision+recall+K.epsilon())
    return f1_val

In [0]:
custom_model = create_custom_model(n_outputs=len(y[0]))
custom_model.compile(
    optimizer="Adam", 
    loss="categorical_crossentropy",
    metrics=[f1]
    )

epochs = 100 #@param {type:"slider", min:1, max:100, step:1}
history = custom_model.fit(
    X_tv, 
    y_tv, 
    validation_split=0.3, 
    batch_size=128, 
    epochs=epochs
    )

In [0]:
results = custom_model.evaluate(X_test, y_test, batch_size=128)

In [0]:
# Play with it just a bit and check it out!
sentence = "Hello, how are" #@param {type: "raw"}
result_vector = custom_model.predict(np.asarray([sentence]))
dictionary_index = tf.keras.backend.argmax(result_vector)
print(DICTIONARY[dictionary_index])

## CONCLUSION
Unsurprisingly, common words in the English Language like "the" or "to" are suggested FAR more frequently than they should. The neural network seems to be defaulting to those in any uncertainty because it's likely to be right.

The optimal number of epochs before overfitting seems to be 5-6 epochs. The val_f1 score seems to be peaking at 0.12 with the books I'm using, but with every new "book" appended onto this, it goes up a little bit. Given a big enough library, I believe we could get this neural network to function at a val_f1 of about 0.2-0.5 (can't tell without actually doing it). It likely won't go up from there because the neural network as of now is unable to categorize certain words like "nouns", "prepositions", "verbs" to optimize it's prediction based off of it. The current neural network is only a proof of concept that the neural network can learn SOMETHING with a standard training regime. There are NUMEROUS ways to improve this code. If I was given more time to pursue this task, here's how I would go about it.

First, here is the current code I have:

![Current Model](https://i.imgur.com/S4xQOd8.png)

As described earlier, there's just simply too many words in which only a handful are far more common than the rest and even with using cross entropy and f1 scoring, it's hard to get good results. As such, we could make a better network by creating a new neural network that can tell us what kind of word should come next. For example

- according to all known laws of
- aviation [noun]

After figuring out what kind of word is next, we could direct it to another custom neural network to select the best noun, verb, etc given the universal language encoding. Just as shown below

![Better Model](https://imgur.com/d83ffPJ.png)

Unfortunately due to time constraints, I wasn't able to impliment this methodology. 