# Assignment 4 Real-time text classification in the browser
### Jing Qian (jq2282)

## Part 1
Modify the starter code (​7-colab-to-webpage.ipynb​) ​to classify snippets of text from four books on ​Project Gutenberg​. 

Given a snippet of text (not necessarily a complete sentence) predict which book it belongs to.


### Step 1. Data preparation 
Use method from https://www.nltk.org/book/ch02.html.

Get texts from four books on Project Gutenberg: Emma, Paradise, Hamlet and Leaves. 

Collect a training set with randomly selected 1000 sentences from each book, 

In [0]:
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
gutenberg.fileids ()

In [0]:
# test with words form, not use it.
emma_words = gutenberg.words('austen-emma.txt')
len(emma_words)

In [3]:
# import books in forms of sentences
nltk.download('punkt')
emma = gutenberg.sents('austen-emma.txt')
paradise = gutenberg.sents('milton-paradise.txt')
hamlet = gutenberg.sents('shakespeare-hamlet.txt')
leaves = gutenberg.sents('whitman-leaves.txt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [0]:
import numpy as np
nlines = 1000 # later to 1000, as required
emma_pick = np.random.choice(emma, nlines, replace=False)
paradise_pick = np.random.choice(paradise, nlines, replace=False)
hamlet_pick = np.random.choice(hamlet, nlines, replace=False)
leaves_pick = np.random.choice(leaves, nlines, replace=False)
xs = np.vstack((emma_pick, paradise_pick, hamlet_pick, leaves_pick))
xs = xs.flatten()
#label 0:emma, label 1:paradise, 2:hamlet, 3:leaves
ys = [0]*nlines + [1]*nlines + [2]*nlines + [3]*nlines

In [0]:
from sklearn.model_selection import train_test_split
#shuffle is True defaultly, before split
x_train, x_test, y_train, y_test = train_test_split(xs, ys, test_size=0.1)
# x_tv, x_test, y_tv, y_test = train_test_split(xs, ys, test_size=0.33)
# x_train, x_val, y_train, y_val = train_test_split(x_tv, y_tv, test_size=0.5)

In [0]:
max_len = 20
num_words = 10000
from keras.preprocessing.text import Tokenizer
# Fit the tokenizer on the training data
t = Tokenizer(num_words=num_words)
t.fit_on_texts(x_train)
#print(t.word_index)
vectorized = t.texts_to_sequences(emma_pick)
#print(vectorized)

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(vectorized, maxlen=max_len, padding='post')
#print(padded)

metadata = {
  'word_index': t.word_index,
  'max_len': max_len,
  'vocabulary_size': num_words,
}

x_train = t.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, maxlen=max_len, padding='post')
# print(x_train)

### Step 4. Define a model, train and test

In [49]:
embedding_size = 8
n_classes = 4
epochs = 10
import tensorflow as tf

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(num_words, embedding_size, input_shape=(max_len,)))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(4, activation='softmax'))
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_2 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 644       
Total params: 80,644
Trainable params: 80,644
Non-trainable params: 0
_________________________________________________________________


In [50]:
model.fit(x_train, y_train, epochs=epochs, validation_split=0.2)

Train on 2880 samples, validate on 720 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f033dcdb9e8>

In [52]:
test_example = paradise[1200]
x_test = t.texts_to_sequences([test_example])
x_test = pad_sequences(x_test, maxlen=max_len, padding='post')
print(x_test)

[[ 81   1   1   8 126  12 126  26  65   1  65  14   7  25  34   0   0   0
    0   0]]


In [53]:
preds = model.predict(x_test)
print(preds)
import numpy as np
print(np.argmax(preds))

[[0.24311227 0.26148778 0.24880302 0.24659693]]
1


##挪到下面

### Step 1. Environment Preparation

In [0]:
!pip install tensorflow==2.0.0-alpha0

In [0]:
!pip install tensorflowjs==1.0.1

Connet to the github page, use the applied-dl repository: https://github.com/fakeJQ/applied-dl. 

Generated page of the repository, shown as: https://fakejq.github.io/applied-dl/

### Step 2. Copy repository from Github

In [0]:
# your github username
USER_NAME = "fakeJQ" 

# the email associated with your commits
# (may not matter if you leave it as is)
USER_EMAIL = "tcqj_8758@163.com" 

# create a token by visiting https://github.com/settings/tokens
# choose public permissions
# important: treat this token like a password (do not commit it)
# or submit it w/ your HW.
TOKEN = "89d35eaae4bb746d07f304526f39807c2c267f5c" 

# for example, if your user_name is "foo", then this notebook will create
# a site at "https://foo.github.io/hw4/"
SITE_NAME = "hw4"

In [0]:
!git config --global user.email {USER_NAME}
!git config --global user.name  {USER_EMAIL}

In [0]:
import os
repo_path = USER_NAME + '.github.io'
if not os.path.exists(os.path.join(os.getcwd(), repo_path)):
  !git clone https://{USER_NAME}:{TOKEN}@github.com/{USER_NAME}/applied-dl

fatal: destination path 'applied-dl' already exists and is not an empty directory.


In [0]:
os.chdir('/content/applied-dl')
!git pull

Already up to date.


In [0]:
project_path = os.path.join(os.getcwd(), SITE_NAME)
if not os.path.exists(project_path): 
  os.mkdir(project_path)
os.chdir(project_path)

In [0]:
print(project_path)

/content/applied-dl/hw4


In [0]:
# DO NOT MODIFY
MODEL_DIR = os.path.join(project_path, "model_js")
if not os.path.exists(MODEL_DIR):
  os.mkdir(MODEL_DIR)

### Step 5. Goto html!

In [0]:
import json
import tensorflowjs as tfjs

metadata_json_path = os.path.join(MODEL_DIR, 'metadata.json')
json.dump(metadata, open(metadata_json_path, 'wt'))
tfjs.converters.save_keras_model(model, MODEL_DIR)
print('\nSaved model artifcats in directory: %s' % MODEL_DIR)


Saved model artifcats in directory: /content/applied-dl/hw4/model_js


In [0]:
index_html = """
<!doctype html>

<body>
  <style>
    #textfield {
      font-size: 120%;
      width: 60%;
      height: 200px;
    }
  </style>
  <h1>
    Title
  </h1>
  <hr>
  <div class="create-model">
    <button id="load-model" style="display:none">Load model</button>
  </div>
  <div>
    <div>
      <span>Vocabulary size: </span>
      <span id="vocabularySize"></span>
    </div>
    <div>
      <span>Max length: </span>
      <span id="maxLen"></span>
    </div>
  </div>
  <hr>
  <div>
    <select id="example-select" class="form-control">
      <option value="example1">Alice's Adventures in Wonderland</option>
      <option value="example2">Dracula</option>
      <option value="example3">The Iliad</option>
    </select>
  </div>
  <div>
    <textarea id="text-entry"></textarea>
  </div>
  <hr>
  <div>
    <span id="status">Standing by.</span>
  </div>

  <script src='https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js'></script>
  <script src='index.js'></script>
</body>
"""

In [0]:
index_js = """
const HOSTED_URLS = {
  model:
      'model_js/model.json',
  metadata:
      'model_js/metadata.json'
};

const examples = {
  'example1':
      'Alice was beginning to get very tired of sitting by her sister on the bank.',
  'example2':
      'Buda-Pesth seems a wonderful place.',
  'example3':
      'Scepticism was as much the result of knowledge, as knowledge is of scepticism.'      
};

function status(statusText) {
  console.log(statusText);
  document.getElementById('status').textContent = statusText;
}

function showMetadata(metadataJSON) {
  document.getElementById('vocabularySize').textContent =
      metadataJSON['vocabulary_size'];
  document.getElementById('maxLen').textContent =
      metadataJSON['max_len'];
}

function settextField(text, predict) {
  const textField = document.getElementById('text-entry');
  textField.value = text;
  doPredict(predict);
}

function setPredictFunction(predict) {
  const textField = document.getElementById('text-entry');
  textField.addEventListener('input', () => doPredict(predict));
}

function disableLoadModelButtons() {
  document.getElementById('load-model').style.display = 'none';
}

function doPredict(predict) {
  const textField = document.getElementById('text-entry');
  const result = predict(textField.value);
  score_string = "Class scores: ";
  for (var x in result.score) {
    score_string += x + " ->  " + result.score[x].toFixed(3) + ", "
  }
  //console.log(score_string);
  status(
      score_string + ' elapsed: ' + result.elapsed.toFixed(3) + ' ms)');
}

function prepUI(predict) {
  setPredictFunction(predict);
  const testExampleSelect = document.getElementById('example-select');
  testExampleSelect.addEventListener('change', () => {
    settextField(examples[testExampleSelect.value], predict);
  });
  settextField(examples['example1'], predict);
}

async function urlExists(url) {
  status('Testing url ' + url);
  try {
    const response = await fetch(url, {method: 'HEAD'});
    return response.ok;
  } catch (err) {
    return false;
  }
}

async function loadHostedPretrainedModel(url) {
  status('Loading pretrained model from ' + url);
  try {
    const model = await tf.loadLayersModel(url);
    status('Done loading pretrained model.');
    disableLoadModelButtons();
    return model;
  } catch (err) {
    console.error(err);
    status('Loading pretrained model failed.');
  }
}

async function loadHostedMetadata(url) {
  status('Loading metadata from ' + url);
  try {
    const metadataJson = await fetch(url);
    const metadata = await metadataJson.json();
    status('Done loading metadata.');
    return metadata;
  } catch (err) {
    console.error(err);
    status('Loading metadata failed.');
  }
}

class Classifier {

  async init(urls) {
    this.urls = urls;
    this.model = await loadHostedPretrainedModel(urls.model);
    await this.loadMetadata();
    return this;
  }

  async loadMetadata() {
    const metadata =
        await loadHostedMetadata(this.urls.metadata);
    showMetadata(metadata);
    this.maxLen = metadata['max_len'];
    console.log('maxLen = ' + this.maxLen);
    this.wordIndex = metadata['word_index']
  }

  predict(text) {
    // Convert to lower case and remove all punctuations.
    const inputText =
        text.trim().toLowerCase().replace(/(\.|\,|\!)/g, '').split(' ');
    // Look up word indices.
    const inputBuffer = tf.buffer([1, this.maxLen], 'float32');
    for (let i = 0; i < inputText.length; ++i) {
      const word = inputText[i];
      inputBuffer.set(this.wordIndex[word], 0, i);
      //console.log(word, this.wordIndex[word], inputBuffer);
    }
    const input = inputBuffer.toTensor();
    //console.log(input);

    status('Running inference');
    const beginMs = performance.now();
    const predictOut = this.model.predict(input);
    //console.log(predictOut.dataSync());
    const score = predictOut.dataSync();//[0];
    predictOut.dispose();
    const endMs = performance.now();

    return {score: score, elapsed: (endMs - beginMs)};
  }
};

async function setup() {
  if (await urlExists(HOSTED_URLS.model)) {
    status('Model available: ' + HOSTED_URLS.model);
    const button = document.getElementById('load-model');
    button.addEventListener('click', async () => {
      const predictor = await new Classifier().init(HOSTED_URLS);
      prepUI(x => predictor.predict(x));
    });
    button.style.display = 'inline-block';
  }

  status('Standing by.');
}

setup();
"""

In [0]:
with open('index.html','w') as f:
  f.write(index_html)
  
with open('index.js','w') as f:
  f.write(index_js)

In [0]:
!ls

index.html  index.js  model_js


In [0]:
#!git add . 
#!git commit -m "colab -> github"
!git push https://{USER_NAME}:{TOKEN}@github.com/{USER_NAME}/applied-dl.git master
#!git push https://{USER_NAME}:{TOKEN}@github.com/{USER_NAME}/applied-dl

Counting objects: 9, done.
Delta compression using up to 2 threads.
Compressing objects:  11% (1/9)   Compressing objects:  22% (2/9)   Compressing objects:  33% (3/9)   Compressing objects:  44% (4/9)   Compressing objects:  55% (5/9)   Compressing objects:  66% (6/9)   Compressing objects:  77% (7/9)   Compressing objects:  88% (8/9)   Compressing objects: 100% (9/9)   Compressing objects: 100% (9/9), done.
Writing objects:  11% (1/9)   Writing objects:  22% (2/9)   Writing objects:  33% (3/9)   Writing objects:  44% (4/9)   Writing objects:  55% (5/9)   Writing objects:  66% (6/9)   Writing objects:  77% (7/9)   Writing objects:  88% (8/9)   Writing objects: 100% (9/9)   Writing objects: 100% (9/9), 34.15 KiB | 11.38 MiB/s, done.
Total 9 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/fakeJQ/applied-dl.git
   f116369..0804b74  master -> master


In [0]:
print("Now, visit https://%s.github.io/applied-dl/%s/" % (USER_NAME, SITE_NAME))

Now, visit https://fakeJQ.github.io/applied-dl/hw4/
