# Assignment 3: Natural Language Processing and Classification
Name(s):

Note: the report with the following headings can be completed in this notebook, or in a separate document.

## 1. Dataset Description
A description of the dataset, including:
1. Biases and limitations of the dataset
2. Class (im)balance
3. A summary of your impressions of the dataset

## 2. Documentation of Experiments
A description of the various things you tried, including:
1. Changes to the vectorization/tokenization/other preprocessing
2. Changes to the model architecture
3. Changes to hyperparameters, loss functions, and optimizers
4. A brief summary of things you learned in the process

Some of this might be best presented in a table or list format.

## 3. Your Final Model
A description of the final "best" model you built, including:
1. The preprocessing pipeline
2. The architecture and model hyperparameters
3. A brief discussion/your best guess as to why this model outperformed the others

## 4. Discussion/Conclusion
A discussion/conclusion section describing:
1. Challenges, advantages, and limitations of the process or model
2. Your key takeaways from the process
3. What you would do if you had more time
4. Any other thoughts or observations you have about NLP and classification

# Starter Code
Here's some BigQuery related stuff and a terrible network to get you started. Feel free to change just about anything in this notebook.

## Load the BigQuery data
I highly recommend using [Colab](https://colab.research.google.com/) for this assignment, as it involves some hefty computation. If you have a GPU that you've managed to convince Tensorflow to work with, you can also use the [BigQuery Python client library](https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python) to load the data. 

The following cell assumes you are using Colab and will prompt you for your Google credentials. You may need to activate the BigQuery API for your account first.


In [None]:
from google.cloud import bigquery
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

To get started with the BigQuery API, do the following:
1. Create a project in the [Google Cloud Console](https://console.cloud.google.com/). The free tier provides up to 1TB of querying per month, so you should be fine, but you can also claim the free credit provided for this course (see instructions on D2L under "Course Info").
2. Enable the BigQuery API for your project.
3. Change the `"<your project id here>"` placeholder in the next cell to the project ID from your Google Cloud Console (probably a slugified version of your project name).

Here I'm loading 100k random rows from the [Stack Overflow dataset](https://console.cloud.google.com/marketplace/details/stack-exchange/stack-overflow), and limiting the selection to questions with a single tag taken from the top 10 most common tags.

In [None]:
project_id = "<your project id here>"

%%bigquery df --project $project_id
SELECT title, tags
FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1 AND tags in (
  SELECT tags
  FROM `bigquery-public-data.stackoverflow.posts_questions` as Posts
  WHERE ARRAY_LENGTH(SPLIT(Posts.tags, '|')) = 1
  GROUP BY tags
  ORDER BY COUNT(*) DESC
  LIMIT 10
)
ORDER BY RAND()
LIMIT 100000

In [None]:
# Look at the distribution of tags
df['tags'].hist()

# Look at the first five rows of the dataframe
df.head()

You might want to do some class balancing or some other preprocessing, but you'll definitely need to do the following:
1. Split the data into training/validation/testing sets
2. Encode the labels as integers or one-hot vectors (there's lots of ways - I chose to use Pandas' `get_dummies` function)
3. Define a `TextVectorization` layer to normalize, split, and map strings to integers - don't forget to take care of the padding values, either by masking or using `RaggedTensor`s
4. Define some kind of model, hopefully better than the one I use as an example

In [None]:
import pandas as pd

tv_thresh = int(.7 * len(df))
vt_thresh = int(.85 * len(df))

# train/val/test split (e.g. 70/15/15)
train_X = df["title"][:tv_thresh].values
val_X = df["title"][tv_thresh:vt_thresh].values
test_X = df["title"][vt_thresh:].values

# encode the output labels
labels = pd.get_dummies(df["tags"])
label_names = labels.columns
train_y = labels[:tv_thresh].values
val_y = labels[tv_thresh:vt_thresh].values
test_y = labels[vt_thresh:].values

The following cell defines the vectorization layer. Lots of important decisions to be made here.

In [None]:
import tensorflow as tf

# define the tokenization/sequencing parameters
vocab_size = 2000
vec_layer = tf.keras.layers.TextVectorization(
    max_tokens=vocab_size,
    standardize="lower",
    ragged=True
)
vec_layer.adapt(train_X)

And finally here's a sample model that actually managed to get worse as training progressed.

In [None]:
model = tf.keras.Sequential([
    vec_layer,
    tf.keras.layers.Embedding(vocab_size, 128),
    tf.keras.layers.GRU(128),
    tf.keras.layers.Dense(len(label_names))
])

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
history = model.fit(
    x=train_X, 
    y=train_y,
    validation_data=(val_X, val_y),
    epochs=5
)