<h1 style="padding-top: 25px;padding-bottom: 25px;text-align: left; padding-left: 10px; background-color: #DDDDDD; 
    color: black;"> <img style="float: left; padding-right: 10px;" src="https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/iacs.png" height="50px"> <a href='https://harvard-iacs.github.io/2021-AC215/' target='_blank'><strong><font color="#A41034">AC215: Advanced Practical Data Science, MLOps</font></strong></a></h1>

# **<font color="#A41034">Exercise 4 - Language Models</font>**

**Harvard University**<br/>
**Fall 2021**<br/>
**Instructor:**<br/>
Pavlos Protopapas

<hr style="height:2pt">

## **<font color="#A41034">Competition</font>**

### **<font color="#f03b20">Due Date: Check Canvas</font>**

#### **[Join Competition](https://www.kaggle.com/t/a8ec65d928b645e596a071527b3b3d33)**

#### **[View Leaderboard](https://www.kaggle.com/c/ac215-fall-2021/leaderboard)**

Now your task for this exercise is to build the best language model capable of classifying **abstracts** of Astrophysics papers taken from ArXiv. The labels for the classification dataset are as following: 

**Lables:**

* 0 = astro-ph.SR - Solar and Stellar Astrophysics
* 1 = astro-ph.GA - Astrophysics of Galaxies
* 2 = astro-ph.CO - Cosmology and Nongalactic Astrophysics

A good start for this exercise would be to start with some pre-trained language models. Since the text in Astrophysics is very domain specific, there is only so much accuracy a pre-trained language model can acheive. An idea to build a better language model will be to finetune the language model on a lot of unlabled abstracts from the Astrophysics papers. Then use this finetuned model for classification.

Here are some techniques you can try:

* Transfer Learning using different pre-trained language models from [HuggingFace](https://huggingface.co/models)
* Finetune a language model using the the abstract texts provided (`abstracts_train.txt` & `datasets/abstracts_validate.txt`). Here is a reference [link](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/language-modeling) on how to finetune a language model
* Any other modeling techniques you feel appropriate

#### **Exercise Requirements:**
* At a minimum beat the public leaderboard score of **0.64** (benchmark submission)



<br>

**Remember to upload your submission files to Kaggle and also submit your notebook to Canvas at the end.**

<br>

**<font color="#f03b20">The leaderboard for this competition will be computed based on `hidden` test set.</font>**

## **<font color="#A41034">Setup Notebook</font>**

**Installs**

In [None]:
!pip install transformers datasets

**Imports**

In [None]:
import os
import requests
import zipfile
import tarfile
import shutil
import math
import json
import time
import sys
import cv2
import string
import re
import subprocess
import hashlib
import numpy as np
import pandas as pd
from glob import glob
import collections
import unicodedata
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline

# Tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.models import Model, Sequential
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.utils.layer_utils import count_params

# sklearn
from sklearn.model_selection import train_test_split

**Verify Setup**

It is a good practice to verify what version of TensorFlow & Keras you are using. Also verify if GPU is enabled and what GPU you have. Run the following cells to check the version of TensorFlow

References:
- [Eager Execution](https://www.tensorflow.org/guide/eager)
- [Data Performance](https://www.tensorflow.org/guide/data_performance)

In [None]:
# Enable/Disable Eager Execution
# Reference: https://www.tensorflow.org/guide/eager
# TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, 
# without building graphs

#tf.compat.v1.disable_eager_execution()
#tf.compat.v1.enable_eager_execution()

print("tensorflow version", tf.__version__)
print("keras version", tf.keras.__version__)
print("Eager Execution Enabled:", tf.executing_eagerly())

# Get the number of replicas 
strategy = tf.distribute.MirroredStrategy()
print("Number of replicas:", strategy.num_replicas_in_sync)

devices = tf.config.experimental.get_visible_devices()
print("Devices:", devices)
print(tf.config.experimental.list_logical_devices('GPU'))

print("GPU Available: ", tf.config.list_physical_devices('GPU'))
print("All Physical Devices", tf.config.list_physical_devices())

# Better performance with the tf.data API
# Reference: https://www.tensorflow.org/guide/data_performance
AUTOTUNE = tf.data.experimental.AUTOTUNE

Run this cell to see what GPU you have. If you get a P100 or T4 GPU that's great. If it's K80, it will still work but it will be slow. Make sure you start this exercise early, as training might take time.

In [None]:
!nvidia-smi

**Utils**

Here are some util functions that we will be using for this notebook

In [None]:
def download_file(packet_url, base_path="", extract=False, headers=None):
  if base_path != "":
    if not os.path.exists(base_path):
      os.mkdir(base_path)
  packet_file = os.path.basename(packet_url)
  with requests.get(packet_url, stream=True, headers=headers) as r:
      r.raise_for_status()
      with open(os.path.join(base_path,packet_file), 'wb') as f:
          for chunk in r.iter_content(chunk_size=8192):
              f.write(chunk)
  
  if extract:
    if packet_file.endswith(".zip"):
      with zipfile.ZipFile(os.path.join(base_path,packet_file)) as zfile:
        zfile.extractall(base_path)
    else:
      packet_name = packet_file.split('.')[0]
      with tarfile.open(os.path.join(base_path,packet_file)) as tfile:
        tfile.extractall(base_path)

class JsonEncoder(json.JSONEncoder):
  def default(self, obj):
    if isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, decimal.Decimal):
        return float(obj)
    elif isinstance(obj, np.ndarray):
        return obj.tolist()
    else:
        return super(JsonEncoder, self).default(obj)

experiment_name = "models"
if not os.path.exists(experiment_name):
  os.mkdir(experiment_name)

def save_data_details(data_details):
  with open(os.path.join(experiment_name,"data_details.json"), "w") as json_file:
    json_file.write(json.dumps(data_details,cls=JsonEncoder))

def save_model(model,model_name="model01"):

  if isinstance(model,TFRobertaForSequenceClassification):
    model.save_weights(os.path.join(experiment_name,model_name+".h5"))
  else:
    # Save the enitire model (structure + weights)
    model.save(os.path.join(experiment_name,model_name+".hdf5"))

    # Save only the weights
    model.save_weights(os.path.join(experiment_name,model_name+".h5"))

    # Save the structure only
    model_json = model.to_json()
    with open(os.path.join(experiment_name,model_name+".json"), "w") as json_file:
        json_file.write(model_json)

def get_model_size(model_name="model01"):
  model_size = os.stat(os.path.join(experiment_name,model_name+".h5")).st_size
  return model_size

def evaluate_save_model(model,test_data, training_results,execution_time, learning_rate, batch_size, epochs, optimizer,save=True):
    
  # Get the model train history
  model_train_history = training_results.history
  # Get the number of epochs the training was run for
  num_epochs = len(model_train_history["loss"])

  # Plot training results
  fig = plt.figure(figsize=(15,5))
  axs = fig.add_subplot(1,2,1)
  axs.set_title('Loss')
  # Plot all metrics
  for metric in ["loss","val_loss"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()
  
  axs = fig.add_subplot(1,2,2)
  axs.set_title('Accuracy')
  # Plot all metrics
  for metric in ["accuracy","val_accuracy"]:
      axs.plot(np.arange(0, num_epochs), model_train_history[metric], label=metric)
  axs.legend()

  plt.show()
  
  # Evaluate on test data
  evaluation_results = model.evaluate(test_data)
  print(evaluation_results)
  
  if save:
    # Save model
    save_model(model, model_name=model.name)
    model_size = get_model_size(model_name=model.name)

    # Save model history
    with open(os.path.join(experiment_name,model.name+"_train_history.json"), "w") as json_file:
        json_file.write(json.dumps(model_train_history,cls=JsonEncoder))

    trainable_parameters = count_params(model.trainable_weights)
    non_trainable_parameters = count_params(model.non_trainable_weights)

    # Save model metrics
    metrics ={
        "trainable_parameters":trainable_parameters,
        "execution_time":execution_time,
        "loss":evaluation_results[0],
        "accuracy":evaluation_results[1],
        "model_size":model_size,
        "learning_rate":learning_rate,
        "batch_size":batch_size,
        "epochs":epochs,
        "name": model.name,
        "optimizer":type(optimizer).__name__
    }
    with open(os.path.join(experiment_name,model.name+"_model_metrics.json"), "w") as json_file:
        json_file.write(json.dumps(metrics,cls=JsonEncoder))

## **<font color="#A41034">Dataset</font>**

#### **Download**

In [None]:
start_time = time.time()
# Dowload the dataset
download_file("https://github.com/dlops-io/datasets/releases/download/v1.0/arxiv_astronmy_competition.zip", 
              base_path="datasets", extract=True)
execution_time = (time.time() - start_time)/60.0
print("Download execution time (mins)",execution_time)

#### **Load & EDA**

In [None]:
# Your Code Here

## **<font color="#A41034">Build Data Pipelines</font>**

In [None]:
# Your Code Here

## **<font color="#A41034">Build Text Classification Models</font>**

In [None]:
# Your Code Here

## **<font color="#A41034">Submit to Kaggle</font>**

* Make predictions on test datasets
* Prepare `submission.csv` file
* Upload using the [Kaggle](https://www.kaggle.com/c/ac215-fall-2021/submit)

In [None]:
# Your Code Here

## **<font color="#A41034">Kaggle Team Name</font>**

Please provide your **Team Name** to Kaggle that you used to make your submissions. 