<a href="https://colab.research.google.com/github/AirbornBird88/hmill-exper/blob/main/Arxiv_classification_with_Mill_Julia_multilabel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.5" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.5 on the current Colab Runtime...
2024-06-03 15:52:41 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.5-linux-x86_64.tar.gz [130873886/130873886] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing IJulia kernel...
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mInstalling julia kernelspec in /root/.local/share/jupyter/kernels/julia-1.8

Successfully installed julia version 1.8.5!
Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then
jump to the 'Checking the Installation' section.




# Checking the Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [1]:
versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 2 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  JULIA_NUM_THREADS = 2




Have fun!

<img src="https://raw.githubusercontent.com/JuliaLang/julia-logo-graphics/master/images/julia-logo-mask.png" height="100" />

In [None]:
try
    using CUDA
catch
    println("No GPU found.")
else
    run(`nvidia-smi`)
    # Create a new random matrix directly on the GPU:
    M_on_gpu = CUDA.CURAND.rand(2^11, 2^11)
    @btime $M_on_gpu * $M_on_gpu; nothing
end

No GPU found.


In [None]:
languages = ["Julia", "Python", "R"]

for lang in languages
  println(lang)
end

Julia
Python
R


In [2]:
# Julia code/packages
using Pkg

Pkg.add("IJulia")
Pkg.add("Mill")
Pkg.add("DataFrames")
Pkg.add("Flux")
Pkg.add("PyCall")
Pkg.add("JSON")
Pkg.add("MLUtils")
Pkg.add("Zygote")
Pkg.add("Statistics")
Pkg.add("JsonGrinder")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m RealDot ───────────────────── v0.1.0
[32m[1m   Installed[22m[39m Crayons ───────────────────── v4.1.1
[32m[1m   Installed[22m[39m IrrationalConstants ───────── v0.2.2
[32m[1m   Installed[22m[39m GPUArraysCore ─────────────── v0.1.5
[32m[1m   Installed[22m[39m IRTools ───────────────────── v0.4.14
[32m[1m   Installed[22m[39m Scratch ───────────────────── v1.2.1
[32m[1m   Installed[22m[39m Transducers ───────────────── v0.4.80
[32m[1m   Installed[22m[39m Flux ──────────────────────── v0.13.17
[32m[1m   Installed[22m[39m DiffRules ─────────────────── v1.15.1
[32m[1m   Installed[22m[39m Adap

In [None]:
using PyCall

LoadError: ArgumentError: Package PyCall not found in current path.
- Run `import Pkg; Pkg.add("PyCall")` to install the PyCall package.

In [None]:
# Use the macro to install Kaggle and download the dataset
python"""
import subprocess
import sys

# Install Kaggle
subprocess.run([sys.executable, '-m', 'pip', 'install', 'kaggle'])
"""

LoadError: LoadError: UndefVarError: @python_str not defined
in expression starting at In[2]:2

In [None]:
!pip install kaggle



In [None]:
# Upload Kaggle API Key
from google.colab import files
files.upload()  # Upload the kaggle.json file

Saving kaggle_mill.json to kaggle_mill.json


{'kaggle_mill.json': b'{"username":"airbornbird88","key":"523660c5fbf3cccd2135247a1a03565c"}'}

In [None]:
# Set up Kaggle configuration
!mkdir -p ~/.kaggle
!mv kaggle_mill.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle_mill.json

In [None]:
# Download the arXiv dataset
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 99% 1.27G/1.29G [00:18<00:00, 89.7MB/s]
100% 1.29G/1.29G [00:18<00:00, 76.5MB/s]


In [None]:
# Extract the dataset
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [None]:
ls /content

arxiv.zip  [0m[01;34msample_data[0m/


In [None]:
ls -lh arxiv-metadata-oai-snapshot.json

-rw-r--r-- 1 root root 3.9G May 25 23:53 arxiv-metadata-oai-snapshot.json


## Předzpracování dat

In [None]:
# List the contents of the /content directory
content_dir = "/content"
println("Contents of $content_dir:")
readdir(content_dir)

NameError: name 'println' is not defined

In [None]:
using Pkg

# Check if Flux is installed
if haskey(Pkg.installed(), "Flux")
    println("Flux is installed.")
else
    println("Flux is not installed.")
end

# Check if Mill is installed
if haskey(Pkg.installed(), "Mill")
    println("Mill is installed.")
else
    println("Mill is not installed.")
end


[33m[1m└ [22m[39m[90m@ Pkg /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Pkg/src/Pkg.jl:675[39m


Flux is installed.
Mill is installed.


[33m[1m└ [22m[39m[90m@ Pkg /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Pkg/src/Pkg.jl:675[39m


## Schema Visualization
In this example we show how can schema be turned into HTML interactive visualization, which helps to examine the schema, especially when dealing with large and heterogeneous data.

In [None]:
using JsonGrinder, JSON
import JsonGrinder: generate_html

In [None]:
using JSON

# File path
data_file = "/content/arxiv-metadata-oai-snapshot.json"

# Vector to store parsed JSON objects
samples = Vector{Dict}()

# Maximum number of samples to parse
max_samples = 5000

# Open the file
open(data_file) do file
    # Read and parse JSON objects line by line
    for line in eachline(file)
        # Parse each line as a JSON object and push it to the samples vector
        push!(samples, JSON.Parser.parse(line))

        # Check if we've reached the maximum number of samples
        if length(samples) >= max_samples
            break
        end
    end
end


In [None]:
# Print example of multiple JSON objects in samples
for i in 1:5  # Adjust the range as needed
    JSON.print(samples[i], 2)
end

In [None]:
# we stabilize the seed to obtain same results every run, for pedagogic purposes
using Random

Random.seed!(42)

TaskLocalRNG()

In [None]:
# We define the minibatch size.
n_samples, n_val, minibatchsize, iterations = 10000, 200, 10, 20

(10000, 200, 10, 20)

In [None]:
targets = map(c -> c["categories"], samples)

5000-element Vector{String}:
 "hep-ph"
 "math.CO cs.CG"
 "physics.gen-ph"
 "math.CO"
 "math.CA math.FA"
 "cond-mat.mes-hall"
 "gr-qc"
 "cond-mat.mtrl-sci"
 "astro-ph"
 "math.CO"
 "math.NT math.AG"
 "math.NT"
 "math.NT"
 ⋮
 "astro-ph"
 "astro-ph"
 "astro-ph"
 "cond-mat.dis-nn cond-mat.soft"
 "cond-mat.str-el"
 "gr-qc cond-mat.str-el hep-ph physics.hist-ph"
 "math.OA math.FA"
 "astro-ph"
 "hep-th"
 "quant-ph"
 "astro-ph gr-qc hep-ph hep-th"
 "cond-mat.supr-con"

In [None]:
labels = sort(unique(targets))

943-element Vector{String}:
 "astro-ph"
 "astro-ph gr-qc"
 "astro-ph gr-qc hep-ph"
 "astro-ph gr-qc hep-ph hep-th"
 "astro-ph gr-qc hep-ph nucl-th"
 "astro-ph gr-qc hep-th"
 "astro-ph gr-qc math.AP"
 "astro-ph gr-qc nucl-th"
 "astro-ph hep-ex"
 "astro-ph hep-ph"
 "astro-ph hep-ph hep-th"
 "astro-ph hep-ph nucl-th"
 "astro-ph hep-ph physics.atom-ph physics.space-ph"
 ⋮
 "quant-ph math-ph math.LO math.MP math.PR"
 "quant-ph math-ph math.MP"
 "quant-ph math.PR"
 "quant-ph nucl-th"
 "quant-ph physics.optics"
 "stat.AP"
 "stat.AP cs.NE"
 "stat.AP math.ST stat.TH"
 "stat.AP stat.ME"
 "stat.ME"
 "stat.ME econ.EM math.ST stat.TH"
 "stat.ME physics.soc-ph stat.AP"

In [None]:
foreach(c -> delete!(c, "categories"), samples)

In [None]:
samples[1]

Dict{String, Any} with 13 entries:
  "journal-ref"    => "Phys.Rev.D76:013009,2007"
  "doi"            => "10.1103/PhysRevD.76.013009"
  "id"             => "0704.0001"
  "comments"       => "37 pages, 15 figures; published version"
  "update_date"    => "2008-11-26"
  "report-no"      => "ANL-HEP-PR-07-12"
  "versions"       => Any[Dict{String, Any}("created"=>"Mon, 2 Apr 2007 19:18:42 GMT", "version"=>"…
  "authors_parsed" => Any[Any["Balázs", "C.", ""], Any["Berger", "E. L.", ""], Any["Nadolsky", "P. …
  "submitter"      => "Pavel Nadolsky"
  "title"          => "Calculation of prompt diphoton production cross sections at Tevatron and\n  …
  "abstract"       => "  A fully differential calculation in perturbative quantum chromodynamics is…
  "authors"        => "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
  "license"        => nothing

In [None]:
using JsonGrinder

sch = JsonGrinder.schema(samples)

[34m[Dict][39m[90m  # updated = 5000[39m
[34m  ├───── update_date: [39m[39m[Scalar - String], 853 unique values[90m  # updated = 5000[39m
[34m  ├──────── versions: [39m[31m[List][39m[90m  # updated = 5000[39m
[34m  │                   [39m[31m  ╰── [39m[32m[Dict][39m[90m  # updated = 7915[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39m[Scalar - String], 34 unique values[90m  # updated = 7915[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39m[Scalar - String], 7907 unique values[90m  # updated = 7915[39m
[34m  ├───────── license: [39m[39m[Scalar - String], 5 unique values[90m  # updated = 343[39m
[34m  ├─────── report-no: [39m[39m[Scalar - String], 430 unique values[90m  # updated = 430[39m
[34m  ├── authors_parsed: [39m[31m[List][39m[90m  # updated = 5000[39m
[34m  │                   [39m[31m  ╰── [39m[32m[List][39m[90m  # updated = 15599[39m
[34m  │                  

In [None]:
delete!(sch.childs,:id)

Dict{Symbol, Any} with 12 entries:
  :update_date          => Entry
  :versions             => ArrayEntry
  :license              => Entry
  Symbol("report-no")   => Entry
  :authors_parsed       => ArrayEntry
  :authors              => Entry
  :title                => Entry
  Symbol("journal-ref") => Entry
  :comments             => Entry
  :abstract             => Entry
  :doi                  => Entry
  :submitter            => Entry

In [None]:
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├───── update_date: [39m[39mString
[34m  ├──────── versions: [39m[31mArray of[39m
[34m  │                   [39m[31m  ╰── [39m[32mDict[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mCategorical d = 35
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mString
[34m  ├───────── license: [39m[39mCategorical d = 6
[34m  ├─────── report-no: [39m[39mString
[34m  ├── authors_parsed: [39m[31mArray of[39m
[34m  │                   [39m[31m  ╰── [39m[32mArray of[39m
[34m  │                   [39m[31m      [39m[32m  ╰── [39m[39mString
[34m  ├───────── authors: [39m[39mString
[34m  ├─────────── title: [39m[39mString
[34m  ├───── journal-ref: [39m[39mString
[34m  ┊[39m
[34m  ├───────────── doi: [39m[39mString
[34m  ╰─────── submitter: [39m[39mString

In [None]:
extractor(samples[1])

[34mProductNode[39m[90m  # 1 obs, 408 bytes[39m
[34m  ├───── update_date: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 130 bytes[39m
[34m  ├──────── comments: [39m[39mArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)[90m  # 1 obs,  [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagNode[39m[90m  # 1 obs, 120 bytes[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductNode[39m[90m  # 2 obs, 48 bytes[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayNode(35×2 OneHotArray with Bool elements)[90m  # 2 obs [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayNode(2053×2 NGramMatrix with Int64 elements)[90m  # 2  [39m[90m⋯[39m
[34m  ├──────── abstract: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 1.077 KiB[39m
[34m  ├───────── license: [39m[39mArrayNode(6×1 MaybeHotMatrix with Union{Missing, B

In [None]:
data = map(extractor, samples)

5000-element Vector{Mill.ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:version, :created), Tuple{Mill.ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, Mill.AlignedBags{Int64}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}, Mill.ArrayNode{Mill.MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, Mill.BagNode{Mill.BagNode{Mill.Ar

## Defining the model reflecting the structure of data

In [None]:
using Mill, Flux

model = reflectinmodel(data[1],
	input_dim -> Dense(input_dim, 64, relu),
	Mill.SegmentedMeanMax,
	fsm = Dict("" => input_dim -> Chain(Dense(input_dim, 64, relu), Dense(64, length(labels))))
	)

[34mProductModel ↦ Chain(Dense(768 => 64, relu), Dense(64 => 943))[39m[90m  # 4 arrays, 110_511 params, 431.84 [39m[90m⋯[39m
[34m  ├───── update_date: [39m[39mArrayModel(Dense(2053 => 64, relu))[90m  # 2 arrays, 131_456 params, 513.578 KiB[39m
[34m  ├──────── comments: [39m[39mArrayModel([postimputing]Dense(2053 => 64, relu))[90m  # 3 arrays, 131_520 param [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagModel ↦ [SegmentedMean(64); SegmentedMax(64)] ↦ Dense(128 => 64, relu)[39m[90m  # [39m[90m⋯[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductModel ↦ Dense(128 => 64, relu)[39m[90m  # 2 arrays, 8_256 params, 32.32 [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayModel(Dense(35 => 64, relu))[90m  # 2 arrays, 2_304 pa [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayModel(Dense(2053 => 64, relu))[90m  # 2 arrays, 131_45 [39m[90m⋯[39m
[34m

### Training the model

In [None]:
# Set a random seed for reproducibility
Random.seed!(42)

TaskLocalRNG()

In [None]:
using Random
using MLUtils

# Shuffle the data and targets together
shuffled_data_targets = shuffleobs((data, targets))

# Split the shuffled data into train and test sets
train, remaining = splitobs(shuffled_data_targets, at = 0.8)

# Split the remaining data into validation and test sets
val, test = splitobs(remaining, at = 0.5) # Split the remaining 20% into validation and test sets

((ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}, Nothing}, AlignedBags{Int64}, Nothin

In [None]:
function loss(x,y)
  x = Zygote.@ignore reduce(catobs, getobs(x))
  y = Zygote.@ignore Flux.onehotbatch(getobs(y), labels)
  Flux.Losses.logitcrossentropy(model(x), y)
end

loss (generic function with 1 method)

In [None]:
data[1]

[34mProductNode[39m[90m  # 1 obs, 408 bytes[39m
[34m  ├───── update_date: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 130 bytes[39m
[34m  ├──────── comments: [39m[39mArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)[90m  # 1 obs,  [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagNode[39m[90m  # 1 obs, 120 bytes[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductNode[39m[90m  # 2 obs, 48 bytes[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayNode(35×2 OneHotArray with Bool elements)[90m  # 2 obs [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayNode(2053×2 NGramMatrix with Int64 elements)[90m  # 2  [39m[90m⋯[39m
[34m  ├──────── abstract: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 1.077 KiB[39m
[34m  ├───────── license: [39m[39mArrayNode(6×1 MaybeHotMatrix with Union{Missing, B

In [None]:
using Statistics

test_accuracy = mean(labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))] .== getobs(test[2]))

# Print test accuracy
println("Test Accuracy: ", test_accuracy)

Test Accuracy: 0.0


In [None]:
labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))] .== getobs(test[2])

500-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [None]:
labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))]

1000-element Vector{String}:
 "math.DG gr-qc"
 "math.GR math.FA"
 "math.AT"
 "math.DG gr-qc"
 "math.DG gr-qc"
 "math.DG gr-qc"
 "quant-ph math-ph math.CO math.MP"
 "math.AT math.DG"
 "quant-ph cond-mat.other physics.atom-ph"
 "math.AT"
 "math.AT math.DG"
 "math.DG gr-qc"
 "nlin.CD nlin.SI"
 ⋮
 "math.DG gr-qc"
 "math.AT"
 "math.DG gr-qc"
 "math.AT math.DG"
 "math.DG gr-qc"
 "math.AT"
 "math.AT math.DG"
 "math.CT math.GN"
 "quant-ph cs.ET"
 "math.AT"
 "cs.DS cs.PF"
 "math.DG gr-qc"

In [None]:
getobs(test[2])

1000-element Vector{String}:
 "astro-ph"
 "physics.soc-ph"
 "cond-mat.supr-con"
 "physics.gen-ph"
 "hep-th"
 "astro-ph"
 "cs.DM cs.LO"
 "cond-mat.mes-hall"
 "hep-ex"
 "stat.ME econ.EM math.ST stat.TH"
 "physics.optics"
 "hep-ph astro-ph hep-th"
 "math.DS cs.DM math.NT"
 ⋮
 "math.CV math.AG"
 "nlin.SI math-ph math.MP"
 "math.OA math.GN"
 "astro-ph"
 "astro-ph"
 "cs.CL cs.HC"
 "hep-ph"
 "math.SG math.CO"
 "hep-ph"
 "hep-th math.DG"
 "physics.atom-ph"
 "hep-th gr-qc math-ph math.MP"

In [None]:
getobs(test[2])

1000-element Vector{String}:
 "astro-ph"
 "physics.soc-ph"
 "cond-mat.supr-con"
 "physics.gen-ph"
 "hep-th"
 "astro-ph"
 "cs.DM cs.LO"
 "cond-mat.mes-hall"
 "hep-ex"
 "stat.ME econ.EM math.ST stat.TH"
 "physics.optics"
 "hep-ph astro-ph hep-th"
 "math.DS cs.DM math.NT"
 ⋮
 "math.CV math.AG"
 "nlin.SI math-ph math.MP"
 "math.OA math.GN"
 "astro-ph"
 "astro-ph"
 "cs.CL cs.HC"
 "hep-ph"
 "math.SG math.CO"
 "hep-ph"
 "hep-th math.DG"
 "physics.atom-ph"
 "hep-th gr-qc math-ph math.MP"

In [None]:
train[2]

4000-element view(::Vector{String}, [3042, 4121, 4041, 479, 4439, 4292, 3816, 214, 1114, 582  …  3616, 1931, 1716, 3463, 2584, 2753, 4265, 1158, 231, 4311]) with eltype String:
 "nucl-ex"
 "math.RA"
 "math.CO"
 "math.AG math.KT"
 "physics.soc-ph physics.data-an"
 "cond-mat.mes-hall"
 "physics.optics physics.bio-ph"
 "quant-ph hep-th"
 "astro-ph"
 "math.PR math-ph math.MP"
 "astro-ph"
 "astro-ph"
 "math.AP"
 ⋮
 "cond-mat.mes-hall"
 "astro-ph"
 "cond-mat.str-el quant-ph"
 "quant-ph"
 "math.DS"
 "quant-ph"
 "cond-mat.soft"
 "astro-ph"
 "cs.GT"
 "cs.CY cs.IR physics.soc-ph"
 "math.CV"
 "hep-ph"

In [None]:
minibatches = Flux.DataLoader((train[1], train[2]), batchsize=minibatchsize, shuffle=true)

400-element DataLoader(::Tuple{SubArray{ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}

In [None]:
ps = Flux.params(model)

# Get the total number of parameters
params_count = length(ps)

println("Total number of parameters in the model: ", params_count)

Total number of parameters in the model: 49


In [None]:
opt = Flux.Optimise.Adam()

Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}())

In [None]:
x, y = first(minibatches)

(ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}, Nothing}, AlignedBags{Int64}, Nothing

In [None]:
loss(x, y)
gs = gradient(() -> loss(x, y), ps)
Flux.Optimise.update!(opt, ps, gs)

In [None]:
Flux.train!(loss, Flux.params(model),
    minibatches,
    ADAM())