<a href="https://colab.research.google.com/github/AirbornBird88/hmill-exper/blob/main/Arxiv_classification_with_Mill_Julia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

# Instalace Julia v prostředí Colab


In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.8.5" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.8.5 on the current Colab Runtime...
2024-06-07 07:41:58 URL:https://storage.googleapis.com/julialang2/bin/linux/x64/1.8/julia-1.8.5-linux-x86_64.tar.gz [130873886/130873886] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...
Installing Julia package CUDA...


# Checking the Installation
The `versioninfo()` function should print your Julia version and some other info about the system:

In [1]:
versioninfo()

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 2 on 2 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/lib64-nvidia
  JULIA_NUM_THREADS = 2


In [None]:
using BenchmarkTools

M = rand(2^11, 2^11)

@btime $M * $M;

  305.271 ms (2 allocations: 32.00 MiB)


In [None]:
try
    using CUDA
catch
    println("No GPU found.")
else
    run(`nvidia-smi`)
    # Create a new random matrix directly on the GPU:
    M_on_gpu = CUDA.CURAND.rand(2^11, 2^11)
    @btime $M_on_gpu * $M_on_gpu; nothing
end

Thu Jun  6 18:22:10 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P8               9W /  70W |      3MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Need Help?

* Learning: https://julialang.org/learning/
* Documentation: https://docs.julialang.org/
* Questions & Discussions:
  * https://discourse.julialang.org/
  * http://julialang.slack.com/
  * https://stackoverflow.com/questions/tagged/julia

If you ever ask for help or file an issue about Julia, you should generally provide the output of `versioninfo()`.

Add new code cells by clicking the `+ Code` button (or _Insert_ > _Code cell_).

Have fun!

<img src="https://raw.githubusercontent.com/JuliaLang/julia-logo-graphics/master/images/julia-logo-mask.png" height="100" />

In [2]:
languages = ["Julia", "Python", "R"]

for lang in languages
  println(lang)
end

Julia
Python
R


# Instalace Julia balíčků

In [None]:
# Julia code/packages
using Pkg

Pkg.add("IJulia")
Pkg.add("Mill")
Pkg.add("DataFrames")
Pkg.add("Flux")
Pkg.add("PyCall")
Pkg.add("JSON")
Pkg.add("MLUtils")
Pkg.add("Zygote")
Pkg.add("Statistics")
Pkg.add("JsonGrinder")

# Zkouška volání Python kódu v Julia

In [None]:
# using PyCall

In [None]:
# Use the macro to install Kaggle and download the dataset
python"""
import subprocess
import sys

# Install Kaggle
subprocess.run([sys.executable, '-m', 'pip', 'install', 'kaggle'])
"""

LoadError: LoadError: UndefVarError: @python_str not defined
in expression starting at In[2]:2

# Připojení k Kaggle API a stažení dat do Colabu

Tuhle část provedeme pomocí Python kódu. Je tudíž třeba přepnout runtime na Python.

In [None]:
!pip install kaggle



In [None]:
# Upload Kaggle API Key
from google.colab import files
files.upload()  # Upload the kaggle.json file

Saving kaggle_mill.json to kaggle_mill.json


{'kaggle_mill.json': b'{"username":"airbornbird88","key":"523660c5fbf3cccd2135247a1a03565c"}'}

In [None]:
# Set up Kaggle configuration
!mkdir -p ~/.kaggle
!mv kaggle_mill.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle_mill.json

In [None]:
# Download the arXiv dataset
!kaggle datasets download -d Cornell-University/arxiv

Dataset URL: https://www.kaggle.com/datasets/Cornell-University/arxiv
License(s): CC0-1.0
Downloading arxiv.zip to /content
 99% 1.28G/1.29G [00:18<00:00, 27.0MB/s]
100% 1.29G/1.29G [00:18<00:00, 75.0MB/s]


In [None]:
# Extract the dataset
!unzip arxiv.zip

Archive:  arxiv.zip
  inflating: arxiv-metadata-oai-snapshot.json  


In [None]:
ls /content

arxiv-metadata-oai-snapshot.json  arxiv.zip  [0m[01;34msample_data[0m/


In [None]:
ls -lh arxiv-metadata-oai-snapshot.json

-rw-r--r-- 1 root root 4.0G Jun  1 23:53 arxiv-metadata-oai-snapshot.json


# Předzpracování dat

V rámci předzracování je třeba pouze parsovat příchozí data z Kaggle API do vhodné podoby. Parsing by měl rozdělit JSON data na jednotlivá pozorování (individuální JSON objekt s metadaty jednoho ArXiv článku).

In [4]:
# List the contents of the /content directory
content_dir = "/content"
println("Contents of $content_dir:")
readdir(content_dir)

Contents of /content:


4-element Vector{String}:
 ".config"
 "arxiv-metadata-oai-snapshot.json"
 "arxiv.zip"
 "sample_data"

In [5]:
# Check if we have the julia packages

using Pkg

# Check if Flux is installed
if haskey(Pkg.installed(), "Flux")
    println("Flux is installed.")
else
    println("Flux is not installed.")
end

# Check if Mill is installed
if haskey(Pkg.installed(), "Mill")
    println("Mill is installed.")
else
    println("Mill is not installed.")
end


[33m[1m└ [22m[39m[90m@ Pkg /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Pkg/src/Pkg.jl:675[39m


Flux is installed.
Mill is installed.


[33m[1m└ [22m[39m[90m@ Pkg /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/usr/share/julia/stdlib/v1.8/Pkg/src/Pkg.jl:675[39m


## Schema Visualization
In this example we show how can schema be turned into HTML interactive visualization, which helps to examine the schema, especially when dealing with large and heterogeneous data.

In [None]:
# using JsonGrinder, JSON
# import JsonGrinder: generate_html

# Parsing JSON dat do vhodné reprezentace.

Definice vektoru dictionaries. Dictionary je dvojice klíč-hodnota a odpovídá jednomu pozorování (v tomto případě 1 ArXiv článek/Json objekt s metadaty).

Dále je třeba oddělit labely (predikovaná proměnná categories) od zbytku dat (prediktorŮ).

In [6]:
using JSON

# File path
data_file = "/content/arxiv-metadata-oai-snapshot.json"

# Vector to store parsed JSON objects
samples = Vector{Dict}()

# Maximum number of samples to parse
max_samples = 5000

# Open the file
open(data_file) do file
    # Read and parse JSON objects line by line
    for line in eachline(file)
        # Parse each line as a JSON object and push it to the samples vector
        push!(samples, JSON.Parser.parse(line))

        # Check if we've reached the maximum number of samples
        if length(samples) >= max_samples
            break
        end
    end
end


In [39]:
# Print example of multiple JSON objects in samples
for i in 1:3  # Adjust the range as needed
    JSON.print(samples[i], 2)
end

{
  "journal-ref": "Phys.Rev.D76:013009,2007",
  "doi": "10.1103/PhysRevD.76.013009",
  "id": "0704.0001",
  "comments": "37 pages, 15 figures; published version",
  "update_date": "2008-11-26",
  "report-no": "ANL-HEP-PR-07-12",
  "versions": [
    {
      "created": "Mon, 2 Apr 2007 19:18:42 GMT",
      "version": "v1"
    },
    {
      "created": "Tue, 24 Jul 2007 20:10:27 GMT",
      "version": "v2"
    }
  ],
  "authors_parsed": [
    [
      "Balázs",
      "C.",
      ""
    ],
    [
      "Berger",
      "E. L.",
      ""
    ],
    [
      "Nadolsky",
      "P. M.",
      ""
    ],
    [
      "Yuan",
      "C. -P.",
      ""
    ]
  ],
  "submitter": "Pavel Nadolsky",
  "title": "Calculation of prompt diphoton production cross sections at Tevatron and\n  LHC energies",
  "abstract": "  A fully differential calculation in perturbative quantum chromodynamics is\npresented for the production of massive photon pairs at hadron colliders. All\nnext-to-leading order perturbative co

## Oddělení labelů od prediktorů
Selekce labelů z dat a jejich uložení jako samosatný vektor labelů.

```
# labels = sort(unique(targets))
```

Odstranění labelů z prediktorů.

In [10]:
# Selekce labelů z dat
targets = map(c -> c["categories"], samples)

5000-element Vector{String}:
 "hep-ph"
 "math.CO cs.CG"
 "physics.gen-ph"
 "math.CO"
 "math.CA math.FA"
 "cond-mat.mes-hall"
 "gr-qc"
 "cond-mat.mtrl-sci"
 "astro-ph"
 "math.CO"
 "math.NT math.AG"
 "math.NT"
 "math.NT"
 ⋮
 "astro-ph"
 "astro-ph"
 "astro-ph"
 "cond-mat.dis-nn cond-mat.soft"
 "cond-mat.str-el"
 "gr-qc cond-mat.str-el hep-ph physics.hist-ph"
 "math.OA math.FA"
 "astro-ph"
 "hep-th"
 "quant-ph"
 "astro-ph gr-qc hep-ph hep-th"
 "cond-mat.supr-con"

In [11]:
# Konstrukce vektoru lableů

labels = sort(unique(targets))

943-element Vector{String}:
 "astro-ph"
 "astro-ph gr-qc"
 "astro-ph gr-qc hep-ph"
 "astro-ph gr-qc hep-ph hep-th"
 "astro-ph gr-qc hep-ph nucl-th"
 "astro-ph gr-qc hep-th"
 "astro-ph gr-qc math.AP"
 "astro-ph gr-qc nucl-th"
 "astro-ph hep-ex"
 "astro-ph hep-ph"
 "astro-ph hep-ph hep-th"
 "astro-ph hep-ph nucl-th"
 "astro-ph hep-ph physics.atom-ph physics.space-ph"
 ⋮
 "quant-ph math-ph math.LO math.MP math.PR"
 "quant-ph math-ph math.MP"
 "quant-ph math.PR"
 "quant-ph nucl-th"
 "quant-ph physics.optics"
 "stat.AP"
 "stat.AP cs.NE"
 "stat.AP math.ST stat.TH"
 "stat.AP stat.ME"
 "stat.ME"
 "stat.ME econ.EM math.ST stat.TH"
 "stat.ME physics.soc-ph stat.AP"

In [12]:
# Odstranění categories (labelů) z množiny prediktorů

foreach(c -> delete!(c, "categories"), samples)

In [13]:
samples[1]

Dict{String, Any} with 13 entries:
  "journal-ref"    => "Phys.Rev.D76:013009,2007"
  "doi"            => "10.1103/PhysRevD.76.013009"
  "id"             => "0704.0001"
  "comments"       => "37 pages, 15 figures; published version"
  "update_date"    => "2008-11-26"
  "report-no"      => "ANL-HEP-PR-07-12"
  "versions"       => Any[Dict{String, Any}("created"=>"Mon, 2 Apr 2007 19:18:42 GMT", "version"=>"…
  "authors_parsed" => Any[Any["Balázs", "C.", ""], Any["Berger", "E. L.", ""], Any["Nadolsky", "P. …
  "submitter"      => "Pavel Nadolsky"
  "title"          => "Calculation of prompt diphoton production cross sections at Tevatron and\n  …
  "abstract"       => "  A fully differential calculation in perturbative quantum chromodynamics is…
  "authors"        => "C. Bal\\'azs, E. L. Berger, P. M. Nadolsky, C.-P. Yuan"
  "license"        => nothing

# Automatická extrakce JSON chématu z dat pomocí JsonGrinder.jl

In [14]:
# Voláme funcki schema pro automatickou extrakci JSON schématu

using JsonGrinder

sch = JsonGrinder.schema(samples)

[34m[Dict][39m[90m  # updated = 5000[39m
[34m  ├───── update_date: [39m[39m[Scalar - String], 853 unique values[90m  # updated = 5000[39m
[34m  ├──────── versions: [39m[31m[List][39m[90m  # updated = 5000[39m
[34m  │                   [39m[31m  ╰── [39m[32m[Dict][39m[90m  # updated = 7915[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39m[Scalar - String], 34 unique values[90m  # updated = 7915[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39m[Scalar - String], 7907 unique values[90m  # updated = 7915[39m
[34m  ├───────── license: [39m[39m[Scalar - String], 5 unique values[90m  # updated = 343[39m
[34m  ├─────── report-no: [39m[39m[Scalar - String], 430 unique values[90m  # updated = 430[39m
[34m  ├── authors_parsed: [39m[31m[List][39m[90m  # updated = 5000[39m
[34m  │                   [39m[31m  ╰── [39m[32m[List][39m[90m  # updated = 15599[39m
[34m  │                  

In [15]:
# Odstranění pole id (identifikátory článků) z množiny predikotrů

delete!(sch.childs,:id)

Dict{Symbol, Any} with 12 entries:
  :update_date          => Entry
  :versions             => ArrayEntry
  :license              => Entry
  Symbol("report-no")   => Entry
  :authors_parsed       => ArrayEntry
  :authors              => Entry
  :title                => Entry
  Symbol("journal-ref") => Entry
  :comments             => Entry
  :abstract             => Entry
  :doi                  => Entry
  :submitter            => Entry

Pomocí suggestextractor získáme navržené datové typy jednotlivých uzlů (string, categorical, Int, FLoat atd.). Pokud jsme spokojeni můžeme aplikovat extractor na původní JSON data, která převede data do reprezentace (vektory a marice) určená pro Mill.jl model.

In [16]:
extractor = suggestextractor(sch)

[34mDict[39m
[34m  ├───── update_date: [39m[39mString
[34m  ├──────── versions: [39m[31mArray of[39m
[34m  │                   [39m[31m  ╰── [39m[32mDict[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mCategorical d = 35
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mString
[34m  ├───────── license: [39m[39mCategorical d = 6
[34m  ├─────── report-no: [39m[39mString
[34m  ├── authors_parsed: [39m[31mArray of[39m
[34m  │                   [39m[31m  ╰── [39m[32mArray of[39m
[34m  │                   [39m[31m      [39m[32m  ╰── [39m[39mString
[34m  ├───────── authors: [39m[39mString
[34m  ├─────────── title: [39m[39mString
[34m  ├───── journal-ref: [39m[39mString
[34m  ┊[39m
[34m  ├───────────── doi: [39m[39mString
[34m  ╰─────── submitter: [39m[39mString

In [17]:
extractor(samples[1])

[34mProductNode[39m[90m  # 1 obs, 408 bytes[39m
[34m  ├───── update_date: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 130 bytes[39m
[34m  ├──────── comments: [39m[39mArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)[90m  # 1 obs,  [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagNode[39m[90m  # 1 obs, 120 bytes[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductNode[39m[90m  # 2 obs, 48 bytes[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayNode(35×2 OneHotArray with Bool elements)[90m  # 2 obs [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayNode(2053×2 NGramMatrix with Int64 elements)[90m  # 2  [39m[90m⋯[39m
[34m  ├──────── abstract: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 1.077 KiB[39m
[34m  ├───────── license: [39m[39mArrayNode(6×1 MaybeHotMatrix with Union{Missing, B

In [18]:
data = map(extractor, samples)

5000-element Vector{Mill.ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, Mill.BagNode{Mill.ProductNode{NamedTuple{(:version, :created), Tuple{Mill.ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, Mill.AlignedBags{Int64}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{String, Vector{String}, Int64}, Nothing}, Mill.ArrayNode{Mill.MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, Mill.ArrayNode{Mill.NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, Mill.BagNode{Mill.BagNode{Mill.Ar

## Defining the model reflecting the structure of data

Definujeme model na základě reprezentace získané v předchozím kroku.
Určíme též velikost výstupní vrstvy. Každá výstupní jednotka odpovídá jednomu labelu (categories).

In [19]:
using Mill, Flux

model = reflectinmodel(data[1],
	input_dim -> Dense(input_dim, 64, relu),
	Mill.SegmentedMeanMax,
	fsm = Dict("" => input_dim -> Chain(Dense(input_dim, 64, relu), Dense(64, length(labels))))
	)

[34mProductModel ↦ Chain(Dense(768 => 64, relu), Dense(64 => 943))[39m[90m  # 4 arrays, 110_511 params, 431.84 [39m[90m⋯[39m
[34m  ├───── update_date: [39m[39mArrayModel(Dense(2053 => 64, relu))[90m  # 2 arrays, 131_456 params, 513.578 KiB[39m
[34m  ├──────── comments: [39m[39mArrayModel([postimputing]Dense(2053 => 64, relu))[90m  # 3 arrays, 131_520 param [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagModel ↦ [SegmentedMean(64); SegmentedMax(64)] ↦ Dense(128 => 64, relu)[39m[90m  # [39m[90m⋯[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductModel ↦ Dense(128 => 64, relu)[39m[90m  # 2 arrays, 8_256 params, 32.32 [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayModel(Dense(35 => 64, relu))[90m  # 2 arrays, 2_304 pa [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayModel(Dense(2053 => 64, relu))[90m  # 2 arrays, 131_45 [39m[90m⋯[39m
[34m

## Training the model

In [None]:
# Set a random seed for reproducibility
using Random

Random.seed!(42)

TaskLocalRNG()

## Rozdělíme data na trénovací, validační a testovací

In [20]:
using MLUtils

# Shuffle the data and targets together
shuffled_data_targets = shuffleobs((data, targets))

# Split the shuffled data into train and test sets
train, remaining = splitobs(shuffled_data_targets, at = 0.8)

# Split the remaining data into validation and test sets
val, test = splitobs(remaining, at = 0.5) # Split the remaining 20% into validation and test sets

((ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}, Nothing}, AlignedBags{Int64}, Nothin

## Nadefinujeme ztrátovou funkci.
Srovnáváme dvě pravděpodobností rozdělení (rozdělení labelů), proto je vhodnou funckí křížová entropie.

In [22]:
# Definition of loss function
using Zygote

function loss(x,y)
  x = Zygote.@ignore reduce(catobs, getobs(x))
  y = Zygote.@ignore Flux.onehotbatch(getobs(y), labels)
  Flux.Losses.logitcrossentropy(model(x), y)
end

loss (generic function with 1 method)

In [23]:
# Let see the Mill representation of the first sample/observation
data[1]

[34mProductNode[39m[90m  # 1 obs, 408 bytes[39m
[34m  ├───── update_date: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 130 bytes[39m
[34m  ├──────── comments: [39m[39mArrayNode(2053×1 NGramMatrix with Union{Missing, Int64} elements)[90m  # 1 obs,  [39m[90m⋯[39m
[34m  ├──────── versions: [39m[31mBagNode[39m[90m  # 1 obs, 120 bytes[39m
[34m  │                   [39m[31m  ╰── [39m[32mProductNode[39m[90m  # 2 obs, 48 bytes[39m
[34m  │                   [39m[31m      [39m[32m  ├── version: [39m[39mArrayNode(35×2 OneHotArray with Bool elements)[90m  # 2 obs [39m[90m⋯[39m
[34m  │                   [39m[31m      [39m[32m  ╰── created: [39m[39mArrayNode(2053×2 NGramMatrix with Int64 elements)[90m  # 2  [39m[90m⋯[39m
[34m  ├──────── abstract: [39m[39mArrayNode(2053×1 NGramMatrix with Int64 elements)[90m  # 1 obs, 1.077 KiB[39m
[34m  ├───────── license: [39m[39mArrayNode(6×1 MaybeHotMatrix with Union{Missing, B

Zkontrolujeme správnost (accuracy) netrénovaného modelu na testovacích datech.

Spávnost by měla být přibližně 1/n, kde n je počet labelů. Např. přesnot je 1/2 = 0.5 (50%), pokud máme pouze 2 labely (např. ANO/NE).

In [24]:
# Let see the individual labels of each sample in the test data
getobs(test[2])

500-element Vector{String}:
 "cond-mat.mes-hall"
 "math.AG math.NT"
 "gr-qc"
 "math.RT math.AG math.CO"
 "cond-mat.mtrl-sci cond-mat.other"
 "astro-ph"
 "physics.soc-ph physics.comp-ph"
 "hep-ph"
 "q-bio.CB"
 "hep-th"
 "cs.IT math.IT"
 "math.DS math-ph math.MP"
 "nucl-th"
 ⋮
 "math.CV math.AG"
 "nlin.SI math-ph math.MP"
 "math.OA math.GN"
 "astro-ph"
 "astro-ph"
 "cs.CL cs.HC"
 "hep-ph"
 "math.SG math.CO"
 "hep-ph"
 "hep-th math.DG"
 "physics.atom-ph"
 "hep-th gr-qc math-ph math.MP"

In [25]:
# Let see the predicted labels of the test data
labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))]

500-element Vector{String}:
 "math.AG math.AC math.RT"
 "cs.RO"
 "cs.RO"
 "cs.RO"
 "hep-ph nucl-ex"
 "cs.RO"
 "physics.hist-ph"
 "hep-ph nucl-ex"
 "physics.class-ph physics.ins-det"
 "q-fin.GN physics.data-an physics.soc-ph"
 "hep-ph nucl-ex"
 "cs.HC cs.AI"
 "cs.RO"
 ⋮
 "cond-mat.dis-nn cs.AR"
 "cond-mat.stat-mech cond-mat.dis-nn physics.soc-ph"
 "cond-mat.stat-mech cond-mat.dis-nn physics.soc-ph"
 "cond-mat.dis-nn cs.AR"
 "math.AG math.DG"
 "math-ph math.AP math.MP"
 "nucl-th astro-ph hep-ph nucl-ex"
 "cs.RO"
 "cond-mat.mtrl-sci cond-mat.other"
 "math.QA math.GT"
 "cond-mat.other cond-mat.mtrl-sci"
 "math.PR math.OC"

In [26]:
# Let see where do they match (1)
labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))] .== getobs(test[2])

500-element BitVector:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0

In [27]:
using Statistics

test_accuracy = mean(labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))] .== getobs(test[2]))

# Print test accuracy
println("Test Accuracy: ", test_accuracy)

Test Accuracy: 0.0


Uložíme parametry modelu a zobrazíme si jejich celkový počet

In [28]:
ps = Flux.params(model)

# Get the total number of parameters
params_count = length(ps)

println("Total number of parameters in the model: ", params_count)

Total number of parameters in the model: 49


Rozdělíme data na minibatches pro rychlejší trénování a optimalizaci modelu (stochastic gradient descent pomocí balíčku Flux.jl).

In [29]:
# We define the minibatch size.
 minibatchsize = 10

10

In [30]:
minibatches = Flux.DataLoader((train[1], train[2]), batchsize=minibatchsize, shuffle=true)

400-element DataLoader(::Tuple{SubArray{ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}

Zvolíme si optimizer ADAM (Adaptive Moment Estimation), který bude optimalizovat parametry modelu na základě dat.

In [31]:
# We want ADAM optimizer
opt = Flux.Optimise.Adam()

Adam(0.001, (0.9, 0.999), 1.0e-8, IdDict{Any, Any}())

In [32]:
x, y = first(minibatches)

(ProductNode{NamedTuple{(:update_date, :comments, :versions, :abstract, :license, Symbol("report-no"), :authors_parsed, :doi, :authors, :submitter, :title, Symbol("journal-ref")), Tuple{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{ProductNode{NamedTuple{(:version, :created), Tuple{ArrayNode{OneHotArrays.OneHotMatrix{UInt32, Vector{UInt32}}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}}}, Nothing}, AlignedBags{Int64}, Nothing}, ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, ArrayNode{MaybeHotMatrix{Union{Missing, UInt32}, Int64, Union{Missing, Bool}}, Nothing}, ArrayNode{NGramMatrix{Union{Missing, String}, Vector{Union{Missing, String}}, Union{Missing, Int64}}, Nothing}, BagNode{BagNode{ArrayNode{NGramMatrix{String, Vector{String}, Int64}, Nothing}, AlignedBags{Int64}, Nothing}, AlignedBags{Int64}, Nothing

In [36]:
loss(x, y)
gs = gradient(() -> loss(x, y), ps)
Flux.Optimise.update!(opt, ps, gs)

Konečně trénujeme model na datech

In [37]:
# Let train and optimize our model
Flux.train!(loss, Flux.params(model),
    minibatches,
    ADAM())

Správnost tohoto tentokrát natrénovaného modelu opět ověříme na testovacích datech.

In [38]:
test_accuracy = mean(labels[Flux.onecold(softmax(model(reduce(catobs, getobs(test[1])))))] .== getobs(test[2]))

# Print test accuracy
println("Test Accuracy: ", test_accuracy)

Test Accuracy: 0.374
