# FastAI with PyTorch
For our final hour, we look at training our own models on Sinopia's RDF for a couple of classification tasks. [FastAI][FASTAI] is a non-profit that provides artificial intellegence training and for our use today, an easy-to-use Python software library that is built upon the [PyTorch](https://pytorch.org/) open-source machine learning framework from Facebook that is widely used in industry and academic research.

[FASTAI]: https://www.fast.ai/

In [None]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
%reload_ext lab_black

from fastai.tabular.all import *

import pandas as pd

import kglab
import rdflib
import helpers
import widgets

## Loading Sinopia Stage RDF Text DataFrame

In [None]:
stage_kg = kglab.KnowledgeGraph()
stage_kg.load_jsonld("data/stage.json")

## Classifing RDF resource by their Template
In Sinopia, each resource uses at least one resource template for constructing the Sinopia's user interface. Currently, when a user imports RDF into the editor, either through the Questioning Authority search or through the **Load RDF** tab, the user is prompted to selected the template to use and Sinopia does its best to matche the template's properties to the incoming RDF.

This classification task extends the initial work done in last year's LD4 presentation, 
[A Machine Learning Approach for Classifying Sinopia's RDF](https://ld4p.github.io/classify-rdf-2020/).

### Step One - Generate Pandas Dataframe
We will run a SPARQL query on our stage knowledge graph, iterate through the results to generate a list of dictionaries from the `helpers.predicate_row`. 

In [None]:
data = []
for row in stage_kg.query(
    """
SELECT ?template ?url 
WHERE {
   ?url <http://sinopia.io/vocabulary/hasResourceTemplate> ?template .
   FILTER isIRI(?url)
} """
):
    # Skip if RDF resource is a Sinopia resource template
    if str(row[0]).startswith("sinopia:template:resource"):
        continue
    data.append(helpers.predicate_row(row[1], stage_kg.rdf_graph()))

For the list of dictionaries that have the predicate frequencies, create a Pandas DataFramek and then replace missing values with zeros.

In [None]:
stage_df = pd.DataFrame(data)
stage_df = stage_df.fillna(0.0)

Shape and information about the `stage_df` DataFrame

In [None]:
print(stage_df.shape)
print(stage_df.info())

Generate a random sample of 10 to see examples of individual Series in the DataFrame

In [None]:
stage_df.sample(10)

### Step Two - Split, Preprocess, and Load Data
First we will create a copy of `stage_df` DataFrame, with the `uri` column as we don't want to train our model on this identifier. Later if we need, we can lookup the `uri` in the original `stage_df` dataframe to retrieve the URI. We will also remove the `sinopia:hasResourceTemplate` column because it doesn't add any information. We will then make sure that we don't have rows that have unique templates (as this will impact later training).

Next we split our data into training and validation sets with our validation set contain 20% of data.

In [None]:
stage_df_copy = stage_df.drop(
    columns=["uri", "http://sinopia.io/vocabulary/hasResourceTemplate"]
)
stage_df_clean = stage_df_copy[
    stage_df_copy.duplicated(subset=["template"], keep=False)
]
splits = helpers.create_splits(stage_df_clean)

Using the FastAI's TabularPandas class, we will pass in some parameters to preprocess the `stage_df_copy`.

In [None]:
continous = [col for col in stage_df_clean.columns]
continous.pop(0)  # Removes template from our continous variables

In [None]:
stage_to = TabularPandas(
    stage_df_clean,
    procs=[Categorify],
    cont_names=continous,
    y_names="template",
    y_block=CategoryBlock,
    splits=splits,
)

In [None]:
stage_to.xs.iloc[:2]

Finally, we will create a FastAI `DataLoader` that we can pass to the `leaner` object for training our model.

In [None]:
stage_data_loader = stage_to.dataloaders(bs=64)

The `TabularDataLoader` provides a method batching up our data in groups of 64 (set when we passed in the `bs` parameter above) and we see an example of a batch.

In [None]:
stage_data_loader.show_batch()

### Step Three - Create Learner and Train Model

In [None]:
stage_learner = tabular_learner(stage_data_loader, metrics=accuracy)

With the `stage_learner` object, we can graphically estimate the learning rate that will be used in training our new model. We can examine the model by printing it out:

In [None]:
stage_learner.model

In our `TabularModel` neural net we have three layers, with the first input layer does the following:
  1.  Applies a PyTorch [Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) to the incoming data
  1.  Applies the PyTorch [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html) activation
  1.  Applies a [batch normalization](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html) on our batch of 65
  
The second hiddent layer of does similar processing as our first layer and the layer third produces the final resultes.

In [None]:
stage_learner.lr_find()
stage_learner.recorder.plot()

In [None]:
stage_learner.fit_one_cycle(5)

In [None]:
stage_learner.show_results()

In [None]:
row, class_, probs = stage_learner.predict(stage_df_copy.iloc[23])

## Exercise 1
So far we have been using all of the RDF in Sinopia's stage environment, repeat the steps above for Sinopia production environment.