# Introduction to ML Workflow and Data Retrieval in Julia

## Setting up the project (environment) in Julia

Before proceeding to topics related to machine learning and data analysis, let's begin with general software engineering idea - **environments**. It's a best practice to prepare isolated project for every programming initiative, whether it is building a web application, running complex scientific simulation or training ML model.

The idea of structured project is prevalent in all programming languages, however the actual name may differ, e.g. [virtual environment](https://docs.python.org/3/library/venv.html) in Python or [build environment](https://docs.gradle.org/current/userguide/build_environment.html) in many compiled languages. The scope of information carried by the projects differ between programming languages as well, but in general they record:
* dependencies (libraries, packages) required by the project
* version of the project (see [semantic versioning](https://semver.org/))
* indication of development stage (development, test, production, etc.)
* author-related information (name, contact information, company affiliation)
* miscellaneous configurations (compiler flags, version control details, CI/CD parameters, IDE settings)

The main goals of the project are to:
* provide reproducibility and standardization (if it works on my machine, it should work on yours), 
* enable collaboration (shared projects within teams)
* supply additional information about the piece of software (date of creation, sponsoring company name, author's email)

**Environments in Julia**

In Julia the project is defined by two files: `Project.toml` and `Manifest.toml`. More information regarding both files can be found in [Pkg.jl documentation](https://pkgdocs.julialang.org/v1/toml-files/).

For example, below extract from DataFrames.jl `Project.toml` contains name of the package, it's unique identifier, current version and one dependency on `DataAPI` package with additional version restriction:

```julia
name = "DataFrames"
uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
version = "1.4.1"

[deps]
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
(...)

[compat]
DataAPI = "1.12.0"
(...)
```

**Starting new project** 

You can start your own project in a current working directory by running the following code in the Julia REPL:
```julia
using Pkg
Pkg.activate(".")
```
or hit `]` which will start Pkg REPL, then run:
```julia
activate .
```
More information on project initiation is available again in [Pkg.jl documentation](https://pkgdocs.julialang.org/v1/environments/)

**Adding packages**

While in Pkg REPL mode, add all required packages (dependencies) to the project using the `add` command:
```julia
add DataFrames, JSON, Plots
```
Alternatively run in Julia REPL:
```julia
Pkg.add(["DataFrames", "JSON", "Plots"])
```

**Recreating the project**

After finishing the work you can share the notebook or Julia scripts (and corresponding TOML files). The users may replicate the environment by running the following commands in the folder with `Project.toml` present:
```julia
$ julia --project
julia> using Pkg
julia> Pkg.instantiate()
```
Note that if you are using the Jupyter Notebook, the project is activated automatically if the `Project.toml` is present in the notebook's directory - see [IJulia documentation](https://julialang.github.io/IJulia.jl/stable/manual/usage/#Julia-projects) for details.

The `instantiate()` function will trigger the packages download and precompilation. After the process is finished, user will be able to run the code without interference and errors related to compatibility or missing libraries.

Having said that, you should have already used the `instantiate()` function while [setting up the course](https://github.com/KrainskiL/JuliaDataScienceTutorial#readme) to make your experience in upcoming exercises smooth and pleasant.

## Machine Learning workflow

![](Class1_ML_Workflow.png)
<div style="text-align: right">Source: Burkov Andriy, ML Engineering, 2020, CC BY-SA 4.0</div> 

Lifecycle of Machine Learning project is a complex process involving multiple areas of expertise and set of skills. To succeed with an enterprise-level ML project, we'll need:
* project managers, product owners and business analysts with good understanding of business problem and ability to define the goal and manage the execution of the initiative
* data engineers, data analysts and data scientists with in-depth knowledge about the data, technical skills and statistical (modelling) expertise
* DevOps engineers, software engineers, application developers taking care of model deployment in secure, robust and performant manner, often with embedment into a bigger application

Increasing penetration of ML models in the business and advancements in the deployment areas created a new area of **Machine Learning engineering** and corresponding position of **Machine Learning engineer**. Activities related to post-evaluation steps are also often reffered to as **MLOps (Machine Learning Operations)**, similarly to operations term used for ongoing maintanance and monitoring of software applications.

___
We'll cover several elements of the process in the notebooks throught the course, focusing mainly on the steps from Data collection up to Model evaluation. The final notebooks will cover basics of the deployment and serving topics.

Let's start with the first step in the ML journey - obtaining and loading data.

## Data Gathering  and Retrieval

Build a machine learning model is a data-heavy exercise and often the success of the whole project may be determined by quality and quantity of data. As for the former, the famous saying 'garbage-in, garbage-out' says it all - data may be cleaned and preprocessed, but there is hardly anything to be done if the inherent information carried by the dataset is weak (e.g. non-related features) or scarce (e.g. many missing values).

There is abandunace of data sources and formats eligible for usage in ML process. Data can be categorized based on the presence (or lack) of structure into:
* **structured data** (fixed schema, often identified with tabular datasets)
* **semi-structured data** (varying schema between observations, usually stored in JSON or XML format)
* **unstructured data** (domain-specific data without exposed features, e.g. text, music, images)

Affiliation with one of the categories above translates into available ML tasks, e.g. regression and classification is usually considered for structured data, object detection is task specific to computer vision, while sentiment analysis will be applied on text data containing natural language.

___
Another factor to consider when thinking about the data is it's source. Majority of small scale or Proof-of-Concept projects are based on the **flat files** available in online repositories - either ones specialized for modelling purposes (e.g. [UCI](https://archive.ics.uci.edu/ml/index.php)) or more general like GitHub. Government agencies also tend to share cyclic reports online in flat files (e.g. [OECD](https://data.oecd.org/)).

For medium and large ML initiatives more suitable source of data would be a **database**, **data warehouse** or **data lake**. Relational databases usually deliver structured data, however modern solutions support semi-structured and unstructured data as well (see [JSON support in PostgreSQL](https://www.postgresql.org/docs/current/datatype-json.html) for example). For big volumes of historical structured data a data warehouse is a common solution, while data lake support all three structure categories. 

Data source commonly connected with semi-structured data is a **REST API**. It's a service that can be queried through exposed URL endpoints and typically response with data in JSON format. Received records are then parsed and turned into structured data or used directly in ML models in a semi-structured form.

Public cloud platforms can also serve as  a modern data source for modelling. **Object storages** such as AWS S3 or Azure Blob Storage provide a convenient place to store and share large quantities of unstructured data.

After the short overview of sources of data and categories of data structure, let's see how we can load various datasets into Julia for further processing. We are starting by loading the required packages.  

In [None]:
using CSV
using DataFrames
using Downloads
using DuckDB
using ImageShow
using JSON
using MLDatasets
using Plots

## Flat files

We are downloading the famous [Iris dataset](https://archive.ics.uci.edu/ml/datasets/iris) from UCI repository and the file with dataset description. `iris.data` is saved to `iris_data.csv` file and `iris.names` to `iris_names.txt`.

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Downloads.download(url, "iris_data.csv")
Downloads.download(replace(url,".data" => ".names"), "iris_names.txt");

We can confirm the flat files are present in the working directory with `readdir()`.

In [None]:
readdir()

Now we can load the data to a DataFrame for analysis.

In [None]:
iris = CSV.read("iris_data.csv", DataFrame; 
        header=["sepal_len", "sepal_wid", "petal_len", "petal_wid", "species"])

Now we can use the dataset to produce a meaningful plot or build a model.

In [None]:
scatter(iris.petal_len, iris.petal_wid, group=iris.species, 
    legend=:topleft, xlab="Petal length", ylab="Petal width")

In relational databases, tabular data manipulation is done with **Structured Query Language (SQL)**. The SQL has become prevalent in data analysis and was adopted also in non-relational analytical tools. Knowledge of SQL should be considered one of the core skills of Data Analyst or Data Scientist.

In Julia SQL queries can be run on top of the DataFrame with DuckDB engine. DuckDB has API for multiple other programming languages, for instance Python or R. The tool is open-source and focuses on analytical workloads. For more details check the [DuckDB website](https://duckdb.org/).

We'll create an in-memory database with the `connect` function and register Iris DataFrame with `register_data_frame`.

In [None]:
con = DBInterface.connect(DuckDB.DB)
DuckDB.register_data_frame(con, iris, "iris")

Let's check the average of 4 numeric features for each species of iris using SQL query.

In [None]:
query = """SELECT 
species, 
mean(sepal_len) avg_sepal_len, 
mean(sepal_wid) avg_sepal_wid,
mean(petal_len) avg_petal_len,
mean(petal_wid) avg_petal_wid
FROM iris
GROUP BY species"""
results = DBInterface.execute(con, query).df

The result is a standard DataFrame. We can use it to plot the averages on the barplots and assess if they differ between species visually.

In [None]:
gr()
plot(
    bar(results.species, results.avg_sepal_len, title="Sepal Length", legend=false),
    bar(results.species, results.avg_sepal_wid, title="Sepal Width", legend=false),
    bar(results.species, results.avg_petal_len, title="Petal Length", legend=false),
    bar(results.species, results.avg_petal_wid, title="Petal Width", legend=false),
    layout=(2,2), size=(800, 700)
    )

## Interacting with an API

Let's query GitHub API for list of Julia-based repositories ranked by number of stargazers.

It's a public API, so there is no need to pass authentication information, but usually a key or username/password pair is required.

In [None]:
Downloads.download("https://api.github.com/search/repositories?q=language:julia&sort=stars", "gh_api.json");

First 300 characters of obtained JSON file are shown below. The file contains `items` array with objects representing repositories. Nested objects are also present in repositories e.g., `owner`.

In [None]:
julia_repos = read("gh_api.json", String);
first(julia_repos, 300)

JSON can fit into a `DataFrame` - nested objects are loaded as `Dict` columns and arrays as `Array`.

In [None]:
repos_df = DataFrame(JSON.parse(julia_repos)["items"])
first(repos_df, 2)

After inspecting the schema we can determine which fields are of composite type.

In [None]:
show(stdout, describe(repos_df, :eltype), allrows=true)

Let's extract the information about top 10 repositories ordered by stars count.

In [None]:
first(select(repos_df, :name, 
    :owner => ByRow(x -> x["login"]) => :owner, 
    :stargazers_count => :stars, 
    :created_at => :created, 
    :description), 10)

Let's create interactive barplot with `Plotly` showing top 20 repositories based on the stargazers count. `julia` itself is excluded due to high stars count compared to other repositories.

In [None]:
plotly()
n=20
data = first(repos_df[2:end,:], n)
bar(1:n, data.stargazers_count, hovertext=data.description,
    size = (900, 500), xrotation=45, legend=false, xticks=(1:n, data.name), ylab="Stars")

## Unstructured data

Data can also be obtained directly through packages. `MLDatasets` package allows to conveniently load well-known unstructured datasets, in particular from computer vision area. We'll load and inspect observations from [FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which contains preprocessed images of clothes (28x28 pixels, single color channel).

**Note** When you run the cell below for the first time, `MLDatasets` will download the archived data to your machine. Before the download commence, you will need to accept the disclaimer. Read the message and accept by typing `y` in the input box and pressing Enter.

In [None]:
fmnist = FashionMNIST(split=:test)
fmnist.features[:, :, 2]

`MLDatasets` did part of our job and preprocessed binary data into a `Matrix`. Each cell contains value between 0.0 - 1.0 corresponding to the 'brightness' of the pixels in an image. There is no explicit structure in a form of named columns, hence the data is unstructured.

Function `convert2image` from the `MLDatasets` produces a nice plot unveiling the piece of clothing hidden behind the numbers.

Looks like the second observation in the dataset is a sweater with an imprint.

In [None]:
convert2image(fmnist, 2)

Colors can be swapped easily with broadcasting - first sample is apparently a boot.

In [None]:
1 .- convert2image(fmnist, 1)