# Introduction to ML Workflow and Data Retrieval in Julia

## Setting up the project (environment) in Julia

Before proceeding to topics related to machine learning and data analysis, let's begin with general software engineering idea - **environments**. It's a best practice to prepare isolated project for every programming initiative, whether it is building a web application, running complex scientific simulation or training ML model.

The idea of structured project is prevalent in all programming languages, however the actual name may differ, e.g. [virtual environment](https://docs.python.org/3/library/venv.html) in Python or [build environment](https://docs.gradle.org/current/userguide/build_environment.html) in many compiled languages. The scope of information carried by the projects differ between programming languages as well, but in general they record:
* dependencies (libraries, packages) required by the project
* version of the project (see [semantic versioning](https://semver.org/))
* indication of development stage (development, test, production, etc.)
* author-related information (name, contact information, company affiliation)
* miscellaneous configurations (compiler flags, version control details, CI/CD parameters, IDE settings)

The main goals of the project are to:
* provide reproducibility and standardization (if it works on my machine, it should work on yours), 
* enable collaboration (shared projects within teams)
* supply additional information about the piece of software (date of creation, sponsoring company name, author's email)

In Julia the project is defined by two files: `Project.toml` and `Manifest.toml`. More information on both can be found in [Pkg.jl documentation](https://pkgdocs.julialang.org/v1/toml-files/).

For example, below extract from DataFrames.jl `Project.toml` contains name of the package, it's unique identifier, current version and one dependency on `DataAPI` package with additional version restriction:

```julia
name = "DataFrames"
uuid = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
version = "1.4.1"

[deps]
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
(...)

[compat]
DataAPI = "1.12.0"
(...)
```

You can start your own Project by running **any of the two** cells below. More information on project initiation is available again in [Pkg.jl documentation](https://pkgdocs.julialang.org/v1/environments/)

In [None]:
using Pkg
Pkg.activate(".")

In [None]:
]activate .

Add all required packages (dependencies) to the project using `]add` command. After finishing the work you can share the notebook (and corresponding TOML files). With high degree of confidence, recipients will be able to run the notebook without interference and errors related to compatibility or missing libraries.

In [None]:
]add DataFrames, JSON, Plots

To replicate the environment, run `]instantiate` command in the folder where `Project.toml` is present.

In [None]:
]instantiate

Having said that, **make sure to run `]instantiate` command above**, so that TOML files provided with the notebook will be used to install all required packages with expected versions to make your experience in upcoming exercises smooth and pleasant.

## Machine Learning workflow

![](Class1_ML_Workflow.png)
<div style="text-align: right">Source: Burkov Andriy, ML Engineering, 2020, CC BY-SA 4.0</div> 

Lifecycle of Machine Learning project is a complex process involving multiple areas of expertise and set of skills. To succeed with an enterprise-level ML project, we'll need:
* project managers, product owners and business analysts with good understanding of business problem and ability to define the goal and manage the execution of the initiative
* data engineers, data analysts and data scientists with in-depth knowledge about the data, technical skills and statistical (modelling) expertise
* DevOps engineers, software engineers, application developers taking care of model deployment in secure, robust and performant manner, often with embedment into a bigger application

Increasing penetration of ML models in the business and advancements in the deployment areas created a new area of **Machine Learning engineering** and corresponding position of **Machine Learning engineer**. Activities related to post-evaluation steps are also often reffered to as **MLOps (Machine Learning Operations)**, similarly to operations term used for ongoing maintanance and monitoring of software applications.

___
We'll cover several elements of the process in the notebooks throught the course, focusing mainly on the steps from Data collection up to Model evaluation. The final notebooks will cover basics of the deployment and serving topics.

Let's start with the first step in the ML journey - obtaining and loading data.

## Data Gathering  and Retrieval

Build a machine learning model is a data-heavy exercise and often the success of the whole project may be determined by quality and quantity of data. As for the former, the famous saying 'garbage-in, garbage-out' says it all - data may be cleaned and preprocessed, but there is hardly anything to be done if the inherent information carried by the dataset is weak (e.g. non-related features) or scarce (e.g. many missing values).

There is abandunace of data sources and formats eligible for usage in ML process. Data can be categorized based on the presence (or lack) of structure into:
* **structured data** (fixed schema, often identified with tabular datasets)
* **semi-structured data** (varying schema between observations, usually stored in JSON or XML format)
* **unstructured data** (domain-specific data without exposed features, e.g. text, music, images)

Affiliation with one of the categories above translates into available ML tasks, e.g. regression and classification is usually considered for structured data, object detection is task specific to computer vision, while sentiment analysis will be applied on text data containing natural language.

___
Another factor to consider when thinking about the data is it's source. Majority of small scale or Proof-of-Concept projects are based on the **flat files** available in online repositories - either ones specialized for modelling purposes (e.g. [UCI](https://archive.ics.uci.edu/ml/index.php)) or more general like GitHub. Government agencies also tend to share cyclic reports online in flat files (e.g. [OECD](https://data.oecd.org/)).

For medium and large ML initiatives more suitable source of data would be a **database**, **data warehouse** or **data lake**. Relational databases usually deliver structured data, however modern solutions support semi-structured and unstructured data as well (see [JSON support in PostgreSQL](https://www.postgresql.org/docs/current/datatype-json.html) for example). For big volumes of historical structured data a data warehouse is a common solution, while data lake support all three structure categories. 

Data source commonly connected with semi-structured data is a **REST API**. It's a service that can be queried through exposed URL endpoints and typically response with data in JSON format. Received records are then parsed and turned into structured data or used directly in ML models in a semi-structured form.

Public cloud platforms can also serve as  a modern data source for modelling. **Object storages** such as AWS S3 or Azure Blob Storage provide a convenient place to store and share large quantities of unstructured data.

After the short overview of sources of data and categories of data structure, let's see how we can load various datasets into Julia for further processing. 

In [None]:
using DataFrames: DataFrame
using Plots
using StatsPlots

## Flat files

In [None]:
using Downloads: download
using CSV

In [None]:
#We are downloading the famous Iris dataset from UCI repository and the file with dataset description
if !isfile("iris_data.csv")
    download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data","iris_data.csv");
    download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names","iris_names.txt");
end

In [None]:
#Flat files are present in the workind directory
readdir()

In [None]:
#Now we can load the data to DataFrame for analysis
CSV.read("iris_data.csv",DataFrame; header=["sepal_len","sepal_wid","petal_len","petal_wid","species"])

In [None]:
#Or without saving a file
iris = CSV.read(download("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), DataFrame;
        header=["sepal_len","sepal_wid","petal_len","petal_wid","species"]);

In [None]:
#Now we can use the dataset to produce a meaningful plot or build a model
@df iris scatter(:petal_len, :petal_wid, group=:species, legend=:topleft, xlab="Petal length", ylab="Petal width")

## Interacting with API

In [None]:
using JSON

In [None]:
rpad()

In [None]:
#Auxilary function to print a DataFrame schema
function print_schema(df::DataFrame)
    for (key, val) in Dict(zip(names(df),eltype.(eachcol(df))))
        println(rpad(key,30," "),": ",val)
    end
end

In [None]:
#Let's query GitHub API for list of Julia-based repositories ranked by number of stargazers
#It's a public API, so we don't neet to pass any authentication information
download("https://api.github.com/search/repositories?q=language:julia&sort=stars","gh_api.json");

In [None]:
#First 300 characters of obtained JSON file
#JSON contains `items` array with objects representing repositories
#There are also nested objects in repositories, e.g. `owner`
julia_repos = read("gh_api.json", String);
print(julia_repos[begin:300])

In [None]:
#JSON can also fit to a DatFrame - nested objects are loaded as Dict columns and arrays as Array
repos_df = DataFrame(JSON.parse(julia_repos)["items"]);
first(repos_df,2)

In [None]:
#After inspecting the schema we can determine which fields are of composite type
print_schema(repos_df)

In [None]:
#Information about top 10 repositories ordered by stars count
for repo in eachrow(first(repos_df,10))
    println("""Name: $(repo["name"])
    Owner: $(repo["owner"]["login"])
    Stars: $(repo["stargazers_count"])
    Created: $(repo["created_at"])
    Description: $(repo["description"])\n""")
end

In [None]:
#Barplot based on semi-structured JSON data
#`julia` itself is excluded due to high stars count compared to other repositories
plotly()
n=20
@df first(repos_df[2:end,:],n) bar(1:n,:stargazers_count, hovertext=:description,
    size = (900,500), xrotation=45, legend=false, xticks=(1:n, :name), ylab="Stars")

## Unstructured data

Data can also be obtained directly through packages. `MLDatasets.jl` allows us to conveniently load few well-known unstructured datasets, in particular from computer vision area. We'll load and inspect observations from [FashionMNIST](https://github.com/zalandoresearch/fashion-mnist) dataset which contains preprocessed images of clothes (28x28 pixels, single channel). You may need to accept a disclaimer when running the first function.

In [None]:
using MLDatasets
using ImageCore

In [None]:
#MLDatasets did part of job for us and parsed binary data into a matrix. 
#Each cell contains value between 0.0 - 1.0 corresponding to the 'brightness' of the pixel in an image.
#There is no explicit structure in a form of named columns of specific type, hence the dat is unstructured.
FashionMNIST.testtensor(1)

In [None]:
#For the sake of visualisation, we can interpet 0.0 as black and 1.0 as white
#MLDatasets provide an auxilary function `convert2image` to produce a nice plot
#Looks like the first test observation is a shoe
FashionMNIST.convert2image(FashionMNIST.testtensor(1))

In [None]:
#Colors can be swapped easily with broadcasting
#Second sample is apparently a sweater with an imprint
FashionMNIST.convert2image(1 .- FashionMNIST.testtensor(2))