# Lecture 22: Data Pipelines

![](https://www.tensorflow.org/images/colab_logo_32px.png)
[Run in colab](https://colab.research.google.com/drive/1udgk2h35b1ghLRSixSceEXocqT_Rw8f4)

Applying machine learning algorithms and approaches to real-world data
involves a number of practical challenges. In this series of lectures,
we will look at tools and ways of working that address questions like:

- How can I store and access large amounts of data remotely?
- How can I keep track of different versions of datasets?
- How can I share my results and make my analyses reproducible by others?

Before we get into individual questions, let's take a look at the broader picture of how we handle data, before and after running any analysis or inference methods.

## Pipelines

More often than not, the data of interest will not be in a directly useable form, or perhaps not even immediately available to you.

Before we can train any model, we first need to make sure that the data is available and properly formatted. This can involve a number of steps:

- Accessing the data
- Cleaning and other preprocessing
- Transforming and generating features
- Making the data available to the model

After this, the data is ready to be used in our algorithm of choice.

![Image of a data pipeline comprising the steps of accessing, preproessing, transforming, serving, modelling and publishing data. Each step is represented as a box with arrows linking them.](Lecture22_Images/pipeline.svg)

This sequence of steps is sometimes called a **data pipeline**. Another related term used to describe similar workflows is ETL: Extract-Transform-Load.

## Accessing data

If we haven't collected the data ourselves, we will first need to get our (virtual) hands on it! This can be done in a number of ways. For example:
- A colleague gives us a file
- We connect to a web service that produces the data
- We "scrape" a web page or other source to extract the data ourselves
- We query a database for the particular data we want

Sometimes we will need to combine more than one source to get the full set of data that we require.

We will talk more about databases in the next lecture. 

A common element in the above examples is that the data can exist in some **remote** location. How we get it on our own computer will depend on the source, format and size of the data. However, this can often be done programmatically.

### Example: Downloading data from S3

S3 is a storage service offered by Amazon Web Services (AWS). Users can upload datasets which can then be accessed by others. For this example we will look at an open dataset that can be downloaded by anyone.

(To run this locally, you will need to install the Python package `boto3`, which will let you communicate with AWS).

In [1]:
import boto3
from botocore import UNSIGNED
from botocore.client import Config

client = boto3.client('s3',
                      config=Config(signature_version=UNSIGNED))

response = client.get_object(
    Bucket="first-street-climate-risk-statistics-for-noncommercial-use",
    Key="v1.0/CongDist_level_risk_FEMA_FSF.csv"
)

with open('flood_data.csv', 'w') as output_file:
    data = response['Body'].read().decode('utf-8')
    output_file.write(data)

## Preprocessing

Getting hold of the data you want to work with is only the first step. Sometimes this raw or preliminary data has to be changed. There are many reasons why:

- Data may contain errors
- Dimensionality or size of dataset is too high
- We want to focus only on a subset of interest
- Raw data does not directly contain variables of interest 
- Some algorithms are negatively impacted by e.g. imbalances in class frequencies or extreme values

Preprocessing steps can include:
- Replacing values that are incorrect or cause problems
- Filtering, subsampling (discarding samples) or supersampling (repeating samples)
- Removing outliers

Aspects of this are often referred to as **"cleaning"** the data. This is an important and often undervalued step in the pipeline. These transformations can be performed manually, although tools like [OpenRefine](https://openrefine.org/) can simplify and automate the process.

## Transforming

Cleaning brings you one step closer to useable data inputs for your model. However, your analysis may rely on variables that are not directly present in the original data.

There is therefore a need for **feature generation**: extracting the variables of interest by combining existing ones.

For example, imagine that you are training a regression model to predict the effect of various factors on the growth of a colony of bacteria.

Your output variable is the rate of growth, i.e. how much the colony grew in a day during the experiment.

The data you have been given does not contain this directly; instead, it has counts of cells at the start and end of the experiment, as well as the start and end date. To use your model, you will first need to combine this information and extract the variable that the model needs.

## Serving

The code you have written may expect to read in data in a particular format, such as a CSV file, or a collection of files. The result of your preprocessing must therefore be made available in the same format.

![Diagram of a format mismatch between a data source and model](Lecture22_Images/mismatch.svg)

Note that the result of this step need not be a file on your computer: you could choose to serve your data through, for example, a web service. The important thing is that, at the end of this step, the data is ready to be fed into the model, matching what that code expects.

This preparation can often be done through a library. For instance, `pandas` offers several methods for writing out a data frame to a number of commonly-used formats:

```python
import pandas as pd
df = pd.read_json('my_data.json')
# ...Perform any transformations...
df.to_csv('my_ready_data.csv')
```

## Publishing

The lifecycle of data does not have to end when you feed it into a model!

You can even think of the model itself, and any analyses (e.g. predictions) you make based on it, as new data. These can in turn be made available to support further research or other work.

You may want to consider uploading them to cloud storage (like S3) or a provider like [Zenodo](https://zenodo.org/).

When releasing your datasets, it is useful to put them in a standard format (such as CSV or JSON) that will be easy for others to read. You should also include **metadata** that explains how the data was generated and what it contains.

Sharing the model itself is less straightforward, but there are still ways that you can make it available. You could:
- build an application that you can distribute
- create a web service that people can call with their own data to get your trained model's outputs
- **serialize** the model and share it

Serealization (convert the trained model code into a string which can be loaded and executed) is perhaps the simplest but is not always trivial.

Fortunately, Python's [`pickle`](https://docs.python.org/3/library/pickle.html) library can serialize a range of objects, including `scikit-learn` models, allowing you to [store and share trained models like static files](https://scikit-learn.org/stable/modules/model_persistence.html).

The [**FAIR principles**](https://www.go-fair.org/fair-principles/) are a guide for sharing research data outputs. They encourage you to ensure that your data is:
- Findable
- Accessible
- Interoperable
- Reusable

## Summary
- Data needs to go through a number of steps before it can used in a model.
- Doing this process programmatically, not manually, leaves a record and facilitates repetition and verification.
- Remote access to data is becoming increasingly important as size grows.