# First steps in a Machine Learning (ML) pipeline

In these notes, we will start to have a look at some general, initial steps we need to make to implement a Machine Learning approach. 

In this regard, the question you should probably ask yourself is the following: Why is it called "Machine Learning"? What exactly is it that we are trying to learn? One practical **definition of learning is the following: to extract a pattern from some data, so that this pattern can be used to make predictions**.

Given the previous definition, it is clear that **to learn, we need some data as a source**.  
The quality of what we learn depends necessarily on the quality of the data but also how the data is presented to the learner. Thus, the initial step for an ML pipeline concerns exactly this aspect: database preparation. We will cover the basics in these lectures but you can find additional info in this [video](https://www.youtube.com/watch?v=DiKwYKmQzJc) and [document](https://cloud.google.com/blog/products/gcp/preparing-and-curating-your-data-for-machine-learning?utm_source=youtube&utm_medium=unpaidsocial&utm_campaign=mir-20191108-Prepare-Curate-Data) from Google for a more in-depth perspective.

Now, let us start with the simplest and somewhat trivial, initial task in the pipeline: learning how to read (and write) on a database.

## Reading Files

In the last set of lectures, we focused on generating data using `numpy` and using a variety of tools such as `pandas` to process and visualise this artificially generated data. Naturally, when faced with real world examples, it is important to be able to import data from external sources. This whole lecture focuses on what is referred to as I/O operations.

> I/O refers to input/output, another Python module that uses the term IO is `asyncio` which is used for asynchronous I/O operations

In [1]:
import pandas as pd

## `pandas`

We have seen how powerful `pandas` is when it comes to manipulating tabular data structures. Also included as part of its library is the ability to import a range of common data files. This includes CSV, Excel, JSON, XML, TSV, and space-delimited data files. In this section, we will go through how to read each of them and also cover some of the caveats.

### CSV

Reading CSV files (and most other files) is very simple, since `pandas` handles the complicated tasks of opening file handlers, reading the lines, and parsing them to a table. 

If we were trying to hard code the process of reading the CSV file and importing it to a `numpy` array, it would look something like this:
```python
fname = 'myfile.csv'
data = []
header = True
with open(fname, 'r') as f:
    for i, line in enumerate(f.readlines()):
        if i == 0 and header:
            continue
        data.append(line.split(','))
data = np.array(data)
```
Instead, by using `pandas`, this whole process is simplified, and also contains a range of optional arguments that we will cover over the course of the lecture.
```python
fname = 'myfile.csv'
data = pd.read_csv(fname)
data = data.to_numpy()
```



In [5]:
fname = 'Lecture3Data/301-test.csv'
data = pd.read_csv(fname)
data
# .to_numpy() converts the data to a numpy array

Unnamed: 0,areaType,areaName,areaCode,date,newCasesBySpecimenDate,cumCasesBySpecimenDate
0,overview,United Kingdom,K02000001,2021-02-02,2934,3871823
1,overview,United Kingdom,K02000001,2021-02-01,17616,3868889
2,overview,United Kingdom,K02000001,2021-01-31,14880,3851273
3,overview,United Kingdom,K02000001,2021-01-30,16381,3836393
4,overview,United Kingdom,K02000001,2021-01-29,21647,3820012
...,...,...,...,...,...,...
365,overview,United Kingdom,K02000001,2020-02-03,0,5
366,overview,United Kingdom,K02000001,2020-02-02,0,5
367,overview,United Kingdom,K02000001,2020-02-01,1,5
368,overview,United Kingdom,K02000001,2020-01-31,2,4


### Excel

Reading Excel files is incredibly useful. Writing the code to do this from scratch is quite difficult, so much so that we will not provide a comparison, like we did for reading CSV files. Excel files contain tabulated data but also contains different pages, and `pandas` can easily manage all this.

The function used for importing excel files is `pd.read_excel`. Like before the first argument is the name of the file. There is a new keyword argument called `engine` which is set to `'openpyxl'`. We will not go into detail about why this is necessary, but at the time of writing it is needed to read `.xlsx` files which are created by the newest versions of Microsoft Excel.

```python
data = pd.read_excel('data/301-test.xlsx', engine='openpyxl')
```

In [10]:
data = pd.read_excel('Lecture3Data/301-test.xlsx', engine='openpyxl')
data

Unnamed: 0,areaType,areaName,areaCode,date,newCasesBySpecimenDate,cumCasesBySpecimenDate
0,overview,United Kingdom,K02000001,2021-02-02,2934,3871823
1,overview,United Kingdom,K02000001,2021-02-01,17616,3868889
2,overview,United Kingdom,K02000001,2021-01-31,14880,3851273
3,overview,United Kingdom,K02000001,2021-01-30,16381,3836393
4,overview,United Kingdom,K02000001,2021-01-29,21647,3820012
...,...,...,...,...,...,...
365,overview,United Kingdom,K02000001,2020-02-03,0,5
366,overview,United Kingdom,K02000001,2020-02-02,0,5
367,overview,United Kingdom,K02000001,2020-02-01,1,5
368,overview,United Kingdom,K02000001,2020-01-31,2,4


### Other files separated by a string

TSV files are **T**ab-**S**eparated **V**alues files.  
Interestingly, we do not need a different file I/O system to manage TSV files, and instead the `pd.read_csv()` function can be used.

In this section, we have also included some other keyword arguments that will become increasingly important as our use of `pandas` increases.

- `delimiter='\t'`: sets the delimiter (usually a comma in Comma-Separated Values (CSV) files
- `index_col=None`: if this is not set then the first column will automatically be set to the index
- `skiprows=1`: skip this many rows (usually 0)
- `header=None`: with header set to None, the columns of the resultant `pd.DataFrame` will be 0, 1, 2,... otherwise the first row will be become the column names

> You can search for other keyword arguments on the `pd.read_csv` documentation page which can be found [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [11]:
data = pd.read_csv(
    'Lecture3Data/301-test.tsv', 
    delimiter='\t', 
    index_col=None, 
    skiprows=10, 
    header=None
)

data
# find a tab separated standard format fileData

Unnamed: 0,0,1,2,3,4,5
0,overview,United Kingdom,K02000001,24/01/2021,17189,3691867
1,overview,United Kingdom,K02000001,23/01/2021,21836,3674678
2,overview,United Kingdom,K02000001,22/01/2021,29581,3652842
3,overview,United Kingdom,K02000001,21/01/2021,31723,3623261
4,overview,United Kingdom,K02000001,20/01/2021,35140,3591538
...,...,...,...,...,...,...
356,overview,United Kingdom,K02000001,03/02/2020,0,5
357,overview,United Kingdom,K02000001,02/02/2020,0,5
358,overview,United Kingdom,K02000001,01/02/2020,1,5
359,overview,United Kingdom,K02000001,31/01/2020,2,4


### Fixed-Width

PDB files are used in molecular simulations to store the positional data of protein molecules. The file in `data/301-test.pdb` is that of an ethanol molecule. PDB files have a fixed width, which requires a different function to read them: `pd.read_fwf`. 

The function has very similar keyword arguments to `pd.read_csv` and in this case, we use a new argument called `nrows` which ensures that only `nrows` rows of the file are read.

> In this case, we have provided the parameters for the keyword arguments. Unfortunately it is rarely this simple and a user must first investigate a file or set of files which have a similar format to check which arguments to use for the `pandas` function arguments

In [12]:
data = pd.read_fwf('Lecture3Data/301-test.pdb', header=None, skiprows=1, nrows=9)

# PDB file of a simple molecule
data

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,ATOM,1,C,1,-0.426,-0.115,-0.147,1.0,0.0
1,ATOM,2,O,1,-0.599,1.244,-0.481,1.0,0.0
2,ATOM,3,H,1,-0.75,-0.738,-0.981,1.0,0.0
3,ATOM,4,H,1,-1.022,-0.351,0.735,1.0,0.0
4,ATOM,5,H,1,-1.642,1.434,-0.689,1.0,0.0
5,ATOM,6,C,1,1.047,-0.383,0.147,1.0,0.0
6,ATOM,7,H,1,1.37,0.24,0.981,1.0,0.0
7,ATOM,8,H,1,1.642,-0.147,-0.735,1.0,0.0
8,ATOM,9,H,1,1.18,-1.434,0.405,1.0,0.0
