# Lesson 3: Data Preparation

Data preparation is particularly important when doing Remote Data Science and Machine Learning. This is because the data scientist doesn't inherently have the ability to freely check how clean the data is, and whether it needs pre-processing. As such, the responsibility falls onto the Data Owner to ensure the data is clean, annotated, and usable.


In this notebook, you'll walk in the steps of a Data Owner (someone who has new and potentially sensitive information) and learn about:
<ul>
    <li> Data Acquisition </li>
    <li> Linking data from multiple sources </li>
    <li> Quality Checks </li>
    <li> Annotation </li>
    <li> Converting your data to a PyGrid compatible format </li>
    <li> How to load data into the node </li>
</ul>
    

## 3.1 Data Acquisition!

Data Acquisition focuses on generating and capturing data into a system. Broadly speaking, it's made up of two phases: <b> data harvesting</b>, and <b>data ingestion</b>. The former, we'll cover in this lesson, and the latter, we'll show you over the next two lessons.

### 3.1.1 The 4 V's of Data

When it comes to thinking about data, there are usually four major things to think about, commonly referred to in the industry as the 4 V's of data. They include:

<b> Volume </b> refers to the quantity or amount of data in question.
An example of low volume: if you're collecting sensitive data about people with a rare condition.
An example of high volume would be most social media applications that you've heard of. For instance, Facebook has more users than China has people. And each of those people are making posts, uploading pictures, liking content- that adds up to trillions of photos that they can use for data science and machine learning. 

<b> Velocity </b> refers to the rate at which new data is being gathered or collected.
For instance, if you're a company, performance reviews might only come once a quarter.
But if you're YouTube, then in one day, you have over 700,000 hours of new videos added. For context, that's longer than the average human lifespan. So if a new person was born tomorrow, and they spent every moment of their life just trying to watch the YouTube videos uploaded on the day they were born, on average they wouldn't be able to get through them all.

<b> Variety </b> refers to the diversity of the data that's being collected. For example, think of the difference between a dataset consisting of polls, and a dataset consisting of emails. No two emails are necessarily quite the same. They could contain quite literally, anything- text about anything, pictures of anything, attachments of any kind, etc. 


<b> Value </b> refers to the idea that not all kinds of data are of equal value. Let's say you're collecting medical images, and some of the images were corrupted during the acquisition process, and were blurry and grainy as a result. That data isn't quite as valuable as a pristine scan.

### 3.1.2 Hands-on
In our case, let's say we use data concerning the number of COVID cases per country. Lets load it and take a look!

In [2]:
# Load data
import pandas as pd
raw_data = pd.read_csv("dataset/first_draft_COVID_synthetic.csv")

In [3]:
raw_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,165,166,167,168,169,170,171,172,173,174
0,1140,1113,3099,92,621,344,283,284,1095,44,...,515,207,722,401,234,521,284,465,522,1176
1,1211,1378,2821,113,575,319,267,295,1167,52,...,577,231,655,449,224,600,224,577,492,1258
2,1238,1587,2356,107,520,265,296,316,1186,42,...,473,209,620,473,231,607,275,407,421,1109
3,1093,2075,2964,117,578,371,359,349,1151,48,...,550,227,648,447,232,563,188,338,432,1148
4,966,2269,3390,135,614,432,341,428,1107,49,...,285,189,640,398,250,535,183,382,459,1181


In this dataset, each column corresponds to a <b> country</b>, each row corresponds to a new <b> month </b> where data was collected, and each value in this DataFrame corresponds to the number of COVID19 cases in the country at the start of that month. 

So for instance, Country 0 had 1140 COVID cases at the start of when this data was collected, and only 966 when the data was last collected.

## 3.2 Quality Check
Checking the quality of a dataset can involve finding missing values, identifying outliers and anomalies (using methods such as an Isolation Forest or k-Nearest Neighbours) or visualizing the dataset. 
It might also involve using external information that we know- for instance, about the sources or about how the data was collected.

### 3.2.1. Description

### 3.2.2. Hands-on

Let's look at the dataset again:


In [4]:
raw_data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,165,166,167,168,169,170,171,172,173,174
0,1140,1113,3099,92,621,344,283,284,1095,44,...,515,207,722,401,234,521,284,465,522,1176
1,1211,1378,2821,113,575,319,267,295,1167,52,...,577,231,655,449,224,600,224,577,492,1258
2,1238,1587,2356,107,520,265,296,316,1186,42,...,473,209,620,473,231,607,275,407,421,1109
3,1093,2075,2964,117,578,371,359,349,1151,48,...,550,227,648,447,232,563,188,338,432,1148
4,966,2269,3390,135,614,432,341,428,1107,49,...,285,189,640,398,250,535,183,382,459,1181


Now let's say, for instance, that when we were given the dataset, we were given the following warnings:
* A lot of migration happened between Country 7 and 8, and as a result, around 500 people were double counted.
* There was a duplication error made with the data, and the results of the last month is the same as the results of the month previous to it

Let's try to reason out how we would try to tackle these issues:
* Since the data was double counted, we could choose to simply offset both countries by the imbalance.
* We can simply disregard the last row of the data.

In [None]:
# Insert code

Now let's try to visualize the dataset and see if anything suspicious looking appears:

In [5]:
# Insert code to visualize

## 3.3 Data Annotation


## 3.4. Data & PyGrid

## 3.5 DP & Datasets

### 3.5.1. DP Primer
### 3.5.2. DP Metadata needed for PyGrid
### 3.5.3. Loading the data to PyGrid!

## 3.2 Linking Data From Multiple Sources

In this course, we'll be using the PySyft and PyGrid frameworks to link data from multiple sources (called nodes). This is both really cool and very useful because it lets us perform data science and machine learning on private data on someone else's machine or server, without compromising the privacy of anyone in the dataset.


In remote data science, because there's a high likelihood that all of our data is not coming from the same source, proper data annotation and cleaning becomes particularly important. This is an important distinction, because:
* Different sources may generate data at different rates; some sources stream data whereas others produce data in batches (i.e. in a periodic manner, or at a certain time interval)
* Different sources may also have different measuring capabilities, and this might affect the reliability of a dataset. 
* Different nodes may have different privacy budgets alloted to their respective datasets, which means some datasets may be seen and used much less than others.

<b> DID YOU KNOW? </b> A historical example of the latter point were the Geiger counters used in Chernobyl. <p>Immediately after the Chernobyl nuclear accident, many people at the time weren't too concerned because measurements from their Geiger counters showed a measurement of 3.6 Roentgen/hour- the equivalent measurement of 10 X-rays. However, it was later discovered that the Geiger counters in use had a maximum detection limit and sensitivity which meant they couldn't read any numbers higher than 3.6 R/h. When new, higher range Geiger counters were used, it was quickly (and shockingly) realized that the radiation being leaked wasn't 3.6R/h, but around 5.6 per SECOND. This was the equivalent of _ nuclear bombs worth of radiation per hour.</p>

Although this is an extreme example, it shows the importance of proper data acquisition and annotating data.

## Additional Resources and References

<ul><li> Lyko K., Nitzschke M., Ngonga Ngomo AC. (2016) Big Data Acquisition. In: Cavanillas J., Curry E., Wahlster W. (eds) New Horizons for a Data-Driven Economy. Springer, Cham. <a> https://doi.org/10.1007/978-3-319-21569-3_4 </a> </li>
    <li> <a href="https://www.oreilly.com/library/view/implementing-a-smart/9781491983492/"> Implementing a Smart Data Platform, O'Reilly Media 2017 </a> </li>
    <li> <a href="https://www.oreilly.com/library/view/python-data-cleaning/9781800565661/"> Python Data Cleaning, O'Reilly Media 2020 </a></li>
</ul>