## Intro

These instructions are intended for data owners / data managers. This tutorial will generating a mock dataset that follows the schema of the real dataset, as well as annotating the mock dataset and uploading it to the server.

For this step, you should have a real dataset for which to generate the mock one.

## Define the schema for the mock dataset

First, we will define the data schema for the synthetic data such that it matches with the original data structure.

For example, assume the real dataset has the columns:
- video_id: unique, int
- video_url: unique, string
- video_suggestive_score (0-1): int
- video_description: unique, string
- algorithm: categorical (label)
- recommendations_id: categorical (label)

Then, the synthetic one will have the same columns, with the same datatypes.

### Generate numerical values

The method below is an example of how mock numerical values can be populated. In our example, `video_id` and `video_suggestive_score` both store numerical values. For random numberic value generation, we will use the `random` function from `numpy` library.

In [None]:
def generate_numeric_columns(num):
    
    video_id_list = np.arange(1000, num + 2000)
    suggestive_score_list = [random.uniform(0, 1) for _ in range(num)]

    video_id = np.random.choice(video_id_list, size=num, replace=False)
    suggestive_score = np.random.choice(suggestive_score_list, size=num, replace=True)

    return video_id, suggestive_score

### Generating string values

We will use a python library called [Faker](https://faker.readthedocs.io/en/master/). `Faker` is a Python package that generates fake data for you based on the specifications you provide it.

In [None]:
!pip install Faker

In [None]:
# random.seed(1234)
from faker import Faker

Faker.seed(0)
faker = Faker()

In [None]:
def make_fake_data(num):
    
    fake_data = [
        {
            "video_url": faker.md5(raw_output=False),
            "video_description": faker.text(max_nb_chars=50),
        }
        for x in range(num)
    ]

    return fake_data

### Generate categorical data

In [None]:
def make_algo_and_recommendations_id(num):
    
    algo_list = ["A", "B", "C", "D", "E"]
    video_id_list = np.arange(2000, num + 3000)

    algorithm = np.random.choice(algo_list, size=num, replace=True)
    recommendations_id = np.random.choice(video_id_list, size=num, replace=True)

    return algorithm, recommendations_id

Lastly, combine the generated columns as you find fit for the usecase of your project in order to complete the mock dataset.

If the real data changes, you can easily update the way in which the mocked data is generated (for example, updating ranges from numerical values, adding/removing columns etc).

## Uploading the dataset

After generating the dataset, it first needs annotations before being uploaded. The next steps are identical to those described in the previous notebook. Please follow the steps there to login into the domain node, annotate the data and then upload it.

```hagrid quickstart https://github.com/Poppy22/notebooks-openmined/blob/master/DO-DM/annotating_data.ipynb```

## Final thoughts

Congratulations on generating mock data! In this tutorial you should have achieved:\
✅ generating mocked data by following the structure of the real data\
✅ dealing with various column types when generating mocked data\
✅ annotating and uploading the dataset