# Hero Data Mocking Workshop
## Task Description
In this workshop scenario we would like to synthesize two datasets as per the agreed requirements with the team.

#### Scenario 1: NYC Crisis Dataset
Internally the DS and the DL teams have aligned on a set of different datasets that will be transformed and managed by them. They both agreed that for training the forecaster they will need a dataste of daily occurings/ count of disasters each day in NYC, for the last several days. We will need to synthesize:
- Date
- Frequency of Events

#### Scenario 2: Superhero Timecards
Through several rounds of to & fro, the team has finally settled on some data fields that they would like to collect from the prospective user i.e. our superheroes. We've finally settled on collecting:
- Full Name
- Date
- Active-on-duty (a boolean field indicating if the superhero is on duty)

## Exercises
We will generate all our data samples using pythonic code and a few easy to use packages.

In [1]:
import pandas as pd
import random

#### Let's define some arrays with random data

In [2]:
names = [
    "Spiderman",
    "Daredevil",
    "Wolverine",
    "Deadpool",
    "Luke Cage",
    "Mr Fantastic",
    "Dr X",
    "The Comedian",
    "Peeping Tom"
] # static list of superhero names

availability = [0, 1] # 1's indicate availability

events_30_days = pd.date_range(
    pd.Timestamp.now().date() - pd.Timedelta(days=30),
    pd.Timestamp.now().date()
)

events_3_days = pd.date_range(
    pd.Timestamp.now().date() - pd.Timedelta(days=2),
    pd.Timestamp.now().date()
)

crisis = [0, 1, 3, 5] # distribution of crisis per day

#### Creating synthetic data via repeated random sampling

In [5]:
raw_df = pd.DataFrame({
    "date": random.choice(events_30_days),
    "crisis": random.choice(crisis),
} for i in range(100))

#### Create more realistic data!
Applying multiple levels of transformations and randomization is a good practice in synthesizing realistic samples of data.

In [6]:
df = raw_df.groupby("date").agg(sum).reset_index().sort_values("date")
df

Unnamed: 0,date,crisis
0,2022-05-01,6
1,2022-05-02,10
2,2022-05-04,7
3,2022-05-05,7
4,2022-05-06,3
5,2022-05-07,15
6,2022-05-08,11
7,2022-05-09,11
8,2022-05-10,8
9,2022-05-11,6


#### Saving our dataset

In [7]:
df.to_csv("../../data/nyc_crisis_may.csv", index=False)

### Let's create another dataframe with more complex data!

In [8]:
raw_df = pd.DataFrame({
    "superhero": random.choice(names),
    "date": random.choice(events_3_days),
    "active_on_duty": random.choice(availability),
} for i in range(20))

raw_df

Unnamed: 0,superhero,date,active_on_duty
0,Mr Fantastic,2022-05-31,1
1,Daredevil,2022-05-29,1
2,Spiderman,2022-05-30,1
3,Mr Fantastic,2022-05-30,0
4,Spiderman,2022-05-30,0
5,Daredevil,2022-05-29,1
6,Deadpool,2022-05-29,1
7,Dr X,2022-05-31,0
8,Peeping Tom,2022-05-30,1
9,Peeping Tom,2022-05-29,0


#### Once again applying some aggregations to smooth the data

In [9]:
df = raw_df.groupby(["superhero", "date"]).max().reset_index()
df

Unnamed: 0,superhero,date,active_on_duty
0,Daredevil,2022-05-29,1
1,Deadpool,2022-05-29,1
2,Deadpool,2022-05-30,0
3,Deadpool,2022-05-31,0
4,Dr X,2022-05-31,0
5,Luke Cage,2022-05-30,0
6,Mr Fantastic,2022-05-30,0
7,Mr Fantastic,2022-05-31,1
8,Peeping Tom,2022-05-29,0
9,Peeping Tom,2022-05-30,1


In [10]:
df.to_csv("../../data/superhero_timesheets.csv", index=False)

## In this section
- We fabricated some synthetic datasets to demonstrate how to use programming and randomization libraries to create datasets when none exists.
- In the next section, we will attempt to use these datasets towards a simple predictive task.
