# Simulating Time Series Data

For now, we've seen where to find **time series** data and how to process it. Now let's look at how to create **time series** data through simulation. 

We will divide it into 3 parts. In the first, we will compare **time series** data simulations with other types of data simulations, seeing which specific new areas of interest come to light when considering the passage of time. In the second part, we will look at some code-based simulations. Finally, in the third part, we will analyze some general trends in **time series** simulations.

Specific examples for generating different types of **time series** data:
- we will simulate the email opening and donation behavior of members of a non-profit organization over several years;
- we will simulate events in a taxi fleet of a thousand vehicles with various shift start times and passenger boarding frequencies;
- we will simulate step by step the evolution of the magnetic state of a solid at a given temperature and size using the relevant laws of physics;

These three examples correlate to three classes of **time series** simulations:
- *heuristic simulations:*
    - we decide how the world should work, ensuring logic and coding, rule by rule;
- *discrete event simulations (SED):*
    - we will create individual actors that follow certain rules in our universe and then implement these actors to see how the universe evolves over time;
- *simulations based on laws of physics:*
    - we will apply the laws of physics to see how a system evolves over time;

### Why is Time Series Simulation Special?

Data simulation is an area of ​​Data Science that is rarely taught despite being an essential skill for **time series** data. This is one of the negative aspects of temporal data: no two data points in the same time series are exactly comparable, as these points occur at different times. If we want to think about *what could have happened in a given time*, we enter the world of simulation.

### Simulation versus Prediction

Simulation and forecasting are similar practices. In both, we must formulate hypotheses about the dynamics and parameters of the underlying system and then extrapolate from these hypotheses in order to generate data points. However, there are important differences to consider when learning and developing simulations rather than predictions:
- it may be easier to integrate qualitative observations into a simulation than into a prediction;
- simulations are run at scale, so that we can analyze several alternative scenarios, while forecasts must be generated with more care;
- the risks of simulations are lower than predictions, as there are no lives or resources at stake. Therefore, you can be more creative and exploratory in your initial rounds of simulations. Obviously, sooner or later, you want to be sure that you can justify how you build your simulations, just as you justify your predictions.

### Installing Libs

In [1]:
import numpy as np
import pandas as pd

#### Doing it ourselves

In this case of simulation, we will do the simulation ourselves, ensuring that we do not specify an illogical order

In [2]:
# user status
years      = ['2014', '2015', '2016', '2017', '2018']

userStatus = ['bronze', 'silver', 'gold', 'inactive']

userYears  = np.random.choice(years, 1000, 
                             p = [0.1, 0.1, 0.15, 0.30, 0.35])

userStats  = np.random.choice(userStatus, 1000, 
                             p = [0.5, 0.3, 0.1, 0.1])

yearJoined = pd.DataFrame({'yearJoined': userYears})

userJoined = pd.DataFrame({'userJoined': userStats})

yearJoined, userJoined

(    yearJoined
 0         2016
 1         2017
 2         2018
 3         2017
 4         2015
 ..         ...
 995       2017
 996       2018
 997       2018
 998       2017
 999       2018
 
 [1000 rows x 1 columns],
     userJoined
 0       bronze
 1       bronze
 2       silver
 3     inactive
 4       bronze
 ..         ...
 995     silver
 996       gold
 997     bronze
 998     silver
 999     bronze
 
 [1000 rows x 1 columns])

Note that there are already many rules/assumptions integrated into the simulation just in these lines of code. We stipulate probabilities specific to the years in which members joined. We also made the user's status completely independent of the year they joined.

#### Doing it ourselves
In the next step, we will create a table indicating when members opened emails each week. Here, we will define our organization's behavior: sending three emails per week. We will also define distinct patterns of user behavior in relation to email:
- never opens the email;
- constant level of engagement/email opening rate;
- increase or decrease in the level of engagement.

In [3]:
NUM_EMAILS_SENT_WEEKLY = 3

# defining multiple functions for different patterns
def never_opens(period_rng):
    return []

def constant_open_rate(period_rng):
    n, p = NUM_EMAILS_SENT_WEEKLY, np.random.uniform(0, 1)
    num_opened = np.random.binomial(n, p, len(period_rng))
    return num_opened

def increasing_open_rate(period_rng):
    return open_rate_with_factor_change(period_rng,
                                        np.random.uniform(1.01, 
                                                          1.30))

def decreasing_open_rate(period_rng):
    return open_rate_with_factor_change(period_rng,
                                       np.random.uniform(0.5,
                                                         0.99))

def open_rate_with_factor_change(period_rng, fac):
    if len(period_rng) < 1:
        return []
    times = np.random.randit(0, len(period_rng),
                             int(0.1 * len(period_rng)))
    num_opened = np.zeros(len(period_rng))
    for prd in range(0, len(period_rng), 2):
        try:
            n, p = NUM_EMAILS_SENT_WEEKLY, np.random.uniform(0, 
                                                             1)
            num_opened[prd:(prd + 2)] = np.random.binomial(n, p,
                                                           2)
            p = max(min(1, p * fac), 0)
        except:
            num_opened[prd] = np.random.binomial(n, p, 1)
    for t in range(len(times)):
        num_opened[times[t]] = 0
    return num_opened

Definimos funções para simular quatro tipos distintos de comportamentos:

*Usuários que nunca abrem os emails*
- never_opens()

*Usuários que abrem aproximadamente o mesmo número de emails todas as semanas*
- constant_open_rate()

*Usuários que abrem um número decrescente de emails a cada semana*
- decreasing_open_rate()

*Usuários que abrem um número crescente de emails a cada semana*
- increasing_open_rate()


We ensure that those with increasing interest or those who lose interest over time are simulated in the same way with the *open_rate_with_factor_change()* function via the *increasing_open_rate()* and *decreasing_open_rate()* functions

It is also necessary to create a system to model donation behavior. We don't want to be too naive, or our siulation won't provide insight into what we should expect. In other words, we want to build our existing hypotheses about user behavior within the model, and then test whether simulations based on these hypotheses match what we see in real data. Next, we'll adopt an imprecise but non-deterministic donation behavior that relates to the number of emails a user has opened:

In [4]:
## donation behavior
def produce_donations(period_rng, user_behavior, num_emails, 
                      user_id, user_join_year):
    donation_amounts = np.array([0, 25, 50, 75, 100, 250, 500,
                                  1000, 1500, 2000])
    user_has = np.random.choice(donation_amounts)
    email_fraction = num_emails / (NUM_EMAILS_SENT_WEEKLY * len(period_rng))
    user_gives = user_has * email_fraction
    user_gives_idx = max(min(user_gives_idx,
                             len(donation_amounts) - 2),
                        1)
    num_times_gave = np.random.randint(0, len(period_rng), num_times_gave)
    dons = pd.DataFrame({'user'      : [],
                         'amount'    : [],
                         'timestamp' : []})
    for n in range(num_times_gave):
        donation = donation_amounts[user_gives_idx + np.random.binomial(1, .3)]
        ts       = str(period_rng[times[n]].start_time + random_weekly_time_delta())
        dons     = dons.append(pd.DataFrame(
                     {'user'      : [user_id],
                      'amount'    : [donation],
                      'timestamp' : [ts]}))
        
    if dons.shape[0] > 0:
        dons = dons[dons.amount != 0]
        ## We do not report the absence of a donation event,
        ## as this would not be recorded in a database
        
    return dons
        

We follow some steps so that the code generates realistic behavior:
- the total number of donations depends on how long someone has been a user;
- we generate an ownership status per user, based on the behavioral hypothesis that the amount donated will be related to a stable amount that a person would have reserved to make donations;

Since user behaviors are tied to specific *timestamps*, we will need to choose the weeks each user donated and during which period of that week they donated:

In [5]:
## utility function to choose a random time during the week
def random_weekly_time_delta():
    days_of_week     = [d for d in range(7)]
    hours_of_day     = [h for h in range(11, 23)]
    minute_of_hour   = [m for m in range(60)]
    second_of_minute = [s for s in range(60)]
    
    return pd.Timedelta(str(np.random.choice(days_of_week)) + "days") + pd.Timedelta(str(np.random.choice(hours_of_day)) + "hours") + pd.Timedelta(str(np.random.choice(minute_of_hour)) + "minutes") + pd.Timedelta(str(np.random.choice(second_of_minute)) + "seconds")

Now, we will group the components developed to simulate a certain number of users and associated events in order to ensure that the events only occur after a user joins, and that a user's email events have some relationship with the donation events:

In [None]:
behaviors         = [never_opens,
                     constant_open_rate,
                     increasing_open_rate,
                     decreasing_open_rate]
user_behaviors    = np.random.choice(behaviors, 1000,
                                    [0.2, 0.5, 0.1, 0.2])
rng               = pd.period_range('2015-02-14', '2018-06-01', freq = 'W')
emails            = pd.DataFrame({'user'       : [],
                                  'amount'     : [],
                                  'timestamp'  : []})
donations         = pd.DataFrame({'user'       : [],
                                  'amount'     : [],
                                  'timestamp'  : []})

for idx in range(yearJoined.shape[0]):
    ## Randomly generates the date a user would have joined
    join_date = pd.Timestamp(yearJoined.iloc[idx].yearJoined) + pd.Timedelta(str(np.random.randint(0, 365)) + 'days')
    join_date = min(join_date, pd.Timestamp('2018-06-01'))
    
    ## User must not have current timestamps before joining
    user_rng  = rng[rng > join_date]
    
    if len(user_rng) < 1:
        continue
    
    info = user_behaviors[idx](user_rng)
    if len(info) == len(user_rng):
        emails = emails.append(pd.DataFrame({'user'        : [idx] * len(info),
                                             'week'        : [str(r.start_time) for r in user_rng],
                                             'emailsOpened': info}))
        donations = donations.append(produce_donation(user_rng, user_behaviors[idx],
                                                      sum(info), idx, join_date.year))

Next, we'll look at the temporal behavior of donations to get an idea of ​​how we can test this in future analysis or predictions. We will plot the total sum of donations we received for each month in the dataset:

In [None]:
df.set_index(pd.to_datetime(df.timestamp), implace = True)
df.sort_index(inplace = True)
df.groupby(pd.Grouper(freq = 'M')).amount.sum().plot()

#### Building a Self-Managed Simulation Universe

Often, you have a specific system and want to define the rules for that system and see how it runs. In some cases, I would like to know as individual agents with their aggregated opinions over time. To do this, we can use *generators*

*Generators* enable us to create a series of independent (or dependent) actors and again in order to observe what they do without a lot of *boilerplate* code to keep track of.

In the next example, we will explore a taxi simulation. We want to imagine how a fleet of taxis, scheduled to start their shifts at different times, might behave together. To do this, we will need to create many individual taxis, release them into an imaginary city and make them relate their activities.

We will start by trying to understand what a *generator* is in **Python**

In [9]:
def taxi_id_number(num_taxis):
    arr = np.arange(num_taxis)
    np.random.shuffle(arr)
    for i in range(num_taxis):
        yield arr[i]

In [10]:
ids = taxi_id_number(10)
print(next(ids))
print(next(ids))
print(next(ids))

1
4
2
