# Optional: Generate synthetic calldata for reproducability

In order to study this code without access to real world data, we provide this notebook that generates synthetic data that resembles original data. The data is organized in rows, where each row represents one connection of a mobile phone with an antenna. The row has the following structure: `timestamp, userid, zip1, zip2, lat, lon` where

* `timestamp` is the timestamp of the connection in the format `YYYYMMDDHHMMSS`
* `userid` is a hashed user-id
* `zip1` and `zip2` are ZIP-codes and are ignored in our analysis
* `lat` is the latitude of the antenna (in degree)
* `lon` is the longiuted of the antenna (in degree)

These are the main parameters for the data generator in this notebook:

In [None]:
n_antennas = 20  # number of antennas
n_user = 1000  # number of users
delay_range = [5, 240]  # The average waiting period between connections
total_time = 10080  # The total time covered by the dataset in minutes (here: 1 week)
data_directory = "../data/0_input_data/" # directory where synthetic data should be stored

We build this based on `numpy` and some standard library utilities:

In [None]:
import datetime
import geojson
import hashlib
import numpy as np
import os
import requests

Make sure that the data directory exists:

In [None]:
os.makedirs(os.path.join(data_directory, "calldata/unzipped"), exist_ok=True)

We store a geojson file with administrative regions:

In [None]:
req = requests.get("https://raw.githubusercontent.com/isellsoap/deutschlandGeoJSON/main/3_regierungsbezirke/1_sehr_hoch.geo.json") # example geojson file for germany
regions = geojson.loads(req.content.decode())

In [None]:
with open(os.path.join(data_directory, "study_region","study_region_germany.geojson"), "w") as f:
    geojson.dump(regions, f)

We calculate the bounding box to later distribute our antennas withing that region of interest:

In [None]:
coords = np.array(list(geojson.utils.coords(regions)))
lon_range = coords[:, 0].min(), coords[:, 0].max()
lat_range = coords[:, 1].min(), coords[:, 1].max()

We start the generation of the raw call data by defining the mobility matrix between antennas. All of its rows are normalized so that we can use them as transition probabilities:

In [None]:
mobility_matrix = np.random.rand(n_antennas, n_antennas)
sum_of_rows = mobility_matrix.sum(axis=1)
mobility_matrix = mobility_matrix / sum_of_rows[:, np.newaxis]

We now create artifical coordinates for our antennas:

In [None]:
antenna_coordinates = np.random.rand(n_antennas, 2)
antenna_coordinates[:, 0] = lon_range[0] + antenna_coordinates[:, 0] * (lon_range[1] - lon_range[0])
antenna_coordinates[:, 1] = lat_range[0] + antenna_coordinates[:, 1] * (lat_range[1] - lat_range[0])

Next we generate artificial antenna connections user by user:

In [None]:
with open(os.path.join(data_directory, "calldata/unzipped", "synthetic.txt"), "w") as f:
    start_time = datetime.datetime.utcnow()
    for i in range(n_user):
        user_hash = hashlib.md5(str(i).encode()).hexdigest()
        current_time = start_time
        current_antenna = np.random.default_rng().integers(0, n_antennas)
        while current_time - start_time < datetime.timedelta(minutes=total_time):
            current_time += datetime.timedelta(
                minutes=int(np.random.default_rng().integers(*delay_range))
            )
            current_antenna = np.random.choice(
                n_antennas, p=mobility_matrix[current_antenna, :]
            )
            f.write(
                f"{current_time.strftime('%Y%m%d%H%M%S')}|{user_hash}|00|000|{antenna_coordinates[current_antenna, 0]}|{antenna_coordinates[current_antenna, 1]}\n"
            );

You could also shuffling this data.