# Benchmarking Datasets Creation

- First of all we need to import the necessary libraries.

In [1]:
import json
import numpy as np
import pandas as pd

np.random.seed(42) # For reproducibility

- We will create a list with the businesses (as Python dictionaries) from the [Yelp's dataset](https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset).

In [2]:
with open('./data/yelp_academic_dataset_business.json', encoding='utf-8') as businesses_file:
    businesses = []
    for business in businesses_file:
        businesses.append(json.loads(business))

- We will create our stores dataset using the locations of the businesses from the Yelp's dataset. This way, we ensure that the store locations within our dataset are distributed in a manner that aligns with real-world locaction-based applications.

In [3]:
stores_df = pd.DataFrame(businesses)
stores_df = stores_df[['address', 'city', 'state', 'postal_code', 'latitude', 'longitude']]
stores_df.head()

Unnamed: 0,address,city,state,postal_code,latitude,longitude
0,"1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197
1,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695
2,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452
3,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564
4,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659


- We will use the `info()` method to get an overview of the DataFrame and ensure that the datatypes are correct.

In [4]:
stores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150346 entries, 0 to 150345
Data columns (total 6 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   address      150346 non-null  object 
 1   city         150346 non-null  object 
 2   state        150346 non-null  object 
 3   postal_code  150346 non-null  object 
 4   latitude     150346 non-null  float64
 5   longitude    150346 non-null  float64
dtypes: float64(2), object(4)
memory usage: 6.9+ MB


- We will replace empty strings with `NaN`.

In [5]:
stores_df = stores_df.replace('', np.nan)

- We will check for missing values.

In [6]:
pd.concat([stores_df.isna().sum(), stores_df.isna().mean()],
          axis=1, keys=['missing_count', 'missing_ratio'])

Unnamed: 0,missing_count,missing_ratio
address,5127,0.034101
city,0,0.0
state,0,0.0
postal_code,73,0.000486
latitude,0,0.0
longitude,0,0.0


- We will drop the rows containing missing values (small portion).

In [7]:
stores_df = stores_df.dropna().reset_index(drop=True)

- We will ensure that the latitudes fall within the range of `[-90, 90]` and the longitudes fall within the range of `[-180, 180]`.

In [8]:
stores_df = stores_df[(stores_df.latitude >= -90) & (stores_df.latitude <= 90)
                      & (stores_df.longitude >= -180) & (stores_df.longitude <= 180)]

- We will generate an auto-increment id for each store.

In [9]:
stores_df['id'] = [id for id in range(1, len(stores_df) + 1)]

- We will generate our own generic store names as the names of the original businesses listed in the Yelp's dataset are not relevant to the specific queries we have chosen for our benchmark.

In [10]:
store_names = []
for i in range(1, len(stores_df) + 1):
    store_names.append(f'store_{i}')

stores_df['name'] = store_names
stores_df.head()

Unnamed: 0,address,city,state,postal_code,latitude,longitude,id,name
0,"1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,1,store_1
1,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,2,store_2
2,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3,store_3
3,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4,store_4
4,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,5,store_5


- We will also generate generic store descriptions to make the size of the table more realistic.

In [11]:
store_desc = []
for i in range(1, len(stores_df) + 1):
    store_desc.append(f'This is store {i}')

stores_df['description'] = store_desc
stores_df.head()

Unnamed: 0,address,city,state,postal_code,latitude,longitude,id,name,description
0,"1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,1,store_1,This is store 1
1,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,2,store_2,This is store 2
2,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3,store_3,This is store 3
3,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4,store_4,This is store 4
4,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,5,store_5,This is store 5


- We will create two additional datasets, one containing 50k stores and another containing 100k stores, which will be subsets of the original dataset.

In [12]:
stores_df_50k = stores_df.sample(50_000)
stores_df_100k = stores_df.sample(100_000)

- We will save the `stores_df_50k`, `stores_df_100k` & `stores_df` as csv.
- The columns are explicitly specified to define the order of the csv columns.

In [13]:
columns = ['id', 'name', 'description', 'address', 'city', 'state', 'postal_code', 'latitude', 'longitude']
stores_df_50k[columns].to_csv('./datasets/stores_50k.csv', index=False)
stores_df_100k[columns].to_csv('./datasets/stores_100k.csv', index=False)
stores_df[columns].to_csv('./datasets/stores_full.csv', index=False)

- We will now read the column with the food entries and their nutrients from the [MyFitnessPal dataset](https://www.kaggle.com/datasets/zvikinozadze/myfitnesspal-dataset).

In [14]:
mfp_df = pd.read_csv('./data/mfp-diaries.tsv', sep='\t', usecols=[2], header=None)
mfp_df.head()

Unnamed: 0,2
0,"[{""meal"": ""MY food"", ""dishes"": [{""nutritions"":..."
1,"[{""meal"": ""MY food"", ""dishes"": [{""nutritions"":..."
2,"[{""meal"": ""MY food"", ""dishes"": [{""nutritions"":..."
3,"[{""meal"": ""MY food"", ""dishes"": [{""nutritions"":..."
4,"[{""meal"": ""MY food"", ""dishes"": [{""nutritions"":..."


- Each diary record (`mfp_df` record) contains multiple food entries.
- We will create a list with the nutrients of each food entry in the MyFitnessPal dataset.

In [15]:
nutrients = []
for i in range(len(mfp_df)):
    for food_entry in json.loads(mfp_df.iloc[i, 0])[0]['dishes']:
        nutrients.append([int(nutritional_fact['value'].replace(',', ''))
                          for nutritional_fact in food_entry['nutritions']])

- We will create our products dataset using the nutrients of the food entries from the MyFitnessPal dataset. This way, the products in our products dataset will mirror the nutritional composition of real food items. Thus, the relative distributions of the nutrients in our products will be accurate which in turn will make the results of filtering in the benchmark more relevant.

In [16]:
products_df = pd.DataFrame(nutrients, columns=['calories', 'carbs', 'fat', 'protein', 'sodium', 'sugar'])
products_df = products_df[['calories', 'protein', 'carbs', 'fat']]
products_df.protein = np.vectorize(round)(products_df.protein, 2)
products_df.carbs = np.vectorize(round)(products_df.carbs, 2)
products_df.fat = np.vectorize(round)(products_df.fat, 2)
products_df

Unnamed: 0,calories,protein,carbs,fat
0,412,21.0,29.0,24.0
1,170,20.0,25.0,5.0
2,176,5.0,33.0,1.0
3,342,24.0,34.0,12.0
4,180,21.0,22.0,7.0
...,...,...,...,...
1987588,263,11.0,23.0,14.0
1987589,120,0.0,22.0,5.0
1987590,263,11.0,23.0,14.0
1987591,180,18.0,0.0,12.0


- We will use the `describe()` method to check some decriptive statistics about our DataFrame.

In [17]:
from IPython.display import display

pd.set_option('display.float_format', lambda x: '%.4f' % x)
display(products_df.describe())
pd.reset_option('display.float_format')

Unnamed: 0,calories,protein,carbs,fat
count,1987593.0,1980099.0,1982793.0,1981778.0
mean,108.5035,6.6232,11.8055,5.2725
std,954.3548,37.6082,46.2222,98.3847
min,-500.0,-31.0,-80.0,-25.0
25%,30.0,0.0,0.0,0.0
50%,77.0,2.0,4.0,1.0
75%,140.0,7.0,17.0,5.0
max,1200800.0,25200.0,47000.0,132088.0


- We can observe that there are some rows with invalid nutrient values, which is expected since the data has been provided by the users of MyFitnessPal.
- We will only keep the rows where the values of the nutrients fall within logical ranges.

In [18]:
products_df = products_df[(products_df.calories > 0) & (products_df.calories < 2000)]
products_df = products_df[(products_df.protein >= 0) & (products_df.protein < 500)]
products_df = products_df[(products_df.carbs >= 0) & (products_df.carbs < 500)]
products_df = products_df[(products_df.fat >= 0) & (products_df.fat < 200)]
len(products_df)

1854591

- We will check for missing values.

In [19]:
pd.concat([products_df.isna().sum(), products_df.isna().mean()],
          axis=1, keys=['missing_count', 'missing_ratio'])

Unnamed: 0,missing_count,missing_ratio
calories,0,0.0
protein,0,0.0
carbs,0,0.0
fat,0,0.0


- We will drop the rows containing missing values (small portion).

In [20]:
products_df = products_df.dropna().reset_index(drop=True)

- We will generate the prices of the products using a log-normal distribution with a mean of `1` and a standard deviation of `0.5`

In [21]:
mean, std_dev = 1, 0.5

prices = np.random.lognormal(mean, std_dev, size=len(products_df))
prices = np.vectorize(round)(prices, 2)
products_df['price'] = prices
products_df.head()

Unnamed: 0,calories,protein,carbs,fat,price
0,412,21.0,29.0,24.0,1.49
1,170,20.0,25.0,5.0,2.52
2,176,5.0,33.0,1.0,2.95
3,342,24.0,34.0,12.0,0.95
4,180,21.0,22.0,7.0,1.03


- We will generate an auto-increment id for each product.

In [22]:
products_df['id'] = [id for id in range(1, len(products_df) + 1)]

- We will generate our own generic product names as the names of the food entries listed in the MyFitnessPal dataset are not relevant to the specific queries we have chosen for our benchmark.

In [23]:
product_names = []
for i in range(1, len(products_df) + 1):
    product_names.append(f'product_{i}')

products_df['name'] = product_names
products_df.head()

Unnamed: 0,calories,protein,carbs,fat,price,id,name
0,412,21.0,29.0,24.0,1.49,1,product_1
1,170,20.0,25.0,5.0,2.52,2,product_2
2,176,5.0,33.0,1.0,2.95,3,product_3
3,342,24.0,34.0,12.0,0.95,4,product_4
4,180,21.0,22.0,7.0,1.03,5,product_5


- We will also generate generic product descriptions to make the size of the table more realistic.

In [24]:
product_desc = []
for i in range(1, len(products_df) + 1):
    product_desc.append(f'This is product {i}')

products_df['description'] = product_desc
products_df.head()

Unnamed: 0,calories,protein,carbs,fat,price,id,name,description
0,412,21.0,29.0,24.0,1.49,1,product_1,This is product 1
1,170,20.0,25.0,5.0,2.52,2,product_2,This is product 2
2,176,5.0,33.0,1.0,2.95,3,product_3,This is product 3
3,342,24.0,34.0,12.0,0.95,4,product_4,This is product 4
4,180,21.0,22.0,7.0,1.03,5,product_5,This is product 5


- We will assign the products to stores in a round-robin fashion to achieve an evenly distributed number of products at each store.

In [25]:
store_ids = []
for i in range(len(products_df)):
    store_ids.append((i % len(stores_df)) + 1)

products_df['store_id'] = store_ids
products_df.head()

Unnamed: 0,calories,protein,carbs,fat,price,id,name,description,store_id
0,412,21.0,29.0,24.0,1.49,1,product_1,This is product 1,1
1,170,20.0,25.0,5.0,2.52,2,product_2,This is product 2,2
2,176,5.0,33.0,1.0,2.95,3,product_3,This is product 3,3
3,342,24.0,34.0,12.0,0.95,4,product_4,This is product 4,4
4,180,21.0,22.0,7.0,1.03,5,product_5,This is product 5,5


- We will create two more datasets, each containing the products of its corresponding stores dataset.

In [26]:
products_df_50k = products_df[products_df.store_id.isin(stores_df_50k.id)]
products_df_100k = products_df[products_df.store_id.isin(stores_df_100k.id)]

- We will save the `products_df_50k`, `products_df_100k` & `products_df` as csv.
- The columns are explicitly specified to define the order of the csv columns.

In [27]:
columns = ['id', 'name', 'description', 'price', 'calories', 'protein', 'carbs', 'fat', 'store_id']
products_df_50k[columns].to_csv('./datasets/products_50k.csv', index=False)
products_df_100k[columns].to_csv('./datasets/products_100k.csv', index=False)
products_df[columns].to_csv('./datasets/products_full.csv', index=False)

- We will create a DataFrame containing the unique combinations of the `city` & `state` columns for each of the `stores_df_50k`, `stores_df_100k` & `stores_df` DataFrames.
- Then we will create a column the values of which will be the concatenation of the `city` & `state` columns from the `unique_combinations` DataFrames in the format 'city, value' 
- Finally, we will save the `locations` column of the `unique_combinations` DataFrame as csv.
- The locations will be geocoded to get the coordinates of each location. Then we will use these coordinates to produce relevant data for the load testing requests.

In [28]:
def locations_unique_combinations_to_csv(df, path):
    unique_combinations = df[['city', 'state']].drop_duplicates()
    unique_combinations['locations'] = unique_combinations['city'] + ', ' + unique_combinations['state']
    unique_combinations[['locations']].to_csv(path, index=False)

In [29]:
locations_unique_combinations_to_csv(stores_df_50k, './data/locations_50k.csv')
locations_unique_combinations_to_csv(stores_df_100k, './data/locations_100k.csv')
locations_unique_combinations_to_csv(stores_df, './data/locations_full.csv')