Below are all the necessary libraries for this notebook to run.

In [29]:
import pandas as pd
import random
import pandas as pd
import numpy as np
from datetime import datetime
import urllib.request
import os

Below we are using the 'maybe_download' function to download our chosen dataset from Kaggle. After that, we read the data and remove duplicates based on the 'itemDescription' column. We subsequently create a new dataframe called 'new_pl', changing 'itemDescription' to 'Product Name', and print the column.

In [None]:
def maybe_download(filename, url):
    if not os.path.exists(filename):
        print("Downloading file...")
        urllib.request.urlretrieve(url, filename)
    else:
        print("File already exists.")


url = "https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset"
maybe_download("Groceries_dataset.csv", url)

In [30]:
pl = pd.read_csv("Groceries_dataset.csv")

pl = pl.drop_duplicates(subset = ['itemDescription'], keep = 'first')
new_pl = pl[["itemDescription"]]
new_pl = new_pl.rename(columns={'itemDescription': 'Product Name'})
print(new_pl)

                Product Name
0             tropical fruit
1                 whole milk
2                  pip fruit
3           other vegetables
5                 rolls/buns
...                      ...
18396         pudding powder
20424            ready soups
21177        make up remover
25545         toilet cleaner
33699  preservation products

[167 rows x 1 columns]


Next, we create 'product_info.csv', which contains the columns seen in the code below related to product information.

In [13]:
product_id = np.arange(167)
product_name = new_pl["Product Name"]
expiration_date = np.random.choice(pd.date_range('2023-01-01', '2025-12-31').strftime('%Y-%m-%d'), size = 167)

df = pd.DataFrame({'Product ID': product_id, 'Product Name': product_name, 'Expiration Date': expiration_date})
df.to_csv('product_info.csv', index = False)


 We create 'user_info.csv', generating the user information seen in the code below.

In [14]:
user_id = np.arange(1000)
age = np.random.randint(18, 65, size = 1000)
gender = np.random.choice(['Male', 'Female', 'Other'], size = 1000)
occupation = np.random.choice(['Student', 'Engineer', 'Teacher', 'Doctor', 'Businessman'], size = 1000)
location = np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], size = 1000)

df = pd.DataFrame({'User ID': user_id, 'Age': age, 'Gender': gender, 'Occupation': occupation, 'Location': location})
df.to_csv('user_info.csv', index = False)

Finally, we create 'interaction_data.csv', containing the interaction data that can be seen in the code below.

In [12]:
user_id = np.random.randint(0, 1000, size = 100000)
product_id = np.random.randint(0, 167, size = 100000)
interaction_type = np.random.choice(['View', 'Purchase', 'Add to Cart', 'Wishlist'], size = 100000)
interaction_timestamp = np.array([datetime.now().strftime('%Y-%m-%d %H:%M:%S') for i in range(100000)])

df = pd.DataFrame({'User ID': user_id, 'Product ID': product_id, 'Interaction Type': interaction_type, 'Interaction Timestamp': interaction_timestamp})
df.to_csv("interaction_data.csv", index = False)

Now, we transform our data to create our final file, 'supermarket_data.csv'. We convert the 'Expiration Date' column from the 'product_info.csv' to datetime format. Then, we convert the 'Interaction Type' column from 'interaction_data' to values from 1 to 4. 1 represents the weakest form of viewer interaction, which is simply viewing the product, whilst 4 represents the strongest form of viewer interaction, which is purchasing the product. 

In [15]:
# Load the data
users = pd.read_csv('user_info.csv')
products = pd.read_csv('product_info.csv')
interactions = pd.read_csv('interaction_data.csv')

products['Expiration Date'] = pd.to_datetime(products['Expiration Date'])

category_to_number = {
    "View": 1,
    "Wishlist": 2,
    "Add to Cart": 3,
    "Purchase": 4
}

In [26]:
interactions["Interaction Type"] = interactions["Interaction Type"].map(category_to_number)
print(interactions["Interaction Type"])

0        3
1        3
2        2
3        4
4        4
        ..
99995    4
99996    3
99997    3
99998    3
99999    2
Name: Interaction Type, Length: 100000, dtype: int64


Finally, we merge the dataframes to create 'supermarket_data.csv'.

In [27]:
data = interactions.merge(users, on = 'User ID').merge(products, on = "Product ID")
data.to_csv("supermarket_data.csv", index = False)