# Worksheet 2: Data modelling with MongoDB

### Exercise 1: Identify documents and collections

#### Case Study: Online Bookstore

You are tasked with designing a MongoDB database for an online bookstore. The bookstore sells books, and each book can have multiple authors. Customers can purchase books, and each purchase can contain multiple books. Additionally, customers can leave reviews for books they have purchased.

#### Objectives:
1. Design the data model for the online bookstore.
2. Define the relationships between different entities.
3. Create sample documents for each collection.



#### Step-by-Step Instructions:

1. **Identify Entities and Relationships:**
    - **Books**: Each book has a title, ISBN, publication date, and a list of authors.
    - **Authors**: Each author has a name and a list of books they have written.
    - **Customers**: Each customer has a name, email, and a list of purchases.
    - **Purchases**: Each purchase has a date, customer reference, and a list of books purchased.
    - **Reviews**: Each review has a rating, comment, customer reference, and book reference.



2. **Design the Data Model:**
  
Create a template for each of the collection. This should include the key-value pairs in JSON format, specify the data type, and specify possible relationships between collections.

Recall that MongoDB supports date type such as: `ObjectId`, `String`, `Integer`, `ISODATE`, etc...

For example

**Books Collection**:

```json
        {
            "_id": ObjectId,
            "title": String,
            "ISBN": String,
            "publication_date": Date,
            "authors": [ObjectId]  // References to Authors
        }
        ```        


- **Authors Collection**:
> ENTER YOUR ANSWER HERE


- **Customers Collection**:
> ENTER YOUR ANSWER HERE

- **Purchases Collection**:
> ENTER YOUR ANSWER HERE

- **Reviews Collection**:
> ENTER YOUR ANSWER HERE


3. **Create Sample Documents:**

Now, create a sample document for each of the collection. You can fill in random values of your choice. The purpose of this exercise is to help you envision how the data are going to look like

For example
**Books Collection**:
```json
        {
            "_id": ObjectId("60c72b2f9b1d8b3a4c8e4d1a"),
            "title": "MongoDB Basics",
            "ISBN": "1234567890",
            "publication_date": ISODate("2021-01-01T00:00:00Z"),
            "authors": [ObjectId("60c72b2f9b1d8b3a4c8e4d1b")]
        }
```


- **Authors Collection**:
> ENTER YOUR ANSWER HERE


- **Customers Collection**:
> ENTER YOUR ANSWER HERE

- **Purchases Collection**:
> ENTER YOUR ANSWER HERE

- **Reviews Collection**:
> ENTER YOUR ANSWER HERE

## Exercise 2: Identify database workload

In this exercise, you will extend your data modeling skills by identifying entities and attributes, quantifying entities, and analyzing read and write operations for different types of application users. 


### Quantifying Reads and Writes in the Online Bookstore Example

To quantify the read and write operations in the online bookstore example, we need to consider the different types of operations performed by various users (customers and admins) and the frequency of these operations.

Fill out the table below with possible operations by customers and admins and specify the estimated frequency, as well as whether it's a read or write operations. You can make your own assumptions.

I have added one operation as an example.



#### Table: Quantifying Reads and Writes

| Operation                     | User Type | Read/Write | Frequency (per day) | Description                                                                 |
|-------------------------------|-----------|------------|----------------------|-----------------------------------------------------------------------------|
| Browse Books                  | Customer  | Read       | 100,000              | Customers browsing the list of available books.                             |
|                               |           |            |                      |                                                                             |
|                               |           |            |                      |                                                                             |
|                               |           |            |                      |                                                                             |
|                               |           |            |                      |                                                                             |
|                               |           |            |                      |                                                                             |
|                               |           |            |                      |                                                                             |


## Exercise 3: Identifying and Modeling Relationships in MongoDB

#### Case Study: Online Bookstore (Continued)

In this exercise, you will identify one-to-one, one-to-many, and many-to-many relationships between entities in an online bookstore. You will analyze these entities to determine whether to embed or reference them using common guidelines. Finally, you will model these relationships using both embedded and referenced approaches.

#### Objectives:
1. Identify one-to-one, one-to-many, and many-to-many relationships between entities.
2. Analyze entities to determine whether to embed or reference using common guidelines.
3. Model embedded and referenced one-to-one, one-to-many, and many-to-many relationships.

#### Step-by-Step Instructions:

For each of the following pair of entities, identify:
- What type of relationships is there
- Determine whether to use embedding or referencing
- Give an example in JSON format


#### Example:

**Entities: Books and Reviews**

1. **Type of Relationship:** One-to-Many (e.g., one book has many reviews)
2. **Embed or Reference:** Embed
3. **Explanation:** Reviews are typically small in size and are frequently accessed together with the book details. Embedding reviews within the book document ensures that all related data can be retrieved in a single read operation, improving read performance.



##### JSON Example:



In [None]:
{
    "_id": ObjectId("bookId1"),
    "title": "MongoDB Basics",
    "ISBN": "1234567890",
    "reviews": [
        {
            "rating": 5,
            "comment": "Great book on MongoDB!",
            "customer_id": ObjectId("customerId1")
        },
        {
            "rating": 4,
            "comment": "Very informative.",
            "customer_id": ObjectId("customerId2")
        }
    ]
}


**YOUR TURN**

For each of the following pairs of entities, identify the type of relationship, decide whether to use embedding or referencing, explain why, and provide a JSON example.

1. **Entities: Books and Authors**
    - **Type of Relationship:** 
    - **Embed or Reference:** 
    - **Explanation:** 
    - **JSON Example:**


> YOUR ANSWER HERE


1. **Entities: Books and Authors**
    - **Type of Relationship:** Many-to-Many
    - **Embed or Reference:** Embed for books, and reference for authors
    - **Explanation:** Each book can have multiple authors, but each author can also write multiple books. 
  
    - For the books collection, since the number of authors of each book is relatively small, we can directly embed authors information in the books collection. This will help users to retrieve information on authors while browsing a book faster. 
    - For the authors collection, since each authors can have a large number of books, embedding books information in authors collection would likely result in unbounded documents (it's too long), and slow down the querry time. Hence, it's better to do referencing to books ID in the authors collection
    - We will also accept the option to use referencing in authors collection here if you assume each authors only publish a small limited number of books
  
    - **JSON Example:**


Books collection (embedded authors)
```json
{
    "book_id": "book_id_3",
    "title": "Good Omens",
    "ISBN": "978-0-06-085398-3",
    "publication_date": "1990-05-01",
    "authors": [
        {
            "author_id": "3",
            "name": "Neil Gaiman"
        },
        {
            "author_id": "4",
            "name": "Terry Pratchett"
        }
    ]
}
```

Authors collection (referencing books)

```json
{
    "author_id": "3",
    "name": "Neil Gaiman",
    "books": [
        "book_id_3",
        "book_id_6"
    ]
}
```


2. **Entities: Customers and Purchases**
    - **Type of Relationship:**
    - **Embed or Reference:** 
    - **Explanation:** 
    - **JSON Example:**


> YOUR ANSWER HERE


2. **Entities: Customers and Purchases**
    - **Type of Relationship:** One-to-many
    - **Embed or Reference:** We would reference purchase ID in the customers collection. The reason is that each customer can have a large number of purchases, embedding it would result in unbounded documents, extremly long json file and slow down querry time. So referencing would be a better choice here. 
    - **Explanation:** Each customer can have multiple purchases, but each purchase order belongs to only one customer
    - **JSON Example:**

```json
{
    "customer_id": "customer_id_2",
    "name": "Jane Smith",
    "email": "jane.smith@example.com",
    "purchases": [
        "purchase_id_1",
        "purchase_id_2",
        ...,
        "purchase_id_100"
    ]
}
```




3. **Entities: Books and Purchases**
    - **Type of Relationship:** 
    - **Embed or Reference:** 
    - **Explanation:** 
    - **JSON Example:**


> YOUR ANSWER HERE


3. **Entities: Books and Purchases**
    - **Type of Relationship:**: Many-to-many
    - **Embed or Reference:** 
    - **Explanation:** Each book can be purchased multiple times, each purchase order can have multiple books
    - For the books collection, we can reference the purchased id since we can have hundreds to millions of purchase orders for a single popular book (e.g. Harry Potter)
    - For the purchases collection, we can use either embedding or referencing. If we assume that most purchase order will contain only a small number of books, then we can use embedding. This is most likely to be the case in real-life. If we assume that most purchases order will contain a large number of books (bulk order of 100s of books), then we can use referencing. 
    - **JSON Example:**
**books collection** (referencing purchase ids)
```json
{
    "book_id": "book_id_1",
    "title": "Harry Potter and the Sorcerer's Stone",
    "ISBN": "978-0-590-35340-3",
    "publication_date": "1997-06-26",
    "purchases": [
        "purchase_id_1",
        "purchase_id_2",
        ...,
        "purchase_id_100"
    ]
}
```

**purchase collection** (embed books info)
```json
{
    "purchase_id": "purchase_id_1",
    "date": "2023-10-01",
    "customer_id": "customer_id_1",
    "books": [
        {
            "book_id": "book_id_1",
            "title": "Harry Potter and the Sorcerer's Stone"
        },
        {
            "book_id": "book_id_2",
            "title": "The Hobbit"
        }
    ]
}
```


4. **Entities: Customers and Reviews**
    - **Type of Relationship:** 
    - **Embed or Reference:** 
    - **Explanation:** 
    - **JSON Example:**


> YOUR ANSWER HERE


4. **Entities: Customers and Reviews**
    - **Type of Relationship:** One-to-many
    - **Embed or Reference:** Reference reviews in customer collection
    - **Explanation:** Each customer can write multiples reviews, each review belongs to a single customer. We should reference the reviews id in the customer collection because the number of reviews could get quite large, hence referencing would be suitable here. 
    - **JSON Example:**

```json
{
    "customer_id": "customer_id_1",
    "name": "John Doe",
    "email": "john.doe@example.com",
    "reviews": [
        "review_id_1",
        "review_id_2",
        "review_id_3",
        "review_id_4",
        "review_id_5",
        "review_id_6",
        "review_id_7",
        "review_id_8",
        "review_id_9",
        "review_id_10"
    ]
}
```


## Exercise 4: Delta Insertion in MongoDB

#### Case Study: Walmart Prices


To handle the case where a new product is inserted into Collection A (historical prices), and if the product doesn't exist in Collection B (latest prices), we want to directly insert it into Collection B as well. Additionally, we'll ensure that the attributes (fields) of Collection A and Collection B are the same.

Changes to Implement:
Ensure that Collection A and Collection B have the same attributes.
If a product is new (not found in Collection B), insert it into both Collection A and Collection B.
Modify the schema to make the attributes consistent across both collections.

**Step 1**: Define the Common Schema for Collection A and Collection B

Both Collection A and Collection B will have the following fields:

- `productID`: Unique ID for each product.
- `price`: The current price of the product.
- `changeDate`: The date of the price change.
- `lastUpdated`: The date when the product’s price was last updated.

Example Document Structure:
```json
{ 
    "productID": 101,
    "price": 12.99,
    "changeDate": "2024-01-15",
    "lastUpdated": "2024-01-15"
}
```
**Step 2**: Python Code to Insert New Products into Both Collections



In [1]:
# pip install pymongo

In [3]:
from pymongo import MongoClient # import mongo client to connect
import json # import json to load credentials
from bson.objectid import ObjectId
import urllib.parse

# load credentials from json file
with open('credentials_mongodb.json') as f:
    login = json.load(f)

# assign credentials to variables
username = login['username']
password = urllib.parse.quote(login['password'])
host = login['host']
url = "mongodb+srv://{}:{}@{}/?retryWrites=true&w=majority".format(username, password, host)

In [4]:
# connect to the database
client = MongoClient(url)

In [5]:
db = client['walmart_prices']  # Database name

In [59]:
# collection_a.delete_many({})
# collection_b.delete_many({})
# collection_c.delete_many({})


In [6]:
collection_a = db["historical_prices"]
collection_b = db["latest_prices"]
collection_c = db["logs"]

In [9]:
historical_data = [
    {
        "_id": ObjectId(),
        "productID": 101,
        "Price": 10.99,
        "changeDate": "2024-01-15"
    },
    {
        "_id": ObjectId(),
        "productID": 102,
        "Price": 12.49,
        "changeDate": "2024-01-20"
    },
    {
        "_id": ObjectId(),
        "productID": 103,
        "Price": 8.79,
        "changeDate": "2024-02-05"
    },
    {
        "_id": ObjectId(),
        "productID": 104,
        "Price": 15.99,
        "changeDate": "2024-03-10"
    },
    {
        "_id": ObjectId(),
        "productID": 105,
        "Price": 7.49,
        "changeDate": "2024-04-18"
    }
]

# Dummy data for 5 records in latest_prices collection (updated prices)
latest_data = [
    {
        "_id": ObjectId(),
        "productID": 101,
        "Price": 10.99,
        "changeDate": "2024-01-15"
    },
    {
        "_id": ObjectId(),
        "productID": 102,
        "Price": 12.49,
        "changeDate": "2024-01-20"
    },
    {
        "_id": ObjectId(),
        "productID": 103,
        "Price": 8.79,
        "changeDate": "2024-02-05"
    },
    {
        "_id": ObjectId(),
        "productID": 104,
        "Price": 15.99,
        "changeDate": "2024-03-10"
    },
    {
        "_id": ObjectId(),
        "productID": 105,
        "Price": 7.49,
        "changeDate": "2024-04-18"
    }
]

log_data = [
    {
        "_id": ObjectId(),
        "operationID": 1,
        "fromTable": "historical_prices",
        "toTable": "latest_prices",
        "Execution_Date": "2024-01-20",
        "Today_Date": "2024-01-20"
    }]
collection_a.insert_many(historical_data)
collection_b.insert_many(historical_data)
collection_c.insert_many(log_data)

# Insert the data into collections
# collection_a.insert_many([{
    #     "_id": ObjectId(),
    #     "productID": 105,
    #     "Price": 8.00,
    #     "changeDate": "2024-04-20"
    # }])

InsertManyResult([ObjectId('67171d84856307621de5be44'), ObjectId('67171d84856307621de5be45'), ObjectId('67171d84856307621de5be46'), ObjectId('67171d84856307621de5be47'), ObjectId('67171d84856307621de5be48')], acknowledged=True)

In [10]:
print(list(collection_a.find()))
print(list(collection_b.find()))
print(list(collection_c.find()))


[{'_id': ObjectId('67171d84856307621de5be44'), 'productID': 101, 'Price': 10.99, 'changeDate': '2024-01-15'}, {'_id': ObjectId('67171d84856307621de5be45'), 'productID': 102, 'Price': 12.49, 'changeDate': '2024-01-20'}, {'_id': ObjectId('67171d84856307621de5be46'), 'productID': 103, 'Price': 8.79, 'changeDate': '2024-02-05'}, {'_id': ObjectId('67171d84856307621de5be47'), 'productID': 104, 'Price': 15.99, 'changeDate': '2024-03-10'}, {'_id': ObjectId('67171d84856307621de5be48'), 'productID': 105, 'Price': 7.49, 'changeDate': '2024-04-18'}]
[{'_id': ObjectId('67171d84856307621de5be44'), 'productID': 101, 'Price': 10.99, 'changeDate': '2024-01-15'}, {'_id': ObjectId('67171d84856307621de5be45'), 'productID': 102, 'Price': 12.49, 'changeDate': '2024-01-20'}, {'_id': ObjectId('67171d84856307621de5be46'), 'productID': 103, 'Price': 8.79, 'changeDate': '2024-02-05'}, {'_id': ObjectId('67171d84856307621de5be47'), 'productID': 104, 'Price': 15.99, 'changeDate': '2024-03-10'}, {'_id': ObjectId('67

In [55]:
# Define the pipeline function
def update_prices_for_all_logs():
    # Get current date for the log
    current_date = datetime.now().strftime("%Y-%m-%d")
    collection_c = db['logs']
    # Fetch all log entries from collection_c
    all_log_entries = list(collection_c.find())
    
    # Loop through each log entry and execute the pipeline
    for log_entry in all_log_entries:
        operation_id = log_entry['operationID']
        from_table_name = log_entry['fromTable']
        to_table_name = log_entry['toTable']
        
        print(f"Processing operationID {operation_id} from {from_table_name} to {to_table_name}...")
        
        # Fetch the collections dynamically using the names from the logs
        from_collection = db[from_table_name]  # Source collection, e.g., historical_prices
        to_collection = db[to_table_name]      # Destination collection, e.g., latest_prices

        # Fetch data from the 'fromTable'
        historical_data = list(from_collection.find())
        
        # Loop through each record in 'fromTable'
        for record in historical_data:
            product_id = record['productID']
            historical_price = record['Price']
            historical_change_date = record['changeDate']

            # Check if the productID exists in the 'toTable'
            latest_record = to_collection.find_one({"productID": product_id})

            if latest_record:
                # If the productID exists, check the changeDate
                latest_change_date = latest_record['changeDate']

                if historical_change_date > latest_change_date:
                    # Update the record if historical changeDate is newer
                    to_collection.update_one(
                        {"productID": product_id},
                        {"$set": {"Price": historical_price, "changeDate": historical_change_date}}
                    )
                    print(f"Updated productID {product_id} in operationID {operation_id} with new price {historical_price}.")
            else:
                # Insert new record if productID doesn't exist in 'toTable'
                to_collection.insert_one(record)
                print(f"Inserted new productID {product_id} in operationID {operation_id} with price {historical_price}.")

        # Update the log_data for this operation after execution
        collection_c.update_one(
            {"operationID": operation_id},
            {
                "$set": {
                    "Execution_Date": current_date,
                    "Today_Date": current_date
                }
            }
        )

        print(f"OperationID {operation_id} complete, logs updated.")

# Call the pipeline function for all log entries
update_prices_for_all_logs()

Processing operationID 1 from historical_prices to latest_prices...
{'_id': ObjectId('6716bf3dd63c865d55b2e58d'), 'productID': 101, 'Price': 10.99, 'changeDate': '2024-01-15'}
{'_id': ObjectId('6716bf3dd63c865d55b2e58e'), 'productID': 102, 'Price': 12.49, 'changeDate': '2024-01-20'}
{'_id': ObjectId('6716bf3dd63c865d55b2e58f'), 'productID': 103, 'Price': 8.79, 'changeDate': '2024-02-05'}
{'_id': ObjectId('6716bf3dd63c865d55b2e590'), 'productID': 104, 'Price': 15.99, 'changeDate': '2024-03-10'}
{'_id': ObjectId('6716bf3dd63c865d55b2e591'), 'productID': 105, 'Price': 7.49, 'changeDate': '2024-04-18'}
{'_id': ObjectId('6716bf3dd63c865d55b2e591'), 'productID': 105, 'Price': 7.49, 'changeDate': '2024-04-18'}
Updated productID 105 in operationID 1 with new price 8.0.
OperationID 1 complete, logs updated.


## Submission instructions

{rubric: mechanics = 5}

- Make sure the notebook can run from top to bottom without any error. Restart the kernel and run all cells.
- Commit and push your notebook to the github repo
- Double check your notebook is rendered properly on Github and you can see all the outputs clearly
- Submit a URL to the github repo that contain this worksheet to Moodle