# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2025 &ndash; Week 10 &ndash; ETH Zurich</center>


## Introduction

This exercise sheet will cover document stores. A number of different document store technologies exist. For the purpose of this course, we will look at MongoDB a _“source-available, cross-platform, document-oriented database program.”_ ([Wikipedia](https://en.wikipedia.org/wiki/MongoDB))

A record in a document store is a *document*. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.  Documents are composed of field-value pairs and have the following structure:

![`insertOne()` Illustration](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-insertOne.bakedsvg.svg)

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

## 1. Understanding Document Stores

To get started with the concept of document stores, provide a brief answer for each of the following questions:

1. What are advantages of document stores over relational databases?

2. Can the data in document stores be normalized?

3. How does denormalization affect performance?

4. How does a large number of small documents affect performance?

5. What makes document stores different from key-value stores?

## 2. MongoDB

We will not get started with setting up MongoDB, such that you can see it in action yourself.

This setup is also the one that is required for the quiz of this week (Quiz 10.2). Make sure that when you are doing the quiz, you start with a fresh `restaurants` collection without any manually added documents.

### 2.1 Starting MongoDB

To start the MongoDB container run the following command in the directory for this week's exercise:

```sh
docker-compose up -d
```

### 2.2 Importing the dataset

To download and import the dataset, run the following command on your local machine: 

```sh
curl -O https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json \
docker cp primer-dataset.json mongo:/primer-dataset.json && \
docker exec mongo sh -c "mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json"
```

to download the dataset, copy it to the docker container, and insert the documents into the ```restaurants``` collection in the ```test``` database. If the collection already exists in the ```test``` database, the operation will drop the ```restaurants``` collection first.

You should see something similiar to the following output:

```text
25359 document(s) imported successfully. 0 document(s) failed to import.
```

> Note:
> To get a "fresh" version of the `restaurants` collection for the quiz, you can simply re-run the import command in your local shell:
> ```sh
> docker exec mongo sh -c "mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json"
> ```

### 2.3 Using the Mongo Shell 

The mongo shell is an interactive JavaScript interface to MongoDB. You can use the mongo shell to query and update data, as well as to perform administrative operations.

To start mongosh use the follwoing command:

```sh
docker exec mongo sh -c "mongosh"
```

You should now have a Mongo Shell (mognosh) within the MongoDB docker container. If you want to exit, press Ctrl + C or type `.exit`.

You can query all documents in the `restaurant` collection:
```js
db.restaurants.find()
```

You can retrieve a single document:
```js
db.restaurants.findOne()
```

To format nicely the printed result, you can add ```.pretty()``` to the operation, as in the following:
```js
db.restaurants.find().limit(1).pretty()
```

You can also add a new document to the collection:
```js
db.restaurants.insertOne(
   {
      "address" : {
         "street" : "W 55th St",
         "zipcode" : "10019",
         "building" : "240",
         "coord" : [ -73.98299163967829, 40.76483853555562 ]
      },
      "borough" : "Manhattan",
      "cuisine" : "Irish",
      "grades" : [
         {
            "date" : ISODate("2005-09-19T00:00:00Z"),
            "grade" : "A",
            "score" : 208
         },
      ],
      "name" : "MacLaren's Pub",
      "restaurant_id" : "41704620"
   }
)
```

### 2.3b Mongo Shell from Notebook (Optional)

If you prefer writing your `mongosh` queries in a notebook cell, you can use the format as shown in the example below:

In [None]:
!mongosh "mongodb://mongo:27017" --eval '''\
    db.restaurants.find({}).limit(5) \
'''

The above example works when you are using the kernel from the given Jupyter server. If you are running the notebook with a local kernel, the following format should work:

In [None]:
!docker exec -i mongo mongosh --eval '''\
    db.restaurants.find({}).limit(5) \
'''

In either case, make sure to use escape all line breaks within the query (using the `\` character at the end of the line).

### 2.4 Querying Documents

For simply queries, you can use the ```db.collection.find()``` method, where you can specify the following (optional) fields:

1. a query **filter** to specify which documents to return,

1. a query **projection** to specifies which fields from the matching documents to return (the projection limits the amount of data that MongoDB returns to the client over the network),

1. optionally, a cursor modifier to impose limits, skips, and sort orders.

![query](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-find.bakedsvg.svg)

For more complex queries, you can use so-called _aggregation pipelines_. To read more about these pipelines, please consider the [MongoDB Documentation](https://www.mongodb.com/docs/manual/core/aggregation-pipeline/).

**Exercise**: Write MongoDB queries to answer the following requests:

1. All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".

2. The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".

3. All restaurants with zipcode 11225.

4. Names of restaurants with zipcode 11225 that have at least one grade "C".

5. Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".

6. Names and streets of restaurants that don't have an "A" grade.

7. All restaurants for which at least one rating has a grade C **with** a score greater than 50.

8. All restaurants with a grade C or a score greater than 50.

9. A table with zipcode and number of restaurants that are in the borough "Queens" and have "Brazilian" 
cuisine.

10. (Optional) Find the top 5 restaurants in the borough “Brooklyn” with the cuisine “Hamburgers” based on the highest average grade scores.

Hint: for Question 10, [$unwind](https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/) might be useful.

### Further Reading and Documentation

You can read more about MongoDB here: 

https://www.mongodb.com/docs/mongodb-shell/

https://www.mongodb.com/docs/manual/reference/operator/

https://www.mongodb.com/docs/manual/aggregation/

## 3. Indexing in MongoDB

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document in a collection to select those documents that match the query. Scans can be highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index type supports. 

By default,  MongoDB creates the ```_id``` index, which is an ascending unique index on the ```_id``` field, for all collections when the collection is created. You cannot remove the index on the ```_id``` field.

### Managing indexes in MongoDB

An ```explain``` operator provides information on the query plan. It returns a document that describes the process and indexes used to return the query. This may provide useful insight when attempting to optimize a query.

```js
db.restaurants.find({"borough" : "Brooklyn"}).explain()
```

In the mongo shell, you can create an index by calling the ```createIndex()``` method.  

```js
db.restaurants.createIndex({ "borough" : 1 })
```

Now, you retrieve a new query plan for indexed data.

```js
db.restaurants.find({"borough" : "Brooklyn"}).explain()
```

The value of the field in the index specification describes the kind of index for that field. For example, a value of 1 specifies an index that orders items in ascending order. A value of -1 specifies an index that orders items in descending order.

To remove all indexes, you can use ```db.collection.dropIndexes()```. To remove a specific index you can use ```db.collection.dropIndex()```, such as ```db.restaurants.dropIndex({ borough : 1 })```.  

### 3.1 Writing indexes

1. Write an index that will speed up the following query:
    ```js
    db.restaurants.find({"borough" : "Brooklyn"})
    ```

2. We have an index on address field as follows:
    ```js
    db.restaurants.createIndex( { "address" : -1 })
    ```

    Will the query 
    ```js
    db.restaurants.find({"address.zipcode" : "11225"  })
    ```
    use that index? If not, explain why and provide an example that could use that index.

3. Write a command for creating an index on the zipcode field.

4. Is it possible to create the index below? Why?/Why not?
    ```js
    db.restaurants.createIndex({ "address.coord": 1, "grades": -1})
    ```

5. Write an index to speed up the following query:
    ```js
    db.restaurants.find({"grades.grade" : { $ne : "A"}} , {"name" : 1 , "address.street": 1})
    ```

6. Write an index to speed up the following query:
    ```js
    db.restaurants.find({"grades.score" : {$gt : 50} , "grades.grade" : "C"})
    ```

7. What are the differences between two index strategies below  

    > **Strategy A:**
    > ```js
    > db.restaurants.createIndex({ "borough": 1, "cuisine": -1})
    > ```  
        
    > **Strategy B:**
    > ```js
    > db.restaurants.createIndex({ "borough": 1})
    > db.restaurants.createIndex({ "cuisine": -1})
    > ``` 

### 3.2 Using Indexes

Consider the following indices:

> **(A)**
> ```js
> db.restaurants.createIndex({ "borough": 1,  "cuisine": -1 })
> ```
 
> **(B)**
> ```js
> db.restaurants.createIndex({"borough": 1, "cuisine": -1, "name" : -1})
> ```

> **(C)**
> ```js
> db.restaurants.createIndex({ "borough": -1,  "cuisine": 1 })
> ```
 
For each of the queries, indicate which indices it would benefit from:

1. ```js
    db.restaurants.find({"borough" : "Brooklyn"})
    ```

2. ```js
    db.restaurants.find({"cuisine" : "Hamburgers"})
    ```

3. ```js
    db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" })
    ```

4. ```js
    db.restaurants.find().sort({"borough" : -1})
    ```

5. ```js
    db.restaurants.find().sort({"borough" : 1, "cuisine" : 1})
    ```

6. ```js
    db.restaurants.find().sort({"borough" : -1, "cuisine" : 1 })
    ```

7. ```js
    db.restaurants.find().sort({"cuisine" : 1, "borough" : -1 })
    ```

## 4. SQL to MongoDB Mapping 

Create a one-to-one mapping between the following SQL and MongoDB queries.

### SQL Queries

> (1):
> ```sql
> INSERT INTO users(user_id,  age,  status)  
> VALUES ("bcd001", 45, "A")
> ```

> **(2)**
> ```sql
> SELECT * FROM users
> ```

> **(3)**
> ```sql
> SELECT user_id, status FROM users
> ```

> **(4)**
> ```sql
> SELECT * FROM users  
> WHERE age > 25 AND age <= 50
> ```

> **(5)**
> ```sql
> SELECT * FROM users  
> WHERE status = "A" OR age = 50
> ``` 

> **(6)**
> ```sql
> CREATE TABLE users (     
>     id MEDIUMINT NOT NULL AUTO_INCREMENT,     
>     user_id Varchar(30),     
>     age Number,     
>     status char(1),     
>     PRIMARY KEY (id)     
> )  
> ```  

> **(7)**
> ```sql  
> SELECT COUNT(user_id)   
> FROM users
> ```

### MongoDB Queries

> **(a)**
> ```js
> db.users.find(
>     { "age": { $gt: 25, $lte: 50 } }
> )
> ```

> **(b)**
> ```js
> db.users.find(
>     { },
>     { "user_id": 1, "status": 1, "_id": 0 }
> )
> ```

> **(c)**
> ```js
> db.createCollection("users")
> ```

> **(d)**
> ```js
> db.users.insert(
>     { "user_id": "bcd001", "age": 45, "status": "A" }
> )
> ```

> **(e)**
> ```js
> db.users.count({ "user_id": {$exists: true}}) 
> ```

> **(f)**
> ```js
> db.users.find(   
>     { $or: [ { "status": "A" } ,       
>     { "age": 50 } ] }    
> ) 
> ```

> **(g)**
> ```js
> db.users.find()
> ```

## 5. True or False

For each of the following statements, argue whether it is *true* or *false*. 
If it is false, give at least one justification for your answer.

1. In document stores, you must determine and declare a table's schema before inserting data. 

2. Documents stores are not subject to data modeling and support only one denormalized data model.

3. Different relationships between data can be represented by references and embedded documents.

4. There are no joins in MongoDB.

## 6. Choosing the Right Technology

In the following situations state which of the technologies (either a *document store* or a *relational database*) would be more suitable.

1. You are mostly working with semistructured or unstructured data.

2. Your application writes hundreds of records every few seconds but does not update them very often.

3. You have a well defined schema with clear constraints.

4. You want the queries written for your DB to be easily readable.

5. Your schema has a lot of relations.

6. Your applications frequently updates and modifies large volume of records.

## 7. Simple Query Comparison

Assume you have an SQL database/MongoDB database that has a table/collection called `users`. You want to query the users that have registered in the last 24 hours. Assume that you have an attribute/collection in your database that is called `signup_time`. Write the corresponding queries in SQL and MongoDB. Which one do you think that is more readable?

## 8. A Bit of Foreshadowing (Optional)

The MongoDB API, while very convenient and suitable for developing data management layers on top of MongoDB, is still "just" an API in an imperative host language and thus, among others, likely unsatisfactory for complex analytics use cases.

An alternative for querying denormalized data, is the JSONiq query language, which will be properly introduced in the next chapter of this course.
Once you read a bit more about the language, you may naturally think, that it would be well suited to query the data of a MongoDB document store.

We will thus show you in this exercise, how to use JSONiq to query MongoDB at the hand of the same dataset as used before.

First, we've got to set up a Rumble session, which will be the engine running the JSONiq queries. Not that if you are running this locally (so not with the provided Jupyter kernel) you will have to install the `jsoniq` python package with pip (or the Python package manager of your choice).

In [None]:
from jsoniq import RumbleSession
rumble = RumbleSession.builder.withMongo().getOrCreate()
%load_ext jsoniqmagic

You can ignore the warnings that may be shown when executing the above cell, as long as there are no errors.

JSONiq currently can't handle the date format used in MongoDB. We will thus copy our `restaurants` collection to a new collection `restaurantsStripped` and remove the `grades.date` fields for this collection. For this, run the following two commands in your Mongo Shell or execute one of the cells below (depending on whether you are using the Jupyter or a local kernel).

```js
db.restaurantsStripped.drop()
```
```js
db.restaurants.aggregate([ { $project: {"grades.date": 0} }, { $out: "restaurantsStripped" } ])
```


In [None]:
# Jupyter Kernel

!mongosh "mongodb://mongo:27017" --eval 'db.restaurantsStripped.drop()'
        
!mongosh "mongodb://mongo:27017" --eval ''' \
    db.restaurants.aggregate([ \
        { $project: {"grades.date": 0} }, \
        { $out: "restaurantsStripped" } \
    ]); \
'''

In [None]:
# Local Kernel

!docker exec -i mongo mongosh "mongodb://mongo:27017" --eval 'db.restaurantsStripped.drop()'
        
!docker exec -i mongo mongosh "mongodb://mongo:27017" --eval ''' \
    db.restaurants.aggregate([ \
        { $project: {"grades.date": 0} }, \
        { $out: "restaurantsStripped" } \
    ]); \
'''

Great, now you should have a new collection `restaurantsStripped` where all the date fields have been removed.

To use this collection with JSONiq, simply specify the `mongodb-collection` as the source of your queries, as shown in the example below.

In [None]:
%%jsoniq

for $i in mongodb-collection("mongodb://mongo:27017/test", "restaurantsStripped")
where $i.borough eq "Manhattan" and $i.cuisine eq "American"
return $i