# <center>Big Data for Engineers &ndash; Exercises</center>
## <center>Spring 2025 &ndash; Week 10 &ndash; ETH Zurich</center>

## Introduction

This exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practical exercises.

## 1. Document stores

Document stores are collections of *documents*. Documents can be of many formats: JSON, XML, YAML, or binary formats such as BSON, PDF, or Microsoft Office files. For instance, MongoDB documents are logically structured as JSON, but are stored and transmitted internally as BSON (Binary JSON).  Documents are composed of field-value pairs and have the following structure:

![Example of InsertOne](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-insertOne.bakedsvg.svg)

The values of fields may include other documents, arrays, and arrays of documents, making the documents arbitrarily nested. Data in MongoDB does not need to adhere to the same schema even within the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

### Questions
1. What are advantages of document stores over relational databases?
2. Is data normalization *possible* in document stores? 
3. How does denormalization affect performance? 
4. How does a large number of small documents affect performance? 
5. What makes document stores different from key-value stores?

## 2. MongoDB

### 2.1 Install MongoDB
MongoDB is an open-source document database. To install it, run the following command if you haven't already:

```bash
docker-compose up -d
```

### 2.2 Import the dataset

Next, to import the dataset, run

```bash
wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
```

in the current folder or, alternatively, run the following cell:

In [None]:
!wget https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json

Next, run

```bash
docker cp primer-dataset.json mongo:/primer-dataset.json
```

in the current directory to copy the dataset to the docker container.

Finally, use mongoimport to insert the documents into the ```restaurants``` collection in the `test` database by running the following command. If the collection already exists in the `test` database, the operation will drop the `restaurants` collection first.
  
```bash
docker exec mongo mongoimport --db test --collection restaurants --drop --file /primer-dataset.json
```

### 2.3 Mongo shell 

The mongo shell is an interactive JavaScript interface to MongoDB. You can use the mongo shell to query and update data, as well as to perform administrative operations.

To start mongo, run the following command

```bash
docker exec -it mongo mongosh
```

In the mongo shell connected to a running MongoDB instance, switch to the `test` database:

```js
use test
```

You can see the structure of documents in the collection:

Query all documents in a collection:
```js
db.restaurants.find()
```

Query one document in a collection:
```js
db.restaurants.findOne()
```

To format the printed result, you can add `.pretty()` to the operation, as in the following:

```js
db.restaurants.find().limit(1).pretty()
```

### 2.4 Query Documents
For the ```db.collection.find()``` method, you can specify the following optional fields:
- a query filter to specify which documents to return and
- a query projection to specifies which fields from the matching documents to return (the projection limits the amount of data that MongoDB returns to the client over the network).

Optionally, a cursor modifier to impose limits, skips, and sort orders can additionally be used.

![query](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-find.bakedsvg.svg)

Here are some useful links to learn more about MongoDB: 

https://www.mongodb.com/docs/manual/tutorial/query-documents/

https://docs.mongodb.com/manual/aggregation/

### Questions
Write queries in MongoDB that return the following:

1. All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".
2. The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".
3. All restaurants with zipcode 11225.
4. Names of restaurants with zipcode 11225 that have at least one grade "C".
5. Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".  
    **Hint**: You can use "." to select an element of an arry e.g. "grades.0" will access the first element of grades.
6. Names and streets of restaurants that don't have an "A" grade.
7. All restaurants for which at least one rating has a grade C **with** a score greater than 50.  
    **Hint**: Since the grade and the score must be in the same grading you should look up `$elemMatch`.
8. All restaurants with a grade C or a score greater than 50.
9. A table with zipcode and number of restaurants that have that zipcode for the restaurants that are in the borough "Queens" and have "Brazilian" cuisine.  
    **Hint**: Research how to write an aggregation pipeline.

### 2.5 Create, Update and Delete

In the previous subsection, we covered the query command `.find()`. In this section we will go over the commands required to create, update and delete documents in mongoDB.

**Create.** To insert documents into a collection, use the `.insertOne(document)` or `.insertMany([document1, ...])` commands.

**Update.** Similarly, documents can be updated using `.updateOne()` and `updateMany()`. The general syntax consists of a filtering criterion and an update operator expressions document, e.g. 
```js
db.scientists.updateMany(
    {"Last": "Einstein"},
    {"$set": {"Century": 20}}
),
```
which applies the update to all documents (the first document for `.updateOne()`) matching the filtering criterion. See [the manual](https://www.mongodb.com/docs/manual/reference/operator/update/#std-label-update-operators) for more information.

**Delete.** Documents can be deleted using `.deleteOne()` and `deleteMany()`. These commands again take a filtering criterion as argument, and delete the first / all documents matching this criterion.

More informations can be found in the MongoDB documentation: https://www.mongodb.com/docs/manual/crud/

### Questions

1. Add the following entry for *Sternen Grill* into the `restaurants` collection.

    ```js
    {
        "address": {
            "building": "22",
            "coord": [8.545590, 47.367341],
            "street": "Theaterstrasse",
            "zipcode": "8001"
        },
        "borough": "Kreis 1",
        "cuisine": "Swiss",
        "grades": [],
        "name": "Sternen Grill",
        "restaurant_id": "99999999"
    }
    ```
    
2. Write a query that returns the `name` and `_id` of restaurants with the name `Sternen Grill`.
3. Add a review to *Sternen Grill* with `score` 5 and `grade` A.

    **Hint.** You can create a date object representing the current time with `new Date()`. Also see [the manual](https://www.mongodb.com/docs/manual/reference/operator/update/#std-label-update-operators) for different update operators, you might find `$push` helpful.
4. Write an update statement that adds an `avgScore` field to each restaurant document. This field should contain the average of the scores from the `grades` array.

    **Hint**. You can use the aggregation pipeline syntax in an update by passing an array instead of an object (see [here](https://www.mongodb.com/docs/manual/reference/method/db.collection.updateMany/#std-label-updateMany-behavior-aggregation-pipeline)). You might find `$set` and `$avg` useful.
5. Using the avgScore field you added in the previous question, write a query that returns a table with each unique cuisine and the average of all avgScore values for that cuisine, sorted in ascending order.
6. Remove `Sternen Grill` from the collection again.

## 3. Indexing in MongoDB

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document in a collection to select those documents that match the query. Scans can be highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index type supports. 

By default, MongoDB creates the `_id` index, which is an ascending unique index on the `_id` field, for all collections upon creation. You cannot remove the index on the `_id` field.

### Managing indexes in MongoDB

An `explain` operator provides information on the query plan. It returns a document that describes the process and indexes used to return the query. This may provide useful insight when attempting to optimize a query. Have a look at the "winningPlan" field of the result and how it changes.

```js
    db.restaurants.find({"borough" : "Brooklyn"}).explain()
```

In the mongo shell, you can create an index by calling the `createIndex()` method.  

```js
    db.restaurants.createIndex({ "borough" : 1 })
```

Now, retrieve a new query plan for indexed data and again have a look at the "winningPlan" field of the result and how it changed.
```js
    db.restaurants.find({"borough" : "Brooklyn"}).explain()
```
The value assigned to each field in the index specification determines the sort order. For example, a value of 1 specifies an index that orders items in ascending order. A value of -1 specifies an index that orders items in descending order.

To remove all indexes, you can use `db.collection.dropIndexes()`. To remove a specific index you can use `db.collection.dropIndex()`, such as `db.restaurants.dropIndex({ borough : 1 })`.

### Questions

1. Write an index that will speed up the following query:

    ```js
        db.restaurants.find({"borough" : "Brooklyn"})
    ```

2. We have an index on address field as follows:
    
    ```js
        db.restaurants.createIndex( {"address" : -1 })
    ```
    
    Will the query below use that index?
    
    ```js
        db.restaurants.find({"address.zipcode" : "11225"  })
    ```

3. Write a command for creating an index on the zipcode field.

4. Let us have the compound index:

    ```js
        db.restaurants.createIndex({ "borough": 1,  "cuisine": -1 })
    ```
    
    Which of the following queries use the index above?
    

    ```js
        a) db.restaurants.find({"borough" : "Brooklyn"})
        b) db.restaurants.find({"cuisine" : "Hamburgers"})
        c) db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" })
        d) db.restaurants.find().sort({"borough" : -1})
        e) db.restaurants.find().sort({"borough" : 1, "cuisine" : 1})
        f) db.restaurants.find().sort({"borough" : -1, "cuisine" : 1 })
        g) db.restaurants.find().sort({"cuisine" : 1, "borough" : -1 })
    ```
 
5. Answer Question 4, but for the following index:

    ```js
        db.restaurants.createIndex({"borough": 1, "cuisine": -1, "name" : -1})
    ```

6. Is it possible to create the index below? Why?/Why not?

    ```js
        db.restaurants.createIndex({ "address.coord": 1, "grades": -1})
    ```

7. Write an index to speed up the following query:

    ```js
        db.restaurants.find({"grades.grade" : { "$ne" : "A"}} , {"name" : 1 , "address.street": 1})
    ```

8. Write an index to speed up the following query:

    ```js
        db.restaurants.find({"grades.score" : {"$gt" : 50} , "grades.grade" : "C"})
    ```
    
9. What are the differences between two index strategies below  

    ```js
        a)  db.restaurants.createIndex({ "borough": 1, "cuisine": -1})
    ```  
    ```js
        b)  db.restaurants.createIndex({ "borough": 1})
            db.restaurants.createIndex({ "cuisine": -1})
    ``` 
    
10. How are sparse indexes different from normal ones?

11. How are hashed indexes used in MongoDB? 

## 4. SQL to MongoDB Mapping 

Create a one-to-one mapping between the following SQL and MongoDB queries.

1.
```sql
INSERT INTO users(user_id,  age,  status)  
VALUES ("bcd001", 45, "A")
```

2.
```sql
SELECT * FROM users
```

3.
```sql
SELECT user_id, status FROM users
```

4.
```sql
SELECT * FROM users  
WHERE age > 25 AND age <= 50
```

5.
```sql
SELECT * FROM users  
WHERE status = "A" OR age = 50
``` 

6.
```sql
CREATE TABLE users (     
            id MEDIUMINT NOT NULL AUTO_INCREMENT,     
            user_id Varchar(30),     
            age Number,     
            status char(1),     
            PRIMARY KEY (id)     
)  
```  

7.
```sql
SELECT COUNT(user_id)   
FROM users
```


a.
```js
db.users.find(
    {"age": { "$gt": 25, "$lte": 50 }}
)
```
b. 
```js
db.users.find(
    { },
    {"user_id": 1, "status": 1, "_id": 0 }
)
```
c.
```js
db.createCollection("users")
```
d. 
```js
db.users.insert(
   {"user_id": "bcd001", "age": 45, "status": "A" }
)
```
e.

```js
db.users.count(
    {"user_id": {"$exists": true}}
) 

``` 
f.   
```js
db.users.find(   
    {"$or": [
        {"status": "A"},       
        {"age": 50}
    ]}    
) 
```
g. 
```js
db.users.find()
```

## 5. True or False
Decide whether the following statements are *true* or *false*.

1. In document stores, you must determine and declare a collection's schema before inserting data. 
2. Documents stores are not subject to strict schema requirements and support only one denormalized data model.
3. Different relationships between data can be represented by references and embedded documents.
4. MongoDB provides the capability to validate documents during updates and insertions.
5. MongoDB uses sharding to replicate data across multiple nodes for high availability and fault tolerance.

## 6. Choose the right technology
In the following situations state which of the technologies (either a *document store* or a *relational database*) would be more suitable.
1. You are mostly working with semistructured or unstructured data.
2. Your application writes hundreds of records every few seconds but does not update them very often.
3. You have a well-defined schema with clear constraints.
4. You want the queries written for your DB to be easily readable.
5. Your schema has a lot of relations.
6. Your applications frequently updates and modifies large volume of records.