# <center>Big Data &ndash; Exercises &ndash; Solution</center>
## <center>Fall 2024 &ndash; Week 10 &ndash; ETH Zurich</center>


## Introduction

This exercise will cover document stores. As a representative of document stores, MongoDB was chosen for the practical exercises.

## 1. Document stores

A record in a document store is a *document*. Document encoding schemes include XML, YAML, JSON, and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). MongoDB documents are similar to JSON objects.  Documents are composed of field-value pairs and have the following structure:

![123](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-insertOne.bakedsvg.svg)

The values of fields may include other documents, arrays, and arrays of documents. Data in MongoDB has a flexible schema in the same collection. All documents do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

### Questions
1. What are advantages of document stores over relational databases?
2. Can the data in document stores be normalized? 
3. How does denormalization affect performance? 
4. How does a large number of small documents affect performance? 
5. What makes document stores different from key-value stores?

###  Solution

1. Flexibility. Not every record needs to store the same properties. New properties can be added on the fly. 

2. Yes. References can be used for data normalization (check included figure). 

3. All data for an object is stored in a single record. In general, it provides better performance for read operations (since expensive joins can be omitted), as well as the ability to request and retrieve related data in a single database operation. In addition, embedded data models make it possible to update related data in a single atomic write operation.

4. It degrades performance since document stores are basically key-value stores. You should consider *embedding* for performance reasons if you have a collection with a large number of small documents. If you can group these small documents by some logical relationship and you frequently retrieve the documents by this grouping, you might consider "rolling-up" the small documents into larger documents that contain an array of embedded documents.

    Rolling up these small documents into logical groupings means that queries to retrieve a group of documents involve sequential reads and fewer random disk accesses. Additionally, rolling up documents and moving common fields to the larger document benefits the creation of an index on these fields. There would be fewer copies of the common fields and there would be fewer associated key entries in the corresponding index. See Indexes for more information on indexes.

    However, if you often only need to retrieve a subset of the documents within the group, then rolling up the documents may not provide better performance. Furthermore, if small, separate documents represent the natural model for the data, you should maintain that model.
    

5. Document-oriented databases are inherently a subclass of the key-value store. The difference lies in the way the data is processed: in a key-value store, the data is considered to be inherently opaque to the database, whereas a document-oriented system relies on an internal structure of the documents in order to extract metadata that the database engine uses for further optimization. Although the difference is often mostly in tools of the systems, conceptually the document-store is designed to offer a richer experience with modern programming techniques.



Illustration for answer (2):

<img src="https://docs.mongodb.com/manual/images/data-model-normalized.bakedsvg.svg" style="width: 500px;"/>
<img src="https://docs.mongodb.com/manual/images/data-model-denormalized.bakedsvg.svg" style="width: 500px;"/>

From [the MongoDB official documentation](https://docs.mongodb.com/manual/core/data-model-design/)

## 2. MongoDB

### 2.1 Install MongoDB
MongoDB is an open-source document database. To start it:

```docker-compose up```

### 2.2 Import the dataset

On your local machine run: 

``` curl -O https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json```

and

```docker cp primer-dataset.json mongo:/primer-dataset.json```

to copy it to the docker container.

Then, open a shell terminal inside the `mongo` container using:

```docker exec -it mongo sh```

Using `ls`, you should now see `primer-dataset.json` file appear.

Use mongoimport to insert the documents into the ```restaurants``` collection in the ```test database```. If the collection already exists in the ```test``` database, the operation will drop the ```restaurants``` collection first.
  
```mongoimport --db test --collection restaurants --drop --file ./primer-dataset.json```

You should see something similiar to the following output: 
```25359 document(s) imported successfully. 0 document(s) failed to import.```

### 2.3 Mongo shell 

The mongo shell is an interactive JavaScript interface to MongoDB. You can use the mongo shell to query and update data, as well as to perform administrative operations.

To start mongo use:

```mongosh```

In the mongo shell connected to a running MongoDB instance, switch to the ```test``` database.
```use test```

Try to insert a document into the ```restaurants``` collection. In addition, you can see the structure of documents in the collection.

```json
db.restaurants.insertOne(
   {
      "address" : {
         "street" : "2 Avenue",
         "zipcode" : "10075",
         "building" : "1480",
         "coord" : [ -73.9557413, 40.7720266 ]
      },
      "borough" : "Manhattan",
      "cuisine" : "Italian",
      "grades" : [
         {
            "date" : ISODate("2014-10-01T00:00:00Z"),
            "grade" : "A",
            "score" : 11
         },
         {
            "date" : ISODate("2014-01-16T00:00:00Z"),
            "grade" : "A",
            "score" : 17
         }
      ],
      "name" : "Vella",
      "restaurant_id" : "41704620"
   }
)
```

Query all documents in a collection:

```db.restaurants.find()```

Query one document in a collection:

```db.restaurants.findOne()```

To format the printed result, you can add ```.pretty()``` to the operation, as in the following:

```db.restaurants.find().limit(1).pretty()```

### Query Documents
For the ```db.collection.find()``` method, you can specify the following optional fields:
- a query filter to specify which documents to return,
- a query projection to specifies which fields from the matching documents to return (the projection limits the amount of data that MongoDB returns to the client over the network),
- optionally, a cursor modifier to impose limits, skips, and sort orders.

![query](https://docs.mongodb.com/manual/images/crud-annotated-mongodb-find.bakedsvg.svg)


### 2.4 Questions
Write queries in MongoDB that return the following:

1. All restaurants in borough (a town) "Brooklyn" and cuisine (a style of cooking) "Hamburgers".
2. The number of restaurants in the borough "Brooklyn" and cuisine "Hamburgers".
3. All restaurants with zipcode 11225.
4. Names of restaurants with zipcode 11225 that have at least one grade "C".
5. Names of restaurants with zipcode 11225 that have as first grade "C" and as second grade "A".
6. Names and streets of restaurants that don't have an "A" grade.
7. All restaurants for which at least one rating has a grade C **with** a score greater than 50.
8. All restaurants with a grade C or a score greater than 50.
9. A table with zipcode and number of restaurants that are in the borough "Queens" and have "Brazilian" cuisine.
10. (Optional) Find the top 5 restaurants in the borough “Brooklyn” with the cuisine “Hamburgers” based on the highest average grade scores.


You can read more about MongoDB here: 

https://www.mongodb.com/docs/mongodb-shell/

https://www.mongodb.com/docs/manual/reference/operator/

https://www.mongodb.com/docs/manual/aggregation/

Hint: for Question 10, [$unwind](https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/) might be useful.

### 2.4 Solution


1. ```db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" }) ```
  
2. ```db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" }).count()```

3. ```db.restaurants.find({"address.zipcode" : "11225" })```

4. ```db.restaurants.find({"address.zipcode" : "11225" , "grades.grade" : "C" } , {"name" : 1 })```

5. ```db.restaurants.find({"address.zipcode" : "11225" , "grades.0.grade" : "C", "grades.1.grade" : "A"  },{"name" : 1 })```

6. ```db.restaurants.find({"grades.grade" : { $ne : "A"}} , {"name" : 1 , "address.street": 1})```

7. ```db.restaurants.find({"grades" : {$elemMatch : {"grade" : "C", "score" : {$gt : 50}}}})```
    
    An alternative way of phrasing the query (but not the correct one):
``` db.restaurants.find( {$and: [{"grades.score" : {$gt : 50}}, { "grades.grade" : "C"}]} ).count()``` or ```db.restaurants.find( {"grades.score" : {$gt : 50}, "grades.grade" : "C"} ).count()```

    In the query with $elemMatch the filter will only be applied to an element in the array that satisfies **both criteria** (meaning, we get a grade C **with** a score greater than 50). In the alternative version the two conditions are checked independently, so we get some extra matches in cases where one rating has a grade C and a different rate has a score greater than 50. 
    
8. ```db.restaurants.find( {$or: [{"grades.score" : {$gt : 50}}, { "grades.grade" : "C"}]} )```

9.  ```db.restaurants.aggregate([{ $match: { "borough": "Queens", "cuisine": "Brazilian" } }, { $group: { "_id": "$address.zipcode" , "count": { $sum: 1 } } }])```

10. ```db.restaurants.aggregate([{ $match: { borough: "Brooklyn", cuisine: "Hamburgers" } }, { $unwind: "$grades" }, { $group: {_id: "$name", avgScore: { $avg: "$grades.score" }}}, { $sort: { avgScore: -1 } }, { $limit: 5 }])```


## 3. Indexing in MongoDB

Indexes support the efficient resolution of queries. Without indexes, MongoDB must scan every document in a collection to select those documents that match the query. Scans can be highly inefficient and require MongoDB to process a large volume of data.

Indexes are special data structures that store a small portion of the data set in an easy-to-traverse form. The index stores the value of a specific field or set of fields, ordered by the value of the field as specified in the index.

MongoDB supports indexes that contain either a single field or multiple fields depending on the operations that this index type supports. 

By default,  MongoDB creates the ```_id``` index, which is an ascending unique index on the ```_id``` field, for all collections when the collection is created. You cannot remove the index on the ```_id``` field.

### Managing indexes in MongoDB

An ```explain``` operator provides information on the query plan. It returns a document that describes the process and indexes used to return the query. This may provide useful insight when attempting to optimize a query.

 ```db.restaurants.find({"borough" : "Brooklyn"}).explain()```

In the mongo shell, you can create an index by calling the ```createIndex()``` method.  

 ```db.restaurants.createIndex({ "borough" : 1 })```

Now, you retrieve a new query plan for indexed data.
```db.restaurants.find({"borough" : "Brooklyn"}).explain()```
The value of the field in the index specification describes the kind of index for that field. For example, a value of 1 specifies an index that orders items in ascending order. A value of -1 specifies an index that orders items in descending order.

To remove all indexes, you can use ```db.collection.dropIndexes()```. To remove a specific index you can use ```db.collection.dropIndex()```, such as ```db.restaurants.dropIndex({ borough : 1 })```.  

### 3.1 Writing indexes

1. Write an index that will speed up the following query:

 ```db.restaurants.find({"borough" : "Brooklyn"})```

2. We have an index on address field as follows:

 ```db.restaurants.createIndex( { "address" : -1 })```

   Will the query 

 ```db.restaurants.find({"address.zipcode" : "11225"  })```

   use that index? If not, explain why and provide an example that could use that index.

3. Write a command for creating an index on the zipcode field.

4. Is it possible to create the index below? Why?/Why not?

 ```db.restaurants.createIndex({ "address.coord": 1, "grades": -1})```

5. Write an index to speed up the following query:

  ```db.restaurants.find({"grades.grade" : { $ne : "A"}} , {"name" : 1 , "address.street": 1})```

6. Write an index to speed up the following query:

 ```db.restaurants.find({"grades.score" : {$gt : 50} , "grades.grade" : "C"})```

7. What are the differences between two index strategies below  

   A. ```db.restaurants.createIndex({ "borough": 1, "cuisine": -1})```  
        
   B. ```db.restaurants.createIndex({ "borough": 1})```    ```db.restaurants.createIndex({ "cuisine": -1})``` 

### Solution

1. ```db.restaurants.createIndex({"borough" : 1})```
2. No, because `address` is an embedded document and the index is on the entire document, not subfields. Queries as the following however would benefit from the index: <br> ``` db.restaurants.find({ address: { building: '319', coord: [-73.9852422, 40.7471677], street: '5 Avenue', zipcode: '10016' } })``` <br> Or more simply: <br> ```db.restaurants.find().sort({ address: -1 })``` <br> Using `.explain()` for the first query, on the `winningPlan` field you should see the following attribute: <br> 
```
inputStage: {
    stage: 'IXSCAN',
    keyPattern: { address: -1 },
    indexName: 'address_-1',
    isMultiKey: false,
    multiKeyPaths: { address: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: {
      address: [
        '[{ building: "319", coord: [ -73.9852422, 40.7471677 ], street: "5 Avenue", zipcode: "10016" }, { building: "319", coord: [ -73.9852422, 40.7471677 ], street: "5 Avenue", zipcode: "10016" }]'
      ]
    }
}
```  
<br> Notice the keys `indexName` and `indexBounds`. You should observe 2 things: first that the index was indeed used, and secondly, that the whole subdocument was used as a single string to match the index. For more information on using subdocuments as indexes, check this [article](https://www.percona.com/blog/mongodb-utilization-of-an-index-on-subdocuments/).

3. ```db.restaurants.createIndex({"address.zipcode" : 1})```
4. No, for a compound multikey index, each indexed document can have at most one indexed field whose value is an array. As such, you cannot create a compound multikey index if more than one to-be-indexed field of a document is an array. 
5. Just ```db.restaurants.createIndex( "grades.grade": 1)```, since ```{"name" : 1 , "address.street": 1}``` is a projection. Note that in general negation is inefficient "ne" queries can use an index, but not very well. They must look at all the index entries other than the one specified by the "ne", so they basically have to scan the entire index. Therefore although the total number of objects that are scanned as well their size are reduced, the index is not fully efficient.
6.  ```db.restaurants.createIndex({"grades.score": 1 , "grades.grade": 1 })```  
    However, as a further example,  for this particular query this index [is not used efficiently](https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/): 
    
    ```db.restaurants.find({"grades.score" : {$gt : 50} }).sort({"grades.grade": 1})```, since the resulting "grades.grade" from `.find()` operation are not sorted after the filter.
7. B covers more cases since MongoDB will use index intersection for queries on two fields. Thus, B includes A. However the compound index is faster than two single-field indexes when you querying for two fields. 

### 3.2 Index use
 
Consider the following queries:
 
   A. ```db.restaurants.find({"borough" : "Brooklyn"})```

   B. ```db.restaurants.find({"cuisine" : "Hamburgers"})```

   C. ```db.restaurants.find({"borough" : "Brooklyn", "cuisine" : "Hamburgers" })```

   D. ```db.restaurants.find().sort({"borough" : -1})```

   E. ```db.restaurants.find().sort({"borough" : 1, "cuisine" : 1})```

   F. ```db.restaurants.find().sort({"borough" : -1, "cuisine" : 1 })```
        
   G. ```db.restaurants.find().sort({"cuisine" : 1, "borough" : -1 })```



   
For every index, explain which queries would benefit from it:

1. ```db.restaurants.createIndex({ "borough": 1,  "cuisine": -1 })```
 
2. ```db.restaurants.createIndex({"borough": 1, "cuisine": -1, "name" : -1})```

3. ```db.restaurants.createIndex({ "borough": -1,  "cuisine": 1 })```

### Solution 


1. The following queries benefit from the index:
  - A (since ```borough``` is a index prefix of the compound index), 
  - C (equality match for compound index), 
  - D (since ```borough``` is a index prefix of the compound index),
  - F (inverse indexes (multiplying each direction by -1) are equivalent).

2. The same answer, since ```{ "borough": 1, "cuisine": -1, "name" : -1}``` has prefix ```{ "borough": 1, "cuisine": -1 }```

3. Same answer (inverse indexes are equivalent)

## 5. SQL to MongoDB Mapping 

Create a one-to-one mapping between the following SQL and MongoDB queries.

1.
```
    INSERT INTO users(user_id,  age,  status)  
    VALUES ("bcd001", 45, "A")
```
2.
```
    SELECT * FROM users
```
3.
```
    SELECT user_id, status FROM users
```
4.
```
    SELECT * FROM users  
    WHERE age > 25 AND age <= 50
```
5.
```
    SELECT * FROM users  
    WHERE status = "A" OR age = 50
``` 
6.
``` 
    CREATE TABLE users (     
                id MEDIUMINT NOT NULL AUTO_INCREMENT,     
                user_id Varchar(30),     
                age Number,     
                status char(1),     
                PRIMARY KEY (id)     
    )  
```  
7.
```  
    SELECT COUNT(user_id)   
    FROM users
```


a.
```
    db.users.find(
       { "age": { $gt: 25, $lte: 50 } }
    )
```

b. 
```json
    db.users.find(
        { },
        { "user_id": 1, "status": 1, "_id": 0 }
    )
```
c.
```json
    db.createCollection("users")
```
d. 
```json
    db.users.insert(
       { "user_id": "bcd001", "age": 45, "status": "A" }
    )
```
e.

```json
    db.users.count({ "user_id": {$exists: true}}) 

``` 
f.   
```json
    db.users.find(   
        { $or: [ { "status": "A" } ,       
        { "age": 50 } ] }    
    ) 
```

g. 
```json
    db.users.find()
```

### Solution
1-d  
2-g  
3-b  
4-a  
5-f  
6-c  
7-e  

## 6. True or False
Say if the following statements are *true* or *false*.

1. In document stores, you must determine and declare a table's schema before inserting data. 
2. Documents stores are not subject to data modeling and support only one denormalized data model.
3. Different relationships between data can be represented by references and embedded documents.
4. There are no joins in MongoDB.

### Solution 
1. False. Document stores have a flexible schema, which does not require specifying the schema as in relational databases.
2. False. Document stores can be used with a wide range of data models. See Section 2.
3. True. These two tools allow applications to represent different data models. See Section 2.
4. True. Nonetheless, starting in version 3.2, MongoDB supports aggregations with "lookup" operator, which can perform a ```LEFT OUTER JOIN```.

## 7. Choose the right technology
In the following situations state which of the technologies (either a *document store* or a *relational database*) would be more suitable.
1. You are mostly working with semistructured or unstructured data.
2. Your application writes hundreds of records every few seconds but does not update them very often.
3. You have a well defined schema with clear constraints.
4. You want the queries written for your DB to be easily readable.
5. Your schema has a lot of relations.
6. Your applications frequently updates and modifies large volume of records.

### Solution 
1. Document Store
2. Document Store
3. Relational Database
4. Relational Database
5. Relational Database
6. Relational Database

## 8. Simple query comparison

Assume you have an SQL database/MongoDB database that has a table/collection called `users`. You want to query the users that have registered in the last 24 hours. Assume that you have an attribute/collection in your database that is called `signup_time`. Write the corresponding queries in SQL and MongoDB. Which one do you think that is more readable?

### Solution

My SQL: 
```
SELECT  *
FROM    users
WHERE   signup_time >= CURDATE() - INTERVAL 1 DAY
``` 

MongoDB:

```
db.users.find({
  "signup_time": {
    $gt: new Date(Date.now() - 24*60*60 * 1000)
  }
})
```

When it comes to readability of queries in most cases SQL is a better choice. In general MongoDB is not a good choice for most web applications as it has several disadvantages. For more info read [here](https://stateofprogress.blog/2017/03/13/choose-sql).