# Lab 3: Data Management - MongoDB

MongoDB is a Document-oriented Database Management System. For this lab, you can again either use the notebook provided or install a MongoDB server on your own laptop - it's up to you! Below are some helpful tutorials and references you might want to take a look at: 

Tutorials:

* [Official MongoDB tutorials](https://university.mongodb.com/)

References: 

* [MongoDB Manual](https://www.mongodb.com/docs/manual/)
* [PyMongo documentation](https://pymongo.readthedocs.io/en/stable/index.html)

As before, we can interact with the mongo shell in an interactive and non-interactive way. We can also use the PyMongo distribution, which is the recommended way to work with MongoDB from Python. This lab is split into three sections, Section 1 will cover the interactive shell, Section 2 will cover the PyMongo distribution and Section 3 will cover Bash scripts for MongoDB. 

## Section 1: The MongoDB Shell

The “mongo” command-line tool is a simple command shell, that functions like a JavaScript interface to MongoDB, and is a component of the MongoDB distribution. You can use mongo shell to query and update data as well as perform administrative operations, and also to run JavaScript batchs of commands.

First let's install our MongoDB server:

In [None]:
!apt install mongodb mongodb-server &>/dev/null &
!mongod --version    # Checking the version to make sure everything has installed ok

db version v3.6.3
git version: 9586e557d54ef70f9ca4b43c26892cd55257e1a5
OpenSSL version: OpenSSL 1.1.1  11 Sep 2018
allocator: tcmalloc
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64


In [None]:
!service mongodb start    # Starting our server

 * Starting database mongodb
   ...done.


## 1.1: Working with the mongo shell:

You can get help for the `mongo` command-line tool by running `mongo -help` and connect to your server by running the `mongo` command. 

You'll notice though (as with the interactive `mysql` shell from last week!), the browser unfortunately interprets our input as a password so you can't see what you've typed until you've hit enter. This can be solved by using the browser dev tools to change the input type from "password" to "text".

In the `mongo` command-line tool, you can display the databases available for your user, the collections in it, and the data inside the documents. After the command, you can type ’;’, but it is not mandatory, unless you want to send more than one command in just one line.

You can also display the help page and the help page for the databases using `help` or `db.help()` respectively. 

You can show the databases available for your user with `show databases`, or select a database with `use <databaseName>`. We can also show the collections in a database using `show collections`. 

We can also display the documents from a collecion, using the alias `db` for the current database. For example: `db.collectionName.find()`. 

To learn about the commands used in collections, we can use `db.collectionName.help()`. 

## 1.2: Basic Data Manipulation Language - insert, update, delete:

In this subsection, we'll create a new database of students (similar to the one we made last week) and explore some of the basic DML. 


1. Create a new database: 

    `>` use myDB

2. Insert some new posts into the students collection in our new database. We can do this with either `insert` or `insertMany`

    `>` db.students.insert({name:"James Nicholas Gray", dept_name:"Computer Science", tot_cred:10, age:63, term_address: {city: "Dublin", street: "Student Village", number: 109}})

    `>` db.students.insertMany([{name:"Alan Mathison Turing", dept_name:"Mathematics", tot_cred:10 , age:41, term_address: {city: "Dublin", street: "Merrion Row", number: 89}},

    {name:"Claude Elwood Shannon", dept_name:"Electrical Engineering", tot_cred:10, age:84, term_address: {city: "Dublin", street: "Student Village", number: 203}},

    {name:"Grace Brewster Murray Hopper", dept_name:"Computer Science", tot_cred:15, age:85, term_address: {city: "Dublin", street: "Rathmines Road", number: 36}}])

3. Show one document (the first, usually), all the documents and all the documents in a "pretty" (hierarchical view) way: 
    
    `>` db.students.findOne()

    `>` db.students.find()

    `>` db.students.find().pretty()

4. Display the students of the Computer Science department: 

    `>` db.students.find({dept_name:"Computer Science"})

5. Display the name of the students under the age of 50:
  
    `>` db.students.find({age:{$lt:50}}, {name:1})

6. Update the department name of Alan Turing to Computer Science:

    `>` db.students.update({name:"Alan Mathison Turing"}, {$set: {dept_name:"Computer Science"}})

    `>` db.students.find({name:"Alan Mathison Turing"}).pretty()

7. Delete the students of the Electrical Engineering department:

    `>` db.students.remove({dept_name:"Electrical Engineering"})

    `>` db.students.find().pretty()

In [None]:
# TODO: Start your shell and run the commands
!mongo

## 1.2: Basic DML - retrieval of documents:

The find() function is not as powerful as the sql select command, but has several options. If find() is not enough, there is an aggregations framework and mapreduce functions to extend it. 

* Retrieve only the fields name and age from all documents in the collection `students`:

    `>` db.students.find({}, {name:1, age:1})

* Order the documents by a field or set of fields in forward or reverse order:

    `>` db.students.find().sort({name:1})

    `>` db.students.find().sort({dept_name: 1, name:1})

    `>` db.students.find().sort({age:-1})

* Order the documents by a field inside another field:

    `>` db.students.find({}, {name:1, "term_address.street": 1}).sort({"term_address.street": 1})

    In this case, we're ordering them by the alphabetical order of the street names of their term address. 

* Filter the documents using a condition based on the values of the fields: 

    `>` db.students.find({age: {$gte:60}}, {name:1, age:1})

    `>` db.students.find({age: {$gte:60, $lte:70}}, {name:1, age:1})

* Filter documents using parts of strings, with a penalty on performance: 

    `>` db.students.find({name: /^Alan./})

    `>` db.students.find({name: /Nicholas/})

* Count how many documents are in a query:

    `>` db.students.find().count()


Exercise: Read about the "Aggregation Framewwork" in the MongoDB Manual and solve the following queries: 

1. Retrieve the average age from students:

    SOLUTION:
    `>` db.students.aggregate([{$group: {_id: "1", avgAge: {$avg: "$age"}}}])

2. Retrieve the sum of credits of all students: 

    SOLUTION:
    `>` db.students.aggregate([{$group: {_id: "1", sumCreds: {$sum: "$tot_cred"}}}])

In [None]:
# TODO: Start your server and run the commands 
!mongo

## Section 2: PyMongo

PyMongo is the official driver that connects to and interacts with MongoDB through Python. The syntax will differ slightly, but in general this is a nice way of interacting with your MongoDB server. 

In [None]:
# Install and import pymongo & the pymongo client
!python -m pip install pymongo

import pymongo
from pymongo import MongoClient

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Start our client and list the current databases 
client = MongoClient()
client.list_database_names()    # We should see the new database we created using the interactive shell listed below

['admin', 'local', 'myDB']

We can access our new database in two ways. Either with `mydb = client.myDB` or we can treat it like a dictionary with `mydb = client["myDB"]`.

In [None]:
mydb = client.myDB
mydb.list_collection_names()    # Listing out the collections we have in the myDB database

['students']

## 2.1 Basic DML - insert, update, delete:

First lets add our Electrical Engineering student back into our collection:

In [3]:
# Import pprint so we can print our results in a "pretty" format
from pprint import pprint

# Accession our collection: 
students = mydb.students

# Create our new document: 
ee_student = {"name":"Claude Elwood Shannon", "dept_name":"Electrical Engineering", "tot_cred":10, "age":84, "term_address": {"city": "Dublin", "street": "Student Village", "number": 203}}

# Insert our document:
students.insert_one(ee_student)

# Display our updated collection: 
for result in students.find():
  pprint(result)

We can also use the `insert_many([list, of, documents])` method to add multiple documents at once. 

Now lets update Alan Turing's total credits to 15. For this we'll want to use the operator `$set`. You'll notice we now have to wrap our operators in quotes.

In [None]:
students.update_one({"name": "Alan Mathison Turing"}, {"$set":{"tot_cred": 15}})

for result in students.find({"name" : "Alan Mathison Turing"}, {"name": 1, "tot_cred": 1}): 
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884861'),
 'name': 'Alan Mathison Turing',
 'tot_cred': 15}


We can delete or remove documents again with the `delete_one()` or `delete_many()` methods: 

In [None]:
students.delete_one({"name": "James Nicholas Gray"})

for result in students.find({}, {"name": 1, "dept_name": 1}):
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884861'),
 'dept_name': 'Computer Science',
 'name': 'Alan Mathison Turing'}
{'_id': ObjectId('6333155d31e544a77d884863'),
 'dept_name': 'Computer Science',
 'name': 'Grace Brewster Murray Hopper'}
{'_id': ObjectId('6333221630961b2d12fbf8b4'),
 'dept_name': 'Electrical Engineering',
 'name': 'Claude Elwood Shannon'}


## 2.2 Basic DML - retrieval of documents:

The find methods are similar to the previous, though the syntax differs slightly:

In [None]:
pprint(students.find_one())

{'_id': ObjectId('6333155d31e544a77d884861'),
 'age': 41.0,
 'dept_name': 'Computer Science',
 'name': 'Alan Mathison Turing',
 'term_address': {'city': 'Dublin', 'number': 89.0, 'street': 'Merrion Row'},
 'tot_cred': 15}


In [None]:
for result in students.find({}, {"name": 1, "age": 1}):
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884861'),
 'age': 41.0,
 'name': 'Alan Mathison Turing'}
{'_id': ObjectId('6333155d31e544a77d884863'),
 'age': 85.0,
 'name': 'Grace Brewster Murray Hopper'}
{'_id': ObjectId('6333221630961b2d12fbf8b4'),
 'age': 84,
 'name': 'Claude Elwood Shannon'}


We can still order our documents in a similar fashion (i.e. by field or set of fields in forward or reverse order):

In [None]:
for result in students.find({}, {"name": 1, "dept_name": 1}).sort([("dept_name", -1), ("name", -1)]): 
  pprint(result)

{'_id': ObjectId('6333221630961b2d12fbf8b4'),
 'dept_name': 'Electrical Engineering',
 'name': 'Claude Elwood Shannon'}
{'_id': ObjectId('6333155d31e544a77d884863'),
 'dept_name': 'Computer Science',
 'name': 'Grace Brewster Murray Hopper'}
{'_id': ObjectId('6333155d31e544a77d884861'),
 'dept_name': 'Computer Science',
 'name': 'Alan Mathison Turing'}


And we can still order the documents by a field inside another field.


In [None]:
for result in students.find({}, {"name": 1, "term_address.street": 1}).sort([("term_address.street", -1)]): 
  pprint(result)

{'_id': ObjectId('6333221630961b2d12fbf8b4'),
 'name': 'Claude Elwood Shannon',
 'term_address': {'street': 'Student Village'}}
{'_id': ObjectId('6333155d31e544a77d884863'),
 'name': 'Grace Brewster Murray Hopper',
 'term_address': {'street': 'Rathmines Road'}}
{'_id': ObjectId('6333155d31e544a77d884861'),
 'name': 'Alan Mathison Turing',
 'term_address': {'street': 'Merrion Row'}}


Filtering the documents using a condition based on values of the fields (again note we have to wrap our operators in quotes):

In [None]:
for result in students.find({"age": {"$gte": 60, "$lte": 90}}, {"name": 1, "age": 1}): 
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884863'),
 'age': 85.0,
 'name': 'Grace Brewster Murray Hopper'}
{'_id': ObjectId('6333221630961b2d12fbf8b4'),
 'age': 84,
 'name': 'Claude Elwood Shannon'}


When filtering documents using parts of strings with PyMongo, we'll need to use the `$regex` operator to tell MongoDB that the string we give it should be treated as a regular expression:

In [None]:
for result in students.find({"name": {"$regex": "^Alan.*$"}}, {"name"}): 
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884861'), 'name': 'Alan Mathison Turing'}


In [None]:
for result in students.find({"name": {"$regex": "Brewster"}}, {"name"}): 
  pprint(result)

{'_id': ObjectId('6333155d31e544a77d884863'),
 'name': 'Grace Brewster Murray Hopper'}


Counting how many documents are in a query:

In [None]:
students.count_documents({})

3

Now try some aggregation using PyMongo: 

1. Retrieve the sum of credits of all students and from students of the Computer Science department: 

In [None]:
# Solution
results = students.aggregate([{"$group":{"_id": 1, 
                                         "sumCreds": {"$sum": "$tot_cred"}}}])
for result in results: 
  print(result)

{'_id': 1, 'sumCreds': 40.0}


In [None]:
# Solution
results = students.aggregate([{"$match": {"dept_name": "Computer Science"}},
                              {"$group":{"_id": 1, 
                                         "sumCreds": {"$sum": "$tot_cred"}}}])
for result in results: 
  print(result)

{'_id': 1, 'sumCreds': 30.0}


2. Retrieve the sum of the students credits, grouping by departments:


In [None]:
# Solution
results = students.aggregate([{"$group":{"_id": "$dept_name", 
                                         "sumCreds": {"$sum": "$tot_cred"}}}])
for result in results: 
  print(result)

{'_id': 'Electrical Engineering', 'sumCreds': 10}
{'_id': 'Computer Science', 'sumCreds': 30.0}


## 3. Bash Scripts for MongoDB

In the same way as MySQL, MongoDb allows you to send commands to the client from a bash script, submitting one line at a time using the option -eval. 

`mongo yourDB --eval "db.hostInfo()"`

`mongo yourDB --eval "printjson(db.getCollectionNames())"`

You can find more information at: 

* [How to execute MongoDB commands from the Linux shell](https://mobilemonitoringsolutions.com/execute-mongodb-commands-linux-shell/)


## Exercise:

We'll use the same Titanic dataset from last week to do a similar exercise. Create and populate a database (1 collection is enough - e.g. use a Bash script that will read the file and create Mongo queries to insert values in the database). 

**Hint:** 

This can be written very similarly to the script from last week. However, remember to replace the write portion with mongo shell commands - e.g. where we had `mysql -D ...`, we will now have `mongo yourDB --eval ”STUFF”`. 

**NB:** This is a point where bash quotation marks can get a bit sticky - you will have to wrap single quotes in double quotes in a relatively unintuitive manner to pass things the way you want. To save you some grief, here is an example of the correct way to wrap them in this instance:

`mongo test --eval 'db.titanic.insert({class:"'"$class"'"})'`

Note that the entire thing is wrapped in single quotes (for literal interpretation), but then we have doubles (special character interpretation) wrapping singles wrapping doubles again for passing the variables. This is quite confusing, so don’t worry too much - feel free to experiment with echo and various combinations to get a feel for this.

In [None]:
!wget http://jse.amstat.org/datasets/titanic.dat.txt

--2022-09-27 16:47:20--  http://jse.amstat.org/datasets/titanic.dat.txt
Resolving jse.amstat.org (jse.amstat.org)... 107.180.48.28
Connecting to jse.amstat.org (jse.amstat.org)|107.180.48.28|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 63829 (62K) [text/plain]
Saving to: ‘titanic.dat.txt’


2022-09-27 16:47:21 (323 KB/s) - ‘titanic.dat.txt’ saved [63829/63829]



In [None]:
# Solution: see titanic.sh
!bash titanic.sh

Now answer the following questions (you can use either the shell or Pymongo for this):

1. How many passengers were there on the Titanic?

2. How many passengers survived?

3. What percentage of passengers survived?

MONGO SHELL SOLUTIONS:

  `>` db.yourCollection.count({})

  `>` db.yourCollection.count({survived: "1"})

  `>` db.yourCollection.count({survived: "1"})/db.yourCollection.count()*100

In [None]:
# PyMongo Solutions: 

titanic = mydb.titanic
mydb.list_collection_names()

['titanic', 'students']

In [None]:
# How many passengers were ther eon the Titanic:
titanic.count_documents({})

2201

In [None]:
# How many passengers survived: 
titanic.count_documents({"survived": "1"})

711

In [None]:
# What % of passengers survived: 
titanic.count_documents({"survived": "1"})/titanic.count_documents({})*100

32.30349840981372