# MongoDB

From Introduction to MongoDB at https://docs.mongodb.com/manual/introduction/:  
"MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling."

While many relational database management systems have been in use since the 1970s, 
as technology and how we use it continues to evolve we need new ways to address the growing volume of data, and how we interact with it. 

What is meant by a "document database" is that each entry in the database is meant to be a self-contained unit. 
These so called documents are then organized into what are called "collections".


While joins are possible in MongoDB, 
they are a bit of an "anti-pattern" in the context of a document store. 



**At the very basic level,
a document is similar to a row in a relational database, 
and a collection is analogous to a table.** 


## Node Assignments 

* Server `mongodb-1.dsa.lan`: Last name A - G 
* Server `mongodb-2.dsa.lan`: Last name H - M
* Server `mongodb-3.dsa.lan`: Last name N - S
* Server `mongodb-4.dsa.lan`: Last name T - Y


## Readings
* [Why NoSQL Database](https://www.couchbase.com/resources/why-nosql)
* [What is MongoDB](https://intellipaat.com/blog/what-is-mongodb/)
* [Introduction to MongoDB and Python](https://realpython.com/introduction-to-mongodb-and-python/)

## Intro to MongoDB + Python with a movie review dataset

Let's start exploring MongoDB with Python to see how things compare to PostgreSQL. 

Full documentation on `pymongo` can be found here:https://api.mongodb.com/python/current/api/pymongo/

In [1]:
from pymongo import MongoClient
import pymongo

# Initialize a Mongo Client
#################################################
# Update UPDATE-ME in the connection code
#################################################
# Node 1 - mongodb-1.dsa.lan
# Node 2 - mongodb-2.dsa.lan
# Node 3 - mongodb-3.dsa.lan
# Node 4 - mongodb-4.dsa.lan
#################################################
#
client = MongoClient('mongodb-1.dsa.lan',
                     username='ml_small_reader',
                     password='mlsmall.read',
                     authSource='ml_small')

In [2]:
# connect to the ml_small database from the connection
db = client.ml_small

In [3]:
# Get a list of the available collections in this DB.
db.list_collection_names()

['item', 'user', 'genre', 'data']

In [4]:
# Get a count of the number of documents in the item collection
print(f"num of items = {db.item.count_documents({})}")
print(f"num of users = {db.user.count_documents({})}")
print(f"num of genres = {db.genre.count_documents({})}")
print(f"num of user ratings = {db.data.count_documents({})}")

num of items = 1681
num of users = 943
num of genres = 19
num of user ratings = 100000


**Note**: in the function call above, we passed in a Dictionary, that mongo converts to a JSON document, that is empty.
This _document_ is the query parameters, i.e., how to match documents it is to count.

With some basic syntax examples, let's explore the data.

In [5]:
db.item.find_one()

{'_id': ObjectId('5b231f5ad698289b415e67f5'),
 'movie_id': 8,
 'movie_title': 'Babe (1995)',
 'release_date': datetime.datetime(1995, 1, 1, 6, 0),
 'video_release_date': '',
 'IMDb_URL': 'http://us.imdb.com/M/title-exact?Babe%20(1995)',
 'unknown': 0,
 'Action': 0,
 'Adventure': 0,
 'Animation': 0,
 "Children's": 1,
 'Comedy': 1,
 'Crime': 0,
 'Documentary': 0,
 'Drama': 1,
 'Fantasy': 0,
 'Film-Noir': 0,
 'Horror': 0,
 'Musical': 0,
 'Mystery': 0,
 'Romance': 0,
 'Sci-Fi': 0,
 'Thriller': 0,
 'War': 0,
 'Western': 0}

In [6]:
db.user.find_one()

{'_id': ObjectId('5b18e2b3d698289b41f3cf1a'),
 'user_id': 5,
 'age': 33,
 'gender': 'F',
 'occupation': 'other',
 'zip_code': 15213}

In [7]:
db.genre.find_one()

{'_id': ObjectId('5b18e2abd698289b41f3ceff'),
 'genre': "Children's",
 'genre_id': 4}

In [8]:
db.data.find_one()

{'_id': ObjectId('5b18e05dd698289b41f236d2'),
 'user_id': 166,
 'item_id': 346,
 'rating': 1,
 'timestamp': 886397596}

Now that we've seen our data, let's make sense of it. 

This data is a movie rating dataset for 1681 movies [`db.item.count_documents({})`]. 
There are 943 users [`db.user.count_documents({})`] who have submitted, 
in this dataset, 100k reviews [`db.data.count_documents({})`]. 

`db.genre` is a listing of all of the genre's available, along with their unique ID, 
which is not used elsewhere in this dataset.
This is residual data from the fact that this dataset was exported from a relational schema.

Each document in `db.item` describes a single movie. 
Each movie could fit into any number of genres, 
so all genres are listed and a binary yes (1) or no (0) is recorded if the movie fit that genre. 
We also have some other data available, including the movie title. 

We'll dive into the data more with the Aggregates and Joins labs, 
but let's get some examples on how to do things you would naturally think of trying to do.

## Returning a number of rows

In [9]:
# get an iterator for all genres.
for genre in db.genre.find():
    print(genre)
    

{'_id': ObjectId('5b18e2abd698289b41f3ceff'), 'genre': "Children's", 'genre_id': 4}
{'_id': ObjectId('5b18e2abd698289b41f3cf00'), 'genre': 'Documentary', 'genre_id': 7}
{'_id': ObjectId('5b18e2abd698289b41f3cf01'), 'genre': 'Drama', 'genre_id': 8}
{'_id': ObjectId('5b18e2abd698289b41f3cf02'), 'genre': 'Fantasy', 'genre_id': 9}
{'_id': ObjectId('5b18e2abd698289b41f3cf03'), 'genre': 'Horror', 'genre_id': 11}
{'_id': ObjectId('5b18e2abd698289b41f3cf04'), 'genre': 'Musical', 'genre_id': 12}
{'_id': ObjectId('5b18e2abd698289b41f3cf05'), 'genre': 'Film-Noir', 'genre_id': 10}
{'_id': ObjectId('5b18e2abd698289b41f3cf06'), 'genre': 'Mystery', 'genre_id': 13}
{'_id': ObjectId('5b18e2abd698289b41f3cf07'), 'genre': 'Adventure', 'genre_id': 2}
{'_id': ObjectId('5b18e2abd698289b41f3cf08'), 'genre': 'Sci-Fi', 'genre_id': 15}
{'_id': ObjectId('5b18e2abd698289b41f3cf09'), 'genre': 'Romance', 'genre_id': 14}
{'_id': ObjectId('5b18e2abd698289b41f3cf0a'), 'genre': 'Thriller', 'genre_id': 16}
{'_id': Objec

In [10]:
# Or get the iterator and immediately make it a list...
genres = list(db.genre.find())
print(genres)
print('-'*50)
print(len(genres))


[{'_id': ObjectId('5b18e2abd698289b41f3ceff'), 'genre': "Children's", 'genre_id': 4}, {'_id': ObjectId('5b18e2abd698289b41f3cf00'), 'genre': 'Documentary', 'genre_id': 7}, {'_id': ObjectId('5b18e2abd698289b41f3cf01'), 'genre': 'Drama', 'genre_id': 8}, {'_id': ObjectId('5b18e2abd698289b41f3cf02'), 'genre': 'Fantasy', 'genre_id': 9}, {'_id': ObjectId('5b18e2abd698289b41f3cf03'), 'genre': 'Horror', 'genre_id': 11}, {'_id': ObjectId('5b18e2abd698289b41f3cf04'), 'genre': 'Musical', 'genre_id': 12}, {'_id': ObjectId('5b18e2abd698289b41f3cf05'), 'genre': 'Film-Noir', 'genre_id': 10}, {'_id': ObjectId('5b18e2abd698289b41f3cf06'), 'genre': 'Mystery', 'genre_id': 13}, {'_id': ObjectId('5b18e2abd698289b41f3cf07'), 'genre': 'Adventure', 'genre_id': 2}, {'_id': ObjectId('5b18e2abd698289b41f3cf08'), 'genre': 'Sci-Fi', 'genre_id': 15}, {'_id': ObjectId('5b18e2abd698289b41f3cf09'), 'genre': 'Romance', 'genre_id': 14}, {'_id': ObjectId('5b18e2abd698289b41f3cf0a'), 'genre': 'Thriller', 'genre_id': 16}, 

In [11]:
# Find a specific Genre, knowing the query will always return at-most one document.
print(db.genre.find_one({"genre": "Children's"}))
print(db.genre.find_one({"genre": "Sci-Fi"}))

{'_id': ObjectId('5b18e2abd698289b41f3ceff'), 'genre': "Children's", 'genre_id': 4}
{'_id': ObjectId('5b18e2abd698289b41f3cf08'), 'genre': 'Sci-Fi', 'genre_id': 15}


In [12]:
# Find any single document, with no concern for which.
print(db.genre.find_one())

{'_id': ObjectId('5b18e2abd698289b41f3ceff'), 'genre': "Children's", 'genre_id': 4}


## Sorting

In [13]:
# Find the last genre, in alphabetical order.
print(db.genre.find_one(sort=[('genre', pymongo.DESCENDING)]))

# Likewise the first one, in alphabetical order.
print(db.genre.find_one(sort=[('genre', pymongo.ASCENDING)]))

{'_id': ObjectId('5b18e2abd698289b41f3cf11'), 'genre': 'unknown', 'genre_id': 0}
{'_id': ObjectId('5b18e2abd698289b41f3cf0c'), 'genre': 'Action', 'genre_id': 1}


## Sorting with a limit 

In [14]:
for movie in db.item.find(sort=[('movie_title', pymongo.ASCENDING)], limit=20):
    print(movie['movie_title'])

'Til There Was You (1997)
1-900 (1994)
101 Dalmatians (1996)
12 Angry Men (1957)
187 (1997)
2 Days in the Valley (1996)
20,000 Leagues Under the Sea (1954)
2001: A Space Odyssey (1968)
3 Ninjas: High Noon At Mega Mountain (1998)
39 Steps, The (1935)
8 1/2 (1963)
8 Heads in a Duffel Bag (1997)
8 Seconds (1994)
A Chef in Love (1996)
Above the Rim (1994)
Absolute Power (1997)
Abyss, The (1989)
Ace Ventura: Pet Detective (1994)
Ace Ventura: When Nature Calls (1995)
Across the Sea of Time (1995)


## Sorting with a limit and skipping documents

In [15]:
# Skip the movies whose title begins with a number 
for movie in db.item.find(sort=[('movie_title', pymongo.ASCENDING)], 
                          limit=20, skip=13):
    print(movie['movie_title'])

A Chef in Love (1996)
Above the Rim (1994)
Absolute Power (1997)
Abyss, The (1989)
Ace Ventura: Pet Detective (1994)
Ace Ventura: When Nature Calls (1995)
Across the Sea of Time (1995)
Addams Family Values (1993)
Addicted to Love (1997)
Addiction, The (1995)
Adventures of Pinocchio, The (1996)
Adventures of Priscilla, Queen of the Desert, The (1994)
Adventures of Robin Hood, The (1938)
Affair to Remember, An (1957)
African Queen, The (1951)
Afterglow (1997)
Age of Innocence, The (1993)
Aiqing wansui (1994)
Air Bud (1997)
Air Force One (1997)


## Summary

As you can see, 
while the syntax for interacting with MongoDB is a little different from what you might 
expect if you are used to working with SQL statements, 
the interface can be picked up relatively quickly. 
Additionally, most things that you would want to do in SQL have a mapping to NoSQL. 

We skip over the data modification statements in this lab since we are connected to a read-only database, 
however they are much the same as you would expect: 
 * `db.collection.insert_*()`, 
 * `db.collection.update_*()`, 
 * `db.collection.delete_*()`,
 
are all available, as well as some not-so-obvious DML such as `db.collection.find_one_and_*()` functions. 

In the prior examples, `*` is a placeholder for one/many and then insert/update/delete, depending on the relevant context. 
See the documentation for more specifics. 


## <span style="background:yellow">Your Turn</span>

Complete following lab work to gain hands-on experience with PyMongo and MongoDB.
**Make sure that you have ran the above cells, and have a connection to the DB!**

### Task 1

#### Query the database to get the movies that start with the letter "a", case-insensitive. 
Specifically, the regex pattern is `'^a'` with option `'i'`
 * refer to https://docs.mongodb.com/manual/reference/operator/query/regex/ for help.

In [27]:
# Add your code below
# -------------------------
    
for movie in db.item.find({"movie_title": {'$regex': '^a', '$options': 'i'}}):
    print(movie['movie_title'])

Antonia's Line (1995)
Apollo 13 (1995)
Angels and Insects (1995)
Ace Ventura: Pet Detective (1994)
Aladdin (1992)
Aristocats, The (1970)
All Dogs Go to Heaven 2 (1996)
Abyss, The (1989)
Aliens (1986)
Apocalypse Now (1979)
Army of Darkness (1993)
Alien (1979)
Amadeus (1984)
Akira (1988)
Austin Powers: International Man of Mystery (1997)
Air Bud (1997)
Absolute Power (1997)
Air Force One (1997)
Apt Pupil (1998)
As Good As It Gets (1997)
Alien: Resurrection (1997)
Apostle, The (1997)
Assignment, The (1997)
Ace Ventura: When Nature Calls (1995)
Adventures of Priscilla, Queen of the Desert, The (1994)
Age of Innocence, The (1993)
Addams Family Values (1993)
Apple Dumpling Gang, The (1975)
Alice in Wonderland (1951)
Aladdin and the King of Thieves (1996)
American Werewolf in London, An (1981)
Amityville 3-D (1983)
Amityville II: The Possession (1982)
Amityville: A New Generation (1993)
Amityville 1992: It's About Time (1992)
Amityville Horror, The (1979)
Amityville Curse, The (1990)
Apartmen

### Task 2

#### Query to find movies that were released after Janurary 1, 1995. 

As an example, and to aid you, datetime.datetime has been imported, 
the afterDate variable has been created for use in your query. 

Hint: It is Janurary 2, so  use greater than or equal to.
 * https://docs.mongodb.com/manual/reference/method/db.collection.find/ 

In [28]:
from datetime import datetime
afterDate = datetime(1995, 1, 2)

# Add your code below
# -------------------------

for movie in db.item.find({
    "release_date": {"$gte": afterDate}}, 
    sort=[('release_date', pymongo.ASCENDING)]):
    print(movie['movie_title'], ":", movie['release_date'])



Usual Suspects, The (1995) : 1995-08-14 05:00:00
Persuasion (1995) : 1995-09-25 05:00:00
Mighty Aphrodite (1995) : 1995-10-30 06:00:00
Othello (1995) : 1995-12-18 06:00:00
Diabolique (1996) : 1996-01-01 06:00:00
Bed of Roses (1996) : 1996-01-01 06:00:00
Bio-Dome (1996) : 1996-01-01 06:00:00
Aladdin and the King of Thieves (1996) : 1996-01-01 06:00:00
Children of the Corn: The Gathering (1996) : 1996-01-01 06:00:00
Daniel Defoe's Robinson Crusoe (1996) : 1996-01-01 06:00:00
Juror, The (1996) : 1996-01-01 06:00:00
Queen Margot (Reine Margot, La) (1994) : 1996-01-01 06:00:00
Lawnmower Man 2: Beyond Cyberspace (1996) : 1996-01-01 06:00:00
Amityville: Dollhouse (1996) : 1996-01-01 06:00:00
White Squall (1996) : 1996-01-01 06:00:00
Eye for an Eye (1996) : 1996-01-01 06:00:00
Two if by Sea (1996) : 1996-01-01 06:00:00
Don't Be a Menace to South Central While Drinking Your Juice in the Hood (1996) : 1996-01-01 06:00:00
Two Much (1996) : 1996-01-01 06:00:00
Big Bully (1996) : 1996-01-01 06:00:0

---

In [29]:
# Be sure to run this cell when you are finished. Thank you.
client.close()

# Save your notebook, then `File > Close and Halt`

---