# MongoDB - Putting It All Together

The goal of this exercise is to utilize joins and the aggregate pipeline together and get a complete overview of how querying a NoSQL database like mongo works. We will use both the MovieLens data we used for the labs, and also a new Taxi dataset. You are expected to create cells to do data exploration as needed. 

In [1]:
from pprint import pprint
from pymongo import MongoClient
import pymongo

# Initialize a Mongo Client
#################################################
# Update UPDATE-ME in the connection code
#################################################
# Client 1 - mongodb-1.dsa.missouri.edu
# Client 2 - mongodb-2.dsa.missouri.edu
# Client 3 - mongodb-3.dsa.missouri.edu
# Client 4 - mongodb-4.dsa.missouri.edu
#################################################
#
client = MongoClient('mongodb-2.dsa.missouri.edu',
                     username='mlLargeReader',
                     password='mlLargeReader.create.role',
                     authSource='mlLarge')

# retrieve the mlLarge database from the connection
db = client.mlLarge

In [2]:
db.collection_names()

['movies', 'links', 'tags', 'ratings', 'genome-scores', 'genome-tags', 'users']

In [3]:
db.movies.find_one()

{'_id': ObjectId('5b18b242d698289b410b776c'),
 'genres': 'Comedy|Drama|Romance',
 'movieId': 4,
 'title': 'Waiting to Exhale (1995)'}

In [4]:
db.tags.find_one()

{'_id': ObjectId('5b18b23cd698289b41045af7'),
 'movieId': 4141,
 'tag': 'Mark Waters',
 'timestamp': 1240597180,
 'userId': 18}

In [27]:
db.ratings.find_one()

{'_id': ObjectId('5b18b242d698289b410be21d'),
 'movieId': 32,
 'rating': 3.5,
 'timestamp': 1112484819,
 'userId': 1}

In [28]:
db.users.find_one()

{'_id': ObjectId('5b18b23cd698289b41045af7'),
 'userId': 18,
 'username': 'Sandy'}

## Task 1

Using the aggregate pipeline get all movies and their tags, sort by title, limit to 10 results.

In [2]:
# Add your code below
# -------------------------
documents = db.movies.aggregate([{
    "$lookup": {
        "from": "tags",
        "localField": "movieId",
        "foreignField": "movieId",
        "as": "tags"
    }
},{    
    "$group": { "_id": {"title": "$title", "tags":"$tags.tag"}}
},{
    "$sort": {"_id.title":1}
},{
    "$limit": 10
}]);

for document in documents:
    pprint(document)



{'_id': {'tags': ['BD-R'], 'title': '"Great Performances" Cats (1998)'}}
{'_id': {'tags': [],
         'title': '#chicagoGirl: The Social Network Takes on a Dictator '
                  '(2013)'}}
{'_id': {'tags': ['BD-R'], 'title': '$ (Dollars) (1971)'}}
{'_id': {'tags': [], 'title': '$5 a Day (2008)'}}
{'_id': {'tags': ['Australia', 'claymation'], 'title': '$9.99 (2008)'}}
{'_id': {'tags': ['BD-R'], 'title': '$ellebrity (Sellebrity) (2012)'}}
{'_id': {'tags': ['Yann Demange'], 'title': "'71 (2014)"}}
{'_id': {'tags': [], 'title': "'Hellboy': The Seeds of Creation (2004)"}}
{'_id': {'tags': [],
         'title': "'Human' Factor, The (Human Factor, The) (1975)"}}
{'_id': {'tags': ['youtube', 'based on a short story', 'based on a book'],
         'title': "'Neath the Arizona Skies (1934)"}}


## Task 2

Using the aggregate pipeline get the top 10 movies by their average rating, select their title and their average rating.

Note that because of the __[default behavior](https://stackoverflow.com/questions/45724785/aggregate-lookup-total-size-of-documents-in-matching-pipeline-exceeds-maximum-d1)__ of `$lookup` you'll want to use the unwind directive. 

To paraphrase, `$lookup` produces a target array within the parent document. If there are too many matches, this can cause the document size to exceed the 16MB limit.

By using `$unwind`, `$lookup` will instead produce one document per element of the would-be array, with the value in the arrays place.

In [3]:
# Add your code below
# -------------------------
documents = db.movies.aggregate([{
    "$lookup": {
        "from": "ratings",
        "localField": "movieId",
        "foreignField": "movieId",
        "as": "ratings"
    }
},{
    "$unwind": "$ratings" 
},{
    # Specify that we only want movies that have ratings
    "$match": {
        "ratings": {
            "$ne": []
        }
    }
},{    
    "$group": { "_id": {"title": "$title"}, "Average":{"$avg": "$ratings.rating"}}
},{
    "$sort": {"Average":-1}
},{
    "$limit": 10
}]);

for document in documents:
    pprint(document)


{'Average': 5.0, '_id': {'title': 'The Wrecking Crew (2008)'}}
{'Average': 5.0, '_id': {'title': 'Mutantes (2009)'}}
{'Average': 5.0, '_id': {'title': 'The Color of Milk (2004)'}}
{'Average': 5.0, '_id': {'title': 'A Blank on the Map (1971)'}}
{'Average': 5.0, '_id': {'title': 'The Beautiful Story (1992)'}}
{'Average': 5.0, '_id': {'title': 'Slingshot Hip Hop (2008)'}}
{'Average': 5.0, '_id': {'title': 'A Gun for Jennifer (1997)'}}
{'Average': 5.0, '_id': {'title': 'Freeheld (2007)'}}
{'Average': 5.0, '_id': {'title': 'Bill Hicks: Sane Man (1989)'}}
{'Average': 5.0,
 '_id': {'title': 'The Garden of Sinners - Chapter 5: Paradox Paradigm '
                  '(2008)'}}


## Task 3

Using the aggregate pipeline get the movie with the highest average rating starting with the letter M, select the title, the movieId, and the average rating.

In [5]:
# Add your code below
# -------------------------
documents = db.movies.aggregate([{
    "$lookup": {
        "from": "ratings",
        "localField": "movieId",
        "foreignField": "movieId",
        "as": "ratings"
    }
},{
    "$unwind": "$ratings" 
},{
    "$match": {"title":{ '$regex': '^m', '$options': 'i'},"ratings": {"$ne": []}}
},{    
    "$group": { "_id": {"title": "$title","movieId": "$movieId"},
                "Average":{"$avg": "$ratings.rating"}}
},{
    "$sort": {"Average":-1}
},{
    "$limit": 5
}]);

for document in documents:
    pprint(document)

    

{'Average': 5.0, '_id': {'movieId': 116387, 'title': 'Muddy River (1981)'}}
{'Average': 5.0, '_id': {'movieId': 126219, 'title': 'Marihuana (1936)'}}
{'Average': 5.0, '_id': {'movieId': 81117, 'title': 'Moth, The (Cma) (1980)'}}
{'Average': 5.0, '_id': {'movieId': 114214, 'title': 'Mishen (Target) (2011)'}}
{'Average': 5.0, '_id': {'movieId': 129741, 'title': 'Mutantes (2009)'}}


In [2]:
# Be sure to run this cell when you are finished. Thank you.
client.close()

# Save your notebook, then `File > Close and Halt`

---