# Aggregates in MongoDB

In MongoDB, aggregations are handled via the so-called "aggregation pipeline". 


For each aggregation pipeline, there is a match stage and a group stage. 
These two stages, when combined together cleverly, 
allow for powerful aggregate expressions. 

Documentation for the aggregate pipeline can be found at: https://docs.mongodb.com/v3.4/core/aggregation-pipeline/ which in turn has several related links. 
If you are looking for more information, we suggest reading these pages. 

The examples we are providing below are largely based on the exmaples in the documentation, 
adapted for our data and extended where relevant. 

In [1]:
from bson.son import SON # for ordering by more than one column
from pprint import pprint
from pymongo import MongoClient
import pymongo

# Initialize a Mongo Client
#################################################
# Update UPDATE-ME in the connection code
#################################################
# Client 1 - mongodb-1.dsa.missouri.edu
# Client 2 - mongodb-2.dsa.missouri.edu
# Client 3 - mongodb-3.dsa.missouri.edu
# Client 4 - mongodb-4.dsa.missouri.edu
#################################################
#
client = MongoClient('mongodb-2.dsa.missouri.edu',
                     username='ml_small_reader',
                     password='mlsmall.read',
                     authSource='ml_small')
# retrieve the ml_small database from the connection
db = client.ml_small

### Number of movies released per year

Let's jump into the aggregate pipeline with a few examples, 
beginning with the number of movies released each year, ordered by date ascending.

In [2]:
db.item.find_one()

{'Action': 0,
 'Adventure': 0,
 'Animation': 0,
 "Children's": 1,
 'Comedy': 1,
 'Crime': 0,
 'Documentary': 0,
 'Drama': 1,
 'Fantasy': 0,
 'Film-Noir': 0,
 'Horror': 0,
 'IMDb_URL': 'http://us.imdb.com/M/title-exact?Babe%20(1995)',
 'Musical': 0,
 'Mystery': 0,
 'Romance': 0,
 'Sci-Fi': 0,
 'Thriller': 0,
 'War': 0,
 'Western': 0,
 '_id': ObjectId('5b231f5ad698289b415e67f5'),
 'movie_id': 8,
 'movie_title': 'Babe (1995)',
 'release_date': datetime.datetime(1995, 1, 1, 6, 0),
 'unknown': 0,
 'video_release_date': ''}

In [3]:
pipeline = [
    {"$group" : { "_id" : {"$year": "$release_date"} , "count" : { "$sum" : 1 }}},
    {"$sort": { "_id": 1 }}
]
for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': 1922, 'count': 1}
{'_id': 1926, 'count': 1}
{'_id': 1930, 'count': 1}
{'_id': 1931, 'count': 1}
{'_id': 1932, 'count': 1}
{'_id': 1933, 'count': 2}
{'_id': 1934, 'count': 4}
{'_id': 1935, 'count': 4}
{'_id': 1936, 'count': 2}
{'_id': 1937, 'count': 4}
{'_id': 1938, 'count': 3}
{'_id': 1939, 'count': 7}
{'_id': 1940, 'count': 8}
{'_id': 1941, 'count': 5}
{'_id': 1942, 'count': 2}
{'_id': 1943, 'count': 4}
{'_id': 1944, 'count': 5}
{'_id': 1945, 'count': 4}
{'_id': 1946, 'count': 5}
{'_id': 1947, 'count': 5}
{'_id': 1948, 'count': 3}
{'_id': 1949, 'count': 4}
{'_id': 1950, 'count': 7}
{'_id': 1951, 'count': 5}
{'_id': 1952, 'count': 3}
{'_id': 1953, 'count': 2}
{'_id': 1954, 'count': 7}
{'_id': 1955, 'count': 5}
{'_id': 1956, 'count': 4}
{'_id': 1957, 'count': 8}
{'_id': 1958, 'count': 9}
{'_id': 1959, 'count': 4}
{'_id': 1960, 'count': 5}
{'_id': 1961, 'count': 3}
{'_id': 1962, 'count': 5}
{'_id': 1963, 'count': 6}
{'_id': 1964, 'count': 2}
{'_id': 1965, 'count': 5}
{'_id': 1966

### Number of movies released per year: Top 10 years, ordered by count.


In [4]:
pipeline = [
    {"$group" : { "_id" : {"$year": "$release_date"} , "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }},
    {"$limit": 10}
]

for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': 1996, 'count': 355}
{'_id': 1997, 'count': 286}
{'_id': 1995, 'count': 219}
{'_id': 1994, 'count': 214}
{'_id': 1993, 'count': 126}
{'_id': 1998, 'count': 65}
{'_id': 1992, 'count': 37}
{'_id': 1990, 'count': 24}
{'_id': 1991, 'count': 22}
{'_id': 1989, 'count': 15}


**The above query: **
 1. Groups the information based on the year of release, calculated using `$year` command and the date.  
 1. Creates a new field called "count" whose value comes from the `$sum` directive
 1. Sort the rows based on the count field in descending order
 1. Limit the results to 10 documents.

### Number of movies released for each month, sorted most to fewest


In [5]:
pipeline = [
    {"$group" : { "_id" : {"$month": "$release_date"} , "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }}
]

for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': 1, 'count': 1140}
{'_id': 3, 'count': 77}
{'_id': 2, 'count': 67}
{'_id': 4, 'count': 62}
{'_id': 5, 'count': 59}
{'_id': 8, 'count': 48}
{'_id': 6, 'count': 48}
{'_id': 10, 'count': 42}
{'_id': 12, 'count': 38}
{'_id': 9, 'count': 38}
{'_id': 7, 'count': 33}
{'_id': 11, 'count': 29}


As you can see, most of the documents in our dataset were released in Janurary. 
Now is a good time to highlight the fact that our data may not represent the reality.
Seeing such a huge bias for the first month in the year suggest that movies which the data
collectors could not find a specific release date for probably ended up with Janurary as the default date. 

Let's explore this idea by looking at the same breakdown, but based on day.

In [6]:
pipeline = [
    {"$group" : { "_id" : {"$dayOfMonth": "$release_date"} , "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }}
]

for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': 1, 'count': 1121}
{'_id': 16, 'count': 32}
{'_id': 23, 'count': 28}
{'_id': 20, 'count': 28}
{'_id': 25, 'count': 26}
{'_id': 13, 'count': 26}
{'_id': 27, 'count': 25}
{'_id': 9, 'count': 24}
{'_id': 14, 'count': 23}
{'_id': 26, 'count': 23}
{'_id': 6, 'count': 23}
{'_id': 22, 'count': 22}
{'_id': 10, 'count': 22}
{'_id': 8, 'count': 21}
{'_id': 30, 'count': 21}
{'_id': 2, 'count': 20}
{'_id': 11, 'count': 19}
{'_id': 4, 'count': 18}
{'_id': 18, 'count': 17}
{'_id': 19, 'count': 17}
{'_id': 29, 'count': 15}
{'_id': 7, 'count': 15}
{'_id': 12, 'count': 15}
{'_id': 28, 'count': 14}
{'_id': 21, 'count': 13}
{'_id': 17, 'count': 9}
{'_id': 15, 'count': 9}
{'_id': 5, 'count': 9}
{'_id': 3, 'count': 9}
{'_id': 31, 'count': 9}
{'_id': 24, 'count': 8}


With 1121 entries having their release date by the first day of the month, 
it would seem our suspicions are founded. 

As just one more example of aggregation with filtering, 
let's count the number of movies per relese day, where the release month is January.

In [7]:
pipeline = [
    {"$project": {
        "_id": {"$dayOfMonth": "$release_date"}, 
        "monthReleased": {"$month": "$release_date"}
    }},
    {"$match": {"monthReleased": 1}},
    {"$group" : { "_id" : "$_id" , "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }}
]

for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': 1, 'count': 1097}
{'_id': 10, 'count': 7}
{'_id': 16, 'count': 7}
{'_id': 24, 'count': 5}
{'_id': 30, 'count': 5}
{'_id': 31, 'count': 4}
{'_id': 9, 'count': 4}
{'_id': 17, 'count': 4}
{'_id': 29, 'count': 2}
{'_id': 23, 'count': 2}
{'_id': 15, 'count': 1}
{'_id': 21, 'count': 1}
{'_id': 22, 'count': 1}


**The above pipeline:**
 1. Sets the \_id field to the result of the `$dayOfMonth` operation on the `$releaseDate` field.
 1. Calculates a new value to be used for matching, to filter our documents to only those where the match condition is met. This value is the month of the release date.
 1. Limits our results using `$match` to those documents that meet the filter.
 1. Groups the remaining documents by their release day (of the month) and adds a count field. 
 1. Sorts the collection in descending order based on the count field.

There are even more powerful things you can do with the aggregation pipeline.
In fact, if you can think of something you want to know about the data, 
there is almost definitely a way to get the answer using the aggregate pipeline. 

## <span style="background:yellow">Your Turn</span>

Complete following lab work to gain hands-on experience with the Aggregate Pipeline.
**Make sure that you have ran the above cells, and have a connection to the DB!**

### Task 1

#### Get counts of movies based on combinations of (Action, Adventure) genres. 
This means that your result should have 4 documents.

In [27]:
# Add your code below
# -------------------------


pipeline = [
    
    #{"$match": {"Action": 1,"Adventure": 1}},
    {"$group" : { "_id" : { 'Action': "$Action", 'Adventure': "$Adventure" }, "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }}
]

for doc in db.item.aggregate(pipeline):
    print(doc)

{'_id': {'Action': 0, 'Adventure': 0}, 'count': 1370}
{'_id': {'Action': 1, 'Adventure': 0}, 'count': 176}
{'_id': {'Action': 1, 'Adventure': 1}, 'count': 75}
{'_id': {'Action': 0, 'Adventure': 1}, 'count': 60}


In [None]:
db.item.find_one()

### Task 2

#### Get counts of reviewers based on gender.
Use the `user` collection. Explore this collection using find_one() if necessary.

In [21]:
# Add your code below
# -------------------------
pipeline = [
    
    {"$group" : { "_id" : "$gender" , "count" : { "$sum" : 1 }}},
    {"$sort": { "count": -1 }}
]

for doc in db.user.aggregate(pipeline):
    print(doc)




{'_id': 'M', 'count': 670}
{'_id': 'F', 'count': 273}


In [20]:
db.user.find_one()

{'_id': ObjectId('5b18e2b3d698289b41f3cf1a'),
 'age': 33,
 'gender': 'F',
 'occupation': 'other',
 'user_id': 5,
 'zip_code': 15213}

### Task 3

#### Get counts of reviewers, grouped by gender and age.
To sort by both gender and age, you will need to use __[SON](https://stackoverflow.com/questions/36566166/sort-the-result-from-a-pymongo-query#36566229)__.

In [5]:
# Add your code below
# -------------------------
from bson.son import SON
pipeline = [
    
    {"$group" : { 
                    "_id" : {"gender":"$gender","age":"$age" } , 
                    "count" : { "$sum" : 1 }
                }
    },
    {"$sort": SON([( "_id.age", 1 ),( "_id.gender", 1 )])}
]

for doc in db.user.aggregate(pipeline):
    print(doc)



{'_id': {'age': 7, 'gender': 'M'}, 'count': 1}
{'_id': {'age': 10, 'gender': 'M'}, 'count': 1}
{'_id': {'age': 11, 'gender': 'M'}, 'count': 1}
{'_id': {'age': 13, 'gender': 'F'}, 'count': 2}
{'_id': {'age': 13, 'gender': 'M'}, 'count': 3}
{'_id': {'age': 14, 'gender': 'F'}, 'count': 3}
{'_id': {'age': 15, 'gender': 'F'}, 'count': 3}
{'_id': {'age': 15, 'gender': 'M'}, 'count': 3}
{'_id': {'age': 16, 'gender': 'F'}, 'count': 2}
{'_id': {'age': 16, 'gender': 'M'}, 'count': 3}
{'_id': {'age': 17, 'gender': 'F'}, 'count': 3}
{'_id': {'age': 17, 'gender': 'M'}, 'count': 11}
{'_id': {'age': 18, 'gender': 'F'}, 'count': 11}
{'_id': {'age': 18, 'gender': 'M'}, 'count': 7}
{'_id': {'age': 19, 'gender': 'F'}, 'count': 8}
{'_id': {'age': 19, 'gender': 'M'}, 'count': 15}
{'_id': {'age': 20, 'gender': 'F'}, 'count': 10}
{'_id': {'age': 20, 'gender': 'M'}, 'count': 22}
{'_id': {'age': 21, 'gender': 'F'}, 'count': 7}
{'_id': {'age': 21, 'gender': 'M'}, 'count': 20}
{'_id': {'age': 22, 'gender': 'F'},

In [20]:
db.user.find_one()

{'_id': ObjectId('5b18e2b3d698289b41f3cf1a'),
 'age': 33,
 'gender': 'F',
 'occupation': 'other',
 'user_id': 5,
 'zip_code': 15213}

In [None]:
# Be sure to run this cell when you are finished. Thank you.
client.close()

# Save your notebook, then `File > Close and Halt`

---