# Tutorial 3: NoSQL

__The goal of this assignment is to create 10 "queries" based on 2 NoSQL databases.__

I put queries in quotes, as these databases do not provide a declarative query language such as SQL.

Instead, you must rely on the database API to extract information and combine the results with some Python code.

During this tutorial, I encourage you to consider the pros and cons of each databases compared to relational databases.

__The first database: imdb_basics.shelve contains the movies records of IMDB (basics)__.

This database is stored using `shelve`, a module of Python standard library.

`shelve` is a simple Key-Value store that acts as a persistent dictionary.

This is the same model as:
- Redis
- Memcached
- Google LevelDB
- Amazon DynamoDB
- Facebook RocksDB

__The second database: (imdb.json) contains the person records of IMDB (names)__.

This database is stored using `tinymongo`, a drop in replacement for MongoDB.

`tinymongo` is a Document store that provides query methods on JSON docs.

This is the same model as:
- MongoDB
- CouchDB
- ArangoDB
- RethinkDB
- ElasticSearch

__These two data models are [the most popular ones](https://db-engines.com/en/ranking) and they are also used to create more [advanced data model](https://github.com/datathings/greycat/tree/master/plugins)__.

__Grade scale__: 20 points
- correct query: 2 point
- incorrect query: 0 points

__Further documentations__:
* https://www.imdb.com/interfaces/
* https://learnxinyminutes.com/docs/python/
* https://github.com/schapman1974/tinymongo/
* https://docs.python.org/3.6/library/shelve.html

# Core

In [1]:
# import shelve from standard library
import shelve

# open a connection to shelve database
basics = shelve.open('imdb_basics.shelve', 'r')

In [2]:
# import shelve from external dependency
from tinymongo import TinyMongoClient

# open a connection to tinymongo database
names = TinyMongoClient('.').imdb.names

# Examples

In [3]:
# get the first 20 keys from basics collection
list(basics.keys())[:20]

['tt0100275',
 'tt0112502',
 'tt0137204',
 'tt0331314',
 'tt0339736',
 'tt0366415',
 'tt0368133',
 'tt0451201',
 'tt0451279',
 'tt0460890',
 'tt0491175',
 'tt0491203',
 'tt0493405',
 'tt0498381',
 'tt0835802',
 'tt0874268',
 'tt0937225',
 'tt0974015',
 'tt1020581',
 'tt1024855']

In [4]:
# get the first item from the basics collection
basics['tt0100275']

{'_id': 'tt0100275',
 'titleType': 'movie',
 'primaryTitle': 'The Wandering Soap Opera',
 'originalTitle': 'La Telenovela Errante',
 'isAdult': 0,
 'startYear': 2017,
 'endYear': None,
 'runtimeMinutes': 80,
 'genres': ['Comedy', 'Drama', 'Fantasy']}

In [5]:
# get the first item from the names collection
names.find_one()

{'_id': 'nm7691973',
 'primaryName': 'Baste Granfon',
 'birthYear': 2012,
 'deathYear': None,
 'knownForTitles': ['tt7736492', 'tt5577086', 'tt6766834', 'tt8207358'],
 'primaryProfession': ['actor']}

# Queries

__1. How many movies are in the `basics` collection ?__
- __hint__: you don't have to use a loop to answer this query
- __return__: Count (where Count = int)

In [6]:
def Q1():
    # YOUR CODE HERE
    return len(list(basics.keys()))

Q1()

18199

In [7]:
assert isinstance(Q1(), int)

__2. Select the record associated to the movie whose primaryTitle is 'Blade Runner 2049'__
* __hints__: you have to use method the database provides
* __return__: Record (where Record = Dict[str, Any])

In [8]:
def Q2():
    for key in basics:
        if basics[key]["primaryTitle"] == "Blade Runner 2049":
            return basics[key]

Q2()

{'_id': 'tt1856101',
 'titleType': 'movie',
 'primaryTitle': 'Blade Runner 2049',
 'originalTitle': 'Blade Runner 2049',
 'isAdult': 0,
 'startYear': 2017,
 'endYear': None,
 'runtimeMinutes': 164,
 'genres': ['Drama', 'Mystery', 'Sci-Fi']}

In [9]:
assert isinstance(Q2(), dict)
assert Q2()['primaryTitle'] == 'Blade Runner 2049'

__3. Select the primary title and runtime of every movies longer than 300 minutes (excluded)__
* __hint__: you have to construct your own return value
* __return__: List[Tuple[primaryTitle, runtimeMinutes]] (where primaryTitle = str, runtimeMinutes=int)

In [10]:
def Q3():
    final_list = list()
    for key in basics:
        if basics[key]["runtimeMinutes"] and basics[key]["runtimeMinutes"] > 300:
            final_list.append((basics[key]["primaryTitle"], basics[key]["runtimeMinutes"]))
    return final_list

Q3()

[('Who was Hitler', 450),
 ('Painting', 360),
 ('Next Stop', 406),
 ('Mazwara', 788),
 ('h36:', 2160),
 ('Stuck', 360),
 ('Nari', 6017),
 ('Europa: The Last Battle', 746),
 ('Sakhi', 1179),
 ('An Infants Journey: Reggio Emilia Approach', 912),
 ('1998: The Deadliest Year For Children In American History', 396),
 ('Laundry', 359),
 ('Make Me Fly', 623),
 ('Bullfighting Memories', 1100)]

In [11]:
assert len(Q3()) == 14
assert all(len(row) == 2 and row[1] > 300 for row in Q3())

__4. Select the record in the `names` collection associated to \_id 'nm0705356'__
* __hint__: use `find_one` to return only one record
* __return__: Record (where Record = Dict[str, Any])

In [12]:
def Q4():
    return names.find_one({ "_id": "nm0705356"} )

Q4()

{'_id': 'nm0705356',
 'primaryName': 'Daniel Radcliffe',
 'birthYear': 1989,
 'deathYear': None,
 'knownForTitles': ['tt0373889', 'tt4034354', 'tt0926084', 'tt1201607'],
 'primaryProfession': ['actor', 'soundtrack', 'producer']}

In [13]:
assert isinstance(Q4(), dict)
assert Q4()['_id'] == 'nm0705356'

__5. Select the primaryName of the first 20 persons born in 2000, sorted by name (descending)__
* __hint__: use the `find` method to return multiple results
* __return__: List[Name] (where Name = str)

In [14]:
def Q5():
    final_list = list()
    for name in names.find({"birthYear":2000}, sort=[('primaryName', -1)], skip=0, limit=20):
        final_list.append(name["primaryName"])
    return final_list

Q5()

['Zaira Wasim',
 'Willow Shields',
 'Viktor Derek',
 'Tobias Nikolai Haugland',
 'Stanislaw Cywka',
 'Shelby Lyon',
 'Ryan Henry Knight',
 'Rachelle Henry',
 'Pedro Diego',
 'Na-Na OuYang',
 'Moka Kamishiraishi',
 'Minami Hamabe',
 'Mima Ito',
 'Max Baissette de Malglaive',
 'Margaret Manousos',
 'Ludvig Fahlstedt',
 'Joshua Latorro',
 'Jonah Bryson',
 'Jackson Yee',
 'G.M. Whiting']

In [15]:
assert len(Q5()) == 20
assert all(isinstance(x, str) for x in Q5())

__6. Select the primaryName and birthYear of persons born after 2000 (excluded) and whose name starts with the letter 'M'__
* __hint__: use the `$and`, `$gt` and `$regex` operator of MongoDB
* __return__: List[Tuple[primaryName, birthYear]] (where primaryName = str, birthYear = int)

In [16]:
def Q6():
    final_list = list()
    bornafter = names.find({'birthYear': {'$gt': 2000}})
    for born in bornafter:
        if born["primaryName"][0] == "M":
            final_list.append((born["primaryName"], born["birthYear"]))
    return final_list

Q6()

[('Mckenna Grace', 2006),
 ('Moxie Jillette', 2005),
 ('Maddie Dixon-Poirier', 2005),
 ('Mana Ashida', 2004),
 ('Mujtuba Ahmed', 2004),
 ('Mariangeli Collado', 2003),
 ('Madison Wolfe', 2002),
 ('Maisa Silva', 2002),
 ('Milo Parker', 2002)]

In [17]:
assert len(Q6()) == 9
assert all(len(row) == 2 for row in Q6())
assert all(row[1] > 2000 for row in Q6())

__7. Compute the average movie runtime in minutes__
* __hint__: aggregation has to be performed with code
* __return__: Average (where Average = float)

In [18]:
def Q7():
    total_minutes = 0
    total_runtimes = 0
    for key in basics:
        if not basics[key]["runtimeMinutes"]:
            minutes_to_add = 0
        else:
            minutes_to_add = basics[key]["runtimeMinutes"]
        total_minutes += minutes_to_add
        total_runtimes += 1
    return total_minutes/total_runtimes
Q7()

61.72877630639046

In [19]:
assert isinstance(Q7(), float)

__8. Select the primary name and the primary titles for which the first 20 persons are known for__
* __hint__: you have to join the two database collections
* __return__: List[Tuple[primaryName, List[primaryTitle]]] (where primaryName = primaryTitle = str)

In [20]:
def Q8():
    """final_list = list()
    result1 = names.find(sort=[('primaryName', -1)], skip=0, limit=20)
    for record in result1:
        for title in record["knownForTitles"]:
            temp_list = list()
            if title in basics.keys():
                
            final_list.append((record['primaryName'], ))"""
        

Q8()

In [21]:
assert len(Q8()) == 20
assert all(len(row) == 2 for row in Q8())

TypeError: object of type 'NoneType' has no len()

__9. Select a sorted (ascending) and distinct list of movie genres__
* __hint__: Python provides a `set` structure and `sorted` function
* __return__: List[Genre] (where Genre = str)

In [22]:
def Q9():
    final_set = set()
    for key in basics:
        for genre in basics[key]["genres"]:
            if genre not in final_set:
                final_set.add(genre)
    return sorted(final_set)

Q9()

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Biography',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Game-Show',
 'History',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'News',
 'Reality-TV',
 'Romance',
 'Sci-Fi',
 'Short',
 'Sport',
 'Talk-Show',
 'Thriller',
 'War',
 'Western']

In [23]:
assert len(Q9()) == 27
assert all(isinstance(x, str) for x in Q9())

__10. Select number of distinct movies for which a person is known for and which exists in the `basics` collection__
- __hint__: I insist on the distinct and exists word in the question
- __return__: Count (where Count = int)

In [None]:
def Q10():
    # YOUR CODE HERE
    total_number = 0
    handled_titles = list()
    final_list = list()
    for name in names.find(sort=[('primaryName', -1)]):
        temp_titles = list()
        for title in name["knownForTitles"]:
            if title not in handled_titles:
                handled_titles.append(title)
                if title in list(basics.keys()):
                    total_number += 1
    return total_number
Q10()

In [None]:
assert isinstance(Q10(), int)