<If there are any questions, reach out via winston.vargo@mongodb.com>

[Find Me on LinkedIn!](https://www.linkedin.com/in/winston-vargo/)

# Welcome to Session 4: Driver Considerations and Data Modelling
As the title says, welcome to session 4. This session will be slightly different from the previous 3 sessions.
The first session was a conceptual overview of MongoDB and "getting started"
Sessions 2 and 3 were content heavy and introduced many of specifics of working with MongoDB as a developer.
This session is intended on taking sessions 1-3 and applying the concepts as it relates to actually _developing applications against MongoDB_ 

More concretely, we will be taking many of the concepts from (2) and (3) which were primarily in the mongo shell, and showing those concepts using a MongoDB driver.

**THIS SESSION USES THE PYMONGO DRIVER WHICH IS THE OFFICIAL MONGODB PYTHON DRIVER. DRIVERS DIFFER IN THEIR IMPLEMENTATION FROM LANGAGUE TO LANGUAGE.**

## Drivers
A _driver_ is a program (or SDK) that implements a protocol for a database connection. More practically what that means is, its the functiona that allows a developer to connect to a database, read and write data, etc... USE THE DATABASE!

MongoDB maintains and supports drivers in many different languages. However the certification is only offered in Python, NodeJS, Java, and C#.

Let's look at some basic python use of the MongoDB driver:

In [12]:
!python --version

Python 2.7.18


In [3]:
#driver import
import sys
!{sys.executable} -m pip install --upgrade pip
!{sys.executable} -m pip install --upgrade "pymongo[srv]"
import pymongo

[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.[0m
Requirement already up-to-date: pip in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (20.3.4)
[33mDEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.[0m
Requirement already up-to-date: pymongo[srv] in /Library/Frameworks/Python.framewo

In [1]:
#Variables for Python use:
mongodb_connection_string = ""
mongodb_username = ""
mongodb_password = ""

MDB_AUTH_URI = "mongodb+srv://" + mongodb_username + ":" + mongodb_password + mongodb_connection_string.replace("mongodb+srv://<username>:<password>", "") + "&tlsAllowInvalidCertificates=true"

print(MDB_AUTH_URI)

mongodb+srv://:&tlsAllowInvalidCertificates=true


## Driver Connection String
We need _some_ way to tell the driver (or shell, or... whatever) _where_ the MongoDB database is hosted. For that, we have the **connection string**, a URI to describe where the mongo database is hosted and how to authenticate against it.

Regardless of the language used, the format of the connection string will be similar. An example is above ^

It has the following components:
- `mongodb+srv` implies that the protocol for communication between the app and the database is the mongodb+srv protocol. That uses DNS lookups to route to the proper MongoDB node.
- `<username>:<password>@` is used for auth
- `devcert.6ngvd.mongodb.net` is the resource identifier for the hosts
- `?retryWrites=true&w=majority` are options for the connection. in this case, retryable writes are turned on and write concern is majority.
- `&tlsAllowInvalidCertificates=true` is another option that **was only added becuase these Jupyter Notebooks are pinned to Python2.8.19 which does not have a valid certificate to authenticate against MongoDB. Don't do it in real life**

the `mongodb+srv` protocol is relatively new and becoming more common, but for educational purposes, let's look at a "traditional" connection string

`mongodb://<username>:<password>@devcert-shard-00-00.6ngvd.mongodb.net:27017,devcert-shard-00-01.6ngvd.mongodb.net:27017,devcert-shard-00-02.6ngvd.mongodb.net:27017/?ssl=true&replicaSet=atlas-vnu7kz-shard-0&authSource=admin&retryWrites=true&w=majority`

Things to note:
- each host and port are exposed when not using the +srv protocol
- more options are required, notably `ssl=true` and `replicaSet=atlas-vnu7kz-shard-0` Both of these options are hidden behind the `+srv` connection string

## Driver Behavior in Contrast to the Shell
Using the shell, a single user expects an interactive experience
An application that uses MongoDB via the driver has much different access patterns:
- oftentimes there are more than 1 user trying to use the application
- interactions with the database are ephemeral

So, the way that drivers are used differ slightly from the shell

Let's take a look at an example of pymongo use to demonstrate this:

In [7]:
client = pymongo.MongoClient(MDB_AUTH_URI)

#the try block
try:
    
    
    db = client['sample_mflix']
    collection = db['movies']
    print(collection.find_one())
    
    
except pymongo.errors.ConnectionFailure as e:
    print("Could not connect to MongoDB:" + str(e))
except Exception as e:
    print("An error occurred:"+ str(e))
finally:
    client.close()


{u'plot': u'A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.', u'genres': [u'Short', u'Western'], u'runtime': 11, u'title': u'The Great Train Robbery', u'num_mflix_comments': 0, u'poster': u'https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI@._V1_SY1000_SX677_AL_.jpg', u'countries': [u'USA'], u'fullplot': u"Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.", u'languages': [u'English'], u'cast': [u'A.C. Abadie', u"Gilbert M. 'Broncho Billy' Anderson", u'George Barnes', u'Justus D. Barnes'], u'directors': [u'Edwin S. Porter'], u'rated': u'TV-G', u'awards': {u'wins': 1, u'nominations': 0, u'text': u'1 win.'}, u'lastupdated': 

In the above code snippet, we use the imported pymongo utility to create a connection to the database that is assigned to the variable `client` by typing `client = pymongo.MongoClient(MDB_AUTH_URI)`

This is an instance of a `MongoClient`, knowing some basic `MongoClient` methods may show up on the certification test.

Then, in the try block, 3 lines are used to make a call against a collection:
1. `db = client['sample_mflix']` <-- this creates a variable `db` that references the "sample_mflix" database
2. `collection = db['movies']` <-- this creates a reference to the collection "movies"
3. `print(collection.find_one())` <-- this is the pymongo implementation of findOne() in the shell which returns a document, not a cursor.

Note that returned object that was printed is not a python string, it is a MongoDB document.

## Drivers & Connection Parameters:
Take a look at this:

In [22]:
client = pymongo.MongoClient(
    host=MDB_AUTH_URI,
    w=1,
    maxPoolSize=50
)

#the try block
try:
    
    
    db = client['sample_mflix']
    collection = db['movies']
    print(collection.find_one())
    
    
except pymongo.errors.ConnectionFailure as e:
    print("Could not connect to MongoDB:" + str(e))
except Exception as e:
    print("An error occurred:"+ str(e))
finally:
    client.close()

{u'plot': u'A group of bandits stage a brazen train hold-up, only to find a determined posse hot on their heels.', u'genres': [u'Short', u'Western'], u'runtime': 11, u'title': u'The Great Train Robbery', u'num_mflix_comments': 0, u'poster': u'https://m.media-amazon.com/images/M/MV5BMTU3NjE5NzYtYTYyNS00MDVmLWIwYjgtMmYwYWIxZDYyNzU2XkEyXkFqcGdeQXVyNzQzNzQxNzI@._V1_SY1000_SX677_AL_.jpg', u'countries': [u'USA'], u'fullplot': u"Among the earliest existing films in American cinema - notable as the first film that presented a narrative story to tell - it depicts a group of cowboy outlaws who hold up a train and rob the passengers. They are then pursued by a Sheriff's posse. Several scenes have color included - all hand tinted.", u'languages': [u'English'], u'cast': [u'A.C. Abadie', u"Gilbert M. 'Broncho Billy' Anderson", u'George Barnes', u'Justus D. Barnes'], u'directors': [u'Edwin S. Porter'], u'rated': u'TV-G', u'awards': {u'wins': 1, u'nominations': 0, u'text': u'1 win.'}, u'lastupdated': 

The only difference between this example and the previous one is the use of `pymongo.MongoClient`.

Note how the `w` and `maxPoolSize` options were set. `w` actually refers to write concern. Wait a sec... My connection string has a write concern configured and the implementation of pymongo has a different write concern!! What gives? Well the connection write concern overrides the connection string write concern. That's the behavior for every parameter.

Generally, it is safe to assume most paramaters can be configured at the connection string level, the connection level, and even the query level.

What about `maxPoolSize`? Well that takes us to...

## Connection Pools

We talked briefly earlier about the difference in behavior between shell and driver. One is around "connections." When a user or an app is connecting to MongoDB, they need to create a _connection_ which is essentially 1 TCP connection to the port that MongoDB is listening on.

There is a lot of functionality hidden behind in the first line, `client = pymongo.MongoClient(MDB_AUTH_URI)`. This actually creating something called a **Connection Pool** which is actually a pool of cached connections _on the driver_ that are used when handling many concurrent requests to the database. This can be configured in the options (at the connection or connection string level).

In our example we were setting the max connections to `50`. The default is `100`

**Testing Tip**
- knowing all of the options is less important for the developer exam than knowing how to _use_ them, ie recognizing valid configurations of connection strings, connections, and queries

## Drivers and CRUD

The final aspect of applying our MongoDB knowledge to drivers is translating our previous conversations about MQL and aggregation framework to the specific driver

**Again, this demonstration is in Python but EACH language will have its own version of this**

We've already talked about sessions and connections. We also have to consider:
- driver specific syntax (findOne v find_one, capitalization, quotes, etc)
- data types and typing
- iterating over a cursor
- async versus sync operations

Let's look at an example:

In [50]:
### ~~~~~~~~~~~ COMMENT 1 ~~~~~~~~~~~~ ###
from datetime import datetime
from bson import ObjectId
client = pymongo.MongoClient(MDB_AUTH_URI)

### ~~~~~~~~~~~ COMMENT 2 ~~~~~~~~~~~~ ###
doc_to_insert = {
    "a":1,
    "example":True
}


try:

    db = client['testDb']
    collection = db['python']
    
    
### ~~~~~~~~~~~ COMMENT 3 ~~~~~~~~~~~~ ###
    collection.delete_many({},{})
    
### ~~~~~~~~~~~ COMMENT 4 ~~~~~~~~~~~~ ###
    insertResult = collection.insert_one(doc_to_insert)
    print("INSERTED ID IS "+str(insertResult.inserted_id))
    
### ~~~~~~~~~~~ COMMENT 5 ~~~~~~~~~~~~ ###
    updateResult1 = collection.update_many({"_id":str(insertResult.inserted_id)},{"$set":{"updated": datetime.now()}})
    updateResult2 = collection.update_many({"_id":ObjectId(str(insertResult.inserted_id))},{"$set":{"updated": datetime.now()}})
    
    print("first update updated " + str(updateResult1.modified_count) + " documents")
    print("second update updated " + str(updateResult2.modified_count) + " documents")
    
### ~~~~~~~~~~~ COMMENT 6 ~~~~~~~~~~~~ ###
    findResults1 = collection.find({})
    for document in findResults1:
        print(document)
        
    findResults2 = list(collection.find({}))
    print(str(len(findResults2)))
    
    
except pymongo.errors.ConnectionFailure as e:
    print("Could not connect to MongoDB:" + str(e))
except Exception as e:
    print("An error occurred:"+ str(e))
finally:
    client.close()


INSERTED ID IS 64ff99d277444e6fd4b71e6f
first update updated 0 documents
second update updated 1 documents
{u'a': 1, u'updated': datetime.datetime(2023, 9, 11, 18, 50, 58, 513000), u'_id': ObjectId('64ff99d277444e6fd4b71e6f'), u'example': True}
1


## Notes and Reflections from the Above Example

**Comment 1:** This shows how Python deals with data typing in MongoDB.
- `datetime` is a standard Python module, and the Pymongo driver uses Python datetimes that cast to ISODates on the database (remember BSON types from session 1?)
- `bson` is a package that is installed when installing the Pymongo driver. In this case, I use it later to query a document based on an `ObjectId` property.

**Comment 2:** This block shows how a _document_ is created in Python. Note that a python dictionary is essentially saved as a document. Generally, for all languages, the standard "object" is what is stored to MongoDB. Javascript objects, POJOs, Python dicts, Go structs, etc are examples of "objects" that are stored to the server. **This is a core feature of the database** Also notice that `True` has a captital T. In javascript, it is `true`. These are the language specific changes that might come up. Another Python / Javascript difference is that the developer has to encase the key with quotations, where in JS that is not required.

**Comment 3:** This is a standard `delete_many` with no filter to reset the demo. Note that python is `based_on_underscores` not `camelCase` like the shell and Javascript.

**Comment 4:** A few things to note here. First of all, this is a fairly typical pattern that we see for inserts. The pattern I am describing is declaring a variable as the output of a insert/update/delete. This has the side effect of performing that action against the database, and the variable becomes a `result object` that represents metadata about the write operation (if it is successful, Ids of documents inserted, etc). Also I want to comment on timing. Python is a synchronous language, which means the assignment of the variable `insertResult` _waits_ for the call to the database to be complete. **Javascript is asynchronous. So if we ran a similar function in JS, the variable `insertResult` would be a Javascript `promise`. The developer needs to use callbacks, async/await, etc to deal with this.**

**Comment 5:** Basically the point here is that a string doesn't match an ObjectID. The first update doesn't actually match on anything due to that point. So I use the `bson` library implementation of the `ObjectId` to filter on ObjectId. Then there is an update statement which is essentially the same as any update statement in the shell.

**Comment 6:** Is all about _cursor iteration_. This shows 2 ways that python can access documents from a cursor (remember `find()` returns a `cursor`, not documents). 
1. `for <document> in <cursor>` allows the application to iterate through a cursor, performing actions on each document returned
2. `list(<cursor>)` in another (less common) way to exhaust a cursor. It tells the application to take the result set and create a list (array) out of that result set. **This can be a heavy operation to the app for a large read**

The point is, each language has its own ways for iterating over a cursor. Knowing them is important for the test.

## Drivers and Aggregations

You can also perform aggregations via the driver. The syntax should be fairly familiar:

In [8]:
client = pymongo.MongoClient(MDB_AUTH_URI)
try:
    
    pipeline = [
        {
            "$match": {
                "year": 1991
            }
        },
        {
            "$unwind": "$countries"
        },
        {
            "$group": {
                "_id": "$countries",
                "average_runtime": {
                    "$avg": "$runtime"
                }
            }
        },
        {"$limit":3}
    ]
    
    db = client['sample_mflix']
    collection = db['movies']
    aggResult = list(collection.aggregate(pipeline))
    
    for document in aggResult:
        print(document)
    
except pymongo.errors.ConnectionFailure as e:
    print("Could not connect to MongoDB:" + str(e))
except Exception as e:
    print("An error occurred:"+ str(e))
finally:
    client.close()


{u'average_runtime': 98.0, u'_id': u'Mexico'}
{u'average_runtime': 96.33333333333333, u'_id': u'Denmark'}
{u'average_runtime': 87.0, u'_id': u'Sweden'}


... So that's pretty much it when it comes to taking the knowledge that we have learned in the past sessions and applying it to drivers.

## Data Modelling

We now know how to use the MongoDB language, and drivers, etc to perform operations against the database. As a "real" developer, this is just the beginning. **The point of learning MongoDB is to use it to create solutions**. To that end, a developer often has to choose how they relate the problem they are trying to solve to a data model.

We have learned about how flexible MongoDB is, as it relates to data modelling. In the relational world, the heuristic that developers use is the **3rd normal form**. With MongoDB, there are so many ways to model data and thus there is no "rule" like 3rd normalization. Instead, there are guidelines:

- Data that is accessed together should be stored together
- The "N" of 1:N and N:N relations _matters_
- Consider the fields that are used to filter and fields that are often modified

**Embedding versus Linking**

In our previous sessions, we were also exposed to the aggregation framework `$lookup` stage, and many were surprised to find out that MongoDB _does_ have the ability to perform operations like `SQL JOINs`.

So a fundamental question when designing a data model is to **embed**:
```
// people collection
{
    _id:1,
    name:"Winston",
    cars:[{year:2023,model:"Tesla Model Y"},{year:1963,model:"Pontiac Firebird"}]
}
```

or **link**:
```
// people collection
{
    _id:1,
    name:"Winston"
}


// cars collection
{
    _id:2,
    year:2023,
    model:"Tesla Model Y",
    ownerID:1
},
{
    _id:3,
    year:1963,
    model:"Pontiac Firebird",
    ownerID:1
}
```

In the **embedding** paradigm, 1:N relationsihips are modelled via arrays within a document. When **linking**, 1:N relationships are modelled via a foreign key relationship in a seperate collection.

_Neither of these paradigms are wrong._ It depends on the nature of the use case:

**Embedding** would support highly efficient computing for the following types of use cases:
- Using `name` as a filter, and returning the associated `cars`
- Situations in which people rarely have a lot of cars (unbounded arrays are bad)
- Generally this has high performance reads, potentially at the expense of writes

**Linking** would support the following types of use cases:
- Frequent updates on `car` data (for instance, if there was a field "milage" which is often updated)
- Queries based on car parameters in which the owner of the car is the desired output
- Generally this has balanced read/write performance.

The `sample_mflix` sample dataset has an example of how a developer might reason about embedding versus linking. The `cast` and `genres` fields are embedded inside documents inside the `movies` collection. There is also a seperate collection for `comments`. This makes sense because a movie will rarely have more than 10 actors or associated genres. However, there could be a effectively limitless amount of comments, and these comments might be frequently edited. 

other things:
- shell toArray
- getSiblingDb
- findAnd_
- atlas Data Explorer
- search indexes