# MongoDB Examples

This notebook uses the Titanic data introduced in the RelationalDB notebook, plus some Twitter data from #bbcnews generated on the 10/01/23.

Activity 13.2 introduces *Seven Databases in Seven Weeks* (Redmond 2012)

The most common NoSQL databases introduced are:

- Riak	- key value
- Hbase	- wide column
- MongoDB	- document 
- CouchDB - document 
- Neo4j	- graph
- Redis	- key value

This notebook will look at the MongoDB NoSQL document database.

In [1]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
# better for printing JSON data: p(retty)print
from pprint import pprint

# Print out the version of pymongo 
print (pymongo.version)

3.12.0


In [2]:
#SET DATABASE CONNECTION STRINGS
MONGOHOST='localhost'
MONGOPORT=27017
MONGOCONN='mongodb://{MONGOHOST}:{MONGOPORT}/'.format(MONGOHOST=MONGOHOST,MONGOPORT=MONGOPORT)

In [3]:
# MongoDB version
! mongod --version

db version v4.4.8
Build Info: {
    "version": "4.4.8",
    "gitVersion": "83b8bb8b6b325d8d8d3dfd2ad9f744bdad7d6ca0",
    "openSSLVersion": "OpenSSL 1.1.1f  31 Mar 2020",
    "modules": [],
    "allocator": "tcmalloc",
    "environment": {
        "distmod": "ubuntu2004",
        "distarch": "x86_64",
        "target_arch": "x86_64"
    }
}


In [4]:
client = pymongo.MongoClient(MONGOCONN)

In [5]:
# Drop the existing titanic database. Unlike SQL, the command will not generate an error if it does not already exist
client.drop_database('titanicDB')
client.list_database_names()

['accidents', 'admin', 'config', 'local', 'mongoExample', 'twitterDB']

Import the Titanic data 

In [6]:
! mongoimport --db titanicDB --type=csv --headerline --file data/train.csv --collection training
! mongoimport --db titanicDB --type=csv --headerline --file data/test.csv --collection testset
! mongoimport --db titanicDB --type=csv --headerline --file data/gender_submission.csv --collection gender_submission

2023-01-12T20:51:26.977+0000	connected to: mongodb://localhost/
2023-01-12T20:51:27.000+0000	891 document(s) imported successfully. 0 document(s) failed to import.
2023-01-12T20:51:27.190+0000	connected to: mongodb://localhost/
2023-01-12T20:51:27.205+0000	418 document(s) imported successfully. 0 document(s) failed to import.
2023-01-12T20:51:27.393+0000	connected to: mongodb://localhost/
2023-01-12T20:51:27.405+0000	418 document(s) imported successfully. 0 document(s) failed to import.


In [7]:
# Check the database has been added
client.list_database_names()

['accidents',
 'admin',
 'config',
 'local',
 'mongoExample',
 'titanicDB',
 'twitterDB']

In [8]:
# titanicDB is a database that contains 3 collections (similar to tables)
db = client.titanicDB

# setup variables for the various collections
training = db.training
testset = db.testset
gender_submission = db.gender_submission

In [9]:
# Show one record - can be any one from the collection
training.find_one()

{'_id': ObjectId('63c072ce73246b274e3e29cd'),
 'PassengerId': 1,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Cabin': '',
 'Embarked': 'S'}

The variables saves us having to use db.collectionName.function() in the queries, but can still be used as seen below.

Just be careful if you swap databases in the same Notebook, as we do later, you could end up referencing a collection in the wrong database. Mongo will not warn you that this is an error, it just assumes it does not exist and will return nothing - a consequence of a schemaless database. 

In [10]:
db.training.find_one()

{'_id': ObjectId('63c072ce73246b274e3e29cd'),
 'PassengerId': 1,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Cabin': '',
 'Embarked': 'S'}

In [11]:
testset.find_one()

{'_id': ObjectId('63c072cfabda8afc4c88d14a'),
 'PassengerId': 893,
 'Pclass': 3,
 'Name': 'Wilkes, Mrs. James (Ellen Needs)',
 'Sex': 'female',
 'Age': 47,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 363272,
 'Fare': 7,
 'Cabin': '',
 'Embarked': 'S'}

In [12]:
gender_submission.find_one()

{'_id': ObjectId('63c072cf47fc72847e329f70'),
 'PassengerId': 905,
 'Survived': 0}

# Querying 

MongoDB data is stored in JSON format, which means it uses the format of: *{key: value}* for most things.

The *find()* function is the equivalent of the SQL SELECT statement.

Instead of a *WHERE* clause you need to provide a JSON string for what you want to find.

For example, the following is the equivalent of *SELECT * FROM training WHERE sex = 'male';*

In [13]:
# find() returns a Cursor instead of displaying the results (in Python)
training.find({"Sex" : "male"})

<pymongo.cursor.Cursor at 0x7f85f54d0a90>

In [14]:
# create some functions to print the individual documents
# using pretty print
def printDocs(documents):
    for doc in documents:
        pprint(doc)

# ordinary print
def printDoc(documents):
    for doc in documents:
        print(doc)

In [15]:
# This means an iterator is needed to display the results
# find the male passengers
docs = training.find({"Sex" : "male"})
printDoc(docs)

{'_id': ObjectId('63c072ce73246b274e3e29cd'), 'PassengerId': 1, 'Survived': 0, 'Pclass': 3, 'Name': 'Braund, Mr. Owen Harris', 'Sex': 'male', 'Age': 22, 'SibSp': 1, 'Parch': 0, 'Ticket': 'A/5 21171', 'Fare': 7.25, 'Cabin': '', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e29d0'), 'PassengerId': 5, 'Survived': 0, 'Pclass': 3, 'Name': 'Allen, Mr. William Henry', 'Sex': 'male', 'Age': 35, 'SibSp': 0, 'Parch': 0, 'Ticket': 373450, 'Fare': 8.05, 'Cabin': '', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e29d1'), 'PassengerId': 6, 'Survived': 0, 'Pclass': 3, 'Name': 'Moran, Mr. James', 'Sex': 'male', 'Age': '', 'SibSp': 0, 'Parch': 0, 'Ticket': 330877, 'Fare': 8.4583, 'Cabin': '', 'Embarked': 'Q'}
{'_id': ObjectId('63c072ce73246b274e3e29d2'), 'PassengerId': 7, 'Survived': 0, 'Pclass': 1, 'Name': 'McCarthy, Mr. Timothy J', 'Sex': 'male', 'Age': 54, 'SibSp': 0, 'Parch': 0, 'Ticket': 17463, 'Fare': 51.8625, 'Cabin': 'E46', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e

{'_id': ObjectId('63c072ce73246b274e3e2cb4'), 'PassengerId': 746, 'Survived': 0, 'Pclass': 1, 'Name': 'Crosby, Capt. Edward Gifford', 'Sex': 'male', 'Age': 70, 'SibSp': 1, 'Parch': 1, 'Ticket': 'WE/P 5735', 'Fare': 71, 'Cabin': 'B22', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e2cb5'), 'PassengerId': 744, 'Survived': 0, 'Pclass': 3, 'Name': 'McNamee, Mr. Neal', 'Sex': 'male', 'Age': 24, 'SibSp': 1, 'Parch': 0, 'Ticket': 376566, 'Fare': 16.1, 'Cabin': '', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e2cb6'), 'PassengerId': 745, 'Survived': 1, 'Pclass': 3, 'Name': 'Stranden, Mr. Juho', 'Sex': 'male', 'Age': 31, 'SibSp': 0, 'Parch': 0, 'Ticket': 'STON/O 2. 3101288', 'Fare': 7.925, 'Cabin': '', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e2cb7'), 'PassengerId': 749, 'Survived': 0, 'Pclass': 1, 'Name': 'Marvin, Mr. Daniel Warner', 'Sex': 'male', 'Age': 19, 'SibSp': 1, 'Parch': 0, 'Ticket': 113773, 'Fare': 53.1, 'Cabin': 'D30', 'Embarked': 'S'}
{'_id': ObjectId(

In [16]:
# alternatively use a dataframe to make it more like a relational table
# find the female passengers embarking in Southampton
pd.DataFrame(training.find({"Sex" : "female", "Embarked" : "S"}))

Unnamed: 0,_id,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,63c072ce73246b274e3e29ce,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.9250,,S
1,63c072ce73246b274e3e29cf,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1000,C123,S
2,63c072ce73246b274e3e29d4,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
3,63c072ce73246b274e3e29d6,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7000,G6,S
4,63c072ce73246b274e3e29d9,15,0,3,"Vestrom, Miss. Hulda Amanda Adolfina",female,14,0,0,350406,7.8542,,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...
198,63c072ce73246b274e3e2d33,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47,1,1,11751,52.5542,D35,S
199,63c072ce73246b274e3e2d3d,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25,0,1,230433,26.0000,,S
200,63c072ce73246b274e3e2d3f,883,0,3,"Dahlberg, Miss. Gerda Ulrika",female,22,0,0,7552,10.5167,,S
201,63c072ce73246b274e3e2d42,888,1,1,"Graham, Miss. Margaret Edith",female,19,0,0,112053,30.0000,B42,S


In [17]:
# Or can convert the Cursor to a list
# Find the 1st class female passengers who survived (0 = No, 1 = Yes)
list(training.find({"Pclass" : 1, "Sex" : "female", "Survived" : 1}))

[{'_id': ObjectId('63c072ce73246b274e3e29cf'),
  'PassengerId': 4,
  'Survived': 1,
  'Pclass': 1,
  'Name': 'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
  'Sex': 'female',
  'Age': 35,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 113803,
  'Fare': 53.1,
  'Cabin': 'C123',
  'Embarked': 'S'},
 {'_id': ObjectId('63c072ce73246b274e3e29d5'),
  'PassengerId': 2,
  'Survived': 1,
  'Pclass': 1,
  'Name': 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
  'Sex': 'female',
  'Age': 38,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 'PC 17599',
  'Fare': 71.2833,
  'Cabin': 'C85',
  'Embarked': 'C'},
 {'_id': ObjectId('63c072ce73246b274e3e29df'),
  'PassengerId': 12,
  'Survived': 1,
  'Pclass': 1,
  'Name': 'Bonnell, Miss. Elizabeth',
  'Sex': 'female',
  'Age': 58,
  'SibSp': 0,
  'Parch': 0,
  'Ticket': 113783,
  'Fare': 26.55,
  'Cabin': 'C103',
  'Embarked': 'S'},
 {'_id': ObjectId('63c072ce73246b274e3e29e9'),
  'PassengerId': 32,
  'Survived': 1,
  'Pclass': 1,
  'Name': 'Spencer, Mrs. Will

In [18]:
# Can specify a search criteria with find_one too
training.find_one({"Sex" : "female"})

{'_id': ObjectId('63c072ce73246b274e3e29ce'),
 'PassengerId': 3,
 'Survived': 1,
 'Pclass': 3,
 'Name': 'Heikkinen, Miss. Laina',
 'Sex': 'female',
 'Age': 26,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': 'STON/O2. 3101282',
 'Fare': 7.925,
 'Cabin': '',
 'Embarked': 'S'}

# Data Dictionary



One consequence of being schemaless, means there are no conventional data dictionary tables to check if the collection or document names exist. This means that it will not generate an error message if neither exist. Do note, the names are all case sensitive. 

Why will the following return no records?

In [19]:
training.find_one({"passengerid" : 500})

In [20]:
db.Training.find_one({"PassengerId" : 500})

But it will generate an error message if it can not find the variables or functions:

In [21]:
Training.find_one({"passengerid" : 500})

NameError: name 'Training' is not defined

In [22]:
training.find_One({"passengerid" : 500})

TypeError: 'Collection' object is not callable. If you meant to call the 'find_One' method on a 'Collection' object it is failing because no such method exists.

In [23]:
# Lets find our passenger
training.find_one({"PassengerId" : 500})

{'_id': ObjectId('63c072ce73246b274e3e2bbe'),
 'PassengerId': 500,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Svensson, Mr. Olof',
 'Sex': 'male',
 'Age': 24,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': 350035,
 'Fare': 7.7958,
 'Cabin': '',
 'Embarked': 'S'}

There may not be a data dictionary collection to query, but you can find the keys in a collection, which are similar to the column names in a relational database. Be aware though, that the structure can vary from document to document in a given collection.


In [24]:
training.find_one().keys()

dict_keys(['_id', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])

For example, we may decide to remove any documents where the Age is blank

In [25]:
training.find_one({"Age": ""})

{'_id': ObjectId('63c072ce73246b274e3e29d1'),
 'PassengerId': 6,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Moran, Mr. James',
 'Sex': 'male',
 'Age': '',
 'SibSp': 0,
 'Parch': 0,
 'Ticket': 330877,
 'Fare': 8.4583,
 'Cabin': '',
 'Embarked': 'Q'}

In [26]:
# Remove the empty Age keys, but note the document will remain
training.update_many({"Age" : ""}, { "$unset": {"Age" : 1 }});

In [27]:
# Now there are no empty age fields
training.find_one({"Age": ""})

In [28]:
# But there are still other ages available:
printDoc(training.find({"Age": 50}))

{'_id': ObjectId('63c072ce73246b274e3e2a93'), 'PassengerId': 178, 'Survived': 0, 'Pclass': 1, 'Name': 'Isham, Miss. Ann Elizabeth', 'Sex': 'female', 'Age': 50, 'SibSp': 0, 'Parch': 0, 'Ticket': 'PC 17595', 'Fare': 28.7125, 'Cabin': 'C49', 'Embarked': 'C'}
{'_id': ObjectId('63c072ce73246b274e3e2abd'), 'PassengerId': 260, 'Survived': 1, 'Pclass': 2, 'Name': 'Parrish, Mrs. (Lutie Davis)', 'Sex': 'female', 'Age': 50, 'SibSp': 0, 'Parch': 1, 'Ticket': 230433, 'Fare': 26, 'Cabin': '', 'Embarked': 'S'}
{'_id': ObjectId('63c072ce73246b274e3e2af8'), 'PassengerId': 300, 'Survived': 1, 'Pclass': 1, 'Name': 'Baxter, Mrs. James (Helene DeLaudeniere Chaput)', 'Sex': 'female', 'Age': 50, 'SibSp': 0, 'Parch': 1, 'Ticket': 'PC 17558', 'Fare': 247.5208, 'Cabin': 'B58 B60', 'Embarked': 'C'}
{'_id': ObjectId('63c072ce73246b274e3e2b91'), 'PassengerId': 435, 'Survived': 0, 'Pclass': 1, 'Name': 'Silvey, Mr. William Baird', 'Sex': 'male', 'Age': 50, 'SibSp': 1, 'Parch': 0, 'Ticket': 13507, 'Fare': 55.9, 'Cabi

In [29]:
# The consequence of this is that the keys will be slightly different for the records that still contain the Age key
# Without the age:
training.find_one({"PassengerId" : 6}).keys()

dict_keys(['_id', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])

In [30]:
# With the Age key:
training.find_one({"PassengerId" : 7}).keys()

dict_keys(['_id', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'])

In [31]:
# how many documents in the collection
db.training.count()

  db.training.count()


891

In [32]:
# count() deprecated since version 3.7
db.training.count_documents({})

891

In [33]:
# can access via the index (starts at 0)
training.find()[0]

{'_id': ObjectId('63c072ce73246b274e3e29cd'),
 'PassengerId': 1,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Braund, Mr. Owen Harris',
 'Sex': 'male',
 'Age': 22,
 'SibSp': 1,
 'Parch': 0,
 'Ticket': 'A/5 21171',
 'Fare': 7.25,
 'Cabin': '',
 'Embarked': 'S'}

In [34]:
# second record
training.find()[1]

{'_id': ObjectId('63c072ce73246b274e3e29ce'),
 'PassengerId': 3,
 'Survived': 1,
 'Pclass': 3,
 'Name': 'Heikkinen, Miss. Laina',
 'Sex': 'female',
 'Age': 26,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': 'STON/O2. 3101282',
 'Fare': 7.925,
 'Cabin': '',
 'Embarked': 'S'}

In [35]:
# Last one
len = training.count_documents({})-1
training.find()[len]

{'_id': ObjectId('63c072ce73246b274e3e2d47'),
 'PassengerId': 878,
 'Survived': 0,
 'Pclass': 3,
 'Name': 'Petroff, Mr. Nedelio',
 'Sex': 'male',
 'Age': 19,
 'SibSp': 0,
 'Parch': 0,
 'Ticket': 349212,
 'Fare': 7.8958,
 'Cabin': '',
 'Embarked': 'S'}

Count can be used with queries to count the result, rather than listing them

In [36]:
training.find({"Age": 22}).count()

  training.find({"Age": 22}).count()


27

In [37]:
training.count_documents({"Age": 50})

10

# Part 15: Complex queries and analysis
# Aggregation Pipeline

More complex processing, including grouping, aggregation functions, and data renaming is achieved through MongoDB’s aggregation pipeline.

For example a query can involve several stages:
                                                
First stage: filter out documents that do not match some criterion<br>
Second stage: group those documents<br>
Third stage: select only groups that match another criterion<br>
Fourth stage: group summaries would then be returned to the client<br>

By building up a pipeline in stages, complex data processing tasks can be built from simple components.

<img src="pipeline.png">

<img src="pipeline_functions.png">

Further examples can be found in *Notebook 15.3 Introducing aggregation pipelines.*

The examples below and in the practical activities all use small data sets that can be used locally. With huge datasets, the processing may be spread over many computers for processing to aid speed. Data processing tools (such as the aggregation pipeline and MapReduce) keep the processing of data near that data itself, reducing the work required by the client and the amount of data to be moved across the network from server to client. 

In [38]:
# Equivalent to SELECT COUNT(*) FROM Training;
# Need to group by an _id
pipeline = [
     {"$group": {"_id": 0, "No of passengers": {"$sum": 1}}},
]

list(training.aggregate(pipeline))


[{'_id': 0, 'No of passengers': 891}]

In [39]:
# SELECT embark, count(*) FROM training;
printDoc(db.training.aggregate( [ { "$group" : { "_id" : "$Embarked", "count": {"$sum": 1} }} ] ))

{'_id': 'C', 'count': 168}
{'_id': 'S', 'count': 644}
{'_id': '', 'count': 2}
{'_id': 'Q', 'count': 77}


In [40]:
# SELECT age, count(*) FROM training ORDER BY age;
printDoc(db.training.aggregate( [ { "$group" : { "_id" : "$Age", "count": {"$sum": 1} }},
                               { "$sort" : {"_id" : 1}} ] ))


{'_id': None, 'count': 177}
{'_id': 0.42, 'count': 1}
{'_id': 0.67, 'count': 1}
{'_id': 0.75, 'count': 2}
{'_id': 0.83, 'count': 2}
{'_id': 0.92, 'count': 1}
{'_id': 1, 'count': 7}
{'_id': 2, 'count': 10}
{'_id': 3, 'count': 6}
{'_id': 4, 'count': 10}
{'_id': 5, 'count': 4}
{'_id': 6, 'count': 3}
{'_id': 7, 'count': 3}
{'_id': 8, 'count': 4}
{'_id': 9, 'count': 8}
{'_id': 10, 'count': 2}
{'_id': 11, 'count': 4}
{'_id': 12, 'count': 1}
{'_id': 13, 'count': 2}
{'_id': 14, 'count': 6}
{'_id': 14.5, 'count': 1}
{'_id': 15, 'count': 5}
{'_id': 16, 'count': 17}
{'_id': 17, 'count': 13}
{'_id': 18, 'count': 26}
{'_id': 19, 'count': 25}
{'_id': 20, 'count': 15}
{'_id': 20.5, 'count': 1}
{'_id': 21, 'count': 24}
{'_id': 22, 'count': 27}
{'_id': 23, 'count': 15}
{'_id': 23.5, 'count': 1}
{'_id': 24, 'count': 30}
{'_id': 24.5, 'count': 1}
{'_id': 25, 'count': 23}
{'_id': 26, 'count': 18}
{'_id': 27, 'count': 18}
{'_id': 28, 'count': 25}
{'_id': 28.5, 'count': 2}
{'_id': 29, 'count': 20}
{'_id': 3

In [41]:
# SELECT age, count(*) FROM training WHERE age >= 60 ORDER BY age DESC;
# This pipeline involves 3 stages: $group, $ match and $sort
docs = db.training.aggregate( [ { "$group" : { "_id" : "$Age", "count": {"$sum": 1} }},
                                { "$match": { "_id": { "$gte": 60 } }},
                                { "$sort" : {"_id" : -1}} 
                              ] )
printDoc(docs)

{'_id': 80, 'count': 1}
{'_id': 74, 'count': 1}
{'_id': 71, 'count': 2}
{'_id': 70.5, 'count': 1}
{'_id': 70, 'count': 2}
{'_id': 66, 'count': 1}
{'_id': 65, 'count': 3}
{'_id': 64, 'count': 2}
{'_id': 63, 'count': 2}
{'_id': 62, 'count': 4}
{'_id': 61, 'count': 3}
{'_id': 60, 'count': 4}


In [42]:
# SELECT sex, pclass, count(*) FROM training GROUP BY sex, pclass ORDER BY count DESC;
list(training.aggregate([
    {"$group" : {"_id":{"sex":"$Sex", "class":"$Pclass"}, "count":{"$sum":1}}},
    {"$sort" : {"count" : -1}} 
]))


[{'_id': {'sex': 'male', 'class': 3}, 'count': 347},
 {'_id': {'sex': 'female', 'class': 3}, 'count': 144},
 {'_id': {'sex': 'male', 'class': 1}, 'count': 122},
 {'_id': {'sex': 'male', 'class': 2}, 'count': 108},
 {'_id': {'sex': 'female', 'class': 1}, 'count': 94},
 {'_id': {'sex': 'female', 'class': 2}, 'count': 76}]

# Joins

The EMA_MongoDB_joins Notebook found with the TMA02 assessment give some examples of using joins in MongoDB. As mentioned in this notebook joining documents was not possible in earlier versions of MongoDB, later versions introduced something similar to a simple left join using the pipeline `$lookup` operator ([Mongo docs](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)).

MongoDB provides the joins as part of the aggregation steps. 

For these examples the testset and gender_submission collections, you still join on a common column as seen in relational databases. 

In this case *PassengerId* is the common field in both collections.

In [43]:
# SELECT * FROM testset t LEFT OUTER JOIN gender_submission gs WHERE t.PassengerId = gs.PassengerId
list(testset.aggregate([
   {
     "$lookup":
       {
         "from": "gender_submission",
         "localField": "PassengerId",
         "foreignField": "PassengerId",
         "as": "joined"
       }
  }
]))

[{'_id': ObjectId('63c072cfabda8afc4c88d14a'),
  'PassengerId': 893,
  'Pclass': 3,
  'Name': 'Wilkes, Mrs. James (Ellen Needs)',
  'Sex': 'female',
  'Age': 47,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 363272,
  'Fare': 7,
  'Cabin': '',
  'Embarked': 'S',
  'joined': [{'_id': ObjectId('63c072cf47fc72847e329fb4'),
    'PassengerId': 893,
    'Survived': 1}]},
 {'_id': ObjectId('63c072cfabda8afc4c88d14b'),
  'PassengerId': 894,
  'Pclass': 2,
  'Name': 'Myles, Mr. Thomas Francis',
  'Sex': 'male',
  'Age': 62,
  'SibSp': 0,
  'Parch': 0,
  'Ticket': 240276,
  'Fare': 9.6875,
  'Cabin': '',
  'Embarked': 'Q',
  'joined': [{'_id': ObjectId('63c072cf47fc72847e329fb3'),
    'PassengerId': 894,
    'Survived': 0}]},
 {'_id': ObjectId('63c072cfabda8afc4c88d14c'),
  'PassengerId': 895,
  'Pclass': 3,
  'Name': 'Wirz, Mr. Albert',
  'Sex': 'male',
  'Age': 27,
  'SibSp': 0,
  'Parch': 0,
  'Ticket': 315154,
  'Fare': 8.6625,
  'Cabin': '',
  'Embarked': 'S',
  'joined': [{'_id': ObjectId('63c07

In [44]:
pd.DataFrame(testset.aggregate([
   {
     "$lookup":
       {
         "from": "gender_submission",
         "localField": "PassengerId",
         "foreignField": "PassengerId",
         "as": "joined"
       }
  }
]))

Unnamed: 0,_id,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,joined
0,63c072cfabda8afc4c88d14a,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S,"[{'_id': 63c072cf47fc72847e329fb4, 'PassengerI..."
1,63c072cfabda8afc4c88d14b,894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q,"[{'_id': 63c072cf47fc72847e329fb3, 'PassengerI..."
2,63c072cfabda8afc4c88d14c,895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S,"[{'_id': 63c072cf47fc72847e329fde, 'PassengerI..."
3,63c072cfabda8afc4c88d14d,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S,"[{'_id': 63c072cf47fc72847e329fe2, 'PassengerI..."
4,63c072cfabda8afc4c88d14e,897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S,"[{'_id': 63c072cf47fc72847e329f98, 'PassengerI..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,63c072cfabda8afc4c88d2e7,1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S,"[{'_id': 63c072cf47fc72847e32a10d, 'PassengerI..."
414,63c072cfabda8afc4c88d2e8,1306,1,"Oliva y Ocana, Dona. Fermina",female,39,0,0,PC 17758,108.9,C105,C,"[{'_id': 63c072cf47fc72847e32a10e, 'PassengerI..."
415,63c072cfabda8afc4c88d2e9,1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S,"[{'_id': 63c072cf47fc72847e32a10f, 'PassengerI..."
416,63c072cfabda8afc4c88d2ea,1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S,"[{'_id': 63c072cf47fc72847e32a110, 'PassengerI..."


In [45]:
# Or vice-versa
pd.DataFrame(gender_submission.aggregate([
   {
     "$lookup":
       {
         "from": "testset",
         "localField": "PassengerId",
         "foreignField": "PassengerId",
         "as": "joined"
       }
  }
]))

Unnamed: 0,_id,PassengerId,Survived,joined
0,63c072cf47fc72847e329f70,905,0,"[{'_id': 63c072cfabda8afc4c88d155, 'PassengerI..."
1,63c072cf47fc72847e329f71,907,1,"[{'_id': 63c072cfabda8afc4c88d157, 'PassengerI..."
2,63c072cf47fc72847e329f72,908,0,"[{'_id': 63c072cfabda8afc4c88d159, 'PassengerI..."
3,63c072cf47fc72847e329f73,909,0,"[{'_id': 63c072cfabda8afc4c88d15a, 'PassengerI..."
4,63c072cf47fc72847e329f74,910,1,"[{'_id': 63c072cfabda8afc4c88d15b, 'PassengerI..."
...,...,...,...,...
413,63c072cf47fc72847e32a10d,1305,0,"[{'_id': 63c072cfabda8afc4c88d2e7, 'PassengerI..."
414,63c072cf47fc72847e32a10e,1306,1,"[{'_id': 63c072cfabda8afc4c88d2e8, 'PassengerI..."
415,63c072cf47fc72847e32a10f,1307,0,"[{'_id': 63c072cfabda8afc4c88d2e9, 'PassengerI..."
416,63c072cf47fc72847e32a110,1308,0,"[{'_id': 63c072cfabda8afc4c88d2ea, 'PassengerI..."


418 rows are returned for both joins. This is because there are the same number of rows in both sets of data

In [46]:
print(testset.count_documents({}))
print(gender_submission.count_documents({}))

418
418


Lets create the embark collection, similar to the embark table:
<pre>
CREATE TABLE embark (embarked CHAR(1) PRIMARY KEY, port VARCHAR(15));
INSERT INTO embark VALUES ('C', 'Cherbourg');
INSERT INTO embark VALUES ('Q', 'Queenstown');
INSERT INTO embark VALUES ('S', 'Southampton');
INSERT INTO embark VALUES ('L', 'Liverpool');
</pre>

And return to the training data set since we have some passengers where the port of embarkation is not known.

In [47]:
#db = client.titanicDB
embark = db["embark"]

# And add the four ports
embark.insert_many([{'embarked': 'C', 'port': 'Cherbourg'},
                    {'embarked': 'Q', 'port': 'Queenstown'},
                    {'embarked': 'S', 'port': 'Southampton'},
                    {'embarked': 'L', 'port': 'Liverpool'}])

<pymongo.results.InsertManyResult at 0x7f85f4ad5380>

In [48]:
# Include all Ports, with a goup of associated passengers from the training data
list(embark.aggregate([
   {
     "$lookup":
       {
         "from": "training",
         "localField": "embarked",
         "foreignField": "Embarked",
         "as": "Passengers"
       }
  }
]))

[{'_id': ObjectId('63c072dc2dbff6fb26d5626d'),
  'embarked': 'C',
  'port': 'Cherbourg',
  'Passengers': [{'_id': ObjectId('63c072ce73246b274e3e29d5'),
    'PassengerId': 2,
    'Survived': 1,
    'Pclass': 1,
    'Name': 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)',
    'Sex': 'female',
    'Age': 38,
    'SibSp': 1,
    'Parch': 0,
    'Ticket': 'PC 17599',
    'Fare': 71.2833,
    'Cabin': 'C85',
    'Embarked': 'C'},
   {'_id': ObjectId('63c072ce73246b274e3e29e1'),
    'PassengerId': 20,
    'Survived': 1,
    'Pclass': 3,
    'Name': 'Masselmani, Mrs. Fatima',
    'Sex': 'female',
    'SibSp': 0,
    'Parch': 0,
    'Ticket': 2649,
    'Fare': 7.225,
    'Cabin': '',
    'Embarked': 'C'},
   {'_id': ObjectId('63c072ce73246b274e3e29e2'),
    'PassengerId': 10,
    'Survived': 1,
    'Pclass': 2,
    'Name': 'Nasser, Mrs. Nicholas (Adele Achem)',
    'Sex': 'female',
    'Age': 14,
    'SibSp': 1,
    'Parch': 0,
    'Ticket': 237736,
    'Fare': 30.0708,
    'Cabin': '',
 

In [49]:
# A dataframe might be easier to view, but the details of the subdocument is now shown
pd.DataFrame(embark.aggregate([
   {
     "$lookup":
       {
         "from": "training",
         "localField": "embarked",
         "foreignField": "Embarked",
         "as": "Passengers"
       }
  }
]))

Unnamed: 0,_id,embarked,port,Passengers
0,63c072dc2dbff6fb26d5626d,C,Cherbourg,"[{'_id': 63c072ce73246b274e3e29d5, 'PassengerI..."
1,63c072dc2dbff6fb26d5626e,Q,Queenstown,"[{'_id': 63c072ce73246b274e3e29d1, 'PassengerI..."
2,63c072dc2dbff6fb26d5626f,S,Southampton,"[{'_id': 63c072ce73246b274e3e29cd, 'PassengerI..."
3,63c072dc2dbff6fb26d56270,L,Liverpool,[]


In [50]:
# Include all Passengers
list(training.aggregate([
   {
     "$lookup":
       {
         "from": "embark",
         "localField": "Embarked",
         "foreignField": "embarked",
         "as": "port info"
       }
  }
]))

[{'_id': ObjectId('63c072ce73246b274e3e29cd'),
  'PassengerId': 1,
  'Survived': 0,
  'Pclass': 3,
  'Name': 'Braund, Mr. Owen Harris',
  'Sex': 'male',
  'Age': 22,
  'SibSp': 1,
  'Parch': 0,
  'Ticket': 'A/5 21171',
  'Fare': 7.25,
  'Cabin': '',
  'Embarked': 'S',
  'port info': [{'_id': ObjectId('63c072dc2dbff6fb26d5626f'),
    'embarked': 'S',
    'port': 'Southampton'}]},
 {'_id': ObjectId('63c072ce73246b274e3e29ce'),
  'PassengerId': 3,
  'Survived': 1,
  'Pclass': 3,
  'Name': 'Heikkinen, Miss. Laina',
  'Sex': 'female',
  'Age': 26,
  'SibSp': 0,
  'Parch': 0,
  'Ticket': 'STON/O2. 3101282',
  'Fare': 7.925,
  'Cabin': '',
  'Embarked': 'S',
  'port info': [{'_id': ObjectId('63c072dc2dbff6fb26d5626f'),
    'embarked': 'S',
    'port': 'Southampton'}]},
 {'_id': ObjectId('63c072ce73246b274e3e29cf'),
  'PassengerId': 4,
  'Survived': 1,
  'Pclass': 1,
  'Name': 'Futrelle, Mrs. Jacques Heath (Lily May Peel)',
  'Sex': 'female',
  'Age': 35,
  'SibSp': 1,
  'Parch': 0,
  'Ticket'

In [51]:
# Hard to find our two passengers, so add an initial step to just show the passengers where the port of embarkation is blank
pd.DataFrame(training.aggregate([
   { "$match": { "Embarked": { "$eq": '' }}},
   { "$lookup":
       {
         "from": "embark",
         "localField": "Embarked",
         "foreignField": "embarked",
         "as": "port info"
       }
  }
]))

Unnamed: 0,_id,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,port info
0,63c072ce73246b274e3e2a1c,62,1,1,"Icard, Miss. Amelie",female,38,0,0,113572,80,B28,,[]
1,63c072ce73246b274e3e2d0b,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62,0,0,113572,80,B28,,[]


# Semi Structured data

The Titanic dataset is an example of structured data, in that it is very uniform, with the same data types in each column.

The power of NoSQL databases is in copying semi-structured data, such as JSON data, where the values may not be straightforward strings and numbers, but could be nested documents.

One example is Twitter data. Below is an example of tweets extracted on the 10th and 11th January 2023. 

In [52]:
# Needed for Twitter data
import string
import operator
import re

In [53]:
# Good practice to examine the data before importing it
! head data/BBCNews-230110-2118.json

{"id": 1612920079967985706, "text": "Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun", "edit_history_tweet_ids": [1612920079967985706], "author_id": 612473, "context_annotations": [{"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "66", "name": "Interests and Hobbies Category", "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"}, "entity": {"id": "1237472346560053249", "name": "Firefighting"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "12374723465600

In [54]:
! tail data/BBCNews-230111-2214.json

{"id": 1612933985436409857, "text": "Tudor Bible sells for \u00a320k in Belfast auction https://t.co/sVMypOIW2I", "edit_history_tweet_ids": [1612933985436409857], "author_id": 612473, "context_annotations": [{"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "69", "name": "News Vertical", "description": "News Categories like Entertainment or Technology"}, "entity": {"id": "1331946773263253506", "name": "Northern Ireland national news"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "1331946773263253506", "name": "Northern Ire

In [55]:
client.drop_database('twitterDB')
client.list_database_names()

['accidents', 'admin', 'config', 'local', 'mongoExample', 'titanicDB']

In [56]:
# load this into a twitterDB database and news from 11th January
! mongoimport --db twitterDB  --file data/BBCNews-230110-2118.json --collection bbcnews
! mongoimport --db twitterDB  --file data/BBCNews-230111-2214.json --collection bbcnews

2023-01-12T20:51:41.728+0000	connected to: mongodb://localhost/
2023-01-12T20:51:41.762+0000	100 document(s) imported successfully. 0 document(s) failed to import.
2023-01-12T20:51:41.945+0000	connected to: mongodb://localhost/
2023-01-12T20:51:41.964+0000	100 document(s) imported successfully. 0 document(s) failed to import.


In [57]:
client.list_database_names()

['accidents',
 'admin',
 'config',
 'local',
 'mongoExample',
 'titanicDB',
 'twitterDB']

In [58]:
# Change database
db = client.twitterDB
bbcnews = db.bbcnews
bbcnews.find_one()

{'_id': ObjectId('63c072dda4443d61145b18da'),
 'id': 1612920079967985706,
 'text': 'Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun',
 'edit_history_tweet_ids': [1612920079967985706],
 'author_id': 612473,
 'context_annotations': [{'domain': {'id': '46',
    'name': 'Business Taxonomy',
    'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
   'entity': {'id': '1557697121477832705',
    'name': 'Publisher & News Business',
    'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books'}},
  {'domain': {'id': '66',
    'name': 'Interests and Hobbies Category',
    'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'},
   'entity': {'id': '1237472346560053249', 'name': 'Firefighting'}},
  {'domain': {'id': '131',
    'name': 'Unified Twitter Taxon

In [59]:
# What columns/keys does it have
# Some of these keys are subdocuments, such as the entities one seen above
bbcnews.find_one().keys()

dict_keys(['_id', 'id', 'text', 'edit_history_tweet_ids', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'edit_controls', 'entities', 'lang', 'public_metrics', 'reply_settings'])

In [60]:
# Is Harry mentioned at all?!
# $regex allows pattern matching. The 'i' option makes the search case insensitive
# "_id:" 0 suppresses showing the object id
# SELECT text FROM bbcnews WHERE LOWER(text) LIKE '%harry%';

tweets = bbcnews.find({'text':{'$regex':'Harry', '$options': 'i'}}, {"_id":0,'text': 1})
printDoc(tweets)

{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB"}
{'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1"}
{'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC"}
{'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t'}
{'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64'}
{'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc'}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/XmfSToqCxu"}
{'text': "Who is Harry's ghostwriter, JD Moehringer - and how much did he make? https://t.co/P2TgFcLz8U"}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/Ff92m8HC4g"}
{'text': "Newspaper headlines: 'No way back' says Harry and ho

In [61]:
## Is Harry mentioned at all?!
# $regex can be used on more than one field - can either use the "OR" clause to get either value. 
# Just make sure the brackets are the correct ones and lined up correctly!
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' OR created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$or": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))

[{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB",
  'created_at': 'Tuesday 10-Jan-2023 21:02:03'},
 {'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1",
  'created_at': 'Tuesday 10-Jan-2023 20:22:04'},
 {'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC",
  'created_at': 'Tuesday 10-Jan-2023 19:08:58'},
 {'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t',
  'created_at': 'Tuesday 10-Jan-2023 14:42:05'},
 {'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64',
  'created_at': 'Tuesday 10-Jan-2023 12:56:42'},
 {'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc',
  'created_at': 'Tuesday 10-Jan-2023 09:46:49'},
 {'text': "Prince Harry's book officially hits shops after days 

In [62]:
# or use the "AND" clause to get both value. 
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' AND created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$and": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))


[{'text': "Which Royal has come out best in the fallout from Prince Harry's book? https://t.co/zvF9dTJNKW",
  'created_at': 'Wednesday 11-Jan-2023 14:58:43'},
 {'text': "Prince Harry condemns 'dangerous spin' about his Taliban comments https://t.co/V2u2hELxp4",
  'created_at': 'Wednesday 11-Jan-2023 02:04:29'}]

In [63]:
# show the distinct languages found in the tweets
# SELECT DISTINCT lang FROM bbcnews;
# The supported languages can be found here: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages 
db.bbcnews.distinct("lang")

['ca', 'en', 'fr', 'tl']

In [64]:
# How many tweets have been retweeted more than 100 times
# Use the dot notation to reference keys in any subdocument
db.bbcnews.count_documents({"public_metrics.retweet_count": { '$gt' : 100 }})    

15

In [65]:
tweets = db.bbcnews.find({'entities.urls.title':{'$regex':'Firefighter'}}, {'entities.urls.title': 1})

printDoc(tweets)

{'_id': ObjectId('63c072dda4443d61145b18da'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}
{'_id': ObjectId('63c072dd24cadb86afa53b92'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}


In [66]:
# Show some fields from the entities subdocument.
# When showing the subdocuments pretty print makes the tweets more readable
tweets = db.bbcnews.find({}, {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'They are also more likely to die from '
                                       'heart attacks and strokes, researchers '
                                       'say.',
                        'title': 'Firefighters face higher cancer risk, '
                                 'Scottish study finds'}]}}
{'entities': {'urls': [{'description': "A bookseller placed Prince Harry's "
                                       "memoir Spare beside Bella Mackie's "
                                       'novel How to Kill Your Family.',
                        'title': "Harry's memoir Spare displayed beside How to "
                                 'Kill Your Family novel'}]}}
{'entities': {'urls': [{'description': 'Zholia Alemi is a "most accomplished '
                                       'fraudster" who forged a certificate to '
                                       'get work, a jury hears.',
                        'title': 'Unqualified doctor who faked

                        'title': 'Parents pay tribute to son who drowned '
                                 'saving family'}]}}
{'entities': {'urls': [{'description': 'Six-year-old Ella Henderson died a day '
                                       'after being struck by a falling tree '
                                       'in Newcastle.',
                        'title': 'Newcastle girl died after decaying tree '
                                 'collapsed at school'}]}}
{'entities': {'urls': [{'description': 'The club says it cannot continue adult '
                                       'cricket at the ground because of some '
                                       'of its neighbours.',
                        'title': 'Dorset village club ends adult cricket '
                                 "matches after 'constant complaints'"}]}}
{'entities': {'urls': [{'description': 'An orange light streaking across the '
                                       'night sky is captured on mobil

{'entities': {'urls': [{'description': 'What you need to know about emergency '
                                       'care during the ambulance strike.',
                        'title': 'Strike: What will ambulances respond to on '
                                 'Wednesday?'}]}}
{'entities': {'urls': [{'description': 'The economic secretary to the Treasury '
                                       'says the UK is committed to becoming a '
                                       'world crypto hub.',
                        'title': 'Cryptocurrency: UK Treasury considers plan '
                                 'for digital pound'}]}}
{'entities': {'urls': [{'description': 'In extensive BBC interviews, Shamima '
                                       'Begum also reveals the detailed '
                                       'planning before she joined IS at 15.',
                        'title': 'Shamima Begum accepts she joined a terror '
                                 'group'}]}}
{'

In [67]:
# These can be searched too - find the Seal story
tweets = db.bbcnews.find({"entities.urls.title": {"$regex": "seal", '$options': 'i' }}, 
                         {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'Being in a fishing lake is like "being '
                                       'in a branch of Waitrose" for a hungry '
                                       'seal, an expert says.',
                        'title': 'Seal stuck in Rochford lake munching its way '
                                 'through fish stock'}]}}


In [68]:
# Another way to unpack the nested documents
# https://stackoverflow.com/questions/25909927/mongodb-how-to-get-a-field-sub-document-from-a-document
tweets=db.bbcnews.aggregate([
    # De-normalize the array content first
    { "$unwind": "$entities" },

    # De-normalize the content from the inner array as well
    { "$unwind": "$entities.urls" },

    # Group the "entities" per document
    { "$group": {
        "_id": "$_id",
        "entities": { "$addToSet": "$entities.urls" }
    }}
])
printDocs(tweets)

{'_id': ObjectId('63c072dda4443d61145b18ff'),
 'entities': [{'description': 'An orange light streaking across the night sky '
                              'was captured on mobile phones and doorbell '
                              'cameras.',
               'display_url': 'bbc.in/3vSiAPT',
               'end': 61,
               'expanded_url': 'https://bbc.in/3vSiAPT',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1612799697290248193/uUu5ud6O?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1612799697290248193/uUu5ud6O?format=jpg&name=150x150',
                           'width': 150}],
               'start': 38,
               'status': 200,
               'title': 'Meteor lights up skies over England',
               'unwound_url': 'https://www.bbc.co.uk/news/av/uk-england-64220964?at_campaign=So

{'_id': ObjectId('63c072dd24cadb86afa53b47'),
 'entities': [{'display_url': 'pic.twitter.com/ISyyY9Pp6M',
               'end': 206,
               'expanded_url': 'https://twitter.com/BBCNews/status/1613218944994709505/video/1',
               'media_key': '13_1613144785967104000',
               'start': 183,
               'url': 'https://t.co/ISyyY9Pp6M'},
              {'description': 'A London schoolgirl disappears. Four years '
                              "later she's found in Syria with Isis",
               'display_url': 'bbc.in/3Znoqqh',
               'end': 182,
               'expanded_url': 'https://bbc.in/3Znoqqh',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1613218948257878016/My_0ew0u?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1613218948257878016/My_0ew0u?format=jpg&name=150

                           'url': 'https://pbs.twimg.com/news_img/1613040202838151168/c5qjtqMG?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1613040202838151168/c5qjtqMG?format=jpg&name=150x150',
                           'width': 150}],
               'start': 52,
               'status': 200,
               'title': 'Golden Globes 2023: The Banshees of Inisherin wins '
                        'big',
               'unwound_url': 'https://www.bbc.com/news/entertainment-arts-64226565?at_format=link&at_medium=social&at_link_id=E71EB97A-916D-11ED-913A-645C16F31EAE&at_campaign_type=owned&at_link_origin=BBCNews&at_link_type=web_link&at_campaign=Social_Flow&at_ptr_name=twitter&at_bbc_team=editorial',
               'url': 'https://t.co/8BQ2a9Dp5t'}]}
{'_id': ObjectId('63c072dda4443d61145b1928'),
 'entities': [{'display_url': 'bbc.in/3GO5HM7',
               'end':

                           'width': 150}],
               'start': 121,
               'status': 200,
               'title': 'Home - BBC News',
               'unwound_url': 'https://www.bbc.co.uk/news?at_ptr_name=twitter&at_link_type=web_link&at_format=image&at_campaign=Social_Flow&at_bbc_team=editorial&at_medium=social&at_campaign_type=owned&at_link_origin=BBCNews&at_link_id=F54B8C6A-9061-11ED-8209-A07A96E8478F',
               'url': 'https://t.co/eZsqA2JMKS'}]}
{'_id': ObjectId('63c072dda4443d61145b190e'),
 'entities': [{'description': "The capital's roads are more congested than "
                              'they were before the pandemic, researchers '
                              'find.',
               'display_url': 'bbc.in/3jZ7jut',
               'end': 80,
               'expanded_url': 'https://bbc.in/3jZ7jut',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1612740840815828992/8HEGbxPL?format=jpg&name=orig',


               'start': 40,
               'status': 200,
               'title': 'Hair and a spare: Prince Harry and the power of the '
                        'beard',
               'unwound_url': 'https://www.bbc.com/news/newsbeat-64209376?xtor=AL-72-%5Bpartner%5D-%5Bbbc.news.twitter%5D-%5Bheadline%5D-%5Bnews%5D-%5Bbizdev%5D-%5Bisapi%5D&at_campaign_type=owned&at_ptr_name=twitter&at_link_origin=BBCWorld&at_bbc_team=editorial&at_medium=social&at_format=link&at_link_type=web_link&at_campaign=Social_Flow&at_link_id=769AC3F2-90C2-11ED-9AC9-BFAB4744363C',
               'url': 'https://t.co/TQE8AQO6Vc'}]}
{'_id': ObjectId('63c072dda4443d61145b1933'),
 'entities': [{'display_url': 'bbc.in/3QtjDzp',
               'end': 156,
               'expanded_url': 'https://bbc.in/3QtjDzp',
               'start': 133,
               'unwound_url': 'https://www.bbc.co.uk/news/live/uk-64215127?at_link_origin=BBCNews&at_medium=social&at_campaign_type=owned&at_link_type=web_link&at_format=link&at_bbc_

                           'width': 150}],
               'start': 49,
               'status': 200,
               'title': 'Gwynedd slate mine seeks young workers as demand '
                        'rockets',
               'unwound_url': 'https://www.bbc.com/news/uk-wales-64223863?xtor=AL-72-%5Bpartner%5D-%5Bbbc.news.twitter%5D-%5Bheadline%5D-%5Bnews%5D-%5Bbizdev%5D-%5Bisapi%5D&at_link_origin=BBCNews&at_ptr_name=twitter&at_medium=social&at_bbc_team=editorial&at_format=link&at_link_id=8BB6B9F8-9185-11ED-AD6F-2CC34744363C&at_campaign=Social_Flow&at_link_type=web_link&at_campaign_type=owned',
               'url': 'https://t.co/HYygFle2qn'}]}
{'_id': ObjectId('63c072dd24cadb86afa53b8d'),
 'entities': [{'description': 'The copy of the Geneva Bible from 1615 sold '
                              'well above its £10,000 valuation in east '
                              'Belfast.',
               'display_url': 'bbc.in/3k0nQ1j',
               'end': 69,
               'expanded_url': 'htt

               'expanded_url': 'https://bbc.in/3itjOhy',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1612947962585391105/jTMGFjY9?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1612947962585391105/jTMGFjY9?format=jpg&name=150x150',
                           'width': 150}],
               'start': 49,
               'status': 200,
               'title': 'Rhyl: Care worker struck off for stealing from '
                        'patient',
               'unwound_url': 'https://www.bbc.com/news/uk-wales-64227410?xtor=AL-72-%5Bpartner%5D-%5Bbbc.news.twitter%5D-%5Bheadline%5D-%5Bnews%5D-%5Bbizdev%5D-%5Bisapi%5D&at_campaign_type=owned&at_campaign=Social_Flow&at_link_type=web_link&at_link_id=704E0DC0-9107-11ED-9AC9-BFAB4744363C&at_ptr_name=twitter&at_link_origin=BBCNews&at_bbc_team=editorial&at_format=li

               'status': 200,
               'title': 'Home - BBC News',
               'unwound_url': 'https://www.bbc.co.uk/news?at_format=image&at_link_type=web_link&at_ptr_name=twitter&at_medium=social&at_bbc_team=editorial&at_campaign_type=owned&at_campaign=Social_Flow&at_link_id=39C59B1E-9135-11ED-B199-B4162152A482&at_link_origin=BBCNews',
               'url': 'https://t.co/Xq77CR0Dfw'},
              {'display_url': 'pic.twitter.com/BdyOsXczXK',
               'end': 139,
               'expanded_url': 'https://twitter.com/BBCNews/status/1612945412318453762/photo/1',
               'media_key': '3_1612945408698847232',
               'start': 116,
               'url': 'https://t.co/BdyOsXczXK'}]}
{'_id': ObjectId('63c072dda4443d61145b18e7'),
 'entities': [{'description': 'The firm, formally known as Hermes, says staff '
                              'shortages, Royal Mail strikes and bad weather '
                              'have caused disruption.',
               'display

# Summary

This and the relationalDB Notebooks give you a flavour of the two types of database management system. 

What are the differences?

Some things to think about:

*Relational*
- relational has a fixed schema
- the data is normalised, with less duplication
- constraints can be enforced
- ACID transaction support (Atomic, Consistency, Isolation and Durability)

*NoSQL (Document)*
- flexible schema, optional data can be easily incorporated.
- can support agile development
- data is denormalised, so can mean more duplication
- constraints not enforced
- BASE transaction support (Basically Available, Soft state, Eventual consistency!)

Bear in mind that NoSQL is a relatively new technology, so can be seen as immature in that it does not provide good support for transaction handling, or access control, but could be argued that this is not the market it is aimed at. 

