# MongoDB Examples

## MongoDB and CSV files

This notebook uses the UK Baby Names dataset introduced in my TMA01 Preparation Tutorial (available on Github: https://github.com/MaryGarvey/TM351). The second half of the Notebook looks at using JSON data. 

Activity 13.2 introduces *Seven Databases in Seven Weeks* (Redmond 2012)

The most common NoSQL databases introduced are:

- Riak	- key value
- Hbase	- wide column
- MongoDB	- document 
- CouchDB - document 
- Neo4j	- graph
- Redis	- key value

This notebook will look at the MongoDB NoSQL document database.

# UK Baby Names 👶 (1996-2021)

## Introduction (from the Kaggle Website)

<i>Baby name statistics are compiled from first names recorded when live births are registered in England and Wales as part of civil registration, a legal requirement.
The statistics are based only on live births which occurred in the calendar year, as there is no public register of stillbirths.</i>

<i>Babies born in England and Wales to women whose usual residence is outside England and Wales are included in the statistics for England and Wales as a whole, but excluded from any sub-division of England and Wales.
The statistics are based on the exact spelling of the name given on the birth certificate. Grouping names with similar pronunciation would change the rankings. Exact names are given so users can group if they wish.</i>

<i>The dataset contains records of around 16k boy names and 22k girl names.</i>

You can get further information and the datasets from: 
https://www.kaggle.com/datasets/johnsmith44/uk-baby-names-1996-2021

In [None]:
# Import the required libraries

import pymongo
import datetime
import collections
#import Object

import pandas as pd
# better for printing JSON data: p(retty)print
from pprint import pprint

# Print out the version of pymongo 
print (pymongo.version)

In [None]:
#SET DATABASE CONNECTION STRINGS
MONGOHOST='localhost'
MONGOPORT=27017
MONGOCONN='mongodb://{MONGOHOST}:{MONGOPORT}/'.format(MONGOHOST=MONGOHOST,MONGOPORT=MONGOPORT)

In [None]:
# MongoDB version
! mongod --version

In [None]:
client = pymongo.MongoClient(MONGOCONN)

In [None]:
# Drop the tutorial databases so we start with a clean sheet
# Unlike SQL, the command will not generate an error if it does not already exist
client.drop_database('babyNamesDB')
client.drop_database('politicsDB')
client.drop_database('twitterDB')
client.list_database_names()

In [None]:
# Check the start and end of the file for any issues
!head data/UKGirlNames1996-2021.csv

In [None]:
!tail data/UKGirlNames1996-2021.csv

In [None]:
# babyNamesDB is a database that contains 2 collections (similar to tables)
db = client.babyNamesDB

There are two ways to import the CSV dataset.

- use the `mongoimport` command
- import into a dataframe as normal, then convert to a MongoDB collection

Both methods will be shown here for information.

1. Using mongoimport

In [None]:
# 1. using mongoimport
! mongoimport --db babyNamesDB --type=csv --headerline --file data/UKGirlNames1996-2021.csv --collection girls

In [None]:
# 2. importing via a data frame
names_df = pd.read_csv("data/UKBoyNames1996-2021.csv")
db.boys.insert_many(names_df.to_dict('records'))

In [None]:
# Check the database has been added (babyNamesDB)
client.list_database_names()

In [None]:
# and it contains the two collections
db.list_collection_names()

In [None]:
# setup variables for the two collections
boys = db.boys
girls = db.girls

In [None]:
# how many documents does each collection have:
print("Girls:\t{}".format(girls.count_documents({})))
print("Boys:\t{}".format(boys.count_documents({})))

The variables saves us having to use db.collectionName.function() in the queries, for example, you can use `girls.find()` instead of `db.girls.find()`. You can still use the longer format.

Just be careful if you swap databases in the same Notebook, as we do later, you could end up referencing a collection in the wrong database. Mongo will not warn you that this is an error, it just assumes it does not exist and will return nothing - a consequence of a schemaless database. 

In [None]:
# Show one record - can be any one from the collection
girls.find_one()

We can see there are a lot of missing values, which will be removed later.

# Querying 

MongoDB data is stored in JSON format, which means it uses the format of: *{key: value}* for most things.

The *find()* function is the equivalent of the SQL SELECT statement.

Instead of a *WHERE* clause you need to provide a JSON string for what you want to find.

For example, the following is the equivalent of *SELECT * FROM girlsName WHERE name = 'Mary';*

In [None]:
girls.find({'Name': 'Mary'})

In [None]:
# Can specify a search criteria with find_one too (could be the only one)
girls.find_one({'Name': 'Mary-Beth'})

The difference between `find()` and `find_one()` is that the former returns all the documents matching the criteria, whereas the latter returns just one of the documents, which can be used to the structure of the data. Do bear in mind, since MongoDB can store semi-structured data, different documents could have a different structure, unlike a relational database, where records in a table would all have the same structure.

To see what is returned in the cursor, lets create some functions to print the individual documents from the cursor.

In [None]:
# This means an iterator is needed to display the results
# using pretty print
def printDocs(documents):
    for doc in documents:
        pprint(doc)

# ordinary print
def printDoc(documents):
    for doc in documents:
        print(doc)

In [None]:
# find the Marys
docs = girls.find({'Name': 'Mary'})
printDoc(docs)

In [None]:
# alternatively use a dataframe to make it more like a relational table
# find the girls names in 2021 with a count more than 2000
pd.DataFrame(girls.find({"2021 Count" : {"$gt": 2000}}))

In [None]:
# Or can convert the Cursor to a list
# find the girls names in 2021 with a count more than 2000, but were ranked in the top 10 in 2020
list(girls.find({"2021 Count" : {"$gt": 2000}, "2020 Rank": {"$lte": 10}}))

# Data Dictionary



One consequence of being schemaless, means there are no conventional data dictionary tables to check if the collection or document names exist. This means that it will not generate an error message if neither exist. Do note, the names are all case sensitive. 

Why will the following return no records?

In [None]:
girls.find_one({"Name" : "Fred"})

In [None]:
db.girls.find_one({"name" : "Susan"})

But it will generate an error message if it can not find the variables or functions:

In [None]:
Girls.find_one({"Name" : "Susan"})

In [None]:
girls.find_One({"Name" : "Susan"})

In [None]:
girls.find_One({"Name" : Susan})

In [None]:
girls.find_One({Name : "Susan"})

In [None]:
# Lets find our girl
girls.find_one({"Name" : "Susan"})

There may not be a data dictionary collection to query, but you can find the keys in a collection, which are similar to the column names in a relational database. Be aware though, that the structure can vary from document to document in a given collection.


In [None]:
girls.find_one().keys()

As seen previously there are a lot of fields with no data. One good point for a NoSQL database is that every document does not have to have the same structure, so if the value is blank, there is no need to store the key.

For example, lets remove any records where the "2021 Rank" is null:

In [None]:
girls.update_many({"2021 Rank" : ""}, { "$unset": {"2021 Rank" : 1 }});

In [None]:
girls.find_one()

Given the amount of empty keys, it would be tedious to remove each one separately, so lets find what keys each record has and then loop through removing any blanks.

Do note, `find_one()` could retrieve any record, if the data was semi-structured each document could have a different structure. In this case, the data came from a CSV file, so every document has the same structure.

In [None]:
keys = girls.find_one({}).keys()
keys

In [None]:
for k in keys:
    girls.update_many({ k : ""}, { "$unset": { k : 1 }});

In [None]:
# note, the above has removed any empty keys, but the document will still exist
girls.find_one()

In [None]:
# do the same to the boys names
keys = boys.find_one({}).keys()
for k in keys:
    boys.update_many({ k : ""}, { "$unset": { k : 1 }});

In [None]:
boys.find_one()

In [None]:
# The consequence of this is that the keys will be slightly different for the records that have more complete data
# one with sparse data
girls.find_one({"Name" : "Marvi"}).keys()

In [None]:
# one more complete:
girls.find_one({"Name" : "Martina"}).keys()

In [None]:
# how many documents in the collection
db.girls.count_documents({})

In [None]:
# can access via the index (starts at 0)
girls.find()[0]

In [None]:
# second record
girls.find()[1]

In [None]:
# Last one
len = girls.count_documents({})-1
girls.find()[len]

`count_documents()` can be used with queries to count the result, rather than listing them

In [None]:
girls.count_documents({"Name": "Mary"})

In [None]:
# how many documents have a count more than 1500 in 2021
girls.count_documents({"2021 Count": {"$gt" : 1500} })

# Part 15: Complex queries and analysis
# Aggregation Pipeline

More complex processing, including grouping, aggregation functions, and data renaming is achieved through MongoDB’s aggregation pipeline.

For example a query can involve several stages:
                                                
First stage: filter out documents that do not match some criterion<br>
Second stage: group those documents<br>
Third stage: select only groups that match another criterion<br>
Fourth stage: group summaries would then be returned to the client<br>

By building up a pipeline in stages, complex data processing tasks can be built from simple components.

<img src="pipeline.png">

<img src="pipeline_functions.png">

Further examples can be found in *Notebook 15.3 Introducing aggregation pipelines.*

The examples below and in the practical activities all use small data sets that can be used locally. With huge datasets, the processing may be spread over many computers for processing to aid speed. Data processing tools (such as the aggregation pipeline and MapReduce) keep the processing of data near that data itself, reducing the work required by the client and the amount of data to be moved across the network from server to client. 

In [None]:
# Equivalent to SELECT COUNT(*) FROM girls;
# Need to group by an _id
pipeline = [
     {"$group": {"_id": 0, "Name": {"$sum": 1}}},
]

list(girls.aggregate(pipeline))

In [None]:
# SELECT "2021 Count", count(*) FROM training ORDER BY "2021 Count";
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$2021 Count", "count": {"$sum": 1} }},
                               { "$sort" : {"_id" : 1}} ] ))


In [None]:
# SELECT Name, count(*) FROM girls;
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$Name", "count": {"$sum": 1} }} ] ))

## Reshaping

To do statistics on this data we want to use information in the keys as values, e.g., extract the year from: `2020 Count`. In Tutorial 2 we did some processing to do this, so lets reuse the code to reshape our data better:

In [None]:
def updateFile(fileType):
    # remove missing data permanately
    filename = 'data/UK'+fileType+'Names1996-2021.csv'
    print("Importing: '"+filename+"'")
    names_df = pd.read_csv(filename)
    names_df = names_df.dropna(how='any')
    # unpivot the dataframe from a wide to long format
    names2_df = pd.melt(names_df, id_vars="Name")
    # split the two values in variable: year and the type (count or rank)
    names2_df[['Year','Type']] = names2_df['variable'].str.split(' ', expand = True)
    # convert year to a number
    names2_df['Year'] = names2_df['Year'].astype(str).astype(int)
    # the variable column is no longer needed
    names2_df.drop('variable', axis=1, inplace=True)
    names2_df.head()
    # save the changes 
    names2_df.to_csv('data/'+fileType+'Updated.csv')
    return names2_df

In [None]:
boys_df = updateFile("Boy")
boys_df.head()

In [None]:
girls_df = updateFile("Girl")
girls_df.head()

In [None]:
# next import the girls_df dataframe into a collection (girlsUpdate)
db.girlsUpdate.insert_many(girls_df.to_dict('records'))

In [None]:
# ditto the boys_df dataframe
db.boysUpdate.insert_many(boys_df.to_dict('records'))

In [None]:
# check they are now in the baby names database (babyNamesDB)
db.list_collection_names()

In [None]:
db.girlsUpdate.find_one()

In [None]:
db.boysUpdate.find_one()

In [None]:
# how many documents does each collection have:
print("Girls: \t\t{}".format(girls.count_documents({})))
print("Girls Update: \t{}".format(db.girlsUpdate.count_documents({})))
print("Boys: \t\t{}".format(boys.count_documents({})))
print("Boys Update: \t{}".format(db.boysUpdate.count_documents({})))

In [None]:
# SELECT Year, count(*) as count FROM boysUpdate;
printDoc(db.boysUpdate.aggregate( [ { "$group" : { "_id" : "$Year", "count": {"$sum": 1} }} ] ))

In [None]:
# SELECT Year, sum() as "Sum of Values" FROM boysUpdate GROUP BY Year ORDER BY Year (_id) descending;
printDoc(db.boysUpdate.aggregate( [ { "$group" : { "_id" : "$Year", "Sum of values": {"$sum": "$value"}}},
                                                 { "$sort" : {"_id" : -1}}  
                                     ] ))

In [None]:
# SELECT Name, avg(value) as "Average Rank" FROM girlsUpdate WHERE Type = 'Rank' ORDER BY "Average Rank";
# This pipeline involves 3 stages: $match, $group and $sort
printDoc(db.girlsUpdate.aggregate( [ { "$match" : {"Type": "Rank"} },
                                     { "$group" : { "_id" : "$Name", "Average Rank": {"$avg": "$value"} }},
                                     { "$sort" : {"Average Rank" : 1}}
                                     ] ))

# Joins

Joining documents was not possible in earlier versions of MongoDB, later versions introduced something similar to a simple left join using the pipeline `$lookup` operator ([Mongo docs](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)).

MongoDB provides the joins as part of the aggregation steps. 

These examples use the boy and girl collections, you still join on a common column as seen in relational databases. 

In this case *Name* is the common field in both collections.

First, convert the collections to a dataframe and let us do a quick check if there are common names in the two datasets.

In [None]:
boys_df = pd.DataFrame(boys.find())
girls_df = pd.DataFrame(girls.find())

In [None]:
boys_df[boys_df['Name'].isin(girls_df['Name'])]

In [None]:
girls_df[girls_df['Name'].isin(boys_df['Name'])]

There appears to be quite a few matching names in both datasets. 

In [None]:
# Check to see if Alex appears in both, plus only show some fields
girls.find_one({"Name" : "Alex"}, {"_id":0, "Name": 1, "2021 Rank": 1, "2021 Count": 1})

In [None]:
boys.find_one({"Name" : "Alex"}, {"_id":0, "Name": 1, "2021 Rank": 1, "2021 Count": 1})

By default MongoDB will carry out an outer join, which means the names that do not match will contain an empty subdocument. Really we want an inner join, `as` creates an array, or subdocument within each document, further work can be done on the `as` array to pull out just the arrays that are not empty.

See https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/ for further information on the `preserveNullAndEmptyArrays` field.

Thanks to https://stackoverflow.com/questions/37575722/how-to-do-inner-joining-in-mongodb for an example (accessed 08/01/2024).

In [None]:
list(girls.aggregate([
   {
     "$lookup":
       {
         "from": "boys",
         "localField": "Name",
         "foreignField": "Name",
         "as": "joined"        
       }
   }, 
    {"$unwind": {
           "path": "$joined",
           "preserveNullAndEmptyArrays": False
   }}    
  ]))

# Semi-Structured data

The Baby Names dataset is an example of structured data, in that it is very uniform, with the same data types in each column.

The power of NoSQL databases is in copying semi-structured data, such as JSON data, where the values may not be straightforward strings and numbers, but could be nested documents.



## USA Government data

A lot of publicly available data is in JSON format, for example, government agencies:

- https://catalog.data.gov/dataset?res_format=JSON

- https://github.com/jdorfman/awesome-json-datasets#government

Below uses the USA government politician datasets found in the last link.

- Current US Senators: roles.json
- Current US Representatives: role-reps.json

Plus a list of USA States and abbreviations:
- states_titlecase.json

found here: https://gist.github.com/mshafrir/2646763

All downloaded: 09/01/2024

In [None]:
# lets have a look at the data
!head data/role.json

In [None]:
!tail data/role.json

In [None]:
! head data/role-reps.json

In [None]:
!tail data/role-reps.json

In [None]:
# note, this file was amended to remove the commas between each document (otherwise would not import)
!head data/states_titlecase.json

In [None]:
!tail data/states_titlecase.json

The politicians data looks to be in JSON format and appear to have some meta data at the start.

In [None]:
client.drop_database('politicsDB')

In [None]:
# import the data using mongoimport, note the type is now json
! mongoimport --db politicsDB --type=json --file data/role.json  --collection senators
! mongoimport --db politicsDB --type=json --file data/role-reps.json --collection reps
! mongoimport --db politicsDB --type=json --file data/states_titlecase.json --collection states

In [None]:
# Change database
db = client.politicsDB
senators = db.senators
reps = db.reps
states = db.states

In [None]:
# check a document in each collection
senators.find_one()

In [None]:
reps.find_one()

In [None]:
states.find_one()

We can see that the politician details have nested documents, where a document (or array) is nested within other information. This is an example of semi-structured data.

The dot syntax can be used to search nested documents. For example, a snippet of information from above for the person sub-document, shows what keys are available within it: 

<pre>
person': {'bioguideid': 'A000055',
    'birthday': '1965-07-22',
    'cspanid': 45516,
    'fediverse_webfinger': None,
    'firstname': 'Robert',
    'gender': 'male',
    'gender_label': 'Male',
    'lastname': 'Aderholt',
    'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
    'middlename': 'B.',
    'name': 'Rep. Robert Aderholt [R-AL4]',
    'namemod': '',
    'nickname': '',
    'osid': 'N00003028',
    'pvsid': None,
    'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
    'twitterid': 'Robert_Aderholt',
    'youtubeid': 'RobertAderholt'},
</pre>    

To pull out the full information for this representative, I'm assuming :

In [None]:
reps.find_one({"objects.person.bioguideid": 'A000055' })

Hmmm, this has found the representative, but the consequence of all the politicians being stored in one document, rather than one document per politician is that if the query returns true, then all the data in that document is returned!

To extract items from the array requires the use of the `$unwind` operator.

For example, display just the firstnames, suppressing the generated id:

In [None]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1}},
                        {"$unwind":"$objects"} ]) 
printDocs(docs)

In [None]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

Do make sure the pipeline is in the right order, if the match is done too soon it will again return all the representatives if the query criteria is matched:

In [None]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"}
                      ])
printDocs(docs)

In [None]:
# or if in doubt duplicate the $match as discussed here:
# https://stackoverflow.com/questions/54030089/how-to-use-unwind-and-match-with-mongodb

docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

In [None]:
# show all the person details, which does need a duplicate $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

In [None]:
# show all the person details, which does need a duplicate $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

person is part of the details stored for each 

In [None]:
# show all the details for this representative 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

In [None]:
# if you don't know your American states, join up the states collection
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }
  }
])
printDocs(joined)

In [None]:
# so what does AL mean for our representative
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }},
       {"$match": { "objects.person.bioguideid": 'A000055' }}
])
printDocs(joined)


In [None]:
# the senator data is similarly structured, lets find any female senators
docs = senators.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.gender":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.gender": 'female' }}
                      ])
printDocs(docs)

In [None]:
# find the Senior Senator for Michigan

docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

In [None]:
# Finally remember, due to no schema you can give an unknown field, which it will just ignore and not warn you!
docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

## Twitter Data

Another example of semi-structured data is Twitter, or X data. 

Unfortunately, X now imposes a cost of at least $100 a month if you want to extract tweets (creating is still free!). Below are some example of tweets extracted on the 10th and 11th January 2023. 

In [None]:
# Needed for Twitter data
import string
import operator
import re

In [None]:
# Good practice to examine the data before importing it
! head data/BBCNews-230110-2118.json

In [None]:
! tail data/BBCNews-230111-2214.json

In [None]:
client.drop_database('twitterDB')
client.list_database_names()

In [None]:
# load this into a twitterDB database and news from 11th January
! mongoimport --db twitterDB  --file data/BBCNews-230110-2118.json --collection bbcnews
! mongoimport --db twitterDB  --file data/BBCNews-230111-2214.json --collection bbcnews

In [None]:
client.list_database_names()

In [None]:
# Change database
db = client.twitterDB
bbcnews = db.bbcnews
bbcnews.find_one()

In [None]:
# What columns/keys does it have
# Some of these keys are subdocuments, such as the entities one seen above
bbcnews.find_one().keys()

In [None]:
# Prince Harry was topical this time last year! Is he mentioned at all?!
# $regex allows pattern matching. The 'i' option makes the search case insensitive
# "_id:" 0 suppresses showing the object id
# SELECT text FROM bbcnews WHERE LOWER(text) LIKE '%harry%';

tweets = bbcnews.find({'text':{'$regex':'Harry', '$options': 'i'}}, {"_id":0,'text': 1})
printDoc(tweets)

In [None]:
# $regex can be used on more than one field - can either use the "OR" clause to get either value. 
# Just make sure the brackets are the correct ones and lined up correctly!
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' OR created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$or": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))

In [None]:
# or use the "AND" clause to get both value. 
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' AND created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$and": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))


In [None]:
# show the distinct languages found in the tweets
# SELECT DISTINCT lang FROM bbcnews;
# The supported languages can be found here: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages 
db.bbcnews.distinct("lang")

In [None]:
# How many tweets have been retweeted more than 100 times
# Use the dot notation to reference keys in any subdocument
db.bbcnews.count_documents({"public_metrics.retweet_count": { '$gt' : 100 }})    

In [None]:
tweets = db.bbcnews.find({'entities.urls.title':{'$regex':'Firefighter'}}, {'entities.urls.title': 1})

printDoc(tweets)

In [None]:
# Show some fields from the entities subdocument.
# When showing the subdocuments pretty print makes the tweets more readable
tweets = db.bbcnews.find({}, {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

In [None]:
# These can be searched too - find the Seal story
tweets = db.bbcnews.find({"entities.urls.title": {"$regex": "seal", '$options': 'i' }}, 
                         {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

In [None]:
# Another way to unpack the nested documents
# https://stackoverflow.com/questions/25909927/mongodb-how-to-get-a-field-sub-document-from-a-document
tweets=db.bbcnews.aggregate([
    # De-normalize the array content first
    { "$unwind": "$entities" },

    # De-normalize the content from the inner array as well
    { "$unwind": "$entities.urls" },

    # Group the "entities" per document
    { "$group": {
        "_id": "$_id",
        "entities": { "$addToSet": "$entities.urls" }
    }}
])
printDocs(tweets)

# Summary

This and the relationalDB Notebooks give you a flavour of the two types of database management system. 

What are the differences?

Some things to think about:

*Relational*
- relational has a fixed schema
- the data is normalised, with less duplication
- constraints can be enforced
- ACID transaction support (Atomic, Consistency, Isolation and Durability)

*NoSQL (Document)*
- flexible schema, optional data can be easily incorporated.
- can support agile development
- data is denormalised, so can mean more duplication
- constraints not enforced
- BASE transaction support (Basically Available, Soft state, Eventual consistency!)

Bear in mind that NoSQL is a relatively new technology, so can be seen as immature in that it does not provide good support for transaction handling, or access control, but could be argued that this is not the market it is aimed at. 

