# MongoDB Examples

## MongoDB and CSV files

This notebook uses the UK Baby Names dataset introduced in my TMA01 Preparation Tutorial (available on Github: https://github.com/MaryGarvey/TM351). The second half of the Notebook looks at using JSON data. 

Activity 13.2 introduces *Seven Databases in Seven Weeks* (Redmond 2012)

The most common NoSQL databases introduced are:

- Riak	- key value
- Hbase	- wide column
- MongoDB	- document 
- CouchDB - document 
- Neo4j	- graph
- Redis	- key value

This notebook will look at the MongoDB NoSQL document database.

# UK Baby Names 👶 (1996-2021)

## Introduction (from the Kaggle Website)

<i>Baby name statistics are compiled from first names recorded when live births are registered in England and Wales as part of civil registration, a legal requirement.
The statistics are based only on live births which occurred in the calendar year, as there is no public register of stillbirths.</i>

<i>Babies born in England and Wales to women whose usual residence is outside England and Wales are included in the statistics for England and Wales as a whole, but excluded from any sub-division of England and Wales.
The statistics are based on the exact spelling of the name given on the birth certificate. Grouping names with similar pronunciation would change the rankings. Exact names are given so users can group if they wish.</i>

<i>The dataset contains records of around 16k boy names and 22k girl names.</i>

You can get further information and the datasets from: 
https://www.kaggle.com/datasets/johnsmith44/uk-baby-names-1996-2021

In [1]:
# Import the required libraries

import pymongo
import datetime
import collections
#import Object

import pandas as pd
# better for printing JSON data: p(retty)print
from pprint import pprint

# Print out the version of pymongo 
print (pymongo.version)

4.5.0


In [2]:
#SET DATABASE CONNECTION STRINGS
MONGOHOST='localhost'
MONGOPORT=27017
MONGOCONN='mongodb://{MONGOHOST}:{MONGOPORT}/'.format(MONGOHOST=MONGOHOST,MONGOPORT=MONGOPORT)

In [3]:
# MongoDB version
! mongod --version

db version v6.0.10
Build Info: {
    "version": "6.0.10",
    "gitVersion": "8e4b5670df9b9fe814e57cb5f3f8ee9407237b5a",
    "openSSLVersion": "OpenSSL 3.0.2 15 Mar 2022",
    "modules": [],
    "allocator": "tcmalloc",
    "environment": {
        "distmod": "ubuntu2204",
        "distarch": "x86_64",
        "target_arch": "x86_64"
    }
}


In [4]:
client = pymongo.MongoClient(MONGOCONN)

In [5]:
# Drop the tutorial databases so we start with a clean sheet
# Unlike SQL, the command will not generate an error if it does not already exist
client.drop_database('babyNamesDB')
client.drop_database('politicsDB')
client.drop_database('twitterDB')
client.list_database_names()

['accidents', 'admin', 'config', 'local']

In [6]:
# Check the start and end of the file for any issues
!head data/UKGirlNames1996-2021.csv

Name,2021 Rank,2021 Count,2020 Rank,2020 Count,2019 Rank,2019 Count,2018 Rank,2018 Count,2017 Rank,2017 Count,2016 Rank,2016 Count,2015 Rank,2015 Count,2014 Rank,2014 Count,2013 Rank,2013 Count,2012 Rank,2012 Count,2011 Rank,2011 Count,2010 Rank,2010 Count,2009 Rank,2009 Count,2008 Rank,2008 Count,2007 Rank,2007 Count,2006 Rank,2006 Count,2005 Rank,2005 Count,2004 Rank,2004 Count,2003 Rank,2003 Count,2002 Rank,2002 Count,2001 Rank,2001 Count,2000 Rank,2000 Count,1999 Rank,1999 Count,1998 Rank,1998 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
A'Idah,,,,,4686.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
A'Isha,5581.0,3.0,,,2458.0,10.0,,,4763.0,4.0,2757.0,9.0,2328.0,11.0,2659.0,9.0,4050.0,5.0,4171.0,5.0,4764.0,4.0,3533.0,6.0,3936.0,5.0,4524.0,4.0,2233.0,10.0,,,4798.0,3.0,3725.0,4.0,4373.0,3.0,,,2023.0,8.0,,,,,3142.0,4.0,,,,
A'Ishah,4634.0,4.0,,,,,,,5765.0,3.0,,,,,3160.0,7.0,5742.0,3.0,4171.0,5.0,5785.0,3.0,2589.0,9.0,,,4524.0,4.0,2895.0,7.0,5061.0,3.0,3382.0,5.0,2802.0,6.0

In [7]:
!tail data/UKGirlNames1996-2021.csv

Zyanna,,,5493.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zyla,2711.0,9.0,3117.0,7.0,3541.0,6.0,5666.0,3.0,,,5785.0,3.0,5730.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zylah,4634.0,4.0,5493.0,3.0,,,,,5765.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zymal,,,,,,,,,5765.0,3.0,,,,,,,4739.0,4.0,,,,,,,,,2487.0,9.0,2627.0,8.0,,,,,,,,,,,,,,,,,,,,,,
Zynab,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3900.0,3.0,,,,,,
Zynah,3961.0,5.0,,,,,,,4763.0,4.0,,,2705.0,9.0,5691.0,3.0,2887.0,8.0,4171.0,5.0,4764.0,4.0,5707.0,3.0,,,5545.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,
Zyra,2341.0,11.0,2449.0,10.0,4001.0,5.0,2901.0,8.0,4063.0,5.0,,,4736.0,4.0,,,5742.0,3.0,5876.0,3.0,,,4688.0,4.0,,,5545.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,
Zyrah,5581.0,3.0,4535.0,4.0,,,,,4763.0,4.0,4096.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zysha,4634.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zyva,2711.0,9.0,2449.0,10.0,3541.0,6.0,3985.0,5.0,4063.0,5.0,,,5730.0,3.0,,,,,,,,,,,3936.0,5.0,5545.0,3.0,,,

In [8]:
# babyNamesDB is a database that contains 2 collections (similar to tables)
db = client.babyNamesDB

There are two ways to import the CSV dataset.

- use the `mongoimport` command
- import into a dataframe as normal, then convert to a MongoDB collection

Both methods will be shown here for information.

1. Using mongoimport

In [9]:
# 1. using mongoimport
! mongoimport --db babyNamesDB --type=csv --headerline --file data/UKGirlNames1996-2021.csv --collection girls

2024-01-09T22:28:49.129+0000	connected to: mongodb://localhost/
2024-01-09T22:28:49.639+0000	21958 document(s) imported successfully. 0 document(s) failed to import.


In [10]:
# 2. importing via a data frame
names_df = pd.read_csv("data/UKBoyNames1996-2021.csv")
db.boys.insert_many(names_df.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f2d7df40a90>

In [11]:
# Check the database has been added (babyNamesDB)
client.list_database_names()

['accidents', 'admin', 'babyNamesDB', 'config', 'local']

In [12]:
# and it contains the two collections
db.list_collection_names()

['boys', 'girls']

In [13]:
# setup variables for the two collections
boys = db.boys
girls = db.girls

In [14]:
# how many documents does each collection have:
print("Girls:\t{}".format(girls.count_documents({})))
print("Boys:\t{}".format(boys.count_documents({})))

Girls:	21958
Boys:	16777


The variables saves us having to use db.collectionName.function() in the queries, for example, you can use `girls.find()` instead of `db.girls.find()`. You can still use the longer format.

Just be careful if you swap databases in the same Notebook, as we do later, you could end up referencing a collection in the wrong database. Mongo will not warn you that this is an error, it just assumes it does not exist and will return nothing - a consequence of a schemaless database. 

In [15]:
# Show one record - can be any one from the collection
girls.find_one()

{'_id': ObjectId('659dc8a1d6a53dd7f677f532'),
 'Name': 'Aabida',
 '2021 Rank': 5581.0,
 '2021 Count': 3.0,
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': '',
 '2019 Count': '',
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': '',
 '2011 Count': '',
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': '',
 '2005 Count': '',
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': '',
 '2003 Count': '',
 '2002 Rank': '',
 '2002 Count': '',
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': '',
 '1999 Count': '',
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': '',
 '1997 Count': '',
 '199

We can see there are a lot of missing values, which will be removed later.

# Querying 

MongoDB data is stored in JSON format, which means it uses the format of: *{key: value}* for most things.

The *find()* function is the equivalent of the SQL SELECT statement.

Instead of a *WHERE* clause you need to provide a JSON string for what you want to find.

For example, the following is the equivalent of *SELECT * FROM girlsName WHERE name = 'Mary';*

In [16]:
girls.find({'Name': 'Mary'})

<pymongo.cursor.Cursor at 0x7f2d71120910>

In [17]:
# Can specify a search criteria with find_one too (could be the only one)
girls.find_one({'Name': 'Mary-Beth'})

{'_id': ObjectId('659dc8a1d6a53dd7f67829ad'),
 'Name': 'Mary-Beth',
 '2021 Rank': '',
 '2021 Count': '',
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': '',
 '2019 Count': '',
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': 5785.0,
 '2011 Count': 3.0,
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': 3970.0,
 '2005 Count': 4.0,
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': 4373.0,
 '2003 Count': 3.0,
 '2002 Rank': 4137.0,
 '2002 Count': 3.0,
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': 2444.0,
 '1999 Count': 6.0,
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': 2738.

The difference between `find()` and `find_one()` is that the former returns all the documents matching the criteria, whereas the latter returns just one of the documents, which can be used to the structure of the data. Do bear in mind, since MongoDB can store semi-structured data, different documents could have a different structure, unlike a relational database, where records in a table would all have the same structure.

To see what is returned in the cursor, lets create some functions to print the individual documents from the cursor.

In [18]:
# This means an iterator is needed to display the results
# using pretty print
def printDocs(documents):
    for doc in documents:
        pprint(doc)

# ordinary print
def printDoc(documents):
    for doc in documents:
        print(doc)

In [19]:
# find the Marys
docs = girls.find({'Name': 'Mary'})
printDoc(docs)

{'_id': ObjectId('659dc8a1d6a53dd7f67829b8'), 'Name': 'Mary', '2021 Rank': 318.0, '2021 Count': 148.0, '2020 Rank': 291.0, '2020 Count': 160.0, '2019 Rank': 289.0, '2019 Count': 170.0, '2018 Rank': 259.0, '2018 Count': 189.0, '2017 Rank': 320.0, '2017 Count': 150.0, '2016 Rank': 250.0, '2016 Count': 204.0, '2015 Rank': 249.0, '2015 Count': 198.0, '2014 Rank': 225.0, '2014 Count': 229.0, '2013 Rank': 244.0, '2013 Count': 203.0, '2012 Rank': 241.0, '2012 Count': 209.0, '2011 Rank': 250.0, '2011 Count': 200.0, '2010 Rank': 213.0, '2010 Count': 237.0, '2009 Rank': 227.0, '2009 Count': 213.0, '2008 Rank': 179.0, '2008 Count': 292.0, '2007 Rank': 177.0, '2007 Count': 300.0, '2006 Rank': 170.0, '2006 Count': 310.0, '2005 Rank': 151.0, '2005 Count': 339.0, '2004 Rank': 164.0, '2004 Count': 325.0, '2003 Rank': 162.0, '2003 Count': 298.0, '2002 Rank': 146.0, '2002 Count': 310.0, '2001 Rank': 145.0, '2001 Count': 315.0, '2000 Rank': 146.0, '2000 Count': 313.0, '1999 Rank': 139.0, '1999 Count': 33

In [20]:
# alternatively use a dataframe to make it more like a relational table
# find the girls names in 2021 with a count more than 2000
pd.DataFrame(girls.find({"2021 Count" : {"$gt": 2000}}))

Unnamed: 0,_id,Name,2021 Rank,2021 Count,2020 Rank,2020 Count,2019 Rank,2019 Count,2018 Rank,2018 Count,...,2000 Rank,2000 Count,1999 Rank,1999 Count,1998 Rank,1998 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
0,659dc8a1d6a53dd7f677fab2,Amelia,2.0,3164.0,2.0,3319.0,2.0,3712.0,2.0,3941.0,...,35.0,1489.0,37.0,1511.0,48.0,1249.0,49.0,1145.0,63.0,929.0
1,659dc8a1d6a53dd7f677ffa9,Ava,4.0,2576.0,4.0,2679.0,4.0,2946.0,3.0,3110.0,...,291.0,129.0,429.0,70.0,530.0,51.0,500.0,55.0,753.0,30.0
2,659dc8a1d6a53dd7f67810fd,Florence,8.0,2180.0,14.0,1963.0,15.0,2025.0,15.0,2062.0,...,163.0,274.0,166.0,268.0,177.0,244.0,175.0,257.0,194.0,228.0
3,659dc8a1d6a53dd7f678114e,Freya,6.0,2187.0,12.0,1982.0,10.0,2264.0,18.0,1921.0,...,75.0,686.0,92.0,589.0,93.0,563.0,113.0,450.0,118.0,394.0
4,659dc8a1d6a53dd7f6781756,Isabella,13.0,2010.0,8.0,2052.0,6.0,2398.0,7.0,2369.0,...,64.0,796.0,57.0,883.0,90.0,594.0,107.0,470.0,106.0,441.0
5,659dc8a1d6a53dd7f678176e,Isla,3.0,2683.0,3.0,2749.0,3.0,2981.0,4.0,3046.0,...,297.0,125.0,286.0,124.0,369.0,87.0,380.0,84.0,382.0,87.0
6,659dc8a1d6a53dd7f67817e8,Ivy,5.0,2245.0,6.0,2166.0,12.0,2158.0,14.0,2104.0,...,1033.0,20.0,1216.0,16.0,2165.0,7.0,1666.0,10.0,1222.0,15.0
7,659dc8a1d6a53dd7f678245a,Lily,7.0,2182.0,7.0,2150.0,9.0,2285.0,13.0,2184.0,...,45.0,1124.0,53.0,1007.0,61.0,873.0,75.0,721.0,85.0,651.0
8,659dc8a1d6a53dd7f6782b71,Mia,9.0,2168.0,5.0,2303.0,5.0,2500.0,6.0,2490.0,...,43.0,1149.0,54.0,980.0,75.0,721.0,89.0,583.0,116.0,397.0
9,659dc8a1d6a53dd7f67831ea,Olivia,1.0,3649.0,1.0,3640.0,1.0,4082.0,1.0,4598.0,...,8.0,4546.0,4.0,5250.0,16.0,3550.0,19.0,2789.0,24.0,2456.0


In [21]:
# Or can convert the Cursor to a list
# find the girls names in 2021 with a count more than 2000, but were ranked in the top 10 in 2020
list(girls.find({"2021 Count" : {"$gt": 2000}, "2020 Rank": {"$lte": 10}}))

[{'_id': ObjectId('659dc8a1d6a53dd7f677fab2'),
  'Name': 'Amelia',
  '2021 Rank': 2.0,
  '2021 Count': 3164.0,
  '2020 Rank': 2.0,
  '2020 Count': 3319.0,
  '2019 Rank': 2.0,
  '2019 Count': 3712.0,
  '2018 Rank': 2.0,
  '2018 Count': 3941.0,
  '2017 Rank': 2.0,
  '2017 Count': 4358.0,
  '2016 Rank': 2.0,
  '2016 Count': 4777.0,
  '2015 Rank': 1.0,
  '2015 Count': 5158.0,
  '2014 Rank': 1.0,
  '2014 Count': 5327.0,
  '2013 Rank': 1.0,
  '2013 Count': 5570.0,
  '2012 Rank': 1.0,
  '2012 Count': 7061.0,
  '2011 Rank': 1.0,
  '2011 Count': 5054.0,
  '2010 Rank': 5.0,
  '2010 Count': 4227.0,
  '2009 Rank': 9.0,
  '2009 Count': 3625.0,
  '2008 Rank': 9.0,
  '2008 Count': 3440.0,
  '2007 Rank': 10.0,
  '2007 Count': 3250.0,
  '2006 Rank': 16.0,
  '2006 Count': 2907.0,
  '2005 Rank': 15.0,
  '2005 Count': 2976.0,
  '2004 Rank': 18.0,
  '2004 Count': 2649.0,
  '2003 Rank': 21.0,
  '2003 Count': 2299.0,
  '2002 Rank': 25.0,
  '2002 Count': 1973.0,
  '2001 Rank': 32.0,
  '2001 Count': 1709.0,
  

# Data Dictionary



One consequence of being schemaless, means there are no conventional data dictionary tables to check if the collection or document names exist. This means that it will not generate an error message if neither exist. Do note, the names are all case sensitive. 

Why will the following return no records?

In [22]:
girls.find_one({"Name" : "Fred"})

In [23]:
db.girls.find_one({"name" : "Susan"})

But it will generate an error message if it can not find the variables or functions:

In [24]:
Girls.find_one({"Name" : "Susan"})

NameError: name 'Girls' is not defined

In [25]:
girls.find_One({"Name" : "Susan"})

TypeError: 'Collection' object is not callable. If you meant to call the 'find_One' method on a 'Collection' object it is failing because no such method exists.

In [26]:
girls.find_One({"Name" : Susan})

NameError: name 'Susan' is not defined

In [27]:
girls.find_One({Name : "Susan"})

NameError: name 'Name' is not defined

In [28]:
# Lets find our girl
girls.find_one({"Name" : "Susan"})

{'_id': ObjectId('659dc8a1d6a53dd7f67841e8'),
 'Name': 'Susan',
 '2021 Rank': 1692.0,
 '2021 Count': 17.0,
 '2020 Rank': 2042.0,
 '2020 Count': 13.0,
 '2019 Rank': 3151.0,
 '2019 Count': 7.0,
 '2018 Rank': 3518.0,
 '2018 Count': 6.0,
 '2017 Rank': 1512.0,
 '2017 Count': 21.0,
 '2016 Rank': 1525.0,
 '2016 Count': 21.0,
 '2015 Rank': 1601.0,
 '2015 Count': 19.0,
 '2014 Rank': 1882.0,
 '2014 Count': 15.0,
 '2013 Rank': 1433.0,
 '2013 Count': 22.0,
 '2012 Rank': 1130.0,
 '2012 Count': 30.0,
 '2011 Rank': 1043.0,
 '2011 Count': 33.0,
 '2010 Rank': 1257.0,
 '2010 Count': 25.0,
 '2009 Rank': 865.0,
 '2009 Count': 39.0,
 '2008 Rank': 878.0,
 '2008 Count': 37.0,
 '2007 Rank': 836.0,
 '2007 Count': 38.0,
 '2006 Rank': 770.0,
 '2006 Count': 40.0,
 '2005 Rank': 951.0,
 '2005 Count': 28.0,
 '2004 Rank': 883.0,
 '2004 Count': 30.0,
 '2003 Rank': 844.0,
 '2003 Count': 31.0,
 '2002 Rank': 761.0,
 '2002 Count': 33.0,
 '2001 Rank': 612.0,
 '2001 Count': 42.0,
 '2000 Rank': 582.0,
 '2000 Count': 46.0,
 '

There may not be a data dictionary collection to query, but you can find the keys in a collection, which are similar to the column names in a relational database. Be aware though, that the structure can vary from document to document in a given collection.


In [29]:
girls.find_one().keys()

dict_keys(['_id', 'Name', '2021 Rank', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

As seen previously there are a lot of fields with no data. One good point for a NoSQL database is that every document does not have to have the same structure, so if the value is blank, there is no need to store the key.

For example, lets remove any records where the "2021 Rank" is null:

In [30]:
girls.update_many({"2021 Rank" : ""}, { "$unset": {"2021 Rank" : 1 }});

In [31]:
girls.find_one()

{'_id': ObjectId('659dc8a1d6a53dd7f677f532'),
 'Name': 'Aabida',
 '2021 Rank': 5581.0,
 '2021 Count': 3.0,
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': '',
 '2019 Count': '',
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': '',
 '2011 Count': '',
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': '',
 '2005 Count': '',
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': '',
 '2003 Count': '',
 '2002 Rank': '',
 '2002 Count': '',
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': '',
 '1999 Count': '',
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': '',
 '1997 Count': '',
 '199

Given the amount of empty keys, it would be tedious to remove each one separately, so lets find what keys each record has and then loop through removing any blanks.

Do note, `find_one()` could retrieve any record, if the data was semi-structured each document could have a different structure. In this case, the data came from a CSV file, so every document has the same structure.

In [32]:
keys = girls.find_one({}).keys()
keys

dict_keys(['_id', 'Name', '2021 Rank', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

In [33]:
for k in keys:
    girls.update_many({ k : ""}, { "$unset": { k : 1 }});

In [34]:
# note, the above has removed any empty keys, but the document will still exist
girls.find_one()

{'_id': ObjectId('659dc8a1d6a53dd7f677f532'),
 'Name': 'Aabida',
 '2021 Rank': 5581.0,
 '2021 Count': 3.0}

In [35]:
# do the same to the boys names
keys = boys.find_one({}).keys()
for k in keys:
    boys.update_many({ k : ""}, { "$unset": { k : 1 }});

In [36]:
boys.find_one()

{'_id': ObjectId('659dc8a1fea004febb02fd2c'),
 'Name': 'A',
 '2021 Rank': 3451.0,
 '2021 Count': 5.0,
 '2020 Rank': 3848.0,
 '2020 Count': 4.0,
 '2019 Rank': 2104.0,
 '2019 Count': 10.0,
 '2018 Rank': 3959.0,
 '2018 Count': 4.0,
 '2017 Rank': 3996.0,
 '2017 Count': 4.0,
 '2016 Rank': 2335.0,
 '2016 Count': 9.0,
 '2015 Rank': 2020.0,
 '2015 Count': 11.0,
 '2014 Rank': 2964.0,
 '2014 Count': 6.0,
 '2013 Rank': 2660.0,
 '2013 Count': 7.0,
 '2012 Rank': 1589.0,
 '2012 Count': 15.0,
 '2011 Rank': 2613.0,
 '2011 Count': 7.0,
 '2010 Rank': 2941.0,
 '2010 Count': 6.0,
 '2009 Rank': nan,
 '2009 Count': nan,
 '2008 Rank': 3158.0,
 '2008 Count': 5.0,
 '2007 Rank': 2741.0,
 '2007 Count': 6.0,
 '2006 Rank': 2870.0,
 '2006 Count': 5.0,
 '2005 Rank': nan,
 '2005 Count': nan,
 '2004 Rank': nan,
 '2004 Count': nan,
 '2003 Rank': nan,
 '2003 Count': nan,
 '2002 Rank': nan,
 '2002 Count': nan,
 '2001 Rank': 3134.0,
 '2001 Count': 3.0,
 '2000 Rank': nan,
 '2000 Count': nan,
 '1999 Rank': nan,
 '1999 Count

In [37]:
# The consequence of this is that the keys will be slightly different for the records that have more complete data
# one with sparse data
girls.find_one({"Name" : "Marvi"}).keys()

dict_keys(['_id', 'Name', '2000 Rank', '2000 Count'])

In [38]:
# one more complete:
girls.find_one({"Name" : "Martina"}).keys()

dict_keys(['_id', 'Name', '2021 Rank', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

In [39]:
# how many documents in the collection
db.girls.count_documents({})

21958

In [40]:
# can access via the index (starts at 0)
girls.find()[0]

{'_id': ObjectId('659dc8a1d6a53dd7f677f532'),
 'Name': 'Aabida',
 '2021 Rank': 5581.0,
 '2021 Count': 3.0}

In [41]:
# second record
girls.find()[1]

{'_id': ObjectId('659dc8a1d6a53dd7f677f533'),
 'Name': "A'Niyah",
 '2016 Rank': 5785.0,
 '2016 Count': 3.0}

In [42]:
# Last one
len = girls.count_documents({})-1
girls.find()[len]

{'_id': ObjectId('659dc8a1d6a53dd7f6784af7'),
 'Name': 'Zyrah',
 '2021 Rank': 5581.0,
 '2021 Count': 3.0,
 '2020 Rank': 4535.0,
 '2020 Count': 4.0,
 '2017 Rank': 4763.0,
 '2017 Count': 4.0,
 '2016 Rank': 4096.0,
 '2016 Count': 5.0}

`count_documents()` can be used with queries to count the result, rather than listing them

In [43]:
girls.count_documents({"Name": "Mary"})

1

In [44]:
# how many documents have a count more than 1500 in 2021
girls.count_documents({"2021 Count": {"$gt" : 1500} })

25

# Part 15: Complex queries and analysis
# Aggregation Pipeline

More complex processing, including grouping, aggregation functions, and data renaming is achieved through MongoDB’s aggregation pipeline.

For example a query can involve several stages:
                                                
First stage: filter out documents that do not match some criterion<br>
Second stage: group those documents<br>
Third stage: select only groups that match another criterion<br>
Fourth stage: group summaries would then be returned to the client<br>

By building up a pipeline in stages, complex data processing tasks can be built from simple components.

<img src="pipeline.png">

<img src="pipeline_functions.png">

Further examples can be found in *Notebook 15.3 Introducing aggregation pipelines.*

The examples below and in the practical activities all use small data sets that can be used locally. With huge datasets, the processing may be spread over many computers for processing to aid speed. Data processing tools (such as the aggregation pipeline and MapReduce) keep the processing of data near that data itself, reducing the work required by the client and the amount of data to be moved across the network from server to client. 

In [45]:
# Equivalent to SELECT COUNT(*) FROM girls;
# Need to group by an _id
pipeline = [
     {"$group": {"_id": 0, "Name": {"$sum": 1}}},
]

list(girls.aggregate(pipeline))

[{'_id': 0, 'Name': 21958}]

In [47]:
# SELECT "2021 Count", count(*) FROM training ORDER BY "2021 Count";
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$2021 Count", "count": {"$sum": 1} }},
                               { "$sort" : {"_id" : 1}} ] ))


{'_id': None, 'count': 14628}
{'_id': 3.0, 'count': 1750}
{'_id': 4.0, 'count': 947}
{'_id': 5.0, 'count': 673}
{'_id': 6.0, 'count': 442}
{'_id': 7.0, 'count': 327}
{'_id': 8.0, 'count': 250}
{'_id': 9.0, 'count': 231}
{'_id': 10.0, 'count': 212}
{'_id': 11.0, 'count': 158}
{'_id': 12.0, 'count': 140}
{'_id': 13.0, 'count': 128}
{'_id': 14.0, 'count': 111}
{'_id': 15.0, 'count': 96}
{'_id': 16.0, 'count': 81}
{'_id': 17.0, 'count': 93}
{'_id': 18.0, 'count': 64}
{'_id': 19.0, 'count': 59}
{'_id': 20.0, 'count': 64}
{'_id': 21.0, 'count': 51}
{'_id': 22.0, 'count': 52}
{'_id': 23.0, 'count': 51}
{'_id': 24.0, 'count': 40}
{'_id': 25.0, 'count': 25}
{'_id': 26.0, 'count': 27}
{'_id': 27.0, 'count': 35}
{'_id': 28.0, 'count': 29}
{'_id': 29.0, 'count': 34}
{'_id': 30.0, 'count': 33}
{'_id': 31.0, 'count': 26}
{'_id': 32.0, 'count': 33}
{'_id': 33.0, 'count': 30}
{'_id': 34.0, 'count': 32}
{'_id': 35.0, 'count': 24}
{'_id': 36.0, 'count': 26}
{'_id': 37.0, 'count': 21}
{'_id': 38.0, 'coun

In [46]:
# SELECT Name, count(*) FROM girls;
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$Name", "count": {"$sum": 1} }} ] ))

{'_id': 'Iseult', 'count': 1}
{'_id': 'Sevyn', 'count': 1}
{'_id': 'Layaan', 'count': 1}
{'_id': 'Ethel', 'count': 1}
{'_id': 'Reyna', 'count': 1}
{'_id': 'Nasriya', 'count': 1}
{'_id': 'Bella-Renee', 'count': 1}
{'_id': 'Nuzhat', 'count': 1}
{'_id': 'Ezmae-Rose', 'count': 1}
{'_id': 'Rivkah', 'count': 1}
{'_id': 'Shamira', 'count': 1}
{'_id': 'Yaren', 'count': 1}
{'_id': 'Arshin', 'count': 1}
{'_id': 'Sukhleen', 'count': 1}
{'_id': 'Roberta', 'count': 1}
{'_id': 'Zaraya', 'count': 1}
{'_id': 'Zaynab', 'count': 1}
{'_id': 'Kharis', 'count': 1}
{'_id': 'Novella', 'count': 1}
{'_id': 'Deia', 'count': 1}
{'_id': 'Lyara', 'count': 1}
{'_id': 'Lima', 'count': 1}
{'_id': 'Prianna', 'count': 1}
{'_id': 'Nava', 'count': 1}
{'_id': 'Shanela', 'count': 1}
{'_id': 'Sharlotte', 'count': 1}
{'_id': 'Tayia', 'count': 1}
{'_id': 'Katisha', 'count': 1}
{'_id': 'Arya-Grace', 'count': 1}
{'_id': 'Alphonsa', 'count': 1}
{'_id': 'Arnika', 'count': 1}
{'_id': 'Jannath', 'count': 1}
{'_id': 'Odessa', 'count

## Reshaping

To do statistics on this data we want to use information in the keys as values, e.g., extract the year from: `2020 Count`. In Tutorial 2 we did some processing to do this, so lets reuse the code to reshape our data better:

In [48]:
def updateFile(fileType):
    # remove missing data permanately
    filename = 'data/UK'+fileType+'Names1996-2021.csv'
    print("Importing: '"+filename+"'")
    names_df = pd.read_csv(filename)
    names_df = names_df.dropna(how='any')
    # unpivot the dataframe from a wide to long format
    names2_df = pd.melt(names_df, id_vars="Name")
    # split the two values in variable: year and the type (count or rank)
    names2_df[['Year','Type']] = names2_df['variable'].str.split(' ', expand = True)
    # convert year to a number
    names2_df['Year'] = names2_df['Year'].astype(str).astype(int)
    # the variable column is no longer needed
    names2_df.drop('variable', axis=1, inplace=True)
    names2_df.head()
    # save the changes 
    names2_df.to_csv('data/'+fileType+'Updated.csv')
    return names2_df

In [49]:
boys_df = updateFile("Boy")
boys_df.head()

Importing: 'data/UKBoyNames1996-2021.csv'


Unnamed: 0,Name,value,Year,Type
0,Aadam,457.0,2021,Rank
1,Aadil,1448.0,2021,Rank
2,Aamir,2301.0,2021,Rank
3,Aaran,1860.0,2021,Rank
4,Aaron,119.0,2021,Rank


In [50]:
girls_df = updateFile("Girl")
girls_df.head()

Importing: 'data/UKGirlNames1996-2021.csv'


Unnamed: 0,Name,value,Year,Type
0,Aaisha,1569.0,2021,Rank
1,Aaishah,2942.0,2021,Rank
2,Aaliya,1402.0,2021,Rank
3,Aaliyah,132.0,2021,Rank
4,Aamina,1785.0,2021,Rank


In [51]:
# next import the girls_df dataframe into a collection (girlsUpdate)
db.girlsUpdate.insert_many(girls_df.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f2d71122560>

In [52]:
# ditto the boys_df dataframe
db.boysUpdate.insert_many(boys_df.to_dict('records'))

<pymongo.results.InsertManyResult at 0x7f2d71030d60>

In [53]:
# check they are now in the baby names database (babyNamesDB)
db.list_collection_names()

['boys', 'boysUpdate', 'girlsUpdate', 'girls']

In [54]:
db.girlsUpdate.find_one()

{'_id': ObjectId('659dc8bafea004febb033eb5'),
 'Name': 'Aaisha',
 'value': 1569.0,
 'Year': 2021,
 'Type': 'Rank'}

In [55]:
db.boysUpdate.find_one()

{'_id': ObjectId('659dc8bbfea004febb046efd'),
 'Name': 'Aadam',
 'value': 457.0,
 'Year': 2021,
 'Type': 'Rank'}

In [56]:
# how many documents does each collection have:
print("Girls: \t\t{}".format(girls.count_documents({})))
print("Girls Update: \t{}".format(db.girlsUpdate.count_documents({})))
print("Boys: \t\t{}".format(boys.count_documents({})))
print("Boys Update: \t{}".format(db.boysUpdate.count_documents({})))

Girls: 		21958
Girls Update: 	77896
Boys: 		16777
Boys Update: 	72124


In [57]:
# SELECT Year, count(*) as count FROM boysUpdate;
printDoc(db.boysUpdate.aggregate( [ { "$group" : { "_id" : "$Year", "count": {"$sum": 1} }} ] ))

{'_id': 2010, 'count': 2774}
{'_id': 2009, 'count': 2774}
{'_id': 2020, 'count': 2774}
{'_id': 1999, 'count': 2774}
{'_id': 1998, 'count': 2774}
{'_id': 2012, 'count': 2774}
{'_id': 2013, 'count': 2774}
{'_id': 2008, 'count': 2774}
{'_id': 2002, 'count': 2774}
{'_id': 2000, 'count': 2774}
{'_id': 1996, 'count': 2774}
{'_id': 2007, 'count': 2774}
{'_id': 2004, 'count': 2774}
{'_id': 2017, 'count': 2774}
{'_id': 2005, 'count': 2774}
{'_id': 2019, 'count': 2774}
{'_id': 2011, 'count': 2774}
{'_id': 2003, 'count': 2774}
{'_id': 2018, 'count': 2774}
{'_id': 2001, 'count': 2774}
{'_id': 2016, 'count': 2774}
{'_id': 2021, 'count': 2774}
{'_id': 1997, 'count': 2774}
{'_id': 2014, 'count': 2774}
{'_id': 2015, 'count': 2774}
{'_id': 2006, 'count': 2774}


In [58]:
# SELECT Year, sum() as "Sum of Values" FROM boysUpdate GROUP BY Year ORDER BY Year (_id) descending;
printDoc(db.boysUpdate.aggregate( [ { "$group" : { "_id" : "$Year", "Sum of values": {"$sum": "$value"}}},
                                                 { "$sort" : {"_id" : -1}}  
                                     ] ))

{'_id': 2021, 'Sum of values': 2022505.0}
{'_id': 2020, 'Sum of values': 1951612.0}
{'_id': 2019, 'Sum of values': 1918836.0}
{'_id': 2018, 'Sum of values': 1863635.0}
{'_id': 2017, 'Sum of values': 1821542.0}
{'_id': 2016, 'Sum of values': 1813214.0}
{'_id': 2015, 'Sum of values': 1776329.0}
{'_id': 2014, 'Sum of values': 1711269.0}
{'_id': 2013, 'Sum of values': 1701228.0}
{'_id': 2012, 'Sum of values': 1671824.0}
{'_id': 2011, 'Sum of values': 1657461.0}
{'_id': 2010, 'Sum of values': 1622126.0}
{'_id': 2009, 'Sum of values': 1586253.0}
{'_id': 2008, 'Sum of values': 1578758.0}
{'_id': 2007, 'Sum of values': 1546028.0}
{'_id': 2006, 'Sum of values': 1529725.0}
{'_id': 2005, 'Sum of values': 1472112.0}
{'_id': 2004, 'Sum of values': 1469449.0}
{'_id': 2003, 'Sum of values': 1468664.0}
{'_id': 2002, 'Sum of values': 1445779.0}
{'_id': 2001, 'Sum of values': 1448069.0}
{'_id': 2000, 'Sum of values': 1471350.0}
{'_id': 1999, 'Sum of values': 1489661.0}
{'_id': 1998, 'Sum of values': 151

In [59]:
# SELECT Name, avg(value) as "Average Rank" FROM girlsUpdate WHERE Type = 'Rank' ORDER BY "Average Rank";
# This pipeline involves 3 stages: $match, $group and $sort
printDoc(db.girlsUpdate.aggregate( [ { "$match" : {"Type": "Rank"} },
                                     { "$group" : { "_id" : "$Name", "Average Rank": {"$avg": "$value"} }},
                                     { "$sort" : {"Average Rank" : 1}}
                                     ] ))

{'_id': 'Emily', 'Average Rank': 4.461538461538462}
{'_id': 'Olivia', 'Average Rank': 5.076923076923077}
{'_id': 'Sophie', 'Average Rank': 9.23076923076923}
{'_id': 'Jessica', 'Average Rank': 9.923076923076923}
{'_id': 'Charlotte', 'Average Rank': 13.461538461538462}
{'_id': 'Chloe', 'Average Rank': 14.0}
{'_id': 'Grace', 'Average Rank': 15.192307692307692}
{'_id': 'Amelia', 'Average Rank': 15.73076923076923}
{'_id': 'Ella', 'Average Rank': 18.23076923076923}
{'_id': 'Lily', 'Average Rank': 21.884615384615383}
{'_id': 'Mia', 'Average Rank': 24.076923076923077}
{'_id': 'Lucy', 'Average Rank': 26.115384615384617}
{'_id': 'Ellie', 'Average Rank': 30.423076923076923}
{'_id': 'Isabella', 'Average Rank': 30.615384615384617}
{'_id': 'Alice', 'Average Rank': 31.692307692307693}
{'_id': 'Phoebe', 'Average Rank': 32.15384615384615}
{'_id': 'Hannah', 'Average Rank': 32.65384615384615}
{'_id': 'Daisy', 'Average Rank': 32.88461538461539}
{'_id': 'Holly', 'Average Rank': 33.53846153846154}
{'_id': '

# Joins

Joining documents was not possible in earlier versions of MongoDB, later versions introduced something similar to a simple left join using the pipeline `$lookup` operator ([Mongo docs](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)).

MongoDB provides the joins as part of the aggregation steps. 

These examples use the boy and girl collections, you still join on a common column as seen in relational databases. 

In this case *Name* is the common field in both collections.

First, convert the collections to a dataframe and let us do a quick check if there are common names in the two datasets.

In [60]:
boys_df = pd.DataFrame(boys.find())
girls_df = pd.DataFrame(girls.find())

In [61]:
boys_df[boys_df['Name'].isin(girls_df['Name'])]

Unnamed: 0,_id,Name,2021 Rank,2021 Count,2020 Rank,2020 Count,2019 Rank,2019 Count,2018 Rank,2018 Count,...,2000 Rank,2000 Count,1999 Rank,1999 Count,1998 Rank,1998 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
27,659dc8a1fea004febb02fd47,Aadya,,,,,,,,,...,,,,,,,,,,
100,659dc8a1fea004febb02fd90,Aarya,,,,,,,,,...,,,,,,,,,,
331,659dc8a1fea004febb02fe77,Abeer,4789.0,3.0,4608.0,3.0,3937.0,4.0,3959.0,4.0,...,,,,,,,,,,
332,659dc8a1fea004febb02fe78,Abel,178.0,301.0,210.0,242.0,224.0,226.0,220.0,235.0,...,794.0,21.0,1281.0,10.0,924.0,16.0,1010.0,14.0,1337.0,9.0
355,659dc8a1fea004febb02fe8f,Abie,4789.0,3.0,,,4702.0,3.0,,,...,,,,,2901.0,3.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16728,659dc8a1fea004febb033e84,Zola,,,,,,,,,...,,,,,,,,,,
16731,659dc8a1fea004febb033e87,Zora,1688.0,14.0,2404.0,8.0,2728.0,7.0,3412.0,5.0,...,,,,,,,,,,
16760,659dc8a1fea004febb033ea4,Zuri,2730.0,7.0,3848.0,4.0,,,,,...,,,,,,,,,,
16761,659dc8a1fea004febb033ea5,Zuriel,1250.0,21.0,1431.0,17.0,2488.0,8.0,3412.0,5.0,...,,,,,,,,,,


In [62]:
girls_df[girls_df['Name'].isin(boys_df['Name'])]

Unnamed: 0,_id,Name,2021 Rank,2021 Count,2016 Rank,2016 Count,2017 Rank,2017 Count,2015 Rank,2015 Count,...,2006 Rank,2006 Count,2000 Rank,2000 Count,1999 Rank,1999 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
29,659dc8a1d6a53dd7f677f54f,Aadya,1785.0,16.0,1241.0,27.0,1911.0,15.0,1601.0,19.0,...,,,,,,,,,,
138,659dc8a1d6a53dd7f677f5bc,Aarya,581.0,67.0,615.0,69.0,701.0,58.0,602.0,70.0,...,3116.0,6.0,,,,,,,,
239,659dc8a1d6a53dd7f677f621,Abeer,,,5785.0,3.0,4063.0,5.0,4736.0,4.0,...,1862.0,12.0,,,2444.0,6.0,,,,
243,659dc8a1d6a53dd7f677f625,Abel,5581.0,3.0,,,,,,,...,,,,,,,,,,
249,659dc8a1d6a53dd7f677f62b,Abie,,,,,,,,,...,,,,,3225.0,4.0,3824.0,3.0,3795.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21877,659dc8a1d6a53dd7f6784aa7,Zola,1454.0,21.0,2367.0,11.0,1995.0,14.0,3615.0,6.0,...,2509.0,8.0,,,3225.0,4.0,2738.0,5.0,,
21881,659dc8a1d6a53dd7f6784aab,Zora,1195.0,28.0,4763.0,4.0,3583.0,6.0,2328.0,11.0,...,4157.0,4.0,,,3900.0,3.0,,,,
21928,659dc8a1d6a53dd7f6784ada,Zuri,478.0,88.0,1172.0,29.0,1371.0,24.0,2070.0,13.0,...,,,,,,,,,,
21938,659dc8a1d6a53dd7f6784ae4,Zuriel,4634.0,4.0,4763.0,4.0,4763.0,4.0,4736.0,4.0,...,,,,,,,,,,


There appears to be quite a few matching names in both datasets. 

In [63]:
# Check to see if Alex appears in both, plus only show some fields
girls.find_one({"Name" : "Alex"}, {"_id":0, "Name": 1, "2021 Rank": 1, "2021 Count": 1})

{'Name': 'Alex', '2021 Rank': 1866.0, '2021 Count': 15.0}

In [64]:
boys.find_one({"Name" : "Alex"}, {"_id":0, "Name": 1, "2021 Rank": 1, "2021 Count": 1})

{'Name': 'Alex', '2021 Rank': 122.0, '2021 Count': 474.0}

By default MongoDB will carry out an outer join, which means the names that do not match will contain an empty subdocument. Really we want an inner join, `as` creates an array, or subdocument within each document, further work can be done on the `as` array to pull out just the arrays that are not empty.

See https://www.mongodb.com/docs/manual/reference/operator/aggregation/unwind/ for further information on the `preserveNullAndEmptyArrays` field.

Thanks to https://stackoverflow.com/questions/37575722/how-to-do-inner-joining-in-mongodb for an example (accessed 08/01/2024).

In [65]:
list(girls.aggregate([
   {
     "$lookup":
       {
         "from": "boys",
         "localField": "Name",
         "foreignField": "Name",
         "as": "joined"        
       }
   }, 
    {"$unwind": {
           "path": "$joined",
           "preserveNullAndEmptyArrays": False
   }}    
  ]))

[{'_id': ObjectId('659dc8a1d6a53dd7f677f54f'),
  'Name': 'Aadya',
  '2021 Rank': 1785.0,
  '2021 Count': 16.0,
  '2020 Rank': 2042.0,
  '2020 Count': 13.0,
  '2019 Rank': 1497.0,
  '2019 Count': 20.0,
  '2018 Rank': 1393.0,
  '2018 Count': 23.0,
  '2017 Rank': 1911.0,
  '2017 Count': 15.0,
  '2016 Rank': 1241.0,
  '2016 Count': 27.0,
  '2015 Rank': 1601.0,
  '2015 Count': 19.0,
  '2014 Rank': 1501.0,
  '2014 Count': 20.0,
  '2013 Rank': 1253.0,
  '2013 Count': 27.0,
  '2012 Rank': 2215.0,
  '2012 Count': 12.0,
  '2011 Rank': 2432.0,
  '2011 Count': 10.0,
  '2010 Rank': 4688.0,
  '2010 Count': 4.0,
  '2009 Rank': 5556.0,
  '2009 Count': 3.0,
  '2008 Rank': 3860.0,
  '2008 Count': 5.0,
  '2007 Rank': 2895.0,
  '2007 Count': 7.0,
  'joined': {'_id': ObjectId('659dc8a1fea004febb02fd47'),
   'Name': 'Aadya',
   '2021 Rank': nan,
   '2021 Count': nan,
   '2020 Rank': nan,
   '2020 Count': nan,
   '2019 Rank': nan,
   '2019 Count': nan,
   '2018 Rank': nan,
   '2018 Count': nan,
   '2017 Rank

# Semi-Structured data

The Baby Names dataset is an example of structured data, in that it is very uniform, with the same data types in each column.

The power of NoSQL databases is in copying semi-structured data, such as JSON data, where the values may not be straightforward strings and numbers, but could be nested documents.



## USA Government data

A lot of publicly available data is in JSON format, for example, government agencies:

- https://catalog.data.gov/dataset?res_format=JSON

- https://github.com/jdorfman/awesome-json-datasets#government

Below uses the USA government politician datasets found in the last link.

- Current US Senators: roles.json
- Current US Representatives: role-reps.json

Plus a list of USA States and abbreviations:
- states_titlecase.json

found here: https://gist.github.com/mshafrir/2646763

All downloaded: 09/01/2024

In [66]:
# lets have a look at the data
!head data/role.json

{
 "meta": {
  "limit": 100,
  "offset": 0,
  "total_count": 100
 },
 "objects": [
  {
   "caucus": null,
   "congress_numbers": [


In [67]:
!tail data/role.json

   "senator_rank": "junior",
   "senator_rank_label": "Junior",
   "startdate": "2023-10-03",
   "state": "CA",
   "title": "Sen.",
   "title_long": "Senator",
   "website": "https://www.butler.senate.gov"
  }
 ]
}

In [68]:
! head data/role-reps.json

{
 "meta": {
  "limit": 438,
  "offset": 0,
  "total_count": 439
 },
 "objects": [
  {
   "caucus": null,
   "congress_numbers": [


In [69]:
!tail data/role-reps.json

   "senator_class": null,
   "senator_rank": null,
   "startdate": "2023-11-07",
   "state": "RI",
   "title": "Rep.",
   "title_long": "Representative",
   "website": ""
  }
 ]
}

In [70]:
# note, this file was amended to remove the commas between each document (otherwise would not import)
!head data/states_titlecase.json

{
"name": "Alabama",
"abbreviation": "AL"
}
{
"name": "Alaska",
"abbreviation": "AK"
}
{
"name": "American Samoa",


In [71]:
!tail data/states_titlecase.json

"abbreviation": "WV"
}
{
"name": "Wisconsin",
"abbreviation": "WI"
}
{
"name": "Wyoming",
"abbreviation": "WY"
}

The politicians data looks to be in JSON format and appear to have some meta data at the start.

In [72]:
client.drop_database('politicsDB')

In [73]:
# import the data using mongoimport, note the type is now json
! mongoimport --db politicsDB --type=json --file data/role.json  --collection senators
! mongoimport --db politicsDB --type=json --file data/role-reps.json --collection reps
! mongoimport --db politicsDB --type=json --file data/states_titlecase.json --collection states

2024-01-09T22:30:34.757+0000	connected to: mongodb://localhost/
2024-01-09T22:30:34.780+0000	1 document(s) imported successfully. 0 document(s) failed to import.
2024-01-09T22:30:34.908+0000	connected to: mongodb://localhost/
2024-01-09T22:30:34.957+0000	1 document(s) imported successfully. 0 document(s) failed to import.
2024-01-09T22:30:35.089+0000	connected to: mongodb://localhost/
2024-01-09T22:30:35.105+0000	59 document(s) imported successfully. 0 document(s) failed to import.


In [74]:
# Change database
db = client.politicsDB
senators = db.senators
reps = db.reps
states = db.states

In [75]:
# check a document in each collection
senators.find_one()

{'_id': ObjectId('659dc90a1d4f949af6581722'),
 'meta': {'limit': 100, 'offset': 0, 'total_count': 100},
 'objects': [{'caucus': None,
   'congress_numbers': [116, 117, 118],
   'current': True,
   'description': 'Junior Senator for Washington',
   'district': None,
   'enddate': '2025-01-03',
   'extra': {'address': '511 Hart Senate Office Building Washington DC 20510',
    'contact_form': 'https://www.cantwell.senate.gov/public/index.cfm/email-maria',
    'office': '511 Hart Senate Office Building',
    'rss_url': 'http://www.cantwell.senate.gov/public/index.cfm/rss/feed'},
   'leadership_title': None,
   'party': 'Democrat',
   'person': {'bioguideid': 'C000127',
    'birthday': '1958-10-13',
    'cspanid': 26137,
    'fediverse_webfinger': None,
    'firstname': 'Maria',
    'gender': 'female',
    'gender_label': 'Female',
    'lastname': 'Cantwell',
    'link': 'https://www.govtrack.us/congress/members/maria_cantwell/300018',
    'middlename': '',
    'name': 'Sen. Maria Cantwell 

In [76]:
reps.find_one()

{'_id': ObjectId('659dc90a9bf569e5749cf8fe'),
 'meta': {'limit': 438, 'offset': 0, 'total_count': 439},
 'objects': [{'caucus': None,
   'congress_numbers': [118],
   'current': True,
   'description': "Representative for Alabama's 4th congressional district",
   'district': 4,
   'enddate': '2025-01-03',
   'extra': {'address': '266 Cannon House Office Building Washington DC 20515-0104',
    'office': '266 Cannon House Office Building',
    'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
   'leadership_title': None,
   'party': 'Republican',
   'person': {'bioguideid': 'A000055',
    'birthday': '1965-07-22',
    'cspanid': 45516,
    'fediverse_webfinger': None,
    'firstname': 'Robert',
    'gender': 'male',
    'gender_label': 'Male',
    'lastname': 'Aderholt',
    'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
    'middlename': 'B.',
    'name': 'Rep. Robert Aderholt [R-AL4]',
    'namemod': '',
    'nickname': '',
    'osid': 'N

In [77]:
states.find_one()

{'_id': ObjectId('659dc90b603c8c105c204428'),
 'name': 'Florida',
 'abbreviation': 'FL'}

We can see that the politician details have nested documents, where a document (or array) is nested within other information. This is an example of semi-structured data.

The dot syntax can be used to search nested documents. For example, a snippet of information from above for the person sub-document, shows what keys are available within it: 

<pre>
person': {'bioguideid': 'A000055',
    'birthday': '1965-07-22',
    'cspanid': 45516,
    'fediverse_webfinger': None,
    'firstname': 'Robert',
    'gender': 'male',
    'gender_label': 'Male',
    'lastname': 'Aderholt',
    'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
    'middlename': 'B.',
    'name': 'Rep. Robert Aderholt [R-AL4]',
    'namemod': '',
    'nickname': '',
    'osid': 'N00003028',
    'pvsid': None,
    'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
    'twitterid': 'Robert_Aderholt',
    'youtubeid': 'RobertAderholt'},
</pre>    

To pull out the full information for this representative, I'm assuming :

In [78]:
reps.find_one({"objects.person.bioguideid": 'A000055' })

{'_id': ObjectId('659dc90a9bf569e5749cf8fe'),
 'meta': {'limit': 438, 'offset': 0, 'total_count': 439},
 'objects': [{'caucus': None,
   'congress_numbers': [118],
   'current': True,
   'description': "Representative for Alabama's 4th congressional district",
   'district': 4,
   'enddate': '2025-01-03',
   'extra': {'address': '266 Cannon House Office Building Washington DC 20515-0104',
    'office': '266 Cannon House Office Building',
    'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
   'leadership_title': None,
   'party': 'Republican',
   'person': {'bioguideid': 'A000055',
    'birthday': '1965-07-22',
    'cspanid': 45516,
    'fediverse_webfinger': None,
    'firstname': 'Robert',
    'gender': 'male',
    'gender_label': 'Male',
    'lastname': 'Aderholt',
    'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
    'middlename': 'B.',
    'name': 'Rep. Robert Aderholt [R-AL4]',
    'namemod': '',
    'nickname': '',
    'osid': 'N

Hmmm, this has found the representative, but the consequence of all the politicians being stored in one document, rather than one document per politician is that if the query returns true, then all the data in that document is returned!

To extract items from the array requires the use of the `$unwind` operator.

For example, display just the firstnames, suppressing the generated id:

In [79]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1}},
                        {"$unwind":"$objects"} ]) 
printDocs(docs)

{'objects': {'person': {'firstname': 'Robert'}}}
{'objects': {'person': {'firstname': 'Gus'}}}
{'objects': {'person': {'firstname': 'Sanford'}}}
{'objects': {'person': {'firstname': 'Earl'}}}
{'objects': {'person': {'firstname': 'Vern'}}}
{'objects': {'person': {'firstname': 'Larry'}}}
{'objects': {'person': {'firstname': 'Michael'}}}
{'objects': {'person': {'firstname': 'Ken'}}}
{'objects': {'person': {'firstname': 'André'}}}
{'objects': {'person': {'firstname': 'John'}}}
{'objects': {'person': {'firstname': 'Kathy'}}}
{'objects': {'person': {'firstname': 'Judy'}}}
{'objects': {'person': {'firstname': 'Yvette'}}}
{'objects': {'person': {'firstname': 'Emanuel'}}}
{'objects': {'person': {'firstname': 'James'}}}
{'objects': {'person': {'firstname': 'Steve'}}}
{'objects': {'person': {'firstname': 'Tom'}}}
{'objects': {'person': {'firstname': 'Gerald'}}}
{'objects': {'person': {'firstname': 'Jim'}}}
{'objects': {'person': {'firstname': 'Joe'}}}
{'objects': {'person': {'firstname': 'Eric'}}

In [80]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'firstname': 'Robert',
                        'lastname': 'Aderholt'}}}


Do make sure the pipeline is in the right order, if the match is done too soon it will again return all the representatives if the query criteria is matched:

In [81]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"}
                      ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'firstname': 'Robert',
                        'lastname': 'Aderholt'}}}
{'objects': {'person': {'bioguideid': 'B001257',
                        'firstname': 'Gus',
                        'lastname': 'Bilirakis'}}}
{'objects': {'person': {'bioguideid': 'B000490',
                        'firstname': 'Sanford',
                        'lastname': 'Bishop'}}}
{'objects': {'person': {'bioguideid': 'B000574',
                        'firstname': 'Earl',
                        'lastname': 'Blumenauer'}}}
{'objects': {'person': {'bioguideid': 'B001260',
                        'firstname': 'Vern',
                        'lastname': 'Buchanan'}}}
{'objects': {'person': {'bioguideid': 'B001275',
                        'firstname': 'Larry',
                        'lastname': 'Bucshon'}}}
{'objects': {'person': {'bioguideid': 'B001248',
                        'firstname': 'Michael',
                        'lastname'

In [82]:
# or if in doubt duplicate the $match as discussed here:
# https://stackoverflow.com/questions/54030089/how-to-use-unwind-and-match-with-mongodb

docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'firstname': 'Robert',
                        'lastname': 'Aderholt'}}}


In [83]:
# show all the person details, which does need a duplicate $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Aderholt',
                        'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
                        'middlename': 'B.',
                        'name': 'Rep. Robert Aderholt [R-AL4]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00003028',
                        'pvsid': None,
                        'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
                        'twitterid': 'Robert_Aderholt',
                        'youtubeid': 'RobertAderholt'}}}


In [84]:
# show all the person details, which does need a duplicate $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Aderholt',
                        'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
                        'middlename': 'B.',
                        'name': 'Rep. Robert Aderholt [R-AL4]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00003028',
                        'pvsid': None,
                        'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
                        'twitterid': 'Robert_Aderholt',
                        'youtubeid': 'RobertAderholt'}}}


person is part of the details stored for each 

In [85]:
# show all the details for this representative 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Alabama's 4th congressional "
                            'district',
             'district': 4,
             'enddate': '2025-01-03',
             'extra': {'address': '266 Cannon House Office Building Washington '
                                  'DC 20515-0104',
                       'office': '266 Cannon House Office Building',
                       'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                 

In [86]:
# if you don't know your American states, join up the states collection
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }
  }
])
printDocs(joined)

{'_id': ObjectId('659dc90a9bf569e5749cf8fe'),
 'meta': {'limit': 438, 'offset': 0, 'total_count': 439},
 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Alabama's 4th congressional "
                            'district',
             'district': 4,
             'enddate': '2025-01-03',
             'extra': {'address': '266 Cannon House Office Building Washington '
                                  'DC 20515-0104',
                       'office': '266 Cannon House Office Building',
                       'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
   

                        'name': 'Rep. Mike Levin [D-CA49]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00040667',
                        'pvsid': None,
                        'sortname': 'Levin, Mike (Rep.) [D-CA49]',
                        'twitterid': 'RepMikeLevin',
                        'youtubeid': None},
             'phone': '202-225-3906',
             'role_type': 'representative',
             'role_type_label': 'Representative',
             'senator_class': None,
             'senator_rank': None,
             'startdate': '2023-01-03',
             'state': 'CA',
             'title': 'Rep.',
             'title_long': 'Representative',
             'website': 'https://mikelevin.house.gov'},
 'stateInfo': [{'_id': ObjectId('659dc90b603c8c105c20444c'),
                'abbreviation': 'CA',
                'name': 'California'}]}
{'_id': ObjectId('659dc90a9bf569e5749cf8fe'),
 'meta': {'limit': 438, 'o

 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Indiana's 6th congressional "
                            'district',
             'district': 6,
             'enddate': '2025-01-03',
             'extra': {'address': '404 Cannon House Office Building Washington '
                                  'DC 20515-1406',
                       'office': '404 Cannon House Office Building'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'P000615',
                        'birthday': '1956-11-14',
                        'cspanid': None,
                        'fediverse_webfinger': None,
                        'firstname': 'Greg',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Pence',
                        'link': 'https://www.govtrack.us/congress/m

                        'link': 'https://www.govtrack.us/congress/members/daniel_meuser/412811',
                        'middlename': '',
                        'name': 'Rep. Daniel Meuser [R-PA9]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00029416',
                        'pvsid': None,
                        'sortname': 'Meuser, Daniel (Rep.) [R-PA9]',
                        'twitterid': 'RepMeuser',
                        'youtubeid': None},
             'phone': '202-225-6511',
             'role_type': 'representative',
             'role_type_label': 'Representative',
             'senator_class': None,
             'senator_rank': None,
             'startdate': '2023-01-03',
             'state': 'PA',
             'title': 'Rep.',
             'title_long': 'Representative',
             'website': 'https://meuser.house.gov'},
 'stateInfo': [{'_id': ObjectId('659dc90b603c8c105c204452'),
             

 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Georgia's 9th congressional "
                            'district',
             'district': 9,
             'enddate': '2025-01-03',
             'extra': {'address': '445 Cannon House Office Building Washington '
                                  'DC 20515-1009',
                       'office': '445 Cannon House Office Building'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'C001116',
                        'birthday': '1963-11-22',
                        'cspanid': None,
                        'fediverse_webfinger': None,
                        'firstname': 'Andrew',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Clyde',
                        'link': 'https://www.govtrack.us/congress

 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Texas's 6th congressional "
                            'district',
             'district': 6,
             'enddate': '2025-01-03',
             'extra': {'address': '1721 Longworth House Office Building '
                                  'Washington DC 20515-4306',
                       'office': '1721 Longworth House Office Building'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'E000071',
                        'birthday': '1970-01-24',
                        'cspanid': None,
                        'fediverse_webfinger': None,
                        'firstname': 'Jake',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Ellzey',
                        'link': 'https://www.govtrack.us/con

             'person': {'bioguideid': 'C001132',
                        'birthday': '1980-01-03',
                        'cspanid': None,
                        'fediverse_webfinger': None,
                        'firstname': 'Eli',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Crane',
                        'link': 'https://www.govtrack.us/congress/members/eli_crane/456879',
                        'middlename': '',
                        'name': 'Rep. Eli Crane [R-AZ2]',
                        'namemod': '',
                        'nickname': '',
                        'osid': None,
                        'pvsid': None,
                        'sortname': 'Crane, Eli (Rep.) [R-AZ2]',
                        'twitterid': 'RepEliCrane',
                        'youtubeid': None},
             'phone': '202-225-3361',
             'role_type': 'representative',
             'role_type_label': 'Rep

             'extra': {'address': '443 Cannon House Office Building Washington '
                                  'DC 20515-2404',
                       'office': '443 Cannon House Office Building'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'E000235',
                        'birthday': '1959-04-06',
                        'cspanid': None,
                        'fediverse_webfinger': None,
                        'firstname': 'Mike',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Ezell',
                        'link': 'https://www.govtrack.us/congress/members/mike_ezell/456911',
                        'middlename': '',
                        'name': 'Rep. Mike Ezell [R-MS4]',
                        'namemod': '',
                        'nickname': '',
                        'osid': None,
                        'pvsid': None,

                        'gender': 'female',
                        'gender_label': 'Female',
                        'lastname': 'Kiggans',
                        'link': 'https://www.govtrack.us/congress/members/jennifer_kiggans/456947',
                        'middlename': 'Ann',
                        'name': 'Rep. Jennifer Kiggans [R-VA2]',
                        'namemod': '',
                        'nickname': '',
                        'osid': None,
                        'pvsid': None,
                        'sortname': 'Kiggans, Jennifer (Rep.) [R-VA2]',
                        'twitterid': None,
                        'youtubeid': None},
             'phone': '202-225-4215',
             'role_type': 'representative',
             'role_type_label': 'Representative',
             'senator_class': None,
             'senator_rank': None,
             'startdate': '2023-01-03',
             'state': 'VA',
             'title': 'Rep.',
             'title_long': 'Repre

In [87]:
# so what does AL mean for our representative
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }},
       {"$match": { "objects.person.bioguideid": 'A000055' }}
])
printDocs(joined)


{'_id': ObjectId('659dc90a9bf569e5749cf8fe'),
 'meta': {'limit': 438, 'offset': 0, 'total_count': 439},
 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Alabama's 4th congressional "
                            'district',
             'district': 4,
             'enddate': '2025-01-03',
             'extra': {'address': '266 Cannon House Office Building Washington '
                                  'DC 20515-0104',
                       'office': '266 Cannon House Office Building',
                       'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
   

In [88]:
# the senator data is similarly structured, lets find any female senators
docs = senators.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.gender":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.gender": 'female' }}
                      ])
printDocs(docs)

{'objects': {'person': {'firstname': 'Maria',
                        'gender': 'female',
                        'lastname': 'Cantwell'}}}
{'objects': {'person': {'firstname': 'Debbie',
                        'gender': 'female',
                        'lastname': 'Stabenow'}}}
{'objects': {'person': {'firstname': 'Tammy',
                        'gender': 'female',
                        'lastname': 'Baldwin'}}}
{'objects': {'person': {'firstname': 'Marsha',
                        'gender': 'female',
                        'lastname': 'Blackburn'}}}
{'objects': {'person': {'firstname': 'Mazie',
                        'gender': 'female',
                        'lastname': 'Hirono'}}}
{'objects': {'person': {'firstname': 'Kirsten',
                        'gender': 'female',
                        'lastname': 'Gillibrand'}}}
{'objects': {'person': {'firstname': 'Amy',
                        'gender': 'female',
                        'lastname': 'Klobuchar'}}}
{'objects': {'per

In [89]:
# find the Senior Senator for Michigan

docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

{'objects': {'caucus': None,
             'congress_numbers': [116, 117, 118],
             'current': True,
             'description': 'Senior Senator for Michigan',
             'district': None,
             'enddate': '2025-01-03',
             'extra': {'address': '731 Hart Senate Office Building Washington '
                                  'DC 20510',
                       'contact_form': 'https://www.stabenow.senate.gov/contact',
                       'office': '731 Hart Senate Office Building',
                       'rss_url': 'http://stabenow.senate.gov/rss/?p=news'},
             'leadership_title': 'Senate Democratic Policy & Communications '
                                 'Committee Chair',
             'party': 'Democrat',
             'person': {'bioguideid': 'S000770',
                        'birthday': '1950-04-29',
                        'cspanid': 45451,
                        'fediverse_webfinger': None,
                        'firstname': 'Debbie',
     

In [90]:
# Finally remember, due to no schema you can give an unknown field, which it will just ignore and not warn you!
docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

## Twitter Data

Another example of semi-structured data is Twitter, or X data. 

Unfortunately, X now imposes a cost of at least $100 a month if you want to extract tweets (creating is still free!). Below are some example of tweets extracted on the 10th and 11th January 2023. 

In [91]:
# Needed for Twitter data
import string
import operator
import re

In [92]:
# Good practice to examine the data before importing it
! head data/BBCNews-230110-2118.json

{"id": 1612920079967985706, "text": "Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun", "edit_history_tweet_ids": [1612920079967985706], "author_id": 612473, "context_annotations": [{"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "66", "name": "Interests and Hobbies Category", "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"}, "entity": {"id": "1237472346560053249", "name": "Firefighting"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "12374723465600

In [93]:
! tail data/BBCNews-230111-2214.json

{"id": 1612933985436409857, "text": "Tudor Bible sells for \u00a320k in Belfast auction https://t.co/sVMypOIW2I", "edit_history_tweet_ids": [1612933985436409857], "author_id": 612473, "context_annotations": [{"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "69", "name": "News Vertical", "description": "News Categories like Entertainment or Technology"}, "entity": {"id": "1331946773263253506", "name": "Northern Ireland national news"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "1331946773263253506", "name": "Northern Ire

In [94]:
client.drop_database('twitterDB')
client.list_database_names()

['accidents', 'admin', 'babyNamesDB', 'config', 'local', 'politicsDB']

In [95]:
# load this into a twitterDB database and news from 11th January
! mongoimport --db twitterDB  --file data/BBCNews-230110-2118.json --collection bbcnews
! mongoimport --db twitterDB  --file data/BBCNews-230111-2214.json --collection bbcnews

2024-01-09T22:30:36.403+0000	connected to: mongodb://localhost/
2024-01-09T22:30:36.431+0000	100 document(s) imported successfully. 0 document(s) failed to import.
2024-01-09T22:30:36.557+0000	connected to: mongodb://localhost/
2024-01-09T22:30:36.580+0000	100 document(s) imported successfully. 0 document(s) failed to import.


In [96]:
client.list_database_names()

['accidents',
 'admin',
 'babyNamesDB',
 'config',
 'local',
 'politicsDB',
 'twitterDB']

In [97]:
# Change database
db = client.twitterDB
bbcnews = db.bbcnews
bbcnews.find_one()

{'_id': ObjectId('659dc90c557014bee011a695'),
 'id': 1612920079967985706,
 'text': 'Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun',
 'edit_history_tweet_ids': [1612920079967985706],
 'author_id': 612473,
 'context_annotations': [{'domain': {'id': '46',
    'name': 'Business Taxonomy',
    'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
   'entity': {'id': '1557697121477832705',
    'name': 'Publisher & News Business',
    'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books'}},
  {'domain': {'id': '66',
    'name': 'Interests and Hobbies Category',
    'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'},
   'entity': {'id': '1237472346560053249', 'name': 'Firefighting'}},
  {'domain': {'id': '131',
    'name': 'Unified Twitter Taxon

In [98]:
# What columns/keys does it have
# Some of these keys are subdocuments, such as the entities one seen above
bbcnews.find_one().keys()

dict_keys(['_id', 'id', 'text', 'edit_history_tweet_ids', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'edit_controls', 'entities', 'lang', 'public_metrics', 'reply_settings'])

In [99]:
# Prince Harry was topical this time last year! Is he mentioned at all?!
# $regex allows pattern matching. The 'i' option makes the search case insensitive
# "_id:" 0 suppresses showing the object id
# SELECT text FROM bbcnews WHERE LOWER(text) LIKE '%harry%';

tweets = bbcnews.find({'text':{'$regex':'Harry', '$options': 'i'}}, {"_id":0,'text': 1})
printDoc(tweets)

{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB"}
{'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1"}
{'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC"}
{'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t'}
{'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64'}
{'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc'}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/XmfSToqCxu"}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/Ff92m8HC4g"}
{'text': "Who is Harry's ghostwriter, JD Moehringer - and how much did he make? https://t.co/P2TgFcLz8U"}
{'text': "Newspaper headlines: 'No way back' says Harry and ho

In [100]:
# $regex can be used on more than one field - can either use the "OR" clause to get either value. 
# Just make sure the brackets are the correct ones and lined up correctly!
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' OR created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$or": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))

[{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB",
  'created_at': 'Tuesday 10-Jan-2023 21:02:03'},
 {'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1",
  'created_at': 'Tuesday 10-Jan-2023 20:22:04'},
 {'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC",
  'created_at': 'Tuesday 10-Jan-2023 19:08:58'},
 {'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t',
  'created_at': 'Tuesday 10-Jan-2023 14:42:05'},
 {'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64',
  'created_at': 'Tuesday 10-Jan-2023 12:56:42'},
 {'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc',
  'created_at': 'Tuesday 10-Jan-2023 09:46:49'},
 {'text': "Prince Harry's book officially hits shops after days 

In [101]:
# or use the "AND" clause to get both value. 
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' AND created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$and": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))


[{'text': "Which Royal has come out best in the fallout from Prince Harry's book? https://t.co/zvF9dTJNKW",
  'created_at': 'Wednesday 11-Jan-2023 14:58:43'},
 {'text': "Prince Harry condemns 'dangerous spin' about his Taliban comments https://t.co/V2u2hELxp4",
  'created_at': 'Wednesday 11-Jan-2023 02:04:29'}]

In [102]:
# show the distinct languages found in the tweets
# SELECT DISTINCT lang FROM bbcnews;
# The supported languages can be found here: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages 
db.bbcnews.distinct("lang")

['ca', 'en', 'fr', 'tl']

In [103]:
# How many tweets have been retweeted more than 100 times
# Use the dot notation to reference keys in any subdocument
db.bbcnews.count_documents({"public_metrics.retweet_count": { '$gt' : 100 }})    

15

In [104]:
tweets = db.bbcnews.find({'entities.urls.title':{'$regex':'Firefighter'}}, {'entities.urls.title': 1})

printDoc(tweets)

{'_id': ObjectId('659dc90c557014bee011a695'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}
{'_id': ObjectId('659dc90cee653d4a269bc8e1'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}


In [105]:
# Show some fields from the entities subdocument.
# When showing the subdocuments pretty print makes the tweets more readable
tweets = db.bbcnews.find({}, {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'They are also more likely to die from '
                                       'heart attacks and strokes, researchers '
                                       'say.',
                        'title': 'Firefighters face higher cancer risk, '
                                 'Scottish study finds'}]}}
{'entities': {'urls': [{'description': "A bookseller placed Prince Harry's "
                                       "memoir Spare beside Bella Mackie's "
                                       'novel How to Kill Your Family.',
                        'title': "Harry's memoir Spare displayed beside How to "
                                 'Kill Your Family novel'}]}}
{'entities': {'urls': [{'description': 'Zholia Alemi is a "most accomplished '
                                       'fraudster" who forged a certificate to '
                                       'get work, a jury hears.',
                        'title': 'Unqualified doctor who faked

In [106]:
# These can be searched too - find the Seal story
tweets = db.bbcnews.find({"entities.urls.title": {"$regex": "seal", '$options': 'i' }}, 
                         {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'Being in a fishing lake is like "being '
                                       'in a branch of Waitrose" for a hungry '
                                       'seal, an expert says.',
                        'title': 'Seal stuck in Rochford lake munching its way '
                                 'through fish stock'}]}}


In [107]:
# Another way to unpack the nested documents
# https://stackoverflow.com/questions/25909927/mongodb-how-to-get-a-field-sub-document-from-a-document
tweets=db.bbcnews.aggregate([
    # De-normalize the array content first
    { "$unwind": "$entities" },

    # De-normalize the content from the inner array as well
    { "$unwind": "$entities.urls" },

    # Group the "entities" per document
    { "$group": {
        "_id": "$_id",
        "entities": { "$addToSet": "$entities.urls" }
    }}
])
printDocs(tweets)

{'_id': ObjectId('659dc90c557014bee011a6ce'),
 'entities': [{'description': 'Under the proposals, some trade union members '
                              'would be required to continue working during a '
                              'strike.',
               'display_url': 'bbc.in/3Cze5gQ',
               'end': 72,
               'expanded_url': 'https://bbc.in/3Cze5gQ',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1612730048758382592/ge4C7cFJ?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1612730048758382592/ge4C7cFJ?format=jpg&name=150x150',
                           'width': 150}],
               'start': 49,
               'status': 200,
               'title': 'Anti-strikes bill to be introduced to Parliament',
               'unwound_url': 'https://www.bbc.com/news/uk-64219016?xtor=AL-72-%

# Summary

This and the relationalDB Notebooks give you a flavour of the two types of database management system. 

What are the differences?

Some things to think about:

*Relational*
- relational has a fixed schema
- the data is normalised, with less duplication
- constraints can be enforced
- ACID transaction support (Atomic, Consistency, Isolation and Durability)

*NoSQL (Document)*
- flexible schema, optional data can be easily incorporated.
- can support agile development
- data is denormalised, so can mean more duplication
- constraints not enforced
- BASE transaction support (Basically Available, Soft state, Eventual consistency!)

Bear in mind that NoSQL is a relatively new technology, so can be seen as immature in that it does not provide good support for transaction handling, or access control, but could be argued that this is not the market it is aimed at. 

