# MongoDB Examples

This notebook uses three different data sets:

* the UK Baby Names dataset introduced in my TMA01 Preparation Tutorial (available on Github: https://github.com/MaryGarvey/TM351/tree/main/TMA01_Preparation)
* JSON data from from the USA Government (pre-current 2025 administration)
* Former Twitter data (#BBCNews)

These will be stored in three separate databases: babyNamesDB, politicsDB and twitterDB.

Activity 13.2 introduces *Seven Databases in Seven Weeks* (Redmond 2012)

The most common NoSQL databases introduced are:

- Riak	- key value
- Hbase	- wide column
- MongoDB	- document 
- CouchDB - document 
- Neo4j	- graph
- Redis	- key value

This notebook will look at the MongoDB NoSQL document database.

# UK Baby Names 👶 (1996-2021)

## Introduction (from the Kaggle Website)

<i>Baby name statistics are compiled from first names recorded when live births are registered in England and Wales as part of civil registration, a legal requirement.
The statistics are based only on live births which occurred in the calendar year, as there is no public register of stillbirths.</i>

<i>Babies born in England and Wales to women whose usual residence is outside England and Wales are included in the statistics for England and Wales as a whole, but excluded from any sub-division of England and Wales.
The statistics are based on the exact spelling of the name given on the birth certificate. Grouping names with similar pronunciation would change the rankings. Exact names are given so users can group if they wish.</i>

<i>The dataset contains records of around 16k boy names and 22k girl names.</i>

You can get further information and the datasets from: 
https://www.kaggle.com/datasets/johnsmith44/uk-baby-names-1996-2021

In [1]:
# Import the required libraries

import pymongo
import datetime
import collections
#import Object

import pandas as pd
# better for printing JSON data: p(retty)print
from pprint import pprint

# Print out the version of pymongo 
print (pymongo.version)

4.10.1


In [2]:
#SET DATABASE CONNECTION STRINGS
MONGOHOST='localhost'
MONGOPORT=27017
MONGOCONN='mongodb://{MONGOHOST}:{MONGOPORT}/'.format(MONGOHOST=MONGOHOST,MONGOPORT=MONGOPORT)

In [3]:
# MongoDB version
! mongod --version

db version v7.0.14
Build Info: {
    "version": "7.0.14",
    "gitVersion": "ce59cfc6a3c5e5c067dca0d30697edd68d4f5188",
    "openSSLVersion": "OpenSSL 3.0.14 4 Jun 2024",
    "modules": [],
    "allocator": "tcmalloc",
    "environment": {
        "distmod": "ubuntu2204",
        "distarch": "x86_64",
        "target_arch": "x86_64"
    }
}


In [4]:
client = pymongo.MongoClient(MONGOCONN)

In [5]:
# Drop the tutorial databases so we start with a clean sheet
# Unlike SQL, the command will not generate an error if it does not already exist
client.drop_database('babyNamesDB')
client.drop_database('politicsDB')
client.drop_database('twitterDB')
client.list_database_names()

['accidents', 'admin', 'bbc_db', 'config', 'fsa', 'local']

In [6]:
# Check the start and end of the file for any issues
!head data/UKGirlNames1996-2021.csv

Name,2021 Rank,2021 Count,2020 Rank,2020 Count,2019 Rank,2019 Count,2018 Rank,2018 Count,2017 Rank,2017 Count,2016 Rank,2016 Count,2015 Rank,2015 Count,2014 Rank,2014 Count,2013 Rank,2013 Count,2012 Rank,2012 Count,2011 Rank,2011 Count,2010 Rank,2010 Count,2009 Rank,2009 Count,2008 Rank,2008 Count,2007 Rank,2007 Count,2006 Rank,2006 Count,2005 Rank,2005 Count,2004 Rank,2004 Count,2003 Rank,2003 Count,2002 Rank,2002 Count,2001 Rank,2001 Count,2000 Rank,2000 Count,1999 Rank,1999 Count,1998 Rank,1998 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
A'Idah,,,,,4686.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
A'Isha,5581.0,3.0,,,2458.0,10.0,,,4763.0,4.0,2757.0,9.0,2328.0,11.0,2659.0,9.0,4050.0,5.0,4171.0,5.0,4764.0,4.0,3533.0,6.0,3936.0,5.0,4524.0,4.0,2233.0,10.0,,,4798.0,3.0,3725.0,4.0,4373.0,3.0,,,2023.0,8.0,,,,,3142.0,4.0,,,,
A'Ishah,4634.0,4.0,,,,,,,5765.0,3.0,,,,,3160.0,7.0,5742.0,3.0,4171.0,5.0,5785.0,3.0,2589.0,9.0,,,4524.0,4.0,2895.0,7.0,5061.0,3.0,3382.0,5.0,2802.0,6.0,2727.

In [7]:
!tail data/UKGirlNames1996-2021.csv

Zyanna,,,5493.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zyla,2711.0,9.0,3117.0,7.0,3541.0,6.0,5666.0,3.0,,,5785.0,3.0,5730.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zylah,4634.0,4.0,5493.0,3.0,,,,,5765.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zymal,,,,,,,,,5765.0,3.0,,,,,,,4739.0,4.0,,,,,,,,,2487.0,9.0,2627.0,8.0,,,,,,,,,,,,,,,,,,,,,,
Zynab,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,3900.0,3.0,,,,,,
Zynah,3961.0,5.0,,,,,,,4763.0,4.0,,,2705.0,9.0,5691.0,3.0,2887.0,8.0,4171.0,5.0,4764.0,4.0,5707.0,3.0,,,5545.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,
Zyra,2341.0,11.0,2449.0,10.0,4001.0,5.0,2901.0,8.0,4063.0,5.0,,,4736.0,4.0,,,5742.0,3.0,5876.0,3.0,,,4688.0,4.0,,,5545.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,
Zyrah,5581.0,3.0,4535.0,4.0,,,,,4763.0,4.0,4096.0,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zysha,4634.0,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Zyva,2711.0,9.0,2449.0,10.0,3541.0,6.0,3985.0,5.0,4063.0,5.0,,,5730.0,3.0,,,,,,,,,,,3936.0,5.0,5545.0,3.0,,,,,,,,,,,,,,,,,,,,,

In [8]:
# babyNamesDB is a database that contains 2 collections (similar to tables)
db = client.babyNamesDB

There are two ways to import the CSV dataset.

- use the `mongoimport` command
- import into a dataframe as normal, then convert to a MongoDB collection

Both methods will be shown here for information.

1. Using mongoimport

In [9]:
# 1. using mongoimport
! mongoimport --db babyNamesDB --type=csv --headerline --file data/UKGirlNames1996-2021.csv --collection girls

2025-01-24T12:40:27.540+0000	connected to: mongodb://localhost/
2025-01-24T12:40:28.030+0000	21958 document(s) imported successfully. 0 document(s) failed to import.


In [10]:
! mongoimport --db babyNamesDB --type=csv --headerline --file data/UKBoyNames1996-2021.csv --collection boys

2025-01-24T12:40:28.148+0000	connected to: mongodb://localhost/
2025-01-24T12:40:28.526+0000	16777 document(s) imported successfully. 0 document(s) failed to import.


Code for importing via a data frame:
<pre>
# importing via a data frame
names_df = pd.read_csv("data/UKBoyNames1996-2021.csv")
db.boys.insert_many(names_df.to_dict('records'))
</pre>

In [11]:
# Check the database has been added (babyNamesDB)
client.list_database_names()

['accidents', 'admin', 'babyNamesDB', 'bbc_db', 'config', 'fsa', 'local']

In [12]:
# and it contains the two collections
db.list_collection_names()

['boys', 'girls']

In [13]:
# setup variables for the two collections
boys = db.boys
girls = db.girls

In [14]:
# how many documents does each collection have:
print("Girls:\t{}".format(girls.count_documents({})))
print("Boys:\t{}".format(boys.count_documents({})))

Girls:	21958
Boys:	16777


The variables saves us having to use db.collectionName.function() in the queries, for example, you can use `girls.find()` instead of `db.girls.find()`. You can still use the longer format.

Just be careful if you swap databases in the same Notebook, as we do later, you could end up referencing a collection in the wrong database. Mongo will not warn you that this is an error, it just assumes it does not exist and will return nothing - a consequence of a schemaless database. 

In [15]:
# Show one record - can be any one from the collection
girls.find_one()

{'_id': ObjectId('67938a3b1d753da705dd2ed6'),
 'Name': "A'Idah",
 '2021 Rank': '',
 '2021 Count': '',
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': 4686.0,
 '2019 Count': 4.0,
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': '',
 '2011 Count': '',
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': '',
 '2005 Count': '',
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': '',
 '2003 Count': '',
 '2002 Rank': '',
 '2002 Count': '',
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': '',
 '1999 Count': '',
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': '',
 '1997 Count': '',
 '199

We can see there are a lot of missing values, which will be removed later.

# Querying 

MongoDB data is stored in JSON format, which means it uses the format of: *{key: value}* for most things.

The *find()* function is the equivalent of the SQL SELECT statement.

Instead of a *WHERE* clause you need to provide a JSON string for what you want to find.

For example, the following is the equivalent of *SELECT * FROM girlsName WHERE name = 'Mary';*

In [16]:
girls.find({'Name': 'Mary'})

<pymongo.synchronous.cursor.Cursor at 0x7fbbeca6da50>

In [17]:
# Can specify a search criteria with find_one too (could be the only one)
girls.find_one({'Name': 'Mary-Beth'})

{'_id': ObjectId('67938a3b1d753da705dd634f'),
 'Name': 'Mary-Beth',
 '2021 Rank': '',
 '2021 Count': '',
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': '',
 '2019 Count': '',
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': 5785.0,
 '2011 Count': 3.0,
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': 3970.0,
 '2005 Count': 4.0,
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': 4373.0,
 '2003 Count': 3.0,
 '2002 Rank': 4137.0,
 '2002 Count': 3.0,
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': 2444.0,
 '1999 Count': 6.0,
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': 2738.

The difference between `find()` and `find_one()` is that the former returns all the documents matching the criteria, whereas the latter returns just one of the documents, which can be used to the structure of the data. Do bear in mind, since MongoDB can store semi-structured data, different documents could have a different structure, unlike a relational database, where records in a table would all have the same structure.

To see what is returned in the cursor, lets create some functions to print the individual documents from the cursor.

In [18]:
# This means an iterator is needed to display the results
# using pretty print
def printDocs(documents):
    for doc in documents:
        pprint(doc)

# ordinary print
def printDoc(documents):
    for doc in documents:
        print(doc)

# ordinary print
def printDocValues(documents):
    for doc in documents:
        values = list(doc.values())
        print(values)

In [19]:
# find the Marys
docs = girls.find({'Name': 'Mary'})
printDoc(docs)

{'_id': ObjectId('67938a3b1d753da705dd634d'), 'Name': 'Mary', '2021 Rank': 318.0, '2021 Count': 148.0, '2020 Rank': 291.0, '2020 Count': 160.0, '2019 Rank': 289.0, '2019 Count': 170.0, '2018 Rank': 259.0, '2018 Count': 189.0, '2017 Rank': 320.0, '2017 Count': 150.0, '2016 Rank': 250.0, '2016 Count': 204.0, '2015 Rank': 249.0, '2015 Count': 198.0, '2014 Rank': 225.0, '2014 Count': 229.0, '2013 Rank': 244.0, '2013 Count': 203.0, '2012 Rank': 241.0, '2012 Count': 209.0, '2011 Rank': 250.0, '2011 Count': 200.0, '2010 Rank': 213.0, '2010 Count': 237.0, '2009 Rank': 227.0, '2009 Count': 213.0, '2008 Rank': 179.0, '2008 Count': 292.0, '2007 Rank': 177.0, '2007 Count': 300.0, '2006 Rank': 170.0, '2006 Count': 310.0, '2005 Rank': 151.0, '2005 Count': 339.0, '2004 Rank': 164.0, '2004 Count': 325.0, '2003 Rank': 162.0, '2003 Count': 298.0, '2002 Rank': 146.0, '2002 Count': 310.0, '2001 Rank': 145.0, '2001 Count': 315.0, '2000 Rank': 146.0, '2000 Count': 313.0, '1999 Rank': 139.0, '1999 Count': 33

In [20]:
# alternatively use a dataframe to make it more like a relational table
# other approaches are to convert the Cursor to a list
# find the girls names in 2021 with a count more than 2000
pd.DataFrame(girls.find({"2021 Count" : {"$gt": 2000}}))

Unnamed: 0,_id,Name,2021 Rank,2021 Count,2020 Rank,2020 Count,2019 Rank,2019 Count,2018 Rank,2018 Count,...,2000 Rank,2000 Count,1999 Rank,1999 Count,1998 Rank,1998 Count,1997 Rank,1997 Count,1996 Rank,1996 Count
0,67938a3b1d753da705dd3448,Amelia,2.0,3164.0,2.0,3319.0,2.0,3712.0,2.0,3941.0,...,35.0,1489.0,37.0,1511.0,48.0,1249.0,49.0,1145.0,63.0,929.0
1,67938a3b1d753da705dd3943,Ava,4.0,2576.0,4.0,2679.0,4.0,2946.0,3.0,3110.0,...,291.0,129.0,429.0,70.0,530.0,51.0,500.0,55.0,753.0,30.0
2,67938a3b1d753da705dd4a96,Florence,8.0,2180.0,14.0,1963.0,15.0,2025.0,15.0,2062.0,...,163.0,274.0,166.0,268.0,177.0,244.0,175.0,257.0,194.0,228.0
3,67938a3b1d753da705dd4af6,Freya,6.0,2187.0,12.0,1982.0,10.0,2264.0,18.0,1921.0,...,75.0,686.0,92.0,589.0,93.0,563.0,113.0,450.0,118.0,394.0
4,67938a3b1d753da705dd50f2,Isabella,13.0,2010.0,8.0,2052.0,6.0,2398.0,7.0,2369.0,...,64.0,796.0,57.0,883.0,90.0,594.0,107.0,470.0,106.0,441.0
5,67938a3b1d753da705dd512e,Isla,3.0,2683.0,3.0,2749.0,3.0,2981.0,4.0,3046.0,...,297.0,125.0,286.0,124.0,369.0,87.0,380.0,84.0,382.0,87.0
6,67938a3b1d753da705dd51a1,Ivy,5.0,2245.0,6.0,2166.0,12.0,2158.0,14.0,2104.0,...,1033.0,20.0,1216.0,16.0,2165.0,7.0,1666.0,10.0,1222.0,15.0
7,67938a3b1d753da705dd5dff,Lily,7.0,2182.0,7.0,2150.0,9.0,2285.0,13.0,2184.0,...,45.0,1124.0,53.0,1007.0,61.0,873.0,75.0,721.0,85.0,651.0
8,67938a3b1d753da705dd6504,Mia,9.0,2168.0,5.0,2303.0,5.0,2500.0,6.0,2490.0,...,43.0,1149.0,54.0,980.0,75.0,721.0,89.0,583.0,116.0,397.0
9,67938a3b1d753da705dd6b8f,Olivia,1.0,3649.0,1.0,3640.0,1.0,4082.0,1.0,4598.0,...,8.0,4546.0,4.0,5250.0,16.0,3550.0,19.0,2789.0,24.0,2456.0


# Data Dictionary



One consequence of being schemaless, means there are no conventional data dictionary tables to check if the collection or document names exist. This means that it will not generate an error message if neither exist. Do note, the names are all case sensitive. 

Why will the following return no records?

In [21]:
girls.find_one({"Name" : "Fred"})

In [22]:
db.girls.find_one({"name" : "Susan"})

But it will generate an error message if it can not find the variables or functions:

In [23]:
Girls.find_one({"Name" : "Susan"})

NameError: name 'Girls' is not defined

In [None]:
girls.find_One({"Name" : "Susan"})

In [None]:
girls.find_One({"Name" : Susan})

In [None]:
girls.find_One({Name : "Susan"})

In [24]:
# Lets find our girl
girls.find_one({"Name" : "Susan"})

{'_id': ObjectId('67938a3b1d753da705dd7b8e'),
 'Name': 'Susan',
 '2021 Rank': 1692.0,
 '2021 Count': 17.0,
 '2020 Rank': 2042.0,
 '2020 Count': 13.0,
 '2019 Rank': 3151.0,
 '2019 Count': 7.0,
 '2018 Rank': 3518.0,
 '2018 Count': 6.0,
 '2017 Rank': 1512.0,
 '2017 Count': 21.0,
 '2016 Rank': 1525.0,
 '2016 Count': 21.0,
 '2015 Rank': 1601.0,
 '2015 Count': 19.0,
 '2014 Rank': 1882.0,
 '2014 Count': 15.0,
 '2013 Rank': 1433.0,
 '2013 Count': 22.0,
 '2012 Rank': 1130.0,
 '2012 Count': 30.0,
 '2011 Rank': 1043.0,
 '2011 Count': 33.0,
 '2010 Rank': 1257.0,
 '2010 Count': 25.0,
 '2009 Rank': 865.0,
 '2009 Count': 39.0,
 '2008 Rank': 878.0,
 '2008 Count': 37.0,
 '2007 Rank': 836.0,
 '2007 Count': 38.0,
 '2006 Rank': 770.0,
 '2006 Count': 40.0,
 '2005 Rank': 951.0,
 '2005 Count': 28.0,
 '2004 Rank': 883.0,
 '2004 Count': 30.0,
 '2003 Rank': 844.0,
 '2003 Count': 31.0,
 '2002 Rank': 761.0,
 '2002 Count': 33.0,
 '2001 Rank': 612.0,
 '2001 Count': 42.0,
 '2000 Rank': 582.0,
 '2000 Count': 46.0,
 '

There may not be a data dictionary collection to query, but you can find the keys in a collection, which are similar to the column names in a relational database. Be aware though, that the structure can vary from document to document in a given collection.


In [25]:
girls.find_one().keys()

dict_keys(['_id', 'Name', '2021 Rank', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

As seen previously there are a lot of fields with no data. One good point for a NoSQL database is that every document does not have to have the same structure, so if the value is blank, there is no need to store the key.

For example, lets remove any records where the "2021 Rank" is null:

In [26]:
girls.update_many({"2021 Rank" : ""}, { "$unset": {"2021 Rank" : 1 }});

In [27]:
girls.find_one()

{'_id': ObjectId('67938a3b1d753da705dd2ed6'),
 'Name': "A'Idah",
 '2021 Count': '',
 '2020 Rank': '',
 '2020 Count': '',
 '2019 Rank': 4686.0,
 '2019 Count': 4.0,
 '2018 Rank': '',
 '2018 Count': '',
 '2017 Rank': '',
 '2017 Count': '',
 '2016 Rank': '',
 '2016 Count': '',
 '2015 Rank': '',
 '2015 Count': '',
 '2014 Rank': '',
 '2014 Count': '',
 '2013 Rank': '',
 '2013 Count': '',
 '2012 Rank': '',
 '2012 Count': '',
 '2011 Rank': '',
 '2011 Count': '',
 '2010 Rank': '',
 '2010 Count': '',
 '2009 Rank': '',
 '2009 Count': '',
 '2008 Rank': '',
 '2008 Count': '',
 '2007 Rank': '',
 '2007 Count': '',
 '2006 Rank': '',
 '2006 Count': '',
 '2005 Rank': '',
 '2005 Count': '',
 '2004 Rank': '',
 '2004 Count': '',
 '2003 Rank': '',
 '2003 Count': '',
 '2002 Rank': '',
 '2002 Count': '',
 '2001 Rank': '',
 '2001 Count': '',
 '2000 Rank': '',
 '2000 Count': '',
 '1999 Rank': '',
 '1999 Count': '',
 '1998 Rank': '',
 '1998 Count': '',
 '1997 Rank': '',
 '1997 Count': '',
 '1996 Rank': '',
 '199

Given the amount of empty keys, it would be tedious to remove each one separately, so lets find what keys each record has and then loop through removing any blanks.

Do note, `find_one()` could retrieve any record, if the data was semi-structured each document could have a different structure. In this case, the data came from a CSV file, so every document has the same structure.

In [28]:
keys = girls.find_one({}).keys()
keys

dict_keys(['_id', 'Name', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

In [29]:
for k in keys:
    girls.update_many({ k : ""}, { "$unset": { k : 1 }});

In [30]:
# note, the above has removed any empty keys, but the document will still exist
girls.find_one()

{'_id': ObjectId('67938a3b1d753da705dd2ed6'),
 'Name': "A'Idah",
 '2019 Rank': 4686.0,
 '2019 Count': 4.0}

In [31]:
# do the same to the boys names
keys = boys.find_one({}).keys()
for k in keys:
    boys.update_many({ k : ""}, { "$unset": { k : 1 }});

In [32]:
boys.find_one()

{'_id': ObjectId('67938a3c72aab1e80d595bc7'),
 'Name': 'A.J.',
 '2009 Rank': 4527.0,
 '2009 Count': 3.0,
 '1999 Rank': 2943.0,
 '1999 Count': 3.0}

In [33]:
# The consequence of this is that the keys will be slightly different for the records that have more complete data
# one with sparse data
girls.find_one({"Name" : "Marvi"}).keys()

dict_keys(['_id', 'Name', '2000 Rank', '2000 Count'])

In [34]:
# one more complete:
girls.find_one({"Name" : "Martina"}).keys()

dict_keys(['_id', 'Name', '2021 Rank', '2021 Count', '2020 Rank', '2020 Count', '2019 Rank', '2019 Count', '2018 Rank', '2018 Count', '2017 Rank', '2017 Count', '2016 Rank', '2016 Count', '2015 Rank', '2015 Count', '2014 Rank', '2014 Count', '2013 Rank', '2013 Count', '2012 Rank', '2012 Count', '2011 Rank', '2011 Count', '2010 Rank', '2010 Count', '2009 Rank', '2009 Count', '2008 Rank', '2008 Count', '2007 Rank', '2007 Count', '2006 Rank', '2006 Count', '2005 Rank', '2005 Count', '2004 Rank', '2004 Count', '2003 Rank', '2003 Count', '2002 Rank', '2002 Count', '2001 Rank', '2001 Count', '2000 Rank', '2000 Count', '1999 Rank', '1999 Count', '1998 Rank', '1998 Count', '1997 Rank', '1997 Count', '1996 Rank', '1996 Count'])

In [35]:
# how many documents in the collection
db.girls.count_documents({})

21958

In [36]:
# can access via the index (starts at 0)
girls.find()[0]

{'_id': ObjectId('67938a3b1d753da705dd2ed6'),
 'Name': "A'Idah",
 '2019 Rank': 4686.0,
 '2019 Count': 4.0}

In [37]:
# second record
girls.find()[1]

{'_id': ObjectId('67938a3b1d753da705dd2ed7'),
 'Name': "A'Ishah",
 '2021 Rank': 4634.0,
 '2021 Count': 4.0,
 '2017 Rank': 5765.0,
 '2017 Count': 3.0,
 '2014 Rank': 3160.0,
 '2014 Count': 7.0,
 '2013 Rank': 5742.0,
 '2013 Count': 3.0,
 '2012 Rank': 4171.0,
 '2012 Count': 5.0,
 '2011 Rank': 5785.0,
 '2011 Count': 3.0,
 '2010 Rank': 2589.0,
 '2010 Count': 9.0,
 '2008 Rank': 4524.0,
 '2008 Count': 4.0,
 '2007 Rank': 2895.0,
 '2007 Count': 7.0,
 '2006 Rank': 5061.0,
 '2006 Count': 3.0,
 '2005 Rank': 3382.0,
 '2005 Count': 5.0,
 '2004 Rank': 2802.0,
 '2004 Count': 6.0,
 '2003 Rank': 2727.0,
 '2003 Count': 6.0,
 '2002 Rank': 2868.0,
 '2002 Count': 5.0,
 '2001 Rank': 2023.0,
 '2001 Count': 8.0,
 '2000 Rank': 3912.0,
 '2000 Count': 3.0,
 '1999 Rank': 3225.0,
 '1999 Count': 4.0,
 '1998 Rank': 3142.0,
 '1998 Count': 4.0}

In [38]:
# Last one
len = girls.count_documents({})-1
girls.find()[len]

{'_id': ObjectId('67938a3c1d753da705dd849b'),
 'Name': 'Zynab',
 '1999 Rank': 3900.0,
 '1999 Count': 3.0}

`count_documents()` can be used with queries to count the result, rather than listing them

In [39]:
girls.count_documents({"Name": "Mary"})

1

In [40]:
# how many documents have a count more than 1500 in 2021
girls.count_documents({"2021 Count": {"$gt" : 1500} })

25

# Part 15: Complex queries and analysis
# Aggregation Pipeline

More complex processing, including grouping, aggregation functions, and data renaming is achieved through MongoDB’s aggregation pipeline.

For example a query can involve several stages:
                                                
First stage: filter out documents that do not match some criterion<br>
Second stage: group those documents<br>
Third stage: select only groups that match another criterion<br>
Fourth stage: group summaries would then be returned to the client<br>

By building up a pipeline in stages, complex data processing tasks can be built from simple components.

![](images/pipeline.png)

![](images/pipeline_functions.png)

Further examples can be found in *Notebook 15.3 Introducing aggregation pipelines.*

The examples below and in the practical activities all use small data sets that can be used locally. With huge datasets, the processing may be spread over many computers for processing to aid speed. Data processing tools (such as the aggregation pipeline and MapReduce) keep the processing of data near that data itself, reducing the work required by the client and the amount of data to be moved across the network from server to client. 

In [41]:
# Equivalent to SELECT COUNT(*) FROM girls;
# Need to group by an _id
pipeline = [
     {"$group": {"_id": 0, "Name": {"$sum": 1}}},
]

list(girls.aggregate(pipeline))

[{'_id': 0, 'Name': 21958}]

In [42]:
# SELECT Name, count(*) FROM girls LIMIT 50;
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$Name", "count": {"$sum": 1} }},
                               {"$limit" : 50}] ))

{'_id': 'Vriti', 'count': 1}
{'_id': 'Divyanshi', 'count': 1}
{'_id': 'Libby-Ann', 'count': 1}
{'_id': 'Shanyce', 'count': 1}
{'_id': 'Sharia', 'count': 1}
{'_id': 'Willamina', 'count': 1}
{'_id': 'Zarmisha', 'count': 1}
{'_id': 'Amber-Rae', 'count': 1}
{'_id': 'Vestina', 'count': 1}
{'_id': 'Preksha', 'count': 1}
{'_id': 'Ruby-Sue', 'count': 1}
{'_id': 'Aml', 'count': 1}
{'_id': 'Eithne', 'count': 1}
{'_id': 'Alexa-Rose', 'count': 1}
{'_id': 'Korin', 'count': 1}
{'_id': 'Fabeeha', 'count': 1}
{'_id': 'Kaydi', 'count': 1}
{'_id': 'Isla-Jade', 'count': 1}
{'_id': 'Kiera-May', 'count': 1}
{'_id': 'Myla-Leigh', 'count': 1}
{'_id': 'Bobbie-Ann', 'count': 1}
{'_id': 'Safiah', 'count': 1}
{'_id': 'Ugonna', 'count': 1}
{'_id': 'Nihira', 'count': 1}
{'_id': 'Aaminah', 'count': 1}
{'_id': 'Lilimae', 'count': 1}
{'_id': 'Lanna', 'count': 1}
{'_id': 'Medeea', 'count': 1}
{'_id': 'Ottoline', 'count': 1}
{'_id': 'Ronnie-May', 'count': 1}
{'_id': 'Malieka', 'count': 1}
{'_id': 'Annisa', 'count': 1}


## Reshaping

To do statistics on this data we want to use information in the keys as values, e.g., extract the year from: `2020 Count`. In Tutorial 2 we did some processing to do this, so lets reuse the code to reshape our data better:

In [43]:
def updateFile(fileType):
    # remove missing data permanately
    filename = 'data/UK'+fileType+'Names1996-2021.csv'
    print("Importing: '"+filename+"'")
    names_df = pd.read_csv(filename)
    names_df = names_df.dropna(how='any')
    # unpivot the dataframe from a wide to long format
    namesUpd_df = pd.melt(names_df, id_vars="Name", value_name='Value')
    # split the two values in variable: year and the type (count or rank)
    namesUpd_df[['Year','Type']] = namesUpd_df['variable'].str.split(' ', expand = True)
    # convert year to a number
    namesUpd_df['Year'] = namesUpd_df['Year'].astype(str).astype(int)
    # the variable column is no longer needed
    namesUpd_df.drop('variable', axis=1, inplace=True)
    namesUpd_df.head()
    # add a gender field
    namesUpd_df["Gender"] = fileType
    # save the changes 
    namesUpd_df.to_csv('data/'+fileType+'Updated.csv', index=False)
    return namesUpd_df

In [44]:
# SELECT "2021 Count", count(*) FROM training ORDER BY "2021 Count" LIMIT 20;
printDoc(db.girls.aggregate( [ { "$group" : { "_id" : "$2021 Count", "count": {"$sum": 1} }},
                               { "$sort" : {"_id" : 1}},
                                {"$limit" : 20}
                             ] ))

{'_id': None, 'count': 14628}
{'_id': 3.0, 'count': 1750}
{'_id': 4.0, 'count': 947}
{'_id': 5.0, 'count': 673}
{'_id': 6.0, 'count': 442}
{'_id': 7.0, 'count': 327}
{'_id': 8.0, 'count': 250}
{'_id': 9.0, 'count': 231}
{'_id': 10.0, 'count': 212}
{'_id': 11.0, 'count': 158}
{'_id': 12.0, 'count': 140}
{'_id': 13.0, 'count': 128}
{'_id': 14.0, 'count': 111}
{'_id': 15.0, 'count': 96}
{'_id': 16.0, 'count': 81}
{'_id': 17.0, 'count': 93}
{'_id': 18.0, 'count': 64}
{'_id': 19.0, 'count': 59}
{'_id': 20.0, 'count': 64}
{'_id': 21.0, 'count': 51}


Update the two files to remove the blanks and unpivot the data.

Let's also keep both files in one collection, so will add a "Gender" field so we know which gender the name belongs to.

In [45]:
boys_df = updateFile("Boy")
boys_df.head()

Importing: 'data/UKBoyNames1996-2021.csv'


Unnamed: 0,Name,Value,Year,Type,Gender
0,Aadam,457.0,2021,Rank,Boy
1,Aadil,1448.0,2021,Rank,Boy
2,Aamir,2301.0,2021,Rank,Boy
3,Aaran,1860.0,2021,Rank,Boy
4,Aaron,119.0,2021,Rank,Boy


In [46]:
girls_df = updateFile("Girl")
girls_df.head()

Importing: 'data/UKGirlNames1996-2021.csv'


Unnamed: 0,Name,Value,Year,Type,Gender
0,Aaisha,1569.0,2021,Rank,Girl
1,Aaishah,2942.0,2021,Rank,Girl
2,Aaliya,1402.0,2021,Rank,Girl
3,Aaliyah,132.0,2021,Rank,Girl
4,Aamina,1785.0,2021,Rank,Girl


In [47]:
# check CSV files have been created
!ls data/*.csv

data/BoyUpdated.csv   data/UKBoyNames1996-2021.csv
data/GirlUpdated.csv  data/UKGirlNames1996-2021.csv


In [48]:
# use mongoimport to import the updates
!mongoimport --db babyNamesDB --type=csv --headerline --file data/GirlUpdated.csv --collection namesUpdate
!mongoimport --db babyNamesDB --type=csv --headerline --file data/BoyUpdated.csv --collection namesUpdate

2025-01-24T12:40:54.892+0000	connected to: mongodb://localhost/
2025-01-24T12:40:55.359+0000	77896 document(s) imported successfully. 0 document(s) failed to import.
2025-01-24T12:40:55.473+0000	connected to: mongodb://localhost/
2025-01-24T12:40:55.933+0000	72124 document(s) imported successfully. 0 document(s) failed to import.


In [49]:
# check they are now in the baby names database (babyNamesDB)
db.list_collection_names()

['namesUpdate', 'boys', 'girls']

In [50]:
db.namesUpdate.find_one()

{'_id': ObjectId('67938a56f8e2a59ba267e3b1'),
 'Name': 'Abida',
 'Value': 3192.0,
 'Year': 2021,
 'Type': 'Rank',
 'Gender': 'Girl'}

In [51]:
# how many documents does each collection have:
print("Girls: \t\t{}".format(girls.count_documents({})))
print("Girls Update: \t{}".format(db.namesUpdate.count_documents({"Gender": "Girl"})))
print("Boys: \t\t{}".format(boys.count_documents({})))
print("Boys Update: \t{}".format(db.namesUpdate.count_documents({"Gender": "Boy"})))

Girls: 		21958
Girls Update: 	77896
Boys: 		16777
Boys Update: 	72124


In [52]:
# SELECT Year, count(*) as count FROM namesUpdate;
printDoc(db.namesUpdate.aggregate( [ 
    { "$group" : { "_id" : "$Year", "count": {"$sum": 1} }},
    ] ))

{'_id': 1997, 'count': 5770}
{'_id': 2008, 'count': 5770}
{'_id': 2017, 'count': 5770}
{'_id': 2015, 'count': 5770}
{'_id': 2003, 'count': 5770}
{'_id': 2001, 'count': 5770}
{'_id': 1998, 'count': 5770}
{'_id': 1996, 'count': 5770}
{'_id': 2002, 'count': 5770}
{'_id': 2000, 'count': 5770}
{'_id': 2013, 'count': 5770}
{'_id': 2009, 'count': 5770}
{'_id': 2010, 'count': 5770}
{'_id': 2012, 'count': 5770}
{'_id': 2007, 'count': 5770}
{'_id': 2018, 'count': 5770}
{'_id': 2016, 'count': 5770}
{'_id': 1999, 'count': 5770}
{'_id': 2014, 'count': 5770}
{'_id': 2019, 'count': 5770}
{'_id': 2004, 'count': 5770}
{'_id': 2005, 'count': 5770}
{'_id': 2011, 'count': 5770}
{'_id': 2006, 'count': 5770}
{'_id': 2021, 'count': 5770}
{'_id': 2020, 'count': 5770}


In [53]:
# SELECT Year, sum() as "Sum of Values" FROM namesUpdate GROUP BY Year ORDER BY Year (_id) descending;
printDoc(db.namesUpdate.aggregate( [ { "$group" : { "_id" : "$Year", "Sum of values": {"$sum": "$Value"}}},
                                                 { "$sort" : {"_id" : -1}}  
                                     ] ))

{'_id': 2021, 'Sum of values': 4484454.0}
{'_id': 2020, 'Sum of values': 4326296.0}
{'_id': 2019, 'Sum of values': 4242305.0}
{'_id': 2018, 'Sum of values': 4097775.0}
{'_id': 2017, 'Sum of values': 4010175.0}
{'_id': 2016, 'Sum of values': 3945549.0}
{'_id': 2015, 'Sum of values': 3862790.0}
{'_id': 2014, 'Sum of values': 3758010.0}
{'_id': 2013, 'Sum of values': 3709197.0}
{'_id': 2012, 'Sum of values': 3650358.0}
{'_id': 2011, 'Sum of values': 3596210.0}
{'_id': 2010, 'Sum of values': 3528470.0}
{'_id': 2009, 'Sum of values': 3425876.0}
{'_id': 2008, 'Sum of values': 3399294.0}
{'_id': 2007, 'Sum of values': 3322128.0}
{'_id': 2006, 'Sum of values': 3276744.0}
{'_id': 2005, 'Sum of values': 3210746.0}
{'_id': 2004, 'Sum of values': 3216501.0}
{'_id': 2003, 'Sum of values': 3205482.0}
{'_id': 2002, 'Sum of values': 3206279.0}
{'_id': 2001, 'Sum of values': 3209147.0}
{'_id': 2000, 'Sum of values': 3258815.0}
{'_id': 1999, 'Sum of values': 3316023.0}
{'_id': 1998, 'Sum of values': 342

In [54]:
# SELECT Name, avg(value) as "Average Rank" FROM namesUpdate WHERE Type = 'Rank' and Gender = 'Girl' ORDER BY "Average Rank" LIMIT 50;
# This pipeline involves 3 stages: $match, $group and $sort
# Round the average value to 2 decimal places
printDoc(db.namesUpdate.aggregate( [ { "$match" : {"Type": "Rank", "Gender": "Girl"} },
                                     { "$group" : { "_id" : "$Name", "Average": {"$avg": "$Value"} }},
                                     { "$sort" : {"Average Rank" : 1}},
                                     { "$limit" : 50},
                                     { "$project" : 
                                          { "Average Rank": { "$trunc": [ "$Average", 2 ] } } }
                                     ] ))

{'_id': 'Umaymah', 'Average Rank': 1614.65}
{'_id': 'Azra', 'Average Rank': 1615.8}
{'_id': 'Georgia', 'Average Rank': 44.3}
{'_id': 'Darcey', 'Average Rank': 219.5}
{'_id': 'Selin', 'Average Rank': 1302.5}
{'_id': 'Safiyah', 'Average Rank': 1010.3}
{'_id': 'Cassia', 'Average Rank': 1331.65}
{'_id': 'Nerys', 'Average Rank': 2587.15}
{'_id': 'Kayleigh', 'Average Rank': 394.03}
{'_id': 'Emilia', 'Average Rank': 114.84}
{'_id': 'Remy', 'Average Rank': 1467.26}
{'_id': 'Nisa', 'Average Rank': 2165.96}
{'_id': 'Dawn', 'Average Rank': 2339.88}
{'_id': 'Alya', 'Average Rank': 1895.5}
{'_id': 'Gwenno', 'Average Rank': 2439.0}
{'_id': 'Halle', 'Average Rank': 598.11}
{'_id': 'Ayah', 'Average Rank': 841.23}
{'_id': 'Honor', 'Average Rank': 610.92}
{'_id': 'Maisha', 'Average Rank': 1106.8}
{'_id': 'Elsie', 'Average Rank': 272.5}
{'_id': 'Rabia', 'Average Rank': 1132.5}
{'_id': 'Lucia', 'Average Rank': 230.65}
{'_id': 'Emilee', 'Average Rank': 1423.5}
{'_id': 'Yumna', 'Average Rank': 1617.3}
{'_id

# Semi-Structured data

The Baby Names dataset is an example of structured data, in that it is very uniform, with the same data types in each column.

The power of NoSQL databases is in copying semi-structured data, such as JSON data, where the values may not be straightforward strings and numbers, but could be nested documents.



## USA Government data

A lot of publicly available data is in JSON format, for example, government agencies:

- https://catalog.data.gov/dataset?res_format=JSON

- https://github.com/jdorfman/awesome-json-datasets#government

Below uses the USA government politician datasets found in the last link.

- Current US Senators: roles.json
- Current US Representatives: role-reps.json

Plus a list of USA States and abbreviations:
- states_titlecase.json

found here: https://gist.github.com/mshafrir/2646763

All downloaded: 09/01/2024 - before the changes from the 2024 USA elections!

The USA government data is public domain, provided under the `CC0 1.0 Universal` licence: https://creativecommons.org/publicdomain/zero/1.0/


In [55]:
# lets have a look at the data
!head data/role.json

{
 "meta": {
  "limit": 100,
  "offset": 0,
  "total_count": 100
 },
 "objects": [
  {
   "caucus": null,
   "congress_numbers": [


In [56]:
!tail data/role.json

   "senator_rank": "junior",
   "senator_rank_label": "Junior",
   "startdate": "2023-10-03",
   "state": "CA",
   "title": "Sen.",
   "title_long": "Senator",
   "website": "https://www.butler.senate.gov"
  }
 ]
}

In [57]:
! head data/role-reps.json

{
 "meta": {
  "limit": 438,
  "offset": 0,
  "total_count": 439
 },
 "objects": [
  {
   "caucus": null,
   "congress_numbers": [


In [58]:
!tail data/role-reps.json

   "senator_class": null,
   "senator_rank": null,
   "startdate": "2023-11-07",
   "state": "RI",
   "title": "Rep.",
   "title_long": "Representative",
   "website": ""
  }
 ]
}

In [59]:
# note, this file was amended to remove the commas between each document (otherwise would not import)
!head data/states_titlecase.json

{
"name": "Alabama",
"abbreviation": "AL"
}
{
"name": "Alaska",
"abbreviation": "AK"
}
{
"name": "American Samoa",


In [60]:
!tail data/states_titlecase.json

"abbreviation": "WV"
}
{
"name": "Wisconsin",
"abbreviation": "WI"
}
{
"name": "Wyoming",
"abbreviation": "WY"
}

The politicians data looks to be in JSON format and appear to have some meta data at the start.

In [61]:
client.drop_database('politicsDB')

In [62]:
! mongoimport --db politicsDB --type=json --file data/role.json  --collection senators
! mongoimport --db politicsDB --type=json --file data/role-reps.json --collection reps
! mongoimport --db politicsDB --type=json --file data/states_titlecase.json --collection states

2025-01-24T12:40:57.039+0000	connected to: mongodb://localhost/
2025-01-24T12:40:57.058+0000	1 document(s) imported successfully. 0 document(s) failed to import.
2025-01-24T12:40:57.183+0000	connected to: mongodb://localhost/
2025-01-24T12:40:57.231+0000	1 document(s) imported successfully. 0 document(s) failed to import.
2025-01-24T12:40:57.354+0000	connected to: mongodb://localhost/
2025-01-24T12:40:57.368+0000	59 document(s) imported successfully. 0 document(s) failed to import.


In [63]:
# Change database
db = client.politicsDB
senators = db.senators
reps = db.reps
states = db.states

In [64]:
states.find_one()

{'_id': ObjectId('67938a59875d20be66a7189c'),
 'name': 'Kansas',
 'abbreviation': 'KS'}

The politician details have nested documents, where a document (or array) is nested within other information. This is an example of semi-structured data.

The dot syntax can be used to search nested documents. For example, a snippet of information from the senators information for the person sub-document, shows what keys are available within it: 

<pre>
person': {'bioguideid': 'A000055',
    'birthday': '1965-07-22',
    'cspanid': 45516,
    'fediverse_webfinger': None,
    'firstname': 'Robert',
    'gender': 'male',
    'gender_label': 'Male',
    'lastname': 'Aderholt',
    'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
    'middlename': 'B.',
    'name': 'Rep. Robert Aderholt [R-AL4]',
    'namemod': '',
    'nickname': '',
    'osid': 'N00003028',
    'pvsid': None,
    'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
    'twitterid': 'Robert_Aderholt',
    'youtubeid': 'RobertAderholt'},
</pre>   

There is an issue in that the way the data has been provided, all the politicians have been stored in one document, rather than one document per politician.

For example, 
<pre>
senators.find_one()
</pre>

This will show all the politicians, which is not very helpful when there are lots of them. Really this data could do with some cleaning first, so that each politician is stored in a separate document. This is beyond the scope of what you need to do with JSON data for TM351, so something for another day....

Being stored in just one document also causes issues when querying the data. If query returns true, then all the data in that document is returned! For example, try this to see just one of the politicians (uncomment below):

<pre>
reps.find_one({"objects.person.bioguideid": 'A000055' })
</pre>


In [65]:
#reps.find_one({"objects.person.bioguideid": 'A000055' })

To extract items from the array requires the use of the `$unwind` operator.

For example, display the first and last names of 50 politicians, suppressing the generated id:

In [66]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname" : 1}},
                        {"$unwind":"$objects"},
                        {"$limit": 50}]) 
printDoc(docs)

{'objects': {'person': {'firstname': 'Robert', 'lastname': 'Aderholt'}}}
{'objects': {'person': {'firstname': 'Gus', 'lastname': 'Bilirakis'}}}
{'objects': {'person': {'firstname': 'Sanford', 'lastname': 'Bishop'}}}
{'objects': {'person': {'firstname': 'Earl', 'lastname': 'Blumenauer'}}}
{'objects': {'person': {'firstname': 'Vern', 'lastname': 'Buchanan'}}}
{'objects': {'person': {'firstname': 'Larry', 'lastname': 'Bucshon'}}}
{'objects': {'person': {'firstname': 'Michael', 'lastname': 'Burgess'}}}
{'objects': {'person': {'firstname': 'Ken', 'lastname': 'Calvert'}}}
{'objects': {'person': {'firstname': 'André', 'lastname': 'Carson'}}}
{'objects': {'person': {'firstname': 'John', 'lastname': 'Carter'}}}
{'objects': {'person': {'firstname': 'Kathy', 'lastname': 'Castor'}}}
{'objects': {'person': {'firstname': 'Judy', 'lastname': 'Chu'}}}
{'objects': {'person': {'firstname': 'Yvette', 'lastname': 'Clarke'}}}
{'objects': {'person': {'firstname': 'Emanuel', 'lastname': 'Cleaver'}}}
{'object

To extract one particular politician:

In [67]:
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'firstname': 'Robert',
                        'lastname': 'Aderholt'}}}


Do make sure the pipeline is in the right order, if the match is done too soon it will again return all the representatives if the query criteria is matched.

If you tried this you will see all the data returned:

<pre>
docs = reps.aggregate([{"$project" : {"_id": 0, 
            "objects.person.firstname" : 1, 
            "objects.person.lastname": 1,
            "objects.person.bioguideid":1}},
        {"$match": { "objects.person.bioguideid": 'A000055' }}, 
        {"$unwind":"$objects"}
])
printDocValues(docs)
</pre>

See the discussion on duplicating the $match here:
https://stackoverflow.com/questions/54030089/how-to-use-unwind-and-match-with-mongodb

In [68]:
# retriving just A000055
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.bioguideid":1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                         ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'firstname': 'Robert',
                        'lastname': 'Aderholt'}}}


In [69]:
# show all the person details, which needs a duplicate $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}, 
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Aderholt',
                        'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
                        'middlename': 'B.',
                        'name': 'Rep. Robert Aderholt [R-AL4]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00003028',
                        'pvsid': None,
                        'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
                        'twitterid': 'Robert_Aderholt',
                        'youtubeid': 'RobertAderholt'}}}


In [70]:
# though seems ok with just the one $match 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects.person" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                        'lastname': 'Aderholt',
                        'link': 'https://www.govtrack.us/congress/members/robert_aderholt/400004',
                        'middlename': 'B.',
                        'name': 'Rep. Robert Aderholt [R-AL4]',
                        'namemod': '',
                        'nickname': '',
                        'osid': 'N00003028',
                        'pvsid': None,
                        'sortname': 'Aderholt, Robert (Rep.) [R-AL4]',
                        'twitterid': 'Robert_Aderholt',
                        'youtubeid': 'RobertAderholt'}}}


person is part of the details stored for each 

In [71]:
# show all the details for this representative 
docs = reps.aggregate([{"$project" : {"_id": 0, "objects" : 1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.bioguideid": 'A000055' }}
                      ])
printDocs(docs)

{'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Alabama's 4th congressional "
                            'district',
             'district': 4,
             'enddate': '2025-01-03',
             'extra': {'address': '266 Cannon House Office Building Washington '
                                  'DC 20515-0104',
                       'office': '266 Cannon House Office Building',
                       'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
                        'gender': 'male',
                        'gender_label': 'Male',
                 

## Joins

At some point you may need to join two documents, which was not possible in earlier versions of MongoDB. Later versions introduced something similar to a simple left join using the pipeline `$lookup` operator ([Mongo docs](https://www.mongodb.com/docs/manual/reference/operator/aggregation/lookup/)).

MongoDB provides the joins as part of the aggregation steps. 


In [72]:
# if you don't know your American states, join up the states collection
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }
  }
])


In [73]:
# This will generate a lot of data, so only uncomment if you want to see it all:
#printDocs(joined)

In [74]:
# Instead, lets just home in on our representative
joined = reps.aggregate([
     {"$unwind":"$objects"},
     {"$lookup":
       {
         "from": "states",
         "localField": "objects.state",
         "foreignField": "abbreviation",
         "as": "stateInfo"
       }},
       {"$match": { "objects.person.bioguideid": 'A000055' }}
])
printDocs(joined)


{'_id': ObjectId('67938a590744cfd02c19a756'),
 'meta': {'limit': 438, 'offset': 0, 'total_count': 439},
 'objects': {'caucus': None,
             'congress_numbers': [118],
             'current': True,
             'description': "Representative for Alabama's 4th congressional "
                            'district',
             'district': 4,
             'enddate': '2025-01-03',
             'extra': {'address': '266 Cannon House Office Building Washington '
                                  'DC 20515-0104',
                       'office': '266 Cannon House Office Building',
                       'rss_url': 'http://aderholt.house.gov/common/rss//index.cfm?rss=20'},
             'leadership_title': None,
             'party': 'Republican',
             'person': {'bioguideid': 'A000055',
                        'birthday': '1965-07-22',
                        'cspanid': 45516,
                        'fediverse_webfinger': None,
                        'firstname': 'Robert',
   

In [75]:
# the senator data is similarly structured, lets find any female senators
docs = senators.aggregate([{"$project" : {"_id": 0, "objects.person.firstname" : 1, "objects.person.lastname": 1,
                                     "objects.person.gender":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.gender": 'female' }}
                      ])
printDocValues(docs)

[{'person': {'firstname': 'Maria', 'gender': 'female', 'lastname': 'Cantwell'}}]
[{'person': {'firstname': 'Debbie', 'gender': 'female', 'lastname': 'Stabenow'}}]
[{'person': {'firstname': 'Tammy', 'gender': 'female', 'lastname': 'Baldwin'}}]
[{'person': {'firstname': 'Marsha', 'gender': 'female', 'lastname': 'Blackburn'}}]
[{'person': {'firstname': 'Mazie', 'gender': 'female', 'lastname': 'Hirono'}}]
[{'person': {'firstname': 'Kirsten', 'gender': 'female', 'lastname': 'Gillibrand'}}]
[{'person': {'firstname': 'Amy', 'gender': 'female', 'lastname': 'Klobuchar'}}]
[{'person': {'firstname': 'Kyrsten', 'gender': 'female', 'lastname': 'Sinema'}}]
[{'person': {'firstname': 'Elizabeth', 'gender': 'female', 'lastname': 'Warren'}}]
[{'person': {'firstname': 'Deb', 'gender': 'female', 'lastname': 'Fischer'}}]
[{'person': {'firstname': 'Jacky', 'gender': 'female', 'lastname': 'Rosen'}}]
[{'person': {'firstname': 'Susan', 'gender': 'female', 'lastname': 'Collins'}}]
[{'person': {'firstname': 'Jea

In [76]:
# find the Senior Senator for Michigan

docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

{'objects': {'caucus': None,
             'congress_numbers': [116, 117, 118],
             'current': True,
             'description': 'Senior Senator for Michigan',
             'district': None,
             'enddate': '2025-01-03',
             'extra': {'address': '731 Hart Senate Office Building Washington '
                                  'DC 20510',
                       'contact_form': 'https://www.stabenow.senate.gov/contact',
                       'office': '731 Hart Senate Office Building',
                       'rss_url': 'http://stabenow.senate.gov/rss/?p=news'},
             'leadership_title': 'Senate Democratic Policy & Communications '
                                 'Committee Chair',
             'party': 'Democrat',
             'person': {'bioguideid': 'S000770',
                        'birthday': '1950-04-29',
                        'cspanid': 45451,
                        'fediverse_webfinger': None,
                        'firstname': 'Debbie',
     

In [77]:
# Finally remember, due to no schema you can give an unknown field, which it will just ignore and not warn you!
docs = senators.aggregate([{"$project" : {"_id": 0, "objects":1}},
                       {"$unwind":"$objects"},
                       {"$match": { "objects.person.description": "Senior Senator for Michigan"}}
                      ])
printDocs(docs)

## Twitter Data

Another example of semi-structured data is Twitter, or X data. 

Unfortunately, X now imposes a cost of at least $100 a month if you want to extract tweets (creating is still free!). Below are some example of historical tweets extracted on the 10th and 11th January 2023 from `#BBCNews`. This was when you could still obtain the data for free under their `academic research product track` and `developer application` account. 

Discussion of using Twitter data can be found in Ahmed 2021's paper:
Wasim Ahmed (2021), *Using Twitter as a data source an overview of social media research tools (2021)*, available at: https://blogs.lse.ac.uk/impactofsocialsciences/2021/05/18/using-twitter-as-a-data-source-an-overview-of-social-media-research-tools-2021/ accessed 21/01/2025.

See this website for more information on costs:

https://docs.x.com/x-api/getting-started/about-x-api

The examples below show how MongoDB is good at working with semi-structured data easily.

In [78]:
# Needed for Twitter data
import string
import operator
import re

In [79]:
# Good practice to examine the data before importing it
! head -2 data/BBCNews-230110-2118.json

{"id": 1612920079967985706, "text": "Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun", "edit_history_tweet_ids": [1612920079967985706], "author_id": 612473, "context_annotations": [{"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "66", "name": "Interests and Hobbies Category", "description": "A grouping of interests and hobbies entities, like Novelty Food or Destinations"}, "entity": {"id": "1237472346560053249", "name": "Firefighting"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "12374723465600

In [80]:
! tail -2 data/BBCNews-230111-2214.json

{"id": 1612907650672369668, "text": "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1", "edit_history_tweet_ids": [1612907650672369668], "author_id": 612473, "context_annotations": [{"domain": {"id": "10", "name": "Person", "description": "Named people in the world like Nelson Mandela"}, "entity": {"id": "934100345080262656", "name": "Prince Harry", "description": "Prince Harry"}}, {"domain": {"id": "46", "name": "Business Taxonomy", "description": "Categories within Brand Verticals that narrow down the scope of Brands"}, "entity": {"id": "1557697121477832705", "name": "Publisher & News Business", "description": "Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books"}}, {"domain": {"id": "131", "name": "Unified Twitter Taxonomy", "description": "A taxonomy of user interests. "}, "entity": {"id": "847878884917886977", "name": "Politics

In [81]:
client.drop_database('twitterDB')
client.list_database_names()

['accidents',
 'admin',
 'babyNamesDB',
 'bbc_db',
 'config',
 'fsa',
 'local',
 'politicsDB']

In [82]:
# load this into a twitterDB database and news from 11th January
! mongoimport --db twitterDB  --file data/BBCNews-230110-2118.json --collection bbcnews
! mongoimport --db twitterDB  --file data/BBCNews-230111-2214.json --collection bbcnews

2025-01-24T12:40:57.943+0000	connected to: mongodb://localhost/
2025-01-24T12:40:57.974+0000	100 document(s) imported successfully. 0 document(s) failed to import.
2025-01-24T12:40:58.101+0000	connected to: mongodb://localhost/
2025-01-24T12:40:58.123+0000	100 document(s) imported successfully. 0 document(s) failed to import.


In [83]:
client.list_database_names()

['accidents',
 'admin',
 'babyNamesDB',
 'bbc_db',
 'config',
 'fsa',
 'local',
 'politicsDB',
 'twitterDB']

In [84]:
# Change database
db = client.twitterDB
bbcnews = db.bbcnews
bbcnews.find_one()

{'_id': ObjectId('67938a59cac9e7819e94e20a'),
 'id': 1612920079967985706,
 'text': 'Firefighters face higher cancer risk, study finds https://t.co/EMGsCJ0yun',
 'edit_history_tweet_ids': [1612920079967985706],
 'author_id': 612473,
 'context_annotations': [{'domain': {'id': '46',
    'name': 'Business Taxonomy',
    'description': 'Categories within Brand Verticals that narrow down the scope of Brands'},
   'entity': {'id': '1557697121477832705',
    'name': 'Publisher & News Business',
    'description': 'Brands, companies, advertisers and every non-person handle with the profit intent related to  marketing and advertiser agencies, publishers of magazines, newspapers, blogs, books'}},
  {'domain': {'id': '66',
    'name': 'Interests and Hobbies Category',
    'description': 'A grouping of interests and hobbies entities, like Novelty Food or Destinations'},
   'entity': {'id': '1237472346560053249', 'name': 'Firefighting'}},
  {'domain': {'id': '131',
    'name': 'Unified Twitter Taxon

In [85]:
# What columns/keys does it have
# Some of these keys are subdocuments, such as the entities one seen above
bbcnews.find_one().keys()

dict_keys(['_id', 'id', 'text', 'edit_history_tweet_ids', 'author_id', 'context_annotations', 'conversation_id', 'created_at', 'edit_controls', 'entities', 'lang', 'public_metrics', 'reply_settings'])

In [86]:
# Prince Harry was topical when this data was generated! Is he mentioned at all?!
# $regex allows pattern matching. The 'i' option makes the search case insensitive
# "_id:" 0 suppresses showing the object id
# SELECT text FROM bbcnews WHERE LOWER(text) LIKE '%harry%';

tweets = bbcnews.find({'text':{'$regex':'Harry', '$options': 'i'}}, {"_id":0,'text': 1})
printDoc(tweets)

{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB"}
{'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1"}
{'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC"}
{'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t'}
{'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64'}
{'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc'}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/XmfSToqCxu"}
{'text': "Prince Harry's book officially hits shops after days of leaks https://t.co/Ff92m8HC4g"}
{'text': "Who is Harry's ghostwriter, JD Moehringer - and how much did he make? https://t.co/P2TgFcLz8U"}
{'text': "Newspaper headlines: 'No way back' says Harry and ho

In [87]:
# $regex can be used on more than one field - can either use the "OR" clause to get either value. 
# Just make sure the brackets are the correct ones and lined up correctly!
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' OR created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$or": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))

[{'text': "Harry's memoir Spare displayed beside How to Kill Your Family novel https://t.co/jL5JlFeAgB",
  'created_at': 'Tuesday 10-Jan-2023 21:02:03'},
 {'text': "Things you might have missed from Prince Harry's book https://t.co/efU3aR5fg1",
  'created_at': 'Tuesday 10-Jan-2023 20:22:04'},
 {'text': "Prince Harry's publisher says book sales 'beyond expectations' https://t.co/tt8qR8WxwC",
  'created_at': 'Tuesday 10-Jan-2023 19:08:58'},
 {'text': 'Harry Styles and Top Gun Maverick boost entertainment industry with record sales https://t.co/leC2wtd94t',
  'created_at': 'Tuesday 10-Jan-2023 14:42:05'},
 {'text': '"I want to hear his story in his words"\n\nPrince Harry\'s book officially hits shops \n\nhttps://t.co/fYy7DUko83 https://t.co/9Z5fJOcT64',
  'created_at': 'Tuesday 10-Jan-2023 12:56:42'},
 {'text': 'Prince Harry and the power of the beard https://t.co/TQE8AQO6Vc',
  'created_at': 'Tuesday 10-Jan-2023 09:46:49'},
 {'text': "Prince Harry's book officially hits shops after days 

In [88]:
# or use the "AND" clause to get both value. 
# SELECT text, created_at from bbcnews WHERE LOWER(text) LIKE '%Harry%' AND created_at LIKE '%Wednesday%'

list(bbcnews.find({
    "$and": 
    [ {'text': {'$regex':'Harry', '$options': 'i'}},  
      {"created_at" : {'$regex': 'Wednesday'}} 
    ]
    }, 
    {"_id":0,'created_at': 1, 'text': 1}))


[{'text': "Which Royal has come out best in the fallout from Prince Harry's book? https://t.co/zvF9dTJNKW",
  'created_at': 'Wednesday 11-Jan-2023 14:58:43'},
 {'text': "Prince Harry condemns 'dangerous spin' about his Taliban comments https://t.co/V2u2hELxp4",
  'created_at': 'Wednesday 11-Jan-2023 02:04:29'}]

In [89]:
# show the distinct languages found in the tweets
# SELECT DISTINCT lang FROM bbcnews;
# The supported languages can be found here: https://developer.twitter.com/en/docs/twitter-for-websites/supported-languages 
db.bbcnews.distinct("lang")

['ca', 'en', 'fr', 'tl']

In [90]:
# How many tweets have been retweeted more than 100 times
# Use the dot notation to reference keys in any subdocument
db.bbcnews.count_documents({"public_metrics.retweet_count": { '$gt' : 100 }})    

15

In [91]:
tweets = db.bbcnews.find({'entities.urls.title':{'$regex':'Firefighter'}}, {'entities.urls.title': 1})

printDoc(tweets)

{'_id': ObjectId('67938a59cac9e7819e94e20a'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}
{'_id': ObjectId('67938a5a31ccc2bc25ffdf1f'), 'entities': {'urls': [{'title': 'Firefighters face higher cancer risk, Scottish study finds'}]}}


In [92]:
# Show some fields from the entities subdocument.
# When showing the subdocuments pretty print makes the tweets more readable
tweets = db.bbcnews.find({}, {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'They are also more likely to die from '
                                       'heart attacks and strokes, researchers '
                                       'say.',
                        'title': 'Firefighters face higher cancer risk, '
                                 'Scottish study finds'}]}}
{'entities': {'urls': [{'description': "A bookseller placed Prince Harry's "
                                       "memoir Spare beside Bella Mackie's "
                                       'novel How to Kill Your Family.',
                        'title': "Harry's memoir Spare displayed beside How to "
                                 'Kill Your Family novel'}]}}
{'entities': {'urls': [{'description': 'Zholia Alemi is a "most accomplished '
                                       'fraudster" who forged a certificate to '
                                       'get work, a jury hears.',
                        'title': 'Unqualified doctor who faked

In [93]:
# These can be searched too - find the Seal story that was topical at the time
tweets = db.bbcnews.find({"entities.urls.title": {"$regex": "seal", '$options': 'i' }}, 
                         {"_id":0, "entities.urls.title": 1, "entities.urls.description": 1})

printDocs(tweets)

{'entities': {'urls': [{'description': 'Being in a fishing lake is like "being '
                                       'in a branch of Waitrose" for a hungry '
                                       'seal, an expert says.',
                        'title': 'Seal stuck in Rochford lake munching its way '
                                 'through fish stock'}]}}


In [94]:
# Another way to unpack the nested documents
# https://stackoverflow.com/questions/25909927/mongodb-how-to-get-a-field-sub-document-from-a-document
tweets=db.bbcnews.aggregate([
    # De-normalize the array content first
    { "$unwind": "$entities" },
    # De-normalize the content from the inner array as well
    { "$unwind": "$entities.urls" },

    # Group the "entities" per document
    { "$group": {
        "_id": "$_id",
        "entities": { "$addToSet": "$entities.urls" }
    }},
    { "$limit": 5}
])
printDocs(tweets)

{'_id': ObjectId('67938a5a31ccc2bc25ffdec6'),
 'entities': [{'description': 'Maddi Neale-Shankster, 21, was paralysed from '
                              'the waist down, after the accident on New '
                              "Year's Eve.",
               'display_url': 'bbc.in/3XozAJn',
               'end': 122,
               'expanded_url': 'https://bbc.in/3XozAJn',
               'images': [{'height': 576,
                           'url': 'https://pbs.twimg.com/news_img/1613280140687036425/EbNhZSfY?format=jpg&name=orig',
                           'width': 1024},
                          {'height': 150,
                           'url': 'https://pbs.twimg.com/news_img/1613280140687036425/EbNhZSfY?format=jpg&name=150x150',
                           'width': 150}],
               'start': 99,
               'status': 200,
               'title': 'Appeal to bring Thailand balcony fall woman home '
                        'raises £73k',
               'unwound_url': 'https://ww

# Summary

This notebook give you a flavour of the document type of database management system. 

What are the differences between MongoDB and a relational database, such as PostgresSQL?

Some things to think about:

*Relational*
- relational has a fixed schema
- the data is normalised, with less duplication
- constraints can be enforced
- ACID transaction support (Atomic, Consistency, Isolation and Durability)

*NoSQL (Document)*
- flexible schema, optional data can be easily incorporated.
- can support agile development
- data is denormalised, so can mean more duplication
- constraints not enforced
- BASE transaction support (Basically Available, Soft state, Eventual consistency!)

Bear in mind that NoSQL is a relatively new technology, so can be seen as immature in that it does not provide good support for transaction handling, or access control, but could be argued that this is not the market it is aimed at. 

