## MongoDB 1

In [1]:
# import statements
import os
from pymongo import MongoClient
import bson
from datetime import datetime

### Connection establishment

In [2]:
client = MongoClient('mongodb://localhost:27017/')

In [3]:
client.server_info()

{'version': '8.0.0',
 'gitVersion': 'd7cd03b239ac39a3c7d63f7145e91aca36f93db6',
 'modules': [],
 'allocator': 'tcmalloc-google',
 'javascriptEngine': 'mozjs',
 'sysInfo': 'deprecated',
 'versionArray': [8, 0, 0, 0],
 'openssl': {'running': 'OpenSSL 3.0.13 30 Jan 2024',
  'compiled': 'OpenSSL 3.0.13 30 Jan 2024'},
 'buildEnvironment': {'distmod': 'ubuntu2404',
  'distarch': 'x86_64',
  'cc': '/opt/mongodbtoolchain/v4/bin/gcc: gcc (GCC) 11.3.0',
  'ccflags': '-Werror -include mongo/platform/basic.h -ffp-contract=off -fasynchronous-unwind-tables -g2 -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -gdwarf-5 -fno-omit-frame-pointer -fno-strict-aliasing -O2 -march=sandybridge -mtune=generic -mprefer-vector-width=128 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-const-variable -Wno-unused-but-set-variable -Wno-missing-braces -Wno-psabi -fstack-protector-strong -gdwarf64 -Wa,--nocompress-debug-sections -fno-builtin-memcmp -Wimplicit-fallthroug

### MongoDB sample datasets

- Source: https://www.mongodb.com/docs/atlas/sample-data/sample-training/

### Lazy database creation

- creation of a reference to the database
- actual creation doesn't happen until you perform a write operation like inserting a document
- the database does not physically exist on the server until then

In [4]:
db = client.sample_training

In [None]:
# directory where the JSON files are stored
json_dir = 'sample_training'
json_files = [f for f in os.listdir(json_dir) if f.endswith(".json")]
collections = [f.replace(".json", "") for f in json_files]
collections

In [None]:
for idx, json_file in enumerate(json_files):
    with open(os.path.join(json_dir, json_file), 'r') as f:
        for line in f:
            data = bson.json_util.loads(line.strip())
            db[collections[idx]].insert_one(data)
        
        print(f"Loaded {json_file} into the '{collections[idx]}' collection.")

### Verify collection names

In [5]:
db.list_collection_names()

['inspections', 'trips']

### MongoDB API: Querying documents

#### Select all documents in a collection `db.collection.find(query, projection, options)`

- retrieves all documents from a collection
- equivalent to `SELECT * FROM <TABLE>` SQL query
- creates a cursor for a query that can be used to iterate over results from MongoDB
- `query`:
    - selection filter
    - `{ <field1>: <value>, <field2>: {conditions} ... }`
- `projection`:
    - determines which fields are returned in the matching documents
    - `{ <field1>: <value>, <field2>: <value> ... }`
- documentation: https://www.mongodb.com/docs/manual/reference/method/db.collection.find/ 

Let's explore `trips` collection.

In [7]:
cursor = db.trips.find()
cursor

<pymongo.synchronous.cursor.Cursor at 0x7be7835b2f20>

In [10]:
cursor = db.trips.find()
trips = list(cursor)
trips[:3]

[{'_id': ObjectId('572bb8222b288919b68abf5a'),
  'tripduration': 379,
  'start station id': 476,
  'start station name': 'E 31 St & 3 Ave',
  'end station id': 498,
  'end station name': 'Broadway & W 32 St',
  'bikeid': 17827,
  'usertype': 'Subscriber',
  'birth year': 1969,
  'gender': 1,
  'start station location': {'type': 'Point',
   'coordinates': [-73.97966069, 40.74394314]},
  'end station location': {'type': 'Point',
   'coordinates': [-73.98808416, 40.74854862]},
  'start time': datetime.datetime(2016, 1, 1, 0, 0, 45),
  'stop time': datetime.datetime(2016, 1, 1, 0, 7, 4)},
 {'_id': ObjectId('572bb8222b288919b68abf5b'),
  'tripduration': 889,
  'start station id': 268,
  'start station name': 'Howard St & Centre St',
  'end station id': 3002,
  'end station name': 'South End Ave & Liberty St',
  'bikeid': 22794,
  'usertype': 'Subscriber',
  'birth year': 1961,
  'gender': 2,
  'start station location': {'type': 'Point',
   'coordinates': [-73.99973337, 40.71910537]},
  'end

Let's explore `inspections` collection.

In [11]:
cursor = db.inspections.find()
inspections = list(cursor)
inspections[:3]

[{'_id': ObjectId('56d61033a378eccde8a8354f'),
  'id': '10021-2015-ENFO',
  'certificate_number': 9278806,
  'business_name': 'ATLIXCO DELI GROCERY INC.',
  'date': 'Feb 20 2015',
  'result': 'No Violation Issued',
  'sector': 'Cigarette Retail Dealer - 127',
  'address': {'city': 'RIDGEWOOD',
   'zip': 11385,
   'street': 'MENAHAN ST',
   'number': 1712}},
 {'_id': ObjectId('56d61033a378eccde8a83550'),
  'id': '10057-2015-ENFO',
  'certificate_number': 6007104,
  'business_name': 'LD BUSINESS SOLUTIONS',
  'date': 'Feb 25 2015',
  'result': 'Violation Issued',
  'sector': 'Tax Preparers - 891',
  'address': {'city': 'NEW YORK',
   'zip': 10030,
   'street': 'FREDERICK DOUGLASS BLVD',
   'number': 2655}},
 {'_id': ObjectId('56d61033a378eccde8a83551'),
  'id': '10084-2015-ENFO',
  'certificate_number': 9278914,
  'business_name': 'MICHAEL GOMEZ RANGHALL',
  'date': 'Feb 10 2015',
  'result': 'No Violation Issued',
  'sector': 'Locksmith - 062',
  'address': {'city': 'QUEENS VLG',
   'zi

#### Q1: Find all trips taken by passengers born in 1988.

- equivalent to `SELECT * FROM <TABLE> WHERE <SOME COLUMN> = <SOME VALUE>` SQL query

In [None]:
trips = db.trips.find({
    'birth year': 1988
})
trips = list(trips)


#### Q2: Find all inspection sectors.

- equivalent to `SELECT <SPECIFIC COLUMN> FROM <TABLE>` SQL query

What if you don't want your output to be cluttered with "_id" field values?

#### Q3: Find all inspections that occurred in "Home Improvement Contractor - 100" and "Home Improvement Salesperson - 101" sectors.

- equivalent to `SELECT * FROM <TABLE NAME> WHERE <SOME COLUMN> in (<VALUE1>, <VALUE2>)`

#### Q4: Find all trips that have duration between 200 and 4000 taken by gender 1.

- equivalent to:
```
    SELECT * FROM <TABLE NAME>
    WHERE <SOME COLUMN1> = <SOME VALUE> AND
        <SOME COLUMN 2> >= <SOME VALUE1> AND <SOME COLUMN2> <= <SOME VALUE 2>
```

#### Q5: Find all inspections that either occurred in Manhattan or Brooklyn.

- equivalent to:
```
    SELECT * FROM <TABLE NAME>
    WHERE <SOME COLUMN1> = <SOME VALUE> OR
        <SOME COLUMN 2> >= <SOME VALUE1> AND <SOME COLUMN2> <= <SOME VALUE 2>
```

### Mongodb comparison operators

- `$eq`: Matches values that are equal to a specified value.
- `$gt`: Matches values that are greater than a specified value.
- `$gte`: Matches values that are greater than or equal to a specified value.
- `$in`: Matches any of the values specified in an array.
- `$lt`: Matches values that are less than a specified value.
- `$lte`: Matches values that are less than or equal to a specified value.
- `$ne`: Matches all values that are not equal to a specified value.
- `$nin`: Matches none of the values specified in an array.

Documentation: https://www.mongodb.com/docs/manual/reference/operator/query-comparison/

### `limit()` method

- specify the maximum number of documents the cursor will return
- documentation: https://www.mongodb.com/docs/manual/reference/method/cursor.limit/#mongodb-method-cursor.limit

#### Q6: Find the first five trips.

- equivalent to: `SELECT * FROM <TABLE NAME> LIMIT <N>`

### Sorting using `sort` method

### `sort()` method

- Specify in the sort parameter the field or fields to sort by and a value of 1 or -1 to specify an ascending or descending sort respectively.
- documentation: https://www.mongodb.com/docs/manual/reference/method/cursor.sort/#mongodb-method-cursor.sort

### `$regex`
- documentation: https://www.mongodb.com/docs/manual/reference/operator/query/regex/

#### Q7: Find all inspections that occurred in 2015 and sort them by ascending order of `id`.

- equivalent to: `SELECT * FROM <TABLE NAME> WHERE <SOME COL> LIKE <SOME SEARCH TERM> ORDER BY <SOME COL> ASC`

Sort the same using descending order.

#### Q8: Find all inspections on all incorporated businesses.

### `findOne(query, projection, options)`

- Fetches the first document that matches the query
- documentation: https://www.mongodb.com/docs/manual/reference/method/db.collection.findOne/
- **IMPORTANT**: In Python API, you must replace camelcase with `_`. That is, method name is `find_one`.

#### Q9: Find the first trip.

### MongoDB shell `mongosh`

```
docker exec -it <container name> mongosh
show dbs
use sample_training
show collections
db.trips.find().limit(5).pretty()
```

### `db.collection.countDocuments(query, options)`

- Returns an integer for the number of documents that match the query of the collection or view.
- documentation: https://www.mongodb.com/docs/manual/reference/method/db.collection.countDocuments/

#### Q10: How many trips are in the trips collection?

#### Q11: How many trips were taken by people born after the year 1988?