## MongoDB 1

In [None]:
# import statements
import os
from pymongo import MongoClient
import bson
from datetime import datetime, timedelta

### Connection establishment

In [None]:
client = MongoClient('mongodb://localhost:27017/')

In [None]:
db = client.sample_airbnb

In [None]:
db.list_collection_names()

In [None]:
#db.listingsAndReviews.find_one({}, {"_id": 0})

### Mongodb evaluation query operators

- `$regex`: Selects documents where values match a specified regular expression.
- `$expr`: Allows use of aggregation expressions within the query language.
- `$mod`: Performs a modulo operation on the value of a field and selects documents with a specified result.

Documentation: https://www.mongodb.com/docs/manual/reference/operator/query-logical/

#### Q: Find all listings where extra_people is more than twice of guests_included.

In [None]:
cursor = db.listingsAndReviews.find(
    {
      
    },
    {"name": 1, "extra_people": 1, "guests_included": 1, "_id": 0}
)
listings = list(cursor)
listings[:1]

#### Q: Find listings where the last_review date is within the last 30 days.
For the purpose of this question, let's assume current day is March 11th 2019. Last review date from the dataset. 

**Self-assesment**: Try writing code to figure this out!

In [None]:
march_11_2019 = 
thirty_days_ago = 

cursor = db.listingsAndReviews.find(
    {
        
    },
    {"name": 1, "last_review": 1, "_id": 0}
)
listings = list(cursor)
listings[:5]

#### Q: Find listings where the price is a multiple of 5.

In [None]:
cursor = db.listingsAndReviews.find(
    {},
    {"name": 1, "price": 1, "_id": 0}
)
listings = list(cursor)
listings[:3]

### Array Query Operators

- `$all`: Matches arrays that contain all elements specified in the query.
- `$elemMatch`: Selects documents if at least one element in the array field matches all the specified $elemMatch conditions.
- `$size`: Selects documents if the array field is a specified size.

- documentation: https://www.mongodb.com/docs/manual/reference/operator/query-array/

#### Q: Find the name the amenities of listings where the number of amenities is exactly 5.

In [None]:
cursor = db.listingsAndReviews.find(
    {},
    {"name": 1, "amenities": 1, "_id": 0}
)
listings = list(cursor)
listings[:3]

#### Q: Find the name the amenities of all listings that have "Pack ’n Play/travel crib".

#### Review: `$in` comparison operator: Matches any of the values specified in an array.

In [None]:
cursor = db.listingsAndReviews.find(
    {
        
    },
    {"name": 1, "amenities": 1, "_id": 0}
)
listings = list(cursor)
listings[:1]

#### Q: Find the name the amenities of all listings that have "Pack ’n Play/travel crib" and "High chair".

In [None]:
cursor = db.listingsAndReviews.find(
    {
        
    },
    {"name": 1, "amenities": 1, "_id": 0}
)
listings = list(cursor)
listings[:1]

#### Q: Find the name the amenities of listings that have at least one of: "Pack ’n Play/travel crib", "High chair".

In [None]:
cursor = db.listingsAndReviews.find(
    {
        
    },
    {"name": 1, "amenities": 1, "_id": 0}
)
listings = list(cursor)
listings[:1]

#### Q: Find all listings with at least 10 amenities.

In [None]:
cursor = db.listingsAndReviews.find(
    {
        
    },
    {"name": 1, "amenities": 1, "_id": 0}
)
listings = list(cursor)
listings[:2]

### Mongodb projection operators

- `$slice`: Limits the number of elements in an array that appear in the query results.
    - Positive \<N\>: Slices first N elements.
    - Negative \<N\>: Slices last N elements.
    - **IMPORTANT NOTE:** Slicing is applied inside projection and not inside the query aka selection.

#### Q: Find the first 3 amenities for each listing.

In [None]:
cursor = db.listingsAndReviews.find({},
    {"name": 1, "amenities": ???, "_id": 0}
)
listings = list(cursor)
listings[:1]

#### Q: Find the last 3 amenities for each listing.

In [None]:
cursor = db.listingsAndReviews.find({},
    {"name": 1, "amenities": {"$slice": 3}, "_id": 0}
)
listings = list(cursor)
listings[:1]

### Analytics dataset

Source: https://www.mongodb.com/docs/atlas/sample-data/sample-analytics/

In [None]:
db = client.sample_analytics

In [None]:
# directory where the JSON files are stored
json_dir = 'sample_analytics'
json_files = [f for f in os.listdir(json_dir) if f.endswith(".json")]
collections = [f.replace(".json", "") for f in json_files]
collections

In [None]:
for idx, json_file in enumerate(json_files):
    with open(os.path.join(json_dir, json_file), 'r') as f:
        for line in f:
            data = bson.json_util.loads(line.strip())
            db[collections[idx]].insert_one(data)
        
        print(f"Loaded {json_file} into the '{collections[idx]}' collection.")

### Combining information from multiple collections

#### Q: Find all transactions made by customers born in 1988.

Let's first find relevant information from customers collection.

In [None]:
start_date = datetime(1988, 1, 1)
end_date = datetime(1989, 1, 1)

born_in_1988 = db.???.find()
born_in_1988 = list(born_in_1988)

In [None]:
all_accounts = []



In [None]:
transactions = db.???.find()
transactions = list(transactions)
transactions[:1]

### Aggregation Operations

- Use cases:
    1. Group values from multiple documents together.
    2. Perform operations on the grouped data to return a single result.
    3. Analyze data changes over time.
- documentation: https://www.mongodb.com/docs/manual/aggregation/
- Two types:
    1. Single Purpose Aggregation Methods
    2. Aggregation Pipelines

### 1. Single Purpose Aggregation Methods

- `db.collection.estimatedDocumentCount(options)`: Returns an approximate count of the documents in a collection or a view.
    - does not take a query filter and instead uses metadata to return the count for a collection.
    - more efficient than the countDocuments() method because it does not scan the documents in the collection; instead, it returns the count based on the collection metadata.
- `db.collection.countDocuments(query, options)`: Returns a count of the number of documents in a collection or a view.
- `db.collection.distinct(field, query, options)`: Returns an array of documents that have distinct values for the specified field.

In [None]:
# fast estimate of count of documents
db.listingsAndReviews.

#### Q: What are the distinct property types?

In [None]:
db.listingsAndReviews.

#### Q: What are the distinct property types in United States?

In [None]:
db.listingsAndReviews.

#### Q: What are all the suburbs in the United States where we have property listings?

In [None]:
db.listingsAndReviews.

### 2. Aggregation Pipelines

- An aggregation pipeline consists of one or more stages that process documents:
    - Each stage performs an operation on the input documents. For example, a stage can filter documents, group documents, and calculate values.
    - The documents that are output from a stage are passed to the next stage.
    - An aggregation pipeline can return results for groups of documents. For example, return the total, average, maximum, and minimum values.

### `db.collection.aggregate(pipeline, options)`

- Calculates aggregate values for the data in a collection or a view.
- Returns:	
    - A cursor for the documents produced by the final stage of the aggregation pipeline.
    - If the pipeline includes the `explain` option, the query returns a document that provides details on the processing of the aggregation operation.

### Building a pipeline

### 1. `$match`
- Filters documents based on a specified query predicate. Matched documents are passed to the next pipeline stage.
- Syntax: `{ $match: { <query predicate> } }`

### 2. `$group`
- The $group stage separates documents into groups according to a "group key". The output is one document for each unique group key.
- A group key is often a field, or group of fields. The group key can also be the result of an expression. Use the `_id` field in the `$group` pipeline stage to set the group key. 
- In the `$group` stage output, the `_id` field is set to the group key for that document.
- Syntax:
```
{
 $group:
   {
     _id: <expression>, // Group key
     <field1>: { <accumulator1> : <expression1> },
     ...
   }
 }
```

### 3. `$project`
- Passes along the documents with the requested fields to the next stage in the pipeline. The specified fields can be existing fields from the input documents or newly computed fields.

#### Q: Find the average price (rounded to two decimal places) of all "Entire home/apt" (`room_type`) listings. 

What kind of grouping do we want to create here? All documents should be part of a single group because we are querying based on a specific room type. How do we mention this? 

- `_id: None` - meaning that all documents are treated as belonging to a single group, effectively removing the grouping by any field and aggregating over the entire dataset.

In [None]:
pipeline = [
    
]
avg_price = list(db.listingsAndReviews.???)
avg_price

In [None]:
pipeline = [
    
]
avg_price = list(db.listingsAndReviews.aggregate(pipeline))
avg_price

#### Q: Find the average price (rounded to two decimal places) of all `room_type` listings. 

In [None]:
pipeline = [

]
avg_price = list(db.listingsAndReviews.aggregate(pipeline))
avg_price

#### Q: Find the top 2 hosts who have the most listings.

How can you explore a complex document?

### More pipeline stages

4. `$sort`: Reorders the document stream by a specified sort key. Only the order changes; the documents remain unmodified. For each input document, outputs one document.
5. `$limit`: Passes the first n documents unmodified to the pipeline where n is the specified limit. For each input document, outputs either one document (for the first n documents) or zero documents (after the first n documents).
- documentation: https://www.mongodb.com/docs/manual/reference/operator/aggregation-pipeline/#std-label-aggregation-pipeline-operator-reference

In [None]:
pipeline = [
    
]
top_hosts = list(db.listingsAndReviews.aggregate(pipeline))
top_hosts

### More pipeline stages

6. `$lookup`: Performs a left outer join to another collection in the same database to filter in documents from the "joined" collection for processing.
- Syntax:
```
{
   $lookup:
     {
       from: <collection to join>,
       localField: <field from the input documents>,
       foreignField: <field from the documents of the "from" collection>,
       as: <output array field>
     }
}
```

7. `$unwind`: Deconstructs an array field from the input documents to output a document for each element. Each output document replaces the array with an element value. For each input document, outputs n documents where n is the number of array elements and can be zero for an empty array.
- Syntax: `{ $unwind: <field path> }`

#### Q: List all customers along with their transactions. For each customer, include their account name and the details of each transaction they made. If a customer has no transactions, still include them in the results.

In [None]:
db = client.sample_analytics

In [None]:
db.accounts.find_one()

In [None]:
db.customers.find_one()

#### Q: List all accounts with customer names and corresponding limits.

In [None]:
pipeline = [

]

merged_results = list(db.customers.aggregate(pipeline))

first = merged_results[0]
print(f"Customer Name: {first['name']}, Account ID: {first['account_details']['account_id']}, Limit: {first['account_details']['limit']}")

### More pipeline stages

8. `$addFields`: Adds new fields to documents. Similar to `$project`, `$addFields` reshapes each document in the stream; specifically, by adding new fields to output documents that contain both the existing fields from the input documents and the newly added fields.

#### Q: Add a field called "account_status" with value "High" if account limit is greater than 9000 and "Low" otherwise.

In [None]:
pipeline = [

]

# Execute the aggregation query
accounts = list(db.accounts.aggregate(pipeline))
accounts[:2]

### More about array methods

- `$addToSet`: The `$addToSet` operator adds a value to an array unless the value is already present, in which case $addToSet does nothing to that array.

### Geospatial Query Operators

- documentation: https://www.mongodb.com/docs/manual/reference/operator/query-geospatial/

### Query selectors

- `$geoIntersects`: Selects geometries that intersect with a GeoJSON geometry. The 2dsphere index supports `$geoIntersects`.
- `$geoWithin`: Selects geometries within a bounding GeoJSON geometry. The 2dsphere and 2d indexes support `$geoWithin`.
- `$near`: Returns geospatial objects in proximity to a point. Requires a geospatial index. The 2dsphere and 2d indexes support `$near`.
- `$nearSphere`:Returns geospatial objects in proximity to a point on a sphere. Requires a geospatial index. The 2dsphere and 2d indexes support `$nearSphere`.

### Geometry Specifiers

- `$geometry`: The `$geometry` operator specifies a GeoJSON geometry for use with the following geospatial query operators: `$geoWithin`, `$geoIntersects`, `$near`, and `$nearSphere`. `$geometry` uses EPSG:4326 as the default coordinate reference system (CRS).
- `$maxDistance`: Specifies a maximum distance to limit the results of $near and $nearSphere queries. The 2dsphere and 2d indexes support $maxDistance.