# Theoretical Questions

1. What are the key differences between SQL and NoSQL databases?

Answer ->

SQL (Relational) Databases store data in structured tables with predefined schemas (rows and columns). They use Structured Query Language (SQL) for defining and manipulating data and are vertically scalable (adding more power to a single server). NoSQL (Non-Relational) Databases store data in various formats like key-value pairs, documents, wide-column stores, or graphs. They have dynamic schemas for unstructured data and are horizontally scalable (adding more servers).


2. What makes MongoDB a good choice for modern applications?

Answer ->

MongoDB is ideal for modern applications because of its flexibility and scalability. It uses a JSON-like format (BSON) that maps directly to objects in application code, allowing developers to iterate faster. Its ability to handle large volumes of unstructured data and scale horizontally across distributed clusters makes it suitable for high-traffic, real-time applications.


3. Explain the concept of collections in MongoDB.

Answer ->

A collection in MongoDB is the equivalent of a table in an SQL database. It is a grouping of MongoDB documents. Unlike SQL tables, collections do not enforce a rigid schema, meaning documents within the same collection can have different fields or structures (though they typically share a similar purpose).


4. How does MongoDB ensure high availability using replication?

Answer ->

MongoDB uses Replica Sets to ensure high availability. A replica set is a group of mongod processes that maintain the same data set. It consists of one Primary node (receives all write operations) and multiple Secondary nodes (replicate the primary's data). If the primary node fails, an automated election process promotes a secondary node to primary, ensuring the system remains online.


5. What are the main benefits of MongoDB Atlas?

Answer ->

MongoDB Atlas is the fully managed cloud database service for MongoDB. Its main benefits include:

Automated Management: Handles provisioning, patching, and backups automatically.

Global Scalability: Allows you to deploy across AWS, Google Cloud, and Azure easily.

Built-in Security: Default encryption, network isolation, and access controls.

Serverless Options: Scales infrastructure up or down based on demand.

6. What is the role of indexes in MongoDB, and how do they improve performance?

Answer ->

Indexes are special data structures that store a small portion of the collection's data in an easy-to-traverse form (usually B-Trees). Without indexes, MongoDB must perform a collection scan, checking every document to find a match. Indexes drastically improve query performance by allowing MongoDB to limit the number of documents it inspects.


7. Describe the stages of the MongoDB aggregation pipeline.

Answer ->

The aggregation pipeline is a framework for data processing. Documents pass through a series of stages, where each stage transforms the documents. Common stages include:


$match : Filters documents (like SQL WHERE).

$group : Groups documents by a specified key (like SQL GROUP BY).

$project : Reshapes documents (selecting or adding fields).

$sort : Sorts the documents.

$limit/$ skip : Controls the number of documents passed to the next stage.

8. What is sharding in MongoDB? How does it differ from replication?

Answer ->

Sharding is the method for distributing data across multiple machines to support deployments with very large datasets and high throughput operations (Horizontal Scaling).

Difference: Replication copies the same data to multiple servers for safety/availability. Sharding partitions different chunks of data across multiple servers to increase storage capacity and write performance.

9. What is PyMongo, and why is it used?

Answer ->

PyMongo is the official MongoDB driver for the Python programming language. It is used to interact with MongoDB databases from Python applications, allowing developers to insert, query, update, and delete data using Python code.


10. What are the ACID properties in the context of MongoDB transactions?

Answer ->

ACID stands for Atomicity, Consistency, Isolation, and Durability. While MongoDB was originally known for eventual consistency, modern versions (4.0+) support multi-document ACID transactions. This ensures that operations across multiple documents or collections either all succeed or all fail (Atomicity), maintaining data integrity just like traditional relational databases.


11. What is the purpose of MongoDB’s explain() function?

Answer ->

The explain() function is a diagnostic tool used to analyze how MongoDB executes a query. It provides details on the query plan, such as whether an index was used, how many documents were scanned, and the time taken. This is essential for optimizing query performance.


12. How does MongoDB handle schema validation?

Answer ->

Although MongoDB is "schema-less," it supports Schema Validation rules. You can define validation rules (using JSON Schema syntax) on a collection during creation or update. This enforces data integrity by rejecting inserts or updates that do not meet specific criteria (e.g., checking if a field exists or if a value is of a certain type).


13. What is the difference between a primary and a secondary node in a replica set?

Answer ->

Primary Node: The only node that can accept write operations. It records all changes in its operation log (oplog).

Secondary Node: Replicates the oplog from the primary and applies the operations to its own data set. By default, secondaries are read-only, but they can be configured to serve read traffic (eventual consistency).

14. What security mechanisms does MongoDB provide for data protection?

Answer ->

MongoDB provides robust security features including:

Authentication: Verifying user identity (SCRAM, LDAP, Kerberos).

Authorization: Role-Based Access Control (RBAC) to define what users can do.

Encryption: Data encryption in transit (TLS/SSL) and at rest.

Auditing: Tracking database activities for compliance.

15. Explain the concept of embedded documents and when they should be used.

Answer ->

Embedded documents (or nested documents) capture relationships between data by storing related data in a single document structure rather than separate tables. They should be used when:

You have "contains" relationships (e.g., a Person document contains an Address object).

You have one-to-few relationships.

You frequently need to retrieve related data together in a single query.

16. What is the purpose of MongoDB’s $lookup stage in aggregation?

Answer ->

The $lookup stage performs a left outer join to a collection in the same database. It allows you to combine documents from two different collections based on a specific field, which is useful for modeling relational patterns in NoSQL when referencing is used instead of embedding.

17. What are some common use cases for MongoDB?

Answer ->

Content Management Systems (CMS): Storing varied content types.

IoT Applications: Handling high-velocity streams of sensor data.

Real-time Analytics: Processing operational data instantly.

E-commerce catalogs: Managing products with different attributes.

Mobile Apps: Syncing data across devices with flexible schemas.

18. What are the advantages of using MongoDB for horizontal scaling?

Answer ->

MongoDB was designed with horizontal scaling (sharding) in mind. It allows you to add cheaper commodity servers to a cluster to handle increased load, rather than upgrading to expensive, high-end hardware. This provides infinite scale-out capability, better throughput, and increased storage capacity without downtime.


19. How do MongoDB transactions differ from SQL transactions?

Answer ->

Historically, SQL transactions focused on multi-row, multi-table consistency. MongoDB transactions (introduced later) work similarly but are applied to documents. While SQL transactions are fundamental to every operation, MongoDB transactions are typically used only for specific use cases where atomicity across multiple documents is strictly required, as the document model often eliminates the need for complex transactions.

20. What are the main differences between capped collections and regular collections?

Answer ->

Capped collections are fixed-size collections that support high-throughput insert and retrieve operations based on insertion order.

Size: Capped collections have a maximum size/document count. Once full, the oldest documents are automatically overwritten (FIFO - First In, First Out).


Use case: Ideal for logging, caching, or storing recent event streams. Regular collections grow dynamically and do not overwrite old data automatically.

21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?

Answer ->

The $match stage is used to filter the documents that pass through the aggregation pipeline. It selects only the documents that match the specified condition(s) to pass to the next stage. It is usually placed as early as possible in the pipeline to reduce the amount of data processed in subsequent stages.


22. How can you secure access to a MongoDB database?

Answer ->

You can secure access by:

Enabling Authentication (requiring username/password).

Configuring Role-Based Access Control (RBAC) to give users strictly the permissions they need (Principle of Least Privilege).

Binding MongoDB to a specific IP address (Network Exposure).

Setting up Firewall rules to allow traffic only from trusted app servers.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?

Answer ->

WiredTiger is the default storage engine for MongoDB (since version 3.2). It is important because it provides:

Document-level concurrency: Allows multiple clients to modify different documents in a collection simultaneously (unlike older engines which locked the whole database or collection).

Compression: Reduces storage costs and I/O usage by compressing data and indexes on disk.

Checkpoints: Ensures data consistency and recovery.

# Practical Questions

In [2]:
import pandas as pd
from pymongo import MongoClient

# Establish connection
client = MongoClient('mongodb://localhost:27017/')
db = client['SuperstoreDB']
collection = db['Orders']

1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB

In [None]:
pipeline = [
    {
        "$group": {
            "_id": "$Category",        # Group by Category
            "count": {"$sum": 1}       # Add 1 for each document
        }
    }
]

results = collection.aggregate(pipeline)

for result in results:
    print(f"Category: {result['_id']}, Count: {result['count']}")

2. Retrieve and print all documents from the Orders collection

In [None]:
pipeline = [
    {
        "$group": {
            "_id": "$Category",        # Group by Category
            "count": {"$sum": 1}       # Add 1 for each document
        }
    }
]

results = collection.aggregate(pipeline)

for result in results:
    print(f"Category: {result['_id']}, Count: {result['count']}")

3. Count and display the total number of documents in the Orders collection

In [None]:
total_count = collection.count_documents({})
print(f"Total documents: {total_count}")

4. Write a query to fetch all orders from the "West" region

In [None]:
west_orders = collection.find({"Region": "West"})

for order in west_orders:
    print(order)

5. Write a query to find orders where Sales is greater than 500

In [None]:
# Filter: Sales > 500
high_sales = collection.find({"Sales": {"$gt": 500}})

for order in high_sales:
    print(order)

6. Fetch the top 3 orders with the highest Profit

In [None]:
# Sort by 'Profit' in descending order (-1) and take top 3
top_profits = collection.find().sort("Profit", -1).limit(3)

for order in top_profits:
    print(order)

7. Update all orders with Ship Mode as "First Class" to "Premium Class"


In [None]:
result = collection.update_many(
    {"Ship Mode": "First Class"},       # Filter criteria
    {"$set": {"Ship Mode": "Premium Class"}} # Update action
)

print(f"Documents updated: {result.modified_count}")

8. Delete all orders where Sales is less than 50

In [None]:
result = collection.delete_many({"Sales": {"$lt": 50}})

print(f"Documents deleted: {result.deleted_count}")

9. Use aggregation to group orders by Region and calculate total sales per region

In [None]:
pipeline = [
    {
        "$group": {
            "_id": "$Region",           # Group by Region
            "totalSales": {"$sum": "$Sales"} # Sum the Sales field
        }
    }
]

results = collection.aggregate(pipeline)

for result in results:
    print(f"Region: {result['_id']}, Total Sales: {result['totalSales']}")

10. Fetch all distinct values for Ship Mode from the collection

In [None]:
distinct_ship_modes = collection.distinct("Ship Mode")
print(distinct_ship_modes)

11. Count the number of orders for each category

In [None]:
pipeline = [
    {
        "$group": {
            "_id": "$Category",        # Group by Category
            "count": {"$sum": 1}       # Add 1 for each document
        }
    }
]

results = collection.aggregate(pipeline)

for result in results:
    print(f"Category: {result['_id']}, Count: {result['count']}")