---
# **MongoDB - Theoretical Questions:-**

---

### **1. What are the key differences between SQL and NoSQL databases?**

- **SQL and NoSQL databases** are designed for different types of applications, data models, and scalability needs. The key differences between **SQL and NoSQL databases** are as follows:

| Feature            | SQL (Relational)                | NoSQL (Non-relational like MongoDB)     |
| ------------------ | ------------------------------- | --------------------------------------- |
| **Structure**      | Tables with rows and columns    | Collections with JSON-like documents    |
| **Schema**         | Fixed schema (predefined)       | Dynamic schema                          |
| **Scalability**    | Vertical (scale up)             | Horizontal (scale out)                  |
| **Query Language** | Structured Query Language (SQL) | Varies (e.g., MongoDB Query Language)   |
| **Joins**          | Supports joins                  | Limited join support                    |
| **Transactions**   | ACID-compliant                  | Eventual consistency (can support ACID) |

- Use cases:

| SQL                           | NoSQL                                  |
| ----------------------------- | -------------------------------------- |
| Banking, ERP, CRM, HR systems | E-commerce, real-time analytics, IoT   |
| Structured & consistent data  | Flexible & evolving data               |
| Strong consistency needed     | High scalability & availability needed |

- Examples of Databases:

| SQL Databases        | NoSQL Databases         |
| -------------------- | ----------------------- |
| MySQL                | MongoDB (Document DB)   |
| PostgreSQL           | Cassandra (Wide Column) |
| Oracle               | Redis (Key-Value Store) |
| Microsoft SQL Server | Neo4j (Graph DB)        |

- Summary:

| Feature        | SQL                           | NoSQL                                            |
| -------------- | ----------------------------- | ------------------------------------------------ |
| Data Format    | Tables (rows & columns)       | Documents, key-value, graphs                     |
| Schema         | Fixed & predefined            | Dynamic & flexible                               |
| Query Language | SQL                           | JSON-like (MongoDB), others                      |
| Joins          | Supported                     | Limited or avoided                               |
| Scalability    | Vertical                      | Horizontal (sharding)                            |
| Transactions   | Full ACID support             | Eventual or tunable consistency                  |
| Performance    | Stable for small-medium apps  | Fast for large, distributed systems              |
| Best For       | Structured data & consistency | High-speed, semi-structured or unstructured data |


- **Example**:

 - SQL: `SELECT * FROM users WHERE age > 30;`

 - MongoDB: `db.users.find({ age: { $gt: 30 } })`

---

### **2. What makes MongoDB a good choice for modern applications?**
- **Modern applications** demand speed, scalability, flexibility, and developer productivity — and MongoDB is designed to meet these needs.

- It’s one of the most widely adopted NoSQL databases because it supports a wide range of use cases, from web and mobile apps to IoT, analytics, and real-time systems.

- The reasons for which **MongoDB** is a good choice for modern applications are as follows:

 - Schema-less: Supports flexible document structure.

 - JSON/BSON format: Works well with JavaScript and REST APIs.

 - Horizontal Scaling: Supports sharding for large-scale apps.

 - High Availability: Through replication.

 - Integration: Works well with Python, Node.js, Java, etc.

- **Example**: MongoDB fits perfectly in MEAN/MERN stacks for real-time apps like chats, e-commerce, IoT apps, etc.

-  Ideal for Modern App Use Cases:

| App Type                | Why MongoDB Works                       |
| ----------------------- | --------------------------------------- |
|  Web apps             | Flexible schema, fast queries           |
|  Mobile apps          | Offline sync with Realm                 |
|  Analytics dashboards | Aggregation + indexing                  |
|  E-commerce           | Scales with millions of products/orders |
|  IoT apps             | Handles high write rates from sensors   |
|  Financial apps       | Secure and compliant                    |

- Summary:

| Feature                 | Why It Matters for Modern Apps      |
| ----------------------- | ----------------------------------- |
|  Flexible Schema      | Rapid prototyping, schema evolution |
|  High Performance      | Fast reads/writes, ideal for scale  |
|  High Availability    | Uptime through replication          |
|  Cloud Deployment     | MongoDB Atlas – easy & scalable     |
|  Secure & Compliant   | Ready for regulated environments    |
|  Aggregation Pipeline | Real-time analytics & reporting     |
|  Language Support     | Python, JS, Java, etc.              |
|  Dev Tools            | CLI, Compass, Atlas UI, APIs        |
|  Community & Docs     | Easy to learn, quick help           |



---

### **3. Explain the concept of collections in MongoDB.**

- In MongoDB, a **collection** is a group of documents — it’s the equivalent of a table in relational databases like MySQL or PostgreSQL.

🔹 Documents = Rows, Collections = Tables, Fields = Columns

- Each document in a collection is stored in BSON format (binary JSON) and can have a flexible schema, meaning documents within a collection don’t need to have the same fields or data types.

- Key Properties of Collections:

| Property                 | Description                             |
| ------------------------ | --------------------------------------- |
|  **Flexible Schema**   | No need to define columns in advance    |
|  **Efficient Storage** | BSON format compresses & optimizes      |
|  **Indexable**         | Can create indexes on any field         |
|  **Document-oriented** | Stores semi-structured data (JSON/BSON) |
|  **CRUD Supported**    | Supports create, read, update, delete   |

- Difference from SQL Table:

| Feature      | SQL Table                  | MongoDB Collection         |
| ------------ | -------------------------- | -------------------------- |
| Schema       | Fixed (columns must match) | Flexible (fields can vary) |
| Data Format  | Rows and Columns           | JSON-like Documents        |
| Join Support | Native SQL joins           | Manual or via `$lookup`    |
| Scaling      | Vertical                   | Horizontal (with sharding) |

- Types of Collections:

| Type               | Description                                    |
| ------------------ | ---------------------------------------------- |
|  **Standard**    | Default type, no size limit                    |
|  **Capped**      | Fixed-size collections with circular overwrite |
|  **Time-series** | Optimized for timestamped data (logs, sensors) |

- Summary:

| Term           | Description                                         |
| -------------- | --------------------------------------------------- |
| **Collection** | Container for related documents (like a SQL table)  |
| **Document**   | JSON-like data entry (like a SQL row)               |
| **Schema**     | Dynamic — each document can differ                  |
| **Use Case**   | Group data like "orders", "products", "users", etc. |
| **Command**    | `db.collectionName` or `db["collectionName"]`       |


- **Example**:

```json
{
  "_id": 1,
  "name": "John",
  "age": 30
}
```

- All documents in a collection can have **different structures**, unlike SQL.

---

### **4. How does MongoDB ensure high availability using replication?**

- **Replication** in MongoDB is the process of copying data from one server (primary) to one or more secondary servers in a group called a replica set.

- If the primary server fails, a secondary is automatically promoted to become the new primary — ensuring zero downtime and high availability.

- MongoDB ensures high availability by using **Replica Sets**.

- A **Replica Set** contains:

  * **1 Primary node** (read/write)

  * **1 or more Secondary nodes** (read-only)

- If the **Primary fails**, one of the **Secondary** nodes becomes the new Primary.

- **Example**: Used in production for automatic failover and data redundancy.

- Summary:

| Feature                | Description                                      |
| ---------------------- | ------------------------------------------------ |
|  Replication Type    | Asynchronous                                     |
|  Structure           | Primary + Secondaries                            |
|  High Availability    | Yes, via automatic failover                      |
|  Data Redundancy    | Yes, across multiple servers                     |
|  Read Scalability    | Read from secondaries                            |
|  Manual Intervention | Not required for failover                        |
|  Tools Used          | `rs.initiate()`, `rs.status()`, PyMongo, Compass |


---

### **5. What are the main benefits of MongoDB Atlas?**

- **MongoDB Atlas** is MongoDB's fully managed cloud Database-as-a-Service (DBaaS). It provides a complete, secure, and scalable platform for hosting and managing MongoDB databases in the cloud — with no server setup required.

- It is available on major cloud providers like AWS, Google Cloud Platform (GCP), and Microsoft Azure.

- The main benefits of **MongoDB Atlas** are as follows:

 - Auto-scaling and backup

 - Global cluster deployment

 - Performance monitoring

 - Security controls

 - Fully managed service (no server setup needed)

- Summary:

| Feature                | Benefit                               |
| ---------------------- | ------------------------------------- |
| Fully Managed          | No manual setup or admin work         |
| Secure & Compliant     | Built-in encryption & access control  |
| Globally Available     | Deploy anywhere with low latency      |
| Scalable               | Auto-scaling, sharding, & replication |
| Backup & Restore       | Automated snapshots & recovery        |
| Performance Monitoring | Real-time dashboard & query optimizer |
| Rich Ecosystem         | Charts, Triggers, Atlas Search, Realm |
| Developer-Friendly     | API access, CLI tools, integrations   |
| Free Tier              | Ideal for learning & prototyping      |


---

### **6. What is the role of indexes in MongoDB, and how do they improve performance?**

- **Indexes in MongoDB** are special data structures that store a small portion of a collection’s data in an easy-to-search format.

- Think of an index like the index of a book — it helps you find specific topics quickly without scanning the whole book.

- Without an index, MongoDB must perform a collection scan (COLLSCAN), which checks every document to find matching data — slow for large datasets.

- The role of **Indexes in MongoDB** is that it helps MongoDB to **search faster**.

- Without an index, MongoDB performs a **collection scan** (slow).

- **Types**:

 - Single field

 - Compound

 - Multikey

 - Text

 - Geospatial

- Summary:

| Feature               | Description                                        |
| --------------------- | -------------------------------------------------- |
| **Purpose**           | Improve query performance                          |
| **Works Like**        | Book index (helps you find fast)                   |
| **Used For**          | Filtering, sorting, searching                      |
| **Common Types**      | Single, compound, text, multikey                   |
| **Performance Boost** | Reduces scan time from full collection to few docs |
| **Commands**          | `createIndex()`, `dropIndex()`, `getIndexes()`     |


- **Example**:

```js
db.orders.createIndex({ "customer_name": 1 });
```

- This speeds up queries filtering on `customer_name`.

---

### **7. Describe the stages of the MongoDB aggregation pipeline.**

- The **Aggregation Pipeline** in MongoDB is a data processing framework that transforms and analyzes documents step by step, similar to dataframes in Pandas or SQL GROUP BY.

- Think of it as a pipeline of stages, where the output of one stage becomes the input to the next.

- Aggregation is like **SQL GROUP BY + HAVING + JOIN**. The stages of the **MongoDB aggregation pipeline** are as follows:

 - `$match`: Filter

 - `$group`: Group and aggregate

 - `$project`: Select fields

 - `$sort`: Sort

 - `$lookup`: Join

 - `$limit`: Limit results

- Summary:

| Stage        | Purpose                        |
| ------------ | ------------------------------ |
| `$match`     | Filter documents               |
| `$project`   | Shape or modify documents      |
| `$group`     | Aggregate values by group      |
| `$sort`      | Sort results                   |
| `$limit`     | Limit result count             |
| `$skip`      | Skip some results              |
| `$unwind`    | Flatten arrays                 |
| `$lookup`    | Join collections               |
| `$addFields` | Add computed fields            |
| `$out`       | Write to new collection        |
| `$merge`     | Merge into existing collection |


- **Example**:

```js
db.orders.aggregate([
  { $match: { region: "West" } },
  { $group: { _id: "$category", totalSales: { $sum: "$sales" } } }
]);
```

---

### **8. What is sharding in MongoDB? How does it differ from replication?**

- **Sharding** is MongoDB’s method of horizontal scaling — it splits large datasets across multiple machines (called shards) to handle:

 - High data volumes

 - High read/write throughput

 - Big user traffic

- **Sharding**: Distributes **data** across multiple servers (horizontal scaling).

- **Replication**: Copies the **same data** to multiple nodes for redundancy.

- Sharding in MongoDB differs from replication in the following ways:

| Feature             | Sharding                                      | Replication                        |
| ------------------- | --------------------------------------------- | ---------------------------------- |
|  Purpose          | **Scalability** (distribute load)             | **Availability** (fault tolerance) |
|  Data Stored      | Different data on each shard                  | Same data on all replica nodes     |
|  Query Routing    | By **shard key** via `mongos`                 | Reads go to primary/secondaries    |
|  Use Case         | Handle **large datasets**, scale horizontally | Handle **failover**, ensure uptime |
|  Failover Support | Optional (if shards are replica sets)         |  Automatic                        |
|  Performance      | Improves **write & read scaling**             | Improves **read reliability**      |

- Summary:

| Feature           | Sharding                          | Replication                       |
| ----------------- | --------------------------------- | --------------------------------- |
| Main Goal         | **Scale** data across servers     | **Duplicate** data across servers |
| Data Split        | Yes (based on shard key)          | No (same data on all nodes)       |
| Adds Performance  |  Yes (scales reads/writes)       | Limited (mostly read scaling)     |
| Adds Availability |  Only if used with replica sets |  Yes (automatic failover)        |
| Use With          | Large, distributed systems        | High-availability systems         |



---

### **9. What is PyMongo, and why is it used?**

- **PyMongo** is the official Python driver for interacting with MongoDB databases.

- It allows Python developers to:

 - Connect to MongoDB

 - Perform CRUD operations (Create, Read, Update, Delete)

 - Run queries, aggregations, and transactions

 - Manage indexes, collections, and databases

- PyMongo communicates with MongoDB using MongoDB's native wire protocol.

| Purpose                        | Benefit                                        |
| ------------------------------ | ---------------------------------------------- |
|  Object-oriented interface   | Easy to use in Python scripts & apps           |
|  Document-based syntax       | Use Python dictionaries like JSON              |
|  Full MongoDB support        | Works with queries, aggregations, transactions |
|  Ideal for data science & ML | Can be combined with Pandas, NumPy, etc.       |
|  Plug-and-play integration   | Works with Flask, Django, FastAPI, etc.        |

- Use cases:

| Application Area          | How PyMongo Helps                      |
| ------------------------- | -------------------------------------- |
|  Data Retrieval         | Run complex queries using Python       |
|  Data Analytics         | Fetch MongoDB data into Pandas         |
|  Machine Learning       | Store training results or logs         |
|  Web Development        | Build APIs with Flask/Django + MongoDB |
|  Automation & ETL Jobs | Automate updates and data migration    |

- Advantages:

| Feature                           | Benefit                                 |
| --------------------------------- | --------------------------------------- |
| Works with native Python types    | Easy to learn and use                   |
| Full MongoDB functionality        | Aggregation, transactions, indexing     |
| Asynchronous support (with Motor) | Use in async applications               |
| Cross-platform                    | Works in Colab, Jupyter, terminal, etc. |

- Summary:

| Feature    | Description                         |
| ---------- | ----------------------------------- |
| Tool       | Python driver for MongoDB           |
| Interface  | Pythonic (uses dicts like JSON)     |
| Common Use | CRUD operations, queries, analytics |
| Ideal For  | Python apps, data science, APIs     |
| Install    | `pip install pymongo`               |


- **Example**:

```python
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["superstore"]
```

---

### **10. What are the ACID properties in the context of MongoDB transactions?**

- **ACID** is an acronym for the four core principles that guarantee the reliability of database transactions:

| Letter | Property        | Description                                                       |
| ------ | --------------- | ----------------------------------------------------------------- |
| A      | **Atomicity**   | All operations in a transaction are completed, or none are        |
| C      | **Consistency** | Data must be valid according to rules and constraints             |
| I      | **Isolation**   | Transactions do not interfere with each other                     |
| D      | **Durability**  | Once a transaction is committed, it remains even in case of crash |

-  MongoDB and ACID Support:

 - Historically, MongoDB was known for being eventually consistent and schema-less, but starting from version 4.0, MongoDB supports multi-document ACID transactions, just like traditional RDBMS (SQL databases).

- ACID is now fully supported in MongoDB for:

 - Replica sets (since v4.0)

 - Sharded clusters (since v4.2)

- Summary:

| Property        | MongoDB Support                             |
| --------------- | ------------------------------------------- |
| **Atomicity**   |  Yes, multi-document transactions          |
| **Consistency** |  Yes, via validations and indexes          |
| **Isolation**   |  Yes, document-level and transaction-level |
| **Durability**  |  Yes, via journaling and WAL               |


- **Example**:

```js
session.startTransaction();
// do operations
session.commitTransaction();
```

---

### **11. What is the purpose of MongoDB’s `explain()` function?**

- The **explain() function** in MongoDB is a powerful diagnostic tool used to understand how a query or aggregation is executed internally.

- It helps analyze performance by showing:

 - Whether indexes are used

 - How many documents were scanned

 - The time taken for execution

 - The execution plan (query planner decisions)

| Purpose                       | Benefit                                       |
| ----------------------------- | --------------------------------------------- |
|  Understand query behavior  | See how MongoDB fetches data                  |
|  Optimize performance       | Identify slow queries or missing indexes      |
|  Compare query strategies   | Test impact of different filters and indexes  |
|  Debug complex aggregations | Track resource usage in aggregation pipelines |

- Summary:

| Feature        | Description                                           |
| -------------- | ----------------------------------------------------- |
| Tool Type      | Query analyzer                                        |
| Returns        | Execution plan, performance metrics                   |
| Used For       | Optimization, debugging, index usage                  |
| Common Methods | `find()`, `aggregate()`, `update()`                   |
| Modes          | `queryPlanner`, `executionStats`, `allPlansExecution` |


- **Example**:

```js
db.orders.find({ region: "West" }).explain("executionStats");
```

---

### **12. How does MongoDB handle schema validation?**

- MongoDB is a NoSQL document database, and by default, it has a flexible schema. This means that documents in a collection can have different fields, types, or structures. However, MongoDB also allows you to enforce **schema validation** rules to ensure data consistency and quality.

- MongoDB uses **JSON Schema validation** to define rules about:

 - Required fields

 - Data types

 - Value constraints (e.g., min/max)

 - Nested objects/arrays

> These rules are applied using the validator option when creating or modifying a collection.

- **Example**:

```js
db.createCollection("users", {
  validator: {
    $jsonSchema: {
      bsonType: "object",
      required: ["name", "email"],
      properties: {
        name: { bsonType: "string" },
        email: { bsonType: "string" }
      }
    }
  }
});
```

- Summary:

| Feature               | Description                                    |
| --------------------- | ---------------------------------------------- |
| Schema Type           | JSON Schema-based                              |
| Flexible              | Only enforces rules you define                 |
| Required Fields       | Yes, supported                                 |
| Data Type Enforcement | Yes, using `bsonType`                          |
| Default Behavior      | No validation unless explicitly defined        |
| Best Used For         | Critical collections where consistency matters |

- Real-World Use Cases:

| Use Case              | Why Schema Validation Helps                        |
| --------------------- | -------------------------------------------------- |
|  E-commerce orders  | Ensure every order has Order ID, total, and status |
|  Analytics logs     | Avoid corrupt data formats                         |
|  Student records | Enforce types for marks, class, and IDs            |


---

### **13. What is the difference between a primary and a secondary node in a replica set?**

- A replica set in MongoDB is a group of mongod processes that maintain the same data set to ensure high availability and data redundancy.

- Each replica set consists of:

 - 1 Primary Node

 - 1 or more Secondary Nodes

- The difference between a **primary and a secondary node** in a **replica set** are as follows:

| Feature                  | Primary Node              | Secondary Node                                  |
| ------------------------ | ------------------------- | ----------------------------------------------- |
| **Role**                 | Main node handling writes | Backup node replicating primary                 |
| **Writes Allowed**       |  Yes                     |  No (replicates only)                          |
| **Reads Allowed**        |  Yes (default)           |  No (unless configured using `readPreference`) |
| **Failover Eligibility** | Always active             | Can become primary if elected                   |
| **Data Source**          | Direct writes             | Replicates from primary                         |
| **Use Case**             | Main transaction handler  | Redundancy, backups, read scaling               |


- If the primary fails, a secondary can become the new primary.

- Summary:

| Aspect           | Primary Node        | Secondary Node             |
| ---------------- | ------------------- | -------------------------- |
| Accepts Writes   |  Yes               |  No                       |
| Accepts Reads    |  Yes (default)     |  (with config)            |
| Role in Failover | Promoted to Primary | Eligible to become Primary |
| Use Case         | Transaction hub     | Redundancy, analytics      |


---

### **14. What security mechanisms does MongoDB provide for data protection?**

- MongoDB offers a variety of built-in security features to protect data at rest, in transit, and during access. These are essential for securing your database in both on-premise and cloud (MongoDB Atlas) environments.

- The **security mechanisms** which MongoDB provide for **data protection** are as follows:

 - **Authentication** (e.g., SCRAM, LDAP)

 - **Authorization** (role-based access)

 - **Encryption** (at rest and in transit)

 - **TLS/SSL** support

 - **Auditing**

- Summary Table: MongoDB Security Features

| Security Mechanism         | Purpose                              |
| -------------------------- | ------------------------------------ |
| **Authentication**         | Verify user identity                 |
| **Authorization (RBAC)**   | Control access levels                |
| **TLS/SSL Encryption**     | Encrypts data in transit             |
| **Encryption at Rest**     | Protects stored data                 |
| **IP Whitelisting**        | Restricts external access            |
| **Auditing**               | Tracks access and changes            |
| **Client-side Encryption** | Secures sensitive fields end-to-end  |
| **Field-level Redaction**  | Hides sensitive fields based on user |


---

### **15. Explain the concept of embedded documents and when they should be used.**

- An **embedded document** is a document nested inside another document in a MongoDB collection.

- MongoDB is a document-oriented database, and its BSON (Binary JSON) format allows hierarchical, structured data. This means you can store related data within a single document, rather than across multiple collections.

- MongoDB allows **documents within documents**.

- Use Case Examples of Embedded Documents:

| Use Case     | Embedded Field       |
| ------------ | -------------------- |
| User profile | Address, preferences |
| Order        | List of items        |
| Blog post    | Comments             |
| Customer     | Payment methods      |
| IoT device   | Sensor readings      |

- Use Embedded Documents When:

| Scenario                              | Why Embed?                         |
| ------------------------------------- | ---------------------------------- |
| One-to-One or One-to-Few Relationship | Simple and contained               |
| Always fetched together               | Reduces queries and improves speed |
| Limited number of embedded items      | Prevents document size bloat       |
| Data is tightly coupled               | Like user + address                |

- Avoid Embedding When:

| Scenario                             | Why Not Embed?                            |
| ------------------------------------ | ----------------------------------------- |
| One-to-Many with large arrays        | Exceeds MongoDB 16MB document limit       |
| Data needs frequent updates          | Might require rewriting large document    |
| Embedded data is accessed separately | Better to reference in another collection |

- Embedded vs Referenced Data:

| Feature          | Embedded Document          | Referenced Document            |
| ---------------- | -------------------------- | ------------------------------ |
| Speed            |  Faster                   |  Slower (needs joins/lookups) |
| Structure        | Nested                     | Flat with references           |
| Query Simplicity | Simple                     | More complex                   |
| Flexibility      | Less (fixed structure)     | More flexible                  |
| Use Case         | Small, tight relationships | Large, loosely related data    |

- Summary:

| Concept           | Description                                                   |
| ----------------- | ------------------------------------------------------------- |
| Embedded Document | Document inside another document                              |
| Best Use Case     | One-to-one or one-to-few, accessed together                   |
| Advantage         | Faster queries, better data locality                          |
| Limitation        | Size limit (16 MB per document), can become complex to update |


- **Example**:

```json
{
  name: "John",
  address: {
    street: "1st Street",
    city: "Mumbai"
  }
}
```

---

### **16. What is the purpose of MongoDB’s `$lookup` stage in aggregation?**

- The $lookup stage in MongoDB's aggregation pipeline is used to perform a LEFT OUTER JOIN between documents in one collection with documents in another collection — similar to JOIN in SQL.

🔹 It combines data from two collections based on a matching field.

- The purpose of MongoDB's **$lookup** stage in aggregation:

 - To combine data from multiple collections.

 - Enables denormalization without duplicating data.

 - Helps in reporting, analytics, and relationship-based queries.


- **Example**:

```js
db.orders.aggregate([
  {
    $lookup: {
      from: "customers",
      localField: "customer_id",
      foreignField: "_id",
      as: "customer_info"
    }
  }
]);
```

- Summary:

| Feature         | Value                                      |
| --------------- | ------------------------------------------ |
| Purpose         | Join two collections                       |
| SQL Equivalent  | `LEFT OUTER JOIN`                          |
| Output          | Array field (`as`)                         |
| Required Fields | `localField`, `foreignField`, `from`, `as` |
| Advanced        | Use `$lookup` with `$pipeline`             |


---

### **17. What are some common use cases for MongoDB?**

- MongoDB is a popular NoSQL document-oriented database known for its flexibility, scalability, and performance. It is used in a wide variety of modern applications where JSON-like dynamic data and rapid development are critical.

- Some common **use cases for MongoDB** are as follows:

 - Real-time analytics

 - Content management

 - Product catalogs

 - Mobile/web apps

 - Internet of Things (IoT)

 - Chat & messaging apps

- Summary:

| Use Case                     | Why MongoDB Works Well                       |
| ---------------------------- | -------------------------------------------- |
| E-commerce catalogs          | Schema-less, handles varied product fields   |
| CMS & blogs                  | Dynamic content structure                    |
| Real-time apps (chat, games) | Low latency, fast writes, capped collections |
| Mobile/web app backend       | Easy JSON integration with APIs              |
| IoT & time-series data       | Efficient time-stamped storage               |
| Analytics dashboards         | Powerful aggregation pipeline                |
| Financial systems            | Flexible schema, fast indexing               |
| Microservices & containers   | Works well in distributed cloud environments |


---

### **18. What are the advantages of using MongoDB for horizontal scaling?**

- **Horizontal Scaling** (also known as scaling out) means adding more machines (nodes/servers) to distribute the data and handle increased load, instead of upgrading a single machine's resources (vertical scaling).

- MongoDB uses sharding to split large datasets across multiple shards (servers). A router (mongos) directs client queries to the appropriate shard.

- The advanatges of using MongoDB for **horizontal scaling** through **sharding** allows:

 - Handling large datasets

 - Distributing load

 - Scaling out by adding more nodes

 - Better performance than vertical scaling

- Summary:

| Feature           | MongoDB Horizontal Scaling Benefit |
| ----------------- | ---------------------------------- |
| Data Distribution | Uses sharding to split data        |
| Performance       | Parallel query processing          |
| Cost              | Uses multiple cheaper servers      |
| Fault Tolerance   | Shards are replicated              |
| Scalability       | Add servers as needed              |
| Load Balancing    | Built-in query router (mongos)     |
| Downtime          | None (online scaling supported)    |

- Conclusion:

> MongoDB’s horizontal scaling architecture using sharding makes it ideal for:

 - Handling massive datasets

 - Serving millions of users

 - Powering real-time analytics and global-scale applications
---

### **19. How do MongoDB transactions differ from SQL transactions?**

- A **transaction** is a sequence of one or more operations that are executed as a single unit. It ensures ACID properties:

> A - Atomicity

> C - Consistency

> I - Isolation

> D - Durability

- Both MongoDB and SQL databases support transactions, but they handle them differently.

- **MongoDB transactions** differ from **SQL transactions** in the following ways:

| Feature                 | SQL Databases (MySQL, PostgreSQL)        | MongoDB (NoSQL)                                                         |
| ----------------------- | ---------------------------------------- | ----------------------------------------------------------------------- |
| **Support Level**       | Fully supported for decades              | Introduced in MongoDB 4.0+ (multi-doc from 4.2)                         |
| **Default Use**         | Frequently used (standard)               | Optional; only used when needed                                         |
| **Atomicity Scope**     | Across multiple rows and tables          | Originally atomic only at single document level; now supports multi-doc |
| **Data Structure**      | Relational: tables, rows                 | Document-oriented: collections, documents                               |
| **Complex Joins**       | Fully supported                          | Limited join capability (`$lookup`)                                     |
| **Performance**         | Optimized for heavy transaction loads    | Slightly slower due to added overhead                                   |
| **Concurrency Control** | MVCC (Multi-Version Concurrency Control) | WiredTiger engine with document-level locking                           |
| **Usage Scenario**      | Banking, billing, enterprise apps        | Web apps, e-commerce, IoT, analytics                                    |

- Summary:

| Aspect            | MongoDB                    | SQL Databases                       |
| ----------------- | -------------------------- | ----------------------------------- |
| Default Atomicity | Per document               | Per transaction block               |
| Multi-document TX | Yes (from v4.0+)           | Yes (built-in)                      |
| Performance       | Good but has some overhead | Very fast with optimized engines    |
| Transaction API   | Manual using sessions      | Built-in syntax (`BEGIN`, `COMMIT`) |

---

### **20. What are the main differences between capped collections and regular collections?**

- MongoDB supports two primary types of collections:

 - Regular Collections

 - Capped Collections

- Each has its own use cases, characteristics, and limitations.

- The main differences between **caped collections and regular collections** are as follows:

| Feature                       | Regular Collection | Capped Collection                              |
| ----------------------------- | ------------------ | ---------------------------------------------- |
| **Growth**                    | Unlimited          | Fixed-size (circular buffer)                   |
| **Insertion Order Preserved** | No                 | Yes                                          |
| **Deletion Allowed**          |  Yes              |  No (overwrites oldest docs automatically)    |
| **Update Support**            |  Any size         | Only if updated doc is same size               |
| **Best Use Case**             | General storage    | Logging, Sensor Data, Real-Time Streams        |
| **Indexing**                  | Full support       | Supports `_id` and capped-specific indexes     |
| **Performance**               | Normal             | Faster (pre-allocated space, no fragmentation) |

- Summary:

| Property            | Capped Collection | Regular Collection     |
| ------------------- | ----------------- | ---------------------- |
| Size Limit          |  Yes (required)  |  No (grows as needed) |
| Overwrites Old Docs |  Yes             |  No                   |
| Deletions Allowed   |  No              |  Yes                  |
| Use Case            | Logs, FIFO Queues | General-purpose data   |

---

### **21. What is the purpose of the `$match` stage in MongoDB’s aggregation pipeline?**

- The **$match stage** in MongoDB’s aggregation pipeline is used to filter documents — just like the WHERE clause in SQL.

- It only passes documents that meet the specified criteria to the next stage in the pipeline.

- The primary purpose of the **$match** stage in MongoDB's aggregation pipeline:

| Purpose                           | Benefit                               |
| --------------------------------- | ------------------------------------- |
| Filter data early in the pipeline | Reduces processing load               |
| Speeds up performance             | Processes fewer documents             |
| Applies complex conditions        | Uses logical and comparison operators |

- Summary:

| Feature        | Description                                     |
| -------------- | ----------------------------------------------- |
| Stage Type     | Filtering                                       |
| Equivalent To  | SQL `WHERE` clause                              |
| Common Use     | Select subset of documents                      |
| Operators Used | `$eq`, `$gt`, `$lt`, `$in`, `$and`, `$or`, etc. |
| Position       | Early in pipeline (best practice)               |


- **Example**:

```js
{ $match: { category: "Furniture" } }
```

- Improves performance by reducing data before aggregation.

---

### **22. How can you secure access to a MongoDB database?**

- Securing a MongoDB database is critical for protecting sensitive data from unauthorized access, data breaches, and misuse. MongoDB provides a multi-layered security model that includes authentication, authorization, encryption, auditing, and network access control.

- Best practices to **secure accesss** to a MongoDB database:

 - Enable **authentication**

 - Use **role-based access control**

 - Run MongoDB behind a **firewall**

 - Enable **TLS/SSL**

 - Use **network whitelisting**

- Summary:

| Security Feature     | Purpose                       |
| -------------------- | ----------------------------- |
| Authentication       | Verify users                  |
| Authorization (RBAC) | Control access based on roles |
| TLS/SSL Encryption   | Secure network traffic        |
| IP Whitelisting      | Restrict external access      |
| Encryption at Rest   | Protect stored data           |
| Audit Logs           | Track activity                |
| Network Firewall     | Block unwanted traffic        |


---

### **23. What is MongoDB’s WiredTiger storage engine, and why is it important?**

- **WiredTiger** is the default storage engine used by MongoDB since version 3.2.

- A storage engine is the low-level component of a database that manages how data is stored, accessed, compressed, cached, and written to disk.

- Key Features of WiredTiger:

| Feature                  | Description                                                               |
| ------------------------ | ------------------------------------------------------------------------- |
|  Document-Level Locking | Improves performance by allowing concurrent writes to different documents |
|  Compression            | Reduces storage space using block compression (Snappy, Zlib, etc.)        |
|  Write-Ahead Logging    | Ensures durability in case of crashes                                     |
|  Checkpoints            | Periodic snapshots to minimize data loss                                  |
|  Caching                | Uses RAM to speed up reads/writes                                         |
|  Journaling             | Maintains recovery logs for safe crash recovery                           |

- Importance:

| Reason                     | Impact                                                       |
| -------------------------- | ------------------------------------------------------------ |
|  High Concurrency        | Many users can write/read at the same time without waiting   |
|  Storage Efficiency      | Compression saves disk space and cost                        |
|  Fast Caching and Access | Data is held in memory for faster performance                |
|  Crash Recovery          | Journals and checkpoints ensure data is not lost             |
|  Fine-Tuned Performance  | You can tweak cache size, compression, etc. for optimization |

- Summary:

| Feature          | WiredTiger Advantage                |
| ---------------- | ----------------------------------- |
| Locking          | Document-level (better concurrency) |
| Compression      | Yes (Snappy, Zlib)                  |
| Durability       | Journaling + Checkpoints            |
| Cache Management | Tunable memory usage                |
| Default Engine   | Yes (MongoDB ≥ 3.2)                 |



---
---


---

# **Practical Questions:-**

---

In [None]:
!pip install pymongo



In [None]:
# 1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB.

# ----------------------------------------
# STEP 1: Install Required Packages
# ----------------------------------------
!pip install pymongo dnspython --quiet

# ----------------------------------------
# STEP 2: Upload the CSV File
# ----------------------------------------
from google.colab import files
uploaded = files.upload()  # Upload 'superstore_db.csv'

# ----------------------------------------
# STEP 3: Import Modules
# ----------------------------------------
import pandas as pd
from pymongo import MongoClient
import ssl

# ----------------------------------------
# STEP 4: Read the CSV File
# ----------------------------------------
df = pd.read_csv("superstore_db.csv", encoding='latin1')
print(" Dataset Loaded Successfully")
print(df.head(3))

# ----------------------------------------
# STEP 5: Connect to MongoDB Atlas
# ----------------------------------------

mongo_uri = "mongodb+srv://chakrabortyarijit57:YgURHb9qOwaoJ0EJ@cluster1.ysvdftv.mongodb.net/?retryWrites=true&w=majority&tls=true&appName=Cluster1"

client = MongoClient(mongo_uri, tlsAllowInvalidCertificates=True)

# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)


db = client["superstore"]              # Database name
collection = db["orders"]              # Collection name

# -------------------------------------------------------------
# STEP 6: Convert DataFrame to Dictionary & Insert into MongoDB
# -------------------------------------------------------------
data = df.to_dict("records")           # Convert to list of dictionaries
collection.insert_many(data)           # Insert all records

print(f" Inserted {len(data)} records into MongoDB 'superstore.orders' collection.")

Saving superstore_db.csv to superstore_db (1).csv
 Dataset Loaded Successfully
   Row ID        Order ID Order Date   Ship Date     Ship Mode Customer ID  \
0       1  CA-2016-152156  11/8/2016  11/11/2016  Second Class    CG-12520   
1       2  CA-2016-152156  11/8/2016  11/11/2016  Second Class    CG-12520   
2       3  CA-2016-138688  6/12/2016   6/16/2016  Second Class    DV-13045   

     Customer Name    Segment        Country         City  ... Postal Code  \
0      Claire Gute   Consumer  United States    Henderson  ...       42420   
1      Claire Gute   Consumer  United States    Henderson  ...       42420   
2  Darrin Van Huff  Corporate  United States  Los Angeles  ...       90036   

   Region       Product ID         Category Sub-Category  \
0   South  FUR-BO-10001798        Furniture    Bookcases   
1   South  FUR-CH-10000454        Furniture       Chairs   
2    West  OFF-LA-10000240  Office Supplies       Labels   

                                        Product Name  

In [None]:
#  2. Retrieve and print all documents from the Orders collection.

from pprint import pprint

print(" Printing all documents from the 'orders' collection:")
for doc in collection.find():
    pprint(doc)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'Segment': 'Consumer',
 'Ship Date': '12/15/2017',
 'Ship Mode': 'Standard Class',
 'State': 'Michigan',
 'Sub-Category': 'Binders',
 '_id': ObjectId('687228594a9ac9c641f1df3d')}
{'Category': 'Office Supplies',
 'City': 'San Francisco',
 'Country': 'United States',
 'Customer ID': 'TC-21535',
 'Customer Name': 'Tracy Collins',
 'Discount': 0.0,
 'Order Date': '12/7/2017',
 'Order ID': 'CA-2017-142328',
 'Postal Code': 94122,
 'Product ID': 'OFF-PA-10000380',
 'Product Name': 'REDIFORM Incoming/Outgoing Call Register, 11" X 8 1/2", 100 '
                 'Messages',
 'Profit': 25.02,
 'Quantity': 6,
 'Region': 'West',
 'Row ID': 9769,
 'Sales': 50.04,
 'Segment': 'Home Office',
 'Ship Date': '12/14/2017',
 'Ship Mode': 'Standard Class',
 'State': 'California',
 'Sub-Category': 'Paper',
 '_id': ObjectId('687228594a9ac9c641f1df3e')}
{'Category': 'Furniture',
 'City': 'Hialeah',
 'Country': 'United States',
 'Customer ID': '

In [None]:
# 3. Count and display the total number of documents in the Orders collection.

total_docs = collection.count_documents({})
print(f" Total number of documents in 'orders' collection: {total_docs}")

 Total number of documents in 'orders' collection: 25429


In [None]:
# 4. Write a query to fetch all orders from the "West" region.

print("\n Orders from the 'West' region:")
west_orders = collection.find({"Region": "West"})
for order in west_orders:
    pprint(order)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'State': 'New Mexico',
 'Sub-Category': 'Art',
 '_id': ObjectId('687228594a9ac9c641f1dd50')}
{'Category': 'Furniture',
 'City': 'Seattle',
 'Country': 'United States',
 'Customer ID': 'HJ-14875',
 'Customer Name': 'Heather Jas',
 'Discount': 0.0,
 'Order Date': '9/20/2016',
 'Order ID': 'CA-2016-166772',
 'Postal Code': 98105,
 'Product ID': 'FUR-BO-10002853',
 'Product Name': "O'Sullivan 5-Shelf Heavy-Duty Bookcases",
 'Profit': 40.97,
 'Quantity': 2,
 'Region': 'West',
 'Row ID': 9281,
 'Sales': 163.88,
 'Segment': 'Home Office',
 'Ship Date': '9/24/2016',
 'Ship Mode': 'Standard Class',
 'State': 'Washington',
 'Sub-Category': 'Bookcases',
 '_id': ObjectId('687228594a9ac9c641f1dd56')}
{'Category': 'Office Supplies',
 'City': 'Fort Collins',
 'Country': 'United States',
 'Customer ID': 'LW-17215',
 'Customer Name': 'Luke Weiss',
 'Discount': 0.2,
 'Order Date': '9/2/2017',
 'Order ID': 'CA-2017-102218',
 'Postal Code':

In [None]:
# 5. Write a query to find orders where Sales is greater than 500.

print("\n Orders with Sales greater than 500:")
high_sales_orders = collection.find({"Sales": {"$gt": 500}})
for order in high_sales_orders:
    pprint(order)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 'Country': 'United States',
 'Customer ID': 'CW-11905',
 'Customer Name': 'Carl Weiss',
 'Discount': 0.0,
 'Order Date': '12/8/2017',
 'Order ID': 'CA-2017-152436',
 'Postal Code': 2920,
 'Product ID': 'OFF-ST-10000036',
 'Product Name': 'Recycled Data-Pak for Archival Bound Computer Printouts, '
                 '12-1/2 x 12-1/2 x 16',
 'Profit': 160.0398,
 'Quantity': 6,
 'Region': 'East',
 'Row ID': 7821,
 'Sales': 592.74,
 'Segment': 'Home Office',
 'Ship Date': '12/10/2017',
 'Ship Mode': 'Second Class',
 'State': 'Rhode Island',
 'Sub-Category': 'Storage',
 '_id': ObjectId('687228594a9ac9c641f1d7a2')}
{'Category': 'Technology',
 'City': 'Chicago',
 'Country': 'United States',
 'Customer ID': 'CY-12745',
 'Customer Name': 'Craig Yedwab',
 'Discount': 0.2,
 'Order Date': '10/31/2017',
 'Order ID': 'CA-2017-117114',
 'Postal Code': 60610,
 'Product ID': 'TEC-PH-10004042',
 'Product Name': 'ClearOne Communications CHAT

In [None]:
# 6. Fetch the top 3 orders with the highest Profit.

print("\n Top 3 orders with the highest Profit:")
top_profit_orders = collection.find().sort("Profit", -1).limit(3)
for order in top_profit_orders:
    pprint(order)


 Top 3 orders with the highest Profit:
{'Category': 'Technology',
 'City': 'Lafayette',
 'Country': 'United States',
 'Customer ID': 'TC-20980',
 'Customer Name': 'Tamara Chand',
 'Discount': 0.0,
 'Order Date': '10/2/2016',
 'Order ID': 'CA-2016-118689',
 'Postal Code': 47905,
 'Product ID': 'TEC-CO-10004722',
 'Product Name': 'Canon imageCLASS 2200 Advanced Copier',
 'Profit': 8399.976,
 'Quantity': 5,
 'Region': 'Central',
 'Row ID': 6827,
 'Sales': 17499.95,
 'Segment': 'Corporate',
 'Ship Date': '10/9/2016',
 'Ship Mode': 'Standard Class',
 'State': 'Indiana',
 'Sub-Category': 'Copiers',
 '_id': ObjectId('687228094a9ac9c641f1acb5')}
{'Category': 'Technology',
 'City': 'Lafayette',
 'Country': 'United States',
 'Customer ID': 'TC-20980',
 'Customer Name': 'Tamara Chand',
 'Discount': 0.0,
 'Order Date': '10/2/2016',
 'Order ID': 'CA-2016-118689',
 'Postal Code': 47905,
 'Product ID': 'TEC-CO-10004722',
 'Product Name': 'Canon imageCLASS 2200 Advanced Copier',
 'Profit': 8399.976,


In [None]:
# 7. Update all orders with Ship Mode as "First Class" to "Premium Class".

print("\n Updating Ship Mode from 'First Class' to 'Premium Class':")
update_result = collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)
print(f" Matched {update_result.matched_count} documents.")
print(f" Modified {update_result.modified_count} documents.")

# Verify the update (optional)
print("\n Verifying updates for Ship Mode 'Premium Class':")
premium_orders = collection.find({"Ship Mode": "Premium Class"})
for order in premium_orders:
    pprint(order)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
 '_id': ObjectId('687228094a9ac9c641f1b3c3')}
{'Category': 'Office Supplies',
 'City': 'Bethlehem',
 'Country': 'United States',
 'Customer ID': 'OT-18730',
 'Customer Name': 'Olvera Toch',
 'Discount': 0.2,
 'Order Date': '7/28/2016',
 'Order ID': 'CA-2016-107783',
 'Postal Code': 18018,
 'Product ID': 'OFF-AP-10004052',
 'Product Name': 'Hoover Replacement Belts For Soft Guard & Commercial '
                 'Ltweight Upright Vacs, 2/Pk',
 'Profit': 0.711,
 'Quantity': 3,
 'Region': 'East',
 'Row ID': 8634,
 'Sales': 9.48,
 'Segment': 'Consumer',
 'Ship Date': '7/29/2016',
 'Ship Mode': 'Premium Class',
 'State': 'Pennsylvania',
 'Sub-Category': 'Appliances',
 '_id': ObjectId('687228094a9ac9c641f1b3c4')}
{'Category': 'Office Supplies',
 'City': 'Louisville',
 'Country': 'United States',
 'Customer ID': 'JM-16195',
 'Customer Name': 'Justin MacKendrick',
 'Discount': 0.0,
 'Order Date': '4/30/2014',
 'Order ID': 'CA-2014

In [None]:
# 8. Delete all orders where Sales is less than 50.

print("\n Deleting orders where Sales is less than 50:")
delete_result = collection.delete_many({"Sales": {"$lt": 50}})
print(f" Deleted {delete_result.deleted_count} documents.")

# Verify the deletion (optional)
print("\n Counting documents after deletion:")
total_docs_after_delete = collection.count_documents({})
print(f" Total number of documents in 'orders' collection after deletion: {total_docs_after_delete}")


 Deleting orders where Sales is less than 50:
 Deleted 4849 documents.

 Counting documents after deletion:
 Total number of documents in 'orders' collection after deletion: 15435


In [None]:
# 9. Use aggregation to group orders by Region and calculate total sales per region.

print("\n Total Sales per Region using Aggregation:")
pipeline = [
    {"$group": {"_id": "$Region", "TotalSales": {"$sum": "$Sales"}}},
    {"$sort": {"TotalSales": -1}} # Sort by total sales in descending order
]
region_sales = list(collection.aggregate(pipeline))
for result in region_sales:
    pprint(result)


 Total Sales per Region using Aggregation:
{'TotalSales': 2084059.8585, '_id': 'West'}
{'TotalSales': 1953413.115, '_id': 'East'}
{'TotalSales': 1438835.5374, '_id': 'Central'}
{'TotalSales': 1128069.936, '_id': 'South'}


In [None]:
# 10. Fetch all distinct values for Ship Mode from the collection.

print("\n Distinct Ship Modes:")
distinct_ship_modes = collection.distinct("Ship Mode")
pprint(distinct_ship_modes)


 Distinct Ship Modes:
['Premium Class', 'Same Day', 'Second Class', 'Standard Class']


In [None]:
# 11. Count the number of orders for each category.

print("\n Number of Orders per Category using Aggregation:")
pipeline_category_count = [
    {"$group": {"_id": "$Category", "OrderCount": {"$sum": 1}}},
    {"$sort": {"OrderCount": -1}} # Sort by order count in descending order
]
category_order_counts = list(collection.aggregate(pipeline_category_count))
for result in category_order_counts:
    pprint(result)


 Number of Orders per Category using Aggregation:
{'OrderCount': 6228, '_id': 'Office Supplies'}
{'OrderCount': 4719, '_id': 'Furniture'}
{'OrderCount': 4488, '_id': 'Technology'}
