# Theoretical Questions

Question 1. What are the key differences between SQL and NoSQL databases.

Answer:

SQL and NoSQL databases differ in structure, scalability, and use cases. Here's a concise breakdown of their key differences:

*   Data Structure:

    * SQL: Relational databases use structured tables with predefined schemas, organizing data into rows and columns. Data is stored in a tabular format with fixed fields.
    * NoSQL: Non-relational databases support flexible, schema-less structures. They handle various data types like key-value, document, column-family, or graph formats.


*   Schema:

    * SQL: Requires a rigid, predefined schema. Changes to the schema (e.g., adding columns) can be complex and may require downtime.
    * NoSQL: Dynamic schema allows adding fields on the fly, making it adaptable to evolving data needs without major restructuring.


*   Scalability:

    * SQL: Scales vertically (adding more power to a single server). Scaling horizontally (across multiple servers) is possible but complex and less common.
    * NoSQL: Designed for horizontal scaling, easily distributing data across multiple servers or nodes, ideal for large-scale, distributed systems.


*   Query Language:

    * SQL: Uses standardized SQL (Structured Query Language) for queries, consistent across relational databases like MySQL, PostgreSQL, or Oracle.
    * NoSQL: Query mechanisms vary by type (e.g., MongoDB uses JSON-like queries, Cassandra uses CQL). No universal standard exists.


*   Consistency vs. Availability:

    * SQL: Emphasizes ACID (Atomicity, Consistency, Isolation, Durability) properties, ensuring strong consistency but potentially sacrificing availability in distributed systems.
    * NoSQL: Often follows BASE (Basically Available, Soft state, Eventual consistency), prioritizing availability and partition tolerance over immediate consistency, suitable for high-availability systems.


*   Use Cases:

    * SQL: Best for structured data and applications requiring complex queries, transactions, or joins, like financial systems, ERP, or CRM.
    * NoSQL: Suited for unstructured or semi-structured data, big data, real-time analytics, or applications like social media, IoT, or content management.


*   Examples:

    * SQL: MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server.
    * NoSQL: MongoDB (document), Cassandra (column-family), Redis (key-value), Neo4j (graph).

Question 2. What makes MongoDB a good choice for modern applications?


Answer:

MongoDB, a leading NoSQL database, is a strong choice for modern applications due to its flexibility, scalability, and developer-friendly features. Here’s a concise overview of what makes MongoDB well-suited for contemporary use cases:

*   Flexible Document Model:

    * MongoDB uses a JSON-like (BSON) document structure, allowing storage of diverse data types without a rigid schema. This flexibility supports rapid development and accommodates evolving data needs, ideal for applications like e-commerce, content management, or IoT.


*   Scalability:

    * Designed for horizontal scaling, MongoDB distributes data across multiple servers via sharding, handling large-scale, high-traffic applications efficiently. Its architecture supports seamless scaling for modern cloud-based or distributed systems.


*   High Performance:

    * MongoDB's in-memory processing, indexing, and aggregation capabilities enable fast read/write operations, making it suitable for real-time applications like analytics, social media, or recommendation engines.


*   Rich Query Language:

    * Supports powerful queries, including filtering, sorting, and aggregations, as well as geospatial and full-text search. This allows developers to handle complex data operations without needing additional tools.


*   Developer Productivity:

    * Its JSON-like documents align closely with modern programming languages (e.g., JavaScript, Python), reducing data mapping overhead. Native drivers for multiple languages and a low learning curve streamline development.


*   Cloud-Native and Managed Services:

    * MongoDB Atlas, its fully managed cloud service, simplifies deployment, scaling, and maintenance across AWS, Azure, and Google Cloud. Features like automated backups, monitoring, and global clusters support modern DevOps workflows.


*   Support for Diverse Workloads:

    * Handles unstructured, semi-structured, or structured data, making it versatile for use cases like real-time analytics, mobile apps, personalization engines, or event-driven architectures.


*   Ecosystem and Community:

    * MongoDB offers tools like MongoDB Compass (GUI), MongoDB Charts, and integration with frameworks like Node.js, Spring, or Django. A large community and extensive documentation ensure robust support.

Question 3.  Explain the concept of collections in MongoDB

Answer:

In MongoDB, a collection is a fundamental concept that serves as a container for storing documents, which are the basic units of data in this NoSQL database. Collections are analogous to tables in relational databases (e.g., SQL), but they differ significantly due to MongoDB's schema-less, document-oriented design. Below is a clear and concise explanation of collections in MongoDB:

Key Characteristics of Collections

*   Group of Documents:

    A collection holds multiple documents, where each document is a JSON-like (BSON) record containing key-value pairs. For example, a document might look like:

    {
    "_id": 1,
    "name": "Alice",
    "age": 25,
    "city": "New York"
    }

    Documents within a collection can have different structures, unlike the fixed schema of SQL tables.


*   Schema-Less Design:

    Collections do not enforce a predefined schema. This means documents in the same collection can have varied fields, data types, or structures, enabling flexibility for evolving application needs.

    Example: One document in a users collection might have an age field, while another might include an email field instead, with no conflict.


*   No Fixed Size or Structure:

    Collections are dynamic, allowing documents to be added, updated, or removed without predefined constraints on size or format, making them ideal for handling diverse or unstructured data.


*   Organized Within Databases:

    Collections reside within a MongoDB database. A single database can contain multiple collections, each with a unique name (e.g., users, products, orders).

    Example: A database named shop might have collections like customers, orders, and inventory.


*   Automatic Creation:

    Collections are created implicitly when you insert a document into a non-existent collection. No explicit creation command is needed, simplifying development.

    Example: Running db.users.insertOne({"name": "Bob"}) creates the users collection if it doesn’t already exist.


*   Indexing Support:

    Collections support indexes to optimize query performance. You can create indexes on fields (e.g., name, _id) to enable faster searches, sorting, or filtering.

    The _id field is automatically indexed as a unique identifier for each document.


*   Capped Collections:

    MongoDB supports a special type called capped collections, which have a fixed size and automatically overwrite old data when full. These are useful for logging or caching scenarios where only recent data is needed.

Question 4.  How does MongoDB ensure high availability using replication?

Answer:

MongoDB ensures high availability through replication, a process that maintains multiple copies of data across different servers to prevent data loss and ensure continuous access, even during failures. This is primarily achieved using replica sets, MongoDB’s replication mechanism. Below is a concise explanation of how MongoDB uses replication to ensure high availability:

1. Replica Sets: The Core of Replication

    A replica set is a group of MongoDB servers (nodes) that maintain identical copies of data. Typically, it consists of:

    Primary Node: Handles all write operations and read operations (by default). Clients interact primarily with this node.

    Secondary Nodes: Replicate data from the primary node and can serve read operations (if configured). They maintain a synchronized copy of the data.

    Arbiter Node (optional): A lightweight node that doesn’t store data but participates in elections to help select a new primary during failover.


    A typical replica set has 3-5 nodes, ensuring redundancy without excessive overhead.

2. Data Synchronization

    Oplog (Operation Log): The primary node records all write operations in a capped collection called the oplog. Secondary nodes continuously replicate this oplog and apply the operations to their own data copies, ensuring data consistency.

    Asynchronous Replication: By default, replication is asynchronous, meaning secondaries may lag slightly behind the primary. However, this lag is typically minimal (milliseconds) under normal conditions.

    Write Concern: MongoDB allows configuration of write concern to control how many nodes must acknowledge a write before it's considered successful. For example, setting w: "majority" ensures writes are acknowledged by a majority of nodes, enhancing durability.

3. Automatic Failover

    If the primary node fails (e.g., due to hardware issues or network partitions), the replica set triggers an automatic failover:

    Election Process: Surviving nodes communicate to elect a new primary from the secondaries. The election is based on factors like node priority, recency of data, and network reachability.

    Seamless Transition: Once a new primary is elected, clients redirect their operations to it, minimizing downtime. This process typically completes in seconds.

    Arbiter Role: If an arbiter is present, it votes in elections to break ties, ensuring a majority decision even in smaller replica sets.



4. High Availability Features

    Redundancy: Multiple nodes ensure data is available even if one or more nodes fail. For example, in a 3-node replica set, the system remains operational as long as at least one node is available.

    Data Durability: With appropriate write concern (e.g., w: "majority"), data is persisted across multiple nodes, reducing the risk of data loss.

    Read Availability: Clients can read from secondary nodes (if configured), distributing read load and maintaining availability during primary node maintenance or failure.

    Geographic Distribution: Replica sets can be spread across data centers (e.g., using MongoDB Atlas), ensuring availability during regional outages and reducing latency for global users.

5. Consistency vs. Availability Trade-Off

    MongoDB prioritizes availability and partition tolerance (AP in CAP theorem) over strict consistency. During network partitions, secondaries may lag, but the system remains available for reads and writes.

    Eventual Consistency: Secondary nodes eventually catch up with the primary via oplog replication, ensuring data consistency over time.

    For applications needing stronger consistency, MongoDB supports read concern (e.g., majority) to ensure reads reflect writes acknowledged by a majority of nodes.

6. Practical Implementation

    Configuration: A replica set is initialized with a configuration specifying member nodes, their roles, and settings like priority or read preferences.

    Example: rs.initiate({"_id": "rs0", "members": [{"_id": 0, "host": "server1:27017"}, {"_id": 1, "host": "server2:27017"}, {"_id": 2, "host": "server3:27017"}]})

    MongoDB Atlas: The managed cloud service simplifies replica set setup, maintenance, and failover, with automated backups and global distribution options.

    Monitoring: Tools like MongoDB Compass, Atlas monitoring, or rs.status() provide insights into replica set health, lag, and election status.

7. Use Case Benefits

    Fault Tolerance: Replica sets ensure applications like e-commerce platforms or real-time analytics remain operational during server failures.

    Maintenance Without Downtime: Nodes can be upgraded or maintained by temporarily stepping down the primary, with no service interruption.

    Load Balancing: Read-heavy applications (e.g., social media feeds) can distribute queries to secondaries, improving performance.

8. Limitations to Consider

    Replication Lag: Asynchronous replication may cause slight delays in secondary data, which can affect applications requiring immediate consistency.

    Resource Overhead: Maintaining multiple nodes increases storage and operational costs.

    Complex Setup: While Atlas simplifies management, manual replica set configuration requires careful planning for network, priority, and failover settings.

9. Example Scenario
    
    In an e-commerce application, a products collection is hosted in a 3-node replica set:

    The primary node in New York handles writes (e.g., updating stock levels).

    Two secondaries (one in London, one in Singapore) replicate the data, serving reads for local users and ensuring availability if the primary fails.
    
    If the New York node goes offline, the London secondary becomes the new primary after an election, and the application continues with minimal disruption.

Question 5. What are the main benefits of MongoDB Atlas?

Answer:

MongoDB Atlas is a fully managed cloud database service that simplifies the deployment, management, and scaling of MongoDB databases. It offers numerous benefits tailored to modern application needs, making it a popular choice for developers and businesses. Below is a concise overview of the main benefits of MongoDB Atlas:

1. Fully Managed Service

    Automation: Atlas handles infrastructure management tasks like provisioning, patching, backups, and monitoring, freeing developers to focus on application development.

    Ease of Use: Simplifies setup with an intuitive UI, CLI, or API, allowing users to deploy clusters in minutes without managing servers.

2. Scalability

    Horizontal Scaling: Supports sharding and replica sets for seamless scaling across multiple nodes to handle growing data and traffic.

    Elastic Scaling: Automatically scales compute and storage resources up or down based on demand, optimizing costs and performance.

    Global Clusters: Enables geographically distributed deployments for low-latency access and high availability across regions.

3. High Availability

    Replica Sets: Automatically configures replica sets with primary and secondary nodes, ensuring data redundancy and automatic failover for minimal downtime.

    Cross-Region Replication: Distributes data across multiple cloud regions (e.g., AWS, Azure, Google Cloud) to withstand regional outages.

4. Security Features

    Built-In Security: Offers encryption at rest (using AWS KMS, Azure Key Vault, or Google Cloud KMS), TLS/SSL for data in transit, and fine-grained access control with role-based permissions.

    Authentication and Authorization: Supports integration with LDAP, AWS IAM, and other identity providers for secure access management.

    Network Isolation: Provides VPC peering, private endpoints, and IP whitelisting to secure database access.

5. Global Deployment and Performance

    Multi-Cloud Support: Runs on AWS, Azure, and Google Cloud, allowing flexibility to choose or combine providers without vendor lock-in.

    Low-Latency Reads/Writes: Global clusters and read replicas enable data locality, reducing latency for users worldwide.

    Performance Optimization: Built-in tools like Performance Advisor and automated indexing suggestions enhance query efficiency.

6. Automated Backups and Recovery


    Continuous Backups: Provides automated, continuous backups with point-in-time recovery to restore data to a specific moment.

    On-Demand Snapshots: Allows manual snapshots for additional control, with minimal impact on performance.

    Disaster Recovery: Supports quick recovery from accidental deletions or outages, ensuring data durability.

7. Monitoring and Analytics


    Real-Time Monitoring: Offers detailed metrics on database performance, query execution, and resource usage through an intuitive dashboard.

    Alerts and Insights: Configurable alerts for issues like high latency or resource limits, with proactive recommendations via Performance Advisor.

    Integration with Analytics: Seamlessly integrates with tools like MongoDB Charts, BI Connector, or external platforms for data visualization and analysis.

8. Developer Productivity

    Native Tools: Includes tools like MongoDB Compass (GUI), MongoDB Shell, and drivers for popular languages (e.g., Node.js, Python), streamlining development.

    Serverless and Free Tier: Offers serverless instances and a free tier (M0) for prototyping or small-scale applications, reducing costs for developers.

    Data API: Provides RESTful APIs for simplified integration with modern frameworks and serverless architectures.

9. Cost Efficiency


    Pay-as-You-Go Pricing: Charges based on usage, with options to scale resources dynamically to match workload demands.

    Free Tier and Low-Cost Options: The M0 free tier and affordable shared clusters make it accessible for startups and small projects.

    Resource Optimization: Auto-scaling and workload isolation prevent over-provisioning, balancing performance and cost.

10. Ecosystem and Community


    Integration with Modern Stacks: Works seamlessly with frameworks like React, Node.js, and cloud services like AWS Lambda or Kubernetes.

    Extensive Support: Backed by MongoDB’s documentation, community forums, and enterprise-grade support for critical applications.

    MongoDB University: Offers free training and certifications to help teams master Atlas and MongoDB.

11. Example Use Cases

    E-commerce: Scales to handle peak traffic (e.g., Black Friday) and supports global customers with low-latency access via global clusters.

    IoT Applications: Manages high-velocity, unstructured data from devices with flexible schemas and automated scaling.
  
    Real-Time Analytics: Powers dashboards and analytics with fast queries and integration with MongoDB Charts or external BI tools.
  
    Startups: Enables rapid prototyping with the free tier and serverless options, scaling as the business grows.

Question 6. What is the role of indexes in MongoDB, and how do they improve performance?

Answer:

In MongoDB, indexes are special data structures that optimize the speed of data retrieval operations on collections by reducing the amount of data scanned during queries. They play a critical role in improving query performance, especially for large datasets, by enabling efficient access to documents. Below is a concise explanation of the role of indexes in MongoDB and how they enhance performance.

Role of Indexes in MongoDB

*   Speed Up Query Execution:

    Indexes allow MongoDB to locate documents quickly without scanning every document in a collection (a process called a collection scan). This is similar to using an index in a book to find specific topics without reading every page.

    For example, querying a users collection for a specific email is faster with an index on the email field.


*   Support Efficient Sorting and Filtering:

    Indexes enable efficient sorting (e.g., sort({ age: 1 })) and filtering (e.g., find({ city: "New York" })) by organizing data in a way that minimizes computational overhead.

    They also support range queries, regular expressions, and geospatial queries.


*   Enforce Uniqueness:

    Indexes can enforce unique constraints on fields (e.g., email or username), preventing duplicate values in a collection.

    Example: db.users.createIndex({ "email": 1 }, { unique: true }) ensures no two documents have the same email.


*   Optimize Aggregation and Joins:

    Indexes improve the performance of aggregation pipelines and lookup operations by reducing the data processed during complex queries.


*   Support Specialized Queries:

    MongoDB supports various index types (e.g., text, geospatial, hashed) to optimize specific use cases like full-text search, location-based queries, or load balancing in sharded clusters.



How Indexes Improve Performance:

Indexes improve performance by reducing the time and resources required for queries. Here's how:

*   Reduced Data Scanning:

    Without an index, MongoDB performs a collection scan, examining every document in a collection, which is slow for large datasets (O(n) complexity).

    With an index, MongoDB uses the index's ordered structure (typically a B-tree or B+ tree) to locate documents in O(log n) time, significantly speeding up queries.


*   Efficient Data Access:

    Indexes store references to documents' locations, allowing MongoDB to fetch only the relevant documents. For example, an index on { age: 1 } organizes documents by age, enabling quick lookups for queries like db.users.find({ age: 25 }).


*   Minimized Resource Usage:

    By avoiding full collection scans, indexes reduce CPU, memory, and disk I/O usage, improving performance for high-traffic applications.


*   Support for Covered Queries:

    A covered query is fully resolved using an index (i.e., all queried fields are in the index, and no document data needs to be fetched). This minimizes disk access, boosting performance.

    Example: If an index exists on { name: 1, age: 1 }, a query like db.users.find({ name: "Alice" }, { age: 1, _id: 0 }) can be covered.

Question 7. Describe the stages of the MongoDB aggregation pipeline

Answer:

The stages of the MongoDB aggregation pipeline are a sequence of data processing steps, where each stage transforms the documents and passes the results to the next stage. The most commonly used stages include:

*   $match: Filters the documents, passing only those that meet the specified condition to the next pipeline stage.

*   $group: Groups documents by a specified key and applies aggregate functions like sum, count, etc., to the groups.

*   $sort: Sorts documents according to specified fields in ascending or descending order.

*   $project: Specifies which fields to include or exclude in the resulting documents, or can create new fields by transforming existing data.

*   $set: Adds new fields or modifies existing fields in documents.

*   $unset: Removes specified fields from documents.

*   $unwind: Deconstructs an array field from the input documents to output one document per array element.

*   $limit: Limits the number of documents passed to the next stage.

*   $skip: Skips a specified number of documents before passing the remaining documents along.

*   $count: Counts the number of documents passing through the pipeline.

*   $sortByCount: Groups documents by a field and sorts them by the count of documents in each group, descending by default.

*   $lookup: Performs joins with other collections.

*   $facet: Processes multiple pipelines within a single stage on the same input, useful for multi-faceted aggregations.

*   Each stage is represented by an operator starting with $ and is enclosed in {}. The stages run in the order they are listed, with the output of one stage becoming the input for the next.

Question 8.  What is sharding in MongoDB? How does it differ from replication?

Answer:

In MongoDB, sharding and replication are two distinct mechanisms designed to enhance scalability and availability, respectively. While both are critical for managing large-scale data and high-traffic applications, they serve different purposes. Below is a concise explanation of sharding, its role in MongoDB, and how it differs from replication.

*   What is Sharding in MongoDB?

    Sharding is a method for distributing data across multiple servers (or shards) to improve scalability and performance in MongoDB. It enables horizontal scaling by partitioning a large dataset into smaller, manageable chunks, called shards, which are stored on different servers. Each shard holds a subset of the data, allowing MongoDB to handle large datasets and high-throughput workloads efficiently.

Key Components of Sharding

*   Shard:

    A single MongoDB instance or replica set that stores a portion of the collection's data.
    
    Example: A users collection might be split so one shard holds users with IDs 1-1000, another holds 1001-2000, etc.


*   Shard Key:

    A field (or combination of fields) used to determine how data is distributed across shards.
    
    Example: Choosing userId as a shard key splits documents based on userId values.
    
    Types: Range-based (divides data into ranges, e.g., userId: 1-1000), Hashed (uses a hash of the shard key for even distribution), or Zoned (custom ranges for specific shards).


*   Config Servers:

    Store metadata about the sharded cluster, including the shard key ranges and shard locations.

    Run as a replica set in production to ensure reliability.


*   Mongos (Query Router):

    Acts as the interface between clients and the sharded cluster, routing queries to the appropriate shards based on the shard key.

    Clients interact with mongos, not individual shards.


*   Chunks:

    Subsets of data within a shard, defined by ranges of the shard key.

    MongoDB automatically balances chunks across shards to prevent any single shard from becoming a bottleneck.

*   How Sharding Works:

    A collection is sharded by selecting a shard key, which determines how documents are distributed.

    MongoDB splits the data into chunks based on the shard key and distributes them across shards.

    The mongos router directs queries to the relevant shards, aggregates results if needed, and returns them to the client.

    Balancer: A background process redistributes chunks to maintain even data distribution across shards as data grows or nodes are added/removed.

*   Benefits of Sharding:

    Scalability: Handles large datasets and high write/read throughput by distributing data across multiple servers.

    Performance: Parallelizes queries across shards, reducing response times for large-scale applications.

    Storage Capacity: Increases storage by adding more shards, unlike vertical scaling, which is limited by single-server capacity.

    Use Cases: Ideal for applications with massive data growth, such as social media platforms, e-commerce systems, or IoT data storage.

*   Example

    For a products collection with millions of documents:

    Shard key: productId (hashed for even distribution).
    Data is split into chunks (e.g., 64 MB each) and distributed across 3 shards.
    A query like db.products.find({ productId: "ABC123" }) is routed by mongos to the shard containing the relevant chunk.


How does sharding differ from replication:

*   Purpose:

    Sharding is for horizontal scaling and increasing capacity, while replication is for high availability and data redundancy.

*   Data Distribution:

    Sharding splits different data across multiple servers, while replication copies the same data across multiple servers.

*   Performance Goal:

    Sharding is designed to increase read/write throughput and storage capacity, while replication is designed to provide failover and read scalability.

*   Data on Each Server:

    Sharding means each shard contains only a subset of the total data, while replication means each replica contains the complete dataset.

*   Write Operations:

    Sharding distributes writes across multiple shards based on the shard key, while replication sends all writes to the primary node only.

*   Read Operations:

    Sharding routes reads to the appropriate shard(s) containing the data, while replication allows reads from primary or secondary nodes.

*   Failure Handling:

    Sharding means if one shard fails, only that portion of data is unavailable, while replication means if the primary fails, a secondary is automatically elected as the new primary.

*   Complexity:

    Sharding requires careful shard key selection and is more complex to manage, while replication is simpler to configure and maintain.

*   Scalability Type:

    Sharding provides horizontal scaling by adding more shards, while replication provides vertical scaling and read distribution.

*   Primary Use Case:

    Sharding is used when your dataset or workload exceeds a single server's capacity, while replication is used when you need continuous availability and data protection.


Question 9. What is PyMongo, and why is it used?

Answer:

*   PyMongo is the official Python driver and toolkit for working with MongoDB databases from Python applications.

*   It provides the necessary modules and functions to connect, query, manipulate, and manage MongoDB data directly from Python code. PyMongo allows developers to perform database operations like CRUD (create, read, update, delete), aggregation, indexing, and specialized data operations through a simple and Pythonic API.

*   PyMongo is used to:

    * Establish connections to MongoDB servers or clusters.

    * Execute queries, insert and update documents, and manage collections or databases.

    * Perform aggregation operations and utilize advanced MongoDB capabilities (e.g., GridFS, monitoring, encryption).

    * Enable application development that leverages MongoDB's NoSQL architecture from Python, the preferred language for data science and backend programming.

*   In summary, PyMongo is essential for any Python project requiring direct, efficient, and flexible access to MongoDB databases, and is recommended by MongoDB for native Python integration.

Question 10. What are the ACID properties in the context of MongoDB transactions?

Answer:

In the context of databases, ACID properties (Atomicity, Consistency, Isolation, Durability) are a set of principles that ensure reliable and predictable transaction processing. MongoDB, traditionally a NoSQL database optimized for flexibility and scalability, introduced support for multi-document ACID transactions starting with version 4.0 (for replica sets) and version 4.2 (for sharded clusters). These transactions allow MongoDB to provide the same reliability guarantees as traditional relational databases for specific use cases. Below is a concise explanation of the ACID properties in the context of MongoDB transactions.
ACID Properties in MongoDB Transactions

*   Atomicity:

    * Definition: Ensures that all operations within a transaction are completed successfully as a single, indivisible unit. If any operation fails, the entire transaction is rolled back, and no changes are applied to the database.

    * In MongoDB: A transaction involving multiple document updates (e.g., updating orders and inventory collections) is treated as a single unit. If any operation fails (e.g., due to a network error or constraint violation), MongoDB rolls back all changes made in the transaction.

    * Example: Transferring funds between two accounts:

        javascriptconst session = client.startSession();
        session.withTransaction(async () => {
        await db.accounts.updateOne({ _id: "account1" }, { $inc: { balance: -100 } }, { session });
        await db.accounts.updateOne({ _id: "account2" }, { $inc: { balance: 100 } }, { session });
      });

      If the second update fails, the first update is undone, ensuring no partial changes.


*   Consistency:

    * Definition: Guarantees that a transaction brings the database from one valid state to another, maintaining all predefined rules, constraints, and data integrity.

    * In MongoDB: Transactions ensure that the database remains consistent with its schema (if any), indexes, and unique constraints. For example, a transaction respects unique indexes and ensures data integrity across multiple documents.

    * Example: If a unique index exists on the email field in a users collection, a transaction attempting to insert duplicate emails will fail, preserving consistency.
    
    * MongoDB's eventual consistency model (in non-transactional operations) shifts to strong consistency within transactions when using appropriate read/write concerns (e.g., majority).


*   Isolation:

    * Definition: Ensures that transactions are executed in isolation from one another, preventing partial changes from being visible to other operations until the transaction is complete.

    * In MongoDB: MongoDB uses snapshot isolation for transactions. Each transaction operates on a consistent snapshot of the data, ensuring that reads and writes are isolated from concurrent operations. Other transactions or queries cannot see uncommitted changes.

    * Example: If one transaction updates a user's balance, another transaction cannot read the intermediate state until the first transaction commits.

    * Implementation: MongoDB's snapshot isolation prevents issues like dirty reads or non-repeatable reads, though long-running transactions may encounter conflicts in high-concurrency scenarios.


*   Durability:

    * Definition: Guarantees that once a transaction is committed, its changes are permanently saved to the database, even in the event of a system failure.

    * In MongoDB: Transactions are durable when committed with a write concern of majority, ensuring changes are replicated to a majority of nodes in a replica set. This guarantees that committed changes persist even if a server crashes.

    * Example: After committing a transaction that updates inventory, the changes are written to the primary node’s journal and replicated to secondaries (with w: "majority"), ensuring they survive a crash.
    
    * MongoDB Atlas enhances durability with automated backups and cross-region replication.

Question 11. What is the purpose of MongoDB's explain() function?

Answer:

The MongoDB explain() function is a diagnostic tool used to analyze and understand how MongoDB executes a query or operation. It provides detailed information about the query execution plan, including how MongoDB processes the query, which indexes (if any) are used, and the performance characteristics of the operation. This helps developers optimize queries, identify performance bottlenecks, and improve database efficiency.
Purpose of the explain() Function

The primary purposes of explain() are:

*   Query Performance Analysis:

    * Reveals how MongoDB retrieves data, whether it uses an index, performs a collection scan, or requires multiple steps.
    
    * Helps identify slow queries that may need optimization, such as adding indexes or restructuring queries.


*   Index Usage Verification:

    * Shows whether an index is used for a query and, if so, which one. This ensures queries are leveraging indexes to avoid inefficient full collection scans.

    * Example: Confirm if a query on db.users.find({ age: 25 }) uses an index on the age field.


*   Execution Statistics:

    * Provides metrics like the number of documents scanned, returned, or modified, as well as execution time, helping to quantify query efficiency.

    * Useful for comparing different query strategies or index configurations.


*   Troubleshooting and Debugging:

    * Helps diagnose issues like why a query is slow, why it returns unexpected results, or why it consumes excessive resources.

    * Identifies issues like missing indexes or suboptimal shard key usage in sharded clusters.


*   Optimization Guidance:

    * Informs decisions about creating or modifying indexes, rewriting queries, or adjusting data models to improve performance.

    * Works with MongoDB Atlas's Performance Advisor, which uses explain()-like insights to recommend optimizations.



*   How explain() Works:

    The explain() method is appended to a MongoDB query or command (e.g., find(), aggregate(), update()).

    It returns a document describing the query plan without executing the query (unless in "executionStats" or "allPlansExecution" modes).
    
    Syntax: db.collection.<operation>.explain(<mode>)

    Example: db.users.find({ age: 25 }).explain("executionStats")

Question 12. How does MongoDB handle schema validation?

Answer:

MongoDB handles schema validation by allowing developers to define explicit validation rules on collections to ensure that inserted and updated documents conform to a specified structure. While MongoDB's default mode is schemaless, schema validation enables enforcement of field requirements, data types, value ranges, and custom expressions to maintain data consistency and integrity within a collection.

Mechanism of Schema Validation:

*   Schema validation rules are typically specified using the $jsonSchema operator either when creating a new collection or by modifying an existing collection using the collMod command.

*   The schema can enforce required fields, allowed data types, minimum/maximum values, array constraints, and complex validation expressions using JSON Schema draft specifications.

*   Validation rules can be strict (applies to all documents, including existing) or moderate (applies only to new or updated documents but not to unmodified ones).

*   Validation Control Options

    * validationLevel: Determines whether rules apply to all documents (strict) or just to newly inserted/modified documents (moderate).

    * validationAction: Specifies whether to reject documents that fail validation (error, default), or to allow but log such violations (warn).

    * Example:

      When creating a collection with validation, the command might look like:

          db.createCollection(
          "posts", {
          validator: {
            $jsonSchema: {
              bsonType: "object",
              required: [ "title", "body" ],
              properties: {
                title: { bsonType: "string" },
                body: { bsonType: "string" },
                likes: { bsonType: "int", minimum: 0 }
              }
            }
          }})

      This ensures every document in the "posts" collection must have "title" and "body" as strings, and "likes" as a non-negative integer.

*   By utilizing schema validation, MongoDB supports enforcement of structure and quality within otherwise flexible collections, providing important control over data integrity in NoSQL applications.

Question 13. What is the difference between a primary and a secondary node in a replica set?

Answer:

In MongoDB, a replica set is a group of servers (nodes) that maintain identical copies of data to ensure high availability and fault tolerance. The nodes in a replica set are categorized primarily as primary and secondary nodes, each with distinct roles. Below is a concise explanation of the differences between primary and secondary nodes in a MongoDB replica set.


*    Primary node processes all write operations from clients, while secondary nodes do not process writes but replicate the primary's oplog to maintain a synchronized data copy.

*    Primary node maintains the authoritative copy of the data, while secondary nodes hold a copy that may have slight replication lag due to asynchronous replication.

*    Primary node records all changes in the oplog (operation log), a special capped collection, while secondary nodes apply these oplog entries to stay synchronized with the primary.

*    Primary node handles read operations by default for strong consistency, ensuring clients see the latest data, while secondary nodes can serve read operations only if configured with read preferences like 'secondary' or 'secondaryPreferred', potentially providing eventual consistency.

*   Primary node serves as the default client interaction point for both reads and writes, while secondary nodes only serve reads when explicitly targeted via read preference settings.

*   Primary node can become unavailable, triggering a replica set election, while secondary nodes can become the new primary during an election if they have up-to-date data and sufficient priority.

*   Primary node processes critical transactions, such as updating account balances in a banking application, while secondary nodes might serve read requests, like a user's transaction history, to reduce load on the primary.

*   Primary node typically requires higher resources to handle write workloads, while secondary nodes offload read-heavy workloads, such as analytics or reporting, to improve performance.

*   Example Context

    In a banking application:

    Primary node processes a transaction to update account balances, while secondary nodes replicate this change from the primary’s oplog.
    
    Primary node serves a read request for the latest balance to ensure strong consistency, while secondary nodes might serve a read for transaction history, potentially with slight replication lag.


Question 14.  What security mechanisms does MongoDB provide for data protection?

Answer:

MongoDB provides multiple robust security mechanisms to protect data throughout its lifecycle, ensuring confidentiality, integrity, and availability.

Key Security Mechanisms in MongoDB for Data Protection

*   Authentication:

    Validates identity of users or applications accessing the database. Supported methods include SCRAM, X.509 certificate authentication, LDAP proxy, and Kerberos, enabling integration with existing identity management systems.

*   Authorization and Role-Based Access Control (RBAC):

    Controls user permissions based on roles assigned, restricting access to database resources and operations to only authorized users. Fine-grained access can be set to cater to different roles within an organization.

*   Encryption:
    
    MongoDB encrypts data at all stages:

    * Data at rest:

      Using AES-256 encryption via WiredTiger storage engine to secure data files on disk.

    * Data in transit:

      TLS/SSL encrypts data traveling between clients and servers.

    * Data in use:

      Client-Side Field Level Encryption (CSFLE) allows encrypting specific fields before sending to MongoDB, protecting sensitive data from internal or external threats.

    * Queryable Encryption lets clients query encrypted data without decrypting it, enhancing security without losing query functionality.

*   Auditing and Monitoring:

    MongoDB Enterprise enables detailed logging and auditing of database activities, including authentication, authorization, and data operations, facilitating security forensics and compliance.

*   Network Security:

    Features include firewall integration, private networking, IP whitelisting, and network isolation to protect clusters from unauthorized network access.

*   Data Resiliency and Availability:

    Replication with automatic failover ensures high availability and data durability, while encrypted backups and point-in-time recovery enhance data protection.

* MongoDB Atlas, the managed cloud service, further integrates these security features with automated compliance, governance controls, and continuous monitoring to enable secure and compliant data management.

*   Together, these mechanisms create a defense-in-depth approach, effectively safeguarding MongoDB data in diverse deployment environments and meeting stringent regulatory compliance requirements.

Question 15. Explain the concept of embedded documents and when they should be used?

Answer:

Embedded documents in MongoDB are documents contained within other documents, allowing for a nested or hierarchical data structure. This design embeds related data directly inside a parent document, rather than using references to link separate documents.

Concept of Embedded Documents:

*   An embedded document is stored as a value in a field of a parent document.

*   They support nesting up to 100 levels deep, with a maximum document size of 16MB.

*   Embedded documents can contain various data types, including other embedded documents and arrays, enabling complex data models.

When to Use Embedded Documents:

*   Related Data Access:

    When related data is frequently accessed or updated together, embedding reduces the need for multiple queries or joins, improving performance.

*   Schema Flexibility:

    Embedding allows for flexible, denormalized schemas suitable for hierarchical data structures or one-to-many relationships, such as orders with multiple items or user profiles with addresses.

*   Atomic Operations:

    When updates or deletions need to be atomic across related data, embedding ensures these operations can be performed on a single document.

*   Practical Examples:

    A Passenger document may embed an address document for quick retrieval.

    A Car document might embed an engine sub-document describing engine specifications.

    A Customer document could include an array of orders, each containing embedded order details.

In short, embedded documents should be used when data entities are tightly related, frequently accessed together, or when atomic updates for related data are needed, optimizing read performance and simplifying data management.

Question 16.  What is the purpose of MongoDB's $lookup stage in aggregation?

Answer:

The purpose of MongoDB's $lookup stage in the aggregation pipeline is to perform a left outer join between two collections in the same database. It allows documents from the input (local) collection to be enriched by joining related documents from a foreign (lookup) collection based on matching fields.

Key Points about $lookup:

*   It joins documents from two collections using a specified local field and foreign field for matching.

*   The joined data from the foreign collection is added as an array in a new field within each input document.

*   Even if no matching documents are found, the input document is included with an empty array for the joined field.

*   This stage enables relational-like join queries in the flexible, schema-less environment of MongoDB.

    Common use cases include combining data for reporting, analytics, and denormalizing related information for easier access.

    Syntax:
  

        {
        $lookup: {
        from: "foreignCollection",
        localField: "localField",
        foreignField: "foreignField",
        as: "outputArrayField"
        }
        }

* Example:
    
  If there is an orders collection with a customer_id field, and a customers collection where _id corresponds to customer IDs, $lookup can embed matching customer documents inside each order:



      db.orders.aggregate([
      {  
      $lookup: {
        from: "customers",
        localField: "customer_id",
        foreignField: "_id",
        as: "customer_details"
      }
      }]
      )

    This will add the array field customer_details containing customer info to each order document.

*   In short, $lookup is used to combine and enrich data from multiple collections with a single aggregation query, enabling more powerful and expressive queries in MongoDB

Question 17. What are some common use cases for MongoDB?

Answer:

Some common use cases for MongoDB include:

*   Content Management Systems (CMS):

    MongoDB's flexible document model is ideal for storing varied content types such as text, images, videos, and metadata. It enables platforms to handle complex and evolving content schemas efficiently in real time.

*   E-commerce Platforms:

    MongoDB supports dynamic product catalogs, customer profiles, and transaction histories. Its horizontal scalability helps manage unpredictable traffic spikes and real-time analytics for customer behavior and inventory management. Companies like eBay use it to handle massive volumes of product listings and user data.

*   Real-Time Analytics:

    MongoDB's aggregation framework and scalability support fast data aggregation and insights from large, continuously changing datasets. It is used by businesses to monitor customer behavior, optimize operations, and serve personalized experiences.

*   Internet of Things (IoT):

    IoT generates huge volumes of semi-structured data from sensors and devices. MongoDB handles this variety and scale effectively, allowing real-time data capture, trend analysis, and predictive insights for connected systems such as smart homes and industrial IoT.

*   Gaming Applications:

    MongoDB effectively stores dynamic player profiles, game state, achievements, and real-time interaction data. Gaming companies like Electronic Arts use MongoDB to support fast data updates and personalized gaming experiences.

*   Customer Relationship Management (CRM):

    MongoDB allows flexible storage of customer profiles, interactions, and transaction history. It supports evolving data models as business requirements change, used by companies like LinkedIn for managing large-scale user data.

*   Social Networks:

    The database manages large volumes of unstructured user-generated content such as posts, comments, relationships, and multimedia. MongoDB enables rapid scaling and real-time workflows needed for social platforms' dynamic feeds and interactions.

*   Financial Services and Modernization:

    MongoDB powers scalable, real-time financial platforms, credit card processing, and digital channels modernization efforts, supporting millions of daily transactions with high availability and performance.

Question 18. What are the advantages of using MongoDB for horizontal scaling?

Answer:

The advantages of using MongoDB for horizontal scaling are primarily achieved through its sharding capability, which distributes data and workload across multiple servers. Key benefits include:

*   Increased Capacity:

    Horizontal scaling enables MongoDB to handle very large datasets by partitioning data across many servers (shards), preventing any single server from becoming a bottleneck and allowing continuous growth.

*   Improved Performance:

    By distributing read and write operations across shards, horizontal scaling enhances overall query throughput and response times, making it suitable for high-traffic applications.

*   High Availability and Fault Tolerance:

    If one shard fails, others continue operating, reducing downtime and providing resilience through built-in replication within each shard.

*   Cost Efficiency:

    Instead of investing heavily in more powerful hardware, horizontal scaling allows using multiple commodity servers, which can be more budget-friendly as the data or traffic grows.

*   Near-Continuous Availability:

    Sharded clusters can be maintained with minimal downtime, supporting applications that require high availability.

*   Flexibility for Large-Scale Applications:

    Horizontal scaling supports diverse use cases like social media, e-commerce, IoT, and analytics that require rapid, unpredictable data growth handling.

*   Load Balancing:

    MongoDB automatically balances data and workload among shards to optimize resource usage and prevent any shard from being overloaded.

Question 19.  How do MongoDB transactions differ from SQL transactions?

Answer:

MongoDB and SQL databases both support transactions for data integrity, but their approaches differ due to MongoDB's NoSQL document model and SQL's relational structure. MongoDB introduced multi-document ACID transactions in v4.0 (replica sets) and v4.2 (sharded clusters), while SQL transactions are core to relational databases. Below is a concise comparison, halved to ~300 words.

Key Differences

*   Scope:

    * MongoDB: Multi-document transactions are opt-in, atomic at the document level by default.

    * SQL: Multi-row, multi-table transactions are standard. Example:


*   Atomicity:

    * MongoDB: Single-document operations are atomic; multi-document transactions ensure all operations succeed or fail.

    * SQL: All operations within a transaction are atomic, covering rows and tables.


*   Consistency:

    *  MongoDB: Uses snapshot isolation with readConcern: "snapshot" for strong consistency; non-transactional operations may be eventually consistent.

    * SQL: Enforces strict consistency, often using locking or MVCC.


*   Isolation:

    * MongoDB: Snapshot isolation prevents concurrent changes from being visible.
    
    * SQL: Offers multiple isolation levels (e.g., Read Committed, Serializable) for flexibility.


*   Durability:

    * MongoDB: Requires writeConcern: "majority" for durability.
    
    * SQL: Commits are durable by default, written to disk.


*   Performance:

    * MongoDB: Transactions are resource-heavy, best for specific use cases.

    * SQL: Optimized for relational workloads, performance varies by isolation level.


*   Data Model:

    * MongoDB: Flexible, schema-less documents reduce join complexity.

    * SQL: Normalized tables require joins and strict schemas.


*   Implementation:

    * MongoDB: Requires explicit sessions (startSession()).
    
    * SQL: Uses simple BEGIN, COMMIT, ROLLBACK.

Question 20. What are the main differences between capped collections and regular collections?

Answer:

In MongoDB, capped collections and regular collections are two types of collections used to store documents, but they serve different purposes and have distinct characteristics. Capped collections are designed for specific use cases requiring fixed-size, high-performance storage with automatic data expiration, while regular collections offer greater flexibility for general-purpose use. Below is a concise comparison of the main differences between capped collections and regular collections in MongoDB.

*   Capped collections have a fixed size with automatic overwriting of old documents, while regular collections grow dynamically without size limits.


*   Capped collections evict the oldest documents when full, while regular collections retain all documents until explicitly deleted.

*   Capped collections restrict updates to maintain document size and prohibit deletions, while regular collections allow unrestricted CRUD operations.


*   Capped collections optimize for high-throughput writes and sequential reads, while regular collections support diverse access patterns with potential fragmentation.


*   Capped collections have limited indexing flexibility due to their design, while regular collections support extensive indexing for varied queries.

*   Capped collections are ideal for transient data like logs or caching, while regular collections suit persistent storage for applications like e-commerce.


*   Capped collections retain schema-less nature but limit operational flexibility, while regular collections offer full schema and operational flexibility.


*   Capped collections have sharding limitations due to their fixed-size design, while regular collections are more adaptable to distributed environments.

*   Example:

    Capped collection: db.createCollection("logs", { capped: true, size: 1048576 }); – Overwrites old logs when full.

    Regularcollection: db.createCollection("users"); – Grows dynamically, supports full CRUD.

Question 21. What is the purpose of the $match stage in MongoDB's aggregation pipeline?

Answer:

The $match stage in MongoDB's aggregation pipeline filters documents in a collection to include only those that meet specified criteria, passing them to the next stage. It is similar to the find() method but operates within the aggregation framework, enabling efficient data processing by reducing the dataset early in the pipeline. Below is a concise explanation of its purpose, functionality, and use cases.

Purpose of the $match Stage:

*  Filter Documents:
    
    $match selects documents based on conditions, such as equality, range, or logical operators, narrowing down the dataset for subsequent pipeline stages.

    Example: Filter users aged 25 or older:
    
    { $match: { age: { $gte: 25 } } }



*   Optimize Pipeline Performance:

    By placing this stage early in the pipeline, it reduces the number of documents processed by later stages (e.g., $group, $sort), improving efficiency.

    Example: Filtering before grouping saves computational resources.


*   Enable Complex Queries:

    Supports MongoDB's query operators (e.g., $eq, $gt, $in, $regex) to express complex conditions, making it versatile for analytics and reporting.

    Example: Match documents with specific categories:
    
    { $match: { category: { $in: ["electronics", "books"] } } }



*   Prepare Data for Aggregation:

    Acts as the first step in data transformation, ensuring only relevant documents are processed for tasks like grouping, joining, or sorting.

*   In short, $match serves as the primary mechanism to narrow down datasets in aggregation pipelines, enabling focused, performant downstream data processing and transformation.

Question 22. How can you secure access to a MongoDB database?

Answer:

To secure access to a MongoDB database, several key mechanisms and best practices should be implemented:

*   Authentication:

    * Use MongoDB's built-in authentication mechanisms to verify the identity of users and applications. The default method is SCRAM (Salted Challenge Response Authentication Mechanism) with SHA-256 encryption.

    * Other supported authentication methods include X.509 certificate authentication, Kerberos, LDAP proxy, and OpenID Connect for integration with enterprise identity providers.

    * Create separate user accounts for different roles and entities instead of shared accounts, easing access management and auditing.

*   Authorization and Role-Based Access Control (RBAC):

    * Define granular user roles with specific privileges to minimize the risk of excessive permissions.

    * Assign roles based on the principle of least privilege, allowing users access only to the databases and operations required for their tasks.

*   Network Security:

    * Use firewalls, IP whitelisting, and Virtual Private Network (VPN) or private networking to restrict network access to the MongoDB server.

    * Enable TLS/SSL to encrypt data in transit between clients and the database server.

*   Encryption:

    * Enable encryption at rest using MongoDB’s WiredTiger storage engine with AES-256 encryption.

    * Use client-side field-level encryption (CSFLE) for encrypting sensitive fields before data is sent to the server for added security.

*   Auditing and Monitoring:

    * Enable auditing features to log authentication attempts, authorization changes, and data modifications.

    * Use monitoring tools to detect unusual access patterns and potential security incidents.

*   Configuration Best Practices:

    * Enable authorization in the MongoDB configuration file to enforce authentication.

    * Regularly update MongoDB and its components to the latest stable releases to patch security vulnerabilities.

    * Use strong passwords and enforce password policies.

Question 23. What is MongoDB’s WiredTiger storage engine, and why is it important?

Answer:

MongoDB's WiredTiger storage engine is the default and most widely used storage engine since MongoDB version 3.2. It is important for several reasons related to performance, scalability, and data management:

*   What is WiredTiger?

    * WiredTiger is a modern, high-performance storage engine that uses a document-level concurrency model, meaning multiple write operations can occur simultaneously on different documents within the same collection, improving throughput.

    * It combines advanced concepts from B-Tree and Log-Structured Merge (LSM) tree storage engines, providing efficient indexing and access to data.

    * WiredTiger uses multiversion concurrency control (MVCC), enabling snapshot isolation for readers without blocking writers.

    * It employs write-ahead logging (WAL) and checkpointing to ensure data durability and crash recovery.

    * Supports data compression (e.g., Snappy, zlib) to reduce storage space and I/O bandwidth usage.

*   Why is WiredTiger Important?

    Improved Concurrency:
    
    * Document-level locking reduces contention, allowing multiple clients to read and write concurrently, which significantly enhances performance on multi-core systems.

    Better Compression:
    
    * Compression reduces disk space usage and enhances I/O efficiency, enabling more data to be stored and accessed with less resource usage.

    Durability and Reliability:
    
    * Write-ahead logging and checkpoint mechanisms ensure that data is safely written to disk and recoverable after crashes.

    Optimized for Modern Hardware:
    
    * WiredTiger leverages modern operating system page caches and manages both internal cache and filesystem cache effectively.

    Scalability:
    
    * Supports large datasets and high-throughput workloads, making it suitable for many enterprise applications.

# Practical Questions

Question 1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB

Answer:


In [2]:
!pip install pymongo

Collecting pymongo
  Downloading pymongo-4.15.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.8.0-py3-none-any.whl.metadata (5.7 kB)
Downloading pymongo-4.15.3-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.8.0-py3-none-any.whl (331 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m331.1/331.1 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.8.0 pymongo-4.15.3


In [3]:
!pip install python-dotenv



**Please note that I am using Mongodb atlas since google colab couldn't access mongodb server on my local machine**

In [32]:
import pandas as pd
from pymongo import MongoClient
import os
from dotenv import load_dotenv

# Load environment variables from .env file in current directory

load_dotenv('https://drive.google.com/uc?export=download&id/1alKGFcaSgKeWHPixYtmuo2NOn4RVPMEe')


# Get MongoDB URI from environment variable (no quotes needed in .env)
mongo_uri = os.getenv('MONGODB_URI')

# Connect to MongoDB Atlas
client = MongoClient(mongo_uri)

# Access database and collection
db = client['superstore_db']
collection = db['Orders']

# Example CSV file URL
csv_file_path = 'https://drive.google.com/uc?export=download&id=1bJ-X2ONfnE5YbsNe2bCK39IfoBHexYQO'

# Read CSV with proper encoding
df = pd.read_csv(csv_file_path, encoding='latin1')

# Convert date columns
df['Order Date'] = pd.to_datetime(df['Order Date'], errors='coerce')
df['Ship Date'] = pd.to_datetime(df['Ship Date'], errors='coerce')

# Convert DataFrame to list of dicts
records = df.to_dict(orient='records')

# Insert records into MongoDB collection
result = collection.insert_many(records)

print(f"Inserted {len(result.inserted_ids)} records into MongoDB collection 'Orders'.")


Inserted 9994 records into MongoDB collection 'Orders'.


Question 2. Retrieve and print all documents from the Orders collection

Answer:

In [33]:
# Retrieve and print all documents from orders collection

# Accessing database and collection
db = client['superstore_db']
collection = db['Orders']

for document in collection.find():
    print(document)

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
{'_id': ObjectId('68e606399817aaf8632f71a1'), 'Row ID': 4995, 'Order ID': 'CA-2015-153038', 'Order Date': datetime.datetime(2015, 12, 18, 0, 0), 'Ship Date': datetime.datetime(2015, 12, 25, 0, 0), 'Ship Mode': 'Standard Class', 'Customer ID': 'RB-19645', 'Customer Name': 'Robert Barroso', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Memphis', 'State': 'Tennessee', 'Postal Code': 38109, 'Region': 'South', 'Product ID': 'FUR-FU-10000221', 'Category': 'Furniture', 'Sub-Category': 'Furnishings', 'Product Name': 'Master Caster Door Stop, Brown', 'Sales': 20.32, 'Quantity': 5, 'Discount': 0.2, 'Profit': 3.556}
{'_id': ObjectId('68e606399817aaf8632f71a2'), 'Row ID': 4996, 'Order ID': 'CA-2014-132227', 'Order Date': datetime.datetime(2014, 11, 4, 0, 0), 'Ship Date': datetime.datetime(2014, 11, 10, 0, 0), 'Ship Mode': 'Standard Class', 'Customer ID': 'SZ-20035', 'Customer Name': 'Sam Zeldin', 'Segment': 'Home Offic

Question 3.  Count and display the total number of documents in the Orders collection

Answer:

In [34]:
# Count all documents
total_docs = collection.count_documents({})

# Display the count
print(f"Total number of documents in the collection 'Orders': {total_docs}")

Total number of documents in the collection 'Orders': 9994


Question 4. Write a query to fetch all orders from the "West" region

Answer:

In [35]:
# Query to fetch all orders where Region is "West"
query = {"Region": "West"}

# Count documents matching the query directly
count = collection.count_documents(query)

print(f"Number of orders where region is west : {count}")
print("---------------------------------------------")

# Execute the query
west_orders = collection.find(query)

# Print the results
for order in west_orders:
    print(order)

Number of orders where region is west : 3203
---------------------------------------------
{'_id': ObjectId('68e606399817aaf8632f5e21'), 'Row ID': 3, 'Order ID': 'CA-2016-138688', 'Order Date': datetime.datetime(2016, 6, 12, 0, 0), 'Ship Date': datetime.datetime(2016, 6, 16, 0, 0), 'Ship Mode': 'Second Class', 'Customer ID': 'DV-13045', 'Customer Name': 'Darrin Van Huff', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Los Angeles', 'State': 'California', 'Postal Code': 90036, 'Region': 'West', 'Product ID': 'OFF-LA-10000240', 'Category': 'Office Supplies', 'Sub-Category': 'Labels', 'Product Name': 'Self-Adhesive Address Labels for Typewriters by Universal', 'Sales': 14.62, 'Quantity': 2, 'Discount': 0.0, 'Profit': 6.8714}
{'_id': ObjectId('68e606399817aaf8632f5e24'), 'Row ID': 6, 'Order ID': 'CA-2014-115812', 'Order Date': datetime.datetime(2014, 6, 9, 0, 0), 'Ship Date': datetime.datetime(2014, 6, 14, 0, 0), 'Ship Mode': 'Standard Class', 'Customer ID': 'BH-11710', 'Cust

Question 5. Write a query to find orders where Sales is greater than 500.

Answer:

In [36]:
query = {"Sales": {"$gt": 500}}

# Count documents matching the query directly
count = collection.count_documents(query)

print(f"Number of orders with Sales > 500: {count}")
print("---------------------------------------------")

# Fetch and print the matching documents
high_sales_orders = collection.find(query)
for order in high_sales_orders:
    print(order)


Number of orders with Sales > 500: 1162
---------------------------------------------
{'_id': ObjectId('68e606399817aaf8632f5e20'), 'Row ID': 2, 'Order ID': 'CA-2016-152156', 'Order Date': datetime.datetime(2016, 11, 8, 0, 0), 'Ship Date': datetime.datetime(2016, 11, 11, 0, 0), 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub-Category': 'Chairs', 'Product Name': 'Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back', 'Sales': 731.94, 'Quantity': 3, 'Discount': 0.0, 'Profit': 219.582}
{'_id': ObjectId('68e606399817aaf8632f5e22'), 'Row ID': 4, 'Order ID': 'US-2015-108966', 'Order Date': datetime.datetime(2015, 10, 11, 0, 0), 'Ship Date': datetime.datetime(2015, 10, 18, 0, 0), 'Ship Mode': 'Standard Class', 'Customer ID': 'SO-20335', 'Customer Name':

Question 6.  Fetch the top 3 orders with the highest Profit

Answer:

In [37]:
#  Find and sort by Profit descending, then limit to top 3

top_3_profit_orders = collection.find().sort("Profit", -1).limit(3)

# Print the top 3 orders
for order in top_3_profit_orders:
    print(order)

{'_id': ObjectId('68e606399817aaf8632f78c9'), 'Row ID': 6827, 'Order ID': 'CA-2016-118689', 'Order Date': datetime.datetime(2016, 10, 2, 0, 0), 'Ship Date': datetime.datetime(2016, 10, 9, 0, 0), 'Ship Mode': 'Standard Class', 'Customer ID': 'TC-20980', 'Customer Name': 'Tamara Chand', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Lafayette', 'State': 'Indiana', 'Postal Code': 47905, 'Region': 'Central', 'Product ID': 'TEC-CO-10004722', 'Category': 'Technology', 'Sub-Category': 'Copiers', 'Product Name': 'Canon imageCLASS 2200 Advanced Copier', 'Sales': 17499.95, 'Quantity': 5, 'Discount': 0.0, 'Profit': 8399.976}
{'_id': ObjectId('68e606399817aaf8632f7df8'), 'Row ID': 8154, 'Order ID': 'CA-2017-140151', 'Order Date': datetime.datetime(2017, 3, 23, 0, 0), 'Ship Date': datetime.datetime(2017, 3, 25, 0, 0), 'Ship Mode': 'First Class', 'Customer ID': 'RB-19360', 'Customer Name': 'Raymond Buch', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Seattle', 'State': 'W

Question 7.  Update all orders with Ship Mode as "First Class" to "Premium Class."

Answer:

In [38]:
# Filter and update documents

result = collection.update_many({"Ship Mode": "First Class"}, {"$set": {"Ship Mode": "Premium Class"}})

print(f"Documents matched: {result.matched_count}")
print(f"Documents modified: {result.modified_count}")

Documents matched: 1538
Documents modified: 1538


Question 8. Delete all orders where Sales is less than 50.

Answer:

In [39]:
# Delete all documents with Sales less than 50
delete_query = {"Sales": {"$lt": 50}}

result = collection.delete_many(delete_query)

print(f"Deleted {result.deleted_count} documents with Sales less than 50.")

Deleted 4849 documents with Sales less than 50.


Question 9. Use aggregation to group orders by Region and calculate total sales per region

Answer:

Note: this total sales for different regions might get affected as 4849 records has been deleted in question 8 whose sales was less than 50

In [40]:
# Aggregation pipeline to group by Region and sum Sales
pipeline = [
    {
        "$group": {
            "_id": "$Region",
            "total_sales": {"$sum": "$Sales"}
        }
    }
]

# Execute aggregation
results = collection.aggregate(pipeline)

# Print total sales per region
for result in results:
    print(f"Region: {result['_id']}, Total Sales: {result['total_sales']}")

Region: Central, Total Sales: 479611.8458
Region: West, Total Sales: 694686.6195
Region: East, Total Sales: 651137.705
Region: South, Total Sales: 376023.312


Question 10.  Fetch all distinct values for Ship Mode from the collection.

Answer:

Note: First class mode was replaced with Premium class with question 7.

In [41]:
# Get distinct Ship Mode values
distinct_ship_modes = collection.distinct('Ship Mode')

# Print distinct values
print("Distinct Ship Mode values:")
for mode in distinct_ship_modes:
    print(mode)

Distinct Ship Mode values:
Premium Class
Same Day
Second Class
Standard Class


Question 11. Count the number of orders for each category.

Answer:


In [42]:
# Aggregation pipeline to group by Category and count orders
pipeline = [
    {
        "$group": {
            "_id": "$Category",
            "total_orders": {"$sum": 1}
        }
    }
]

# Execute the aggregation
results = collection.aggregate(pipeline)

# Print the total orders per category
for result in results:
    print(f"Category: {result['_id']}, Total Orders: {result['total_orders']}")

Category: Furniture, Total Orders: 1573
Category: Technology, Total Orders: 1496
Category: Office Supplies, Total Orders: 2076
