# Theoretical Questions
### 1. What are the key differences between SQL and NoSQL databases ?
    -> SQL and NoSQL databases differ primarily in their structure, flexibility, and scalability. SQL databases are relational and use a fixed schema, organizing data into tables with rows and columns. They rely on Structured Query Language (SQL) for defining and manipulating data, making them suitable for complex queries and transactions where data integrity is essential. In contrast, NoSQL databases are non-relational and support a variety of data models, such as document, key-value, column-family, or graph. They offer a dynamic schema, allowing for flexible and rapid changes to data structures. This makes NoSQL ideal for handling large volumes of unstructured or semi-structured data. Additionally, SQL databases typically scale vertically by upgrading server resources, while NoSQL databases are designed for horizontal scaling across multiple servers, making them more suitable for distributed and high-traffic applications.

### 2. What makes MongoDB a good choice for modern applications ?
    -> MongoDB is well-suited for modern applications because it provides a flexible, schema-less document model that allows developers to store and query data in a natural, JSON-like format, making it easy to handle evolving or unstructured data. Its horizontal scalability through built-in sharding enables it to efficiently manage large volumes of data and high traffic loads. MongoDB also supports rich querying and indexing capabilities, including full-text search and geospatial queries, which help build complex, performant applications. Additionally, its strong ecosystem, easy integration with popular programming languages, and support for real-time analytics and distributed data make it ideal for agile development, rapid iteration, and cloud-native architectures common in modern software projects.

### 3. Explain the concept of collections in MongoDB?
    -> In MongoDB, a collection is a grouping of related documents, similar to a table in relational databases. Unlike tables, collections in MongoDB do not enforce a fixed schema, so documents within the same collection can have different fields and structures. Collections serve as containers for documents, which are JSON-like objects that store data. This flexible design allows for easy storage of varied and evolving data formats. Collections are created implicitly when you insert the first document, and they help organize data logically within a database, enabling efficient querying, indexing, and management of related information.

### 4. How does MongoDB ensure high availability using replication?
    -> MongoDB ensures high availability through a feature called replication, which involves maintaining multiple copies of data across different servers in a cluster known as a replica set. A replica set typically consists of one primary node that handles all write operations and multiple secondary nodes that replicate data from the primary asynchronously. If the primary node fails or becomes unavailable, the replica set automatically holds an election among the secondary nodes to select a new primary, ensuring the database remains operational without manual intervention. This automatic failover mechanism, combined with data redundancy across multiple nodes, helps MongoDB provide continuous uptime, fault tolerance, and protection against data loss, making it highly reliable for modern applications.

### 5. What are the main benefits of MongoDB Atlas ?
    -> MongoDB Atlas offers several key benefits, including fully managed cloud hosting that eliminates the need for manual database setup and maintenance, allowing developers to focus on building applications. It provides automated backups, monitoring, and security features like encryption and access controls out of the box. Atlas also supports easy scalability with just a few clicks, enabling seamless handling of growing data and traffic demands. Additionally, it offers global distribution with multi-region clusters for low-latency access and disaster recovery. Integrated tools for performance optimization and real-time analytics further enhance application reliability and efficiency, making Atlas a powerful and convenient solution for deploying MongoDB in the cloud.

### 6. What is the role of indexes in MongoDB, and how do they improve performance ?
    -> Indexes in MongoDB play a crucial role in speeding up query performance by allowing the database to quickly locate and access the data without scanning every document in a collection. They work like indexes in a book, providing a fast lookup mechanism for specific fields or combinations of fields. By creating indexes on frequently queried fields, MongoDB can efficiently filter and sort results, significantly reducing query execution time. Without indexes, queries require a full collection scan, which becomes slow and resource-intensive as data grows. Additionally, MongoDB supports various types of indexes, such as single-field, compound, text, and geospatial indexes, enabling optimized performance for diverse query patterns and application needs.

### 7. Describe the stages of the MongoDB aggregation pipeline.
    -> The MongoDB aggregation pipeline processes data through a series of **stages**, each transforming the documents as they pass through, much like a data processing pipeline. The main stages include:

    1. $match — Filters documents to pass only those that meet specific criteria, similar to a query’s WHERE clause.
    2. $group — Groups documents by a specified key and performs aggregations like sum, average, or count on grouped data.
    3. $project — Reshapes each document by including, excluding, or adding new fields, essentially controlling which data fields appear in the output.
    4. $sort — Orders documents based on one or more fields.
    5. $limit — Restricts the number of documents passed to the next stage.
    6. $skip — Skips a specified number of documents before passing the rest along.

    There are many other stages like $unwind (to deconstruct arrays), $lookup (to perform joins), and $addFields (to add new fields), all allowing flexible and powerful data transformation and analysis within the database. The pipeline processes documents sequentially through these stages to produce aggregated results efficiently.

### 8. What is sharding in MongoDB? How does it differ from replication?
    -> **Sharding** in MongoDB is a method of **horizontal scaling** where data is distributed across multiple servers or shards, allowing the database to handle large datasets and high throughput by splitting the data into smaller, more manageable pieces. Each shard holds a subset of the data, and a special component called the **mongos** router directs queries to the appropriate shard(s) based on the shard key. This enables MongoDB to scale out by adding more servers as data grows.

### 9. What is PyMongo, and why is it used ?
    -> PyMongo is the official Python driver for MongoDB, providing a way for Python applications to interact with MongoDB databases. It allows developers to connect to a MongoDB server, perform database operations like inserting, querying, updating, and deleting documents, as well as managing indexes and running aggregation pipelines—all using Python code. PyMongo is widely used because it offers a simple and efficient interface to work with MongoDB’s flexible document model, making it easy to integrate MongoDB into Python-based projects such as web applications, data analysis, and automation scripts.

### 10. What are the ACID properties in the context of MongoDB transactions?
    -> MongoDB supports ACID (Atomicity, Consistency, Isolation, Durability) properties within multi-document transactions starting from version 4.0. This means that a transaction either fully completes or fully rolls back (Atomicity), keeps data consistent with rules (Consistency), isolates operations from others until committed (Isolation), and ensures changes persist even after crashes (Durability).

### 11. What is the purpose of MongoDB’s explain() function?
    -> The explain() function provides detailed information on how MongoDB executes a query or aggregation, including which indexes are used, query plans, and execution statistics. It helps optimize query performance by revealing bottlenecks.

### 12. How does MongoDB handle schema validation?
    -> MongoDB allows optional schema validation rules at the collection level using JSON Schema. This lets you enforce constraints on document structure, types, and required fields, while still retaining flexibility.

### 13. What is the difference between a primary and a secondary node in a replica set?
    -> The primary node receives all write operations and replicates data to secondary nodes, which maintain copies of the primary’s data and can serve read queries (depending on read preferences). If the primary fails, a secondary is elected as the new primary.

### 14. What security mechanisms does MongoDB provide for data protection?
    -> MongoDB offers authentication, role-based access control (RBAC), encryption at rest and in transit (TLS/SSL), auditing, IP whitelisting, and integration with LDAP and Kerberos for secure access.

### 15. Explain the concept of embedded documents and when they should be used.
    -> Embedded documents store related data within a single document as nested objects. Use them when related data is accessed together frequently, enabling faster reads and fewer joins.

### 16. What is the purpose of MongoDB’s $lookup stage in aggregation?
    -> $lookup performs a left outer join between collections, allowing you to combine data from different 
collections within the aggregation pipeline.

### 17. What are some common use cases for MongoDB?
    -> Use cases include content management, real-time analytics, IoT applications, catalogs, mobile apps, social networks, and applications needing flexible schemas or horizontal scaling.

### 18. What are the advantages of using MongoDB for horizontal scaling?
    -> MongoDB’s sharding enables data distribution across multiple servers, allowing it to handle large datasets and high throughput by adding nodes seamlessly, supporting growth without downtime.

### 19. How do MongoDB transactions differ from SQL transactions?
    -> MongoDB introduced multi-document ACID transactions later and supports distributed transactions across shards. SQL databases traditionally have mature transaction support. MongoDB transactions are designed for flexibility and scalability in distributed environments.

### 20. What are the main differences between capped collections and regular collections?
    -> Capped collections are fixed-size, high-performance collections that maintain insertion order and automatically overwrite oldest entries when full. Regular collections have no size limit and allow flexible CRUD operations.

### 21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?
    -> $match filters documents early in the pipeline to pass only those matching specified criteria, improving efficiency by reducing subsequent workload.

### 22. How can you secure access to a MongoDB database?
    -> Secure access by enabling authentication, using strong passwords, configuring role-based access control, encrypting data in transit and at rest, limiting network exposure with firewalls and IP whitelisting, and auditing access.

### 23. What is MongoDB’s WiredTiger storage engine, and why is it important?
    -> WiredTiger is MongoDB’s default storage engine, providing document-level concurrency, compression, and better performance. It supports transactions and efficient use of system resources, improving overall database efficiency.

# Practical Questions

In [None]:
from pymongo import MongoClient
import json
import pprint

In [None]:
# 1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB.
client = MongoClient('mongodb://localhost:27017/')  
db = client['SuperstoreDB']                       
collection = db['SalesData']

data_dict = df.to_dict(orient='records')
collection.insert_many(data_dict)  


print("Data inserted into MongoDB successfully!")

Data inserted into MongoDB successfully!


In [None]:
# 2. Retrieve and print all documents from the Orders collection.
from pymongo import MongoClient
import pprint

all_orders = collection.find()


for order in all_orders:
    pprint.pprint(order)

In [None]:
# 3. Count and display the total number of documents in the Orders collection.
total_documents = collection.count_documents({})

# 3. Print the count
print(f"Total documents in the collection: {total_documents}")

Total documents in the collection: 9994


In [None]:
# 4. Write a query to fetch all orders from the "West" region

west_orders = collection.find({'Region': 'West'})

for order in west_orders:
    pprint.pprint(order)

      Row ID        Order ID Order Date  Ship Date       Ship Mode  \
2          3  CA-2016-138688  6/12/2016  6/16/2016    Second Class   
5          6  CA-2014-115812   6/9/2014  6/14/2014  Standard Class   
6          7  CA-2014-115812   6/9/2014  6/14/2014  Standard Class   
7          8  CA-2014-115812   6/9/2014  6/14/2014  Standard Class   
8          9  CA-2014-115812   6/9/2014  6/14/2014  Standard Class   
...      ...             ...        ...        ...             ...   
9986    9987  CA-2016-125794  9/29/2016  10/3/2016  Standard Class   
9990    9991  CA-2017-121258  2/26/2017   3/3/2017  Standard Class   
9991    9992  CA-2017-121258  2/26/2017   3/3/2017  Standard Class   
9992    9993  CA-2017-121258  2/26/2017   3/3/2017  Standard Class   
9993    9994  CA-2017-119914   5/4/2017   5/9/2017    Second Class   

     Customer ID    Customer Name    Segment        Country         City  ...  \
2       DV-13045  Darrin Van Huff  Corporate  United States  Los Angeles  ... 

In [None]:
# 5. Write a query to find orders where Sales is greater than 500.
high_sales_orders = collection.find({'Sales': {'$gt': 500}})

for order in high_sales_orders:
    pprint.pprint(order)

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
1          2  CA-2016-152156   11/8/2016  11/11/2016    Second Class   
3          4  US-2015-108966  10/11/2015  10/18/2015  Standard Class   
7          8  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
10        11  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
11        12  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
...      ...             ...         ...         ...             ...   
9931    9932  CA-2015-104948  11/13/2015  11/17/2015  Standard Class   
9942    9943  CA-2014-143371  12/28/2014    1/3/2015  Standard Class   
9947    9948  CA-2017-121559    6/1/2017    6/3/2017    Second Class   
9948    9949  CA-2017-121559    6/1/2017    6/3/2017    Second Class   
9968    9969  CA-2017-153871  12/11/2017  12/17/2017  Standard Class   

     Customer ID    Customer Name    Segment        Country             City  \
1       CG-12520      Claire Gute   Consumer  United St

In [None]:
# 6. Fetch the top 3 orders with the highest Profit.
top_3_profit_orders = collection.find().sort('Profit', -1).limit(3)

# Display results
for order in top_3_profit_orders:
    pprint.pprint(order)

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
6826    6827  CA-2016-118689   10/2/2016   10/9/2016  Standard Class   
8153    8154  CA-2017-140151   3/23/2017   3/25/2017     First Class   
4190    4191  CA-2017-166709  11/17/2017  11/22/2017  Standard Class   

     Customer ID Customer Name    Segment        Country       City  ...  \
6826    TC-20980  Tamara Chand  Corporate  United States  Lafayette  ...   
8153    RB-19360  Raymond Buch   Consumer  United States    Seattle  ...   
4190    HL-15040  Hunter Lopez   Consumer  United States     Newark  ...   

     Postal Code   Region       Product ID    Category Sub-Category  \
6826       47905  Central  TEC-CO-10004722  Technology      Copiers   
8153       98115     West  TEC-CO-10004722  Technology      Copiers   
4190       19711     East  TEC-CO-10004722  Technology      Copiers   

                               Product Name     Sales  Quantity  Discount  \
6826  Canon imageCLASS 2200 Advanced Copier 

In [None]:
# 7. Update all orders with Ship Mode as "First Class" to "Premium Class."
result = collection.update_many(
    {'Ship Mode': 'First Class'},      # Filter
    {'$set': {'Ship Mode': 'Premium Class'}}  # Update
)

print(f"Modifiled results:- {result.modified_count} documents.")

Modifiled results:-  Ship Mode
Standard Class    5968
Second Class      1945
Premium Class     1538
Same Day           543
Name: count, dtype: int64


In [None]:
# 8. Delete all orders where Sales is less than 50.
result = collection.delete_many({'Sales': {'$lt': 50}})

print(f" Deleted {result.deleted_count} documents where Sales < 50.")

In [None]:
# 9. Use aggregation to group orders by Region and calculate total sales per region.
pipeline = [
    {
        '$group': {
            '_id': '$Region',
            'TotalSales': {'$sum': '$Sales'}
        }
    },
    {
        '$sort': {'TotalSales': -1}  # Sort descending by total sales
    }
]

results = collection.aggregate(pipeline)

print("Total Sales by Region:")
for res in results:
    print(f"Region: {res['_id']}, Total Sales: {res['TotalSales']:.2f}")

In [None]:
# 10. Fetch all distinct values for Ship Mode from the collection.

distinct_ship_modes = collection.distinct('Ship Mode')

print("🚢 Distinct Ship Modes:")
for mode in distinct_ship_modes:
    print(mode)

In [None]:
# 11. Count the number of orders for each category.

pipeline = [
    {
        '$group': {
            '_id': '$Category',
            'OrderCount': {'$sum': 1}
        }
    },
    {
        '$sort': {'OrderCount': -1}
    }
]

results = collection.aggregate(pipeline)

print("Number of orders per Category:")
for res in results:
    print(f"Category: {res['_id']}, Orders: {res['OrderCount']}")

Number of orders per Category:
Category
Office Supplies    6026
Furniture          2121
Technology         1847
Name: count, dtype: int64
