# **THEORATICAL QUESTIONS**

##1.What are the key differences between SQL and NoSQL databases?
| Feature                  | SQL (Relational Databases)                          | NoSQL (Non-Relational Databases)                          |
|--------------------------|-----------------------------------------------------|------------------------------------------------------------|
| **Data Model**           | Structured (tables with rows and columns)           | Unstructured, semi-structured (JSON, key-value, graph, etc.) |
| **Schema**               | Fixed schema (predefined)                           | Dynamic or flexible schema                                 |
| **Examples**             | MySQL, PostgreSQL, Oracle, SQLite                   | MongoDB, Cassandra, Redis, CouchDB                         |
| **Query Language**       | SQL (Structured Query Language)                     | Varies (e.g., MongoDB uses BSON queries)                   |
| **Scalability**          | Vertically scalable                                 | Horizontally scalable                                      |
| **ACID Compliance**      | Strong ACID compliance (Atomicity, Consistency, Isolation, Durability) | Weaker or eventual consistency (in some cases)             |
| **Joins**                | Supports JOIN operations                            | Limited or no JOIN support                                 |
| **Best Use Case**        | Complex queries, transactional systems (banking, ERP) | Big data, real-time apps, content management, IoT          |
| **Data Integrity**       | High                                                 | Moderate (depends on type)                                |
| **Storage Format**       | Relational tables                                   | Document, key-value, column-family, graph formats          |


##2.What makes MongoDB a good choice for modern applications?
- MongoDB is a great choice for modern applications because it offers a flexible, document-based schema (JSON-like format) that allows for rapid development and scalability. It handles large volumes of unstructured data efficiently and supports horizontal scaling, making it ideal for real-time, cloud-based, and big data applications.

##3.Explain the concept of collections in MongoDB.
- In MongoDB, a collection is a group of documents, similar to a table in relational databases. Each document in a collection is a JSON-like object (called BSON) and can have a different structure from others in the same collection. Collections do not require a fixed schema, making them flexible for storing varied or evolving data formats.

##4.How does MongoDB ensure high availability using replication?
- MongoDB ensures high availability through replica sets, which are groups of servers where one acts as the primary and others as secondaries. The secondaries replicate data from the primary in real-time. If the primary fails, MongoDB automatically elects a new primary, ensuring minimal downtime. This provides fault tolerance, data redundancy, and automatic failover.

##5.What are the main benefits of MongoDB Atlas?
| Benefit                  | Description                                                                 |
|--------------------------|-----------------------------------------------------------------------------|
| **Fully Managed Service** | Automates deployment, scaling, backups, and updates                        |
| **High Availability**     | Built-in replication and automatic failover ensure minimal downtime        |
| **Scalability**           | Easily scales vertically and horizontally across cloud regions             |
| **Integrated Security**   | Includes encryption, authentication, and role-based access control         |
| **Performance Monitoring**| Real-time metrics, slow query analysis, and automated alerts               |
| **Multi-Cloud Support**   | Deploys seamlessly on AWS, Azure, and Google Cloud                         |


##6.What is the role of indexes in MongoDB, and how do they improve performance?
###  Role of Indexes in MongoDB

Indexes in MongoDB are used to improve the speed and efficiency of query operations. Without indexes, MongoDB must scan every document in a collection to find matching results, which is slow for large datasets.

Indexes act like a roadmap, allowing MongoDB to quickly locate the required data. They significantly reduce the time taken for read operations such as `find()`, `sort()`, and `aggregate()`.

####  Benefits of Indexes:
- **Faster Query Performance:** Speeds up data retrieval by avoiding full collection scans.
- **Efficient Sorting:** Helps in executing sorted queries quickly.
- **Unique Constraints:** Enforces uniqueness on specific fields like `email` or `username`.
- **Support for Complex Queries:** Enables optimization of compound queries and
 range queries.

##7.Describe the stages of the MongoDB aggregation pipeline.

The MongoDB aggregation pipeline processes data through a sequence of stages, transforming documents step-by-step.

####  Key Stages:

1. **$match**  
   Filters documents based on specified criteria (like `WHERE` in SQL).  
   Example: `{ $match: { status: "active" } }`

2. **$group**  
   Groups documents by a field and performs operations like sum, avg, count, etc.  
   Example: `{ $group: { _id: "$department", total: { $sum: 1 } } }`

3. **$project**  
   Reshapes documents, includes/excludes fields, or creates computed fields.  
   Example: `{ $project: { name: 1, totalSales: { $multiply: ["$price", "$quantity"] } } }`

4. **$sort**  
   Sorts documents in ascending or descending order.  
   Example: `{ $sort: { totalSales: -1 } }`

5. **$limit**  
   Limits the number of documents in the output.  
   Example: `{ $limit: 5 }`

6. **$skip**  
   Skips a specified number of documents (useful for pagination).  
   Example: `{ $skip: 10 }`

7. **$lookup**  
   Performs a left outer join with another collection.  
   Example: Join orders with users using user_id.


##8. What is sharding in MongoDB? How does it differ from replication?
- Sharding in MongoDB is a method of distributing large datasets across multiple machines to improve scalability and performance. It splits data into chunks using a shard key and stores them across different shards. In contrast, replication is used to ensure high availability by copying the same data across multiple servers (replica set). While sharding distributes data, replication duplicates it for fault tolerance.

##9. What is PyMongo, and why is it used?
- PyMongo is the official Python driver for MongoDB. It allows Python applications to connect to, query, insert, update, and manage data in MongoDB databases. PyMongo is used because it provides a simple and efficient interface for interacting with MongoDB directly from Python code, making it ideal for building data-driven applications.

##10.What are the ACID properties in the context of MongoDB transactions?

- MongoDB supports ACID transactions to ensure safe, consistent, and reliable data operations, especially across multiple documents and collections.



- **Atomicity**  
  All operations within a transaction are completed successfully or none are applied at all.

- **Consistency**  
  The database remains in a valid state before and after the transaction, preserving defined rules and constraints.

- **Isolation**  
  Transactions are isolated from each other, meaning their intermediate states are not visible to others.

- **Durability**  
  Once a transaction is committed, its changes are permanently saved—even in the event of a crash or power failure.

##11.What is the purpose of MongoDB’s explain() function?
- The purpose of MongoDB’s explain() function is to analyze and understand how a query is executed. It provides detailed information about the query plan, such as whether indexes are used, how many documents were scanned, and the time taken. This helps developers optimize performance by identifying inefficient queries or missing indexes.

##12.How does MongoDB handle schema validation?
- MongoDB handles schema validation using JSON Schema-based rules defined at the collection level. This allows developers to enforce structure, data types, required fields, and custom rules for documents inserted or updated in a collection.

##13.What is the difference between a primary and a secondary node in a replica set?

| Feature               | Primary Node                               | Secondary Node                             |
|-----------------------|---------------------------------------------|---------------------------------------------|
| **Role**              | Handles all write and default read operations | Replicates data from the primary            |
| **Writes**            | Allowed                                     | Not allowed (unless explicitly enabled)     |
| **Reads**             | Default read target                         | Can be used for reads (if configured)       |
| **Failover**          | Elected as primary in normal operation      | Can be promoted to primary during failover  |
| **Data Replication**  | Source of truth                             | Continuously syncs from primary             |

##14.What security mechanisms does MongoDB provide for data protection?

MongoDB provides several built-in security features to protect data:

- **Authentication**  
  Verifies the identity of users and applications using credentials or external systems (LDAP, Kerberos, etc.).

- **Authorization**  
  Role-Based Access Control (RBAC) allows fine-grained permissions on databases, collections, and operations.

- **Encryption**  
  - **At Rest**: Data files can be encrypted using encryption-at-rest options.
  - **In Transit**: TLS/SSL ensures secure data transfer between clients and servers.

- **Auditing**  
  Tracks and logs database activity for compliance and security analysis.

- **IP Whitelisting & Network Rules**  
  Limits access to trusted IP addresses or internal networks.

- **Field-Level Redaction (Enterprise)**  
  Controls visibility of sensitive fields in documents.

##15.Explain the concept of embedded documents and when they should be used.
- Embedded documents in MongoDB are documents stored inside other documents. They are used to keep related data together in a single document, making data access faster and more efficient. You should use embedded documents when there is a close relationship between the data, such as a user and their address, and when you often need to access both together. This helps reduce the number of queries and improves performance.

##16.What is the purpose of MongoDB’s lookup stage in aggregation?
- The purpose of MongoDB’s $lookup stage in aggregation is to perform a left outer join between documents from two collections. It allows you to combine related data, similar to SQL joins. For example, you can join an orders collection with a users collection to get user details for each order. This is useful when data is stored in separate collections but needs to be queried together.

##17.What are some common use cases for MongoDB?
- MongoDB is commonly used for applications that require flexible, scalable, and high-performance data storage. It’s ideal for content management systems, real-time analytics, IoT applications, catalogs, e-commerce platforms, and mobile or social apps. Its flexible schema and ability to handle large volumes of unstructured data make it perfect for rapidly changing and data-rich environments.

##18. What are the advantages of using MongoDB for horizontal scaling?
MongoDB supports horizontal scaling through **sharding**, allowing data to be distributed across multiple servers.

### Key Advantages:

- **Handles Large Datasets**  
  Sharding allows MongoDB to store and process huge volumes of data by spreading it across shards.

- **Improves Performance**  
  Queries can be parallelized across shards, reducing response times.

- **High Availability**  
  Each shard can be replicated, ensuring fault tolerance and minimal downtime.

- **Flexible Growth**  
  Easily add more shards (servers) as data or traffic grows, without downtime.

- **Cost-Effective Scaling**  
  Use commodity hardware instead of expensive high-end servers.

##19. How do MongoDB transactions differ from SQL transactions?

| Feature                | MongoDB Transactions                           | SQL Transactions                                |
|------------------------|--------------------------------------------------|--------------------------------------------------|
| **Data Model**         | Document-based (NoSQL)                          | Table-based (Relational)                         |
| **Transaction Scope**  | Single or multiple documents (since v4.0+)      | Multiple rows across multiple tables             |
| **ACID Support**       | Supported (from v4.0)                           | Fully supported                                  |
| **Schema Flexibility** | Schema-less, uses embedded documents            | Fixed schema, normalized structure               |
| **Use Case**           | Used when operations span multiple documents    | Commonly used in multi-table operations          |
| **Performance**        | Optimized for most operations without transactions | Designed around transactional integrity       |

##20.What are the main differences between capped collections and regular collections?

| Feature                  | Capped Collections                                 | Regular Collections                         |
|--------------------------|----------------------------------------------------|---------------------------------------------|
| **Storage Size**         | Fixed size (pre-allocated)                         | Grows dynamically                            |
| **Document Removal**     | Oldest documents automatically overwritten         | Documents must be manually deleted           |
| **Insertion Order**      | Preserves insertion order                          | No guarantee of insertion order              |
| **Update Behavior**      | Cannot increase document size after insertion      | Allows document size to change               |
| **Use Case**             | Ideal for logs, real-time data, and circular queues | Suitable for general-purpose data storage    |
| **Performance**          | High performance for inserts and reads (circular buffer) | Performance depends on workload         |

##21.What is the purpose of the match stage in MongoDB’s aggregation pipeline?
- The purpose of the $match stage in MongoDB’s aggregation pipeline is to filter documents based on specific conditions, similar to the WHERE clause in SQL. It allows only those documents that meet the criteria to pass to the next stage in the pipeline, improving performance by reducing the amount of data processed in later stages.

##22.How can you secure access to a MongoDB database?

To protect a MongoDB database from unauthorized access, follow these key security practices:

- **Enable Authentication**  
  Require users to log in with a username and password.

- **Use Role-Based Access Control (RBAC)**  
  Assign specific roles and permissions to users based on their responsibilities.

- **Enable TLS/SSL Encryption**  
  Encrypt data in transit to prevent interception during communication.

- **Use Firewalls and IP Whitelisting**  
  Restrict database access to trusted IP addresses or networks only.

- **Enable Auditing**  
  Track and log database activities for monitoring and compliance.

- **Keep MongoDB Updated**  
  Regularly update MongoDB to patch known vulnerabilities.

##23.What is MongoDB’s WiredTiger storage engine, and why is it important?
- WiredTiger is MongoDB’s default storage engine that provides high performance and efficient data management. It supports features like document-level locking, compression, and concurrent read/write operations, which help improve throughput and reduce storage space. WiredTiger is important because it enables scalability, faster access, and better resource utilization, especially for modern, high-load applications.





 













 

# **PRACRICAL QUESTIONS**

##1.Write a Python script to load the Superstore dataset from a CSV file into MongoDB

In [3]:
import pandas as pd
from pymongo import MongoClient

file_path = "superstore.csv" 
df = pd.read_csv(file_path,encoding='latin1')

data = df.to_dict(orient='records')


client = MongoClient("mongodb://localhost:27017/")  
db = client["STORE"]  
collection = db["orders"]   

collection.insert_many(data)

print("Superstore dataset successfully loaded into MongoDB!")


Superstore dataset successfully loaded into MongoDB!


##2. Retrieve and print all documents from the Orders collection

In [4]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]

all_docs = collection.find()

for doc in all_docs:
    print(doc)


{'_id': ObjectId('6891bd4ab81df53b58179719'), 'Row ID': 1, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-BO-10001798', 'Category': 'Furniture', 'Sub-Category': 'Bookcases', 'Product Name': 'Bush Somerset Collection Bookcase', 'Sales': 261.96, 'Quantity': 2, 'Discount': 0, 'Profit': 41.9136}
{'_id': ObjectId('6891bd4ab81df53b5817971a'), 'Row ID': 2, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub-C

##3.Count and display the total number of documents in the Orders collection

In [15]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]
total_documents = collection.count_documents({})

print("Total number of documents  collection:", total_documents)


Total number of documents  collection: 5145


##4.Write a query to fetch all orders from the "West" region

In [6]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]                
collection = db["Orders"]      
west_orders = collection.find({"Region": "West"})

# Print all results
for order in west_orders:
    print(order)


{'_id': ObjectId('6891bd4ab81df53b5817971b'), 'Row ID': 3, 'Order ID': 'CA-2016-138688', 'Order Date': '6/12/2016', 'Ship Date': '6/16/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'DV-13045', 'Customer Name': 'Darrin Van Huff', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Los Angeles', 'State': 'California', 'Postal Code': 90036, 'Region': 'West', 'Product ID': 'OFF-LA-10000240', 'Category': 'Office Supplies', 'Sub-Category': 'Labels', 'Product Name': 'Self-Adhesive Address Labels for Typewriters by Universal', 'Sales': 14.62, 'Quantity': 2, 'Discount': 0, 'Profit': 6.8714}
{'_id': ObjectId('6891bd4ab81df53b5817971e'), 'Row ID': 6, 'Order ID': 'CA-2014-115812', 'Order Date': '6/9/2014', 'Ship Date': '6/14/2014', 'Ship Mode': 'Standard Class', 'Customer ID': 'BH-11710', 'Customer Name': 'Brosina Hoffman', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Los Angeles', 'State': 'California', 'Postal Code': 90032, 'Region': 'West', 'Product ID': 'FUR-FU-100

##5. Write a query to find orders where Sales is greater than 500

In [7]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]

high_sales_orders = collection.find({"Sales": {"$gt": 500}})


for order in high_sales_orders:
    print(order)


{'_id': ObjectId('6891bd4ab81df53b5817971a'), 'Row ID': 2, 'Order ID': 'CA-2016-152156', 'Order Date': '11/8/2016', 'Ship Date': '11/11/2016', 'Ship Mode': 'Second Class', 'Customer ID': 'CG-12520', 'Customer Name': 'Claire Gute', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Henderson', 'State': 'Kentucky', 'Postal Code': 42420, 'Region': 'South', 'Product ID': 'FUR-CH-10000454', 'Category': 'Furniture', 'Sub-Category': 'Chairs', 'Product Name': 'Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back', 'Sales': 731.94, 'Quantity': 3, 'Discount': 0, 'Profit': 219.582}
{'_id': ObjectId('6891bd4ab81df53b5817971c'), 'Row ID': 4, 'Order ID': 'US-2015-108966', 'Order Date': '10/11/2015', 'Ship Date': '10/18/2015', 'Ship Mode': 'Standard Class', 'Customer ID': 'SO-20335', 'Customer Name': "Sean O'Donnell", 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Fort Lauderdale', 'State': 'Florida', 'Postal Code': 33311, 'Region': 'South', 'Product ID': 'FUR-TA-10000577

##6.Fetch the top 3 orders with the highest Profit

In [8]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]
top_profit_orders = collection.find().sort("Profit", -1).limit(3)

for order in top_profit_orders:
    print(order)


{'_id': ObjectId('6891bd4bb81df53b5817b1c3'), 'Row ID': 6827, 'Order ID': 'CA-2016-118689', 'Order Date': '10/2/2016', 'Ship Date': '10/9/2016', 'Ship Mode': 'Standard Class', 'Customer ID': 'TC-20980', 'Customer Name': 'Tamara Chand', 'Segment': 'Corporate', 'Country': 'United States', 'City': 'Lafayette', 'State': 'Indiana', 'Postal Code': 47905, 'Region': 'Central', 'Product ID': 'TEC-CO-10004722', 'Category': 'Technology', 'Sub-Category': 'Copiers', 'Product Name': 'Canon imageCLASS 2200 Advanced Copier', 'Sales': 17499.95, 'Quantity': 5, 'Discount': 0, 'Profit': 8399.976}
{'_id': ObjectId('6891bd4cb81df53b5817b6f2'), 'Row ID': 8154, 'Order ID': 'CA-2017-140151', 'Order Date': '3/23/2017', 'Ship Date': '3/25/2017', 'Ship Mode': 'First Class', 'Customer ID': 'RB-19360', 'Customer Name': 'Raymond Buch', 'Segment': 'Consumer', 'Country': 'United States', 'City': 'Seattle', 'State': 'Washington', 'Postal Code': 98115, 'Region': 'West', 'Product ID': 'TEC-CO-10004722', 'Category': 'Tech

##7. Update all orders with Ship Mode as "First Class" to "Premium Class.

In [9]:
from pymongo import MongoClient

client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]
result = collection.update_many(
    {"Ship Mode": "First Class"},        
    {"$set": {"Ship Mode": "Premium Class"}}  
)

print("Total documents updated:", result.modified_count)


Total documents updated: 1538


##8. Delete all orders where Sales is less than 50 Use aggregation to group orders by Region a

In [10]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]

delete_result = collection.delete_many({"Sales": {"$lt": 50}})
print("Total documents deleted:", delete_result.deleted_count)


Total documents deleted: 4849


##9. Use aggregation to group orders by Region and calculate total sales per region

In [11]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]

pipeline = [
    {
        "$group": {
            "_id": "$Region",
            "total_sales": {"$sum": "$Sales"}
        }
    }
]

# Run aggregation and print results
region_sales = collection.aggregate(pipeline)
for region in region_sales:
    print(region)


{'_id': 'East', 'total_sales': 651137.705}
{'_id': 'Central', 'total_sales': 479611.8458}
{'_id': 'South', 'total_sales': 376023.312}
{'_id': 'West', 'total_sales': 694686.6195}


##10. Fetch all distinct values for Ship Mode from the collection

In [12]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]

ship_modes = collection.distinct("Ship Mode")

print("Distinct Ship Modes:")
for mode in ship_modes:
    print("-", mode)


Distinct Ship Modes:
- Premium Class
- Same Day
- Second Class
- Standard Class


##11.Count the number of orders for each category.

In [13]:
from pymongo import MongoClient
client = MongoClient("mongodb://localhost:27017/")
db = client["STORE"]
collection = db["Orders"]
pipeline = [
    {
        "$group": {
            "_id": "$Category",
            "order_count": {"$sum": 1}
        }
    }
]

category_counts = collection.aggregate(pipeline)
print("Order counts by category:")
for category in category_counts:
    print(f"{category['_id']}: {category['order_count']}")


Order counts by category:
Furniture: 1573
Technology: 1496
Office Supplies: 2076
