                                              #### Theoretical Questions

1.  What are the key differences between SQL and NoSQL databases

  SQL and NoSQL databases differ primarily in their structure, scalability, and data models. SQL databases are relational, using a structured query language and typically scaling vertically. NoSQL databases are non-relational, employing various data models (document, key-value, etc.) and scaling horizontally.

2. What makes MongoDB a good choice for modern applications


  MongoDB is a good choice for modern applications due to its flexibility, scalability, and performance characteristics. Its document-oriented data model and ability to handle unstructured data make it well-suited for agile development and evolving business needs.

3. Explain the concept of collections in MongoDB<

  MongoDB - Database, Collection, and Document - GeeksforGeeksIn MongoDB, a collection is a grouping of documents, similar to a table in a relational database. It's a fundamental structure within a database that stores and organizes data as a set of documents. Each document within a collection can have a different structure, offering flexibility in how data is stored.


4. < How does MongoDB ensure high availability using replication

  MongoDB ensures high availability through replica sets, which are groups of mongod instances that maintain the same data set. This redundancy allows for automatic failover, meaning that if the primary node fails, a secondary node is elected to take over, minimizing downtime and ensuring continuous operation.

5. What are the main benefits of MongoDB Atlas

  MongoDB Atlas offers several key benefits, making it a popular choice for modern application development. These include ease of use, scalability, security, and automated management. It simplifies database deployment and maintenance, allowing developers to focus on building applications rather than managing infrastructure.

6. What is the role of indexes in MongoDB, and how do they improve performance

  In MongoDB, indexes enhance query performance by allowing the database to quickly locate and retrieve data, rather than scanning every document in a collection. This is achieved by creating an ordered data structure that acts as a roadmap, enabling MongoDB to jump directly to relevant documents that match query criteria. Indexes are crucial for efficient data retrieval, especially in large collections, and can significantly reduce query execution time and resource consumption


7. Describe the stages of the MongoDB aggregation pipeline<

  MongoDB Aggregation: tutorial with examples and exercises ...The MongoDB aggregation pipeline processes documents through a series of stages, each performing a specific operation, transforming the data along the way. The output of one stage becomes the input for the next, creating a flow of data processing. Key stages include $match to filter documents, $group to aggregate data, and $sort to order results.
  
8. What is sharding in MongoDB? How does it differ from replication

  In MongoDB, sharding and replication serve different purposes in scaling and data management. Sharding distributes data across multiple servers (shards) to handle large datasets and high throughput, effectively scaling horizontally. Replication, on the other hand, creates copies of the entire dataset on multiple servers, primarily for high availability and read scalability.

9.  What is PyMongo, and why is it used

  In MongoDB, ACID properties (Atomicity, Consistency, Isolation, and Durability) ensure the reliability and integrity of database transactions. These properties guarantee that a transaction is treated as a single unit of work, either fully completing or having no effect, and that the database remains in a valid state throughout.

10. What are the ACID properties in the context of MongoDB transactions

  In MongoDB, ACID properties (Atomicity, Consistency, Isolation, and Durability) ensure the reliability and integrity of database transactions. They guarantee that a transaction is treated as a single, indivisible unit of work, with operations either completing successfully or not at all, and that the database remains in a consistent state throughout the process.

11. What is the purpose of MongoDB’s explain() function

  
  explain() returns the queryPlanner information for the evaluated method. MongoDB runs the query optimizer to choose the winning plan, executes the winning plan to completion, and returns statistics describing the execution of the winning plan
  

12. How does MongoDB handle schema validation

  MongoDB schema validation ensures that documents inserted or updated in a collection conform to a predefined structure and set of rules, enhancing data consistency and integrity. This is achieved by associating a JSON Schema with a collection, which specifies the required fields, data types, and other constraints for documents within that collection.

13. What is the difference between a primary and a secondary node in a replica set

  In a MongoDB replica set, the primary node handles all write operations and acts as the source of truth for data, while secondary nodes maintain copies of the data and can handle read operations, but cannot accept writes. If the primary node fails, a secondary node can be elected as the new primary to ensure continuous operation.

14. What security mechanisms does MongoDB provide for data protection

  MongoDB offers a comprehensive suite of security mechanisms for data protection, including encryption, authentication, authorization, and auditing. These features help protect data both in transit and at rest, and ensure that only authorized users can access and modify sensitive information.

15. Explain the concept of embedded documents and when they should be used<

  Embedded documents, also known as nested or sub-documents, are a way to store related data within a single document in a database like MongoDB. This means you can have a document within another document, creating a hierarchical structure. Embedded documents are useful when data is closely related and frequently accessed together, simplifying data retrieval and reducing the need for complex joins

16. What is the purpose of MongoDB’s $lookup stage in aggregation

  The $lookup stage in the Aggregation Framework is used to perform left outer joins with other collections. It allows you to combine documents from different collections based on a specified condition

17. What are some common use cases for MongoDB

  MongoDB is a versatile NoSQL database well-suited for a variety of use cases, particularly those involving large, diverse, and evolving datasets. It excels in scenarios requiring high performance, scalability, and flexibility in data modeling, such as content management, real-time analytics, and mobile applications. Its document-oriented structure also makes it a good fit for handling semi-structured and unstructured data commonly found in areas like IoT, social media, and gaming.

18. What are the advantages of using MongoDB for horizontal scaling

  MongoDB's sharding feature enables efficient horizontal scaling, offering several advantages: increased capacity, improved performance, enhanced fault tolerance, and cost-effectiveness. By distributing data across multiple servers (shards), MongoDB can handle large datasets and high traffic loads, preventing any single server from becoming a bottleneck.

19.  How do MongoDB transactions differ from SQL transactions

  SQL databases are used to store structured data while NoSQL databases like MongoDB are used to save unstructured data. MongoDB is used to save unstructured data in JSON format. MongoDB does not support advanced analytics and joins like SQL databases support.

20. What are the main differences between capped collections and regular collections

  Capped collections in MongoDB are fixed-size collections that automatically overwrite the oldest documents when the collection reaches its maximum size, while regular collections can grow dynamically and do not have this behavior. Capped collections also maintain insertion order and do not allow document deletion or modification.

21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline

  A $match stage filters out a document from pipeline results if one of the following conditions applies: The $match query predicate returns a 0 , null , or false value on that document. The $match query predicate uses a field that is missing from that

22. < How can you secure access to a MongoDB database

  o secure a MongoDB database effectively, you should focus on enabling authentication and authorization, encrypting data both in transit and at rest, restricting network access, and implementing robust auditing and monitoring. These measures prevent unauthorized access, protect sensitive data, and help detect and respond to security breaches.

23. What is MongoDB’s WiredTiger storage engine, and why is it important?

  MongoDB's WiredTiger storage engine is a core component that manages how data is stored and accessed. It replaced the older MMAPv1 engine as the default in MongoDB 3.2 and is crucial for performance, scalability, and data integrity. WiredTiger's key features include document-level concurrency, checkpointing, compression, and encryption (for enterprise versions), making it suitable for a wide range of workloads.


                                                ##### Practical Questions

1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB

# Task
Write a Python script to load the Superstore dataset from a CSV file into MongoDB.

## Install libraries

### Subtask:
Install the `pymongo` and `pandas` libraries.


**Reasoning**:
The subtask requires installing the `pymongo` and `pandas` libraries. I will use pip to install both libraries in a single cell.



In [2]:
%pip install pymongo pandas



## Import libraries

### Subtask:
Import the necessary libraries.


**Reasoning**:
Import the necessary libraries for interacting with MongoDB and handling data.



In [3]:
import pandas as pd
from pymongo import MongoClient

## Load data

### Subtask:
Load the Superstore dataset from a CSV file into a pandas DataFrame.


**Reasoning**:
The current subtask is to load the data from the CSV file into a pandas DataFrame. The `pd.read_csv()` function is suitable for this purpose. After loading, displaying the head and info of the DataFrame is necessary to inspect the data structure and types as required by the instructions.



In [4]:
df_superstore = pd.read_csv('Superstore.csv')
display(df_superstore.head())
display(df_superstore.info())

FileNotFoundError: [Errno 2] No such file or directory: 'Superstore.csv'

**Reasoning**:
The previous command failed because the file 'Superstore.csv' was not found. I need to check the available files in the current directory to confirm the file name and its location.



In [5]:
%ls

[0m[01;34msample_data[0m/


**Reasoning**:
The previous command showed that 'Superstore.csv' is not in the current directory. I will assume the file is located in the 'sample_data' directory based on the output of the `%ls` command. I will try to read the CSV file from the 'sample_data' directory.



In [6]:
df_superstore = pd.read_csv('sample_data/Superstore.csv')
display(df_superstore.head())
display(df_superstore.info())

FileNotFoundError: [Errno 2] No such file or directory: 'sample_data/Superstore.csv'

**Reasoning**:
The 'Superstore.csv' file was not found in the current directory or the 'sample_data' directory. I need to explore the contents of the 'sample_data' directory to see if the file exists there with a different name or if there are other files.



In [7]:
%ls sample_data

[0m[01;32manscombe.json[0m*                mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  [01;32mREADME.md[0m*


## Load data

### Subtask:
Load the Superstore dataset from a CSV file into a pandas DataFrame.


**Reasoning**:
Since the previous attempt to locate the 'Superstore.csv' file failed, I will try another common location for sample datasets in environments like this, which is often a dedicated datasets directory. I will list the files in the root directory and a potential 'datasets' directory to locate the file.



In [8]:
%ls /
%ls /datasets

[0m[01;36mbin[0m@                        [01;36mlib32[0m@                    [01;34mroot[0m/
[01;34mboot[0m/                       [01;36mlib64[0m@                    [01;34mrun[0m/
[01;34mcontent[0m/                    [01;36mlibx32[0m@                   [01;36msbin[0m@
cuda-keyring_1.1-1_all.deb  [01;34mmedia[0m/                    [01;34msrv[0m/
[01;34mdatalab[0m/                    [01;34mmnt[0m/                      [01;34msys[0m/
[01;34mdev[0m/                        NGC-DL-CONTAINER-LICENSE  [30;42mtmp[0m/
[01;34metc[0m/                        [01;34mopt[0m/                      [01;34mtools[0m/
[01;34mhome[0m/                       [01;34mproc[0m/                     [01;34musr[0m/
[01;34mkaggle[0m/                     [01;34mpython-apt[0m/               [01;34mvar[0m/
[01;36mlib[0m@                        [01;32mpython-apt.tar.xz[0m*
ls: cannot access '/datasets': No such file or directory


**Reasoning**:
The previous attempt to find the file in `/datasets` failed. I will try listing the files in the `/content` directory, which is another common location for user-uploaded files or datasets in this environment.



In [9]:
%ls /content

[0m[01;34msample_data[0m/


**Reasoning**:
The 'Superstore.csv' file was not found in the root, `/datasets`, or `/content` directories. Since the `/content` directory contains a `sample_data` subdirectory which I have already checked, I will try listing the contents of the root directory again, looking for any directories that might contain datasets. I will then check the contents of any promising directories.



2.

## Retrieve data

### Subtask:
Retrieve and print all documents from the Orders collection.

**Reasoning**:
The current subtask is to retrieve and print all documents from the 'Orders' collection. I will use the `find()` method on the collection to get a cursor of all documents. Then I will iterate through the cursor and print each document.

In [None]:
# Access the "superstore" database and the "Orders" collection
db = client["superstore"]
collection = db["Orders"]

# Retrieve all documents from the collection
for document in collection.find():
    display(document)

3. Count and display the total number of documents in the Orders collection<


## Count documents

### Subtask:
Count and display the total number of documents in the Orders collection.

**Reasoning**:
To count the documents in the 'Orders' collection, I will use the `count_documents()` method. This method efficiently returns the number of documents that match a given query (in this case, an empty query `{}` to count all documents). I will then display the count.

In [None]:
# Count the number of documents in the collection
count = collection.count_documents({})

# Display the count
print(f"Total number of documents in the Orders collection: {count}")

4. Write a query to fetch all orders from the "West" region<

## Fetch orders from "West" region

### Subtask:
Write a query to fetch all orders from the "West" region.

**Reasoning**:
To fetch documents where the 'Region' field is 'West', I will use the `find()` method with a query document specifying the condition `{"Region": "West"}`. I will then iterate through the cursor and display each matching document.

In [None]:
# Fetch orders from the "West" region
west_orders = collection.find({"Region": "West"})

# Display the fetched orders
for order in west_orders:
    display(order)

5. Write a query to find orders where Sales is greater than 500<

## Find orders with Sales greater than 500

### Subtask:
Write a query to find orders where Sales is greater than 500.

**Reasoning**:
To find documents where the 'Sales' field is greater than 500, I will use the `find()` method with a query document using the `$gt` operator: `{"Sales": {"$gt": 500}}`. I will then iterate through the cursor and display each matching document.

In [None]:
# Find orders where Sales is greater than 500
high_sales_orders = collection.find({"Sales": {"$gt": 500}})

# Display the fetched orders
for order in high_sales_orders:
    display(order)

6.  Fetch the top 3 orders with the highest Profit<


## Fetch top 3 orders by Profit

### Subtask:
Fetch the top 3 orders with the highest Profit.

**Reasoning**:
To fetch the top 3 orders with the highest profit, I will use the `find()` method. I will sort the results by the 'Profit' field in descending order using `sort("Profit", -1)` and limit the results to 3 using `limit(3)`. I will then iterate through the cursor and display each matching document.

In [None]:
# Fetch the top 3 orders with the highest Profit
top_profit_orders = collection.find().sort("Profit", -1).limit(3)

# Display the fetched orders
for order in top_profit_orders:
    display(order)

7.  Update all orders with Ship Mode as "First Class" to "Premium Class.O


## Update Ship Mode

### Subtask:
Update all orders with Ship Mode as "First Class" to "Premium Class".

**Reasoning**:
To update all documents where the 'Ship Mode' field is "First Class", I will use the `update_many()` method. The filter document will be `{"Ship Mode": "First Class"}` and the update document will use the `$set` operator to change the 'Ship Mode' to "Premium Class". I will then print the number of documents modified.

In [None]:
# Update orders with Ship Mode "First Class" to "Premium Class"
update_result = collection.update_many(
    {"Ship Mode": "First Class"},
    {"$set": {"Ship Mode": "Premium Class"}}
)

# Print the number of documents modified
print(f"Number of documents modified: {update_result.modified_count}")

8.  Delete all orders where Sales is less than 50<

## Delete orders with Sales less than 50

### Subtask:
Delete all orders where Sales is less than 50.

**Reasoning**:
To delete documents where the 'Sales' field is less than 50, I will use the `delete_many()` method with a query document using the `$lt` operator: `{"Sales": {"$lt": 50}}`. I will then print the number of documents deleted.

In [None]:
# Delete orders where Sales is less than 50
delete_result = collection.delete_many({"Sales": {"$lt": 50}})

# Print the number of documents deleted
print(f"Number of documents deleted: {delete_result.deleted_count}")

9. Use aggregation to group orders by Region and calculate total sales per region<

## Aggregate sales by Region

### Subtask:
Use aggregation to group orders by Region and calculate total sales per region.

**Reasoning**:
To group orders by 'Region' and calculate the total sales for each region using aggregation, I will use the `aggregate()` method. The pipeline will include a `$group` stage to group by 'Region' and a `$sum` accumulator to calculate the total 'Sales' for each group. I will then iterate through the results and display them.

In [None]:
# Aggregate sales by Region
pipeline = [
    {"$group": {"_id": "$Region", "total_sales": {"$sum": "$Sales"}}}
]

sales_by_region = collection.aggregate(pipeline)

# Display the results
print("Total sales per region:")
for result in sales_by_region:
    display(result)

10.  Fetch all distinct values for Ship Mode from the collection

## Fetch distinct Ship Modes

### Subtask:
Fetch all distinct values for Ship Mode from the collection.

**Reasoning**:
To fetch all distinct values for the 'Ship Mode' field, I will use the `distinct()` method on the collection, specifying the field name 'Ship Mode'. I will then print the list of distinct values.

In [None]:
# Fetch all distinct values for Ship Mode
distinct_ship_modes = collection.distinct("Ship Mode")

# Display the distinct values
print("Distinct Ship Modes:")
for ship_mode in distinct_ship_modes:
    print(ship_mode)

11. Count the number of orders for each category.

## Count orders by Category

### Subtask:
Count the number of orders for each category.

**Reasoning**:
To count the number of orders for each category using aggregation, I will use the `aggregate()` method. The pipeline will include a `$group` stage to group by 'Category' and a `$sum` accumulator set to 1 (`{"$sum": 1}`) to count the documents in each group. I will then iterate through the results and display them.

In [None]:
# Count orders by Category
pipeline = [
    {"$group": {"_id": "$Category", "count": {"$sum": 1}}}
]

orders_by_category = collection.aggregate(pipeline)

# Display the results
print("Number of orders per category:")
for result in orders_by_category:
    display(result)