**THEORETICAL QUESTIONS**

**1. What are the key differences between SQL and NoSQL databases?**

Ans:
SQL and NoSQL databases have distinct approaches to storing and managing data. Here are the key differences:
Schema: SQL databases use a fixed schema, whereas NoSQL databases have dynamic or flexible schemas.
Data Structure: SQL databases use tables, whereas NoSQL databases use documents, key-value pairs, graphs, or wide-column stores.
Scalability: SQL databases scale vertically, whereas NoSQL databases scale horizontally.
Data Relationships: SQL databases are better suited for complex relationships, whereas NoSQL databases are better for simple or no relationships.
ACID Compliance: SQL databases follow ACID principles, whereas NoSQL databases often sacrifice some ACID properties for higher performance and scalability.
Query Language: SQL databases use Structured Query Language (SQL), whereas NoSQL databases use query languages specific to their data model.

**2.  What makes MongoDB a good choice for modern applications?**

Ans:
MongoDB is a popular NoSQL database that offers several benefits for modern applications. Here are some reasons why MongoDB is a good choice:
Flexible Schema: MongoDB's dynamic schema allows for easy adaptation to changing data structures, reducing the need for costly schema migrations.
Scalability: MongoDB scales horizontally, making it suitable for high-traffic applications and large datasets.
High Performance: MongoDB's document-based data model and indexing capabilities enable fast query performance.
Easy Data Integration: MongoDB supports various data formats, including JSON, making it easy to integrate with modern web and mobile applications.
Real-time Data Processing: MongoDB's support for real-time data processing and analytics enables applications to respond quickly to changing data.
Cloud-Native: MongoDB is designed for cloud environments, offering flexible deployment options and seamless scalability.
Large Community: MongoDB has a large and active community, ensuring there, are many resources available for learning and troubleshooting.

**3. Explain the concept of collections in MongoDB.**

Ans:
In MongoDB, a collection is a group of documents that are stored together in a database. Collections are similar to tables in relational databases, but they don't enforce a fixed schema. Here's how collections work:
Document grouping: Collections store documents that share a common purpose or theme.
Schema flexibility: Each document in a collection can have a different structure, allowing for flexible data modeling.
No strict schema: Unlike relational databases, MongoDB collections don't enforce a strict schema, giving you more freedom to adapt to changing data structures.
Indexing and querying: You can create indexes on collections to improve query performance and use MongoDB's query language to retrieve specific documents.

**4.  How does MongoDB ensure high availability using replication?**

Ans:
MongoDB ensures high availability using replication, which involves maintaining multiple copies of data across different nodes. Here's how it works:
Replica Set: A replica set is a group of MongoDB nodes that maintain the same data set. One node is designated as the primary node, and the others are secondary nodes.
Primary Node: The primary node accepts write operations and replicates the data to secondary nodes.
Secondary Nodes: Secondary nodes replicate the data from the primary node and can become the new primary if the current primary fails.
Automatic Failover: If the primary node fails, the replica set automatically elects a new primary node, ensuring minimal downtime.
Data Redundancy: Replication provides data redundancy, which helps protect against data loss due to hardware failure or other issues.

**5. What are the main benefits of MongoDB Atlas **

Ans:
MongoDB Atlas is a cloud-based MongoDB service that offers several benefits, including:
Managed Service: Atlas handles database management tasks, such as provisioning, patching, and scaling, freeing up your team to focus on application development.
Scalability: Atlas allows you to easily scale your database up or down to match changing application demands.
High Availability: Atlas provides built-in replication and automatic failover, ensuring high availability and data protection.
Security: Atlas offers advanced security features, such as encryption at rest and in transit, network access controls, and identity and access management.
Monitoring and Alerts: Atlas provides real-time monitoring and alerting, helping you identify and resolve issues quickly.
Global Clusters: Atlas supports global clusters, enabling you to distribute data across multiple regions and improve application performance.
Integration with MongoDB Tools: Atlas integrates seamlessly with MongoDB's suite of tools, including Compass, Stitch, and Charts.
Cost-Effective: Atlas offers a pay-as-you-go pricing model, allowing you to only pay for the resources you use.

**6. What is the role of indexes in MongoDB, and how do they improve performance?**

Ans:
In MongoDB, indexes play a crucial role in improving query performance. Here's how:
Faster Query Execution: Indexes allow MongoDB to quickly locate specific data, reducing the time it takes to execute queries.
Reduced Data Scanning: By using an index, MongoDB can avoid scanning entire collections, which can be time-consuming and resource-intensive.
Improved Query Optimization: Indexes help MongoDB's query optimizer choose the most efficient query plan, leading to better performance.

**7.  Describe the stages of the MongoDB aggregation pipeline.**

Ans:
The MongoDB aggregation pipeline is a powerful framework for data processing and analysis. Here are the main stages:

$match: Filters documents based on specified conditions.
$project: Reshapes documents, including adding, removing, or renaming fields.

$group: Groups documents by specified fields and applies aggregation operators (e.g., $sum, $avg).

$sort: Sorts documents in ascending or descending order.
$limit: Limits the number of documents returned.

$skip: Skips a specified number of documents.
$unwind: Deconstructs array fields into separate documents.

$lookup: Performs left outer join with another collection.
$addFields: Adds new fields to documents.

$bucket: Groups documents into buckets based on specified boundaries.

**8.  What is sharding in MongoDB? How does it differ from replication?**

Ans:
In MongoDB, sharding is a technique for distributing data across multiple servers or nodes to improve scalability and performance. Here's how it works:
Horizontal partitioning: Sharding involves dividing data into smaller chunks, called shards, and distributing them across multiple nodes.
Shard key: A shard key is used to determine which shard a document belongs to.
Scalability: Sharding allows MongoDB to scale horizontally, handling large amounts of data and high traffic.
Sharding differs from replication in several ways:
Purpose: Replication is used for high availability and data protection, while sharding is used for scalability and performance.
Data distribution: Replication duplicates data across nodes, while sharding distributes data across nodes.
Node roles: In replication, nodes can be primary or secondary, while in sharding, nodes are shards that store a portion of the data.

**9. What is PyMongo, and why is it used?**

Ans:
PyMongo is a Python distribution containing tools for working with MongoDB. It's a popular driver that allows Python developers to interact with MongoDB databases.
PyMongo is used for:
Connecting to MongoDB: PyMongo provides a way to connect to MongoDB instances, including replica sets and sharded clusters.
Performing CRUD operations: PyMongo allows you to create, read, update, and delete documents in MongoDB collections.
Querying data: PyMongo supports various query methods, including filtering, sorting, and aggregating data.
Working with MongoDB features: PyMongo provides access to MongoDB features like GridFS, MapReduce, and aggregation framework.

**10. What are the ACID properties in the context of MongoDB transactions?**

Ans:
In the context of MongoDB transactions, ACID properties refer to a set of guarantees that ensure database transactions are processed reliably. ACID stands for:
Atomicity: Ensures that transactions are treated as a single, indivisible unit. If any part of the transaction fails, the entire transaction is rolled back.
Consistency: Ensures that the database remains in a consistent state, even after multiple transactions have been applied.
Isolation: Ensures that transactions are executed independently, without interference from other transactions.
Durability: Ensures that once a transaction is committed, its effects are permanent and survive even in the event of a failure.

**11. What is the purpose of MongoDB’s explain() function?**

Ans:
The explain() function in MongoDB is used to:
Analyze query performance: explain() provides detailed information about how MongoDB executes a query, including the query plan, index usage, and execution statistics.
Optimize queries: By analyzing the output of explain(), you can identify performance bottlenecks and optimize your queries for better performance.
Understand index usage: explain() shows whether MongoDB is using an index to fulfill a query, and if so, which index is being used.

**12.  How does MongoDB handle schema validation?**

Ans;
MongoDB provides schema validation, which allows you to enforce structure and rules on your data. Here's how it works:
JSON Schema: MongoDB uses JSON Schema to define validation rules for documents in a collection.
Validation rules: You can specify rules for fields, such as data type, format, and allowed values.
Validation levels: MongoDB provides different validation levels, including:
Off: No validation is performed.
Strict: Validation is enforced for all documents.
Moderate: Validation is enforced for inserts and updates, but existing documents are not validated.
Validation actions: You can specify actions to take when validation fails, such as:
Error: Reject the operation and return an error.
Warn: Log a warning message, but allow the operation to proceed.

**13.  What is the difference between a primary and a secondary node in a replica set?**

Ans:
In a MongoDB replica set:
Primary node: The primary node is the main node that accepts write operations. All writes are directed to the primary node, and it is responsible for replicating the data to secondary nodes.
Secondary node: Secondary nodes replicate the data from the primary node and can serve read traffic. They can become the new primary node if the current primary fails.
Key differences:
Write operations: Only the primary node accepts write operations. Secondary nodes replicate the data from the primary node.
Read operations: Both primary and secondary nodes can serve read traffic, but secondary nodes may lag behind the primary node due to replication delay.
Failover: If the primary node fails, a secondary node can be elected as the new primary node, ensuring high availability.

**14. What security mechanisms does MongoDB provide for data protection?**

Ans:
MongoDB provides several security mechanisms for data protection, including:
Authentication: MongoDB supports various authentication mechanisms, such as username/password, LDAP, and Kerberos.
Authorization: Role-Based Access Control (RBAC) allows you to define roles and permissions for users and applications.
Encryption: MongoDB supports encryption at rest (using WiredTiger storage engine) and in transit (using TLS/SSL).
Auditing: MongoDB provides auditing capabilities to track database events and changes.
Network isolation: MongoDB supports network isolation using firewalls, IP whitelisting, and VPC peering.
Data encryption: MongoDB supports field-level encryption, allowing you to encrypt specific fields in your documents.

**15.  Explain the concept of embedded documents and when they should be used.**

Ans:
In MongoDB, embedded documents are documents that are nested inside other documents. They are used to store related data in a single document, reducing the need for separate collections and joins.
Embedded documents are useful when:
One-to-one relationships: When a document has a single, tightly coupled relationship with another document.
One-to-few relationships: When a document has a small number of related documents.
Data locality: When related data is frequently accessed together.
Benefits of embedded documents:
Improved performance: Reduced need for joins and separate queries.
Simplified data model: Related data is stored in a single document.
Easier data retrieval: Related data can be retrieved in a single query.
When to use embedded documents:
Frequently accessed together: When related data is often accessed together.
Small amounts of data: When the embedded data is relatively small.
No need for separate queries: When you don't need to query the embedded data separately.

**16. What is the purpose of MongoDB’s $lookup stage in aggregation?**

Ans:
The $lookup stage in MongoDB's aggregation framework is used to:
Perform left outer join: Combine data from two collections based on a common field.
Enrich documents: Add fields from another collection to the current documents.
The $lookup stage allows you to:
Join collections: Combine data from multiple collections.
Retrieve related data: Fetch data from another collection based on a common field.

**17.  What are some common use cases for MongoDB?**

Ans:
MongoDB is a versatile database that can be used in a variety of scenarios. Some common use cases include:
Real-time analytics: MongoDB's high performance and scalability make it well-suited for real-time analytics and reporting.
Content management: MongoDB's flexible schema and document-based data model make it a good fit for content management systems.
IoT data storage: MongoDB's ability to handle large amounts of semi-structured data makes it a popular choice for IoT data storage and analysis.
Mobile apps: MongoDB's scalability and performance make it a good choice for mobile apps that require a robust backend database.
E-commerce platforms: MongoDB's flexibility and scalability make it a popular choice for e-commerce platforms that require a robust product catalog and order management system.
Log and event data storage: MongoDB's ability to handle large amounts of semi-structured data makes it a popular choice for log and event data storage and analysis.
Personalization and recommendation engines: MongoDB's ability to handle large amounts of data and perform complex queries makes it a good fit for personalization and recommendation engines.

**18. What are the advantages of using MongoDB for horizontal scaling?**

Ans:
MongoDB provides several advantages for horizontal scaling:
Sharding: MongoDB's sharding feature allows you to distribute data across multiple servers, making it easy to scale horizontally.
Automatic data distribution: MongoDB automatically distributes data across shards, reducing the complexity of scaling.
Load balancing: MongoDB's sharding feature includes load balancing, ensuring that no single server is overwhelmed.
Increased storage capacity: By adding more shards, you can increase storage capacity and handle larger amounts of data.
Improved performance: Horizontal scaling with MongoDB can improve performance by distributing the load across multiple servers.
Easy addition of new nodes: MongoDB makes it easy to add new nodes to a shard, allowing you to scale your database as needed.

**19. How do MongoDB transactions differ from SQL transactions?**

Ans:
MongoDB transactions and SQL transactions share some similarities, but they also have some key differences:
Multi-document transactions: MongoDB supports multi-document transactions, which allow you to perform atomic operations across multiple documents.
Document-level atomicity: In MongoDB, single-document operations are atomic by default, whereas SQL databases typically require explicit transactions for atomicity.
Snapshot isolation: MongoDB transactions use snapshot isolation, which ensures that transactions see a consistent view of the data.
Distributed transactions: MongoDB supports distributed transactions, which allow you to perform transactions across multiple shards

**20. What are the main differences between capped collections and regular collections?**

Ans:
The main differences between capped collections and regular collections in MongoDB are:
Fixed size: Capped collections have a fixed size, whereas regular collections can grow dynamically.
FIFO behavior: Capped collections follow a First-In-First-Out (FIFO) behavior, where the oldest documents are automatically removed when the collection reaches its maximum size.
Insertion order: Capped collections maintain the insertion order of documents, whereas regular collections do not.
High performance: Capped collections are optimized for high-performance logging and queuing applications.
Limited updates: Capped collections have limited update capabilities, as documents cannot be resized or updated in a way that would change their size.

**21. What is the purpose of the $match stage in MongoDB’s aggregation pipeline?**

Ans:
The $match stage in MongoDB's aggregation pipeline is used to:
Filter documents: Select only the documents that match a specified condition.
Reduce data: Reduce the amount of data that needs to be processed in subsequent stages.
The $match stage allows you to:
Specify conditions: Use query operators to specify conditions for document selection.
Use indexes: Take advantage of indexes to improve performance.

**22. How can you secure access to a MongoDB database?**

Ans:
To secure access to a MongoDB database:
Enable authentication: Require users to authenticate with a username and password.
Use role-based access control: Assign roles to users that define their permissions and access levels.
Use encryption: Enable encryption at rest and in transit to protect data from unauthorized access.
Configure network access: Limit network access to the database using firewalls, IP whitelisting, and VPC peering.
Use strong passwords: Enforce strong password policies for all users.
Monitor database activity: Use auditing and logging to monitor database activity and detect potential security threats.

**23. What is MongoDB’s WiredTiger storage engine, and why is it important?**

Ans:
WiredTiger is a storage engine in MongoDB that provides:
High performance: WiredTiger is designed for high-performance and concurrency.
Document-level concurrency: WiredTiger allows for document-level concurrency, reducing contention and improving performance.
Compression: WiredTiger provides compression, reducing storage costs and improving data transfer times.
Encryption: WiredTiger supports encryption at rest, providing an additional layer of security.
Checkpointing: WiredTiger uses checkpointing to ensure data consistency and durability.
WiredTiger is important because it:
Improves performance: WiredTiger's concurrency and compression features improve overall database performance.
Enhances security: WiredTiger's encryption feature provides an additional layer of security for sensitive data.
Reduces storage costs: WiredTiger's compression feature reduces storage costs and improves data transfer times.

***PRACTICAL QUESTIONS***

**Q1. Write a Python script to load the Superstore dataset from a CSV file into MongoDB?**

In [1]:
! pip install pymongo

Collecting pymongo
  Downloading pymongo-4.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (22 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading pymongo-4.13.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m19.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.7.0 pymongo-4.13.2


In [16]:
import pandas as pd
from pymongo import MongoClient

df= pd.read_csv('superstore.csv', encoding= 'windows-1252')

# connect to MongoDB
client= MongoClient("mongodb+srv://abhinavsk5899:3nrstpbOWZ9U6qfZ@cluster0.45hnw79.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0")

db= client['RetailStore'] # creating database
collection= db['Superstore'] # creating collection

# convert datafram to dictionary formate
data_dict= df.to_dict(orient='records')

# insert into mongodb
collection.insert_many(data_dict)

# extractig a particular object/document for just checking
from bson import ObjectId
print(collection.find_one({'_id':ObjectId('688266d8bdafbb3e5b9c0ba8')}))

None


**Q2. Retrieve and print all documents from the Orders collection.**

In [5]:
result= collection.find()

#for documents in  result:
  #print(documents)


**3. Count and display the total number of documents in the Orders collection.**

In [6]:
count= collection.count_documents({})
print(f"Total number of documents in 'Superstore' collection are : {count}")

Total number of documents in 'Superstore' collection are : 40568


**4. Write a query to fetch all orders from the "West" region.**

In [7]:
import pprint
region_west= collection.find({'Region':'West'})
#for documents in region_west:
 #pprint.pprint(documents)

**5. Write a query to find orders where Sales is greater than 500.**

In [8]:
sales= collection.find({'Sales':{'$gt':500}})
#for documents in sales:
 # pprint.pprint(documents)

In [9]:
# checking how many documents following the above conditions
total_documents= collection.count_documents({'Sales':{'$gt':500}})
total_documents

6972

**6. Fetch the top 3 orders with the highest Profit.**

In [10]:
result= collection.find().sort('Profit',-1).limit(3)
for documents in result:
  pprint.pprint(documents)


{'Category': 'Technology',
 'City': 'Lafayette',
 'Country': 'United States',
 'Customer ID': 'TC-20980',
 'Customer Name': 'Tamara Chand',
 'Discount': 0.0,
 'Order Date': '10/2/2016',
 'Order ID': 'CA-2016-118689',
 'Postal Code': 47905,
 'Product ID': 'TEC-CO-10004722',
 'Product Name': 'Canon imageCLASS 2200 Advanced Copier',
 'Profit': 8399.976,
 'Quantity': 5,
 'Region': 'Central',
 'Row ID': 6827,
 'Sales': 17499.95,
 'Segment': 'Corporate',
 'Ship Date': '10/9/2016',
 'Ship Mode': 'Standard Class',
 'State': 'Indiana',
 'Sub-Category': 'Copiers',
 '_id': ObjectId('68834f0da79e53d565b8dcd8')}
{'Category': 'Technology',
 'City': 'Lafayette',
 'Country': 'United States',
 'Customer ID': 'TC-20980',
 'Customer Name': 'Tamara Chand',
 'Discount': 0.0,
 'Order Date': '10/2/2016',
 'Order ID': 'CA-2016-118689',
 'Postal Code': 47905,
 'Product ID': 'TEC-CO-10004722',
 'Product Name': 'Canon imageCLASS 2200 Advanced Copier',
 'Profit': 8399.976,
 'Quantity': 5,
 'Region': 'Central',
 '

**7. Update all orders with Ship Mode as "First Class" to "Premium Class.**

In [11]:
update= collection.update_many({'Ship Mode':'First Class'},
                               {'$set':{'Ship Mode':'First Class'}}
                               )
print(f'Number of total value updated are: {update.matched_count}')

Number of total value updated are: 6272


**8. Delete all orders where Sales is less than 50.**

In [12]:
delete= collection.delete_many({'Sales':{'$lt':50}})
print(f'Total deleted documents are {delete.deleted_count}')

Total deleted documents are 9698


**9. Use aggregation to group orders by Region and calculate total sales per region.**

In [13]:
# aggregation pipeline
pipeline=[
    {
        '$group':{
            '_id':'$Region',
            'total_sales_per_region':{'$sum':'$Sales'}
        }
    },
    {
        '$sort':{'total_sales_per_region':1} # sorting in ascending order
    }

]

# run the aggregation
results= collection.aggregate(pipeline)

# display the result
for region in results:
    print(f"Region: {region['_id']}, Total Sales: {region['total_sales_per_region']:.2f}")

Region: South, Total Sales: 2256139.87
Region: Central, Total Sales: 2877671.07
Region: East, Total Sales: 3906826.23
Region: West, Total Sales: 4168119.72


**10. Fetch all distinct values for Ship Mode from the collection.**

In [14]:
distinct_ship_mode= collection.distinct('Ship Mode')
print(distinct_ship_mode)

['First Class', 'Same Day', 'Second Class', 'Standard Class']


**11. Count the number of orders for each category.**

In [15]:
pipeline=[
        {'$group':{'_id':'$Category','order_count':{'$sum':1}}},
        {'$sort':{'ordercount':-1}}
]

# run the aggregation
result= collection.aggregate(pipeline)

# display the result
for docs in result:
  print(docs)

{'_id': 'Technology', 'order_count': 8976}
{'_id': 'Furniture', 'order_count': 9438}
{'_id': 'Office Supplies', 'order_count': 12456}
