In [11]:
"""
MongoDB Assignment Solution - PWSkills
Complete solution for both theoretical and practical questions
"""

# First, install required packages (run this in Colab)
# !pip install pymongo pandas

import pymongo
import pandas as pd
from pymongo import MongoClient
import json

print("=== MONGODB ASSIGNMENT SOLUTIONS ===\n")

# ==========================================
# THEORETICAL QUESTIONS - ANSWERS
# ==========================================

theoretical_answers = {
    "Q1": """Key differences between SQL and NoSQL databases:
    - Structure: SQL uses structured tables with fixed schema, NoSQL uses flexible document/key-value structures
    - Schema: SQL requires predefined schema, NoSQL is schema-less or schema-flexible
    - Scaling: SQL scales vertically, NoSQL scales horizontally
    - ACID: SQL guarantees full ACID properties, NoSQL may sacrifice some for performance
    - Query Language: SQL uses structured query language, NoSQL uses various query methods
    - Relationships: SQL uses JOINs, NoSQL embeds documents or uses references""",

    "Q2": """What makes MongoDB a good choice for modern applications:
    - Flexible schema design accommodates evolving data requirements
    - Horizontal scaling capabilities handle large datasets
    - Rich query capabilities with aggregation pipeline
    - Built-in replication for high availability
    - JSON-like document storage matches application objects
    - Strong community and ecosystem support""",

    "Q3": """Collections in MongoDB:
    Collections are groups of MongoDB documents, equivalent to tables in SQL databases.
    They don't enforce a schema and can contain documents with different structures.
    Documents in a collection should be related but don't need identical fields.""",

    "Q4": """MongoDB high availability using replication:
    - Uses replica sets with primary and secondary nodes
    - Primary handles all write operations
    - Secondaries replicate primary's data
    - Automatic failover if primary fails
    - Reads can be distributed across secondaries""",

    "Q5": """Main benefits of MongoDB Atlas:
    - Fully managed cloud database service
    - Automated backups and monitoring
    - Built-in security features
    - Global clusters and multi-region deployment
    - Easy scaling and performance optimization
    - Integration with cloud providers (AWS, Azure, GCP)""",

    "Q6": """Role of indexes in MongoDB:
    Indexes improve query performance by creating efficient data structures.
    Benefits: Faster query execution, reduced disk I/O, efficient sorting
    Types: Single field, compound, multikey, text, geospatial indexes""",

    "Q7": """MongoDB aggregation pipeline stages:
    1. $match - Filter documents
    2. $group - Group documents and perform calculations
    3. $project - Reshape documents
    4. $sort - Sort documents
    5. $limit - Limit number of documents
    6. $lookup - Join collections
    7. $unwind - Deconstruct arrays""",

    "Q8": """Sharding vs Replication in MongoDB:
    Sharding: Horizontal partitioning of data across multiple servers for scalability
    Replication: Creating copies of data across multiple servers for availability
    Sharding splits data, replication duplicates it""",

    "Q9": """PyMongo:
    PyMongo is the official Python driver for MongoDB.
    Used for connecting Python applications to MongoDB databases,
    enabling CRUD operations, aggregation, and database administration.""",

    "Q10": """ACID properties in MongoDB transactions:
    - Atomicity: All operations succeed or fail together
    - Consistency: Database remains in valid state
    - Isolation: Transactions don't interfere with each other
    - Durability: Committed changes persist
    MongoDB supports multi-document ACID transactions since version 4.0""",

    "Q11": """MongoDB's explain() function:
    Provides detailed execution statistics for queries including:
    - Execution plan and stages
    - Number of documents examined
    - Index usage information
    - Query execution time
    - Performance optimization insights""",

    "Q12": """MongoDB schema validation:
    Uses JSON Schema to validate document structure on insert/update.
    Can specify required fields, data types, value ranges, and custom validation rules.
    Validation levels: strict (all) or moderate (existing documents only)""",

    "Q13": """Primary vs Secondary nodes in replica set:
    Primary: Receives all write operations, single primary per replica set
    Secondary: Replicate primary's data, can serve read operations if configured
    Only primary can accept writes, secondaries maintain data copies""",

    "Q14": """MongoDB security mechanisms:
    - Authentication (SCRAM, x.509 certificates)
    - Authorization (role-based access control)
    - Encryption (at rest and in transit)
    - Network security (IP whitelisting, VPC)
    - Auditing and monitoring capabilities""",

    "Q15": """Embedded documents:
    Documents nested within other documents. Use when:
    - Data has one-to-one relationships
    - Embedded data is accessed together with parent
    - Document size remains reasonable
    - No need to query embedded data independently""",

    "Q16": """MongoDB's $lookup stage:
    Performs left outer join between collections.
    Combines documents from different collections based on specified conditions.
    Similar to SQL JOIN operations""",

    "Q17": """Common MongoDB use cases:
    - Content management systems
    - Real-time analytics
    - IoT data storage
    - Product catalogs
    - Social media applications
    - Mobile applications
    - Gaming applications""",

    "Q18": """MongoDB advantages for horizontal scaling:
    - Automatic sharding distributes data
    - No single point of failure
    - Linear scalability with more servers
    - Maintains performance under load
    - Cost-effective scaling on commodity hardware""",

    "Q19": """MongoDB vs SQL transactions:
    MongoDB: Multi-document transactions across collections
    SQL: Traditional ACID transactions with rollback
    MongoDB optimizes for document-level operations
    SQL optimizes for relational integrity""",

    "Q20": """Capped vs Regular collections:
    Capped: Fixed size, FIFO ordering, high throughput
    Regular: Dynamic size, flexible operations
    Capped collections are ideal for logging and caching""",

    "Q21": """$match stage in aggregation:
    Filters documents based on specified criteria.
    Similar to WHERE clause in SQL.
    Should be placed early in pipeline for performance""",

    "Q22": """Securing MongoDB access:
    - Enable authentication and authorization
    - Use TLS/SSL for connections
    - Configure firewall rules
    - Implement role-based access control
    - Regular security audits
    - Keep MongoDB updated""",

    "Q23": """WiredTiger storage engine:
    Default storage engine since MongoDB 3.2.
    Features: Document-level concurrency, compression,
    encryption at rest, checkpointing, and improved performance"""
}

# Print all theoretical answers
print("THEORETICAL ANSWERS:")
print("=" * 50)
for q, answer in theoretical_answers.items():
    print(f"{q}: {answer}\n")

# ==========================================
# PRACTICAL QUESTIONS - MONGODB OPERATIONS
# ==========================================

print("\n" + "=" * 50)
print("PRACTICAL QUESTIONS - CODE SOLUTIONS")
print("=" * 50)

# Note: For Google Colab, you'll need to either:
# 1. Use MongoDB Atlas (cloud)
# 2. Install MongoDB locally
# 3. Use a Docker container

# Connection setup (modify as needed)
print("\n1. MongoDB Connection Setup:")
print("# Replace with your MongoDB connection string")
connection_code = '''
# For local MongoDB
client = MongoClient('mongodb://localhost:27017/')

# For MongoDB Atlas (replace with your connection string)
# client = MongoClient('mongodb+srv://username:password@cluster.mongodb.net/')

db = client['superstore_db']
collection = db['orders']
'''
print(connection_code)

# Sample code for each practical question
practical_solutions = {
    "P1": '''
# 1. Load Superstore dataset from CSV into MongoDB
import pandas as pd

def load_csv_to_mongodb():
    # Load CSV file
    df = pd.read_csv('superstore.csv')  # Replace with your CSV path

    # Convert DataFrame to dictionary records
    records = df.to_dict('records')

    # Insert into MongoDB
    result = collection.insert_many(records)
    print(f"Inserted {len(result.inserted_ids)} documents")
    return result

# load_csv_to_mongodb()
''',

    "P2": '''
# 2. Retrieve and print all documents from Orders collection
def get_all_orders():
    orders = collection.find()
    for order in orders:
        print(order)
    return list(collection.find())

# all_orders = get_all_orders()
''',

    "P3": '''
# 3. Count total number of documents
def count_documents():
    count = collection.count_documents({})
    print(f"Total documents: {count}")
    return count

# document_count = count_documents()
''',

    "P4": '''
# 4. Fetch all orders from "West" region
def get_west_orders():
    west_orders = collection.find({"Region": "West"})
    result = list(west_orders)
    print(f"Found {len(result)} orders from West region")
    for order in result[:5]:  # Print first 5
        print(order)
    return result

# west_orders = get_west_orders()
''',

    "P5": '''
# 5. Find orders where Sales is greater than 500
def get_high_sales_orders():
    high_sales = collection.find({"Sales": {"$gt": 500}})
    result = list(high_sales)
    print(f"Found {len(result)} orders with sales > 500")
    for order in result[:5]:
        print(f"Order ID: {order.get('Order ID')}, Sales: {order.get('Sales')}")
    return result

# high_sales_orders = get_high_sales_orders()
''',

    "P6": '''
# 6. Fetch top 3 orders with highest Profit
def get_top_profit_orders():
    top_orders = collection.find().sort("Profit", -1).limit(3)
    result = list(top_orders)
    print("Top 3 orders by profit:")
    for i, order in enumerate(result, 1):
        print(f"{i}. Order ID: {order.get('Order ID')}, Profit: {order.get('Profit')}")
    return result

# top_profit_orders = get_top_profit_orders()
''',

    "P7": '''
# 7. Update Ship Mode from "First Class" to "Premium Class"
def update_ship_mode():
    result = collection.update_many(
        {"Ship Mode": "First Class"},
        {"$set": {"Ship Mode": "Premium Class"}}
    )
    print(f"Updated {result.modified_count} documents")
    return result

# update_result = update_ship_mode()
''',

    "P8": '''
# 8. Delete orders where Sales < 50
def delete_low_sales():
    result = collection.delete_many({"Sales": {"$lt": 50}})
    print(f"Deleted {result.deleted_count} documents")
    return result

# delete_result = delete_low_sales()
''',

    "P9": '''
# 9. Aggregation: Group by Region and calculate total sales
def sales_by_region():
    pipeline = [
        {
            "$group": {
                "_id": "$Region",
                "total_sales": {"$sum": "$Sales"},
                "order_count": {"$sum": 1}
            }
        },
        {"$sort": {"total_sales": -1}}
    ]

    result = list(collection.aggregate(pipeline))
    print("Sales by Region:")
    for region_data in result:
        print(f"Region: {region_data['_id']}, Total Sales: ${region_data['total_sales']:.2f}, Orders: {region_data['order_count']}")
    return result

# region_sales = sales_by_region()
''',

    "P10": '''
# 10. Fetch distinct Ship Mode values
def get_distinct_ship_modes():
    distinct_modes = collection.distinct("Ship Mode")
    print("Distinct Ship Modes:")
    for mode in distinct_modes:
        print(f"- {mode}")
    return distinct_modes

# ship_modes = get_distinct_ship_modes()
''',

    "P11": '''
# 11. Count orders by Category
def count_by_category():
    pipeline = [
        {
            "$group": {
                "_id": "$Category",
                "count": {"$sum": 1}
            }
        },
        {"$sort": {"count": -1}}
    ]

    result = list(collection.aggregate(pipeline))
    print("Orders by Category:")
    for category_data in result:
        print(f"Category: {category_data['_id']}, Count: {category_data['count']}")
    return result

# category_counts = count_by_category()
'''
}

# Print all practical solutions
for p_num, solution in practical_solutions.items():
    print(f"\n{p_num}:")
    print(solution)

# Complete working example
complete_example = '''

# COMPLETE WORKING EXAMPLE - Copy this to run in Colab

# Step 1: Install required packages
# !pip install pymongo pandas

import pymongo
import pandas as pd
from pymongo import MongoClient
import json

# Step 2: Connect to MongoDB (modify connection string as needed)
try:
    # Local MongoDB
    client = MongoClient('mongodb://localhost:27017/')

    # Test connection
    client.server_info()
    print("Connected to MongoDB successfully!")

    db = client['superstore_db']
    collection = db['orders']

except Exception as e:
    print(f"Connection failed: {e}")
    print("Please check your MongoDB connection string and ensure MongoDB is running")

# Step 3: Sample data creation (if you don't have the CSV)
sample_data = [
    {
        "Order ID": "CA-2016-152156",
        "Order Date": "2016-11-08",
        "Ship Date": "2016-11-11",
        "Ship Mode": "Second Class",
        "Customer ID": "CG-12520",
        "Customer Name": "Claire Gute",
        "Segment": "Consumer",
        "Country": "United States",
        "City": "Henderson",
        "State": "Kentucky",
        "Postal Code": 42420,
        "Region": "South",
        "Product ID": "FUR-BO-10001798",
        "Category": "Furniture",
        "Sub-Category": "Bookcases",
        "Product Name": "Bush Somerset Collection Bookcase",
        "Sales": 261.96,
        "Quantity": 2,
        "Discount": 0.00,
        "Profit": 41.9136
    },
    {
        "Order ID": "CA-2016-152156",
        "Order Date": "2016-11-08",
        "Ship Date": "2016-11-11",
        "Ship Mode": "Second Class",
        "Customer ID": "CG-12520",
        "Customer Name": "Claire Gute",
        "Segment": "Consumer",
        "Country": "United States",
        "City": "Henderson",
        "State": "Kentucky",
        "Postal Code": 42420,
        "Region": "South",
        "Product ID": "FUR-CH-10000454",
        "Category": "Furniture",
        "Sub-Category": "Chairs",
        "Product Name": "Hon Deluxe Fabric Upholstered Stacking Chairs",
        "Sales": 731.94,
        "Quantity": 3,
        "Discount": 0.00,
        "Profit": 219.582
    }
]

# Insert sample data
try:
    collection.insert_many(sample_data)
    print("Sample data inserted successfully!")
except Exception as e:
    print(f"Error inserting data: {e}")
'''

=== MONGODB ASSIGNMENT SOLUTIONS ===

THEORETICAL ANSWERS:
Q1: Key differences between SQL and NoSQL databases:
    - Structure: SQL uses structured tables with fixed schema, NoSQL uses flexible document/key-value structures
    - Schema: SQL requires predefined schema, NoSQL is schema-less or schema-flexible
    - Scaling: SQL scales vertically, NoSQL scales horizontally
    - ACID: SQL guarantees full ACID properties, NoSQL may sacrifice some for performance
    - Query Language: SQL uses structured query language, NoSQL uses various query methods
    - Relationships: SQL uses JOINs, NoSQL embeds documents or uses references

Q2: What makes MongoDB a good choice for modern applications:
    - Flexible schema design accommodates evolving data requirements
    - Horizontal scaling capabilities handle large datasets
    - Rich query capabilities with aggregation pipeline
    - Built-in replication for high availability
    - JSON-like document storage matches application objects
    - S