<h1 align="center" style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">MSc in Data Analytics: Big Data Storage and Processing</h1>

### Table of Contents

- [Introduction](#Introduction)
    - [Assessment Overview](#Assessment-Overview)
    - [Project Summary](#Project-Summary)

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Introduction</h2>

### **Assessment Overview**

### Project Summary

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Install and Import Required Libraries</h2>

In [1]:
# !pip install -q pyspark pymongo

In [2]:
import glob
import os
from datetime import datetime
import logging

import numpy as np
import pandas as pd

import pymongo
from pymongo import MongoClient


from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import lit

from kaggle_secrets import UserSecretsClient

import warnings

In [3]:
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s — %(levelname)s %(message)s', force=True)
logger = logging.getLogger(__name__)

# Disable warnings
warnings.filterwarnings(action='ignore')

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Define Data Paths</h2>

In [4]:
STOCKPRICE_FOLDER = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stockprice"
STOCKTWEET_CSV = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stocktweet/stocktweet.csv"

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Set MongoDB Connection</h2>

In [5]:
user_secrets = UserSecretsClient()
mongodb_uri = user_secrets.get_secret("mongodb-atlas-uri")

In [6]:
def create_mongodb_connection(uri, db_name='stock_analytics'):
    """
    Connect to MongoDB and return database instance
    
    Args:
        uri (str): Database URI
        db_name (str): Database name
    
    Returns:
        pymongo.database.Database: MongoDB database instance
    """
    try:
        client = MongoClient(uri)
        db = client[db_name]
        print("MongoDB version:", client.server_info()["version"])
        logger.info(f"Connected to MongoDB database: {db_name}")
        return db
    except Exception as e:
        logger.error(f"Failed to connect to MongoDB: {e}")
        raise

db = create_mongodb_connection(mongodb_uri, 'stock_analytics')

2025-07-14 18:39:34,449 — INFO Connected to MongoDB database: stock_analytics


MongoDB version: 8.0.11


### MongoDB Selection Rationale

**Why MongoDB was chosen over other NoSQL options:**


* **Document-oriented structure**

  * MongoDB’s JSON-like document model naturally accommodates the semi-structured `stocktweet.csv` data (e.g., tweet text, ticker, and timestamp).
  * No need for complex schema definitions or rigid table structures.

* **Ease of use and developer-friendliness**
  * It is easy to get started with MongoDB. The installation process is easy as well as connecting to the database
  * MongoDB’s query language (MQL) is intuitive and similar to JSON syntax, which makes it easy to write.
  * Rich developer tools such as MongoDB Compass which helps me visualize the database. E.g when I have created a collection, inserted a data, and so on.
  * Extensive support for multiple programming languages (e.g., Python, Java, Scala).

* **Seamless integration with Apache Spark**

  * The `mongo-spark-connector` enables direct read/write access between MongoDB and Spark DataFrames.
  * Simplifies data ingestion, distributed transformation, and analytics within the Spark environment.

* **Efficient for read-heavy analytical workloads**

  * MongoDB supports secondary indexing and text search—ideal for querying tweets by ticker symbol or date.
  * Aggregation pipelines allow for efficient summarization and filtering of large datasets.

* **Cloud accessibility and scalability**

  * MongoDB Atlas provides a fully managed cloud database service that allows me to connect securely from anywhere.
  * I can easily deploy and scale MongoDB clusters in the cloud and integrate them with my local Spark environment.
  * Atlas ensures high availability, backup, and monitoring, which is ideal for handling large-scale, real-time tweet and stock data.