<h1 align="center" style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">MSc in Data Analytics: Big Data Storage and Processing</h1>

### Table of Contents

- [Introduction](#Introduction)
    - [Assessment Overview](#Assessment-Overview)
    - [Project Summary](#Project-Summary)

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Introduction</h2>

### **Assessment Overview**

### Project Summary

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Install and Import Required Libraries</h2>

In [1]:
# !pip install -q pyspark pymongo

In [2]:
import glob
import os
from datetime import datetime
import logging

import numpy as np
import pandas as pd

import pymongo
from pymongo import MongoClient


from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import lit

from kaggle_secrets import UserSecretsClient

import warnings

In [3]:
# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s — %(levelname)s %(message)s', force=True)
logger = logging.getLogger(__name__)

# Disable warnings
warnings.filterwarnings(action='ignore')

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Define Data Paths</h2>

In [4]:
STOCKPRICE_FOLDER = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stockprice"
STOCKTWEET_CSV = "/kaggle/input/stock-tweet-and-price/stock-tweet-and-price/stocktweet/stocktweet.csv"

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Set MongoDB Connection</h2>

In [5]:
user_secrets = UserSecretsClient()
mongodb_uri = user_secrets.get_secret("mongodb-atlas-uri")

In [6]:
def create_mongodb_connection(uri, db_name='stock_analytics'):
    """
    Connect to MongoDB and return database instance
    
    Args:
        uri (str): Database URI
        db_name (str): Database name
    
    Returns:
        pymongo.database.Database: MongoDB database instance
    """
    try:
        client = MongoClient(uri)
        db = client[db_name]
        print("MongoDB version:", client.server_info()["version"])
        logger.info(f"Connected to MongoDB database: {db_name}")
        return db
    except Exception as e:
        logger.error(f"Failed to connect to MongoDB: {e}")
        raise

db = create_mongodb_connection(mongodb_uri, 'stock_analytics')

2025-07-14 18:39:34,449 — INFO Connected to MongoDB database: stock_analytics


MongoDB version: 8.0.11


### MongoDB Selection Rationale

**Why MongoDB was chosen over other NoSQL options:**


* **Document-oriented structure**

  * MongoDB’s JSON-like document model naturally accommodates the semi-structured `stocktweet.csv` data (e.g., tweet text, ticker, and timestamp).
  * No need for complex schema definitions or rigid table structures.

* **Ease of use and developer-friendliness**
  * It is easy to get started with MongoDB. The installation process is easy as well as connecting to the database
  * MongoDB’s query language (MQL) is intuitive and similar to JSON syntax, which makes it easy to write.
  * Rich developer tools such as MongoDB Compass which helps me visualize the database. E.g when I have created a collection, inserted a data, and so on.
  * Extensive support for multiple programming languages (e.g., Python, Java, Scala).

* **Seamless integration with Apache Spark**

  * The `mongo-spark-connector` enables direct read/write access between MongoDB and Spark DataFrames.
  * Simplifies data ingestion, distributed transformation, and analytics within the Spark environment.

* **Efficient for read-heavy analytical workloads**

  * MongoDB supports secondary indexing and text search—ideal for querying tweets by ticker symbol or date.
  * Aggregation pipelines allow for efficient summarization and filtering of large datasets.

* **Cloud accessibility and scalability**

  * MongoDB Atlas provides a fully managed cloud database service that allows me to connect securely from anywhere.
  * I can easily deploy and scale MongoDB clusters in the cloud and integrate them with my local Spark environment.
  * Atlas ensures high availability, backup, and monitoring, which is ideal for handling large-scale, real-time tweet and stock data.

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Create Spark Session</h2>

**Note:** The original MongoDB URI had the format `<mongodb+srv://<user>:<password>@cluster.mongodb.net/?...>`, which lacks a target database. Since Spark requires a database to be explicitly defined in the connection URI, the string was programmatically updated to include the `/stock_analytics` path before the query string.

In [7]:
# Insert 'stock_analytics' before the query string
spark_mongodb_uri = mongodb_uri.replace(".net/", ".net/stock_analytics?")

In [8]:
def create_spark_session(app_name="StockAnalytics"):
    """
    Create Spark session with MongoDB connector
    
    Args:
        app_name (str): Spark application name
    
    Returns:
        pyspark.sql.SparkSession: Spark session instance
    """
    try:
        spark = SparkSession.builder \
            .appName(app_name) \
            .config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1") \
            .config("spark.mongodb.input.uri", spark_mongodb_uri) \
            .config("spark.mongodb.output.uri", spark_mongodb_uri) \
            .getOrCreate()
        
        logger.info(f"Spark session created: {app_name}")
        return spark
    except Exception as e:
        logger.error(f"Failed to create Spark session: {e}")
        raise

In [9]:
spark = create_spark_session()

# Suppress Warnings in PySpark
spark.sparkContext.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/usr/local/lib/python3.11/dist-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
org.mongodb.spark#mongo-spark-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e3ef89fe-374c-4de1-a492-75380e04545c;1.0
	confs: [default]
	found org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 in central
	found org.mongodb#mongodb-driver-sync;4.0.5 in central
	found org.mongodb#bson;4.0.5 in central
	found org.mongodb#mongodb-driver-core;4.0.5 in central
:: resolution report :: resolve 317ms :: artifacts dl 18ms
	:: modules in use:
	org.mongodb#bson;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-core;4.0.5 from central in [default]
	org.mongodb#mongodb-driver-sync;4.0.5 from central in [default]
	org.mongodb.spark#mongo-spark-connector_2.12;3.0.1 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       

### Spark Selection Rationale

**Why Spark instead of Hadoop MapReduce:**

* Spark performs in-memory processing, making it much faster than Hadoop MapReduce—especially for repeated operations like filtering, joining, and aggregation.
* It provides simpler, higher-level APIs (e.g., DataFrames and SQL) that are easier to use and more readable than the verbose MapReduce code.
* Spark includes powerful built-in libraries for SQL, machine learning, and streaming, which are not natively available in MapReduce.

**Why PySpark:**

* PySpark allows me to use Python—making it easier to integrate with pandas, matplotlib, and other familiar libraries.
* Its syntax is concise and readable, improving development speed and code clarity.
* Strong community support and documentation simplify implementation and troubleshooting.

**MongoDB Integration:**

* The Spark session is configured with the MongoDB connector, allowing direct read/write access between Spark and MongoDB.
* This setup creates a seamless pipeline from storage to distributed processing without intermediate conversions.

<h2 style="background-color:#2c3e54;color:#ecf0f1;border-radius: 8px; padding:15px">Store Source Datasets into NoSQL Database using Spark</h2>

In [10]:
# Clear existing data
for collection_name in db.list_collection_names():
    db[collection_name].drop()

In [11]:
def store_stock_tweets_to_mongodb_spark(spark, csv_path):
    """
    Load stock tweet data from CSV to MongoDB using Spark
    
    Args:
        spark (SparkSession): Spark session instance
        csv_path (str): Path to stocktweet.csv file
    """
    try:
        df = spark.read.option("header", True).csv(csv_path)
        df.write.format("mongo") \
            .mode("overwrite") \
            .option("collection", "stock_tweets") \
            .save()
        
        logger.info(f"Loaded stock tweets into MongoDB using Spark")
    except Exception as e:
        logger.error(f"Failed to load stock tweets via Spark: {e}")
        raise

store_stock_tweets_to_mongodb_spark(spark, STOCKTWEET_CSV)

2025-07-14 18:40:33,264 — INFO Loaded stock tweets into MongoDB using Spark     


In [12]:
def store_stock_prices_to_mongodb_spark(spark, stockprice_folder):
    """
    Load stock price data from multiple CSVs to MongoDB using Spark
    
    Args:
        spark (SparkSession): Spark session instance
        stockprice_folder (str): Path to folder containing stock price CSV files
    """
    try:
        # Load all CSVs from the folder
        df = spark.read.option("header", True).csv(f"{stockprice_folder}/*.csv")
        df.write.format("mongo") \
            .mode("overwrite") \
            .option("collection", "stock_prices") \
            .save()

        logger.info(f"Loaded stock prices into MongoDB using Spark")
    except Exception as e:
        logger.error(f"Failed to load stock prices via Spark: {e}")
        raise

store_stock_prices_to_mongodb_spark(spark, STOCKPRICE_FOLDER)

2025-07-14 18:40:46,366 — INFO Loaded stock prices into MongoDB using Spark     


### Rationale for Using Spark to Populate MongoDB

The source datasets (`stocktweet.csv` and stock price files) were loaded using **PySpark**, and written directly into a MongoDB NoSQL database using the **MongoDB Spark Connector**. This approach leverages Spark’s distributed capabilities to handle large data volumes efficiently and meets the requirement to populate the NoSQL database using a big data processing tool. Each dataset was written to a separate collection (`stock_tweets`, `stock_prices`) for streamlined querying and integration with further Spark-based analytics.