# Jupyter Notebook: Docker Setup and Financial Metrics Exploration
This notebook provides a step-by-step guide for setting up Docker to containerize the project and explores financial metrics computation using PySpark and Delta Lake.

# Setup Dockerfile
Write a Dockerfile to containerize the project, including dependencies and environment setup.

# Build Docker Image
Use the `docker build` command to create a Docker image for the project.

In [None]:
# Build the Docker image
import os

os.system("docker build -t stock-market-analysis .")
print("✅ Docker image built successfully.")

# Run Docker Container
Run the Docker container using the `docker run` command, exposing necessary ports.

In [None]:
os.system("docker run -p 8501:8501 stock-market-analysis")
print("✅ Docker container is running. Access the dashboard at http://localhost:8501.")

# Access the Dashboard
Provide instructions to access the Streamlit dashboard in a web browser.

1. Open your web browser.
2. Navigate to `http://localhost:8501`.
3. Explore the Streamlit dashboard for stock market analysis.

# Import Required Libraries
Import necessary libraries such as PySpark, Delta Lake, and visualization tools.

In [2]:
# Import libraries
import pyspark
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import pandas as pd
import plotly.express as px

# Load Data from Delta Lake
Load cleaned data from Delta Lake for financial metrics computation.

In [3]:
# Initialize Spark session
spark = (SparkSession.builder
    .appName("FinancialMetrics")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .getOrCreate())

# Load data from Delta Lake
delta_table_path = "/home/abdelhalim/Desktop/Temp /StockMarketAnalysis/data/delta_tables/cleaned_tech_stocks"
df = spark.read.format("delta").load(delta_table_path)
df.show(5)

25/05/01 22:27:48 WARN Utils: Your hostname, abdelhalim resolves to a loopback address: 127.0.1.1; using 192.168.1.6 instead (on interface enp8s0)
25/05/01 22:27:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/01 22:27:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/01 22:27:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/01 22:27:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/05/01 22:27:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/05/01 22:27:50 WAR

Py4JJavaError: An error occurred while calling o30.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: delta. Please find packages at `https://spark.apache.org/third-party-projects.html`.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.dataSourceNotFoundError(QueryExecutionErrors.scala:738)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:647)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:697)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: delta.DefaultSource
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:633)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:633)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:633)
	... 15 more


# Compute Financial Metrics
Calculate metrics like RSI, Moving Averages, and Sharpe Ratio using PySpark.

In [None]:
from pyspark.sql.functions import col, avg, stddev, lag, when
from pyspark.sql.window import Window

# Define a window specification
window_spec = Window.partitionBy("Ticker").orderBy("Date")

# Compute Moving Average (20-day)
df = df.withColumn("MA_20", avg("Close").over(window_spec.rowsBetween(-19, 0)))

# Compute RSI (Relative Strength Index)
df = df.withColumn("Change", col("Close") - lag("Close", 1).over(window_spec))
df = df.withColumn("Gain", when(col("Change") > 0, col("Change")).otherwise(0))
df = df.withColumn("Loss", when(col("Change") < 0, -col("Change")).otherwise(0))
df = df.withColumn("Avg_Gain", avg("Gain").over(window_spec.rowsBetween(-13, 0)))
df = df.withColumn("Avg_Loss", avg("Loss").over(window_spec.rowsBetween(-13, 0)))
df = df.withColumn("RS", col("Avg_Gain") / col("Avg_Loss"))
df = df.withColumn("RSI", 100 - (100 / (1 + col("RS"))))

# Compute Sharpe Ratio
df = df.withColumn("Daily_Return", (col("Close") - lag("Close", 1).over(window_spec)) / lag("Close", 1).over(window_spec))
df = df.withColumn("Mean_Return", avg("Daily_Return").over(window_spec.rowsBetween(-19, 0)))
df = df.withColumn("Std_Dev_Return", stddev("Daily_Return").over(window_spec.rowsBetween(-19, 0)))
df = df.withColumn("Sharpe_Ratio", col("Mean_Return") / col("Std_Dev_Return"))

df.show(5)

# Visualize Financial Metrics
Create visualizations for the computed metrics using libraries like Matplotlib or Plotly.

In [None]:
# Convert Spark DataFrame to Pandas DataFrame for visualization
pandas_df = df.select("Date", "Ticker", "Close", "MA_20", "RSI", "Sharpe_Ratio").toPandas()

# Plot Moving Average
fig_ma = px.line(pandas_df, x="Date", y="MA_20", color="Ticker", title="20-Day Moving Average")
fig_ma.show()

# Plot RSI
fig_rsi = px.line(pandas_df, x="Date", y="RSI", color="Ticker", title="RSI (Relative Strength Index)")
fig_rsi.show()

# Plot Sharpe Ratio
fig_sharpe = px.line(pandas_df, x="Date", y="Sharpe_Ratio", color="Ticker", title="Sharpe Ratio")
fig_sharpe.show()

# Insights and Observations
Analyze the visualizations and provide insights into the financial metrics.

- **Moving Average**: The 20-day moving average smooths out short-term fluctuations and highlights longer-term trends.
- **RSI**: Stocks with RSI above 70 are overbought, while those below 30 are oversold.
- **Sharpe Ratio**: A higher Sharpe Ratio indicates better risk-adjusted returns.