# Stonks!
## A realtime stocks analyzer.

Coded by **Alfio Cardillo**
<br>
For the subject **"Technologies for Advanced Programming"**
<br>
University of **Catania**, Department of Mathematics and Computer Science. **13 June 2023**

# What is it?
This project consists in a realtime stocks analyzer of the three most valued stocks in the market right now. (June 2023). Which are Apple (**AAPL**), Alphabet Inc. (**GOOGL**) and Microsoft (**MSFT**).
<br>

<div>
<img src="images/logos.png" width="800"/>
</div>
   

# The structure
The pipeline consists in a variety of technologies that are all dockerized into **small micro-services** that communicates between one another thanks to **docker networks**.
<br><br>
Some time was spent in optimizing the containers by both choosing when possible an **official image** for that technology and the **smallest version**.

<div>
<img src="images/pipeline.png" width="1000"/>
</div>

# Input


In [14]:
import json, rich
data = {
    "Meta Data": {"1. Information": "Intraday (5min) open, high, low, close prices and volume", "2. Symbol": "AAPL", "3. Last Refreshed": "2023-06-09 19:55:00", "4. Interval": "5min", "5. Output Size": "Compact", "6. Time Zone": "US/Eastern"},
    "Time Series (5min)": {"2023-06-09 19:55:00": {"1. open": "180.7800", "2. high": "180.8700", "3. low": "180.7700", "4. close": "180.8100", "5. volume": "3131"}, "2023-06-09 19:50:00": {"1. open": "180.7900", "2. high": "180.8000", "3. low": "180.7700", "4. close": "180.7700", "5. volume": "1178"}}
}
rich.print_json(json.dumps(data, indent=2))

# Script
A python script downloads the data from [Alpha Vantage](https://www.alphavantage.co/) for the three symbols, then internally it creates a **timeline** and arranges all events on it based on the time they have in the record. 
<br><br>
Lastly it creates an **internal timer** and when the time arrives at a point in the timeline where an event happened it sends that event to Fluentbit with an HTTP request on Fluentbit's port **9880**.

## Why
- It is **hard to find realtime data** in the stock market that are **free to use**
- Easy language that gives especially for these low resource demanding task an easy and great solution

# Fluentbit
Fluentbit receives the data from python and **parse** it by extracting the **event time** and converting it into a **Unix Timestamp**
<div>
<img src="images/parse.png" width="500"/>
</div>

# Why Fluentbit
- **Smaller** and **faster** than other ingestion methods
- Still an easy syntax 
<div>
<img src="images/fluentlogstash.jpg" width="300"/>
</div>

# Difficulties
- The **documentation** is **not very clear** on how to parse data

# Kafka
Kafka was choosed as our **Data Streaming** tool, passing all the data received from **Fluentbit** to the next component, **Spark**.
<br><br>
This micro-service uses an image that allows to use **KRaft** instead of managing **Kafka**'s nodes with **Zookeeper**.

## Why
- Lightweight, more stable and simplier
<div>
<img src="images/nozookeeper.gif" width="200"/>
</div>

## Difficulties
- **KRaft** is still not widely adopted, hence infos on how to do set it up are scarce.

# Spark
**Spark** is the responsible of the data processing and enrichment.
<br>
In particular it applies a pretrained **Linear Regression** model on a vector of features that are: 
- open
- close
- low
- high
- volume


to predict the **next close** value. 
<br><br>
The model can be build from the **training** folder which contains both the code for **downloading the datasets** and **train the model**  

## Why
- Allows to do **complex operations on Streaming Dataframes** in matter of seconds

## Difficulties? Many
- **No lag** function for streaming DataFrames
- Crashes with **no error code**
- Mapping to elastic search is a **great adventure**
<div>
<img src="images/sparkworking.gif" width="300"/>
</div>

<div>
<img src="images/sparklag.png" width="800"/>
</div>

# Elastic Search
**Elastic Search** is the deposit where our processed data are stored into. 

## Why
- Perfect storage for **real-time** data
- Easy to query with **kibana**
<div>
<img src="images/kibanaelastic.jpg", width="300">
</div>

# Difficulties
- Although in a container it may behave differently based on the hardware. Hence it needs to be **properly configured "ad-hoc"**
<div>
<img src="images/elasticlogs.gif", width="300">
</div>

# Kibana
**Kibana** was used to visualize the data stored in elastic-search with continuous refreshes for showing incoming **real-time** data.

## Why
- Ease of use

## Difficulties
- Some problems reading **the timestamp**
<div>
<img src="images/kibanatime.gif", width="300">
</div>

# The end
Now let's try the project by creating a file containing our **API key** and start the services with **docker compose**.