# StackOverflow - Analyzer

Developed by **Simone Torrisi**, Computer Science student at University of Catania

## Project Goal

The goal of this project is to analyze real time questions from <a href="https://stackoverflow.com/">**Stack Overflow**</a> and clustering them based on title, body and tags associated to the question.<br>The results will be then displayed on dashboards.


## Project structure

The project follows the structure below
<img src="images/project-structure.svg" width="1000" height="600"/>

## What about Stack Overflow?

<a href="https://stackoverflow.com/" target="_blank">Stack Overflow</a> is the most popular Q&A web platform about the programming world and it's part of the Stack Exchange network.

"*Questions are everywhere, answers are on Stack Overflow*"

<img src="images/meme.png" width="400" height="400"/>

## Data Source

The data are taken from connecting to Stack Exchange web socket.

The web socket address used is "wss://qa.sockets.stackexchange.com/" 
and the first message sent is "*155-questions-active*".

## Data example

<img src="images/data-example.png" width="600" height="500"/>

## Data Ingestion

A **Kafka Connector** is used for the process of Data Ingestion. A web socket module connects to the Stack Exchange web socket and filters the data received from the Stack Overflow domain only. Then, a record will be written into the Blocking Queue, named *QuestionsQueue*. The Connector will read record from QuestionsQueue and it will write them into the Kafka topic, named *stackoverflow*.

<img src="images/data-ingestion-schema.svg" width="1000" height="800"/>

## Data Streaming

**Apache Kafka** is a distributed streaming platform that handle constant real-time data stream.

Kafka is executed as a cluster. Each node of the cluster, named Kafka broker, manages Kafka topics and divide them in multiple partitions.

In this project, it was used one Kafka topic, named "*stackoverflow*", and one partition.

<img src="images/kafka-schema.svg" width="600" height="500"/>

## Spark Streaming

Spark Streaming is part of Spark API and it takes in input a data stream from various sources, including Kafka.

It generates a DStream as output that is available for the Spark application.

DStream represents a stream of data divided into small batches.

<img src="images/spark-streaming-schema.png" width="600" height="400"/>

<img src="images/dstream.png" width="600" height="500"/>

## Data Processing

Data are processed with **Apache Spark** using the Machine Learn library (MLlib) and applying a clustering algorithm named **K-Means**.

<img src="images/data-processing-schema.svg" width="800" height="600"/>

### Pipeline

A pipeline is a specified sequence of stages, where each one is a **Transformer** or an **Estimator**.

Stages are run in order.

For Transformer stages, the transform method is called.
For Estimator stages, the fit method is called to produce a Transformer.

<img src="images/pipeline-schema.svg" width="800" height="600"/>

### K-Means

It's an **unsupervised** clustering algorithm that clusters data points into a predefined number of clusters.
<img src="images/k-means.png" width="600" height="400"/>

### Finding the value of K

The algorithm needs a predefined number of clusters.

How can be possible to find it?

### Elbow method

<img src="images/elbow.png" width="600" height="500"/>

### Silhouette analysis

The Silhouette score is a measure of how close each point in one cluster is to points in the neighboring clusters. <br>This measure has a range of [-1, 1].

It was found a trade off between **K = 10** and **silhouette = 0.40**

## Data Indexing

<img src="images/elasticsearch-logo.png" width="400" height="200"/>

<a href="https://www.elastic.co/what-is/elasticsearch">Elasticsearch</a> is a distributed, open source search and analytics engine for all types of data, including textual, numerical, geospatial, structured, and unstructured. Elasticsearch is built on Apache Lucene and was first released in 2010 by Elasticsearch N.V. (now known as Elastic). Known for its simple REST APIs, distributed nature, speed, and scalability, Elasticsearch is the central component of the Elastic Stack, a set of open source tools for data ingestion, enrichment, storage, analysis, and visualization.

## Data Visualization

<img src="images/kibana-logo.png" width="300" height="100"/>

<a href="https://www.elastic.co/what-is/kibana">Kibana</a> is an open source frontend application that sits on top of the Elastic Stack, providing search and data visualization capabilities for data indexed in Elasticsearch. Commonly known as the charting tool for the Elastic Stack, Kibana also acts as the user interface for monitoring, managing, and securing an Elastic Stack cluster. 

# Let's go to the live demo