# ChessTV_Analytics

* **Subject**: Technologies of Advanced Programming, 2022/23
* **Prof**: Salvatore Nicotra
* **Student**: Fabrizio Iuvara

## Introduction

ChessTV_Analytics is a project designed and implemented for the TAP (Technology for advanced programming) course in Catania university. 

## Motivation

In recent times the world of chess has experienced a notable surge of interest, driven by various factors, among which we can see the popularity of TV series such as "The Queen's Gambit"; weekly grandmasters tournaments such as _titled tuesday_ or even high profile events related to cheating scandals in the game.
So I thought of proposing a real-time analysis tool of chess games of the best players in the world to make available the statistics otherwise difficult to find out without the use of special tools for big data, such as the ones I am using in this project.

## Project Pipeline

![pipeline](img/pipeline2.png)

## Data Source

https://lichess.org/about

_lichess.org is a free/libre, open-source chess server powered by volunteers and donations._

_Today, Lichess users play more than five million games every day. Lichess is one of the most popular chess websites in the world while remaining 100% free. Most “free” websites subsist by selling ads or selling user data. Others do it by putting all the good stuff behind paywalls. Lichess does none of these things and never will_


![partite](img/partite_lichess.png)


* As data sources I used the lichess site and the API made available by it; we have in particular two different categories of games:
    * **real time games** - the games are those in progress on lichess TV (channel _blitz_ and _rapid_)
    * **ended games** - games from the previous section but which, being finished and having made way for other games in real time, are no longer available on TV therefore difficult to find; this is one of the main reasons that inspired me to design this application.

I created a python script to get this data from the lichess API and send them to Logstash, which I will discuss in more detail in the next section.

## Technologies Used

### Docker

All the technologies used have been created from docker images available in the docker hub and rebuilt considering my settings; furthermore, Docker manages communication between these containers by establishing a network infrastructure.

### Logstash

Logstash is an open-source data processing pipeline tool that facilitates the collection, processing, and transformation of data from various sources in real time. 
It's part of the Elastic Stack, which also includes Elasticsearch (for data storage and search) and Kibana (for data visualization and exploration).

I used logstash to ingest the data from the python script and to send them to kafka system;
in particular i used two different topics to separate into two channels.

### Apache Kafka

Apache Kafka is an open-source distributed streaming platform and messaging system. It is designed for managing and processing real-time data streams from various sources and destinations. Kafka is particularly useful for handling massive amounts of data reliably, at scale and with high performance.

ZooKeeper plays an important role in Kafka's data flow coordinating the cluster, managing configurations and metadata related to topics and partitions.

* Topics used:
    * all_games: this topic contains the set of completed games
    * real_time: this topic contains the set of games currently in lichess_TV

### Apache Spark

_Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters._ 



* In our use case, Spark gets data from two kafka streams; moreover I created the following scripts using pyspark:
    * _live_\__chess.py_ - this script takes care of transferring game's data from the all_games topic to elasticsearch.
    * _classifier.py_ - this script instead manages the games of the other topic, the real time one; in particular, using spark Mllib (I used a logistic regression model) it predicts the winner of the game and enriches the raw data with this information; finally, the updated data is sent to elasticsearch.

### Elasticsearch

_Elasticsearch is a distributed, RESTful search and analytics engine capable of addressing a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data for lightning fast search, fine‑tuned relevancy, and powerful analytics that scale with ease._

It has been used in this project to store and index data efficiently and to allow fast and flexible search capabilities.

### Kibana

Kibana is an open-source data visualization and exploration platform designed to work alongside Elasticsearch. It allows users to interact with data stored in Elasticsearch through a user-friendly web interface.

In this project I used kibana to create a dashboard to visualize statistics of our chess games, this is a screenshot of the dashboard:

![dashboard](img/dash.png)

### Potential update
* Improvement of the data source considering different API's
* To increase the number of statistics to show inside the dashboard