# TAP x Pokemon

TAP x Pokemon is a project that links together many technologies and applies them to the world of competitive Pokemon battling, processing battle data logs to extract useful information and trying to predict the outcome of a match using machine learning.

<img src="./img/pikachu_meme.png" alt="Best Pikachu Meme" title="Wow" width="800" />

# Project Introduction
### Technicalities First

<img src="./img/data_pipeline.png" title="Data Pipeline" width="1000" />

During this journey we are going to explore every data pipeline stage and its component, from *Ingestion* to *Visualization*.
But first...

# Data Source
## Where everything starts

# Pokemon Showdown
## [Showdown Battle Simulator](https://play.pokemonshowdown.com/)



As mentioned we are going to collect data regarding competitive pokemon battles, so we chose the best and most played competive simulator: Pokemon Showdown (aka SD).
The website doesn't have any API to use and the only way to obtain data is to deal with the messages passing through the websocket, briefly...a nightmare!

Thankfully a hero made an excellent job, not just with host to server communication but creating a battle-bot using 3 different AIs (based on their behavior), his name is pmariglia and his work is available at [pmariglia/showdown](https://github.com/pmariglia/showdown).
All I had to do was fixing the output, retrieving only the useful fields for the project and creating a json for every battle. Eventually everything is sent to our Data Ingestion component.

 As an example:
 ***
 `{
  "battle": [
    {
      "bid": 3,
      "player": "Norne42",
      "type": "gen8ou",
      "turns": 44
    }
  ],
  "pokemon": [
    {
      "name": "barraskewda",
      "ability": "swiftswim",
      "types": [
        "water"
      ],
      "item": "choiceband",
      "win": 1
    },
}` 
***
Shortened for obvious reasons.

In detail we are extracting data such as:
 - Battle Info (Player name, Turns, Type of Battle)
 - Pokemon Info (Name, Ability, Types, Moves, etc...)
 
<img src="./img/more_data.jpg" title="More Data" width="500" />

# Data Ingestion
## Where everything goes

Logstash is our choice for this project and there many reasons why:
 - Ease of use, in both coding and "Dockerizing"
 - Filters and parse data on the fly
 - Flume didn't work...
 
 <img src="./img/flume_logstash.jpg" title="Flume vs Logstash" width="300" />
 

 It just takes a bunch of logstash plugins (available in a unified elastic github for input and output), a python-logstash module and a few lines of code to simply send json data from the battle bots to Logstash. 
  <img src="./img/python_logstash.png" title="Flume vs Logstash" width="500" align="center"/>
  Then from Logstash to Kafka Server to create a "showdown" topic.
 <img src="./img/logstash_code.png" title="Flume vs Logstash" width="350" align="center" />
 


#### Maybe not that simple...
 <img src="./img/conf_meme.png" title="Flume vs Logstash" width="350" align="center" />


# Data Streaming
## Apache Zookeeper & Kafka


<img src="./img/zk_meme.png" title="ZkLogoMeme" width="350" align="right" align="center" />


**Zookeeper** is a top-level software developed by Yahoo first and Apache later, that acts as a centralized service and is used to maintain naming and configuration data and to provide flexible and robust synchronization within distributed systems.


It keeps track of status of the Kafka cluster nodes and it also keeps track of Kafka topics, partitions etc. Zookeeper itself is allowing multiple clients to perform simultaneous reads and writes and acts as a shared configuration service within the system.


Briefly, ZK is the service needed to keep Kafka online, nearly deprecated nowadays because of the new enterprise solution, Confluent, leaving the open-source formula for an open Core one.






#### [Source1](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiQndG43NzuAhVF-6QKHb6BCtIQFjAAegQIBBAD&url=https%3A%2F%2Fzookeeper.apache.org%2F&usg=AOvVaw0JEPBmJTTzUiLMgvke9BFd) [Source2](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjJzdmd29zuAhUKO-wKHRf3CssQFjAEegQICRAC&url=https%3A%2F%2Fwww.cloudkarafka.com%2Fblog%2F2018-07-04-cloudkarafka_what_is_zookeeper.html&usg=AOvVaw1VL7xh_6qQl8Fsql0trjv2)  [Source3](https://www.confluent.io/confluent-community-license-faq/)



In a few words **Kafka** is distributed event streaming platform used for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications (well maybe not mission-critical in this case). Here its usage is limited to a data pipeline and streaming service to pass events from Logstash to Spark in an easy and reliable way. Three are the key capabilities that can be implemented for event streaming:
1) To *publish* (write) and *subscribe* to (read) streams of events, including continuous import/export of your data from other systems. This can be done using a Producer and Consumer written in any supported programming language or with a **Direct Approach** as in this case to bypass any overhead and avoid an addictional point of failure.

2)  To *store* streams of events durably and reliably for as long as you want.

3)  To *process* streams of events as they occur or retrospectively, an added value to this already versitile platform.

#### [Source1](https://kafka.apache.org) [Source2](https://kafka.apache.org/documentation/#gettingStarted)

# Data Processing
## Let's get Started!


# Apache Spark

**Spark** is where everything takes shape, going from a simple json formatted event to a dataframe used to train a ML model, trying to predict the battle outcome based on the opponent team on the fly. Let's go in order.

## Spark Dataframe (dataframe.py)
Implemented using the *pyspark.sql* module specifically designed for structured data processing, it is fondamentally used to create multiple dataframes all over this project (each of them stained with a little of my unexperienced python programmer blood).

The **first** use case (dataframe.py) presents a simple dataframe that consists of two columns, one for the opponent team and the second for the battle score.
The **second** and more complex one (showdown_es.py) is filled with all the useful data collected and processed along the pipeline. 


## Spark MLlib (training.py)
Advertized as easy to use...surprisingly it is! In this instance we built a Pipeline wich consists in 3 stages.

 - **RegexTokenizer**: taking a string as an input and returns as much tokens as the number of Pokemon in the input string.
 - **Word2Vec**: creates a word embedding for each token. Word2vec takes as its input a corpus of text and produces a vector space, with a variable dimension, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.
 - **Logistic Regression**: used as form of binary regression to predict one of the two possibile labels "win/lose" (1/0) for the word embeddings in input.

The dataframe generated from the previous step is then split into two for model training, than it is evaluated and "optimized" and eventually saved to later use without the training overhead.
 
 #### [Source1](https://en.wikipedia.org/wiki/Word2vec) [Source2](https://en.wikipedia.org/wiki/Logistic_regression) [Example](https://blog.insightdatascience.com/hero2vec-d42d6838c941)

## Spark x Pokemon  (showdown_es.py)
The final step is where everything gets combined, the model trained on the original dataframe is used to make actual prediction on live data coming from the battling bots, expanding the incoming RDD and sending everything to Elasticsearch.

# ElasticSearch & Kibana
## The worst is over...or maybe not

Both part of Elastic Stack, are essential to give our data expressive power but not without some trouble!


**ElasticSearch** is used for storing and indexing data, providing an extremely fast search engine. All we have to do is sending a very simple and intuitive mapping...




 <img src="./img/mapping.png" title="Mapping" width="1000" align="center" />


...to create an incredibly flexible index...
 <img src="./img/MemeAttack.webp" title="Mapping meme" width="800" align="top" />


## Last but not least
**Kibana** is the only component that doesn't need a thousand words to describe. A very powerful UI to visualize and navigate our data.

**A picture's worth a thousand log lines**

[source](https://www.elastic.co/kibana)

 <img src="./img/dashboard.png" title="Kibana Dashboard" width="1500" align="center" />


# Conclusions
## That's it!

This is the end of our journey, the whole project has been a great approach to everyday technologies composing a typical data pipeline. During this month I had the chance to learn a lot more than I expected in the first place, so I feel to thank my professor Salvatore Nicotra and a lot of pleople I don't even know sharing (and solving!) the problems I encountered. 

 <img src="./img/thanks.jpg" title="Thanks" width="500" align="center" />


# Future Improvements

Sadly this project is far from perfect but luckily there is room for a lot of improvementes:

 - More variables taken in account to predict a win or a lose and therefore improve accuracy.
 - More data coming from the source about full pokemon sets and not just what a bot sees in a battle.
 - Live prediction during battle.
 - Reccomendation system for a competitive team.
