# AbadIA

This notebook will have a few simple samples to explore the datasets from the AbadIA project.
We will use Apache Spark for some of the actions.

## Spark Bootstraping for Google Colab

Run this before start working with the notebooks from the spark course. 
When you will start a new (and fresh) notebook at Colab. Google Cloud will create a new Docker container just for your use. 

Executing this notebook will install into the container the software. The container will be reused by the user until it will destroy by inactivity.


In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
!tar xf spark-2.4.0-bin-hadoop2.7.tgz
!pip install -q findspark

## Set Environment Variables
Set the locations where Spark and Java are installed.

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.0-bin-hadoop2.7"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[2] pyspark-shell"

## Start a SparkSession
This will start a local Spark session.

In [0]:
import findspark
print(findspark.init())

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext


## Cloning our Github repo

In [0]:
!rm -rf /content/abadia-gym
!git clone https://github.com/LaAbadIAdelCrimen/abadia-gym  

## Download your favorite datasets 


In [0]:
!mkdir -p /content/abadia-gym/datasets
!cd /content/abadia-gym/datasets && wget https://storage.googleapis.com/abadia-data/datasets/actions-2019030x.tgz 
!cd /content/abadia-gym/datasets && tar xvzf *.tgz   
!ls -ltr /content/abadia-gym/datasets
!#/content/abadia-gym/tools/download_list.sh https://storage.googleapis.com/abadia-data/last_5000_actions_list.txt 2>/dev/null& 

In [0]:
!# mv /content/abadia-gym/datasets/actions-20190326.tgz /content/abadia-gym/datasets/actions-20190326.tar.gz
!# cd /content/abadia-gym/datasets && gunzip actions-20190326.tar.gz
!#rm /content/abadia-gym/datasets/*2019* 
!# ls -ltr /content/abadia-gym/datasets
!#rm /content/abadia-gym/datasets/*


In [0]:
!#tar tvf /content/abadia-gym/datasets/actions-test.tar
!#ls -lt /content/abadia-gym/datasets
!# cd /content/abadia-gym/datasets && tar xvf actions-test.tar

df = spark.read.option("multiLine", False).option("mode", "DROPMALFORMED").option("badRecordsPath", "/tmp/bad-format").json('/content/abadia-gym/datasets/*actions*.json')

In [0]:
!mkdir -p /tmp/bad-format
!ls -lt /tmp/bad-format
!# tar tvf /content/abadia-gym/datasets/actions-test.tar
print(df.printSchema())

In [0]:
df.count()

In [0]:
df.cache()

In [0]:
df.select("action.*").show(10, False)

In [0]:
print(df.describe("action.nextstate.*").show())

In [0]:
df.filter("action is not null and porcentaje is not null").show(20, False)