#**Introduccion a Big Data**
##Jugando con Colab & Spark  (Python)

De forma bastante practico para utilizar  pyspark en Colab seria ejecutar algunos de los siguientes comandos en una celda en Colab:

Haremos lo siguiente :

*   Instalacion Rapida Java y Spark
*   Configuracion Veloz de Java y Spark
*   Ejemplos utilizando PySpark


##Instalar Java y Spark
Podemos hacer rapidamente que nuestro entorno de ejecucion implemente Java

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null


Vamos a la direccion web de Apache Spark, obtenemos la version deseada y la desempaquetamos

In [16]:
import os # libreria de manejo del sistema operativo
os.system("wget -q https://www-us.apache.org/dist/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz")
os.system("tar -xzvf spark-2.4.5-bin-hadoop2.7.tgz")
!ls ##listamos los archivos

sample_data		       spark-2.4.5-bin-hadoop2.7.tgz.1
spark-2.4.5-bin-hadoop2.7      spark-2.4.5-bin-hadoop2.7.tgz.2
spark-2.4.5-bin-hadoop2.7.tgz


Ahora instalamos **pyspark**

In [0]:
!pip install -q pyspark


##Configuramos las variables de Java y Spark


In [21]:
!rm spark-2.4.5-bin-hadoop2.7.tgz.1
!rm spark-2.4.5-bin-hadoop2.7.tgz.2
!ls /content/

sample_data  spark-2.4.5-bin-hadoop2.7	spark-2.4.5-bin-hadoop2.7.tgz


In [0]:
# Variables de Entorno
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/spark-2.4.5-bin-hadoop2.7"

#Iniciamos y Utilizamos PySpark

Iniciamos con la importacion de lirerias necesarias y la creacion de una sesion de Spark

In [0]:
# Cargar Pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test_spark").master("local[*]").getOrCreate()

Imprimimos los valores de la sesion

In [26]:
spark

En la carpeta sample_dara Colab nos ofrece algunos datasets de muestra, damos una mirada rapida al archivo **./sample_data/california_housing_train.csv**


In [28]:
!head -n 10 ./sample_data/california_housing_train.csv

"longitude","latitude","housing_median_age","total_rooms","total_bedrooms","population","households","median_income","median_house_value"
-114.310000,34.190000,15.000000,5612.000000,1283.000000,1015.000000,472.000000,1.493600,66900.000000
-114.470000,34.400000,19.000000,7650.000000,1901.000000,1129.000000,463.000000,1.820000,80100.000000
-114.560000,33.690000,17.000000,720.000000,174.000000,333.000000,117.000000,1.650900,85700.000000
-114.570000,33.640000,14.000000,1501.000000,337.000000,515.000000,226.000000,3.191700,73400.000000
-114.570000,33.570000,20.000000,1454.000000,326.000000,624.000000,262.000000,1.925000,65500.000000
-114.580000,33.630000,29.000000,1387.000000,236.000000,671.000000,239.000000,3.343800,74000.000000
-114.580000,33.610000,25.000000,2907.000000,680.000000,1841.000000,633.000000,2.676800,82400.000000
-114.590000,34.830000,41.000000,812.000000,168.000000,375.000000,158.000000,1.708300,48500.000000
-114.590000,33.610000,34.000000,4789.000000,1175.000000,3134.000000

In [29]:
#Cargamos el ARchivo con Spark (en Memoria)
archivo = './sample_data/california_housing_train.csv'
df_spark = spark.read.csv(archivo, inferSchema=True, header=True)

# imprimir tipo de archivo
print(type(df_spark))

<class 'pyspark.sql.dataframe.DataFrame'>


###¿Cuantos registros posee este dataframe?

In [30]:
df_spark.count()

17000

### ¿Cual es la estructura del Dataframe?


In [31]:
df_spark.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)



###¿Cual es el nombre de las columnas?


In [32]:
df_spark.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

###Mostrar los primeros Registros

In [34]:
#Ver los primeros 20 registros del dataframe
df_spark.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|    

###Hagamos una Descricipcion Estadistica del dataframe

In [35]:
df_spark.describe().toPandas().transpose()


Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
longitude,17000,-119.56210823529375,2.0051664084260357,-124.35,-114.31
latitude,17000,35.6252247058827,2.1373397946570867,32.54,41.95
housing_median_age,17000,28.58935294117647,12.586936981660406,1.0,52.0
total_rooms,17000,2643.664411764706,2179.947071452777,2.0,37937.0
total_bedrooms,17000,539.4108235294118,421.4994515798648,1.0,6445.0
population,17000,1429.5739411764705,1147.852959159527,3.0,35682.0
households,17000,501.2219411764706,384.5208408559016,1.0,6082.0
median_income,17000,3.883578100000021,1.9081565183791036,0.4999,15.0001
median_house_value,17000,207300.91235294117,115983.76438720895,14999.0,500001.0


In [43]:
df_spark.describe().show()

+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|       households|     median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|  count|              17000|             17000|             17000|            17000|            17000|             17000|            17000|             17000|             17000|
|   mean|-119.56210823529375|  35.6252247058827| 28.58935294117647|2643.664411764706|539.4108235294118|1429.5739411764705|501.2219411764706| 3.883578100000021|207300.91235294117|
| stddev| 2.0051664084260357|2.1373397946570867|12.586936981660406|2179.947071452777|421.4994515798648| 1

In [36]:
df_spark.describe(['median_house_value']).show()

+-------+------------------+
|summary|median_house_value|
+-------+------------------+
|  count|             17000|
|   mean|207300.91235294117|
| stddev|115983.76438720895|
|    min|           14999.0|
|    max|          500001.0|
+-------+------------------+



###Generacion de Datos Aleatorios
La generación de datos aleatorios es útil para probar algoritmos existentes e implementar algoritmos aleatorios, como la proyección aleatoria. **Spark** proporciona métodos para generar columnas que contienen valores extraídos de una distribución, por ejemplo, uniforme (rand) y normal estándar (randn).

In [37]:
df = spark.range(0, 10)
df.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



In [39]:
from pyspark.sql.functions import rand, randn

df.select("id", rand(seed=10).alias("uniform"), randn(seed=27).alias("normal")).show()

+---+-------------------+--------------------+
| id|            uniform|              normal|
+---+-------------------+--------------------+
|  0|0.41371264720975787|  0.5888539012978773|
|  1| 0.7311719281896606|  0.8645537008427937|
|  2| 0.9031701155118229|  1.2524569684217643|
|  3|0.09430205113458567|  -2.573636861034734|
|  4|0.38340505276222947|  0.5469737451926588|
|  5| 0.1982919638208397| 0.06157382353970104|
|  6|0.12714181165849525|  0.3623040918178586|
|  7| 0.7604318153406678|-0.49575204523675975|
|  8|   0.83487085888236|   1.022815424084479|
|  9| 0.3142596916968412|   2.750429557170309|
+---+-------------------+--------------------+



De esta forma se puede instalar automaticamente spark en google colab y hacer uno de el de forma gratis.

En la version gratis solo se cuenta con una CPU si se quiere aumentar la capacidad de procesamiento es necesario pagar.