<a href="https://colab.research.google.com/github/AjeetSingh02/spark_learn/blob/master/InstallingAndGettingStartedWithApacheSparkOnGoogleColab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark

In [2]:
!ls /usr/lib/jvm

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


In [3]:
# This installation is optional
# This is useful when we want to convert spark df into pandas df
# By default the conversion takes place by pickling and unpickling but thats quite slow
# So to avoid that we will enable pyarrow to make things faster
# Pandas and Spark both are pyarrow compatible so data gets transferred directly. no pickling
! pip install pyarrow



In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()   # Will search for Spark and set in the system path

In [0]:
# to work with spark we need a spark context
from pyspark.sql import SparkSession

# Here we are telling that SparkSession.builder.master is local since we dont have distributed environment
# Both driver and executer node will be local colab environment
spark = SparkSession.builder.master("local[*]").getOrCreate()

# Here we are providing other configurations which we usually provide during Spark Summit
spark.conf.set("spark.executor.memory", "4g")   # How much executer memory is allocated
spark.conf.set("spark.driver.memory", "4g")   # How much driver memory is allocated
spark.conf.set("spark.memory.fraction", "0.9")   # What memory fraction to allocate

We can provide other files too by this:

For Jar files or any other submit arguments
* os.environ['PYSPARK_SUBMIT_ARGS'] = 

To add python file in spark context
* spark.sparkContext.addPyFile('../')

**Spark is installed. Now to test**

In [0]:
credit_df = spark.read.option("inferSchema", "true").csv("/content/sample_data/california_housing_test.csv", header=True)

In [8]:
credit_df.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -122.05|   37.37|              27.0|     3885.0|         661.0|    1537.0|     606.0|       6.6085|          344700.0|
|   -118.3|   34.26|              43.0|     1510.0|         310.0|     809.0|     277.0|        3.599|          176500.0|
|  -117.81|   33.78|              27.0|     3589.0|         507.0|    1484.0|     495.0|       5.7934|          270500.0|
|  -118.36|   33.82|              28.0|       67.0|          15.0|      49.0|      11.0|       6.1359|          330000.0|
|  -119.67|   36.33|              19.0|     1241.0|         244.0|     850.0|     237.0|       2.9375|           81700.0|
|  -119.56|   36.51|    

In [9]:
credit_df.count()

3000