
# **Running Pyspark in Colab**

To run spark in Colab, we need to first install all the dependencies in Colab environment i.e. Apache Spark 2.3.2 with hadoop 2.7, Java 8 and Findspark to locate the spark in the system. The tools installation can be carried out inside the Jupyter Notebook of the Colab. One important note is that if you are new in Spark, it is better to avoid Spark 2.4.0 version since some people have already complained about its compatibility issue with python. 
Follow the steps to install the dependencies:

In [6]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.0.0/spark-3.0.0-bin-hadoop3.2.tgz
!tar xf spark-3.0.0-bin-hadoop3.2.tgz
!pip install -q findspark

Now that you installed Spark and Java in Colab, it is time to set the environment path which enables you to run Pyspark in your Colab environment. Set the location of Java and Spark by running the following code:

In [7]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-bin-hadoop3.2"

Run a local spark session to test your installation:

In [8]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [9]:
import pandas as pd


In [10]:
from google.colab import files
files.upload()

Saving data1.csv to data1 (1).csv
Saving data2.csv to data2 (1).csv


{'data1.csv': b'\xef\xbb\xbfID,Price\r\n1488566728,39.89\r\n1603768865,51.99\r\n1542813125,40.02\r\n1298954594,50.31\r\n1168517495,46.26\r\n1085995169,55.99\r\n1250316563,41.12\r\n1288693176,46.19\r\n1249023060,21.4\r\n1240593600,45.23\r\n1264049202,45\r\n1254535382,7.99\r\n1305479080,9.99\r\n1284207489,9.99\r\n1281982625,19.99\r\n1315710011,25\r\n1243874488,15.99\r\n1274623813,9.99\r\n1325726787,37.99\r\n1373723882,299.99\r\n1379574546,124.95\r\n1363204703,124.95\r\n1363709900,60.02\r\n1374013440,54.22\r\n1385778993,44.99\r\n1417769833,5.99\r\n1491070770,86.98\r\n1477958233,149\r\n1495293015,63\r\n1454029510,13.99\r\n1446189975,23.99\r\n1483734578,22.99\r\n1445053941,48\r\n1447870200,45.9\r\n1451916165,54\r\n1373837500,279\r\n1373708058,59.99\r\n1365337545,68.99\r\n1368486469,33.99\r\n1327960200,14.99\r\n1255258638,30\r\n1219717920,42.79\r\n1126642500,43.98\r\n1198638945,110\r\n1227874725,24.99\r\n1219969853,31.99\r\n1173637400,84.09\r\n1128709292,56.99\r\n1129245604,51.99\r\n11243062

Check the dataset is uploaded correctly in the system by the following command

In [26]:
data4 = pd.read_csv("data1.csv")

In [27]:
data4

Unnamed: 0,ID,Price
0,1.488567e+09,39.89
1,1.603769e+09,51.99
2,1.542813e+09,40.02
3,1.298955e+09,50.31
4,1.168517e+09,46.26
...,...,...
9794,3.068345e+09,
9795,3.059270e+09,
9796,3.060378e+09,
9797,2.892216e+09,


In [29]:
data5 = pd.read_csv("data2.csv")

In [30]:
data5

Unnamed: 0,ID,Price
0,1.488567e+09,135.98
1,1.603769e+09,147.49
2,1.542813e+09,146.93
3,1.298955e+09,139.89
4,1.168517e+09,125.60
...,...,...
9794,3.068345e+09,52.48
9795,3.059270e+09,52.36
9796,3.060378e+09,52.17
9797,2.892216e+09,52.03


In [34]:
print(pd.merge(data4, data5, on='ID'))

                ID Price_x  Price_y
0     1.488567e+09   39.89   135.98
1     1.603769e+09   51.99   147.49
2     1.542813e+09   40.02   146.93
3     1.298955e+09   50.31   139.89
4     1.168517e+09   46.26   125.60
...            ...     ...      ...
9529  3.068345e+09     NaN    52.48
9530  3.059270e+09     NaN    52.36
9531  3.060378e+09     NaN    52.17
9532  2.892216e+09     NaN    52.03
9533  2.902793e+09     NaN    49.48

[9534 rows x 3 columns]


In [32]:
!ls

'data1 (1).csv'   data2.csv		      spark-3.0.0-bin-hadoop3.2.tgz
 data1.csv	  sample_data
'data2 (1).csv'   spark-3.0.0-bin-hadoop3.2
