
Checking Java Version

In [None]:
!java -version

openjdk version "11.0.9.1" 2020-11-04
OpenJDK Runtime Environment (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04)
OpenJDK 64-Bit Server VM (build 11.0.9.1+1-Ubuntu-0ubuntu1.18.04, mixed mode, sharing)


Setting Java 8 environment

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

Downloading Spark

In [None]:
!wget -q http://apache.osuosl.org/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz

Extracting Spark Files

In [None]:
!tar xf spark-3.0.1-bin-hadoop3.2.tgz

Installing FindSpark

In [None]:
!pip install -q findspark

 JVM folders

In [None]:
!ls /usr/lib/jvm/

default-java		   java-11-openjdk-amd64     java-8-openjdk-amd64
java-1.11.0-openjdk-amd64  java-1.8.0-openjdk-amd64


Pyarrow
- Pyarrow is a library for building data frame internals (and other data processing applications). 
- It is not an end user library like pandas.
- PyArrow library provides a Python API for the functionality provided by the Arrow libraries, along with tools for Arrow integration and interoperability with pandas, NumPy, and other software in the Python ecosystem.

Installing Pyarrow



In [None]:
!pip install -U pyarrow

Requirement already up-to-date: pyarrow in /usr/local/lib/python3.6/dist-packages (2.0.0)


Setting up Home environment

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop3.2"

Creating Spark Session

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Stopping the session

In [None]:
#spark.stop()

Check the pyspark version

In [None]:
import pyspark
print(pyspark.__version__)

3.0.1


Importing Pyspark
- Creating conf and it's object.

In [None]:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('local')
conf.setAppName('spark-basis')

<pyspark.conf.SparkConf at 0x7f1848414518>

In [None]:
sc.getConf().getAll()

[('spark.master', 'local'),
 ('spark.app.id', 'local-1610647032541'),
 ('spark.rdd.compress', 'True'),
 ('spark.executor.memory', '4g'),
 ('spark.app.name', 'spark-basis'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.submit.pyFiles', ''),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.port', '36003'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.driver.host', 'a65bcac3d49a')]

We can create the Spark Context using that configuration object.

In [None]:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '4g'), ('spark.driver.memory','4g'), ('spark.memory.fraction', '0.9')])

sys 
- Access system-specific parameters and functions.

tempfile
- Module creates temporary files and directories. It works on all supported 
platforms. 

urllib
- Module that can be used for opening URLs.

In [None]:
import sys, tempfile, urllib

Creating a base_directory and output_file to store data

In [None]:
BASE_DIR ='/tmp'
OUTPUT_FILE = os.path.join(BASE_DIR, 'autos_data.csv')

Downloading the data from uci repository

In [None]:
autos_data = urllib.request.urlretrieve('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data',OUTPUT_FILE)

In [None]:
!ls /tmp

autos_data.csv
blockmgr-7ab72995-0857-4805-9291-2e2443f76e79
credit_data.csv
dap_multiplexer.a65bcac3d49a.root.log.INFO.20210114-172326.51
dap_multiplexer.INFO
debugger_29cke107i9
hsperfdata_root
initgoogle_syslog_dir.0
liblz4-java-6750267683785321441.so
liblz4-java-6750267683785321441.so.lck
spark-3cf3a23b-7244-4929-8be1-76ee8734cf27
spark-a12e9b4b-0bb8-4478-82ef-dfdb30c1e044


inferSchema 
- By setting inferSchema=true , Spark will automatically go through the csv file and infer the schema of each column.

Importing the SparkSession

In [None]:
from pyspark.sql import SparkSession

Initialize SparkSession object

In [None]:
spark = SparkSession \
    .builder \
    .master("local") \
    .appName("Spark CSV Reader") \
    .getOrCreate()

Reading the CSV file

In [None]:
autos_df = spark.read.option("inferSchema", "true").csv("/tmp/autos_data.csv", header=False)

Showing the CSV file 

In [None]:
credit_df.show()

+---+---+----+---+-----+----+-----------+----+-----+------+----+-----+-----+----+----+-----+
|_c0|_c1| _c2|_c3|  _c4| _c5|        _c6| _c7|  _c8|   _c9|_c10| _c11| _c12|_c13|_c14| _c15|
+---+---+----+---+-----+----+-----------+----+-----+------+----+-----+-----+----+----+-----+
|  3|  ?|null|gas|  std| two|convertible|null|front| 88.60|null|64.10|48.80|2548|null| four|
|  3|  ?|null|gas|  std| two|convertible|null|front| 88.60|null|64.10|48.80|2548|null| four|
|  1|  ?|null|gas|  std| two|  hatchback|null|front| 94.50|null|65.50|52.40|2823|null|  six|
|  2|164|null|gas|  std|four|      sedan|null|front| 99.80|null|66.20|54.30|2337|null| four|
|  2|164|null|gas|  std|four|      sedan|null|front| 99.40|null|66.40|54.30|2824|null| five|
|  2|  ?|null|gas|  std| two|      sedan|null|front| 99.80|null|66.30|53.10|2507|null| five|
|  1|158|null|gas|  std|four|      sedan|null|front|105.80|null|71.40|55.70|2844|null| five|
|  1|  ?|null|gas|  std|four|      wagon|null|front|105.80|null|71.40|

Counting the numbers of Rows in the File

In [None]:
credit_df.count()

205

# COMPLETED