<a href="https://colab.research.google.com/github/DileepNalle78/pyspark__DileepNalle/blob/main/5th_aug.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialize Spark Scala Env.

In [1]:
# Mount Google Drive to store cached files
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [2]:
# Set variables
strBasePath="/content/drive/MyDrive/IBM-DE-Spark-Scala"
scala_deb_path = strBasePath+"/scala-2.12.18.deb"
spark_tgz_path = strBasePath+"/spark-3.4.1-bin-hadoop3.tgz"

!mkdir -p /content/tmp
import os
# Download Scala .deb if not cached
if not os.path.exists(scala_deb_path):
    !wget -O "{scala_deb_path}" https://github.com/scala/scala/releases/download/v2.12.18/scala-2.12.18.deb

# Download Spark tgz if not cached
if not os.path.exists(spark_tgz_path):
    !wget -O "{spark_tgz_path}" https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz

# Copy cached files to working dir
!cp "{scala_deb_path}" /content/tmp/scala-2.12.18.deb
!cp "{spark_tgz_path}" /content/tmp/spark-3.4.1-bin-hadoop3.tgz

# Install Java if not already present
!java -version || apt-get install openjdk-11-jdk-headless -qq > /dev/null

# Install Scala
!dpkg -i /content/tmp/scala-2.12.18.deb

# Extract Spark
!tar xf /content/tmp/spark-3.4.1-bin-hadoop3.tgz -C /content

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"
os.environ["PATH"] += f":{os.environ['SPARK_HOME']}/bin"

# Confirm installation
!java -version
!scala -version
!scalac -version
!echo "Spark path: $SPARK_HOME"
!ls $SPARK_HOME

openjdk version "11.0.28" 2025-07-15
OpenJDK Runtime Environment (build 11.0.28+6-post-Ubuntu-1ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.28+6-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)
Selecting previously unselected package scala.
(Reading database ... 126284 files and directories currently installed.)
Preparing to unpack /content/tmp/scala-2.12.18.deb ...
Unpacking scala (2.12.18-400) ...
Setting up scala (2.12.18-400) ...
Creating system group: scala
Creating system user: scala in scala with scala daemon-user and shell /bin/false
Processing triggers for man-db (2.10.2-1) ...
openjdk version "11.0.28" 2025-07-15
OpenJDK Runtime Environment (build 11.0.28+6-post-Ubuntu-1ubuntu122.04.1)
OpenJDK 64-Bit Server VM (build 11.0.28+6-post-Ubuntu-1ubuntu122.04.1, mixed mode, sharing)
Scala code runner version 2.12.18 -- Copyright 2002-2023, LAMP/EPFL and Lightbend, Inc.
Scala compiler version 2.12.18 -- Copyright 2002-2023, LAMP/EPFL and Lightbend, Inc.
Spark path: /content/

## Test Hello World in JAVA

In [5]:
%%writefile CreateDataFrame.java
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Encoders;
import java.util.Arrays;
import java.util.List;

public class CreateDataFrame {
    public static void main(String[] args) {
        SparkSession spark = SparkSession.builder()
                .appName("CreateDataFrameExample")
                .master("local[*]")
                .getOrCreate();

        // Sample data
        List<String> data = Arrays.asList("Java", "Python", "Scala");

        // Create DataFrame from a list
        Dataset<String> df = spark.createDataset(data, Encoders.STRING());

        // Show the DataFrame content
        df.show();

        spark.stop();
    }
}

Writing CreateDataFrame.java


In [6]:
import os
spark_home = os.environ.get("SPARK_HOME")
!javac -cp "$spark_home/jars/*" CreateDataFrame.java

In [7]:
import os
spark_home = os.environ.get("SPARK_HOME")
!java -cp "$spark_home/jars/*:." CreateDataFrame

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/05 09:35:11 INFO SparkContext: Running Spark version 3.4.1
25/08/05 09:35:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/05 09:35:12 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/05 09:35:12 INFO SparkContext: Submitted application: CreateDataFrameExample
25/08/05 09:35:12 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/05 09:35:12 INFO ResourceProfile: Limiting resource is cpu
25/08/05 09:35:12 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/05 09:35:12 INFO SecurityManager: Changing view acls to: root
25/08/05 09:35

In [8]:
# set varible

!SPARK_HOME=/content/spark-3.3.2-bin-hadoop3
!JARS=$(echo $SPARK_HOME/jars/*.jar | tr ' ' ':')


In [10]:
!echo $SPARK_HOME
!echo $JARS

/content/spark-3.4.1-bin-hadoop3



In [12]:
!java -cp "$(echo $SPARK_HOME/jars/*.jar | tr ' ' ':')" CreateDataFrame.java

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/05 10:48:55 INFO SparkContext: Running Spark version 3.4.1
25/08/05 10:48:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/05 10:48:56 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/05 10:48:56 INFO SparkContext: Submitted application: CreateDataFrameExample
25/08/05 10:48:56 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/05 10:48:56 INFO ResourceProfile: Limiting resource is cpu
25/08/05 10:48:56 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/05 10:48:56 INFO SecurityManager: Changing view acls to: root
25/08/05 10:48

In [14]:
!java -cp ".:$(echo $SPARK_HOME/jars/*.jar | tr ' ' ':')" CreateDataFrame

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/05 10:50:20 INFO SparkContext: Running Spark version 3.4.1
25/08/05 10:50:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/05 10:50:21 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/05 10:50:21 INFO SparkContext: Submitted application: CreateDataFrameExample
25/08/05 10:50:21 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/05 10:50:21 INFO ResourceProfile: Limiting resource is cpu
25/08/05 10:50:21 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/05 10:50:21 INFO SecurityManager: Changing view acls to: root
25/08/05 10:50