<a href="https://colab.research.google.com/github/Kiran45181/Pyspark/blob/main/scala_basic_to_advance_programs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Prepare Scala Environment

*   Running A Job using Classpath



In [1]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.4.1/spark-3.4.1-bin-hadoop3.tgz
!wget -q https://github.com/scala/scala/releases/download/v2.12.18/scala-2.12.18.deb
!tar xf spark-3.4.1-bin-hadoop3.tgz
!dpkg -i scala-2.12.18.deb

# Set environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.4.1-bin-hadoop3"

!curl -Lo coursier https://git.io/coursier-cli && chmod +x coursier
!./coursier launch --fork almond --scala 2.12.10 -- --install

Selecting previously unselected package scala.
(Reading database ... 126284 files and directories currently installed.)
Preparing to unpack scala-2.12.18.deb ...
Unpacking scala (2.12.18-400) ...
Setting up scala (2.12.18-400) ...
Creating system group: scala
Creating system user: scala in scala with scala daemon-user and shell /bin/false
Processing triggers for man-db (2.10.2-1) ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 42577  100 42577    0     0   162k      0 --:--:-- --:--:-- --:--:--  162k
[2KDownloaded 1 missing file(s) / 58
[1A[2KDownloaded 2 missing file(s) / 58
[1A[2KDownloaded 3 missing file(s) / 58
[1A[2KDownloaded 4 missing file(s) / 58
[1A[2KDownloaded 5 missing file(s) / 58
[1A[2KDown

In [2]:
!ls /usr/lib/jvm/java-11-openjdk-amd64
!ls /content/spark-3.4.1-bin-hadoop3
!ls /content/scala-2.12.18.deb
!echo $JAVA_HOME
!echo $SPARK_HOME

bin  conf  docs  include  jmods  legal	lib  man  release
bin   data	jars	    LICENSE   NOTICE  R		 RELEASE  yarn
conf  examples	kubernetes  licenses  python  README.md  sbin
/content/scala-2.12.18.deb
/usr/lib/jvm/java-11-openjdk-amd64
/content/spark-3.4.1-bin-hadoop3


In [3]:
!pip install -q findspark
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Colab Scala Spark") \
    .getOrCreate()

spark

In [7]:
%%writefile HelloWorld.scala
import org.apache.spark.sql.SparkSession

object HelloWorld {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("ScalaAdvancedTransformations")
      .master("local[*]")
      .config("spark.driver.memory", "1g")  // Increase memory to 1 GB
      .getOrCreate()

    val sc = spark.sparkContext

    // Sample data
    val rdd = sc.parallelize(Seq(
      ("apple", 2),
      ("banana", 3),
      ("apple", 4),
      ("banana", 1),
      ("orange", 5),
      ("apple", 6)
    ))

    // Using combineByKey (Advanced Transformation)
    val combined = rdd.combineByKey(
      (value: Int) => (value, 1), // CreateCombiner
      (acc: (Int, Int), value: Int) => (acc._1 + value, acc._2 + 1), // MergeValue
      (acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2) // MergeCombiners
    )

    // Compute average per key
    val averages = combined.mapValues { case (sum, count) => sum.toDouble / count }

    // Print results
    averages.collect().foreach(println)

    spark.stop()
  }
}


Overwriting HelloWorld.scala


In [8]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" HelloWorld.scala

In [9]:
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" HelloWorld

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 04:53:52 INFO SparkContext: Running Spark version 3.4.1
25/08/01 04:53:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 04:53:53 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 04:53:53 INFO SparkContext: Submitted application: ScalaAdvancedTransformations
25/08/01 04:53:53 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 04:53:53 INFO ResourceProfile: Limiting resource is cpu
25/08/01 04:53:53 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 04:53:53 INFO SecurityManager: Changing view acls to: root
25/08/01

In [10]:
%%writefile BasicTransformation.scala
import org.apache.spark.sql.SparkSession

object BasicTransformation {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("ScalaAdvancedTransformations")
      .master("local[*]")
      .config("spark.driver.memory", "1g")  // Increase memory to 1 GB
      .getOrCreate()

    val sc = spark.sparkContext

    val nums = List(1, 2, 3, 4, 5)
    val evenSquares = nums.filter(n => n % 2 == 0).map(n => n * n)
    evenSquares.foreach(println) // 4 16

    spark.stop()
  }
}


Writing BasicTransformation.scala


In [12]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" BasicTransformation.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" BasicTransformation

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 05:11:56 INFO SparkContext: Running Spark version 3.4.1
25/08/01 05:11:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 05:11:56 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 05:11:56 INFO SparkContext: Submitted application: ScalaAdvancedTransformations
25/08/01 05:11:56 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 05:11:56 INFO ResourceProfile: Limiting resource is cpu
25/08/01 05:11:56 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 05:11:57 INFO SecurityManager: Changing view acls to: root
25/08/01

##Filter Transformation

In [14]:
%%writefile FilterExample.scala
import org.apache.spark.sql.SparkSession

object FilterExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("FilterExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    // RDD of numbers
    val nums = sc.parallelize(List(1, 2, 3, 4, 5))

    // Filter even numbers
    val evenNums = nums.filter(n => n % 2 == 0)

    evenNums.collect().foreach(println)  // Output: 2, 4

    spark.stop()
  }
}

Writing FilterExample.scala


In [15]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" FilterExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" FilterExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 05:24:54 INFO SparkContext: Running Spark version 3.4.1
25/08/01 05:24:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 05:24:55 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 05:24:55 INFO SparkContext: Submitted application: FilterExample
25/08/01 05:24:55 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 05:24:55 INFO ResourceProfile: Limiting resource is cpu
25/08/01 05:24:55 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 05:24:55 INFO SecurityManager: Changing view acls to: root
25/08/01 05:24:55 INFO 

##Map Transformation

In [16]:
%%writefile MapExample.scala
import org.apache.spark.sql.SparkSession

object MapExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("MapExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    // RDD of numbers
    val nums = sc.parallelize(List(1, 2, 3, 4, 5))

    // Square each number
    val squares = nums.map(n => n * n)

    squares.collect().foreach(println)  // Output: 1, 4, 9, 16, 25

    spark.stop()
  }
}

Writing MapExample.scala


In [17]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" MapExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" MapExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 05:26:14 INFO SparkContext: Running Spark version 3.4.1
25/08/01 05:26:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 05:26:14 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 05:26:14 INFO SparkContext: Submitted application: MapExample
25/08/01 05:26:14 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 05:26:14 INFO ResourceProfile: Limiting resource is cpu
25/08/01 05:26:14 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 05:26:14 INFO SecurityManager: Changing view acls to: root
25/08/01 05:26:14 INFO Sec

##Distinct Transformation

In [19]:
%%writefile DistinctExample.scala
import org.apache.spark.sql.SparkSession

object DistinctExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("DistinctExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    // RDD with duplicates
    val nums = sc.parallelize(List(1, 2, 2, 3, 3, 3, 4, 5))

    // Get distinct elements
    val uniqueNums = nums.distinct()

    uniqueNums.collect().foreach(println)  // Output: 1, 2, 3, 4, 5

    spark.stop()
  }
}


Writing DistinctExample.scala


In [20]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" DistinctExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" DistinctExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 05:27:26 INFO SparkContext: Running Spark version 3.4.1
25/08/01 05:27:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 05:27:26 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 05:27:26 INFO SparkContext: Submitted application: DistinctExample
25/08/01 05:27:26 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 05:27:26 INFO ResourceProfile: Limiting resource is cpu
25/08/01 05:27:26 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 05:27:26 INFO SecurityManager: Changing view acls to: root
25/08/01 05:27:26 INF

##Accumulator Program

In [21]:
%%writefile AccumulatorProgram.scala
import org.apache.spark.sql.SparkSession

object AccumulatorProgram {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("DistinctExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    // Create an accumulator for counting errors
    val errorCount = sc.longAccumulator("Error Counter")

    // Sample RDD with some log lines
    val data = sc.parallelize(Seq("INFO Start", "ERROR Failure1", "ERROR Failure2", "INFO End"))

    // Use the accumulator
    data.foreach(line => {
      if (line.contains("ERROR")) {
        errorCount.add(1)  // increment by 1
      }
    })

    // Print the result
    println(s"Total Errors : ${errorCount.value}") // Output: Total Errors : 2

    spark.stop()
  }
}


Writing AccumulatorProgram.scala


In [22]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" AccumulatorProgram.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" AccumulatorProgram

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 06:35:30 INFO SparkContext: Running Spark version 3.4.1
25/08/01 06:35:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 06:35:30 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 06:35:30 INFO SparkContext: Submitted application: DistinctExample
25/08/01 06:35:30 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 06:35:30 INFO ResourceProfile: Limiting resource is cpu
25/08/01 06:35:30 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 06:35:30 INFO SecurityManager: Changing view acls to: root
25/08/01 06:35:30 INF

##Broadcast

In [23]:
%%writefile BroadcastExample.scala
import org.apache.spark.sql.SparkSession

object BroadcastExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("BroadcastExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    // 1. Create lookup Map
    val countryLookup = Map(
      "US" -> "United States",
      "IN" -> "India",
      "UK" -> "United Kingdom"
    )

    // 2. Broadcast the Map
    val broadcastLookup = sc.broadcast(countryLookup)

    // 3. Sample RDD of country codes
    val countryCodes = sc.parallelize(Seq("US", "IN", "UK", "CA"))

    // 4. Map codes to full names using broadcast variable
    val mappedCountries = countryCodes.map(code =>
      broadcastLookup.value.getOrElse(code, s"Unknown ($code)")
    )

    // 5. Print results
    mappedCountries.collect().foreach(println)
    // Output:
    // United States
    // India
    // United Kingdom
    // Unknown (CA)

    spark.stop()
  }
}

Writing BroadcastExample.scala


In [24]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" BroadcastExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" BroadcastExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 06:42:24 INFO SparkContext: Running Spark version 3.4.1
25/08/01 06:42:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 06:42:24 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 06:42:24 INFO SparkContext: Submitted application: BroadcastExample
25/08/01 06:42:24 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 06:42:24 INFO ResourceProfile: Limiting resource is cpu
25/08/01 06:42:24 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 06:42:24 INFO SecurityManager: Changing view acls to: root
25/08/01 06:42:24 IN

##Caching vs Persistance in Spark

##CachingExample

In [25]:
%%writefile CachingExample.scala
import org.apache.spark.sql.SparkSession

object CachingExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("BroadcastExample")
      .master("local[*]")
      .config("spark.driver.memory", "1g")
      .getOrCreate()

    val sc = spark.sparkContext

    val data = sc.parallelize(1 to 1000000)

    //cache the filtered RDD in memory

    val evenNumbers = data.filter(_ % 2 == 0).cache()

    //First action triggers computation and caching

    println("Count: " + evenNumbers.count())

    //Reuse cached RDD (Faster)
    println("Sum: " + evenNumbers.sum())


    spark.stop()
  }
}

Writing CachingExample.scala


In [26]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" CachingExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" CachingExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 06:59:54 INFO SparkContext: Running Spark version 3.4.1
25/08/01 06:59:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 06:59:55 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 06:59:55 INFO SparkContext: Submitted application: BroadcastExample
25/08/01 06:59:55 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 06:59:55 INFO ResourceProfile: Limiting resource is cpu
25/08/01 06:59:55 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 06:59:55 INFO SecurityManager: Changing view acls to: root
25/08/01 06:59:55 IN

##Persistance Example

In [27]:
%%writefile PersistenceExample.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel

object PersistenceExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("PersistenceExample")
      .master("local[*]")
      .getOrCreate()

    val sc = spark.sparkContext

    val bigData = sc.parallelize(1 to 100000000)

    // Persist to memory and spill to disk if needed
    val squares = bigData.map(x => x * x).persist(StorageLevel.MEMORY_AND_DISK)

    // First action triggers computation and persist
    println("First 5: " + squares.take(5).mkString(", "))

    // Reuse persisted Data
    println("Count: " + squares.count())

    spark.stop()
  }
}


Writing PersistenceExample.scala


In [28]:
!scalac -classpath "$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" PersistenceExample.scala
!scala -J-Xmx1g -classpath ".:$(echo /content/spark-3.4.1-bin-hadoop3/jars/*.jar | tr ' ' ':')" PersistenceExample

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/01 07:10:20 INFO SparkContext: Running Spark version 3.4.1
25/08/01 07:10:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/08/01 07:10:21 INFO ResourceUtils: No custom resources configured for spark.driver.
25/08/01 07:10:21 INFO SparkContext: Submitted application: PersistenceExample
25/08/01 07:10:21 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
25/08/01 07:10:21 INFO ResourceProfile: Limiting resource is cpu
25/08/01 07:10:21 INFO ResourceProfileManager: Added ResourceProfile id: 0
25/08/01 07:10:21 INFO SecurityManager: Changing view acls to: root
25/08/01 07:10:21 