<a href="https://colab.research.google.com/github/Akshatpattiwar512/Big-data-ml/blob/main/Apache_Spark_Part_1_RDD_and_Data_Frame_components.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Getting Started with Apache Spark - Part 1:RDD and Data Frame components

This notebook is based on [Apache Spark-Part 1:RDD and Data Frame components Article](https://www.linkedin.com/pulse/apache-spark-akshat-pattiwar/)

##Installing pySpark using pip

In [2]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/89/db/e18cfd78e408de957821ec5ca56de1250645b05f8523d169803d8df35a64/pyspark-3.1.2.tar.gz (212.4MB)
[K     |████████████████████████████████| 212.4MB 61kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 17.8MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.1.2-py2.py3-none-any.whl size=212880768 sha256=8b3198bbbb0f1093afecd70c4d732292ccc5cc2f985628b16a35fecbb3cf80a2
  Stored in directory: /root/.cache/pip/wheels/40/1b/2c/30f43be2627857ab80062bef1527c0128f7b4070b6b2d02139
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.1.2


##Installing required values

In [4]:
from pyspark.sql import SparkSession

##Create SparkSession and SparkContext

In [5]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

##Create a RDD from a collection

In [6]:
num = [1,2,3,4,5]
num_rdd = sc.parallelize(num)
num_rdd.collect()

[1, 2, 3, 4, 5]

##Transformations

###filter()

To filter the data based on a certain condition.

In [7]:
even_rdd = num_rdd.filter(lambda x : x % 2 == 0)
even_rdd.collect()

[2, 4]

###map()

This will map your input to some output based on the function specified in the map function.

In [8]:
double_rdd = num_rdd.map(lambda x : x * 2)
double_rdd.collect()

[2, 4, 6, 8, 10]

###flatMap()

This function is very similar to map, but can return multiple elements for each input in the given RDD.

In [9]:
flat_rdd = num_rdd.flatMap(lambda x : range(1,x))
flat_rdd.collect()

[1, 1, 2, 1, 2, 3, 1, 2, 3, 4]

###distinct()

This will return distinct elements from an RDD.

In [10]:
rdd1 = sc.parallelize([10, 11, 10, 11, 12, 11])
dist_rdd = rdd1.distinct()
dist_rdd.collect()

[10, 12, 11]

###reduceByKey()

This function reduces the key values pairs based on the keys and a given function inside the reduceByKey.

In [11]:
pairs = [ ("a", 5), ("b", 7), ("c", 2), ("a", 3), ("b", 1), ("c", 4)]
pair_rdd = sc.parallelize(pairs)
output = pair_rdd.reduceByKey(lambda x, y : x + y)
result = output.collect()
print(*result, sep='\n')

('b', 8)
('c', 6)
('a', 8)


###groupByKey()

This function is another ByKey function which can operate on a (key, value) pair RDD but this will only group the values based on the keys.

In [12]:
grp_out = pair_rdd.groupByKey()

###sortByKey()

This function will perform the sorting on a (key, value) pair RDD based on the keys. By default, sorting will be done in ascending order.

In [13]:
pairs = [ ("a", 5), ("d", 7), ("c", 2), ("b", 3)]
raw_rdd = sc.parallelize(pairs)
sortkey_rdd = raw_rdd.sortByKey()
result = sortkey_rdd.collect()
print(*result,sep='\n')

('a', 5)
('b', 3)
('c', 2)
('d', 7)


###sortBy()

sortBy is a more generalized function for sorting.

In [14]:
# Create RDD.
pairs = [ ("a", 5, 10), ("d", 7, 12), ("c", 2, 11), ("b", 3, 9)]
raw_rdd = sc.parallelize(pairs)

# Let’s try to do the sorting based on the 3rd element of the tuple.
sort_out = raw_rdd.sortBy(lambda x : x[2])
result = sort_out.collect()
print(*result, sep='\n')

('b', 3, 9)
('a', 5, 10)
('c', 2, 11)
('d', 7, 12)


##Actions

Actions are operations on RDDs which execute immediately. While Transformations return another RDD, Actions return language native data structures.

###count()

This will count the number of elements in the given RDD.

In [15]:
num = sc.parallelize([1,2,3,4,2])
num.count()

5

###first()

This will return the first element from given RDD.

In [16]:
num.first()

1

###collect()

This will return all the elements for the given RDD.

In [17]:
num.collect()

[1, 2, 3, 4, 2]

###take()

This will return the number of elements specified.

In [None]:
num.take(3)