# Filter and Count

- The RDD transformation `filter()` returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. For this exercise, you'll filter out lines containing keyword `Spark` from `fileRDD` RDD which consists of lines of text from the `README.md` file. Next, you'll count the total number of lines containing the keyword `Spark` and finally print the first `4` lines of the filtered RDD.

- Remember, you already have a `SparkContext` `sc`, `file_path` and `fileRDD` available in your workspace.



## Instructions
- Create `filter()` transformation to select the lines containing the keyword `Spark`.
- How many lines in `fileRDD_filter` contains the keyword `Spark`?
- Print the first four lines of the resulting RDD.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [22]:
file_path = 'file:////home/talentum/spark/README.md'

# Create a fileRDD from file_path
fileRDD = sc.textFile(file_path)

# Filter the fileRDD to select lines with Spark keyword
fileRDD_filter = fileRDD.filter(lambda line: 'Spark' in line)

# How many lines are there in fileRDD?
print("The total number of lines with the keyword Spark is", fileRDD_filter.count())

# Print the first four lines of fileRDD
for line in fileRDD_filter.take(4): 
  print(line)

The total number of lines with the keyword Spark is 19
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
and Spark Streaming for stream processing.


In [25]:
print(fileRDD_filter.collect())

['# Apache Spark', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,', 'and Spark Streaming for stream processing.', 'You can find the latest Spark documentation, including a programming', '## Building Spark', 'Spark is built using [Apache Maven](http://maven.apache.org/).', 'To build Spark and its example programs, run:', '["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 'For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](http://spark.apache.org/developer-tools.html).', 'The easiest way to start using Spark is through the Scala shell:', 'Spark also comes with several sample programs in the `examples` directory.', '    ./bin/run-example SparkPi', '    MASTER=spark://host:7077 ./bin/run-example SparkPi', 'Testing first requires [building Spark](#building-spark). Once Spark is built, tests', 'S

In [26]:
type(fileRDD_filter.collect())

list

In [None]:
# terminal commandd: 
#     talentum@talentum-virtual-machine:~$ cat /home/talentum/spark/README.md | grep -c  'Spark'
        
# outtput:19
