# Loading data in PySpark shell

- In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

- Remember you already have a `SparkContext` `sc` and `file_path` variable (which is the path to the README.md file) already available in your workspace.

## Instructions

-Load a local text file README.md in PySpark shell.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [3]:
file_path = 'file:////home/talentum/spark/README.md'
# Load a local file into PySpark shell
lines = sc.textFile(file_path)
lines

file:////home/talentum/spark/README.md MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [4]:
print(type(lines))
print(lines.count())
print(lines.take(5))

<class 'pyspark.rdd.RDD'>
104
['# Apache Spark', '', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', 'supports general computation graphs for data analysis. It also supports a']


In [5]:
file_path1 = 'README.md'
# Load a file into PySpark shell
lines1 = sc.textFile(file_path)
lines1

file:////home/talentum/spark/README.md MapPartitionsRDD[5] at textFile at NativeMethodAccessorImpl.java:0

## Commands in Linux

**1. $ wc -l /home/talentum/spark/README.md**

- This gives number of lines (WordCount) in the file. 

- Rdd makes 1 line as 1 element

**2. $ head -n 5 /home/talentum/spark/README.md**

- This prints the first 5 lines of the file


In [6]:
help(sc.textFile)

Help on method textFile in module pyspark.context:

textFile(name, minPartitions=None, use_unicode=True) method of pyspark.context.SparkContext instance
    Read a text file from HDFS, a local file system (available on all
    nodes), or any Hadoop-supported file system URI, and return it as an
    RDD of Strings.
    
    If use_unicode is False, the strings will be kept as `str` (encoding
    as `utf-8`), which is faster and smaller than unicode. (Added in
    Spark 1.2)
    
    >>> path = os.path.join(tempdir, "sample-text.txt")
    >>> with open(path, "w") as testFile:
    ...    _ = testFile.write("Hello world!")
    >>> textFile = sc.textFile(path)
    >>> textFile.collect()
    ['Hello world!']



In [7]:
file_path1 = 'constitution.txt'
# File is present on HDFS

# Load a file into PySpark shell
lines1 = sc.textFile(file_path)

print(type(lines1))
print(lines1.count())
print(lines1.take(5))

<class 'pyspark.rdd.RDD'>
104
['# Apache Spark', '', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'high-level APIs in Scala, Java, Python, and R, and an optimized engine that', 'supports general computation graphs for data analysis. It also supports a']
