# Spark Features

- In-memory computation
- Distributed processing using parallelize
- Can be used with many cluster managers (Spark, Yarn, Mesos e.t.c)
- Fault-tolerant
- Immutable
- Lazy evaluation
- Cache & persistence
- Inbuild-optimization when using DataFrames
- Supports ANSI SQL

# Advantages of PySpark
- PySpark is a general-purpose, in-memory, distributed processing engine that allows you to process data - efficiently in a distributed fashion.
- Applications running on PySpark are 100x faster than traditional systems.
- You will get great benefits using PySpark for data ingestion pipelines.
- Using PySpark we can process data from Hadoop HDFS, AWS S3, and many file systems.
- PySpark also is used to process real-time data using Streaming and Kafka.
- Using PySpark streaming you can also stream files from the file system and also stream from the socket.
- PySpark natively has machine learning and graph libraries.

# PySpark Dataframe 

DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as structured data files, tables in Hive, external databases, or existing RDDs.


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('spark_explorer').master("local[*]").config("spark.driver.bindAddress", "127.0.0.1").getOrCreate()

In [7]:
# read from CSV

df = spark.read.csv("sales_info.csv", inferSchema=True, header=True)

In [8]:
"""
Print the schema of the DataFrame
"""
df.printSchema()

root
 |-- Company: string (nullable = true)
 |-- Person: string (nullable = true)
 |-- Sales: double (nullable = true)



In [9]:
df.head(5)

[Row(Company='GOOG', Person='Sam', Sales=200.0),
 Row(Company='GOOG', Person='Charlie', Sales=120.0),
 Row(Company='GOOG', Person='Frank', Sales=340.0),
 Row(Company='MSFT', Person='Tina', Sales=600.0),
 Row(Company='MSFT', Person='Amy', Sales=124.0)]