# Apache Spark

**Analyze large datasets on clusters using Apache Spark**

This notebook is intended to run in Google CoLab.
Skip the next code block if ran locally.

## Colab Setup

In [None]:
# Find the latest version of spark from http://www-us.apache.org/dist/spark/ 
spark_version = 'spark-3.0.1'

In [None]:
# Set Environment Variables
import os
os.environ['SPARK_VERSION'] = spark_version
os.environ['BASE_URL'] = 'http://www-us.apache.org/dist/spark'
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = f"/content/{spark_version}-bin-hadoop2.7"

# Install spark, java, and findspark
!apt-get update
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q $BASE_URL/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

# Initialize spark session
import findspark
findspark.init()

## Introduction to Spark

### Start session

Start a spark session pyspark

In [None]:
from pyspark.sql import SparkSession


# Start spark session
spark = SparkSession\
    .builder\
    .appName("DataFrameBasics")\
    .getOrCreate()

### Create a DataFrame

Using an array of tuples and two headers, create an example dataframe

In [None]:
# Example dataframe
df = spark.createDataFrame(
    [(0, 'A'),
     (1, 'B'), 
     (2, 'C'),], 
    ['id', 'words'])

# Show head of dataframe
df.head()

### SparkFiles
Connect to Amazon's S3 using sparkfiles

#### Food data

In [None]:
from pyspark import SparkFiles


# Url for data in s3
url = 'https://s3.amazonaws.com/datavix-curriculum/day_1/food.csv'

# Add file to session
spark.sparkContext.addFile(url)

# create dataframe from csv
df = spark.read.csv(
    SparkFiles.get('food.csv'), 
    header=True)


In [None]:
# Show dataframe
df.show()

In [None]:
# Print schema
df.printSchema()

In [None]:
# Show columns
df.columns

In [None]:
# Describe data
df.describe()

##### Transform data
Change price data type to integer

##### Change schema

In [None]:
from pyspark.sql.types import StructField as Field
from pyspark.sql.types import StringType, IntegerType, StructType


# Create a list of structure fields
schema = StructType(fields=[
    Field("food", StringType(), True), 
    Field("price", IntegerType(), True),
])

# Load dataframe with correct schema
dataframe = spark.read.csv(
    SparkFiles.get('food.csv'),
    schema=schema,
    header=True)

# Print schema
dataframe.printSchema()

### Some other data frame methods

Manipulate columns using the ```.withColumn()``` method

In [None]:
# New price column
dataframe.withColumn(
    'newprice', dataframe['price']
).show()

In [None]:
# Make column name uppercase
dataframe.withColumnRenamed('price','Price').show()

In [None]:
# Use arithmetic function to transform the data
dataframe.withColumn(
    'DoublePrice', dataframe['Price'] * 2
).show()