## Introduction to Spark

Spark is an extremely powerful cluster computing framework that is used for data workloads that require more resources than can be provided by a single node. There is a wealth of information available online about Spark, as it is one of the most in demand data engineering tools today.

In this lab, you'll install Spark onto your local machine and then do some basic tasks. In reality, running Spark on a laptop is a bit silly, as it's designed for larger workloads. That said, the beauty of Spark is that you can write it on a local machine, and then deploy it to a massive cluster and thus scale your workload seamlessly.

You may notice that parts of Spark feels a bit like pandas or SQL. This is because much of Spark's feel has been inherited from those tools. In general Spark is just a little more verbose than pandas, but in return for that verbosity, you get incredible computing power.

If you'd like to read up in great depth, this [e-book](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf) is excellent.

Before we get started you'll need to pip or pip3 install pyspark:

*pip3 install --user pyspark*

In [1]:
#freebie code to confirm your spark is running correctly

import pandas as pd
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = spark.sql("select 'spark' as hello_zipcoder ")

df.show()

+--------------+
|hello_zipcoder|
+--------------+
|         spark|
+--------------+



Even though we're only running this locally - let's get a really big dataset to work with. Through a long legal battle and countless hours of data cleaning, The Washington Post has gained access to, cleaned, and made publically available 6 years of the DEA's data on opiods in America. This was a big deal when it happened in the summer of 2019 and has lots of current relevance. To get started, go and download the dataset from Kaggle [here](https://www.kaggle.com/paultimothymooney/pain-pills-in-the-usa).

This is a nearly 12GB compressed file. Once you download it, it will to be unzipped/uncompressed. In it's uncompressed form, the file is about 80GB. Below, try reading it in using pandas, then try Spark.

In [2]:
#try it with pandas

try_pandas = pd.read_csv('arcos_all_washpost.tsv', sep='\t')

KeyboardInterrupt: 

## Hint. . . .


Pandas will never work on a dataset this large 🤣

In [4]:
#read in your dataset using Spark and be sure to print how long it takes

%%time

data = spark.read.format('com.databricks.spark.csv').\
options(header='true', \
inferschema='true').\
load("arcos_all_washpost.tsv", header=True)

CPU times: user 43.7 ms, sys: 20.6 ms, total: 64.3 ms
Wall time: 6min 28s


In [5]:
#show the a the first few rows of data

data.show(5)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|REPORTER_DEA_NO	REPORTER_BUS_ACT	REPORTER_NAME	REPORTER_ADDL_CO_INFO	REPORTER_ADDRESS1	REPORTER_ADDRESS2	REPORTER_CITY	REPORTER_STATE	REPORTER_ZIP	REPORTER_COUNTY	BUYER_DEA_NO	BUYER_BUS_ACT	BUYER_NAME	BUYER_ADDL_CO_INFO	BUYER_ADDRESS1	BUYER_ADDRESS2	BUYER_CITY	BUYER_STATE	BUYER_ZIP	BUYER_COUNTY	TRANSACTION_CODE	DRUG_CODE	NDC_NO	DRUG_NAME	QUANTITY	UNIT	ACTION_INDICATOR	ORDER_FORM_NO	CORRECTION_NO	S