**This tutorial will cover**

* Setting up a PySpark session
* Comparing PySpark to Pandas
* Some basic PySpark functions

In [50]:
# Since the Docker container already contains PySpark preinstalled, there is no need to install it here.
# Prefix shell commands with an exclamation mark.
# !pip install pyspark

In [51]:
# If the package has been installed properly, then an import will NOT throw an error.
import pyspark

In [52]:
# How to read data with Pandas.
import pandas as pd
df = pd.read_csv("test1.csv")
df

Unnamed: 0,Name,Age
0,Steve,30
1,Bill,31
2,John,32


In [53]:
# Now I will show you how to read data with PySpark.
# Before you can read data with PySpark you need to start a Spark session first.
from pyspark.sql import SparkSession

In [54]:
# Set a name for the applicaton and create a SparkSession. If a SparkSession already exists, then getOrCreate() will retrieve it.
spark = SparkSession.builder.appName("Practice").getOrCreate()

In [55]:
spark

In [56]:
# Read a dataset and return a DataFrameReader, which is an interface to read data from external data sources into a DataFrame.
df_pyspark_no_header = spark.read.csv("test1.csv")

# Tip: you can type `spark.read.` and press the Tab key to see a list of possible read methods.

In [57]:
# View the DataFrameReader. 
df_pyspark_no_header

DataFrame[_c0: string, _c1: string]

In [58]:
# View the entire dataset as a dataframe.
df_pyspark_no_header.show()

+-----+----+
|  _c0| _c1|
+-----+----+
| Name| Age|
|Steve|  30|
| Bill|  31|
| John|  32|
+-----+----+



That dataframe is fine, but we would like to use the first row as the header. This is how we do that:

In [59]:
# Use the option("header", "true") method.
df_pyspark = spark.read.option("header", "true").csv("test1.csv")

# Tip: You can right-click on any word in the chain above and select "Show Contextual Help" (or press Ctrl+I) to view some code hints and tips.

In [60]:
df_pyspark.show()

+-----+----+
| Name| Age|
+-----+----+
|Steve|  30|
| Bill|  31|
| John|  32|
+-----+----+



In [61]:
# You can also view the type of df_pyspark.
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

Let's compare that to the Pandas dataframe.

In [62]:
type(df)

pandas.core.frame.DataFrame

You can see that both the PySpark DataFrame and the Pandas DataFrame are DataFrames. You will also see that Pandas and PySpark DataFrames share some similar API features.

In [63]:
df_pyspark.head(3)

[Row(Name='Steve',  Age=' 30'),
 Row(Name='Bill',  Age=' 31'),
 Row(Name='John',  Age=' 32')]

In [64]:
# Show only the first two rows instead of the entire dataset.
df_pyspark.show(2)

+-----+----+
| Name| Age|
+-----+----+
|Steve|  30|
| Bill|  31|
+-----+----+
only showing top 2 rows



In [65]:
# View the DataFramew's schema in tree format.
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |--  Age: string (nullable = true)

