# Introduction

**This tutorial will cover:**

* Setting up a PySpark session and some basic PySpark functions
* Comparing PySpark to Pandas

## Setting up a PySpark session and some basic PySpark functions

In [1]:
# Since the Docker container already contains PySpark preinstalled, there is no need to install it here.
# Prefix shell commands with an exclamation mark.
# !pip install pyspark

In [2]:
# If the package has been installed properly, then an import will NOT throw an error.
import pyspark

In [3]:
# How to read data with PySpark.
# Before you can read data with PySpark you need to start a Spark session first.
from pyspark.sql import SparkSession

In [4]:
# Set a name for the applicaton and create a SparkSession. If a SparkSession already exists, then getOrCreate() will retrieve it.
spark = SparkSession.builder.appName("Practice").getOrCreate()

In [5]:
spark

In [6]:
# Read a dataset and return a DataFrameReader, which is an interface to read data from external data sources into a DataFrame.
df_pyspark_no_header = spark.read.csv("test1.csv")

# Tip: you can type `spark.read.` and press the Tab key to see a list of possible read methods.

In [7]:
# View the dataframe's schema. 
df_pyspark_no_header

DataFrame[_c0: string, _c1: string]

That schema looks strange. Let's view the entire dataset as a dataframe with the `show()` method to see what is going on:

In [8]:
df_pyspark_no_header.show()

+-----+---+
|  _c0|_c1|
+-----+---+
| Name|Age|
|Steve| 30|
| Bill| 31|
| John| 32|
+-----+---+



By default a PySpark dataframe uses `_c0`, `_c1`, etc as column headers. Let's use the first row as the header. We do that with the `option("header", True)` method:

In [9]:
df_pyspark = spark.read.option("header", True).csv("test1.csv")

# Tip: You can right-click on any word in the chain above and select "Show Contextual Help" (or press Ctrl+I) to view some code hints and tips.

Now if you view the dataframe's schema it might make more sense:

In [10]:
df_pyspark

DataFrame[Name: string, Age: string]

In [11]:
df_pyspark.show()

+-----+---+
| Name|Age|
+-----+---+
|Steve| 30|
| Bill| 31|
| John| 32|
+-----+---+



## Comparing PySpark to Pandas

In [12]:
# How to read data with Pandas.
import pandas as pd
df = pd.read_csv("test1.csv")
df

Unnamed: 0,Name,Age
0,Steve,30
1,Bill,31
2,John,32


In [13]:
# View the type of a Pandas dataframe.
type(df)

pandas.core.frame.DataFrame

In [14]:
# View the type of a PySpark dataframe.
type(df_pyspark)

pyspark.sql.dataframe.DataFrame

You can see that both the Pandas and PySpark dataframes are of type `DataFrame`. You will also see that Pandas and PySpark dataframes share some similar API features.

In [15]:
# Show the first 3 rows of a PySpark dataframe in list format.
df_pyspark.head(3)

[Row(Name='Steve', Age='30'),
 Row(Name='Bill', Age='31'),
 Row(Name='John', Age='32')]

In [16]:
# Show only the first two rows (instead of the entire dataset) as a dataframe.
df_pyspark.show(2)

+-----+---+
| Name|Age|
+-----+---+
|Steve| 30|
| Bill| 31|
+-----+---+
only showing top 2 rows



In [17]:
# View more details of the dataframe's schema.
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)

