# Tutorial 2

**This tutorial will cover:**

* PySpark DataFrames
* Reading datasets
* Checking the data types of columns (i.e. checking the schema)
* Selecting columns and indexing
* View descriptive statistics of a DataFrame
* Adding columns
* Dropping columns
* Renaming columns

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("DataFrame").getOrCreate()

In [3]:
spark

## PySpark DataFrames and Reading datasets

In [4]:
# Read the dataset.
df_pyspark = spark.read.option("header", True).csv("test-data-2.csv")
df_pyspark

DataFrame[Name: string, Age: string, Experience: string]

In [5]:
# View the entire dataset.
df_pyspark.show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
|Steve| 30|        10|
| Bill| 31|         8|
| John| 32|         4|
+-----+---+----------+



## Checking the data types of columns (i.e. checking the schema)

In [6]:
# View the schema.
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Experience: string (nullable = true)



The `Age` and `Experience` columns should be `integer` data types, but they are strings. We can fix that with the `inferSchema=True` option in the `csv()` method:

In [7]:
# Notice also that we can use the `header-True` options in the `csv()` method.
df_pyspark = spark.read.csv("test-data-2.csv", header=True, inferSchema=True)
df_pyspark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Experience: integer (nullable = true)



## Selecting columns and indexing

In [8]:
# Get the column names.
df_pyspark.columns

['Name', 'Age', 'Experience']

In [9]:
# Get a specific number of rows in list format.
df_pyspark.head(3)

# NOTE: This is similar to the Pandas `head()` method only PySpark returns a list of rows and Pandas returns a dataframe. 

[Row(Name='Steve', Age=30, Experience=10),
 Row(Name='Bill', Age=31, Experience=8),
 Row(Name='John', Age=32, Experience=4)]

In [10]:
# Select a column and view its schema.
df_pyspark.select("Name")

DataFrame[Name: string]

In [11]:
# Select a column and view all the elements as a dataframe.
df_pyspark.select("Name").show()

+-----+
| Name|
+-----+
|Steve|
| Bill|
| John|
+-----+



In [12]:
# Select multiple columns and view all the elements as a dataframe.
df_pyspark.select(["Name", "Experience"]).show()

+-----+----------+
| Name|Experience|
+-----+----------+
|Steve|        10|
| Bill|         8|
| John|         4|
+-----+----------+



In [13]:
# Check the data types of the columns.
df_pyspark.dtypes

[('Name', 'string'), ('Age', 'int'), ('Experience', 'int')]

## View descriptive statistics of a DataFrame

In [14]:
# Use the `describe()` method to view descriptive statistics (similar to Pandas).
df_pyspark.describe().show()

+-------+-----+----+-----------------+
|summary| Name| Age|       Experience|
+-------+-----+----+-----------------+
|  count|    3|   3|                3|
|   mean| null|31.0|7.333333333333333|
| stddev| null| 1.0|3.055050463303893|
|    min| Bill|  30|                4|
|    max|Steve|  32|               10|
+-------+-----+----+-----------------+



## Adding columns

You can add columns with the `withColumn()` method. View the docs for more details: [pyspark.sql.DataFrame.withColumn](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html?highlight=withcolumn#pyspark.sql.DataFrame.withColumn).

In [15]:
df_pyspark.withColumn("Experience After 2 Years", df_pyspark["Experience"]+2).show()

+-----+---+----------+------------------------+
| Name|Age|Experience|Experience After 2 Years|
+-----+---+----------+------------------------+
|Steve| 30|        10|                      12|
| Bill| 31|         8|                      10|
| John| 32|         4|                       6|
+-----+---+----------+------------------------+



## Dropping columns

You can drop columns with the `drop()` method.

In [16]:
df_pyspark.drop("Experience After 2 Years").show()

+-----+---+----------+
| Name|Age|Experience|
+-----+---+----------+
|Steve| 30|        10|
| Bill| 31|         8|
| John| 32|         4|
+-----+---+----------+



## Renaming columns

In [17]:
df_pyspark.withColumnRenamed("Name", "New Name").show()

+--------+---+----------+
|New Name|Age|Experience|
+--------+---+----------+
|   Steve| 30|        10|
|    Bill| 31|         8|
|    John| 32|         4|
+--------+---+----------+

