# Introduction to this tutorial

This tutorial is an introduction to how to load data into Spark. For this tutorial we are going to be using the following Data Set:  
__ratings.csv__: _100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users._

This Data Set is in CSV format.

## SparkSession and Settings
Before we continue, set up a SparkSession.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("31LoadingDataFromCSV").getOrCreate()

Additionally, we have to define some settings to ensure proper operations.
 - `RATINGS_CSV_LOCATION` is used to tell our Spark Application where to find the ratings.csv file

In [None]:
# Location of the ratings.csv  file
RATINGS_CSV_LOCATION = "/home/jovyan/data-sets/ml-latest-small/ratings.csv"

# Part 1: Loading Data from a CSV file
Take a moment to study the readme file that belongs to the MovieLens-latest-small dataset.

[`pyspark.sql.DataFrameReader.read.csv`](https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) is used for reading csv files.

We can access this, simply, by referencing our `SparkSession`, which we initated as an object we named `spark` in the previous cell. Hence, [`spark.read.csv()`](https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) is used to tell Spark to read from a csv file from a given location.

So let's try this.

In [None]:
df = spark.read.csv(RATINGS_CSV_LOCATION)

You've now told Spark to load the data from the given CSV file. 
Because Spark is lazy, we have to explicitly tell it to show us something. 

Let's see the content by running `.show()` on our new DataFrame.  
Let's also check the schema of what we loaded, by using `.printSchema()`.

In [None]:
df.show()
df.printSchema()

What you can see, is that the data is being loaded, but it does not quite appear to be right. Additionally, all the columns appear to be cast as a StringType - which is not ideal.

### Parsing the CSV file correctly and DataTypes()


We can fix the aformentioned issues by giving the `read.csv()` method the correct settings.

To quote the `README.txt` that belongs to the MovieLens data:
> The dataset files are written as [__comma__-separated values](http://en.wikipedia.org/wiki/Comma-separated_values) files with a __single header row__. Columns that contain commas (`,`) are __escaped using double-quotes__ (`"`). These files are __encoded as UTF-8__.

*__NOTE__: It is a good idea to read the full `README.txt`, since it explains in detail how the data should be interpreted.*  
*__NOTE__: Take a moment to study the [documentation for `read.csv()`](https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader) to learn about which possible things we can set.*

To parse the CSV correctly, we are going to need to set the following on our `read.csv()` method:
 
 1. We leave the same `path` as before, referring to `RATINGS_CSV_LOCATION` that we set previously.
 2. Since we have __comma-seperated-values__, we need to set `sep` to `','`.
 3. Since we have a __single header row__, we need to set `header` to `True`.
 4. Since columns that contain commas (`,`) are __escaped using double-quotes__ (`"`), we set `quote` to `'"'`.
 5. Since the files are __encoded as UTF-8__, we set `encoding` to `UTF-8`.
 6. Additionally, since we observed that all values are cast to `StringType` by default, we set `inferSchema` to `True`.

In [None]:
# Loading CSV file with proper parsing and inferSchema
df = spark.read.csv(
    path=RATINGS_CSV_LOCATION,
    sep=",",
    header=True,
    quote='"',
    encoding="UTF-8",
    inferSchema=True,
)

# Displaying results of the load
df.show()
df.printSchema()

Looking at the output we can notice a few things:

 - The header now appears properly parsed, no more `_c0`, `_c1`, etc.
 - The numeric value columns are cast to `IntegerType` and `DoubleType` thanks to `inferSchema`
 
Using `inferSchema`, Spark casted the following types to our schema:

> `|-- userId: integer (nullable = true)`  
> `|-- movieId: integer (nullable = true)`  
> `|-- rating: double (nullable = true)`  
> `|-- timestamp: integer (nullable = true)`  

In short, our data now appears to have a correct parsed schema with DataTypes that appear to match the current data. 

### Type Safety

InferSchema is a great way to (quickly) set the schema for the data we are using. It is however good practice to be as explicit as possible when it comes to DataTypes and Schema - we call this [Type Safety](https://en.wikipedia.org/wiki/Type_safety).  
Applying proper schema and ensuring Type Safety, is extra important once we start using more than one Data Source. For example, when trying to join two datasets, the join will not work as expected if the DataTypes of the join columns are not set correctly.

*__NOTE__: Take a moment to read about [Spark DataTypes](https://spark.apache.org/docs/latest/sql-reference.html#data-types).*

Let's now set out schema to an explicit value. We will do this by using the `schema` option belonging to [`read.csv()`](https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader).

`schema`'s description reads:

> *an optional `pyspark.sql.types.StructType` for the input schema or a DDL-formatted string (For example `col0 INT, col1 DOUBLE`).*

I won't cover `StructType` at this point in time, but we will be using a `DDL-formatted string`. Spark uses Apache Hive's DDL language.

*__NOTE__: Take a moment to read about [Compatibility with Apache Hive](https://spark.apache.org/docs/2.4.3/sql-migration-guide-hive-compatibility.html#supported-hive-features) if you want to learn more about the Apache Hive DDL syntax that Spark uses.*

In this case we will define our `DDL-formatted string` as:  
`'userId INT, movieId INT, rating DOUBLE, timestamp INT'`

In [None]:
#  Type safe loading of ratings.csv file
df = spark.read.csv(
    path=RATINGS_CSV_LOCATION,
    sep=",",
    header=True,
    quote='"',
    encoding="UTF-8",
    schema="userId INT, movieId INT, rating DOUBLE, timestamp INT",
)

# Displaying results of the load
df.show()
df.printSchema()
df.describe().show()
df.explain()

We now have the same output as before, but since we have an explicit schema we can ensure Type Safety

## What we've learned so far:

- How to use `read.csv()` to load CSV files, and how to control the settings of this method
- By default, CSVs are parsed with all columns being cast to `StringType`
- `inferSchema` allows Spark to guess what schema should be used
- To ensure proper Type Safety, we can use Hive Schema DDL to set an explicit schema

---

In [None]:
spark.stop()

End of part 1