In [1]:
import pyspark
import pylint

## Using DataFrames
Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built on top of RDDs.

The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the same result, but some often take much longer than others. When using RDDs, it's up to the data scientist to figure out the right way to optimize the query, but the DataFrame implementation has much of this optimization built in!

To start working with Spark DataFrames, you first have to create a `SparkSession` object from your `SparkContext`. You can think of the `SparkContext` as your connection to the cluster and the `SparkSession` as your interface with that connection.

Remember, for the rest of this course you'll have a `SparkSession` called spark available in your workspace!

In [2]:
sc = pyspark.SparkContext()

In [3]:
sc.version

'3.4.0'

In [4]:
sc

## Creating a SparkSession
We've already created a SparkSession for you called spark, but what if you're not sure there already is one? Creating multiple SparkSessions and SparkContexts can cause issues, so it's best practice to use the SparkSession.builder.getOrCreate() method. This returns an existing SparkSession if there's already one in the environment, or creates a new one if necessary!

* Import SparkSession from pyspark.sql.
* Make a new SparkSession called my_spark using SparkSession.builder.getOrCreate().
* Print my_spark to the console to verify it's a SparkSession.

In [5]:
from pyspark.sql import SparkSession

# Create my_spark
my_spark = SparkSession(sc).builder.getOrCreate()

# Print my_spark
print(my_spark)

<pyspark.sql.session.SparkSession object at 0x0000020BC237BA30>


In [6]:
my_spark

## Viewing tables
Once you've created a `SparkSession`, you can start poking around to see what data is in your cluster!

Your `SparkSession` has an attribute called catalog which lists all the data inside the cluster. This attribute has a few methods for extracting different pieces of information.

One of the most useful is the `.listTables()` method, which returns the names of all the tables in your cluster as a list.

* See what tables are in your cluster by calling `spark.catalog.listTables()` and printing the result!

### Problems running Hadoop on Windows
Hadoop requires native libraries on Windows to work properly -that includes to access the file:// filesystem, where Hadoop uses some Windows APIs to implement posix-like file access permissions.

This is implemented in HADOOP.DLL and WINUTILS.EXE.

In particular, %HADOOP_HOME%\BIN\WINUTILS.EXE must be locatable.

If it is not, Hadoop or an application built on top of Hadoop will fail.

How to fix a missing WINUTILS.EXE
You can fix this problem in two ways

Install a full native windows Hadoop version. The ASF does not currently (September 2015) release such a version; releases are available externally.
Or: get the WINUTILS.EXE binary from a Hadoop redistribution. There is a repository of this for some Hadoop versions on github.
Then

Set the environment variable %HADOOP_HOME% to point to the directory above the BIN dir containing WINUTILS.EXE.
Or: run the Java process with the system property hadoop.home.dir set to the home directory.

Refer to: [WindowsProblems](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems) page on colfluence

#### Steps to resolve:
* Clone [this repo](https://github.com/steveloughran/winutils) into a folder using:
`git clone https://github.com/steveloughran/winutils.git`
* Add a new system variable called `HADOOP_HOME` with value of:
    `Path you cloned the repo to` +`\hadoop-3.0.0`
   Or depending on the latest version you can find
   
##### You can try this also:
* run: `pip install pyhadoop`

In [9]:
# Print the tables in the catalog
print(my_spark.catalog.listTables())
my_spark.catalog.listTables()

[]


[]