# Data Exploration

In this notebook we'll look at one of the first elements involved in any Data Engineering project - Getting to know what the inputs might look like.

### Step 1 - What inputs do you have?

In [None]:
%%bash
ls ../data

### Step 2 - Get your tools setup

As part of this we will be using [PySpark](http://spark.apache.org/docs/2.1.1/api/python/index.html) to inspect the data on hand and also gather some basic details.

In [None]:
import os
from IPython.display import display, HTML
import pandas as pd

#Locating where pyspark is installed
import findspark
findspark.init()
import pyspark

#Settings for PySpark to work
driver_memory = '4g'
num_executors = 2
executor_memory = '1g'
#pyspark_submit_args = ' --driver-memory ' + driver_memory + ' --executor-memory ' + executor_memory + ' --num-executors ' + num_executors + ' pyspark-shell'
pyspark_submit_args = ' --driver-memory ' + driver_memory + ' pyspark-shell'

#Setting the required parameters to start up PySpark
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

#Import Modules Needed for PySpark
from pyspark.sql import SparkSession

In [None]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  20, truncate = True):
    if(truncate):
        pd.set_option('display.max_colwidth', 50)
    else:
        pd.set_option('display.max_colwidth', -1)
    pd.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pd.reset_option('display.max_rows')

In [None]:
#Creating a spark session
spark = SparkSession.builder.appName("Data Exploration").getOrCreate()

### Step 3 - Look inside your data

We need to look at how our data is composed:
1. Format
2. Structure
3. Size
4. Dimensions

In this example our input is a CSV file with a header.  Let's try to see what the data looks like

#### Read The Data

In [None]:
#Read the file into a Spark Data Frame
country = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("../data/WDICountry.csv")

#### Inspect the schema of the file you just read

In [None]:
country.printSchema()

#### Take a look at some sample data

You can run <dataframe>.show() to look at the sample data.  However the output is not well formatted so we will use our helper function to look at the data.

In [None]:
showDF(country, truncate = False)

#### Get Some Basic Stats

In [None]:
#Count the number of records in the dataframe
country.count()

#### Examining Dimensions
##### How many different regions do the various countries belong to ?

In [None]:
showDF(country.select('Region').distinct(), truncate = False)

##### How many different income groups do we have across countries?

In [None]:
showDF(country.select('Income Group').distinct(), truncate = False)

#### By applying the same steps as we did for the "WDICountry.csv" dataset, we can see what the rest of the datasets look like

###### WDISeries.csv

In [None]:
series = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("../data/WDISeries.csv")

In [None]:
series.printSchema()

In [None]:
showDF(series)

In [None]:
series.count()

#### Examining Dimensions
##### What are the different periodicities or aggregation methods we might expect to see in the data ?

In [None]:
showDF(series.select('Periodicity').distinct(), truncate = False)

In [None]:
showDF(series.select('Aggregation Method').distinct(), truncate = False)

## Exercise

Repeat the same steps for the `WDIData.csv` file and read it into a dataframe called `indicators`.

In [None]:
# Read the data


In [None]:
# Inspect the schema


In [None]:
# Look at sample records


In [None]:
# Get some basic stats
