# SPARK101 EXERCISES

## 1. **Create a spark data frame that contains your favorite programming languages.**

* The name of the column should be language  
* View the schema of the dataframe  
* Output the shape of the dataframe  
* Show the first 5 records in the dataframe

In [1]:
import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import SparkSession
from pydataset import data
# from vega_datasets import data

# Note: The pyspark avg and mean functions are aliases of eachother
from pyspark.sql.functions import col, expr, concat, sum, avg, min, max, count, mean, lit, regexp_extract, regexp_replace, when, asc, desc, month, year, quarter

### Visualization (or Lack Therof)

Spark does not provide a way to do visualization with their dataframes. To
visualize data from spark, you should use the `.toPandas` method on a spark
dataframe to convert it to a pandas dataframe, then visualize as you normally
would.

!!!warning "Converting to A Pandas Dataframe"
    Converting a spark dataframe to a pandas dataframe will pull all the data into memory, so make sure you have enough available memory to do so.

## References

- [PySpark API Docs](https://spark.apache.org/docs/latest/api/python/index.html)
- [Spark SQL Programming Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) -- Note that the docs here show examples in many different programming languages, make sure you choose Python.
- [DataFrame class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame)
- [Column class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column)
- [pyspark.sql.functions module](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)
- `df.na`: [DataFrameNaFunctions class](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions)

In [2]:
from env import host, username, password

def get_connection(database, host=host, user=username, password=password):
    return f'mysql+pymysql://{user}:{password}@{host}/{database}'

In [3]:
# query = """SELECT * FROM source"""
# url = get_connection("311_data")
# source_df = pd.read_sql(query, url)
# source_df = spark.createDataFrame(source_df)
# source_df.show(4)

In [4]:
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/11/17 12:58:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/11/17 12:58:18 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [10]:
# utilized demo code to populate mpg
mpg = spark.createDataFrame(data("mpg"))

mpg.write.json("data/mpg_json", mode="overwrite")

# like much else in spark, there's multiple ways we could do this:
(
    mpg.write.format("csv")
    .mode("overwrite")
    .option("header", "true")
    .save("data/mpg_csv")
)

  if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
                                                                                

## 2. **Load the mpg dataset as a spark dataframe.**

$a.$ Create 1 column of output that contains a message like the one below:

 `The 1999 audi a4 has a 4 cylinder engine.`
 
For each vehicle.

In [11]:
# displaying top 20 of the vehicles
# concat can be used to create a message like inthe example.
mpg.show()

+------------+------------------+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|             model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+------------------+-----+----+---+----------+---+---+---+---+-------+
|        audi|                a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|                a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|                a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|                a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|                a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
|        audi|                a4|  2.8|1999|  6|manual(m5)|  f| 18| 26|  p|compact|
|        audi|                a4|  3.1|2008|  6|  auto(av)|  f| 18| 27|  p|compact|
|        audi|        a4 quattro|  1.8|1999|  4|manual(m5)|  4| 18| 26|  p|compact|
|        audi|        a4 quattro|  1.8|1999|  4|  auto(l5)|  4| 16| 25|  p|c

In [None]:
mpg.select(
    co

$b.$ Transform the trans column so that it only contains either manual or auto.

## 3. **Load the tips dataset as a spark dataframe.**

$a.$ What percentage of observations are smokers?  

$b.$ Create a column that contains the tip percentage  

$c.$ Calculate the average tip percentage for each combination of sex and smoker.  

## 4. **Use the seattle weather dataset referenced in the lesson to answer the questions below.**

* Convert the temperatures to fahrenheit.

* Which month has the most rain, on average?
   
* Which year was the windiest?

* What is the most frequent type of weather in January?
  
* What is the average high and low temperature on sunny days in July in 2013 and 2014?
  
* What percentage of days were rainy in q3 of 2015?
  
* For each year, find what percentage of days it rained (had non-zero precipitation).