# Spark 101 Exercises
### Kwame V. Taylor

1. Create a spark data frame that contains your favorite programming languages.
  * The name of the column should be ```language```
  * View the schema of the dataframe
  * Output the shape of the dataframe
  * Show the first 5 records in the dataframe

**Imports**

In [1]:
import pandas as pd
import numpy as np

np.random.seed(666)

import pyspark
import pyspark.sql.functions as F
from pyspark.sql.functions import col, expr

spark = pyspark.sql.SparkSession.builder.getOrCreate()

**Create custom pyspark shape function**

In [13]:
def spark_shape(self):
    return (self.count(), len(self.columns))
pyspark.sql.dataframe.DataFrame.shape = spark_shape

**Create Spark dataframe**

In [2]:
# Create pandas dataframe by columns using dictionary-like object
pd_df = pd.DataFrame({'language': ['Python', 'Java', 'HTML', 'CSS', 'JavaScript']}, 
                     index = [1, 2, 3, 4, 5])
pd_df

Unnamed: 0,language
1,Python
2,Java
3,HTML
4,CSS
5,JavaScript


In [4]:
# Convert pandas dataframe to spark dataframe
df = spark.createDataFrame(pd_df)
df

DataFrame[language: string]

**View schema of dataframe**

In [8]:
df.printSchema()

root
 |-- language: string (nullable = true)



**Print shape of dataframe**

In [15]:
print((df.count(), len(df.columns)))

(5, 1)


Alternatively, I can use my custom pyspark shape function.

In [14]:
df.shape()

(5, 1)

**Print first 5 records**

In [5]:
df.show(5)

+----------+
|  language|
+----------+
|    Python|
|      Java|
|      HTML|
|       CSS|
|JavaScript|
+----------+



**Describe the dataframe**

In [16]:
df.describe().show()

+-------+--------+
|summary|language|
+-------+--------+
|  count|       5|
|   mean|    null|
| stddev|    null|
|    min|     CSS|
|    max|  Python|
+-------+--------+



2. Load the ```mpg``` dataset as a spark dataframe.

    a. Create 1 column of output that contains a message like the one below:
        The 1999 audi a4 has a 4 cylinder engine.
    For each vehicle.
    
    b. Transform the ```trans``` column so that it only contains either ```manual``` or ```auto```.

**Load mpg dataset**

In [17]:
from pydataset import data

mpg = spark.createDataFrame(data("mpg"))
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



**Create new column**