# Spark API Exercises

## Exercise 1
1. Create a spark data frame that contains your favorite programming languages.
With oone column named `language`
<br>*Hint: Start with a pandas dataframe. Maybe use a dictionary?*
2. View the schema of the dataframe
3. Output the shape of the dataframe
4. Show the first 5 records in the dataframe

In [1]:
# standard python imports
import pandas as pd
import numpy as np

In [4]:
# testing code output to create df
pd.DataFrame({'languages': ['CSS', 'Python', 'JavaScript', 'HTML']})

Unnamed: 0,languages
0,CSS
1,Python
2,JavaScript
3,HTML


In [5]:
# storing pandas df in variable
fav_languages = pd.DataFrame({'languages': ['CSS', 'Python', 'JavaScript', 'HTML']})

In [6]:
# importing pyspark library
import pyspark

# creating the spark object that activates the spark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()

# testing code to transform pandas df to spark df
spark.createDataFrame(fav_languages)

DataFrame[languages: string]

In [7]:
# storing spark df in variable
spark_fav_languages = spark.createDataFrame(fav_languages)

In [10]:
# 1. Create Spark df
spark_fav_languages.show()

+----------+
| languages|
+----------+
|       CSS|
|    Python|
|JavaScript|
|      HTML|
+----------+



In [12]:
# 2. View schema of the df
spark_fav_languages.describe().show()

+-------+---------+
|summary|languages|
+-------+---------+
|  count|        4|
|   mean|     null|
| stddev|     null|
|    min|      CSS|
|    max|   Python|
+-------+---------+



In [13]:
spark_fav_languages.printSchema()

root
 |-- languages: string (nullable = true)



In [18]:
# 3. Output the shape of the df
print('There are',spark_fav_languages.count(),'rows and',len(spark_fav_languages.columns),'columns.')

There are 4 rows and 1 columns.


In [22]:
# 4. Show the first 5 records in the dataframe
spark_fav_languages.show(5)

+----------+
| languages|
+----------+
|       CSS|
|    Python|
|JavaScript|
|      HTML|
+----------+



## Exercise 2
Load the `mpg` dataset as a spark dataframe.

a. Create 1 column of output that contains a message like the one below for each record:

    The 1999 audi a4 has a 4 cylinder engine.

> Hint: You will need to concatenate values that already exist in the data with string literals

b. Transform the trans column so that it only contains either manual or auto.

> Hint: Consider spark string methods and `when().otherwise()` chaining

In [23]:
# import pydata
from pydataset import data

In [25]:
# creating spark df from mpg data
spark_mpg = spark.createDataFrame(data('mpg'))

spark_mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [29]:
# a. Create 1 column of output that contains a message like the one below for each record:
#     The 1999 audi a4 has a 4 cylinder engine.

print('The',spark_mpg.year,spark_mpg.manufacturer,'has a',spark_mpg.cyl,'cylinder engine.')

The Column<'year'> Column<'manufacturer'> has a Column<'cyl'> cylinder engine.


In [31]:
# importing spark sql functions and string manipulation functions
from pyspark.sql.functions import regexp_extract, regexp_replace

In [59]:
# importing lit function
from pyspark.sql.functions import lit

# creating the msg_text column
spark_mpg.select(concat(lit('The '), spark_mpg.year, lit(' '), spark_mpg.manufacturer,
                       lit(' has a '), spark_mpg.cyl, lit(' cylinder engine.')).alias('msg_txt')
                ).show(5, truncate = False)

+--------------------------------------+
|msg_txt                               |
+--------------------------------------+
|The 1999 audi has a 4 cylinder engine.|
|The 1999 audi has a 4 cylinder engine.|
|The 2008 audi has a 4 cylinder engine.|
|The 2008 audi has a 4 cylinder engine.|
|The 1999 audi has a 6 cylinder engine.|
+--------------------------------------+
only showing top 5 rows



In [108]:
# b.  Transform the trans column so that it only contains either manual or auto.
spark_mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



In [115]:
# # my attempt
# spark_mpg.select(spark_mpg.trans,
#                 when('auto' in spark_mpg.trans, 'auto')
#                 .otherwise('manual')
#                 .alias('trans_2'),
#                 ).show(5)

NameError: name 'when' is not defined

In [113]:
spark_mpg.select('trans', 
           regexp_replace('trans', r'.{4}$', '').alias('trans_new')).show(5)

+----------+---------+
|     trans|trans_new|
+----------+---------+
|  auto(l5)|     auto|
|manual(m5)|   manual|
|manual(m6)|   manual|
|  auto(av)|     auto|
|  auto(l5)|     auto|
+----------+---------+
only showing top 5 rows



# Exercise 3
Load the `tips` dataset as a spark dataframe.

a. What percentage of observations are smokers?
> Hint: `.groupBy()` and `.withColumn()` are useful functions here

b. Create a column that contains the tip percentage
> Hint: `.withColumn()` is useful here

c. Calculate the average tip percentage for each combination of sex and smoker.
> Hint: Chain additional functions off the answer to part b 

In [117]:
# loading the tips data and creating spark df for exercise 3
spark_tips = spark.createDataFrame(data('tips'))

spark_tips.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [132]:
# a. What percentage of observations are smokers?

# looking at different classes for smoke
spark_tips.rollup('smoker').count().show()

+------+-----+
|smoker|count|
+------+-----+
|  null|  244|
|    No|  151|
|   Yes|   93|
+------+-----+



In [141]:
# getting an aggregate sum of the non-smokers, smokers, and other
smoker.agg({'count':'sum'}).show()

+----------+
|sum(count)|
+----------+
|       488|
+----------+



In [148]:
# using .withColumn to create a new column with percentages
smoker.select(smoker.smoker,
             smoker.count).withColumn('percentage', col('count') / smoker.agg({'count':'sum'}))

TypeError: Invalid argument, not a string or column: <bound method DataFrame.count of DataFrame[smoker: string, count: bigint]> of type <class 'method'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

In [None]:
# b. Create a column that contains the tip percentage
