## Exercises

Using the [repo setup directions](https://ds.codeup.com/fundamentals/git/), setup a new local and remote repository named `spark-exercises`. The local version of your repo should live inside of `~/codeup-data-science`. This repo should be named `spark-exercises`

Save this work in your `spark-exercises` repo. Then add, commit, and push your changes.

Create a jupyter notebook or python script named `spark101` for this exercise.

In [1]:
import pandas as pd
import numpy as np
import pyspark

1. Create a spark data frame that contains your favorite programming languages.

In [2]:
import pyspark.sql.functions as F

In [3]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [4]:
pd_df = pd.DataFrame({'language': ['python', 'r',
 'scala', 'java', 'c', 'c++']})

In [5]:
pd_df

Unnamed: 0,language
0,python
1,r
2,scala
3,java
4,c
5,c++


    - The name of the column should be `language`
    

In [6]:
df = spark.createDataFrame(pd_df)

In [7]:
df.show()

+--------+
|language|
+--------+
|  python|
|       r|
|   scala|
|    java|
|       c|
|     c++|
+--------+



    - View the schema of the dataframe

In [8]:
df.schema

StructType([StructField('language', StringType(), True)])

    - Output the shape of the dataframe

In [9]:
df.count(), len(df.columns)

(6, 1)

In [10]:
df.describe().show()

+-------+--------+
|summary|language|
+-------+--------+
|  count|       6|
|   mean|    null|
| stddev|    null|
|    min|       c|
|    max|   scala|
+-------+--------+



     - Show the first 5 records in the dataframe

In [11]:
df.show(5)

+--------+
|language|
+--------+
|  python|
|       r|
|   scala|
|    java|
|       c|
+--------+
only showing top 5 rows



2. Load the `mpg` dataset as a spark dataframe.

In [12]:
from pydataset import data
mpg_pandas = data('mpg')

In [13]:
mpg = spark.createDataFrame(mpg_pandas)

In [14]:
mpg.show(5)

+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|manufacturer|model|displ|year|cyl|     trans|drv|cty|hwy| fl|  class|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
|        audi|   a4|  1.8|1999|  4|  auto(l5)|  f| 18| 29|  p|compact|
|        audi|   a4|  1.8|1999|  4|manual(m5)|  f| 21| 29|  p|compact|
|        audi|   a4|  2.0|2008|  4|manual(m6)|  f| 20| 31|  p|compact|
|        audi|   a4|  2.0|2008|  4|  auto(av)|  f| 21| 30|  p|compact|
|        audi|   a4|  2.8|1999|  6|  auto(l5)|  f| 16| 26|  p|compact|
+------------+-----+-----+----+---+----------+---+---+---+---+-------+
only showing top 5 rows



    - Create 1 column of output that contains a message like the one below:

In [17]:
mpg.select(
    F.concat(
    F.lit('The '), 
    mpg.year,
    F.lit(' '), 
    mpg.manufacturer,
    F.lit(' has a '),
    mpg.cyl,
    F.lit(' cylinder engine')).alias('description')
    ).show(truncate=False)

+------------------------------------------+
|description                               |
+------------------------------------------+
|The 1999 audi has a 4 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 1999 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 2008 audi has a 4 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 1999 audi has a 6 cylinder engine     |
|The 2008 audi has a 6 cylinder engine     |
|The 2008 audi has a 8 cylinder engine     |
|The 2008 chevrolet has a 8 cylinder engine|
|The 2008 

            The 1999 audi a4 has a 4 cylinder engine.

        For each vehicle.

     - Transform the `trans` column so that it only contains either `manual` or `auto`.

In [18]:
# mpg.select(F.when)

In [None]:
mpg.select(mpg.trans).show(10)

In [19]:
# mpg with the new column, tran
# F.when function
# mpg.trans.like('auto%'), 'auto'
# chain that with otherwise 'manual'
# after the withColumn, select my old and new column
mpg.withColumn('tran', 
               F.when(
                mpg.trans.like('auto%'), 'auto'
                ).otherwise(
                    'manual'
                    )
            ).select('trans','tran').show(5)

+----------+------+
|     trans|  tran|
+----------+------+
|  auto(l5)|  auto|
|manual(m5)|manual|
|manual(m6)|manual|
|  auto(av)|  auto|
|  auto(l5)|  auto|
+----------+------+
only showing top 5 rows



3. Load the `tips` dataset as a spark dataframe.

    1. What percentage of observations are smokers?
    

In [21]:
# load up tips from pydataset, feed it into createDataFrame
tips = spark.createDataFrame(data('tips'))

In [22]:
tips.count()

244

In [24]:
tips.groupby('smoker').count().show()

+------+-----+
|smoker|count|
+------+-----+
|    No|  151|
|   Yes|   93|
+------+-----+



In [25]:

# group by smoker column,
# grab the counts of each subpopulation,
# make a new column (withColumn) called percent
# reference the new aggreagted column count, divide by the length of the df
# multiply by 100 to get the percentage, round the whole thing
# then show
tips.groupby('smoker').count().withColumn(
    'percent', F.round(
        F.col('count') / tips.count() * 100
        )
        ).show()

+------+-----+-------+
|smoker|count|percent|
+------+-----+-------+
|    No|  151|   62.0|
|   Yes|   93|   38.0|
+------+-----+-------+



    2. Create a column that contains the tip percentage
    

In [26]:
tips.columns

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [27]:
tips.withColumn(
    'tip_percentage', tips.tip / tips.total_bill
    ).show(5)

+----------+----+------+------+---+------+----+-------------------+
|total_bill| tip|   sex|smoker|day|  time|size|     tip_percentage|
+----------+----+------+------+---+------+----+-------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|0.05944673337257211|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|0.16054158607350097|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|0.16658733936220846|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2| 0.1397804054054054|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|0.14680764538430255|
+----------+----+------+------+---+------+----+-------------------+
only showing top 5 rows



In [29]:
tips.select(
    tips.tip,
    tips.total_bill,
    F.round(
        (tips.tip / tips.total_bill), 4
        ).alias('tip_percentage')
        ).show(5)

+----+----------+--------------+
| tip|total_bill|tip_percentage|
+----+----------+--------------+
|1.01|     16.99|        0.0594|
|1.66|     10.34|        0.1605|
| 3.5|     21.01|        0.1666|
|3.31|     23.68|        0.1398|
|3.61|     24.59|        0.1468|
+----+----------+--------------+
only showing top 5 rows



    3. Calculate the average tip percentage for each combination of sex and smoker.

In [30]:
# make the same tip_percentage column that I just did
# from that point, pass a groupby
# sex, smoker
# pass aggregative function mean to tip_percentage (applied to each group)
tips.withColumn(
    'tip_percentage',
    tips.tip / tips.total_bill
    ).groupby(
    'sex',
    'smoker').agg(
        F.round(
            F.mean('tip_percentage'),4).alias(
                'avg_tip_p')
                ).show()

+------+------+---------+
|   sex|smoker|avg_tip_p|
+------+------+---------+
|  Male|    No|   0.1607|
|Female|    No|   0.1569|
|  Male|   Yes|   0.1528|
|Female|   Yes|   0.1822|
+------+------+---------+



In [31]:
# pivot version:
tips.groupby(
    'sex').pivot(
        'smoker').agg(
            F.round(F.mean(tips.tip / tips.total_bill),4)).show()

+------+------+------+
|   sex|    No|   Yes|
+------+------+------+
|Female|0.1569|0.1822|
|  Male|0.1607|0.1528|
+------+------+------+



4. Use the seattle weather dataset referenced in the lesson to answer the questions below.

    - Convert the temperatures to fahrenheit.
    

In [32]:
from vega_datasets import data
weather = data.seattle_weather()
weather = spark.createDataFrame(weather)

In [None]:
# c to f: (0°C × 9/5) + 32 = 32°F

In [34]:
weather.show(5)

+-------------------+-------------+--------+--------+----+-------+
|               date|precipitation|temp_max|temp_min|wind|weather|
+-------------------+-------------+--------+--------+----+-------+
|2012-01-01 00:00:00|          0.0|   55.04|    41.0| 4.7|drizzle|
|2012-01-02 00:00:00|         10.9|   51.08|   37.04| 4.5|   rain|
|2012-01-03 00:00:00|          0.8|   53.06|   44.96| 2.3|   rain|
|2012-01-04 00:00:00|         20.3|   53.96|   42.08| 4.7|   rain|
|2012-01-05 00:00:00|          1.3|   48.02|   37.04| 6.1|   rain|
+-------------------+-------------+--------+--------+----+-------+
only showing top 5 rows



In [33]:
# assign everything to weather, then pass in weather.show(5)
# steps:
# make a column called temp_max,
# multiply it by 9/5 and add 32
#pass in a new withColumn temp_min, do the same operation
weather = \
weather.withColumn(
    "temp_max", (F.col("temp_max") * 9 / 5 + 32)
).withColumn("temp_min", (F.col("temp_min") * 9 / 5 + 32))
weather.show(5)

+-------------------+-------------+--------+--------+----+-------+
|               date|precipitation|temp_max|temp_min|wind|weather|
+-------------------+-------------+--------+--------+----+-------+
|2012-01-01 00:00:00|          0.0|   55.04|    41.0| 4.7|drizzle|
|2012-01-02 00:00:00|         10.9|   51.08|   37.04| 4.5|   rain|
|2012-01-03 00:00:00|          0.8|   53.06|   44.96| 2.3|   rain|
|2012-01-04 00:00:00|         20.3|   53.96|   42.08| 4.7|   rain|
|2012-01-05 00:00:00|          1.3|   48.02|   37.04| 6.1|   rain|
+-------------------+-------------+--------+--------+----+-------+
only showing top 5 rows



In [None]:
weather.columns

In [35]:
weather.count(), len(weather.columns)

(1461, 6)

    - Which month has the most rain, on average?
    

In [38]:
# aggregation: take the months out of the date
# agreative function: average of rain, highest value
# steps:
# make a column (withColumn)
# month, F.month function to grab that from weather.date col
# group by that new month column (using col('month'))
# aggregate based on mean weather.precipitation
# alias as avg_rainfall
# pass a sort, make it descending,
#show
weather.withColumn(
    'month', F.month(weather.date)
    ).groupby(
    F.col('month')
    ).agg(
        F.mean(
            weather.precipitation
            ).alias('avg_rainfall')
            ).sort(
                F.col('avg_rainfall').desc()
                ).first()[0]

11

    - Which year was the windiest?
    

In [39]:
# we want to aggregate on the years instead of the months
# same kind of process there --> we create a column called year,
# we use weather.date passed into the F.year() function to grab that
# group by the new year column,
# get the average wind value for each year,
# sort by the average wind column that we created via alias, (inside the agg())
# chain a desc() inside my sort()
# grab the first entry
weather.withColumn(
    'year',
    F.year(weather.date)
    ).groupby(
    F.col('year')
    ).agg(
        F.mean(weather.wind).alias('avg_wind')
        ).sort(
            F.col('avg_wind').desc()
            ).first()

Row(year=2012, avg_wind=3.400819672131148)

    - What is the most frequent type of weather in January?
    

In [40]:
# narrow down the month, using the F.month() function
# df[df.month == 1] ==> .filter(F.month(weather.date) == 1)
# aggregate weather 
# aggregation function: frequency -> take the count
# sort by the count, descending
weather.filter(
    F.month(weather.date) == 1
    ).groupby(
    weather.weather
    ).count().sort(
        F.col('count').desc()
        ).show()

+-------+-----+
|weather|count|
+-------+-----+
|    fog|   38|
|   rain|   35|
|    sun|   33|
|drizzle|   10|
|   snow|    8|
+-------+-----+



    - What is the average high and low temperature on sunny days in July in 2013 and 2014?
    

In [41]:
# filter on month as july
# say cases where the year is greater than 2012, but lower than 2015
# weather is equivalent to sun
# pass these filters in chain,
# aggregate on the entire dataframe:
# average max temp, aliased as average_high_temp
# average min temp, alaiased as average_low_temp
(
    weather.filter(F.month("date") == 7)
    .filter(F.year("date") > 2012)
    .filter(F.year("date") < 2015)
    .filter(F.col("weather") == F.lit("sun"))
    .agg(
        F.avg("temp_max").alias("average_high_temp"),
        F.avg("temp_min").alias("average_low_temp"),
    )
    .show()
)

+-----------------+-----------------+
|average_high_temp| average_low_temp|
+-----------------+-----------------+
|80.29192307692308|57.52884615384615|
+-----------------+-----------------+



    - What percentage of days were rainy in q3 of 2015?
    

In [42]:
# in pandas -- (df.weather == "rain").mean()
# measure a rainy day by weather == rain
(
    weather.filter(F.year('date') == 2015)
    .filter(F.quarter('date') == 3)
    .select(
        F.when(
            F.col('weather') == 'rain', 1
            ).otherwise(0).alias('rain'))
    .agg(F.mean('rain'))
    .show()
)

+--------------------+
|           avg(rain)|
+--------------------+
|0.021739130434782608|
+--------------------+



    - For each year, find what percentage of days it rained (had non-zero precipitation).

In [44]:
# measure a rainy day by precipitation > 0
# for each year:
# make a column called year, grab that with F.year('date')
# pass a new select on that withColumn version of the dataframe
# inside of that select,
# pass a when ==> when precipitation is greater than zero, give a 1
# otherwise, give me a zero
# aliasing that column as 'did_rain'
# select this new column did_rain, and year strictly
# group by the year column (created in the first withColumn)
# aggregate the mean of each rain field for year
(
    weather.withColumn(
        'year', F.year('date')
        )
    .select(
        F.when(
            F.col('precipitation') > 0, 1
            ).otherwise(0).alias('did_rain'), 'year'
            )
    .groupby('year')
    .agg(F.mean('did_rain'))
    .show()
)

+----+-------------------+
|year|      avg(did_rain)|
+----+-------------------+
|2012|0.48360655737704916|
|2013|0.41643835616438357|
|2014|  0.410958904109589|
|2015|0.39452054794520547|
+----+-------------------+

