# Graded Lab : Exploring USA weather data

In this lab you will explore weather data from National Oceanic and Atmospheric Administration (NOAA).

>If your are curious, they are avalable here https://noaa-isd-pds.s3.amazonaws.com/index.html, or directly in FTP access here : ftp.ncdc.noaa.gov. You can find weather data from 1901 to today.

Raw data are stored in ISD (Integrated Surface Data) format. It's a strange format, with a mandatory section with positional fields, and a additional section with variable fields. For this lab, the dataset has been transform into json. But no other processing had been done. For exemple, missing temperature are not filtered and are coded with 999.9. The lab's dataset contains more than 100 years of weather data. Its total size is about 4Go once compressed (and around 40Go uncrompressed).

For instance, here is an exemple of a reccord

```js
{
   "weather_station":"010040",
   "latitude":78.933,
   "longitude":11.883,
   "elevation":42,
   "time":"1975-03-04T18:00:00+00:00",
   "air_temperature":{
      "value":-24.0,
      "quality":"1"
   },
   "dew_point":{
      "value":-27.0,
      "quality":"1"
   },
   "wind_speed":{
      "value":1.0,
      "quality":"1"
   },
   "wind_direction":{
      "value":"160",
      "quality":"1"
   },
   "sea_level_pressure":{
      "value":1002.1,
      "quality":""
   },
   "sky_ceiling":{
      "value":22000,
      "quality":"1"
   },
   "visibility_distance":{
      "value":50000,
      "quality":"1"
   },
   "liquid_precip":[
      {
         "hours":99,
         "depth":0.0
      }
   ],
  "sky_cover_condition":[
      {
         "base_height":50000,
         "cloud_type":"Cirrus and/or Cirrocumulus"
		 "coverage":8
      }
   ],
   "extreme_temperature":[
      {
         "hours":999,
         "code":"M",
         "temperature":{
            "value":-23.0,
            "quality":"1"
         }
      }
   ]
}
```

## Setup the lab

Is spark running ? You can start the lab once you get a message like `SparkSession available as 'spark'.`

In [1]:
#Spark session
spark

# Configuraion
spark._jsc.hadoopConfiguration().set("fs.s3.useRequesterPaysHeader","true")

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0,application_1649920197808_0001,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

<pyspark.sql.session.SparkSession object at 0x7efef21d6750>

Usefull import for the lab. You can import more functions if you want

In [2]:
from pyspark.sql.window import Window
from pyspark.sql.functions import count, min, max, mean, exp, first, from_json, window, col, expr, year, month, explode, sum, row_number, avg, abs
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, ArrayType, TimestampType, BooleanType, LongType, DoubleType
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Data schema

In [9]:
schema = StructType([
    StructField("air_temperature",StructType([
        StructField("quality",StringType(),True)
        ,StructField("value",DoubleType(),True)]),True)
    ,StructField("dew_point",StructType([
        StructField("quality",StringType(),True)
        ,StructField("value",DoubleType(),True)]),True)
    ,StructField("wind_speed",StructType([
        StructField("quality",StringType(),True)
        ,StructField("value",DoubleType(),True)]),True)
    ,StructField("elevation",LongType(),True)
    ,StructField("extreme_temperature",ArrayType(StructType([
        StructField("code",StringType(),True)
        ,StructField("hours",LongType(),True)
        ,StructField("temperature",StructType([
            StructField("quality",StringType(),True)
            ,StructField("value",DoubleType(),True)]),True)]),True),True)
    ,StructField("latitude",DoubleType(),True)
    ,StructField("liquid_precip",ArrayType(StructType([
        StructField("depth",StringType(),True)
        ,StructField("hours",LongType(),True)]),True),True)
    ,StructField("longitude",DoubleType(),True)
    ,StructField("sea_level_pressure",StructType([
        StructField("quality",StringType(),True),
        StructField("value",DoubleType(),True)]),True)
    ,StructField("sky_ceiling",StructType([
        StructField("quality",StringType(),True)
        ,StructField("value",LongType(),True)]),True)
    ,StructField("sky_cover_condition",ArrayType(StructType([
        StructField("base_height",LongType(),True)
        ,StructField("cloud_type",StringType(),True)
        ,StructField("coverage",LongType(),True)]),True),True)
    ,StructField("time",TimestampType(),True)
    ,StructField("visibility_distance",StructType([
        StructField("quality",StringType(),True)
        ,StructField("value",LongType(),True)]),True)
    ,StructField("weather_station",StringType(),True)
    ,StructField("wind_direction",StructType([
        StructField("quality",StringType(),True),
        StructField("value",StringType(),True)]),True)])

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## Student part

1. Import the dataset and apply the schema.

In [10]:
meteo = spark.read.json("s3://spark-lab-input-data-ensai20212022/weather_data/", schema=schema)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

2. Print the dataframe schema. **What type of variable is the "quality" of the different records?**

In [30]:
meteo.printSchema()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- air_temperature: struct (nullable = true)
 |    |-- quality: string (nullable = true)
 |    |-- value: double (nullable = true)
 |-- dew_point: struct (nullable = true)
 |    |-- quality: string (nullable = true)
 |    |-- value: double (nullable = true)
 |-- elevation: long (nullable = true)
 |-- extreme_temperature: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- code: string (nullable = true)
 |    |    |-- hours: long (nullable = true)
 |    |    |-- temperature: struct (nullable = true)
 |    |    |    |-- quality: string (nullable = true)
 |    |    |    |-- value: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- liquid_precip: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- depth: string (nullable = true)
 |    |    |-- hours: long (nullable = true)
 |-- longitude: double (nullable = true)
 |-- sea_level_pressure: struct (nullable = true)
 |    |-- quality: string (nullab

3. Print the first 5 rows. You can use `vertical=True`to get a prettier result 

In [11]:
meteo.show(5,vertical=True)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

-RECORD 0-----------------------------------
 air_temperature     | [1, -11.0]           
 dew_point           | [1, -16.0]           
 wind_speed          | [1, 13.4]            
 elevation           | 9                    
 extreme_temperature | null                 
 latitude            | 70.933               
 liquid_precip       | [["NaN", 6]]         
 longitude           | -8.667               
 sea_level_pressure  | [, 1000.5]           
 sky_ceiling         | [1, 22000]           
 sky_cover_condition | null                 
 time                | 1976-01-01 00:00:00  
 visibility_distance | [1, 30000]           
 weather_station     | 010010               
 wind_direction      | [1, 350]             
-RECORD 1-----------------------------------
 air_temperature     | [1, -11.0]           
 dew_point           | [1, -16.0]           
 wind_speed          | [1, 10.3]            
 elevation           | 9                    
 extreme_temperature | null                 
 latitude 

4. How many lines the dataset contains ? BEWARE: CAN take time

In [8]:
meteo.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

14506002

5. The dataset is way too big to work on it as is (it would take minutes to run each instruction). Create a new dataset called meteo_small containing a sample at a rate of a 1000th.
How would you tell Spark to perform the computation only once, and to actually compute and store the dataset meteo_small?



In [12]:
meteo_small = meteo\
  .sample(fraction=0.0001)\
  .cache()

meteo_small.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

1414

6. Filter missing air temperature or bad quality air temperature. How many observations do you get?"

- missing : temperature = 999,9
- Good quality : quality in ["0","1","4","5","9"]

In [13]:
meteo_small = meteo_small\
  .filter(meteo.air_temperature.quality.isin(["0","1","4","5","9"]))\
  .filter(meteo.air_temperature.value !=  999.9)\
  .cache()
    
meteo_small.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

448

7. Compute average air temperature

In [36]:
meteo_small.select(mean("air_temperature.value")).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------------+
|avg(air_temperature.value)|
+--------------------------+
|         8.066714165083384|
+--------------------------+

8. Compute the **NUMBER OF RECORDS** AND **THE AVERAGE TEMPERATURE** by METEO station. Order you results by temperature. Repeat the exercice by year instead of by station.
    
    To get the year from the time column, you can use the year function like that `year(time)`

In [45]:
meteo_small.groupBy("weather_station").agg(count("*").alias("count"), mean("air_temperature.value").alias("mean_temperature")).orderBy(col("mean_temperature").desc()).show()

meteo_small.groupBy(year("time").alias("year")).agg(count("*").alias("count"), mean("air_temperature.value").alias("mean_temperature")).orderBy(col("mean_temperature").desc()).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------+-----+----------------+
|weather_station|count|mean_temperature|
+---------------+-----+----------------+
|         606560|    1|            44.6|
|         422650|    1|            41.1|
|         403400|    1|            40.0|
|         424350|    1|            39.4|
|         406700|    1|            38.9|
|         411960|    2|            38.9|
|         417100|    1|            37.8|
|         431370|    1|            37.2|
|         421220|    1|            37.2|
|         433250|    1|            36.7|
|         624050|    2|            35.7|
|         488526|    1|            35.7|
|         433290|    1|            35.6|
|         783134|    1|            35.6|
|         749094|    1|            35.2|
|         381490|    1|            35.0|
|         480330|    1|            34.6|
|         722577|    1|            34.6|
|         408310|    1|            34.4|
|         749243|    1|            34.1|
+---------------+-----+----------------+
only showing top

9. Compute the min, mean, max temperature and count for each **(year, station)** possible combinations. You output should be like:

| weather_station | year | temp_min | temp_max | teamp_mean | reccords_count |
| --------------- | ---- | -------- | -------- | ---------- | -------------- |
| 036830          | 1992 | a        | b        | c          | d              |
| 033730          | null | e        | f        | g          | h              |
| 010010          | 1992 | i        | j        | k          | l              |
| null            | 1992 | m        | n        | o          | p              |
| 061000          | 1991 | q        | r        | s          | t              |

A `null` value mean this dimension wasn't use for this row. For instance the row 2 gives the min, max, mean and reccord count for the station all data related to the weather station 033730 

In [39]:
meteo_small.cube("weather_station", year("time").alias("year"))\
.agg(\
     min("air_temperature.value").alias("temp_min")\
     , max("air_temperature.value").alias("temp_max")\
    ,mean("air_temperature.value").alias("temp_mean"), \
    count("*").alias("reccords_count")).show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------+----+--------+--------+------------------+--------------+
|weather_station|year|temp_min|temp_max|         temp_mean|reccords_count|
+---------------+----+--------+--------+------------------+--------------+
|         036830|1992|     9.0|    13.8|              11.4|             2|
|         033730|null|    -3.5|    16.9| 8.440000000000001|            10|
|         066330|null|     6.6|     6.6|               6.6|             1|
|         031710|null|     0.0|    22.0| 8.751612903225807|            31|
|         067440|1992|     0.2|     0.2|               0.2|             1|
|         061930|null|    -0.2|    18.9|10.738461538461538|            13|
|         039740|null|    -1.0|    19.0|             8.296|            25|
|         030880|1992|     1.1|    15.6|7.3999999999999995|             3|
|         010010|1992|    -8.4|    -8.4|              -8.4|             1|
|         037260|null|     5.8|    20.8|12.694444444444446|            18|
|         038840|1992|   

9b. **Briefly explain why computing a maximum is well-suited to the "reduce" step of map-and-reduce algorithm. What does the "map" step correponds to?**

9c. Print the 5 stations with the more precipitation for each year  

In [85]:
meteo.withColumn("liquid_precip_exploded", explode("liquid_precip")).select("liquid_precip_exploded", "weather_station", year("time").alias("year"))\
.filter(col("liquid_precip_exploded.depth") != "NaN")\
.withColumn("depth_cleaned", col("liquid_precip_exploded.depth").cast('float'))\
.groupBy("year", "weather_station")\
.agg(sum("depth_cleaned").alias("tot_depth"))\
.withColumn("order",row_number().over(Window.partitionBy("year").orderBy(col("tot_depth").desc())))\
.filter(col("order") < 5).drop("order").show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+---------------+------------------+
|year|weather_station|         tot_depth|
+----+---------------+------------------+
|1975|         037910|3442.9000000059605|
|1975|         037760|3354.8000010997057|
|1975|         043800|2932.9000006243587|
|1975|         042720|2748.0000007003546|
|1973|         035951|            5060.0|
|1973|         033990|5021.4000000059605|
|1973|         060300|4483.0000007376075|
|1973|         067000|3658.9000002518296|
|1976|         061800| 20852.50000232458|
|1976|         066700|   14628.500002563|
|1976|         033230| 5057.200000524521|
|1976|         037760| 4676.600001722574|
|1977|         066700| 9944.700001128018|
|1977|         061800| 8644.700000949204|
|1977|         066800| 6176.600000299513|
|1977|         067620| 4575.100000195205|
|1974|         035951|           25420.0|
|1974|         066700|  9111.80000115186|
|1974|         020910| 8761.200000271201|
|1974|         035961|            8729.0|
+----+---------------+------------

10. Compute the partial humidty with this formula
$$RH = e^{\frac{17,625*T_{DP}}{2430.4+T_{DP}}}/e^{\frac{17,625*T_{air}}{2430.4+T_{air}}}$$

and the approximate partial humidty with this formula

$$RH_{approx} = 100 - 5 (T_{air} -T_{DP}) $$

With $T_{DP}$ the dew point temperature and $T_{air}$ the air temperature.
Compute average difference between the two values.
Cautious : some dew point temperature can be missing or of bad quality. The rules to filter those value are the same as air_temperature 

In [137]:
meteo_small\
  .filter(meteo.dew_point.quality.isin(["0","1","4","5","9"]))\
  .filter(meteo.dew_point.value !=  999.9)\
  .withColumn(
    "partial_humidity", 
    100*(exp((17.625*meteo_small.dew_point.value)/(243.04+meteo_small.dew_point.value))/exp((17.625*meteo_small.air_temperature.value)/(243.04+meteo_small.air_temperature.value))))\
  .withColumn(
    "partial_humidity_approx", 
    100 - 5*(meteo_small.air_temperature.value - meteo_small.dew_point.value))\
  .withColumn("diff_humidity", abs(col("partial_humidity") - col("partial_humidity_approx")) )\
  .select(min(col("diff_humidity")), max(col("diff_humidity")),  avg(col("diff_humidity")))\
.show()


    


VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+------------------+------------------+------------------+
|min(diff_humidity)|max(diff_humidity)|avg(diff_humidity)|
+------------------+------------------+------------------+
|               0.0| 132.0305717067268| 3.228647474053264|
+------------------+------------------+------------------+

11. Count how many max temperature were reccorded (extreme temperature > element > code = "M")

In [139]:
meteo.select("extreme_temperature")\
.withColumn("extreme_temperature_exploded", explode("extreme_temperature"))\
.select(col("extreme_temperature_exploded"))\
.filter(col("extreme_temperature_exploded.code")=="M")\
.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

732000

12. Compute for each weather station the max difference between an extreme maximal temperature reccord and the air temperature of the reccord

In [51]:
meteo_small.select("extreme_temperature", "air_temperature", "weather_station")\
.withColumn("extreme_temperature_exploded", explode("extreme_temperature"))\
.select(col("extreme_temperature_exploded"), "air_temperature", "weather_station")\
.filter(col("extreme_temperature_exploded.code")=="N", )\
.groupBy("weather_station")\
.agg(max(col("air_temperature.value")-col("extreme_temperature_exploded.temperature.value")))\
.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------------+-----------------------------------------------------------------------------+
|weather_station|max((air_temperature.value - extreme_temperature_exploded.temperature.value))|
+---------------+-----------------------------------------------------------------------------+
|         029630|                                                                          3.0|
|         064780|                                                                          0.0|
|         010140|                                                                          5.0|
|         043200|                                                                          3.0|
|         080830|                                                                          6.0|
|         029130|                                                                          4.0|
|         020570|                                                                          3.0|
|         012480|                       

13. For this question, you will do a linear regression to model the relationship between the air temperature and year, elevation, logitude, latitude, and the sea level pressure.
  - Filter to keep only the good value of temperature and pressure (9999.9 = missing pressure)
  - Create a vector assembler
  - Vectorize
  - Do the regression
  - Explore the results

In [29]:
from pyspark.ml.feature import VectorAssembler
vectorizer = VectorAssembler(
    inputCols     = ["year", "latitude", "longitude", "elevation", "sea_level_pressure"], # the columns we want to put in the features column
    outputCol     = "features",                    # the name of the column ("features")
    handleInvalid = 'skip'                         # skip rows with missing / invalid values
)
meteo_small_agg=meteo_small\
    .filter(meteo_small.sea_level_pressure.value!=9999.9)\
    .withColumn("year", year("time"))\
    .select("year", "latitude", "longitude", "elevation", col("sea_level_pressure.value").alias("sea_level_pressure"), col("air_temperature.value").alias("temperature"))

meteo_small_agg.show()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+--------+---------+---------+------------------+-----------+
|year|latitude|longitude|elevation|sea_level_pressure|temperature|
+----+--------+---------+---------+------------------+-----------+
|1976|  70.933|   -8.667|        9|             976.8|        0.0|
|1976|  70.333|   21.467|       10|            1021.4|       -2.0|
|1976|    71.1|     24.0|       13|            1011.5|       -6.0|
|1976|  70.067|   25.117|       34|             986.3|        1.0|
|1976|  67.883|    13.05|       31|            1004.8|        2.0|
|1976|  66.267|   13.983|       33|             983.2|        1.0|
|1976|  68.633|   14.467|       11|            1021.9|       15.0|
|1976|    60.2|   11.083|      204|            1015.6|        1.0|
|1976|  65.833|    24.15|        7|            1003.4|        3.0|
|1976|    64.5|   14.133|      318|             983.9|       -1.0|
|1976|    63.6|   20.767|        6|            1015.7|       11.0|
|1976|    63.6|   20.767|        6|            1024.7|       -

In [30]:
meteo_small_vec = vectorizer.transform(meteo_small_agg).select("temperature", "features")
meteo_small_vec.show(5) 

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------+--------------------+
|temperature|            features|
+-----------+--------------------+
|        0.0|[1976.0,70.933,-8...|
|       -2.0|[1976.0,70.333,21...|
|       -6.0|[1976.0,71.1,24.0...|
|        1.0|[1976.0,70.067,25...|
|        2.0|[1976.0,67.883,13...|
+-----------+--------------------+
only showing top 5 rows

In [31]:
regressor = LinearRegression(featuresCol="features", labelCol="temperature")
model     = regressor.fit(meteo_small_vec)

print(model.coefficients)
print(model.intercept)

listings_pred = model.transform(meteo_small_vec)
listings_pred.show() # model and predictions from the regression

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[-0.1346702770812204,-0.5119035991474302,0.1348505415414194,0.003921682892383677,-0.0008263033996835437]
301.1073641053976
+-----------+--------------------+------------------+
|temperature|            features|        prediction|
+-----------+--------------------+------------------+
|        0.0|[1976.0,70.933,-8...| -3.25254906373749|
|       -2.0|[1976.0,70.333,21...|1.0852478658265454|
|       -6.0|[1976.0,71.1,24.0...|1.0541396793389595|
|        1.0|[1976.0,70.067,25...| 1.836742338572094|
|        2.0|[1976.0,67.883,13...|1.3004466527584668|
|        1.0|[1976.0,66.267,13...|2.2791899434568563|
|       15.0|[1976.0,68.633,14...|1.0150387247797994|
|        1.0|[1976.0,60.2,11.0...| 5.637678053461968|
|        3.0|[1976.0,65.833,24...|3.7537264774628056|
|       -1.0|[1976.0,64.5,14.1...| 4.321052396331027|
|       11.0|[1976.0,63.6,20.7...|4.4265226176159445|
|       -5.0|[1976.0,63.6,20.7...| 4.419085887018753|
|        5.0|[1976.0,58.75,17....| 6.633951524987538|
|       15.0|

In [51]:
# Summarize the model over the training set and print out some metrics
trainingSummary = model.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

numIterations: 1
objectiveHistory: [0.0]
+-------------------+
|          residuals|
+-------------------+
| 15.662577714614216|
| -2.152160545167817|
| 1.3001159114415888|
|  10.97982916293978|
| -4.238685349999116|
| -6.281033164943909|
|  7.727358081670303|
|  4.195742441205965|
| 11.999879599651777|
| 1.5719335270613932|
| 3.1058200988040596|
|  11.72182735639375|
|-1.1134484757643435|
|  6.856031124912192|
| -3.213379778206712|
|  7.303058017421179|
|-10.357344399242132|
| -10.64355592594158|
|  -9.84205389814286|
|  8.157028210144006|
+-------------------+
only showing top 20 rows

RMSE: 9.923333
r2: 0.241735

14. **Does the regression fall in the category of "embarassingly parallel problems"? (Explain) Is it a "inherently sequential problem" ? (Explain) Give an exemple for each.**

15. Now, consider your data are not at rest, but they are stream **from** an s3 bucket. Compute the average temperature for each station in a streaming context.

16. Compute the maximum and the first quartile of recorded temperature in station XXXXXXX with `summary()`. Is the computation exact? Why is it interestind to use summary instead of running two different instructions?

17. Compute the sum of XXXXXX with Spark SQL (with method summary()), then with Spark ML (with classes Vectorizer and Summarizer) and then by creating a local version of the column and performing the computation locally (with Pandas). Is there a clear fastest solution? (Try to count separately the overhead of creating the vectorized copy of the dataset for the second solution and of downloading the column for the third. Beware: because Spark is so lazy, you have to pay attention to how exactly you measure time!)

18. Repeat the regression for every year. You will use a `Pipeline` object.

19. What is the difference between the `fit()` and `transform()` methods? Is each of them called eagerly or lazily?

20. Down-sample your data set to n=100000. Save it on S3, then download it on your computer. Run the regression locally on your computer in R. In your opinion, is the extra precision (in term of R2) is worth the extra computation time?

21. Count how many different stations there are with `count_distinct()`, while measuring execution time. What is the interest of `approx_count_distinct()`? (You will demonstrate this interest by timing two other examples.)