# Lecture 4: Summary Statistics and Boulder Weather Data
***

In this notebook we'll: 
- Compute summary statistics on Boulder weather data 
- Figure out how summary statistics like mean and standard deviation change under transformations of the data
 

The data we'll explore in this notebook concerns temperatures and other weather observations in Boulder County over the month of July 2017.  The data was obtained from the National Oceanic and Atmospheric Administration's [Climate.gov](https://www.climate.gov/) website.  You can find and download loads of climate-related data from NOAA [here](https://www.climate.gov/maps-data/datasets).   

The data is stored in a .csv file called clean_boulder_weather.csv.  

In [2]:
# First we read in the data
# 1) Code Here
weather_data = read.csv("clean_boulder_weather.csv")
weather_data

STATION,NAME,DATE,PRCP,TMAX,TMIN
USW00094075,"BOULDER 14 W, CO US",2017-07-01,0.00,68,31
USW00094075,"BOULDER 14 W, CO US",2017-07-02,0.00,73,35
USW00094075,"BOULDER 14 W, CO US",2017-07-03,0.00,68,46
USW00094075,"BOULDER 14 W, CO US",2017-07-04,0.05,68,43
USW00094075,"BOULDER 14 W, CO US",2017-07-05,0.01,73,40
USW00094075,"BOULDER 14 W, CO US",2017-07-06,0.00,76,48
USW00094075,"BOULDER 14 W, CO US",2017-07-07,0.02,74,43
USW00094075,"BOULDER 14 W, CO US",2017-07-08,0.00,65,44
USW00094075,"BOULDER 14 W, CO US",2017-07-09,0.01,73,39
USW00094075,"BOULDER 14 W, CO US",2017-07-10,0.01,75,44


In [3]:
# Next we examine the first few rows of the data file using the head() function.
# 2) Code Here
head(weather_data,10)

STATION,NAME,DATE,PRCP,TMAX,TMIN
USW00094075,"BOULDER 14 W, CO US",2017-07-01,0.0,68,31
USW00094075,"BOULDER 14 W, CO US",2017-07-02,0.0,73,35
USW00094075,"BOULDER 14 W, CO US",2017-07-03,0.0,68,46
USW00094075,"BOULDER 14 W, CO US",2017-07-04,0.05,68,43
USW00094075,"BOULDER 14 W, CO US",2017-07-05,0.01,73,40
USW00094075,"BOULDER 14 W, CO US",2017-07-06,0.0,76,48
USW00094075,"BOULDER 14 W, CO US",2017-07-07,0.02,74,43
USW00094075,"BOULDER 14 W, CO US",2017-07-08,0.0,65,44
USW00094075,"BOULDER 14 W, CO US",2017-07-09,0.01,73,39
USW00094075,"BOULDER 14 W, CO US",2017-07-10,0.01,75,44


Investigate what the very useful `summary()` function does.

In [4]:
# 3) Code Here
summary(weather_data)

        STATION                          NAME            DATE    
 USC00050848:30   BOULDER 14 W, CO US      :31   2017-07-02:  7  
 USC00053629:31   BOULDER, CO US           :30   2017-07-03:  7  
 USC00055984:30   GROSS RESERVOIR, CO US   :31   2017-07-04:  7  
 USC00056816:30   NIWOT, CO US             :31   2017-07-05:  7  
 USR0000CBDR:31   NORTHGLENN, CO US        :30   2017-07-06:  7  
 USS0005J42S:31   RALSTON RESERVOIR, CO US :30   2017-07-07:  7  
 USW00094075:31   SUGARLOAF COLORADO, CO US:31   (Other)   :172  
      PRCP              TMAX             TMIN      
 Min.   :0.00000   Min.   : 54.00   Min.   :31.00  
 1st Qu.:0.00000   1st Qu.: 74.00   1st Qu.:47.00  
 Median :0.00000   Median : 83.00   Median :55.00  
 Mean   :0.04346   Mean   : 81.48   Mean   :53.59  
 3rd Qu.:0.02750   3rd Qu.: 89.00   3rd Qu.:60.75  
 Max.   :0.69000   Max.   :101.00   Max.   :68.00  
 NA's   :32                                        

From this you should see that each row in the DataFrame refers to a particular weather station / date combination.  The columns of the DataFrame are as follows: 

- **STATION**: The unique identification code for each weather station 
- **NAME**: The location / name of the weather station 
- **DATE**: The date of the observation 
- **PRCP**: The precipitation (in inches)
- **TMAX**: The daily maximum temperature (in Fahrenheit)
- **TMIN**: The daily minimum temperature (in Fahrenheit)

From the printed DataFrame above you can see that we actually have data from multiple weather stations.  To see how many, we can pass the **NAME** column (or the **STATION** column) into R's unique function.

In [10]:
# 4) Code Here
## unique(weather_data$NAME)
unique(weather_data["NAME"])

Unnamed: 0,NAME
1,"BOULDER 14 W, CO US"
32,"GROSS RESERVOIR, CO US"
63,"SUGARLOAF COLORADO, CO US"
94,"NIWOT, CO US"
125,"BOULDER, CO US"
155,"RALSTON RESERVOIR, CO US"
185,"NORTHGLENN, CO US"


It looks like we have data from seven different weather stations.  For consistency, let's reduce the data to just the reports from the weather station in Niwot.  

### Exercise 1
***
Extract the rows of the DataFrame concerned with the Niwot weather station.  Store this data in a new DataFrame called dfNiwot.

In [13]:
# 5) Code Here
dfNiwot = weather_data[weather_data$NAME =='NIWOT, CO US',]

In [14]:
dfNiwot

Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN
94,USS0005J42S,"NIWOT, CO US",2017-07-01,0.0,69,32
95,USS0005J42S,"NIWOT, CO US",2017-07-02,0.0,73,37
96,USS0005J42S,"NIWOT, CO US",2017-07-03,0.0,68,47
97,USS0005J42S,"NIWOT, CO US",2017-07-04,0.1,70,41
98,USS0005J42S,"NIWOT, CO US",2017-07-05,0.0,74,40
99,USS0005J42S,"NIWOT, CO US",2017-07-06,0.0,78,47
100,USS0005J42S,"NIWOT, CO US",2017-07-07,0.0,74,44
101,USS0005J42S,"NIWOT, CO US",2017-07-08,0.0,67,43
102,USS0005J42S,"NIWOT, CO US",2017-07-09,0.0,75,41
103,USS0005J42S,"NIWOT, CO US",2017-07-10,0.0,75,44


### Exercise 2  
***

R has canned functions that compute each of the summary statistics discussed in lecture.  We'll use the mean( ) function as an example.  

Using the mean() function, find the sample mean of the maximum daily temperature in Niwot. 

In [16]:
# 6) Code Here
mean_tmax = mean(dfNiwot$TMAX)
mean_tmax

Let's see what happens if we call mean( ) on the entire DataFrame. 

In [17]:
# 7) Code Here
mean_dfNiwot = mean(dfNiwot)
mean_dfNiwot

"argument is not numeric or logical: returning NA"

In [19]:
# And the fix...
library(dplyr)
dfNiwot %>% 
summarize_if(is.numeric, mean, na.rm=T)

PRCP,TMAX,TMIN
0.06129032,69.83871,43.54839


The functions for the other summary statistics are as follows: 

\begin{array}{l|l}
\textrm{Function} & \textrm{Statistics} \\
\hline
\textrm{var()} & \textrm{variance} \\
\textrm{sd()} & \textrm{standard deviation} \\
\textrm{min()} & \textrm{minimum value} \\
\textrm{max()} & \textrm{maximum value} \\
\textrm{median()} & \textrm{value} \\
\textrm{quantile(data, probs=c(...)} & \textrm{quantile, where data is the desired input and probs specifies the desired percentile(s) as a decimal} \\
\end{array}

Your job is to use these functions to compute the 5-number summary for the maximum daily temperature for the Niwot weather station. 

In [22]:
# 8) Code Here
minimum = min(dfNiwot$TMAX)
Q1 = quantile(dfNiwot$TMAX,.25)
Q2 = median(dfNiwot$TMAX)
Q3 = quantile(dfNiwot$TMAX,.75)
maximum = max(dfNiwot$TMAX)

cat("Five Number summary : ", minimum, Q1, Q2, Q3, maximum)

Five Number summary :  54 66.5 70 74 80

### Exercise 3 
***
It turns out that R has a nice function called fivenum( ) that will compute all of the standard summary statistics for you.  

Run the fivenum( ) function on the **TMAX** column of your DataFrame, and check that the results agree with your computations from Exercise 2. 

In [23]:
# 9) Code Here
fivenum(dfNiwot$TMAX)

### Exercise 4 
***
In this exercise we'll explore how the mean and the standard deviation change when we perform basic transformations on the data.  In particular, we're interested in what happens if we 

1. Add or subtract some value from every entry in the data set 
1. Multiply every entry in the data set by some value 

We know from above that the mean and standard deviation of the Niwot **TMAX** value are 69.83871 and 5.621962.  Experiment by adding and multiplying nice integer values with the **TMAX** column and then recomputing the statistics.  From your observations, can you guess how the mean and std dev change under these transformations? 


In [37]:
# 10) Code Here
#Checking the mean after modification of the data
mean(dfNiwot$TMAX)
mean(dfNiwot$TMAX * )

#Checking the standard deviation after modification of the data
sd(dfNiwot$TMAX)
sd(dfNiwot$TMAX - 3)

See if you can prove that your guess works in general mathematically using the formulas for the two statistics: 

$$
\bar{x} = \frac{1}{n} \displaystyle\sum_{k=1}^n x_k \quad \quad \textrm{and} \quad \quad s = \sqrt{\frac{1}{n-1} \sum_{k=1}^n \left( x_k - \bar{x}\right)^2} 
$$

$$\bar{y} = \frac{1}{n} \sum_{k=1}^n (x_k + a) $$


### Exercise 5 
***
OK, let's apply a common transformation to the **TMAX** and **TMIN** columns by converting the temperatures from Fahrenheit to Celsius.  Remember that the transformation is given by 

$$
\textrm{CELSIUS} = \frac{5}{9} (\textrm{FAHRENHEIT}-32) 
$$

First, use the Fahrenheit data in columns **TMAX** and **TMIN** to create Celsius columns in the Niwot DataFrame called **TMAX_C** and **TMIN_C**.

In [42]:
# 11) Code Here
# This is the code for adding new columns
dfNiwot$TMAX_C = (5/9) * (dfNiwot$TMAX - 32)
dfNiwot$TMIN_C = (5/9) * (dfNiwot$TMIN - 32)
dfNiwot

Unnamed: 0,STATION,NAME,DATE,PRCP,TMAX,TMIN,TMAX_C,TMIN_C
94,USS0005J42S,"NIWOT, CO US",2017-07-01,0.0,69,32,20.55556,0.0
95,USS0005J42S,"NIWOT, CO US",2017-07-02,0.0,73,37,22.77778,2.777778
96,USS0005J42S,"NIWOT, CO US",2017-07-03,0.0,68,47,20.0,8.333333
97,USS0005J42S,"NIWOT, CO US",2017-07-04,0.1,70,41,21.11111,5.0
98,USS0005J42S,"NIWOT, CO US",2017-07-05,0.0,74,40,23.33333,4.444444
99,USS0005J42S,"NIWOT, CO US",2017-07-06,0.0,78,47,25.55556,8.333333
100,USS0005J42S,"NIWOT, CO US",2017-07-07,0.0,74,44,23.33333,6.666667
101,USS0005J42S,"NIWOT, CO US",2017-07-08,0.0,67,43,19.44444,6.111111
102,USS0005J42S,"NIWOT, CO US",2017-07-09,0.0,75,41,23.88889,5.0
103,USS0005J42S,"NIWOT, CO US",2017-07-10,0.0,75,44,23.88889,6.666667


Based on the stuff we proved in **Exercise 4**, what do you expect the mean and the standard deviation of the daily maximum temperature to be in Celsius? 

In [48]:
cat("Guess for the mean TMAX in Celsius: ",(5/9)*mean(dfNiwot$TMAX-32))
## print("\n")
cat("Original mean for TMAX in Celcius: ",mean(dfNiwot$TMAX_C))

Guess for the mean TMAX in Celsius:  21.02151Original mean for TMAX in Celcius:  21.02151

Once you've made your guess, see if you're right by applying the mean( ) and sd( ) methods to **TMAX_C** and **TMIN_C**. 

### Exercise 6 
***

Compute the daily temperature range (max minus min) for each row in the Niwot DataFrame and store it in a column called **TDIFF**.  Then answer these questions.  

- What is the mean temperature difference over the month of July? 
- What is the difference between the means of the max and min daily temperatures? 
- Do you see a relationship between these two quantities?  If so, can you prove that it's always the case for mean difference and difference of means? 

In [13]:
# 12) Code Here

To calculate the mean temperature difference, we need to get the mean of the **TDIFF** column we just created:

In [14]:
# 13) Code Here

The difference between the means of the max and min temperatures is:

In [52]:
diff_means = mean(dfNiwot$TMAX)-mean(dfNiwot$TMIN)
cat("The diff of the means is ", diff_means)

The diff of the means is  26.29032

It looks like they're the same! Can we show mathematically that this is true in general? 