#1) Load the tidyverse and datasets packages. Explore the documentation to find a datset.

In [None]:
# note: if you don't already have them installed you need to run the following two lines in addition
#install.packages(tidyverse)
#install.packages(datasets)

#load packages
library(tidyverse)
library(datasets)

#explore documentation
?datasets()

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



#2) Explore the dataset (I will use 'trees') - print the dataset, access and print a column, access and print a row, edit a value. Rename a column.

In [None]:
#print whole dataset
trees           #could also run 'print(trees)'

Girth,Height,Volume
<dbl>,<dbl>,<dbl>
8.3,70,10.3
8.6,65,10.3
8.8,63,10.2
10.5,72,16.4
10.7,81,18.8
10.8,83,19.7
11.0,66,15.6
11.0,75,18.2
11.1,80,22.6
11.2,75,19.9


In [None]:
#access and print a column - there are many ways to do this, all are equivalent
trees$Girth      
trees[1]        #returns a tibble containing the Girth column only
trees[[1]]      #returns a vector containing the entries of the Girth column

Girth
<dbl>
8.3
8.6
8.8
10.5
10.7
10.8
11.0
11.0
11.1
11.2


In [None]:
#access and print a row - there is one main way to do this
trees[4,]       #the number before the comma indicates which row to access. 
                #the blank after the comma indicates that all columns should be reported

Unnamed: 0_level_0,Girth,Height,Volume
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
4,10.5,72,16.4


In [None]:
#edit a value
trees_edited <- trees      #note I saved this into a new variable so I still have the original copy
                           #incase I need to go back
trees_edited[4,2] <- 73    #changes the value stored in row 4 column 2 from 72 to 73

trees[4,2]                         
trees_edited[4,2]

In [None]:
#rename a column - there is an easy function that is part of the 'tidyverse' package called rename()
?rename()
rename(trees_edited,Girth_inches = Girth)

Girth_inches,Height,Volume
<dbl>,<dbl>,<dbl>
8.3,70,10.3
8.6,65,10.3
8.8,63,10.2
10.5,73,16.4
10.7,81,18.8
10.8,83,19.7
11.0,66,15.6
11.0,75,18.2
11.1,80,22.6
11.2,75,19.9


#3) Create a tibble with date and price columns. Add 730 days (two years from 1/1/2018 - 12/31/2019). Simulate prices with normal distribution (mean 100, stdev 10) for 2018, and uniform distribution (100 to 200) for 2019.

In [None]:
?as.Date()
?seq()

In [None]:
#Create the dataset containing dates and prices. The seq() function can handle dates. 
#I initialized the prices to all be 1 and then will go back and change them for the proper time frames
#but there are many other ways to do this.

prices <- tibble(
  date = seq(from = as.Date("2018/1/1"),to = as.Date("2019/12/31"),by = "day"),
  price = 1
)

In [None]:
?rnorm()
?runif()

In [None]:
#go back and simulate the prices by changing the appropriate entries
prices[1:365,2] <- rnorm(365,100,10)      #change rows 1 through 365 in column 2 to be samples from rnorm()
prices[366:730,2] <- runif(365,100,200)          #change rows 366 through 760 in column 2 to be samples from runif()

#print to check it out
prices

date,price
<date>,<dbl>
2018-01-01,99.91366
2018-01-02,110.50391
2018-01-03,91.57210
2018-01-04,100.88456
2018-01-05,107.72822
2018-01-06,104.67736
2018-01-07,85.34799
2018-01-08,99.87349
2018-01-09,94.33546
2018-01-10,110.01404


#4) If these were real price fluctuations (i.e. gathered data not simulated data) how would you analyze them to determine if there was a significant change in pricing between 2018 and 2019?

I would calculate the average price in 2018 and the average price in 2019 and compare them. If the difference is sizeable, I might run a t-test to test for significance.

#5) Calculate average price in 2018, 2019, compare. Is this evidence of a price increase? What test could you run to be sure?

In [None]:
avg_price_2018 <- mean(prices$price[1:365])
avg_price_2019 <- mean(prices$price[366:730])

print(c('2018 avg price:',avg_price_2018))
print(c('2019 avg price:',avg_price_2019))
print(c('difference in average price:',avg_price_2019 - avg_price_2018))

[1] "2018 avg price:"  "100.103712354207"
[1] "2019 avg price:"  "150.063392110934"
[1] "difference in average price:" "49.9596797567276"            


Yes this is evidence of price increases on average. You could run a t-test to show that the increase is statistically significant. 

#6) Create new column that lists the mean price difference for each date. Does this increase or decrease 2018 to 2019? You could also look at the varaince in price.

In [None]:
#first calculate the overall average price
avg_price_all <- mean(prices$price)

#then add a new column to the tibble containing the difference between the price on each day and the average
prices <- add_column(prices,mean_difference = prices$price - avg_price_all)

In [None]:
variance_2018 <- mean((prices$mean_difference[1:365])^2)
variance_2019 <- mean((prices$mean_difference[366:730])^2)

variance_2018
variance_2019

The prices in 2018 were not only below average but also less variable than the prices in 2019. 

---

