# Title (Group 78 Project Proposal) ** we'll edit in later! 

## Introduction

### Background Information

Vancouver is one of the rainiest cities in Canada, averaging over 1 metre of rainfall per year (*[Government of Canada 1981-2010 Climate Normals & Averages](https://climate.weather.gc.ca/climate_normals/index_e.html)*). Daily weather forecasts have become essential for residents to plan everything from what to wear, all the way to what activities to plan for the day. Meteorologists use complex computation based on Lewis Richardson's numerical process for weather prediction (1922). However, a 2021 study by Colorado State University showed that machine-learning-based forecast systems can predict the weather just as well, but at a _fraction_ of the cost.

How much will it rain on a given day? We shall try to answer this with meteorological measurements and a regression model as well as evaluate the accuracy of the end result. We will train our model using a [dataset](https://vancouver.weatherstats.ca/download.html) downloaded from _Vancouver Weather Stats_, containing shows climate data for vancouver for the past 12 years.

**TODO**
- (**our research question is..**
- describe why data set is important //)
- briefly describe our approach; ex. using regression to predict actual amount of rainfall and comparing that model to see its accuracy
- include some info on what columns this dataset contains in the wrangling section 



## Preliminary Explatory Data Analysis

In [1]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
options(repr.matrix.max.rows = 6)

Inspecting the data should ___ csv ___ read_csv

In [None]:
rain_data <- read_csv("data/weatherstats_vancouver_daily.csv")

[1mRows: [22m[34m4000[39m [1mColumns: [22m[34m70[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m  (62): max_temperature, avg_hourly_temperature, avg_temperature, min_tem...
[33mlgl[39m   (5): solar_radiation, max_cloud_cover_4, avg_hourly_cloud_cover_4, avg...
[34mdate[39m  (1): date
[34mtime[39m  (2): sunrise, sunset

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Taking a glimpse of our data shows that it is __tidy. We will remove the dew_point column since dew point is just a calculation involving __, __, __

### Wrangling

We start by wrangling the data into a ___ format. ___ says that the three factors that influence rain primarily are **temperature, humidity, and windspeed**. Our dataset provides **daily averages** for each. 

We filter out rows !!! where measurements are NA, because this could affect the knn algorithm’s accuracy? Since we have over ___ and only a few columns have NA, this should not bias our model too much.

We filtered out the rows that were most effective for prediction, and removed the rows with NA, since they might affect the knn algorithm’s accuracy. 

In [None]:
rain_data |> 
            filter(rain>0)|>
            select(date, avg_temperature, avg_relative_humidity, avg_wind_speed, rain) |> 
            mutate(rain_flag = (rain > 0))|>
            na.omit(rain_data)

date,avg_temperature,avg_relative_humidity,avg_wind_speed,rain,rain_flag
<date>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>
2022-10-24,10.35,86.0,18.5,13.0,TRUE
2022-10-23,7.60,83.0,9.5,6.7,TRUE
2022-10-22,9.05,85.5,6.5,0.2,TRUE
⋮,⋮,⋮,⋮,⋮,⋮
2011-11-18,1.95,91.5,9.5,1.2,TRUE
2011-11-17,4.39,75.5,24.0,1.4,TRUE
2011-11-16,4.15,83.5,24.0,11.8,TRUE


### Visualization

In [1]:
options(repr.plot.width = 8, repr.plot.height = 7)

Before proceeding with visualzation, let us first split our data into training and testing to ensure fairness when testing accuracy. We have chosen to allocate 75% of the data for training. This ensures balance between enough data available a well-trained model, aswellas enough data to judge the accuracy well. We maintain evenness of the `rain` column's values when splitting.

In [7]:
<<<<<<< LOCAL CELL DELETED >>>>>>>
options(repr.plot.width = 8, repr.plot.height = 7)
small_weather <- slice_sample(rain_data, n = 100)

In [None]:
rain_split <- initial_split(rain_data, 0.75, strata = rain)
rain_training <- training(rain_split)
rain_testing <- testing(rain_split)
rain_training

<<<<<<< local


Rows: 2,999
Columns: 70
$ date                          [3m[90m<date>[39m[23m 2022-10-22, 2022-10-21, 2022-10-19, 202…
$ max_temperature               [3m[90m<dbl>[39m[23m 12.3, 13.0, 13.9, 16.6, 18.2, 20.2, 17.7…
$ avg_hourly_temperature        [3m[90m<dbl>[39m[23m 8.88, 9.98, 10.46, 11.88, 13.54, 12.84, …
$ avg_temperature               [3m[90m<dbl>[39m[23m 9.05, 10.30, 9.85, 11.80, 14.14, 13.10, …
$ min_temperature               [3m[90m<dbl>[39m[23m 5.8, 7.6, 5.8, 7.0, 10.1, 6.0, 6.4, 6.4,…
$ max_humidex                   [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ min_windchill                 [3m[90m<dbl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ max_relative_humidity         [3m[90m<dbl>[39m[23m 100, 100, 100, 100, 100, 100, 100, 100, …
$ avg_hourly_relative_humidity  [3m[90m<dbl>[39m[23m 92.0, 84.1, 100.0, 94.0, 87.5, 82.3, 92.…
$ avg_relative_humidity         [3m[90m<dbl>[39m[23m 85.5, 83.0, 99.5, 87.5, 84.5

>>>>>>> remote
<<<<<<< local <removed>


date,max_temperature,avg_hourly_temperature,avg_temperature,min_temperature,max_humidex,min_windchill,max_relative_humidity,avg_hourly_relative_humidity,avg_relative_humidity,⋯,avg_cloud_cover_4,min_cloud_cover_4,max_cloud_cover_8,avg_hourly_cloud_cover_8,avg_cloud_cover_8,min_cloud_cover_8,max_cloud_cover_10,avg_hourly_cloud_cover_10,avg_cloud_cover_10,min_cloud_cover_10
<date>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<lgl>,<lgl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2022-10-22,12.3,8.88,9.05,5.8,,,100,92.0,85.5,⋯,,,8,5.6,4.5,1,,,,
2022-10-21,13.0,9.98,10.30,7.6,,,100,84.1,83.0,⋯,,,8,7.4,6.0,4,,,,
2022-10-20,17.1,13.11,13.15,9.2,,,100,84.5,78.5,⋯,,,8,6.1,5.5,3,,,,
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
2011-11-23,7.4,5.10,5.20,3.0,,,98,92.1,90.5,⋯,,,,,,,,,,
2011-11-22,9.8,7.33,7.50,5.2,,,98,90.1,83.5,⋯,,,,,,,,,,
2011-11-21,8.1,5.26,4.39,0.7,,,97,87.0,85.5,⋯,,,,,,,,,,


>>>>>>> remote <modified: text/html, text/latex, text/markdown, text/plain>


We then plot different variables against the rainfall to analyze trends and identify which predictors we can use. We know that rainfall. Let's start with some values that are relatively easy to measure (CITATION??): temperature, air pressure, relative humidity, and 

## Methods

We start by splitting our data into training and testing data, and ensure we have an even distribution of rainfall values.


## Expected Outcomes and Significance

(quick rough draft notes to think about : )
- expected that rainfall will fall heavier in winter months rather than the summer
(there would be a strong relationship between day of the year and rainfall amount)
- temperature is another indicator of rain? the lower the temp the higher chance/amount of rain
- extension : impacts of climate change over the recent years 





