# Group Proposal - Predicting Following Day Rainfall in Australia


## 1.0 Introduction

#### 1.1 Background

Meteorological data has been collected and used to predict weather conditions ever since we have had the means to do so - since 1869! It helps one plan their day ahead and deal with the given weather conditions accordingly.By predicting a rainy day, lots of variables need to be considered, such as wind, humidity, temperature, etc. 

#### 1.2 Central Question

Will it rain tomorrow in Australia based on a set of meteorological characteristics from the previous day?

#### 1.3 Dataset

The dataset that we will use is the “Rain in Australia” dataset by Joe Young and Adam Young. This dataset contains meteorological data across 10 years in Australia from 2007/10/31 to 2017/6/24 in various regions, collected by weather stations across Australia. The dataset contains variables such as weather conditions - wind speed, wind direction and temperature, as well the amount of precipitation in the form of rain on any given day.

## 2.0 Preliminiary Exploratory Data Analysis

#### 2.1 Wrangling

To aid in our decision for predictor variables, we can visualize which columns are present with the most valid data (least NA columns). A larger sample of data would allow us to reduce the impact of factors such as random error in the observation process and improve the overall quality of the analysis. 

In [18]:
#load tidyverse
library(tidyverse)
library(tidymodels)

In [19]:
#load data into r
weather_data_raw <- read_csv("weatherAUS.csv")

[1mRows: [22m[34m145460[39m [1mColumns: [22m[34m23[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m   (6): Location, WindGustDir, WindDir9am, WindDir3pm, RainToday, RainTom...
[32mdbl[39m  (16): MinTemp, MaxTemp, Rainfall, Evaporation, Sunshine, WindGustSpeed,...
[34mdate[39m  (1): Date

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [20]:
# Select three cities for analysis

options(repr.plot.width =14, repr.plot.height = 8) 

#Remove categorical variables & variables with too many N/A values (Wrangle Data)
weather_data_clean <- weather_data_raw |>
                            select(-WindGustDir, -WindDir9am, -WindDir3pm, -RainToday, -Date, -Location) |>
                            select(-Sunshine, -Evaporation, -Cloud3pm, -Cloud9am)

In [None]:
# Split training and testing sets
weather_split <- initial_split(weather_data_clean, prop = 0.75, strata = RainTomorrow)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)

#### 2.2 Summarizing New Non N/A Rows
We can now check the distributions of usable rows within our training data by summarizing and counting the number of non N/A rows within our training data.

In [23]:
# Summarizes each row by counting non N/A cells. Renames the variables afterwards after conversion to data frame
no_na_data <- as.data.frame((colSums(!is.na(weather_train)))) 
no_na_data <- cbind(rownames(no_na_data), no_na_data)
rownames(no_na_data) <- NULL
colnames(no_na_data) <- c("measurement","count")

count_tbl <- arrange(no_na_data, desc(count))
count_tbl

measurement,count
<chr>,<dbl>
MaxTemp,108159
MinTemp,107996
WindSpeed9am,107820
Temp9am,107791
Humidity9am,107130
WindSpeed3pm,106814
Rainfall,106676
RainTomorrow,106646
Temp3pm,106382
Humidity3pm,105717


## 3.0 Methods

#### 3.1 Explain how you will conduct either your data analysis and which variables/columns you will use

The columns we plan to use are those quantifying the day’s weather with the least number of NA-observation counts. This includes minimum temperature, max temperature, rainfall level, humidity ,and windspeed. This has been proved by listing out the rows containing NA value with colSums!(is.na) function. Also, we will be plotting facet_grid histograms to compare the distributions of each predictor variable.


#### 3.2 Describe at least one way that you will visualize the results

We will make scatterplots to visualize the relationship between each pair of variables, such as “rainfall level” and “humidity”, and colored by the target variable to see if there is any pattern distinguishing whether it rains tomorrow or not.

## 4.0 Expected Outcomes and Significance:

#### 4.1 Expected Outcomes

We are expecting to create a design with an arithmetic trend with a specific date’s information to determine whether the next day rains. Therefore, the expected outcome would be accurately predicting the occurrence of rain on the next day.

#### 4.2 Significance of Investigation

The significance of this analysis lies in the immense impact that weather and in particular, rain, has on society. Being able to predict rain is not only beneficial for day-to-day life but quintessential for industries such as agriculture, tourism, and urban development

#### 4.3 Extended/Further Questions

Investigating precipitation further in the future can prompt inquiries on how rain patterns have evolved throughout the last decade. Furthermore, this investigation can also prompt further analysis into how accurate our model can be when compared to modern circumstances.