# Weather percipitation forecast for the following day

Goal: Your task is to train a model to predict amount of next day precipitation based on the present day weather.

In [None]:
# dependencies

import pandas as pd
import seaborn as sns

# setup options

pd.set_option('display.max_columns', 30)
sns.set()

# load data

df = pd.read_parquet('../data/australia_weather.parquet')

## Exploratory data analysis

Questions:
- What features could we use?
- How do we deal with missing values?
- How do variables correlate with each other?
- How is location relevant to the prediction?
    - Are we predicting precipitation of entire continent (mean) or a specific location?
    - Should we filter out some locations (keep only the ones that are in close proximity)?

In [None]:
df.info()

In [None]:
df.head(10)

In [None]:
df.describe()

In [None]:
df.groupby('Location')['Location'].count()

Observations:

- there are 156412 entries (rows) and 24 variables (columns) 
- out of the variables 2 are dependent (`RainTomorrow` depends on `RISK_MM`, `RainToday` depends on `Rainfall`)
- data has mostly float values, but also some string values (`Location`, `WindGustDir`, `WindDir9am`, `WindDir3pm`) and boolean values (`RainToday`, `RainTomorrow`) which can be converted to categorical values and integer values
- there many columns with missing values (some columns have near to 50% missing values - `Evaporation`, `Sunshine`)
- maximum atmospheric pressure is over 10000 hPa, which is not possible (at least not on Earth)
- maximum cloud cover is 9, which is not possible (at most 8 oktas)

**Questions about the data**:

- Reasons for why some data is missing?
- What does sunshine mean? Is it the number of hours of sunshine? What exactly does "bright sunshine" mean?
- How to determine if it rained or not for a given day? Is it determined the same as the `RainToday` column?

In [None]:
# correlation matrix

sns.set(rc = {'figure.figsize':(11.7,8.27)})
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f")

Observations:

- strong positive correlation between
    - `Temp9am` and `MinTemp` (0.9)
    - `Temp3pm` and `MaxTemp` (0.98)
    - `Temp3pm` and `Temp9am` (0.86)
- moderate negative correlation between
    - `Cloud3pm` and `Sunshine` (-0.7)
    - `Cloud9am` and `Sunshine` (-0.68)
- no correlation between
    - `Pressure9am` and `Pressure3pm` (should have strong positive correlation, probably caused by incorrect data)
    - `Humidity` and `Pressure*` (pressure should be inversely proportional to humidity, probably caused by incorrect data)

Strong correlations should be removed from the data.
Cloud cover is preferable over sunshine, because sunshine is missing more rows.

In [None]:
from scipy.stats import normaltest

min_temp = df['MinTemp'].dropna()

_, p = normaltest(min_temp)
print(f'p-value: {p}')

sns.histplot(data=min_temp, bins=50)

In [None]:
sns.scatterplot(data=df, x='MinTemp', y='MaxTemp', hue='RainToday')

In [None]:
sns.pairplot(df, vars=['Temp3pm', 'Humidity3pm', 'Pressure3pm'], hue='RainToday')

# Preprocessing

**TODO**:

- filter out columns that are not needed
- convert 3am 9am data to single column
- remove correlated variables
- subset dataset
- convert types
- remove incorrect data (pressure, cloud cover)
- imputation of missing values
- normalize data
- use one-hot encoding for categorical variables
- use label encoding for boolean variables
- split data into train and test set

In [None]:
# type conversion

df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')

obj_cols = df.select_dtypes("object").columns
df[obj_cols] = df[obj_cols].astype("category")

df.info()

In [None]:
# TODO: choose subset of columns

subset_columns = [
    'Date',
    # 'Location',
    # 'MinTemp',
    # 'MaxTemp',
    'Rainfall',
    'Evaporation',
    # 'Sunshine',
    # 'WindGustDir',
    # 'WindGustSpeed',
    # 'WindDir9am',
    # 'WindDir3pm',
    # 'WindSpeed9am',
    # 'WindSpeed3pm',
    'Humidity9am',
    'Humidity3pm',
    'Pressure9am',
    'Pressure3pm',
    'Cloud9am',
    'Cloud3pm',
    'Temp9am',
    'Temp3pm',
    'RainToday',
    'RISK_MM',
    'RainTomorrow',
]

df_subset = df[subset_columns]

In [None]:
# statistical imputation

# TODO

# Model

## Model selection

## Model explanation

## Model training

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=123)

## Model interpretation

## Model evaluation

# Results summary