# 1. Weather dataset introduction

This notebook will download and format the data to be used.
The dataset selected is the [Rain in Austria](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) challenge from Kaggle. It contains about 10 years of daily weather observations from many locations across Australia. 


#### Index
1. [Download dataset from Kaggle](#1.-Download-dataset-from-Kaggle)
2. [Apply proper format to the dataset](#2.-Apply-proper-format-to-the-dataset)
3. [Dataset profiling](#3.-Dataset-profiling)

In [3]:
!pip install pandas_profiling

Collecting pandas_profiling
  Using cached pandas_profiling-3.1.0-py2.py3-none-any.whl (261 kB)
Processing c:\users\usuario\appdata\local\pip\cache\wheels\23\14\6e\4be5bfeeb027f4939a01764b48edd5996acf574b0913fe5243\htmlmin-0.1.12-py3-none-any.whl
Collecting missingno>=0.4.2
  Using cached missingno-0.5.0-py3-none-any.whl (8.8 kB)
Collecting phik>=0.11.1
  Using cached phik-0.12.0-cp38-cp38-win_amd64.whl (659 kB)
Collecting visions[type_image_path]==0.7.4
  Using cached visions-0.7.4-py3-none-any.whl (102 kB)
Collecting pydantic>=1.8.1
  Using cached pydantic-1.8.2-cp38-cp38-win_amd64.whl (2.0 MB)
Collecting tangled-up-in-unicode==0.1.0
  Using cached tangled_up_in_unicode-0.1.0-py3-none-any.whl (3.1 MB)
Processing c:\users\usuario\appdata\local\pip\cache\wheels\48\a1\7f\096c1269d6bf78d4768180602579b35a1e8cb1250bb4b40c74\imagehash-4.2.1-py2.py3-none-any.whl
Installing collected packages: htmlmin, missingno, phik, tangled-up-in-unicode, imagehash, visions, pydantic, pandas-profiling
Succ

In [156]:
import os
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport #!pip install pandas_profiling
import time

## 1. Download dataset from Kaggle

In [5]:
!pip install kaggle



In [6]:
os.environ['KAGGLE_USERNAME'] = "ginesmeca" #User name in Kaggle
os.environ['KAGGLE_KEY'] = "4fab0a819b6b377c5f6fd16dede7d525" # Kaggle/ account / create key

import kaggle

kaggle.api.authenticate()
kaggle.api.dataset_download_files('jsphyg/weather-dataset-rattle-package', path='../data/01_raw', unzip=True)

## 2. Apply proper format to the dataset

### 2.1 Read the dataset

In [108]:
weather = pd.read_csv('../data/01_raw/weatherAUS.csv')
weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [101]:
weather.dtypes

Date              object
Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
dtype: object

### 2.2 Apply the proper format

In [172]:
# Format the date
weather['Date'] = pd.to_datetime(weather.Date, format='%Y-%m-%d')

#Format RainToday by re-creating it
weather['RainToday'] = np.where(weather.Rainfall<=1.0, 0, 1)

#Remove columns we are not interested in
weather.drop('RainTomorrow',axis=1, inplace=True)

weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0


### 2.3 Save the dataset

In [173]:
weather.to_parquet('../data/02_intermediate/weather_intermediate.parquet')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.

### 2.4  Creating new RainTomorrow column

### 2.4.1 Using if/for:

In [174]:
t0 = time.perf_counter()

#Creating an empty list and filling with next day RainToday values:
RainT = []
for n in range(0, len(weather) - 1):
    if weather.Location[n] == weather.Location[n+1]:
        RainT += [weather.RainToday[n+1]]
    else:
        RainT += [np.nan]

#Adding last date:
RainT += [np.nan]

#Adding RainT to weather:
weather['RainTomorrow'] = RainT

t = time.perf_counter() - t0

#Redefine to save it:
WEATHER = weather
WEATHER.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,0,0.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,0,0.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,0,0.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,0,0.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0,0.0


In [175]:
#Running time if/for method:
t

2.765478899998925

### 2.4.2 Using np.where:

We have to reset our dataset like in 2.2

In [176]:
# Format the date
weather['Date'] = pd.to_datetime(weather.Date, format='%Y-%m-%d')

#Format RainToday by re-creating it
weather['RainToday'] = np.where(weather.Rainfall<=1.0, 0, 1)

#Remove columns we are not interested in
weather.drop('RainTomorrow',axis=1, inplace=True)

weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0


In [177]:
t0 = time.perf_counter()

#Two datasets to compare with a day of difference
RT1 = weather[['Location', 'RainToday']]
RT2 = weather[['Location', 'RainToday']][1:len(weather)].reset_index(drop='True')
RT2.loc[len(weather) - 1] = [np.nan, np.nan]

#Creating the new column:
RainT = np.where(RT1.Location == RT2.Location, RT2.RainToday, np.nan)

#Adding RainT to weather:
weather['RainTomorrow'] = RainT

t = time.perf_counter() - t0

weather.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,...,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,0,0.0
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,...,44.0,25.0,1010.6,1007.8,,,17.2,24.3,0,0.0
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,...,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,0,0.0
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,...,45.0,16.0,1017.6,1012.8,,,18.1,26.5,0,0.0
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,...,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,0,0.0


In [178]:
#Running time np.where method:
t

0.03576649999922665

As we can see, we obtain the same result but np.where method is far more efficient.

### 3. Dataset profiling

In [84]:
profile = ProfileReport(weather, title="Weather dataset profiling report", explorative=True)
profile.to_file("../data/reporting/weather_report.html")

#profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

  (2 * xtie * ytie) / m + x0 * y0 / (9 * m * (size - 2)))


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 3.1 Report insights