# Rain Analysis
### Purpose
This notebook will look at volunteer trends for reporting rain, adressing the Github issue #54

### Author: 
Hamza El-Saawy
### Date: 
2020-06-14
### Update Date: 
2020-06-14

### Inputs 
 - `1.1-circles_to_many_stations_usa_weather_data_20200424213015`

### Output Files
`2.1-cbc_prcp_1900-2018.csv`: A reduced CBC dataset consisiting of only rain (precipitaion) data and an analysis of that data compared to the NOAA GHCN data

## Steps or Proceedures in the notebook 
 - Clean the CBC data
 - Compare to NOAA data
 - Make some plots

## Where the Data will Be Saved 
The project Google Drive, at: https://drive.google.com/drive/folders/1Nlj9Nq-_dPFTDbrSDf94XMritWYG6E2I

## Notes
the flattened NOAA BigQuery drops the `QFLAG` column, so we cannot drop erroneous data and also does not contain the `WT**` `element` values (which can be used alongside the `PRCP` fields to determin precipitation)

Additionally, 1.1 drops rows where `temp_min_value`, `temp_max_value`, `temp_avg`, and `snow` are `nan`, but they could have usable values for `[am|pm]_[rain|snow]`, since it is much easier to annotate if weather happened vs taking measurments.

In [2]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

The cleaned data set, `1.0-rec-initial-data-cleaning.txt`, drops circles with "impossible" temperture, wind, and snow values, which we still find valuable here since we assume that even mistaken/erroneous temp/wind data can still have valuable precipitation data

In [123]:
DATA_PATH = '../data/Cloud_Data'
RAW_DATA_PATH = os.path.join(DATA_PATH, 'cbc_effort_weather_1900-2018.txt')
CLN_DATA_PATH = os.path.join(DATA_PATH, '1.0-rec-initial-data-cleaning.txt')
NOAA_DATA_PATH = os.path.join(DATA_PATH, '1.1-circles_to_many_stations_usa_weather_data_20200424213015.csv')
CBC_PRCP_PATH = os.path.join(DATA_PATH, 'cbc_prcp_1900-2018.csv')

In [124]:
raw_data = pd.read_csv(RAW_DATA_PATH, encoding = "ISO-8859-1", sep="\t")

In [125]:
clean_data = pd.read_csv(CLN_DATA_PATH, encoding = "ISO-8859-1", sep="\t")

  interactivity=interactivity, compiler=compiler, result=result)


In [155]:
noaa_data = pd.read_csv(NOAA_DATA_PATH).rename(columns={'int64_field_0': 'orig_id', 'id': 'station_id', })

In [191]:
fields_to_keep = ('orig_id', 'circle_name', 'country_state', 'count_year', 'count_date', 
                  'min_snow_metric', 'am_rain', 'pm_rain', 'am_snow', 'pm_snow', 
                  'station_id', 
                  'precipitation_value', 'snow', 'snwd'
                 )

prcp_data = noaa_data.loc[:, fields_to_keep]

In [218]:
prcp_data['r_sd_gtz'] = np.where(prcp_data.min_snow_metric.isna(), np.NaN, prcp_data.min_snow_metric > 0)

In [193]:
prcp_data['s_sd_gtz'] = np.where(prcp_data.snwd.isna(), np.NaN, prcp_data.snwd > 0)

In [220]:
for c in ['am_rain', 'pm_rain', 'am_snow', 'pm_snow']:
    i = ~prcp_data[c].isna()
    prcp_data.loc[i, c] = prcp_data.loc[i, c].astype('int64').astype('str')

In [219]:
for c in ['am_rain', 'pm_rain', 'am_snow', 'pm_snow']:
    prcp_data.loc[prcp_data[c].str.contains('4', na=False), c] = np.NaN

In [221]:
prcp_data

Unnamed: 0,orig_id,circle_name,country_state,count_year,count_date,min_snow_metric,am_rain,pm_rain,am_snow,pm_snow,station_id,precipitation_value,snow,snwd,r_sd_gtz,s_sd_gtz
0,32617,Amchitka Island,US-AK,1980,1979-12-18,0.00,3,3,2,2,USC00500252,5.0,3.0,0.0,0.0,0.0
1,52625,Amchitka Island,US-AK,1993,1992-12-20,0.00,,,,,USC00500252,,0.0,0.0,0.0,0.0
2,90930,Caribou,US-ME,2012,2011-12-28,10.16,321,1,2,3,USW00014607,71.0,8.0,25.0,1.0,1.0
3,93245,Caribou,US-ME,2013,2012-12-29,10.16,3,3,3,3,USW00014607,0.0,0.0,229.0,1.0,1.0
4,95653,Caribou,US-ME,2014,2014-01-01,38.10,3,3,3,3,USW00014607,0.0,3.0,460.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109385,37123,Kaua'i: Waimea,US-HI,1983,1982-12-26,0.00,,,,,USC00518205,0.0,0.0,0.0,0.0,0.0
109386,37123,Kaua'i: Waimea,US-HI,1983,1982-12-26,0.00,,,,,USC00519253,0.0,0.0,0.0,0.0,0.0
109387,37123,Kaua'i: Waimea,US-HI,1983,1982-12-26,0.00,,,,,USC00514272,0.0,0.0,0.0,0.0,0.0
109388,37123,Kaua'i: Waimea,US-HI,1983,1982-12-26,0.00,,,,,USW00022501,0.0,0.0,0.0,0.0,0.0


if either am or pm shows rain/snow (1 or 2), then it rained, if *both* am and pm are "3", then it did not rain. else, it is NaN

In [244]:
ooo = (prcp_data.am_rain.str.contains('[12]').isna() ^ prcp_data.pm_rain.str.contains('[12]').isna())

In [256]:
prcp_data.am_rain[ooo].str.contains('[12]') | prcp_data.pm_rain[ooo].str.contains('[12]')

9         False
703        True
704        True
772       False
818       False
          ...  
109311     True
109312     True
109313     True
109316    False
109317    False
Length: 457, dtype: bool

In [257]:
prcp_data.am_rain[ooo].str.contains('[12]') 

9         False
703        True
704        True
772       False
818       False
          ...  
109311     True
109312     True
109313     True
109316    False
109317    False
Name: am_rain, Length: 457, dtype: object