# Problem 4 (optional)

This is an optional task for more advanced students. 

## What to do

1. Start by downloading your own data (daily summaries for years **1959-2018 August**) for **Sodankyla Lokka** (notice the place name should be without `ä` letter), from the [NOAA Climate Data Online Search](https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND). Make sure to click on starting day (and ending day) in the date selection panel after changing year! After you have searched, click “Add to cart” for a selected station, then go to cart. Select the ``Custom GHCN-Daily Text`` format for the resulting output file and hit continue.

    - From the `Station Detail & Data Flag Options` choose two of the following attributes: Station Name, Geographic Location. **Notice:** Do **NOT** include data flags because it makes the data difficult to read. Use **Standard** units.
    - Take also Precipitation and Temperature which are under a separate button below. 
    - From the next page, add your own email address where the weather data will be sent after a short moment.

2. After you have downloaded the data. you should first,

    - Calculate the average temperature using columns `TMAX` and `TMIN` and insert those values into a new column called `TAVG`.

3. Next, you should use the approaches learned during this week and used in Problem 3 to answer / do the following:

    - Calculate the temperature anomalies in Sodankyla, i.e. the difference between `reference_temps` and the average temperature for each month (see Problem 3).
    - Calculate the monthly temperature differences between Sodankyla and Helsinki stations
        - How different are the summer temperatures (June, July, August) between Helsinki (used in Problems 1-3) and Sodankyla station?
        - What were the summer mean temperatures for both of these stations?
        - What were the summer standard deviations for both of these stations?
    - Calculate the monthly differences in a DataFrame and save it (as `CSV` file) into your own Exercise repository for this week
4. Upload your script and data to GitHub

In [19]:
import pandas as pd

In [21]:
fp = 'data/sodlokka.csv'
sodlokka = pd.read_csv(fp, sep='\s+', skiprows=[1], na_values=-9999)

  sodlokka = pd.read_csv(fp, sep='\s+', skiprows=[1], na_values=-9999)


In [23]:
sodlokka.head()

Unnamed: 0,Unnamed: 1,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TMAX,TMIN
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,19590101,0.03,,9.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,19590102,0.0,,6.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,19590103,0.02,,-9.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,19590104,0.08,,10.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,19590105,0.09,,13.0


In [25]:
sodlokka['TAVG'] = (sodlokka['TMAX'] + sodlokka['TMIN']) / 2

In [27]:
sodlokka.tail(10)

Unnamed: 0,Unnamed: 1,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TMAX,TMIN,TAVG
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180822,0.05,60.0,31.0,45.5
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180823,0.02,53.0,39.0,46.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180824,0.42,61.0,33.0,47.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180825,0.09,63.0,49.0,56.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180826,0.19,62.0,48.0,55.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180827,0.04,55.0,43.0,49.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180828,0.0,59.0,31.0,45.0
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180829,0.0,65.0,32.0,48.5
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180830,0.02,65.0,48.0,56.5
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180831,0.0,59.0,46.0,52.5


In [29]:
def fahr_to_cel(temp_f):
    temp_c = 5/9 * (temp_f - 32)
    return temp_c

In [31]:
sodlokka['TEMP_C'] = sodlokka['TAVG'].apply(fahr_to_cel).round(2)

In [33]:
sodlokka.tail()

Unnamed: 0,Unnamed: 1,STATION,STATION_NAME,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,TMAX,TMIN,TAVG,TEMP_C
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180827,0.04,55.0,43.0,49.0,9.44
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180828,0.0,59.0,31.0,45.0,7.22
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180829,0.0,65.0,32.0,48.5,9.17
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180830,0.02,65.0,48.0,56.5,13.61
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,240,67.8206,27.7503,20180831,0.0,59.0,46.0,52.5,11.39


In [40]:
sodlokka = sodlokka[['STATION', 'STATION_NAME', 'DATE', 'PRCP', 'TMAX', 'TMIN', 'TEMP_C']]
sodlokka.tail()

Unnamed: 0,Unnamed: 1,STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN,TEMP_C
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180827,0.04,55.0,43.0,9.44
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180828,0.0,59.0,31.0,7.22
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180829,0.0,65.0,32.0,9.17
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180830,0.02,65.0,48.0,13.61
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180831,0.0,59.0,46.0,11.39


In [45]:
sodlokka['DATE_STR'] = sodlokka['DATE'].astype(str)
sodlokka.tail()

Unnamed: 0,Unnamed: 1,STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN,TEMP_C,DATE_STR
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180827,0.04,55.0,43.0,9.44,20180827
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180828,0.0,59.0,31.0,7.22,20180828
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180829,0.0,65.0,32.0,9.17,20180829
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180830,0.02,65.0,48.0,13.61,20180830
GHCND:FIE00146538,SODANKYLA,LOKKA,FI,20180831,0.0,59.0,46.0,11.39,20180831


In [53]:
sodlokka.dtypes

STATION          object
STATION_NAME     object
DATE              int64
PRCP            float64
TMAX            float64
TMIN            float64
TEMP_C          float64
DATE_STR         object
dtype: object

In [59]:
sodlokka['MONTH'] = sodlokka['DATE_STR'].str.slice(start=4, stop=6).astype(int)


KeyError: 36

In [79]:
grouped = sodlokka.groupby(by=['MONTH'])

In [95]:
grouped.dtypes
reference_temps = grouped['TEMP_C'].mean()

  grouped.dtypes


In [97]:
reference_temps

MONTH
1    -14.682880
2    -14.127404
3     -9.549348
4     -3.118256
5      3.883856
6     10.388167
7     13.543868
8     10.975940
9      5.746526
10    -0.856893
11    -7.617076
12   -12.353497
Name: TEMP_C, dtype: float64

In [99]:
reference_temps.reset_index()

Unnamed: 0,MONTH,TEMP_C
0,1,-14.68288
1,2,-14.127404
2,3,-9.549348
3,4,-3.118256
4,5,3.883856
5,6,10.388167
6,7,13.543868
7,8,10.97594
8,9,5.746526
9,10,-0.856893


dtype('float64')

In [107]:
merged = pd.merge(sodlokka, reference_temps, on='MONTH')

In [109]:
merged

Unnamed: 0,STATION,STATION_NAME,DATE,PRCP,TMAX,TMIN,TEMP_C_x,DATE_STR,MONTH,TEMP_C_y
0,LOKKA,FI,19590101,0.03,,9.0,,19590101,1,-14.68288
1,LOKKA,FI,19590102,0.00,,6.0,,19590102,1,-14.68288
2,LOKKA,FI,19590103,0.02,,-9.0,,19590103,1,-14.68288
3,LOKKA,FI,19590104,0.08,,10.0,,19590104,1,-14.68288
4,LOKKA,FI,19590105,0.09,,13.0,,19590105,1,-14.68288
...,...,...,...,...,...,...,...,...,...,...
20907,LOKKA,FI,20180827,0.04,55.0,43.0,9.44,20180827,8,10.97594
20908,LOKKA,FI,20180828,0.00,59.0,31.0,7.22,20180828,8,10.97594
20909,LOKKA,FI,20180829,0.00,65.0,32.0,9.17,20180829,8,10.97594
20910,LOKKA,FI,20180830,0.02,65.0,48.0,13.61,20180830,8,10.97594


In [137]:
merged.rename(columns={'TEMP_C_x': 'TEMP_C', 'TEMP_C_y': 'REFERENCE_TEMP'}, inplace=True)

In [139]:
print(merged)

      STATION STATION_NAME      DATE  PRCP  TMAX  TMIN  TEMP_C  DATE_STR  \
0       LOKKA           FI  19590101  0.03   NaN   9.0     NaN  19590101   
1       LOKKA           FI  19590102  0.00   NaN   6.0     NaN  19590102   
2       LOKKA           FI  19590103  0.02   NaN  -9.0     NaN  19590103   
3       LOKKA           FI  19590104  0.08   NaN  10.0     NaN  19590104   
4       LOKKA           FI  19590105  0.09   NaN  13.0     NaN  19590105   
...       ...          ...       ...   ...   ...   ...     ...       ...   
20907   LOKKA           FI  20180827  0.04  55.0  43.0    9.44  20180827   
20908   LOKKA           FI  20180828  0.00  59.0  31.0    7.22  20180828   
20909   LOKKA           FI  20180829  0.00  65.0  32.0    9.17  20180829   
20910   LOKKA           FI  20180830  0.02  65.0  48.0   13.61  20180830   
20911   LOKKA           FI  20180831  0.00  59.0  46.0   11.39  20180831   

       MONTH  REFERENCE_TEMP  
0          1       -14.68288  
1          1       -14.68

In [141]:
merged['DIFF'] = merged['TEMP_C'] - merged['REFERENCE_TEMP']

In [143]:
print(merged)

      STATION STATION_NAME      DATE  PRCP  TMAX  TMIN  TEMP_C  DATE_STR  \
0       LOKKA           FI  19590101  0.03   NaN   9.0     NaN  19590101   
1       LOKKA           FI  19590102  0.00   NaN   6.0     NaN  19590102   
2       LOKKA           FI  19590103  0.02   NaN  -9.0     NaN  19590103   
3       LOKKA           FI  19590104  0.08   NaN  10.0     NaN  19590104   
4       LOKKA           FI  19590105  0.09   NaN  13.0     NaN  19590105   
...       ...          ...       ...   ...   ...   ...     ...       ...   
20907   LOKKA           FI  20180827  0.04  55.0  43.0    9.44  20180827   
20908   LOKKA           FI  20180828  0.00  59.0  31.0    7.22  20180828   
20909   LOKKA           FI  20180829  0.00  65.0  32.0    9.17  20180829   
20910   LOKKA           FI  20180830  0.02  65.0  48.0   13.61  20180830   
20911   LOKKA           FI  20180831  0.00  59.0  46.0   11.39  20180831   

       MONTH  REFERENCE_TEMP     DIFF  
0          1       -14.68288      NaN  
1      

In [149]:
print(merged['DIFF'].idxmax())

14835


In [151]:
print(merged[14835])

KeyError: 14835