# Real world data - TAWES

## Tawes weather data example

Tawes weather data for kapfenberg from 1997 to 2018 is in `data/tawes/Messstationen Tagesdaten v2 Datensatz_19970101_20241001_kapfenberg.csv`.
CSV (Comma Separated Value) files can be read with pandas (amongst many other file formats).

In [None]:
import pandas
df = pandas.read_csv('../data/tawes/Messstationen Tagesdaten v2 Datensatz_19970101_20241001_kapfenberg.csv')
df.head()

rr is sum percipitation for the whole day, cglo_j is global irradiance, tl_mittel is average air temperature, tlmin and tlmax are the respective temperature extrema on these days, vv_mittel is average wind speed, p_mittel is average air pressure.
However, there are no values. NaN is short for Not A Number. And if there is no value in this row (that is otherwise numeric in nature) it simple fills it with this NaN indicator.
The first row in the csv file looks like this:
`1997-01-01T00:00+00:00,13305,,,,,`
because the weather station started being operational only later that year, and only been operational until 2018.

So how to get rid of those NaNs?


In [None]:
df.dropna().head()

This however drops all rows that contain a NaN in any place. clgo was not available until later (2002) so we loose a lot of data. We want to know at least when all temperature values were available)

In [None]:
df.dropna(subset=['tl_mittel', 'tlmax', 'tlmin']).head()

In [None]:
df.dropna(subset=['tl_mittel', 'tlmax', 'tlmin']).tail()

So from row 1826 until row 7810 seems to represent the time that the weather station was operational. We can simply slice the dataframe like we would a python list object.

In [None]:
df = df[59:7811]

There are more elaborate ways to go about this, including not accessing the DataFrame using row indices, but using the time as index, but more on that later.

In [None]:
df


## Accessing columns

One can access separate columns (can be multiple) like we would with a dictionary:

In [None]:
df['tl_mittel']

In [None]:
df[['tl_mittel', 'tlmin', 'tlmax']]

## Basic maths operations

Lets do some basic operations. Make a new column with the temperature differential tlmax - tmin.

In [None]:
df['tl_diff'] = df['tlmax'] - df['tlmin']
df.head()

**Exercise** *: Create a new column named 'freezing' that contains True if the min temparature was below 0 and False otherwise.

**Exercise** **: Instead of True and False this freezing column should contain 1 and 0 (1 if True, 0 if False)

## Basic stats

Get some basic statistics on the data using describe().

In [None]:
df.describe()

## Conditional slicing, finding and counting occurrences

When was the coldest day in Kapfenberg? We can see above that tlmin had a lowest value of -20, but when?
We can use conditions that evaluate to true or false (like the freezing one above) as indexers.

In [None]:
df['tlmin'] == -20.0

In [None]:
df[df['tlmin'] == -20.0]


We can use this to select whole ranges of data where some condition applies. E.g. select all data where it was freezing.

In [None]:
df['freezing'] = (df['tlmin'] < 0).astype(int)
df[df['freezing'] == 1]

We can use value_counts to see what value occurs how often in the dataframe.

In [None]:
df['freezing'].value_counts()

**Exercise** **: How many days were the temperature was always freezing (look at tlmax) and what percentage of the time does this represent.

## Time indexed DataFrames

Pandas supports datetime indices

In [None]:
df['time'] = pandas.to_datetime(df['time'], utc=True)
df = df.set_index('time', drop=True)

In [None]:
df.head()


Now we can index rows based on times.

In [None]:
df['2011-03-01 00:00': '2011-03-02 00:00']

We can now resample the dataframe to some other resolution. Resampling to a lower frequency is called downsampling, to a higher frequency this is called upsampling.

When resampling, one has to specify a frequency and a method.

For example yearly avarages:


In [None]:
df.resample('1YS').mean() 

We can also resample to a higher frequency than the original data.

For example upsampling to hourly frequency while using linear interpolation.

In [None]:
small_df = df['2011-03-01 00:00': '2011-03-07 00:00'].copy() 
small_df.resample('1h').interpolate()


**Exercise** **: Ressample the 'freezing' column to yearly frequency providing not the mean (as in the examples above) but the sum within each year.