## Problem 1

The data for this problem comes from a subset of The National Centers for Environmental Information (NCEI) [Global Historical Climatology Network daily (GHCNd)](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily) (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe - it's a wonderfully large dataset to play with! In particular, you will be asked to use data from the Ann Arbor Michigan location, and this is stored in the file: [assets/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv](assets/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv).

Each row in this datafile corresponds to a single observation from a weather station, and has the following variables:

- __id__: station identification code
- __date__: date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
- __element__: indicator of element type
    - _TMAX_: Maximum temperature (tenths of degrees F)
    - _TMIN_: Minimum temperature (tenths of degrees F)
- __value__: data value for element (tenths of degrees F)

For this assignment, you must:

1. Read the documentation and familiarize yourself with the dataset, then write a python notebook which plots line graphs of the record high and record low temperatures by day of the year over the period 2005-2014. The area between the record high and record low temperatures for each day should be shaded.
2. Overlay a scatter of the 2015 data for any points (highs and lows) for which the ten year record (2005-2014) record high or record low was broken in 2015. (Based on the graph, do you think extreme weather is getting more frequent in 2015?)
3. Watch out for leap days (i.e. February 29th), it is reasonable to remove these points from the dataset for the purpose of this visualization.
4. Make the visual nice! Leverage principles from the first module in this course when developing your solution. Consider issues such as legends, labels, and chart junk.

Your final output should looks like next one:

![](chris_sketch.png)

# Workshop 3

After you learn about _regular expressions_ and _data manipulation_, it is momment to validate how many do you learn about __data visualization__.  

### Step 1.1

Load the dataset and transform the data into _Celsius_ (refer to documentation). Then extract all of the rows which have minimum or maximum temperatures, each one in a separate dataframe.

__Hint: when I did this step I had two DataFrame objects, each with ~80,000 entries in it__

In [123]:
import pandas as pd

def farenheit_to_celsius(farenheit):
    # YOUR CODE HERE
    return ((farenheit-32)*5)/9

# load the data
data = pd.read_csv("assets/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv",)
data["Data_Value"] = data["Data_Value"].apply(lambda x: farenheit_to_celsius(x))
# YOUR CODE HERE
temperature_min_df = data[data["Element"]=="TMIN"]
temperature_max_df = data[data["Element"]=="TMAX"]

print(temperature_max_df)

                 ID        Date Element  Data_Value
0       USW00094889  2014-11-12    TMAX   -5.555556
2       USC00200032  2008-05-26    TMAX  136.666667
3       USC00205563  2005-11-11    TMAX   59.444444
4       USC00200230  2014-02-27    TMAX  -76.666667
5       USW00014833  2010-10-01    TMAX   90.000000
...             ...         ...     ...         ...
165074  USW00094889  2009-07-09    TMAX  127.222222
165076  USC00205050  2013-09-29    TMAX  127.222222
165080  USC00205822  2015-06-09    TMAX  124.444444
165081  USC00205822  2009-10-06    TMAX   75.000000
165082  USC00205050  2014-07-14    TMAX  139.444444

[83063 rows x 4 columns]


In [None]:
# space for proffessor's tests


### Step 1.2

In order to visualize the data we would _plot_ the __min__ and _max_ data for each day of the year between the years $2005$ and $2014$ across all weather stations. But we also need to find out _when_ the __min__ or __max__ temperature in $2015$ falls below the min or rises above the max for the _previous decade_.

If you did _Step 1.1_ you have _two Series_ objects with _min_ and _max_ times for the years $2005$ through $2015$. You can use Pandas `groupby` to create _max_ and _min_ temperature _Series_ objects across all weather stations for each day of these years, and you can deal with the records for _February 29_ (the leap year) by dropping them.

__Hint: When I finished this step, I had two DataFrame objects, each with exactly 4015 observations in them__

In [130]:
# create a DataFrame of maximum temperature by date
# YOUR CODE HERE
temperature_max_df = temperature_max_df[~temperature_max_df['Date'].str.endswith('02-29')]
#temperature_max_per_day_df = temperature_max_df.reset_index().groupby(["Date","Data_Value"],observed=False)

# create a DataFrame of minimum temperatures by date
# YOUR CODE HERE    
temperature_min_per_day_df = ...

                 ID        Date Element  Data_Value
123332  USC00205563  2005-05-12    TMAX   77.777778
123350  USC00200842  2005-05-12    TMAX   99.444444
123382  USC00208080  2005-05-12    TMAX   62.222222
124468  USC00205451  2005-05-12    TMAX   99.444444
124768  USC00200230  2005-05-12    TMAX   59.444444
126240  USW00094889  2005-05-12    TMAX   56.111111
126872  USW00004848  2005-05-12    TMAX   41.111111
127264  USC00207308  2005-05-12    TMAX   99.444444
127799  USC00200228  2005-05-12    TMAX   96.666667
127875  USC00207312  2005-05-12    TMAX   96.666667
128367  USC00208202  2005-05-12    TMAX   96.666667
135921  USC00205050  2005-05-12    TMAX   62.222222
145603  USW00014833  2005-05-12    TMAX   62.222222
145705  USW00014853  2005-05-12    TMAX   59.444444
146025  USC00202308  2005-05-12    TMAX  102.777778
146364  USC00207320  2005-05-12    TMAX  108.888889
146536  USC00205822  2005-05-12    TMAX  111.666667


In [None]:
# space for proffessor's tests

### Step 1.3
Now that you have grouped the _daily_ __max__ and __min__ temperatures for each day of the years $2005$ through $2015$, you can separate out the data for $2015$. Then you can use the Pandas `groupby` function to find the __max__ and __min__ of the temperature data for each __day of the year__ for the $2005-2014$ data.

__Hint: At the end of this step I had two DataFrames, one of maximum and the other of minimum values, which each had 365 observations in them. I also had another pair of similar DataFrames but only for the year 2015.__

In [None]:
# calculate the minimum and maximum values for the day of the year for 2005 through 2014
# YOUR CODE HERE
temperature_max_day_year_df = ...

temperature_min_day_year_df = ...

# calculate the minimum and maximum values for the years 2015
# YOUR CODE HERE
temperature_max_per_day_2015_df = ...

# YOUR CODE HERE
temperature_min_per_day_2015_df = ...

In [None]:
# space for proffessor's tests

### Step 1.4

Now it's time to plot! You need to explore _matplotlib_ in order to plot line graphs of the _min_ and _max_ temperatures for the years $2005$ through $2014$ and to scatter plot __only__ the daily $2015$ temperatures that exceeded those values.

In [3]:
import matplotlib.pyplot as plt
from calendar import month_abbr

fig = plt.figure(figsize=(15, 10))
# YOUR CODE HERE

TypeError: arrow() missing 4 required positional arguments: 'x', 'y', 'dx', and 'dy'

<Figure size 1500x1000 with 0 Axes>