# Exercise 6 - Data analysis with Pandas

In this week's exercise we will continue developing our skills using Pandas to analyze climate data.

After making your changes, you will need to upload your changes to GitHub.
The answers to the questions in this week's exercise should be given by modifying the document in the requested places.

If you are uncertain about **the style of your code**, take a look at the **[PEP 8 - Style guide for Python code](https://www.python.org/dev/peps/pep-0008/)**.  

 - **Exercise 6 is due by 16:00 on 17.10.**
 - Don't forget to check out the [hints for this week's exercise](https://geo-python.github.io/2018/lessons/L6/exercise-6.html) if you're having trouble.
 - Scores on this exercise are out of 20 points.
 - There are altogether 3 problems that you should solve. The fourth problem is optional (Problem 4) for more advanced students (does not affect grading)

## Data

For problems 1-3 in this exercise we will be using climate data from the Helsinki-Vantaa airport station.
For these problems, we have daily observations obtained from the [NOAA Global Historical Climatology Network](https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND).
The file was downloaded using the "Custom GHCN-Daily Text" output format, including following attributes:

| Attribute                | Description                      |
|--------------------------|----------------------------------|
| `STATION`                | Unique ID of the weather station |
| `ELEVATION`              | Elevation of the station         |
| `LATITUDE` , `LONGITUDE` | Coordinates of the station       |
| `DATE`                   | Date of the measurement          |
| `PRCP`                   | Precipitation                    |
| `TAVG`                   | Average temperature              |
| `TMAX`                   | Maximum temperature              |
| `TMIN`                   | Minimum temperature              |

The file for this problem is exactly as available from the NOAA website. You can take a [look of the data](data/1091402.txt).

**Note**: once again that temperatures in this dataset are given in degrees Fahrenheit.

Additional information about the data format can be found in the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6-hints.html).


# Problem 1 - Reading in a tricky data file (5 points)

#### Overview

You first task for this exercise is to read in the data file (`data/1091402.txt`) to a variable called **`data`**.
This should be done using the `read_csv()` -function in Pandas, and the resulting DataFrame should have the following attributes:

  - The numerical values for rainfall and temperature read in as numbers
  - The second row of the datafile should be skipped, but the text labels for the columns should be from the first row
  - The no-data values (assigned with value **`-9999`**) should properly be converted to `NaN`
  
After successfully reading the data file, you should find answers programmably to specific questions below, and upload your notebook to **your own repository** for this week's exercise.

You can find hints about how to do these things in the [description of Exercise 5 Problem 1](https://github.com/Geo-Python-2018/Exercise-5/blob/master/Pandas/Exercise-5-problem1.ipynb) and the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6.html).


1. Read the file into variable **data**
   - Skip the second row
   - Convert the no-data values into `NaN` (values -9999)

In [1]:
# YOUR CODE HERE
# raise NotImplementedError()
import pandas as pd

In [2]:
fp = r"data/1091402.txt"
data = pd.read_csv(fp, sep='\s+', skiprows=[1], na_values=-9999)

# Data columns
print(data.columns)
# dataframe shape
print(data.shape)

Index(['STATION', 'ELEVATION', 'LATITUDE', 'LONGITUDE', 'DATE', 'PRCP', 'TAVG',
       'TMAX', 'TMIN'],
      dtype='object')
(23716, 9)


In [3]:
# Test print that should work
print(data.head(1))


             STATION  ELEVATION  LATITUDE  LONGITUDE      DATE  PRCP  TAVG  \
0  GHCND:FIE00142080         51   60.3269    24.9603  19520101  0.31  37.0   

   TMAX  TMIN  
0  39.0  34.0  


- How many no-data values (NaN) are there for **`TAVG`**?
  - Assign your answer into a variable called **`tavg_nodata_count`**


In [4]:
# How many no-data values?
tavg_nodata_count = data.TAVG.isna().sum()
# YOUR CODE HERE
# raise NotImplementedError()

In [5]:
# This test print should print a number
print(tavg_nodata_count)


3308


- How many no-data values (NaN) are there for `TMIN`?
  - Assign your answer into a variable called **`tmin_nodata_count`**


In [6]:
# How many no-data values?
tmin_nodata_count = data['TMIN'].isna().sum()

# YOUR CODE HERE
# raise NotImplementedError()

In [7]:
# This test print should print a number
print(tmin_nodata_count)


365


- How many days total are covered by this data file?
  - Assign your answer into a variable called **`day_count`**


In [8]:
# How many days?
day_count = data.DATE.count()

# YOUR CODE HERE
# raise NotImplementedError()

In [9]:
# This test print should print a number
print(day_count)


23716


- When was the first observation made (i.e. the oldest)?
  - Assign your answer into a variable called **`first_obs`**


In [10]:
# YOUR CODE HERE
# raise NotImplementedError()
# Sort the data in ascending order
data = data.sort_values(by = 'DATE', ascending = True)

# Use iloc[<index_num>, dataframe.columns.get_loc(<'column'>)] to return the value of specific column
first_obs = data.iloc[0, data.columns.get_loc('DATE')]

In [11]:
# This test print should print a number
print(first_obs)


19520101


- When was the last observation made (i.e. the most recent)?
  - Assign your answer into a variable called **`last_obs`**

In [12]:
# YOUR CODE HERE
# raise NotImplementedError()

# Sort the data in descending order
last_obs = data.sort_values(by='DATE', ascending=False).iloc[0, data.columns.get_loc('DATE')]

In [13]:
# This test print should print a number
print(last_obs)


20171004


- What was the average temperature of the whole data file (all years)?
  - Assign your answer into a variable called **`avg_temp`**

In [14]:
# YOUR CODE HERE
# raise NotImplementedError()
avg_temp = data.TAVG.mean()

In [15]:
# This test print should print a number
print(avg_temp)


41.32408859270874


- What was the **`TMAX`** temperature of the ``Summer 69`` (i.e. including months May, June, July, August of the year 1969)?
  - Assign your answer into a variable called **`avg_temp_69`**

In [16]:
# YOUR CODE HERE
# raise NotImplementedError()

# define condition for DATE column in order to filter the data in the dataframe
avg_temp_69 = data.loc[(data.DATE >= 19690501) & (data.DATE<= 19690831)]

# Asign the  mean value of column TAVG to a variable
avg_temp_69 = avg_temp_69.TAVG.mean()

In [17]:
# This test print should print a number
print(avg_temp_69)


nan


# Problem 2 - Calculating monthly average temperatures (7.5 points)

For this problem your goal is to calculate monthly average temperature values in degrees Celsius from the daily values we have in the data file. You can use the approaches taught in Lessons 4,5 and 6 to solve this.
You can again consult the [hints for Exercise 6](https://geo-python.github.io/2018/lessons/L6/exercise-6-hints.html) if you are stuck.

**You can continue working with the same data that you used in Problem 1.**

#### For this problem you should modify:

1. Calculate the monthly average temperatures for the entire data (i.e. for each year separately) file using the approach taught in the lecture.
    - You should store the average temperatures into a new Pandas DataFrame called **`monthly_data`**.
2. Create a new column called **`temp_celsius`** into the **`monthly_data`** DataFrame that has the monthly temperatures in Celsius.   
   - Store also the information about the date into column **`DATE_m`** (which should be a string column with month and year info) and the **`TAVG`** values into the `monthly_data` DataFrame.
3. Update and commit your changes to the notebook in your **own repository** of this week's exercise.

In [18]:
# YOUR CODE HERE
# raise NotImplementedError()

# convert DATE column into string to slice date into year, month, day
data['DATE_str'] = data['DATE'].astype(str)

# Check data type
print(data['DATE_str'].dtype)

# returns data type of the first value of the column DATE_str
print('\nDAta type of the column DATE_str is:')
print(type(data.iloc[0, data.columns.get_loc('DATE_str')])) 

# Slice string column DATE_str
data['DATE_m'] = data['DATE_str'].str.slice(start=0, stop= 6)

# print head of the new column
print(data['DATE_m'].head())

# Check the data type for the column 'DATE_m' is string
print('\nDAta type of the column DATE_m is:')
print(type(data.iloc[0, data.columns.get_loc('DATE_m')]))

object

DAta type of the column DATE_str is:
<class 'str'>
0    195201
1    195201
2    195201
3    195201
4    195201
Name: DATE_m, dtype: object

DAta type of the column DATE_m is:
<class 'str'>


In [19]:
# Crete function to convert Fahrenheits to Celsuis
def fahrToCelsius(temp_fahrenheit):
    """
    Function to convert Fahrenheit temperature into Celsius.

    Parameters
    ----------

    temp_fahrenheit: int | float
        Input temperature in Fahrenheit (should be a number)
        
    Returns
    -------
    
    Temperature in Celsius (float)
    """

    # Convert the Fahrenheit into Celsius and return it
    converted_temp = (temp_fahrenheit - 32) / 1.8
    return converted_temp
    

In [20]:
""" Create a new column called temp_celsius containing monthly temperatures in Celsius """

# Create a empty column
col_name = 'temp_celsius'
data[col_name]=  None

# Iterate over the rows of data dataframe and convert average temperature in Celsius
for idx, row in data.iterrows():
    celsius = fahrToCelsius(row['TAVG'])
    data.iloc[idx, data.columns.get_loc(col_name)] = celsius



In [21]:
# Create a new empty Dataframe
monthly_data = pd.DataFrame()

In [22]:
# group the values by month..... which will be called later the key value for grouping
grouped = data.groupby('DATE_m')

# let's check what we have after grouping
# what's the type?
print('Type:\n', type(grouped))

# how many groups do we have?
print('Length:\n', len(grouped))

Type:
 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Length:
 790


In [23]:
# let's specify the first month of the data (as text)
time1 = '195201'

# select an specific group using the above defined key
group1 = grouped.get_group(time1) # returns the NDframe form the grouped

# let's see what we have
print(group1)

              STATION  ELEVATION  LATITUDE  LONGITUDE      DATE  PRCP  TAVG  \
0   GHCND:FIE00142080         51   60.3269    24.9603  19520101  0.31  37.0   
1   GHCND:FIE00142080         51   60.3269    24.9603  19520102   NaN  35.0   
2   GHCND:FIE00142080         51   60.3269    24.9603  19520103  0.14  33.0   
3   GHCND:FIE00142080         51   60.3269    24.9603  19520104  0.05  29.0   
4   GHCND:FIE00142080         51   60.3269    24.9603  19520105  0.06  27.0   
5   GHCND:FIE00142080         51   60.3269    24.9603  19520106   NaN  26.0   
6   GHCND:FIE00142080         51   60.3269    24.9603  19520107  0.03  39.0   
7   GHCND:FIE00142080         51   60.3269    24.9603  19520108   NaN  33.0   
8   GHCND:FIE00142080         51   60.3269    24.9603  19520109  0.18  38.0   
9   GHCND:FIE00142080         51   60.3269    24.9603  19520111  0.10  33.0   
10  GHCND:FIE00142080         51   60.3269    24.9603  19520112  0.28  29.0   
11  GHCND:FIE00142080         51   60.3269    24.960

Ahaa! As we can see, a single group contains a **DataFrame** with values only for that specific month.
This is really useful, because now we can calculate e.g. the average values for all weather measurements (+ month or even hour if we want to) that we have (you can use any of the statistical functions that we have seen already, e.g. mean, std, min, max, median, etc.).

We can do that by using the **`mean()`** -function that we already used during the Lesson 5. 

- Let's calculate the mean for following attributes (let's see how to do them all at once!): 
   - ``PRCP``, 
   - ``TAVG``, 
   - ``TMAX``, 
   - ``TMIN``, 
   - ``Celsius``.

In [25]:
# Specify the columns which will be use in the calculation
mean_cols = ['PRCP', 'TAVG','TMAX', 'TMIN', 'temp_celsius']

# calculate tge mean valkues all at one go
mean_values = group1[mean_cols].mean()

# print mean values for all the columns
print(mean_values)


PRCP             0.174545
TAVG            29.478261
TMAX            33.263158
TMIN            27.545455
temp_celsius    -1.400966
dtype: float64


In [26]:
# Add a specific date(year_month) into the pandas series to get the mean values
mean_values['DATE_m'] = time1

# print mean values for an specific year_month
print(mean_values)

PRCP            0.174545
TAVG             29.4783
TMAX             33.2632
TMIN             27.5455
temp_celsius    -1.40097
DATE_m            195201
dtype: object


<b>Great!...</b>

Let's add these values to the dataframe created a few steps above.

In [27]:
# Add the values into our dataframe (monthly_date)
monthly_data= monthly_data.append(mean_values, ignore_index=True)

# Print dataframe
print(monthly_data)

   DATE_m      PRCP       TAVG       TMAX       TMIN  temp_celsius
0  195201  0.174545  29.478261  33.263158  27.545455     -1.400966


In [28]:
# Iterate over the first group 
for key, group in grouped:
    #print key and group
    print("key:\n", key)
    print("Group:\n", group)
    
    # Stop the iteration with break command
    break


key:
 195201
Group:
               STATION  ELEVATION  LATITUDE  LONGITUDE      DATE  PRCP  TAVG  \
0   GHCND:FIE00142080         51   60.3269    24.9603  19520101  0.31  37.0   
1   GHCND:FIE00142080         51   60.3269    24.9603  19520102   NaN  35.0   
2   GHCND:FIE00142080         51   60.3269    24.9603  19520103  0.14  33.0   
3   GHCND:FIE00142080         51   60.3269    24.9603  19520104  0.05  29.0   
4   GHCND:FIE00142080         51   60.3269    24.9603  19520105  0.06  27.0   
5   GHCND:FIE00142080         51   60.3269    24.9603  19520106   NaN  26.0   
6   GHCND:FIE00142080         51   60.3269    24.9603  19520107  0.03  39.0   
7   GHCND:FIE00142080         51   60.3269    24.9603  19520108   NaN  33.0   
8   GHCND:FIE00142080         51   60.3269    24.9603  19520109  0.18  38.0   
9   GHCND:FIE00142080         51   60.3269    24.9603  19520111  0.10  33.0   
10  GHCND:FIE00142080         51   60.3269    24.9603  19520112  0.28  29.0   
11  GHCND:FIE00142080         5

From this iteration we can see that the **`key`** contains the value **`19520101`** that is the same value as **`DATE_m`** column, meaning that we grouped the values based on that column

- Now let's create the dataframe where we are going to calculate the mean values for all of the weather attributes monthly based, Let's repeat the steps above done, but for the entire dataset

In [29]:
# Create a DataFrame
monthly_data  = pd.DataFrame()

# the columns that we want to aggregate
mean_cols = ['temp_celsius']

# Iterate over ther groups and update the new dataframe called 'monthly_data'
for key, group in grouped:
    # aggregate the data
    mean_values = group[mean_cols].mean()
    
    # Add the key (data + time information) into the aggregated values
    mean_values['DATE_m'] = key
    
    # Append the aggregated values into the dataframe
    monthly_data = monthly_data.append(mean_values, ignore_index=True)

In [30]:
# Print the monthly_data dataframe

# define Column index order
col_order = ['DATE_m', 'temp_celsius']
# change column index order
monthly_data = monthly_data.reindex(columns=col_order)

# print dataframe
monthly_data

Unnamed: 0,DATE_m,temp_celsius
0,195201,-1.400966
1,195202,-4.000000
2,195203,-10.106838
3,195204,4.226190
4,195205,7.037037
5,195206,13.611111
6,195207,16.230159
7,195208,14.157706
8,195209,8.461538
9,195210,2.162698


In [31]:
# Create an empty DataFrame for the aggregated values
monthly_data = pd.DataFrame()

# The columns that we want to aggregate
mean_cols = ['temp_celsius']

# Iterate over the groups
for key, group in grouped:
   # Aggregate the data
   mean_values = group[mean_cols].mean()

   # Add the ´key´ (i.e. the date+time information) into the aggregated values
   mean_values['DATE_m'] = key

   # Append the aggregated values into the DataFrame
   monthly_data = monthly_data.append(mean_values, ignore_index=True)

In [32]:
# define Column index order
col_order = ['DATE_m', 'temp_celsius']
# change column index order
monthly_data = monthly_data.reindex(columns=col_order)

# print the dataframe monthly_data
print(monthly_data)

     DATE_m  temp_celsius
0    195201     -1.400966
1    195202     -4.000000
2    195203    -10.106838
3    195204      4.226190
4    195205      7.037037
5    195206     13.611111
6    195207     16.230159
7    195208     14.157706
8    195209      8.461538
9    195210      2.162698
10   195211     -2.380952
11   195212     -2.911111
12   195301     -5.396825
13   195302     -8.662551
14   195303     -0.483092
15   195304      4.423868
16   195305      9.265233
17   195306     17.037037
18   195307     17.066667
19   195308     15.114943
20   195309      9.650206
21   195310      7.222222
22   195311      1.408730
23   195312     -1.290323
24   195401     -7.072650
25   195402    -11.527778
26   195403     -1.269841
27   195404      1.412037
28   195405     11.455939
29   195406     13.910256
..      ...           ...
760  201505      9.587814
761  201506     13.537037
762  201507     16.075269
763  201508     16.702509
764  201509     12.537037
765  201510      5.179211
766  201511 

# Problem 3 - Calculating temperature anomalies (7.5 points)

Our goal in this problem is to calculate monthly temperature anomalies in order to see how temperatures have changed over time, relative to the observation period between 1952-1980.

We will again continue working with this same notebook.

In order to complete the problem, you must do following things:

- You need to calculate a mean temperature ***for each month*** over the period 1952-1980 using the data in the data file.
 As a result, you should end up with 12 values, 1 mean temperature for each month in that period, and store them in a new Pandas DataFrame called **`reference_temps`**.  
   - The columns in the new DataFrame should be titled `Month` and `ref_temp`. 
   
For example, your `reference_temps` data should be something like that below, 1 value for each month of the year (12 total):
   
| Month    | ref_temp         |
|----------|------------------|
| 01       | -5.350916        |
| 02       | -5.941307        |
| 03       | -2.440364        |
| ...      | ...              |
   
*Remember, these temperatures should be in degrees Celsius.*

- Once you have the monthly mean values for each of the 12 months, you can then calculate a temperature anomaly for every month in the `monthly_data` DataFrame.
- The temperature anomaly we want to calculate is simply the temperature for one month in `monthly_data` (`temp_celsius` -column) minus the corresponding monthly reference temperature in `ref_temp` column of `reference_temps` DataFrame. 
    - Hint: You need to make a table join (see hints for this week)
- You should thus end up with three new columns in the `monthly_data` DataFrame: 

    1. **`Diff`**  showing the temperature anomaly, the difference in temperature for a given month (e.g., February 1960) compared to the average (e.g., for February 1952-1980), 
    2. **`Month`** indicating the month, and 
    3. **`ref_temp`** indicating the (monthly) reference temperature.
- Update and commit your changes to the notebook in your **own repository** of this week's exercise.


### Get the month from the year_month column called **`data['DATE_m']`**

In [None]:
# # create new column called 'month' that will contain month for each year
# monthly_data['DATE_m'] = monthly_data['DATE_m'].astype(str)
# # check data type
# print('Data type for the column:')
# print(monthly_data['DATE_m'].dtype)

# print('\nDATa type of the first value in the column:')
# print(type(monthly_data.loc[0, 'DATE_m']))

In [36]:
# Slice the string column to obtain the month value
monthly_data['month'] = monthly_data['DATE_m'].str.slice(start=4, stop=6)

# Convert the month and DATE_m text back to integer form
monthly_data['DATE_m'] = monthly_data['DATE_m'].astype(int)
monthly_data['month'] = monthly_data['month'].astype(str)
# let's see if we slice correctly the data
print(monthly_data.head())

   DATE_m  temp_celsius month
0  195201     -1.400966    01
1  195202     -4.000000    02
2  195203    -10.106838    03
3  195204      4.226190    04
4  195205      7.037037    05


In [34]:
# YOUR CODE HERE
# raise NotImplementedError()

# Create new dataframe
reference_temp = pd.DataFrame()

### <i>Let's filter the data to be only between 1952 and 1980...</i>

In [37]:
# Create temp dataframe to filter data 
monthly_data_temp = pd.DataFrame()

# filter the data to the new datafrma
md_sample = monthly_data.loc[(monthly_data['DATE_m'] >= 195201) & (monthly_data['DATE_m']<=198012)]

# Let's print our filtered data
print(md_sample)

     DATE_m  temp_celsius month
0    195201     -1.400966    01
1    195202     -4.000000    02
2    195203    -10.106838    03
3    195204      4.226190    04
4    195205      7.037037    05
5    195206     13.611111    06
6    195207     16.230159    07
7    195208     14.157706    08
8    195209      8.461538    09
9    195210      2.162698    10
10   195211     -2.380952    11
11   195212     -2.911111    12
12   195301     -5.396825    01
13   195302     -8.662551    02
14   195303     -0.483092    03
15   195304      4.423868    04
16   195305      9.265233    05
17   195306     17.037037    06
18   195307     17.066667    07
19   195308     15.114943    08
20   195309      9.650206    09
21   195310      7.222222    10
22   195311      1.408730    11
23   195312     -1.290323    12
24   195401     -7.072650    01
25   195402    -11.527778    02
26   195403     -1.269841    03
27   195404      1.412037    04
28   195405     11.455939    05
29   195406     13.910256    06
..      

In [38]:
# Use groupeby() to group the data based on a key value
grouped = md_sample.groupby('month')

In [39]:
# What;s the type of the group data
print('Type:\n', type(grouped))

# how many rows contains
print('Length:\n', len(grouped))

Type:
 <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
Length:
 12


Now we can see that we have a new dataframe called 'reference_month', a new object called grouped with type **`DataFrameGroupBy`**. it contains 12 individual groups in the data. <b>One group for each month</b>

Let's see what we have on month grouped variable. on the first month **`01`**. we can get the value of the month from the **`DataFrameGroupBy`** -object with **`get_group()`** function

In [48]:
# Specify the month, key value to make the first selection
month_1 = '01'

# Select the group
group_1 = grouped.get_group(month_1) # returns a NDFrame from the group

# Let's see what we have
print(group_1)

     DATE_m  temp_celsius month
0    195201     -1.400966    01
12   195301     -5.396825    01
24   195401     -7.072650    01
36   195501     -5.473251    01
48   195601     -8.133333    01
60   195701     -2.440476    01
72   195801     -8.315412    01
84   195901     -5.148148    01
96   196001    -10.229885    01
108  196101     -3.648148    01
120  196201     -3.655914    01
132  196301     -9.892473    01
144  196401           NaN    01
156  196501           NaN    01
168  196601           NaN    01
180  196701           NaN    01
192  196801           NaN    01
204  196901           NaN    01
216  197001           NaN    01
228  197101           NaN    01
240  197201           NaN    01
252  197301     -1.971326    01
264  197401     -2.921147    01
276  197501     -1.379928    01
288  197601     -9.946237    01
300  197701     -6.451613    01
312  197801     -5.806452    01
324  197901     -8.655914    01
336  198001     -8.835125    01


Great... we can see that a single group contains a <b>DataFrame</b> with values only for a specific month. now we can calculate the average values for all weather mesurenments (+ month) that we have

Let's caluclate all the mean values for all the groups (1 months per group) for all those weather attributes that we were interested that are between 1952 and 1980.

In [49]:
# Create new DataFrame called reference_temps
reference_temps = pd.DataFrame()

# the columns that we want to aggregate
mean_cols = ['temp_celsius']

# Iterate over the groups
for key, group in grouped:
    mean_values = group[mean_cols].mean()

    # Add the 'key' (i.e. the month information) into the aggrgated values 
    mean_values['month']= key

    # Append the aggregated values into the DataFrame
    reference_temps = reference_temps.append(mean_values, ignore_index= True)

In [50]:
# Create a dictionary with old a new column names
name_convention_dict = {'month': 'month', 'temp_celsius': 'ref_temp'}

# re-name the columns
reference_temps = reference_temps.rename(columns=name_convention_dict)

# Print the new dtaframe with only the month and the ref_temp columns
print(reference_temps.head())

# Re-order the columns
col_order = ['month', 'ref_temp']
reference_temps = reference_temps.reindex(columns=col_order)

print('\n',reference_temps)

  month  ref_temp
0    01 -5.838761
1    02 -7.064088
2    03 -3.874213
3    04  2.370749
4    05  9.482356

    month   ref_temp
0     01  -5.838761
1     02  -7.064088
2     03  -3.874213
3     04   2.370749
4     05   9.482356
5     06  14.661728
6     07  16.520986
7     08  15.045650
8     09   9.934222
9     10   4.952240
10    11   0.245195
11    12  -4.165641


In [47]:
# lets print a sample of both dataframes
print((monthly_data.dtypes))

print('\n', reference_temps.dtypes)

DATE_m            int32
temp_celsius    float64
month            object
dtype: object

 Month        object
ref_temp    float64
dtype: object


- What is the highest value in `Diff` column?
   - Print the answer in the cell below

In [45]:
# Use iloc[<index_num>, dataframe.columns.get_loc(<'column'>)] to return the value of specific column
# m_data = monthly_data.loc[monthly_data['DATE_m']== 196002, 'Celsius']
# r_temp = reference_temps.loc[reference_temps['Month'] == 2, 'ref_temp']

m_data = monthly_data.iloc[97]['temp_celsius']
r_temp = reference_temps.iloc[1]['ref_temp']


print(m_data)
print(r_temp)

# let's see the result
diff = m_data - r_temp
print('diff:\n',diff)

-8.927203065134101
-7.064087935088894
diff:
 -1.8631151300452071


In [52]:
# join the dataframes aka table join

join = monthly_data.merge(reference_temps, on='month')

# let's see how does the join look like
print(join.head())

# let's organize the data in a better way
# Re-order the columns
col_order = ['DATE_m', 'month', 'temp_celsius', 'ref_temp']
join = join.reindex(columns=col_order)
#let's print it again
print('\n',join)

   DATE_m  temp_celsius month  ref_temp
0  195201     -1.400966    01 -5.838761
1  195301     -5.396825    01 -5.838761
2  195401     -7.072650    01 -5.838761
3  195501     -5.473251    01 -5.838761
4  195601     -8.133333    01 -5.838761

      DATE_m month  temp_celsius  ref_temp
0    195201    01     -1.400966 -5.838761
1    195301    01     -5.396825 -5.838761
2    195401    01     -7.072650 -5.838761
3    195501    01     -5.473251 -5.838761
4    195601    01     -8.133333 -5.838761
5    195701    01     -2.440476 -5.838761
6    195801    01     -8.315412 -5.838761
7    195901    01     -5.148148 -5.838761
8    196001    01    -10.229885 -5.838761
9    196101    01     -3.648148 -5.838761
10   196201    01     -3.655914 -5.838761
11   196301    01     -9.892473 -5.838761
12   196401    01           NaN -5.838761
13   196501    01           NaN -5.838761
14   196601    01           NaN -5.838761
15   196701    01           NaN -5.838761
16   196801    01           NaN -5.838761
17

In [55]:
# let's add a new column 'Diff' as result of the temperature anomaly for that given momth
join['Diff'] = join['temp_celsius'] - join['ref_temp']

# Now print the dataframe with the new column 
print(join.head())

   DATE_m month  temp_celsius  ref_temp      Diff
0  195201    01     -1.400966 -5.838761  4.437795
1  195301    01     -5.396825 -5.838761  0.441936
2  195401    01     -7.072650 -5.838761 -1.233888
3  195501    01     -5.473251 -5.838761  0.365510
4  195601    01     -8.133333 -5.838761 -2.294572


#### Done!

That's it. Now you are ready with Problems 1-3. If you want, you can still continue with an optional [Problem 4.](Exercise-6-problem-4.ipynb)