<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Libraries" data-toc-modified-id="Libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Libraries</a></span></li><li><span><a href="#General-info" data-toc-modified-id="General-info-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>General info</a></span></li><li><span><a href="#Load-data-and-cleaning" data-toc-modified-id="Load-data-and-cleaning-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load data and cleaning</a></span></li><li><span><a href="#Visualization" data-toc-modified-id="Visualization-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Visualization</a></span><ul class="toc-item"><li><span><a href="#By-month" data-toc-modified-id="By-month-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>By month</a></span></li></ul></li><li><span><a href="#Preparing-data" data-toc-modified-id="Preparing-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Preparing data</a></span><ul class="toc-item"><li><span><a href="#Create-new-dataframe-with-an-index-for-each-month" data-toc-modified-id="Create-new-dataframe-with-an-index-for-each-month-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Create new dataframe with an index for each month</a></span></li><li><span><a href="#Merge-datasets" data-toc-modified-id="Merge-datasets-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Merge datasets</a></span></li><li><span><a href="#Visualize-the-data" data-toc-modified-id="Visualize-the-data-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Visualize the data</a></span></li></ul></li></ul></div>

# Libraries

In [1]:
# Adding the parent directory to the path so that we can import the cleaning module.
import sys
sys.path.append("../")

import pandas as pd
import numpy as np
import calendar
from datetime import datetime
import plotly.express as px
from plotly.subplots import make_subplots # to make subplots
import plotly.graph_objects as go # to make subplots

import src.cleaning as cl

# General info 

The temperature data represents temperature anomalies (differences from the mean/expected value) per month and per season (DJF=Dec-Feb, MAM=Mar-May, etc). We will not be working with absolute temperature data as in climate change studies, anomalies are more important than absolute temperature.

For more info [here](https://data.giss.nasa.gov/gistemp/)

# Load data and cleaning

In [2]:
temp = pd.read_csv("../Data/SST_Global.csv")
temp.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Land-Ocean: Global Means
Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
1880,-.29,-.18,-.11,-.20,-.12,-.23,-.21,-.09,-.16,-.23,-.20,-.23,-.19,***,***,-.14,-.18,-.20
1881,-.16,-.17,.04,.04,.02,-.20,-.07,-.03,-.14,-.21,-.22,-.11,-.10,-.11,-.18,.03,-.10,-.19
1882,.14,.15,.03,-.19,-.16,-.26,-.21,-.06,-.10,-.25,-.16,-.25,-.11,-.10,.06,-.10,-.17,-.17
1883,-.32,-.39,-.13,-.17,-.20,-.13,-.08,-.15,-.21,-.14,-.22,-.16,-.19,-.20,-.32,-.17,-.12,-.19


In [3]:
df = cl.cleaning_temp_data(temp)
df.to_csv("temp_complete_clean.csv")

In [4]:
df.head(2)

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec,J-D,D-N,DJF,MAM,JJA,SON
1,1880,-0.29,-0.18,-0.11,-0.2,-0.12,-0.23,-0.21,-0.09,-0.16,-0.23,-0.2,-0.23,-0.19,***,***,-0.14,-0.18,-0.2
2,1881,-0.16,-0.17,0.04,0.04,0.02,-0.2,-0.07,-0.03,-0.14,-0.21,-0.22,-0.11,-0.1,-.11,-.18,0.03,-0.1,-0.19


# Visualization 

More documentation about axes [here](https://plotly.com/python/axes/)

To see the axis:
```python
fig.update_xaxes(showline = True, linewidth = 1, linecolor = "black")
```

Subplot documentation [here](https://plotly.com/python/subplots/)

## By month 

In [5]:
fig = make_subplots(rows = 4, cols = 3, subplot_titles=('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep',
       'Oct', 'Nov', 'Dec'))
count = 1
count2 = 1
count3 = 1
count4 = 1
for i in df.columns[1:13]:
    if count <= 3:
        fig.add_trace(
            go.Scatter(x = df.Year, y = df[i]), row = 1, col = count)
        count += 1
    elif count > 3 and count2 <= 3:
        fig.add_trace(
            go.Scatter(x = df.Year, y = df[i]), row = 2, col = count2)
        count2 += 1
        
    elif count > 3 and count2 > 3 and count3 <= 3:
        fig.add_trace(
            go.Scatter(x = df.Year, y = df[i]), row = 3, col = count3)
        count3 +=1
    else:
        fig.add_trace(
            go.Scatter(x = df.Year, y = df[i]), row = 4, col = count4)
        count4 +=1
        
fig.update_layout(height=1000, width=1000, title_text="Evolution of temperature", showlegend = False)
fig.update_layout(
    font_family="Garamond",
    font_color="black",
    font_size = 16,
    title_font_family="Times New Roman",
    title_font_color="black",
    legend_title_font_color="black"
)

# to change the line color and width
fig.update_traces(line=dict(color = "Black", width = 0.3))

# to change the axes 
fig.update_xaxes(tickangle = -45, ticks = "outside", showgrid = False, showline = True, linewidth = 1, linecolor = "black", mirror = True)
fig.update_yaxes(ticks = "outside", showgrid = False, showline = True, linewidth = 1, linecolor = "black", mirror = True)

fig.show()

# Preparing data 

## Create new dataframe with an index for each month

**Frequency Aliases**

Some of the most common are:

- "D" : Day
- "W" : Week
- "H" : Hour
- "T" : Minute
- "S" : Second
- "L" : Milisecond

In [8]:
# create the date range
date_rng = pd.date_range(start='1/1/1880', end='1/03/2019', freq='M')
date_rng[1]

Timestamp('1880-02-29 00:00:00', freq='M')

In [9]:
# Next create the empty DataFrame, which we will populate using the actual data
new_df = pd.DataFrame(date_rng, columns = ["date"])
new_df.head()

Unnamed: 0,date
0,1880-01-31
1,1880-02-29
2,1880-03-31
3,1880-04-30
4,1880-05-31


In [10]:
# Create a column for the anomoly values
new_df["Avg_anomalies"] = None
new_df.head()

Unnamed: 0,date,Avg_anomalies
0,1880-01-31,
1,1880-02-29,
2,1880-03-31,
3,1880-04-30,
4,1880-05-31,


In [11]:
# puede que no sepamos que frecuencia tienen nuestros datos, en pandas tambien lo podemos inferir
dates = pd.date_range(start='1/1/1880', end='1/03/2019',periods=29)
freq = pd.infer_freq(dates)
print(freq)

43518H


## Merge datasets 

In [12]:
# First select only the data that we want
df2 = df.iloc[:,:13]
df2.head()

Unnamed: 0,Year,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
1,1880,-0.29,-0.18,-0.11,-0.2,-0.12,-0.23,-0.21,-0.09,-0.16,-0.23,-0.2,-0.23
2,1881,-0.16,-0.17,0.04,0.04,0.02,-0.2,-0.07,-0.03,-0.14,-0.21,-0.22,-0.11
3,1882,0.14,0.15,0.03,-0.19,-0.16,-0.26,-0.21,-0.06,-0.1,-0.25,-0.16,-0.25
4,1883,-0.32,-0.39,-0.13,-0.17,-0.2,-0.13,-0.08,-0.15,-0.21,-0.14,-0.22,-0.16
5,1884,-0.16,-0.08,-0.37,-0.43,-0.37,-0.41,-0.35,-0.26,-0.27,-0.24,-0.3,-0.29


In [13]:
final = cl.merge_data(df2, new_df)

In [14]:
# Apply above function to all anomaly values in DataFrame
new_df['Avg_anomalies'] = new_df['Avg_anomalies'].apply(lambda raw_value: cl.clean_anomaly_value(raw_value))


In [15]:
new_df.head()

Unnamed: 0_level_0,Avg_anomalies
date,Unnamed: 1_level_1
1880-01-31,-0.29
1880-02-29,-0.18
1880-03-31,-0.11
1880-04-30,-0.2
1880-05-31,-0.12


## Visualize the data

In [16]:
fig = px.line(new_df)
fig.show()

In [17]:
new_df2 = new_df.resample('A').mean().head()
new_df2.tail(10)

Unnamed: 0_level_0,Avg_anomalies
date,Unnamed: 1_level_1
1880-12-31,-0.1875
1881-12-31,-0.100833
1882-12-31,-0.11
1883-12-31,-0.191667
1884-12-31,-0.294167


In [18]:
new_df.to_csv("temp_clean.csv")