![](../../docs/ae_logo.png "Adapt & Enable")
# AE workshop 2023 - Data science

## Part 1 - Exploratory Data Analysis (EDA)

In this notebook, we'll talk about time series using a contemporary issue as a use case: global warming. Let's kick off by importing some packages we'll need, down the line.

In [15]:
import os
import datetime
import numpy as np
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf
import statsmodels.api as sm

%matplotlib inline
import plotly.offline as py
import plotly.io as pio

pio.templates.default = "plotly_white"
pio.renderers.default = "iframe"

ae_orange = "#FD9129"
ae_orange2 = "#FFD580"
ae_gold = "#FFD700"

### Read data

First of all, we have to load the on-disk data to the working memory. A great library to handle tabular data is **pandas**. The data was downloaded from https://www.kaggle.com/datasets/berkeleyearth/climate-change-earth-surface-temperature-data, feel free to explore!

In [16]:
df = pd.read_csv("../../data/GlobalTemperatures.csv")

Let's check how the first rows look like:

In [17]:
df.head(3)

Unnamed: 0,dt,LandAverageTemperature,LandAverageTemperatureUncertainty,LandMaxTemperature,LandMaxTemperatureUncertainty,LandMinTemperature,LandMinTemperatureUncertainty,LandAndOceanAverageTemperature,LandAndOceanAverageTemperatureUncertainty
0,1750-01-01,3.034,3.574,,,,,,
1,1750-02-01,3.083,3.702,,,,,,
2,1750-03-01,5.626,3.076,,,,,,


We will only use the first three columns: the date, an average temperature and an associated uncertainty measure.

In [18]:
df = df[["dt", "LandAverageTemperature", "LandAverageTemperatureUncertainty"]]

# change the datatype of date
df["dt"] = pd.to_datetime(df["dt"])

# remove NaN
df = df[~df["LandAverageTemperature"].isnull()]

# use shorter names because we're lazy (careful not to get confused by your own simplifications though)
df = df.rename(
    columns={
        "dt": "date",
        "LandAverageTemperature": "temperature",
        "LandAverageTemperatureUncertainty": "uncertainty",
    }
)

df.dtypes

date           datetime64[ns]
temperature           float64
uncertainty           float64
dtype: object

### EDA

The first thing to do in *any* analysis is to explore the data so we can see what we're working with. This is the **E**xplorational **D**ata **A**nalysis part.

**<span style="color:#FD9129">How would you do this? What sort of visuals would you try? Give it a try below.</span>**

<span style="color:#FD9129">* **What do you expect?**</span>

<span style="color:#FD9129">* **See if you spot any irregularities.**</span>

<span style="color:#FD9129">* **Can you spot any patterns?**</span>


In [19]:
### Code goes brrr ###

### Patterns

Let's look at our time series through another lens. Depending on the type of data and the range of the values, certain types of patterns can appear. In particular, we can look for the following:

* <span style="color:darkorange">**Trend**</span> - this is a general, monotonic increase or decrease in our data.
* <span style="color:darkorange">**Seasonality**</span> - sometimes our values fluctuate in a predictable, calender-bound manner (e.g., with seasons, month, or day of the week).
* <span style="color:darkorange">**Cyclicality**</span> - when you notice a repeating pattern that cannot be attributed to seasonality (i.e., there does not appear to be a straightforward link with the calendar), we call this a cycle.

<span style="color:#FD9129">**Which of these do you expect? Can you visualize the data to make a slightly more educated guess?**</span>

In [20]:
### Code goes brrr ###

We also want to know whether our time series is *stationary*. <span style="color:#FD9129">**Any idea what that means? Try and find out if it is.**</span>

In [21]:
from statsmodels.tsa.stattools import adfuller

### Code goes brrr ###

We can try and isolate different components as well, using `statsmodels`' time series analysis API. <span style="color:#FD9129">**Check it out (you can use the function below)!**</span>

In [22]:
def decompose_signal(time_series: pd.Series, period=12):
    # Perform seasonal decomposition
    decomposition = sm.tsa.seasonal_decompose(df.temperature, period=period)

    # Create a figure with 4 subplots
    fig = make_subplots(rows=4, cols=1, shared_xaxes=True, vertical_spacing=0.05)

    # Add the original signal component to the top subplot
    fig.add_trace(
        go.Scatter(x=df.index, y=df.temperature, name="Original"), row=1, col=1
    )

    # Add the seasonal component
    fig.add_trace(
        go.Scatter(
            x=decomposition.seasonal.index, y=decomposition.seasonal, name="Seasonal"
        ),
        row=2,
        col=1,
    )

    # Add the trend component
    fig.add_trace(
        go.Scatter(x=decomposition.trend.index, y=decomposition.trend, name="Trend"),
        row=3,
        col=1,
    )

    # Add the residuals component
    fig.add_trace(
        go.Scatter(
            x=decomposition.resid.index, y=decomposition.resid, name="Residuals"
        ),
        row=4,
        col=1,
    )

    # Customize the figure layout
    fig.update_layout(
        height=800, width=1200, title_text="Decomposition", showlegend=False
    )
    fig.update_traces(line_color=ae_orange, row=1)
    fig.update_xaxes(title_text="")
    fig.update_xaxes(tickformat="%b\n%Y")
    fig.update_yaxes(title_text="Original", row=1)
    fig.update_yaxes(title_text="Seasonal", row=2)
    fig.update_yaxes(title_text="Trend", row=3)
    fig.update_yaxes(title_text="Residuals", row=4)

    fig.show()

In [23]:
### Code goes brrr ###

Let's move on to another topic: time series forecasting! [Let's go there now](./2_forecasting_mlflow-student.ipynb).