<a href="https://colab.research.google.com/github/Amirhatamian/Statistical-Models-For-Data-Science/blob/main/Lesson2_21_11_2023_ToDo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Write your own Google drive path to files
DrivePath = "/content/drive/My Drive/Colab Notebooks"

# Link to Google drive
from google.colab import drive
drive.mount('/content/drive',force_remount=True)

In [None]:
import numpy as np
import pandas as pd
import math
import statistics
from scipy import stats

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn; seaborn.set()

#**1. Working with Time Series - Dates and Times**
Every observation in a time series has an associated date or time. As such, we need specific objects for storing and working with dates and related measures.

##**1.1 Dates and Times - Native Python**

The basic objects for working with dates and times can be created using the built-in `datetime` module.  

In [None]:
from datetime import datetime # Provides classes for manipulating dates and times
from dateutil import parser

In [None]:
# To manually build a date, specifying the different inputs
date = datetime(year=2021, month=11, day=29,hour=11, minute=30, second=25) # the last three are optional, default is 00:00:00
print(date)
print(type(date))

# To parse a string into a datetime object  (strptime)
xmas_day = '2021-12-25'
print(datetime.strptime(xmas_day, '%Y-%m-%d'))


# Note: the dateutil module provides the parser.parse function that can automatically parse dates from a variety of string formats:

first_day = '1st of January, 2022'
last_day = '31/12/21'
random_day = 'Nov 08, 1999 10:32 AM'
random_day2 = '20180803213450'

print(parser.parse(first_day))
print(parser.parse(last_day))
print(parser.parse(random_day))
print(parser.parse(random_day2))

##**1.2 Dates and Times - Pandas**

Pandas was developed in the context of financial modeling, thus it contains several tools for working with dates, times, and time-indexed data. It provides three important data structures for working with these data types:


*   *Timestamp*: this allows working with time stamps that are particular moment in time (e.g., October 31th, 2010 at 8am). This is a replacement for Python native `datetime`, as it is based on a more efficient numpy.datetime64 type. Pandas represents timestamps using instances of `Timestamp` and sequences of timestamps using instances of `DatetimeIndex`;
*   *Period*: this allows working with periods, i.e. intervals of time (e.g., 24 hour-long period). It is useful for example to check whether a specific event occurs within a certain period. The associated index structure is *PeriodIndex*;
*  *Timedelta*: this allows working with time deltas (or durations, e.g., a duration of 3 minutes or 45 seconds).
The associated index structure is *TimedeltaIndex*.


Note: it is essential to save the index of a DataFrame as a DatetimeIndex and not as strings!



###**1.2.1. Timestamp and DatetimeIndex**

In [None]:
 # Creating a timestamp object - Example 1
xmas_day = pd.to_datetime('25th of Dec, 2021') # an alternative: pd.to_datetime('12/25/21')
print(xmas_day)
print(type(xmas_day)) # Timestamp type

# If I want to convert to a string:
Year = xmas_day.strftime('%Y')
print('Year:', Year)

In [None]:
# When passing a series of dates, pd.to_datetime() returns a DatetimeIndex, i.e. a group of Timestamp objects
dates = pd.to_datetime([datetime(2020, 12, 25), '4th of July, 2020',
                       '2018-Oct-21', '20200508', '1982/1/22'])
print(dates)
print(type(dates))
print('-----')

# Alternative way to create a DatetimeIndex
D = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
                          '2015-07-04', '2015-07-22'])
print(D)

DatetimeIndex are very useful as we can use them as index for the Series objects:

In [None]:
index = pd.DatetimeIndex(['2014-07-04', '2014-03-04', '2015/07/22',
                          '7/4/99', '01/01/1900'])
data = pd.Series([10, 21, 32, 43,99], index=index)
print(index)
print('-----')
print(data)

print('-----')
print('Single element:', data.iloc[0]) # or data[0]
print('Single element with explicit indexing:', data.loc['1999-07-04'])
print('Specific year: \n', data['2014'])
print('Specific year/month: \n', data['2015-07'])

###**1.2.2. Period and PeriodIndex**

In [None]:
# To convert a DatetimeIndex to a Period object, frequency has to be specified
dates = pd.to_datetime(['2013-02-02', '2012-01-02', '2015-11-30'])
print(dates) # DatetimeIndex object
print('-------------')

period_daily = dates.to_period('D') # create daily time periods, PeriodIndex object
print('Day:', period_daily)
print('-------------')
period_weekly = dates.to_period('W') # create weekly time periods
print('Week:', period_weekly)
print('-------------')
period_monthly = dates.to_period('M') # create monthly time periods
print('Month:', period_monthly)
print('-------------')
period_yearly = dates.to_period('Y') # create yearly time periods
print('Year:', period_yearly)

# Start/end time of a Period or other operations can be done on these PeriodIndex objects
Stime = period_weekly.start_time # becomes a DatetimeIndex object
Etime = period_monthly.end_time


###**1.2.3 Timedelta and TimedeltaIndex**

In [None]:
# A TimedeltaIndex is given by the temporal difference between a DatetimeIndex and Timestamp objects
dates = pd.to_datetime([datetime(2020, 12, 25), '31st of December, 1990',
                       '2018-Oct-6', '07-07-2017', '20200508', '20200422T203448']) #DatetimeIndex
dates_v2 = pd.to_datetime('2019-09-15') # Timestamp

Difference = dates-dates_v2
print(Difference)
print('----------')
Difference_2 = dates[2] - dates_v2
print(Difference_2)
print(type(Difference_2))



##**1.3 Creating Date Sequences**

Regular date sequences can be automatically created using functions, such as `pd.date_range()` for sequences of timestamps, `pd.period_range()` for periods, and `pd.timedelta_range()` for time deltas.
Frequency can also be changed accordingly in order to create something more precise, depending on our purpouses, as we will see below.

###**1.3.1 Sequence of dates: `pd.date_range()`**

In [None]:
# 1. Simple sequence of Dates by specifying start/end: by default, the frequency is daily (output type is DatetimeIndex)
day = pd.date_range('2021-08-01', '2021-08-10', freq='B') # B for business day only
week = pd.date_range('2021-08-01', '2021-08-10',freq='W') # W weekly frequency
month = pd.date_range('2021-08-01', '2021-10-31',freq='M') # M monthly frequency

print(day)
print(week)
print(month)

# Note: a complete list of frequencies that can be used is provided at this link: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

In [None]:
# 2. Sequence of Dates by specifying Period: date range is specified with a start point and number of periods (to set the number of samples)
date_rng_d = pd.date_range('01-01-1900 10:15', periods=5) # default daily frequency
print(date_rng_d)

date_rng_m = pd.date_range('01-01-1900 10:15', periods=5, freq='M') #M = last day of the month
print(date_rng_m)

date_rng_ms = pd.date_range('01-01-1900 10:15', periods=5, freq='MS') # MS = month start
print(date_rng_ms)

test = pd.date_range('2020-02-03 10:15', periods=8, freq = 'W')
print(test)

In [None]:
# Note: date_range() output can be used as index in a Series object or in a DataFrame
# Series example
rng = pd.date_range('2016 Jul 1', periods = 10, freq = 'D')
rng
Series1 = pd.Series(list(range(len(rng))), index = rng)
print(Series1)
print('-------')

# DataFrame example
np.random.seed(42)
DF1 = pd.DataFrame(60*np.random.rand(10,1), # random samples from a uniform distribution over [0, 1), the dimension of the output has to be specified
             columns=['Values'], index=rng)
DF1

# As alternative, we can have our default values for index, and then set the index to the rng values
# DF1 = pd.DataFrame(60*np.random.rand(10,1),
#             columns=['Values'])
# DF1.set_index(rng, inplace=True)
# DF1.index # DatetimeIndex

###**1.3.2 Sequence of periods: `pd.period_range()`**

In [None]:
# Create a period (interval) specifying the starting point and the number of periods
A = pd.period_range('2020-02-03', periods=8, freq='D')
print('A (Year):',A)
print('-----------')
B = pd.period_range('2020-02-03', periods=8, freq='M')
print('B (Month):',B)
print('-----------')
C = pd.period_range('2020-02-03', periods=8, freq='W')
print('C (Week):',C)
print('-----------')
D = pd.period_range('2020-02-03', periods=8, freq='10H')
print('D (10 Hours):',D)
print('-----------')


# Remember: Period represents an interval in time, whereas Timestamp/DatetimeIndex represents a point in time.

In [None]:
# As these are PeriodIndex object, we can print start and end times:
print('Start time period at index 0: ', A[0].start_time) # if I do not specify the index this operation will be repeated for all the elements
print('End time period at index 0: ', A[0].end_time)

###**1.3.3. Sequence of time intervals (deltas): `timedelta_range()`**

In [None]:
A = pd.timedelta_range(start='10 days', periods=5)
C = pd.timedelta_range('2 hours', freq='30T', periods=10)
print(A)
print(C)

#**2. How to deal with tabular data**

Besides manually creating DataFrames and Series objects, most of the times we will directly load in Python any file containing the data we want to analyse further. For example, in order to read CSV files in Pandas we have to call the *read_csv* method. Besides the name of the file, we add the *na_values* key argument to this method along with the character that represents "non available data" in the file. As most of the CSV files have a header with the names of the columns, the *usecols* parameter can be used to select which columns in the file will be used. This will also prevent to load all the columns from the file and thus to save memory/space.

In [None]:
# These example data are part of the European Commission database. They represent government data, related in particular to
# educational fundings by the member states.
# In a delimiter-separated value file, as a CSV file, each line is a data record and each record consists of one (or more) fields, separated by the
# delimiter character (usually a comma or semicolon).

edu = pd.read_csv(DrivePath +'/Data/education_Data.csv', na_values=':',sep=';',usecols=['TIME','GEO','Value'])
display(edu)
print('-------')
display(type(edu)) # DataFrame
display(edu.dtypes) # To check the data type of each column

# Remember: if we have loaded several columns and want to delete some of them:
# edu.drop(columns=['GEO'],inplace=True)

##**2.1 First check of the Data**

To see how the data looks, we can use the *head()* method, which shows just the first five rows. if we put a number N as an argument, this will be the number of the first N rows that are listed. Similarly, *tail()* method shows the last N rows (five as defaults).

In [None]:
edu.head()
# edu.tail(7)

*Columns*, *index* and *values* attributes can be used to retrieve information about the content of our DataFrame object.


In [None]:
print('Columns:', edu.columns)
print('')
print('Indexes:', edu.index)
print('')
print('Values:', edu.values)

##**2.2 Data Selection**

All the ways to access DataFrame objects we have seen before can be now applied to work on the data loaded from an Excel file or from websites, for example to select a single column or filter the data.

In [None]:
# Single Column -> result will be a Series data structure, not a DataFrame, because only one column is retrieved.
T = edu['Value']
display(T)
display(type(T))

In [None]:
# Implicit slicing
edu.iloc[10:14]

In [None]:
# Explicit slicing
edu.loc[10:14]

In [None]:
# Portion of the DataFrame
edu.loc[0:4,'TIME':'GEO']

In [None]:
# Filtering the DataFrame to select a subset of data (Boolean indexing)
mask = edu['Value'] > 6.5
edu[mask].head()
# or better: edu[edu['Value'] > 6.5].head()

In [None]:
# Storing given information from the DataFrame in a NumPy Array
data_array = np.array(edu['Value'].values) # To transform specific columns of DataFrame to NumPy Array
display(data_array)

<u>Important Note</u>: Pandas uses the value NaN to represent missing values, which is a special floating-point value. A subtle feature of NaN values is that two NaN are never equal. Thus, the only safe way to tell whether or not a value is missing in a DataFrame is by using the *isnull()* function. Other useful functions are:
1.   *notnull()*: Opposite of isnull()
2.   *dropna()*: Return a filtered version of the data
3.   *fillna()*: Return a copy of the data with missing values filled or imputed

These functions can be used to filter rows with missing values.

In [None]:
# To identify the NaN values
null_elem = edu['Value'].isnull()
display(edu[null_elem].head(5))

In [None]:
# To discard the NaN values
edu_drop = edu.dropna() # To directly drop the entire rows with NaN
edu_drop.head(5)

In [None]:
# If we aim at filling NaN values, we can choose 0 with the fillna(0) function
edu_filled = edu.fillna(0)
edu_filled.head(5)

In [None]:
# Alternatively, we can specify a method to fill the values by propagating the previous or subsequent values (forward or backward-fill):
# ffill: propagate last valid observation forward to next valid
# bfill: use next valid observation to fill gap
edu_filled = edu.fillna(method='bfill', axis=0)
edu_filled.head(5)

# axis can be used to define along which axis to fill missing values (0 or ‘index’, 1 or ‘columns’)
#edu_filled = edu.fillna(method='ffill', axis=1)
#edu_filled.head(5)

##**2.3 Sorting Data**
Another important functionality to inspect data is to order them according to a given column. This can be achieved sorting any column, using *sort_values()*.

In [None]:
# Data sorted in descending order for 'Values' (i.e., from the largest to the smallest values):
edu_ordered = edu.sort_values(by='Value', ascending=False, inplace=False)
edu_ordered.head()

# Note 1: the 'inplace = True' keyword means that the DataFrame will be overwritten, and hence no new DataFrame is returned.
# Note 2: if 'ascending = True' the values are shown in ascending order

##**2.4 Simple Descriptive Statistics**
When loading the data in a dataframe, there is a convenience method `describe()` that computes several common aggregates for each column and returns the result. This can be a useful way to begin understanding the overall properties of a dataset.

The main aggregating functions we could use for Series and DataFrame objects is reported below:
<figure>
<center>
<img src=https://drive.google.com/uc?id=1BuxNv3DbiljZSDSXCQvAx0E8xBdK8Bb0 width="350"/>

In [None]:
# This is applied on columns with numerical variables only
display(edu.head(10))
display(edu.dtypes)
print('----------------')
edu['Value'].describe() # To avoid results from TIME column which are not meaningful in this case (TIME is int, not yet a DatatimeIndex)

##**2.5 Rearranging Data**

In what we have seen so far, the indexes of our time series DataFrame created from the imported csv file have been simple row numbers (e.g., from 0 to 383) without much meaning. However, we can rearrange the data, redistributing the indexes and columns to better manipulate them and perform further operations, in an easier way. \\
The *pivot_table()* function represents a further useful tool for such purpose. A pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. This *pivot_table()* function in Pandas requires to specify the columns to be used as new indexes, the new values and the new columns.

In [None]:
# Additional data organizations with pivot tables
display(edu.head(4))
display(edu.dtypes)

In [None]:
piv_edu = pd.pivot_table(edu, values='Value',
                        index=['GEO'], columns=['TIME'],aggfunc='median') # default aggfunc='mean'
piv_edu.head(3) # It appears in ascending order

# Note: if values are not specified, the pivot table will include the summary measures for all the variables with numerical values

In [None]:
# Accessing the data stored in the pivot table
display(piv_edu[2003]) # To extract an entire Column - Alternative: piv_edu.loc[:,2003]
# Note: I can not use '2003' (string) since the column names are Integer numbers, as confirmed by piv_edu.columns

print('------------')
display(piv_edu.loc[['Austria','Cyprus']]) # To extract all the values for two separate rows

print('------------')
display(piv_edu.iloc[0:2,0:3]) # To extract all the values for a part of the table (first two rows, three columns)

In [None]:
# Important, as we will see better later:
# when loading data from external sources, parse_dates allows to import ‘date’ as timestamps (DatetimeIndex),
# while index_col specifies the index of our dataframe
edu_up = pd.read_csv(DrivePath +'/Data/education_Data.csv', na_values=':',sep=';',usecols=['TIME','GEO','Value'],parse_dates=['TIME'],index_col='TIME')
display(edu_up)
edu_up.index # DatetimeIndex

# If parse_dates is not included, index will be a simple series of Int numbers rather than a DatetimeIndex

##**Exercise 1:**

Using the data stored in the "sample_pivot.xlsx" file, explore pivot tables.
As first step, load the data (hint: function is `pd.read_excel()`, with parse_dates and index_col as for `pd.read_csv()`) setting the time information as index.

In [None]:
# Load the data as required and visualise the DataFrame, including the types for the different columns



In [None]:
# Verify whether NaN values are present in the "Units" columns and count how many they are


# Fill the NaN values with the median value for that entire column



In [None]:
# Create a pivot table from the filled DataFrame using Type and Region columns of interest, Units as values and mean as aggregating function


In [None]:
 # Create a pivot table from the filled DataFrame using Type and Region columns of interest.
 # For the aggregating function, use mean for Sales and sum for Units.


##**2.6 Resampling**
The process of converting a time series from one frequency to another is called *Resampling*. When higher frequency data are aggregated to lower frequency: **downsampling**; converting lower frequency to higher frequency: **upsampling**. \\
This can be done using the `resample()` method, or `asfreq()` method. The primary difference between the two is that `resample()` is a data aggregation method, while `asfreq()` is a data selection. \\
Some options for the resampling period:
> W: weekly frequency \\
> M: month end frequency \\
> SM: semi-month end frequency (15th and end of month) \\
> Q: quarter end frequency
> BA: business year end frequency


In order to investigate these aspects, we will load some stock price data from Yahoo finance API. This needs thre mandatory arguments in this order: \\
> 1) Tickers (i.e., the name of the stock we want to load); \\
> 2) Start date + End date or Period; \\
> 3) Interval (i.e. the frequency/time frame we want to inspect the prices).

Valid intervals are: 1m, 2m, 5m, 15m, 30m, 60m, 90m, 1h, 1d, 5d, 1wk, 1mo, 3mo

In [None]:
!pip install yfinance
import yfinance as yf

# Option 1 -> specifying 'period'
data = yf.download(tickers='AAPL', period='2d', interval='1h', progress=False, auto_adjust=False) # This returns a DataFrame
display(data.head())

print(data.shape)
print(data.index) # DatetimeIndex object

In [None]:
data.info() # To have some summary general information

In [None]:
# Option 2 -> specifying start/end time
data = yf.download(tickers='AAPL', start='2021-01-01', end='2021-12-31', interval='1wk', progress=False)
data.head(7)

In [None]:
# For accessing the data: different options with loc and iloc attributes as before
print('Indexing (implicit):')
print(data.iloc[0])
print('----------')
print('Indexing (explicit):')
print(data.loc['2021-01-01'])
print('----------')
print('Slicing (explicit):')
print(data.loc['2021-04-01':'2021-06-1'])
print('----------')


In [None]:
# Select a single column from the Dataframe storing the retrieved information
data_closing = data['Adj Close']
display(data_closing.head(12))

print(data_closing.index) # DatetimeIndex object
print(type(data_closing)) # Series
print(data_closing.shape)

data_closing.plot(); # First (basic!) plot for visualising the time series

In [None]:
# Resample() - Example
Maximum = data_closing.resample('M').max()
Maximum.plot(style=':')

Mean = data_closing.resample('M').mean()
Mean.plot(style='--')

plt.legend(['Maximum', 'Mean'],loc='upper left');

display('Original data (W freq):', data_closing.head(7))
display('Resample data (M freq):', Maximum.head(7))

# Note from the pd.resample() help:
# The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex),
# or the caller must pass the label of a datetime-like series/index to the "on" parameter.


In [None]:
# Asfreq() - Example
data_TEST = yf.download(tickers='TSLA', start='2021-01-01', end='2021-11-30', interval='1d', progress=False) # DataFrame
display(data_TEST.head(6))
data_5d = data_TEST['Adj Close'].asfreq('5d') # The values corresponding to any timesteps in the new index which were not present in the original index will be NaN
print(data_5d)

data_5d.plot(style='--')
plt.legend(['5d freq'],loc='upper left');

#data_5d_filled = data_TEST['Adj Close'].asfreq('5d', method='ffill') # To fill the NaN values
#data_5d_filled.plot(style='-')
#plt.legend(['5d freq filled'],loc='upper left');

# I could combine the two, but it will only select specific values and fill the others with NaN
#T = data_closing.resample('M').asfreq()
#T

In [None]:
# Upsampling with Resample - In this case we have to add an interpolation step, rather than an aggregation function
Up = data_closing.resample('3d').interpolate(method='linear')
display(Up)

fig,axs = plt.subplots(1,2,figsize=(15,5))
data_closing.plot(ax=axs[0],marker = 'o', ms=3)
Up.plot(ax=axs[1],marker = 'o', ms=3)

###**Example 1 - Real Data**

This dataset in the Temp_Prep_Data.csv file comprises information about the daily temperature (maximum in Fahrenheit) and total precipitation (inches) in July 2018 for Colorado. Data were provided by the National Oceanic and Atmospheric Administration. Here we will see why it is important to handle the date as DatetimeIndex objects rather than strings.

In [None]:
# Loading data from csv
temp_prep_data_orig = pd.read_csv(DrivePath +'/Data/Temp_Prep_Data.csv', na_values='-999',sep=',')
display(temp_prep_data_orig.head(6))
print(temp_prep_data_orig.dtypes) # date is object type
print(temp_prep_data_orig.shape) # 31 rows, 3 columns

In [None]:
# Initial visualisation
fig, ax = plt.subplots(figsize=(10, 4));
ax.plot(temp_prep_data_orig['date'],
        temp_prep_data_orig['precip'],
        color='green');

ax.set(xlabel="Date",
       ylabel="Precipitation",
       title="Daily Total Precipitation\nBoulder - Jul 2018 (Colorado)");


In [None]:
# NaN values could be filled
temp_prep_data_filled = temp_prep_data_orig.fillna(method='bfill') # I could use other values as seen before

fig, ax = plt.subplots(figsize=(10, 4));
ax.plot(temp_prep_data_filled['date'],
        temp_prep_data_filled['precip'],
        color='green');

ax.set(xlabel="Date",
       ylabel="Precipitation",
       title="Daily Total Precipitation\nBoulder - Jul 2018 (Colorado)");


If we look at the x-axis, Python gets stuck trying to plot the all of the date labels. Each value is read as a string, and it is difficult to try to fit all the values on the axis in an efficient way --> important to set the time information as datetime object during the import phase and possibly set this as index to easier all these processes.


In [None]:
# Better way to import the data - 1
temp_prep_data = pd.read_csv(DrivePath +'/Data/Temp_Prep_Data.csv', na_values='-999',sep=',',parse_dates=['date'])
display(temp_prep_data.head(6))
print(temp_prep_data.dtypes) # date is datetime

In [None]:
# Better way to import the data - 2
temp_prep_data = pd.read_csv(DrivePath +'/Data/Temp_Prep_Data.csv', na_values='-999',sep=',',parse_dates=['date'], index_col = 'date')
display(temp_prep_data.head(6))
print(temp_prep_data.dtypes)
print(temp_prep_data.index) # datetime is now the index of the dataframe (DatetimeIndex)

In [None]:
# Resample with 3 days period
Mean_3d = temp_prep_data['precip'].resample('3d').mean()
display(Mean_3d)
fig, ax = plt.subplots(figsize=(10, 4));
ax.plot(Mean_3d,
       color='green');

ax.set(xlabel="Date",
       ylabel="Precipitation",
       title="Daily Total Precipitation\nBoulder - Jul 2018 (Colorado)");

# Question:  what happens if I apply the resample function to the original "temp_prep_data_orig" data?

In [None]:
# Alternative with asfreq (note that the output is different)
Asfreq_3d = temp_prep_data['precip'].asfreq('3d')

fig, ax = plt.subplots(figsize=(6, 6));
ax.plot(Asfreq_3d,
        color='green');

ax.set(xlabel="Date",
       ylabel="Precipitation",
       title="Daily Total Precipitation\nBoulder - Jul 2018 (Colorado)");

display(Asfreq_3d)

In [None]:
display(Asfreq_3d)
print('---------')
display(Mean_3d)
print('---------')
display(temp_prep_data.head(20))

###**Example 2 - Real Data**

The data we are going to analyses are related to earthquakes, in different periods and locations.

In [None]:
# Import version 1
earthquake_data = pd.read_csv(DrivePath +'/Data/Data_earthquakes.csv', na_values='',sep=',')
display(earthquake_data.head())
print(earthquake_data.dtypes) # datetime is object type
#print(earthquake_data.shape) # 27 rows, 5 columns

In [None]:
# Import version 2 - compare this to the previous one
earthquake_data = pd.read_csv(DrivePath +'/Data/Data_earthquakes.csv', na_values='',sep=',', parse_dates=['datetime'])
display(earthquake_data.head())
print(earthquake_data.dtypes) # datetime is datatime64 type


To split a column with date and time information into separate columns, `Series.dt` can be used to access the values of the series such as year, month, day etc.

In [None]:
# Splitting the date/time information
earthquake_data['date'] = earthquake_data['datetime'].dt.date
earthquake_data['time'] = earthquake_data['datetime'].dt.time
earthquake_data['year'] = earthquake_data['datetime'].dt.year

earthquake_data['month'] = earthquake_data['datetime'].dt.month
earthquake_data['day'] = earthquake_data['datetime'].dt.day

earthquake_data['hour'] = earthquake_data['datetime'].dt.hour
earthquake_data['minute'] = earthquake_data['datetime'].dt.minute
earthquake_data['second'] = earthquake_data['datetime'].dt.second

# Drop the unnecessary columns (redundant)
earthquake_data.drop(columns=['datetime'],inplace=True)


In [None]:
display(earthquake_data.head())
print(earthquake_data.dtypes) # Date is an object
print(earthquake_data['date'][0])

# To convert Date to a Datetime object:
earthquake_data['date'] = pd.to_datetime(earthquake_data['date'])
print(earthquake_data.dtypes) # Date is a Datetime
print(earthquake_data['date'][0])

If I want to compute the inverse operation, that is merge the individual columns for days, month, years, a Datetime object can be created using `pd.to_datetime` method seen before:

In [None]:
earthquake_data['new_datetime'] = pd.to_datetime(earthquake_data[["year", "month", "day", "hour", "minute", "second"]])
earthquake_data.head()


In [None]:
earthquake_data.info()

##**Exercise 2 - Real Data**

The data we will analyse in this exercise are measures of the global-scale temperature (global_temperature.csv), as provided by two different centers (from 1880 up to 2016).

In [None]:
# 1. Load the data and set the "Date" column to a DatetimeIndex object.
# Visualise the different types of data + the dimension of the dataframe.


In [None]:
# 2. Type the following command:
temperature_data.Mean[:350].plot();

# Is this operation correct? What are we looking at?

In [None]:
# 3. What the are steps you can apply to have a new table with Dates as index and two columns representing each one the temperatures for the two centers?


In [None]:
# 4. Visualise the temperature data from the two centers in a single figure


In [None]:
# 5. Downsample the temperature data with year frequency, using mean as aggregator.
# Visualise in a single figure the new temperature data for the two centers (yearly frequency)


#**3. Time Series Data Visualization in Python - Part 1**

Data visualization is another essential task in all the different projects/domains, as it provides a clear idea of what the information means by reporting it visually through maps or graphs.
Among the different possibilities for data visualization, Matplotlib library is one of the most well-known in Python. This is a multi-platform data visualization library built on NumPy arrays, which has been designed to work with the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch for enabling interactive MATLAB-style plotting.
While it has been largely used for years, people have been started to developing new packages (e.g., Seaborn, ggpy, HoloViews, Altair).

##**3.1. Time plots**

For time series data, the obvious graph to start with is a time plot (as preliminary seen above). In this graph, the observations are plotted against the time of observation, with consecutive observations joined by straight lines.

In [None]:
# General importing
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
# Simple Time Plot
y = np.linspace(0, 25, 2000)
x =  pd.date_range('01-01-1850', periods=2000, freq='MS')
fig = plt.figure()
plt.plot(x, y**2, '-', label='time series');

# To adjust the axis limits
plt.xlim([x.min(), x.max()])
plt.ylim(0, (y**2).max()+100);

# To add labels and title
plt.title("Example of a simple time plot")
plt.xlabel("Time [months]")
plt.ylabel("Number of observations");

# To add a legend
plt.legend(frameon=True, loc='upper left');


# plt.style.available # to check the available style that can be used

In [None]:
# General command for saving figures to files, directly in Google Drive
fig.savefig(DrivePath +'/Data/my_time-plot.png',dpi=300)

# To dowload on local computer
#from google.colab import files
#files.download( DrivePath +'/Data/my_figure.png')



In [None]:
# EXTRA - Save all the figures directly in a pdf file
!pip install fpdf
from fpdf import FPDF
from datetime import date

pdf = FPDF(format='A4', unit='mm')
pdf.add_page()
pdf.set_font("Arial", size=12)
today = date.today()
d2 = today.strftime("%B %d, %Y")
pdf.cell(200, 10, txt="Lesson 3 -  " + str(d2), ln=1, align="C")
stringa = "Simple time plot"
pdf.multi_cell(0,15,stringa)

def add_image(image_path,pdf_file, w):
    pdf_file.image(image_path, w=w)

add_image(DrivePath +'/Data/my_time-plot.png',pdf, w=120)

pdf.output(DrivePath + '/Data/'+'TEST_time-plot'+'.pdf');

For any scientific measurement, accounting for errors is equally important as accurately reporting the number itself. Indeed, when visualising data and results, showing these errors in an effective way allows to convey a more complete set of information.  \\
A basic errorbar can be created with calling a simple Matplotlib function, named `errorbar()`. This is similar to the line plot, except that each data point comes with an errorbar to quantify uncertainty or variance present in each datum.

In [None]:
# Example 1
from matplotlib.dates import DateFormatter

date_form = DateFormatter("%Y")
np.random.seed(42)

date_rng =  pd.date_range('01-01-1850', periods=100, freq='M')
x = np.linspace(0, 10, 100)
dy = 0.8
y = np.sin(x) + dy * np.random.random(100) # to add noise to each point of a sinusoidal signal

fig, ax = plt.subplots()
ax.errorbar(date_rng, y, yerr = dy, ecolor='lightgrey', elinewidth=2, capsize=2, fmt='o-'); # yerr specifies the error rate in the y direction
ax.xaxis.set_major_formatter(date_form)
ax.set_xlim(date_rng.min(), date_rng.max());
ax.set_xlabel('Time')
ax.set_ylabel('Values');

In [None]:
# Example 2
# In some cases, it might be useful to visualise the errorbar as shaded area across the time plot:
mean_1 = np.array([10, 20, 30, 25, 32, 43])
std_1 = np.array([2.2, 2.3, 1.2, 2.2, 1.8, 3.5])

mean_2 = np.array([12, 22, 30, 13, 33, 39])
std_2 = np.array([2.4, 1.3, 2.2, 1.2, 1.9, 3.5])

date_form = DateFormatter("%b-%Y")
x = pd.date_range('01-2022', periods = len(mean_1), freq='MS')

fig, ax = plt.subplots()
ax.plot(x, mean_1, 'b-', label='Signal 1')
ax.fill_between(x, mean_1 - std_1, mean_1 + std_1, color='b', alpha=0.2)
ax.plot(x, mean_2, 'r-', label='Signal 2')
ax.fill_between(x, mean_2 - std_2, mean_2 + std_2, color='r', alpha=0.2);
ax.xaxis.set_major_formatter(date_form)
ax.set_xlim(x.min(), xmax = x.max());
ax.legend(frameon=False, loc='upper left', ncol=1);


In [None]:
# Example 3
# Representative fMRI time series (dataset already available in seaborn)
#plt.style.use('seaborn')

fmri = sns.load_dataset("fmri")
fmri_pivot = pd.pivot_table(fmri, index='timepoint',columns=['event','subject'],values='signal')
display(fmri_pivot.tail(6))

A = fmri_pivot['cue'].mean(axis=1) # A.shape = (19,)
B = fmri_pivot['stim'].mean(axis=1)

Astd = fmri_pivot['cue'].std(axis=1)
Bstd = fmri_pivot['stim'].std(axis=1)

x = list(range(0,len(A)))
plt.plot(x,A, 'g-', label='Cue')
plt.fill_between(x, A - Astd, A + Astd, color='g', alpha=0.2);
plt.plot(x,B, 'b-', label='Stim')
plt.fill_between(x,B - Bstd, B + Bstd, color='b', alpha=0.2);
plt.legend(frameon=True, loc='upper right', ncol=1);
plt.xlim([0,len(A)-1]);


In [None]:
# Alternative using the seaborn functionalities
sns.lineplot(data=fmri, x='timepoint', y='signal',hue='event',ci='sd',estimator='mean');

##**3.2 Scatter plots**
Time plots are useful for visualising individual time series. However, it is often necessary to explore relationships between multiple variables (time series in our case). Simple scatter plots can come into play for studying the relationship between two series by plotting one against the other.
Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. That is, each (x,y) coordinate pair is represented by a symbol.

In [None]:
# Example 1 on a real scenario: stock prices time series
#!pip install yfinance
#import yfinance as yf
plt.style.use('seaborn')
stocks = ['GOOG', 'AMZN']
data_dw = yf.download(tickers=stocks, start = '2021-01-01', progress=False)
data = data_dw['Adj Close']
display(data.head(6))

# Simple time plots
from matplotlib import rcParams
rcParams['figure.figsize'] = 12,6
plt.plot(data.AMZN,label='Amazon')
plt.plot(data.GOOG, label='Google')
plt.grid(True, color='k', linestyle=':')
plt.title("Amazon & Google Prices")
plt.xlabel("Date")
#plt.yticks([500, 1000,1500,2000,2500])
plt.legend(frameon=False, loc='lower center', ncol=2);

In [None]:
# Simple scatter plot -> The correlation does not seem high
plt.scatter(data.GOOG, data.AMZN)
plt.xlabel('Google')
plt.ylabel('Amazon');

In [None]:
# Scatter plot on the difference
returns = data.diff()
display(returns.head(10))
returns.dropna(inplace=True)

rcParams['figure.figsize'] = 6,6
plt.scatter(returns.GOOG, returns.AMZN, c = 'r', edgecolor='k')

plt.axvline(0, c=(.5, .5, .5), ls='--')
plt.axhline(0, c=(.5, .5, .5), ls='--')
plt.xlabel('Google')
plt.ylabel('Amazon')
plt.xlim((-20,20))
plt.ylim((-20,20));


In [None]:
# Example 2 - Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.keys())
print('Features:',iris.feature_names)
print('Species:',iris.target_names)

features = iris.data.T
fig, ax = plt.subplots()
rcParams['figure.figsize'] = 5,5

scatter = ax.scatter(features[0], features[1], alpha=0.4, s=100*features[3], c=iris.target, cmap='viridis')
ax.set_xlabel(iris.feature_names[0])
ax.set_ylabel(iris.feature_names[1]);
legend1 = ax.legend(*scatter.legend_elements(), loc = 'center', bbox_to_anchor=(1.1, 0.5), title='Species')
ax.add_artist(legend1);
# Note: legend_elements(prop ='sizes') produces a legend with a cross section of sizes from the scatter
# legend2 = ax.legend(*scatter.legend_elements(prop='sizes'), loc= 'upper left', title='Sizes')

# --> This type of visualization allows to simultaneously explore four different features of the data:
# the (x, y) location of each point corresponds to the sepal length and width, the size of the point is related to the petal width,
# and the color is related to the particular species of flower.


###**3.2.1. Correlation**
It is common to compute correlation coefficients to measure the strength of the linear relationship between two variables. The correlation between variables x and y is given by:
<figure>
<center>
<img src=https://drive.google.com/uc?id=1bgp74cqZXyKwpQNDoZHsHdoLNojMqhPA
width="350"/>  


The value of r always lies between -1 and 1, with negative values indicating a negative relationship (data are anticorrelated) and positive values indicating a positive relationship.

In [None]:
# Simple function to calculate the Pearson correlation value given two arrays of equal size
x = np.array([-2, -1, 0, 1, 2])
y = np.array([5, 1, 3, 2, 0])

x = returns['AMZN']
y = returns['GOOG']

def correlation(x,y):
  A = np.sum((x-np.mean(x))*(y-np.mean(y)))
  B = np.sum((x-np.mean(x))**2) *np.sum((y-np.mean(y))**2)
  corr = A/(B)**0.5
  return(corr)

A = correlation(x,y)
print('Correlation value [Code] is:',A)


In [None]:
# Simple function to calculate the covariance value given two arrays of equal size
def covariance(x,y):
  A = np.sum((x-np.mean(x))*(y-np.mean(y)))
  cov_val = A/(len(x)-1) # sample covariance
  return(cov_val)

A = covariance(x,y)
print('Covariance value is:',A)

In [None]:
# Note: We can use the covariance value as numerator for the correlation equation:
def correlation_v2(x,y):
  A = covariance(x,y)
  B = np.sum((x-np.mean(x))**2)/(len(x)-1) *np.sum((y-np.mean(y))**2)/(len(y)-1)
  corr = A/(B)**0.5
  return(corr)

A = correlation_v2(x,y)
print('Correlation value is:',A)

In [None]:
# To verify these values with Python available functions:

# Pearson Correlation coefficient - Option 1
my_corrcoef = np.corrcoef(x, y) # This returns a 2x2 matrix
print('Correlation value [Python] is:', my_corrcoef[0,1])

# Pearson Correlation coefficient - Option 2
PC_val = stats.pearsonr(x,y) # This returns two values
print('Correlation value [Python] is:',PC_val[0])
print('The associated p-value is:',PC_val[1])

COV_val = np.cov(x,y) # By default Python calculates the sample covariance. Returns a 2x2 matrix
print('Covariance value [Python] is:',COV_val[0,1])

Another measure of correlation we might use with time series data is *Spearman rank correlation* that is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables).
While Pearson correlation assesses linear relationships, Spearman correlation assesses monotonic relationships, whether linear or not.

In [None]:
# The Spearman rank-order correlation coefficient is a nonparametric measure of the monotonicity of the relationship between two datasets.
# Differently than Pearson correlation, this does not assume that both datasets are normally distributed.

c = stats.spearmanr(x,y)
print('Spearman Correlation value [Python] is:',c[0])
print('The associated p-value is:',c[1])