# Pandas (Basic)

`Pandas` is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the `Python` programming language, used for data preparation and data analysis. The tutorial mainly includes two parts, **Basic** and **Advanced**. The former is mandatory for this course and the latter is optional. The following content is the **Basic**  part.

## Install Pandas
If you run code on your own computer, you need to install pandas. Open the console and enter ```pip install pandas```

In [1]:
# Import panas and view pandas version. The 'as' keyword is to replace pandas with an abbreviation 'pd'.
import pandas as pd
print(pd.__version__)

1.1.5


## Introduction to pandas Data Structures
The two primary data structures of pandas is
+ Series
+ DataFrame

The `series` is designed to accommodate a sequence of one-dimensional data, and the `dataframe` is designed to contain cases with several dimensions.

## Series
As shown below, the internal structure of the series object is simple, which is composed of two arrays associated with each other. The main array holds the data (data of any `NumPy` type) to which each element is associated
with a label, contained within the other array, called the ```index```.

|index|Value|
|:---|--:|
|0|-1|
|1|3|
|2|8|

### Defining a Series

To create the series specified in above, you simply call the `Series()` function and pass as an argument an array containing the values to be included in it.

In [2]:
s = pd.Series([-1,3,8]) # list
print(s)
s1 = pd.Series(s) # from Other Series
print(s1)

0   -1
1    3
2    8
dtype: int64
0   -1
1    3
2    8
dtype: int64


As you can see from the output of the series, on the left there are the values in the index, which is a series of labels, and on the right are the corresponding values.


```{margin}
If you do not specify any index during the definition of the series, by default, pandas will assign numerical values increasing from 0 as labels. In this case, the labels correspond to the indexes (position in the array) of the elements in the series object.
```

In general, it is best to create a series with meaningful labels to distinguish and identify each item. During the constructor call, the labels are added by specifying the value of the `index` option or using `dict` object.

In [3]:
# when declaring a Series, you can specify the index
s2 = pd.Series({"a":-1,"b":3,"c":8}) # dictionary
print(s2)
s3 = pd.Series([-1,3,8], index=['x','y','z']) # specify the index by 'index' option
print(s3)

a   -1
b    3
c    8
dtype: int64
x   -1
y    3
z    8
dtype: int64


In [4]:
import numpy as np
arr = np.array([100,20,-3]) # from NumPy Arrays
s4 = pd.Series(arr, index=['x','y','z'] )
print(s4)

x    100
y     20
z     -3
dtype: int32


```{margin}
Always keep in mind that the values contained in the `NumPy` array or in the
original series are not copied, but are passed by reference. That is, the object is inserted
dynamically within the new series object. If it changes, for example its internal element
varies in value, then those changes will also be present in the new series object.
```

In [5]:
arr[2] = -40
arr
s4

x    100
y     20
z    -40
dtype: int32

As you can see in this example, by changing the third element of the `arr` array, we
also modified the corresponding element in the `s4` series.

If you want to individually see the two arrays that make up this data structure, you
can call the two attributes of the series as follows: index and values.

In [6]:
print(s.values)
print(s.index)

[-1  3  8]
RangeIndex(start=0, stop=3, step=1)


### Selecting the Internal Elements

You can select individual elements as ordinary numpy arrays, specifying the key.

In [7]:
s2 = pd.Series({"a":-1,"b":3,"c":8})
s2[1]

3

Or you can specify the label corresponding to the position of the index.

In [8]:
s2['a']

-1

In the same way you select multiple items in a numpy array, you can specify the
following:

In [9]:
s2[0:2]

a   -1
b    3
dtype: int64

In [10]:
s2['a':'c']

a   -1
b    3
c    8
dtype: int64

In [11]:
s2[['a','c']]

a   -1
c    8
dtype: int64

### Assigning Values to the Elements

In [12]:
s1['a'] = 100
s1

0     -1
1      3
2      8
a    100
dtype: int64

### Filtering Values

If you need to know which elements in the series are greater than 4, you
write the following

In [13]:
s = pd.Series([1,3,5,2,10])
s[s>4] # greater than 4

2     5
4    10
dtype: int64

In [14]:
# According to Boolean value to filter
s = pd.Series([1,3,5,2,10])
print(s.isin([2,5]))
print(s[s.isin([2,5])])

0    False
1    False
2     True
3     True
4    False
dtype: bool
2    5
3    2
dtype: int64


### Operations and Mathematical Functions

In [15]:
s*2.5

0     2.5
1     7.5
2    12.5
3     5.0
4    25.0
dtype: float64

In [16]:
np.exp(s)

0        2.718282
1       20.085537
2      148.413159
3        7.389056
4    22026.465795
dtype: float64

### Nan Value

The `NaN` refers to `Not a Number`, which generally is caused by the missing value. Before data analysis, the `NaN` value need to be adressed.

In [17]:
import numpy as np 
# Declaring a Series with NaN value
s = pd.Series([1,np.NaN,10,9,-2,np.NaN])
s

0     1.0
1     NaN
2    10.0
3     9.0
4    -2.0
5     NaN
dtype: float64

Call `isnull()` or `notnull()` functions to generate boolean value and further identify the indexes with NaN value. 

In [18]:
print(s.isnull())
print(s.notnull())

0    False
1     True
2    False
3    False
4    False
5     True
dtype: bool
0     True
1    False
2     True
3     True
4     True
5    False
dtype: bool


Based on the generated boolean value, the series with full NaN value and without NaN can be generated.

In [19]:
print(s[s.isnull()])
print(s[s.notnull()])

1   NaN
5   NaN
dtype: float64
0     1.0
2    10.0
3     9.0
4    -2.0
dtype: float64


### Operation of multiple Series

In [20]:
s = pd.Series({"Singapore":30,"Malaysia":23,"Vietnam":36,"Cambodia":41})
s1 = pd.Series({"China":51,"Japan":73,"Vietnam":36,"Laos":31})
s*s1

Cambodia        NaN
China           NaN
Japan           NaN
Laos            NaN
Malaysia        NaN
Singapore       NaN
Vietnam      1296.0
dtype: float64

As you can see, only indexes which all series have can opearte. 

## DataFrame

Compared with the `Series`, the `DataFrame` can contain multiple dimentional data. Its fist column and first row are `index` and `columns`, respectively. (Only for DataFrame without multiple index, `DataFrame` with multiple indexes which will be introduced later). Each column must be same data type (numeric, string, boolean et al.) but different columns can have different data types. 

|index|numeric|string|boolean|
|:--|:--:|:--:|--:|
|0|-1|Singapore|True|
|1|3|China|True|
|2|8|Japan|False|


### Defining a DataFrame

Call `DataFrame()` function to create a `DataFrame`. The `Array`, `List`, `dict` all can taken as the input of `DataFrame()` function.

In [21]:
# array
df = pd.DataFrame(np.array([[14, 35, 35, 35],
                            [19, 34, 57, 34],
                            [42, 74, 49, 59]]))
print(df)
# list,  use 'columns' and 'index' parameters to specify the column and index of generated dataframe.
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
print(df)
# dict
df = pd.DataFrame({"Country":["Malaysia","Singapore","Vietnam"],
                   "Capital":["Kuala Lumpur","Singapore","Hanoi"],
              "Population":[32365999,5850342,97338579],
              "Isdeveloped":[False,True,True]})
df

    0   1   2   3
0  14  35  35  35
1  19  34  57  34
2  42  74  49  59
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True


Unnamed: 0,Country,Capital,Population,Isdeveloped
0,Malaysia,Kuala Lumpur,32365999,False
1,Singapore,Singapore,5850342,True
2,Vietnam,Hanoi,97338579,True


### Selecting the Internal Elements

Similar with `Series`, two ways can be used to select the elements from `DataFrame`. Call `iloc[]` and `loc[]` to select the elements by position and label, respectively .

In [22]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Vietnam,Hanoi,97338579,True


In [23]:
# use ':' to represent select all
df.iloc[:,0:2]

Unnamed: 0,Country,Capital
a,Malaysia,Kuala Lumpur
b,Singapore,Singapore
c,Vietnam,Hanoi


In [24]:
df.loc[:,"Country":"Population"]

Unnamed: 0,Country,Capital,Population
a,Malaysia,Kuala Lumpur,32365999
b,Singapore,Singapore,5850342
c,Vietnam,Hanoi,97338579


In [25]:
df.loc["a",["Country","Population"]]

Country       Malaysia
Population    32365999
Name: a, dtype: object

In [26]:
df.iloc[[0,1]] # If you omit number of columns, all columns will be selected 

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


Use ```columns```,```index``` and ```values``` atrributes to obtain corresponding object value.

In [27]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [28]:
df.columns

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

In [29]:
df.values

array([['Malaysia', 'Kuala Lumpur', 32365999, False],
       ['Singapore', 'Singapore', 5850342, True],
       ['Vietnam', 'Hanoi', 97338579, True]], dtype=object)

Select corresponding column according the label or number of columns.

In [30]:
df["Country"]

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

In [31]:
df[["Country","Population"]] # Use list to select multiple columns  

Unnamed: 0,Country,Population
a,Malaysia,32365999
b,Singapore,5850342
c,Vietnam,97338579


In [32]:
df.Country # Also support as atrribute to select

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

In [33]:
df["a":"b"] # When select multiple rows, do not use list   

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


In [34]:
df[0:2] # When select multiple rows, do not use list

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


### Assigning value

In [35]:
df.loc["c","Country"] = "Japan"
df.loc["c","Capital"] = "Tokyo"
df.loc["c","Population"] = 126476461
df.loc["c","Isdeveloped"] = True
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Japan,Tokyo,126476461,True


In [36]:
df.loc["c"] = ["Japan", "Tokyo", 126476461, True]
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Japan,Tokyo,126476461,True


### Assigning index, columns, and name of index and columns

In [37]:
df.index = ["e", "f", "g"]
df.index.name = "label"
df.columns.name = "atributes"
df.columns = ["Coun", "Cap", "Pop", "ID"]
df

Unnamed: 0_level_0,Coun,Cap,Pop,ID
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,Malaysia,Kuala Lumpur,32365999,False
f,Singapore,Singapore,5850342,True
g,Japan,Tokyo,126476461,True


### Delete columns from dataframe

In [38]:
del df["ID"]
df

Unnamed: 0_level_0,Coun,Cap,Pop
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e,Malaysia,Kuala Lumpur,32365999
f,Singapore,Singapore,5850342
g,Japan,Tokyo,126476461


### Filtering
Same as ```Series()``` mentioned above.

In [39]:
df = pd.DataFrame(np.array([[14, 35, 35, 35],
                            [19, 34, 57, 34],
                            [42, 74, 49, 59]]))
# filtering lesser than 30
df[df<30]

Unnamed: 0,0,1,2,3
0,14.0,,,
1,19.0,,,
2,,,,


In [40]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
# Filtering accroding to conditions of one column
df[df["Population"]<50000000]

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


In [41]:
# Filtering accroding to conditions of multiple columns
df[(df["Population"]<50000000) & (df["Isdeveloped"]==True)]

Unnamed: 0,Country,Capital,Population,Isdeveloped
b,Singapore,Singapore,5850342,True


### Transposition  of a Dataframe
Same as `Array` from `Numpy` dataframe can transpose. `Columns` change to `Index` and `Index` change to `Columns`.

In [42]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
df1 = df.T
df1

Unnamed: 0,a,b,c
Country,Malaysia,Singapore,Vietnam
Capital,Kuala Lumpur,Singapore,Hanoi
Population,32365999,5850342,97338579
Isdeveloped,False,True,True


In [43]:
df1.index

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

In [44]:
df1.columns

Index(['a', 'b', 'c'], dtype='object')

### Merge of dataframe
`Concat()`, `Join()`

In [45]:
df1 = pd.DataFrame(np.random.rand(3,4))
df2 = pd.DataFrame(np.random.rand(3,4))
df3 = pd.DataFrame(np.random.rand(6,4))
df4 = pd.DataFrame(np.random.rand(3,6))

In [46]:
df1

Unnamed: 0,0,1,2,3
0,0.206051,0.992769,0.123265,0.647524
1,0.634294,0.658906,0.177446,0.618687
2,0.646614,0.595673,0.90152,0.54474


In [47]:
df2

Unnamed: 0,0,1,2,3
0,0.055476,0.905927,0.534322,0.204508
1,0.723764,0.618826,0.542614,0.850142
2,0.393729,0.919337,0.793761,0.629774


In [48]:
df3

Unnamed: 0,0,1,2,3
0,0.26945,0.578978,0.103235,0.55402
1,0.453324,0.080696,0.182836,0.390433
2,0.155646,0.036758,0.271894,0.828253
3,0.355042,0.464576,0.086404,0.395181
4,0.330333,0.750446,0.389247,0.967997
5,0.236876,0.465663,0.967985,0.929154


In [49]:
df4

Unnamed: 0,0,1,2,3,4,5
0,0.522044,0.593614,0.484898,0.864356,0.552645,0.505305
1,0.684472,0.177255,0.342014,0.125136,0.557648,0.386877
2,0.182415,0.178786,0.490618,0.167139,0.674949,0.089527


In [50]:
pd.concat([df1,df2])

Unnamed: 0,0,1,2,3
0,0.206051,0.992769,0.123265,0.647524
1,0.634294,0.658906,0.177446,0.618687
2,0.646614,0.595673,0.90152,0.54474
0,0.055476,0.905927,0.534322,0.204508
1,0.723764,0.618826,0.542614,0.850142
2,0.393729,0.919337,0.793761,0.629774


In [51]:
result = df1.append(df2)
result

Unnamed: 0,0,1,2,3
0,0.206051,0.992769,0.123265,0.647524
1,0.634294,0.658906,0.177446,0.618687
2,0.646614,0.595673,0.90152,0.54474
0,0.055476,0.905927,0.534322,0.204508
1,0.723764,0.618826,0.542614,0.850142
2,0.393729,0.919337,0.793761,0.629774


### `join`, `inner` and `merge`

### View data

In [52]:
df = pd.DataFrame(np.random.rand(100,4))
df.head()

Unnamed: 0,0,1,2,3
0,0.961074,0.914723,0.489604,0.968548
1,0.647185,0.904352,0.022548,0.129909
2,0.710488,0.584896,0.576213,0.272517
3,0.448723,0.398865,0.768406,0.337896
4,0.057121,0.300043,0.442141,0.460806


In [53]:
df.tail()

Unnamed: 0,0,1,2,3
95,0.93813,0.843183,0.242784,0.805869
96,0.624455,0.835685,0.849114,0.241616
97,0.996938,0.151783,0.834767,0.107647
98,0.467609,0.814447,0.354351,0.114249
99,0.252274,0.385371,0.037245,0.891316


### Computational tools

### Covariance

In [54]:
df = pd.DataFrame(np.random.rand(5,5))
df.cov()

Unnamed: 0,0,1,2,3,4
0,0.06797,0.002293,-0.014832,0.020889,-0.046879
1,0.002293,0.029578,0.018176,0.015083,0.051378
2,-0.014832,0.018176,0.062296,0.026519,-0.001599
3,0.020889,0.015083,0.026519,0.061488,0.011392
4,-0.046879,0.051378,-0.001599,0.011392,0.182713


### Correlation

In [55]:
df.corr() # pearson (default), kendall, spearman

Unnamed: 0,0,1,2,3,4
0,1.0,0.051142,-0.227939,0.323127,-0.420664
1,0.051142,1.0,0.423431,0.35367,0.698891
2,-0.227939,0.423431,1.0,0.428483,-0.014987
3,0.323127,0.35367,0.428483,1.0,0.107477
4,-0.420664,0.698891,-0.014987,0.107477,1.0


### `mean()`, `sum()`, `describe()`

In [56]:
df.mean()

0    0.377023
1    0.375848
2    0.614063
3    0.373357
4    0.387859
dtype: float64

In [57]:
df.sum()

0    1.885115
1    1.879239
2    3.070315
3    1.866785
4    1.939296
dtype: float64

In [58]:
df.describe()

Unnamed: 0,0,1,2,3,4
count,5.0,5.0,5.0,5.0,5.0
mean,0.377023,0.375848,0.614063,0.373357,0.387859
std,0.260711,0.171983,0.249592,0.247967,0.427449
min,0.041826,0.100235,0.299637,0.017899,0.010373
25%,0.209401,0.344873,0.403286,0.284336,0.022208
50%,0.432463,0.436393,0.693698,0.353977,0.223428
75%,0.481803,0.437813,0.824877,0.56217,0.752823
max,0.719622,0.559926,0.848816,0.648402,0.930463


### Data ranking

In [59]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True],
                  ["Japan","Tokyo",None,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c",'d'])
df.sort_values(by=['Population','Country'], ascending=False, na_position='first')

Unnamed: 0,Country,Capital,Population,Isdeveloped
d,Japan,Tokyo,,True
c,Vietnam,Hanoi,97338579.0,True
a,Malaysia,Kuala Lumpur,32365999.0,False
b,Singapore,Singapore,5850342.0,True


## Missing data (NaN value)

In [60]:
df = pd.DataFrame(np.random.rand(5,5))
df.iloc[0,1] = np.nan
df.iloc[2,2] = np.nan
df.iloc[3,1] = np.nan
df.iloc[3,3] = np.nan
df

Unnamed: 0,0,1,2,3,4
0,0.456593,,0.997167,0.070873,0.946465
1,0.593342,0.194045,0.889637,0.077502,0.799251
2,0.542118,0.739005,,0.041342,0.685254
3,0.553801,,0.705614,,0.840214
4,0.963496,0.311629,0.919142,0.768203,0.525202


In [61]:
# detecting nan value
df.isnull()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,False
1,False,False,False,False,False
2,False,False,True,False,False
3,False,True,False,True,False
4,False,False,False,False,False


In [62]:
# detecting nan value
df.notnull()

Unnamed: 0,0,1,2,3,4
0,True,False,True,True,True
1,True,True,True,True,True
2,True,True,False,True,True
3,True,False,True,False,True
4,True,True,True,True,True


In [63]:
df.isna()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,False
1,False,False,False,False,False
2,False,False,True,False,False
3,False,True,False,True,False
4,False,False,False,False,False


In [64]:
# fill nan using a specify value
df.fillna(value=0)

Unnamed: 0,0,1,2,3,4
0,0.456593,0.0,0.997167,0.070873,0.946465
1,0.593342,0.194045,0.889637,0.077502,0.799251
2,0.542118,0.739005,0.0,0.041342,0.685254
3,0.553801,0.0,0.705614,0.0,0.840214
4,0.963496,0.311629,0.919142,0.768203,0.525202


In [65]:
# fill nan using a method
# set inplace to True, the changes will act on dataframe
df.fillna(method="ffill") # other method: ‘backfill’, ‘bfill’, ‘pad’
df

Unnamed: 0,0,1,2,3,4
0,0.456593,,0.997167,0.070873,0.946465
1,0.593342,0.194045,0.889637,0.077502,0.799251
2,0.542118,0.739005,,0.041342,0.685254
3,0.553801,,0.705614,,0.840214
4,0.963496,0.311629,0.919142,0.768203,0.525202


In [66]:
df.fillna(method="pad")

Unnamed: 0,0,1,2,3,4
0,0.456593,,0.997167,0.070873,0.946465
1,0.593342,0.194045,0.889637,0.077502,0.799251
2,0.542118,0.739005,0.889637,0.041342,0.685254
3,0.553801,0.739005,0.705614,0.041342,0.840214
4,0.963496,0.311629,0.919142,0.768203,0.525202


In [67]:
# delete NaN value
# ‘any’ : If any NA values are present, drop that row or column.
# ‘all’ : If all values are NA, drop that row or column.

# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
df.dropna(axis="index",how="any")

Unnamed: 0,0,1,2,3,4
1,0.593342,0.194045,0.889637,0.077502,0.799251
4,0.963496,0.311629,0.919142,0.768203,0.525202


## Date index

In [68]:
dti = pd.date_range("2018-01-01", periods=3, freq="H")
print(dti)
dti = pd.date_range(start = "2021-09-28",end="2021-09-30", freq="10H")
print(dti)

DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
               '2018-01-01 02:00:00'],
              dtype='datetime64[ns]', freq='H')
DatetimeIndex(['2021-09-28 00:00:00', '2021-09-28 10:00:00',
               '2021-09-28 20:00:00', '2021-09-29 06:00:00',
               '2021-09-29 16:00:00'],
              dtype='datetime64[ns]', freq='10H')


Manipulating and converting date times with timezone information

In [69]:
dti = pd.date_range(start = "2021-09-28",end="2021-09-30", freq="10H")
dti = dti.tz_localize("UTC")
dti

DatetimeIndex(['2021-09-28 00:00:00+00:00', '2021-09-28 10:00:00+00:00',
               '2021-09-28 20:00:00+00:00', '2021-09-29 06:00:00+00:00',
               '2021-09-29 16:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq='10H')

In [70]:
dti = pd.date_range(start = "2021-09-28",end="2021-09-30", freq="10H")
dti = dti.tz_localize("Asia/Singapore")
dti

DatetimeIndex(['2021-09-28 00:00:00+08:00', '2021-09-28 10:00:00+08:00',
               '2021-09-28 20:00:00+08:00', '2021-09-29 06:00:00+08:00',
               '2021-09-29 16:00:00+08:00'],
              dtype='datetime64[ns, Asia/Singapore]', freq=None)

Using the `origin` option, one can specify an alternative starting point for creation of a `DatetimeIndex`. For example, to use 1900-01-01 00:00:00 as the starting time and hour as the unit period length:

In [71]:
pd.to_datetime([100, 101, 102], unit="h", origin=pd.Timestamp("1900-01-01 00:00:00"))

DatetimeIndex(['1900-01-05 04:00:00', '1900-01-05 05:00:00',
               '1900-01-05 06:00:00'],
              dtype='datetime64[ns]', freq=None)

Supported units are: `D`:day, `h`:hour, `m`: minute, and `s`:second.

```{tip}
The time labels of climate products is usually given a start time point and discrete time interval, and then represented by a column of integers. In this case, we can use above way to construct time labels.
```

## Upsampling and Downsampling

* Upsampling: Increase the frequency of the samples by interpolation, such as from minutes to seconds. 
* Downsampling: Ddecrease the frequency of the samples by aggregation, such as from months to years.


In [72]:
# prepare data, this section will be introduced in the next tutorial
# Data Source: http://www.weather.gov.sg/climate-historical-daily/
data = pd.read_csv('../../../assets/data/Changi_daily_rainfall.csv', index_col=0,header=0,parse_dates=True)
data.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-01-01,0.0
1981-01-02,0.0
1981-01-03,0.0
1981-01-04,0.0
1981-01-05,0.0


In [73]:
# Downsampling: Convert monthly data to yearly data by sum and max
df = data
dfsum = df.resample("Y").sum()
dfsum.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,1336.3
1982-12-31,1581.7
1983-12-31,1866.5
1984-12-31,2686.7
1985-12-31,1483.9


In [74]:
dfmax = df.resample("Y").max()
dfmax.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-12-31,109.0
1983-12-31,181.8
1984-12-31,154.4
1985-12-31,86.8


In [75]:
# Upsampling: Convert monthly data to yearly data by sum and max
dfmax.resample('10D').asfreq()[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-10,
1982-01-20,
1982-01-30,
1982-02-09,


In [76]:
dfmax.resample('10D').pad()[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-10,71.5
1982-01-20,71.5
1982-01-30,71.5
1982-02-09,71.5


In [77]:
dfmax.resample('D').ffill(limit=2)[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-01,71.5
1982-01-02,71.5
1982-01-03,
1982-01-04,
