# Pandas (basic)

`Pandas` is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of the Python programming language, used for data preparation and data analysis.

## Install Pandas
If you run code on your own computer, you need to install pandas. Open the console and enter ```pip install pandas```

In [1]:
# Import panas and view pandas version. The 'as' keyword is to replace pandas with an abbreviation 'pd'.
import pandas as pd
print(pd.__version__)

1.1.5


## Introduction to pandas Data Structures
The two primary data structures of pandas is
+ Series
+ DataFrame

The `series` is designed to accommodate a sequence of one-dimensional data, and the `dataframe` is designed to contain cases with several dimensions.

## Series
As shown below, the internal structure of the series object is simple, which is composed of two arrays associated with each other. The main array holds the data (data of any `NumPy` type) to which each element is associated
with a label, contained within the other array, called the ```index```.

|index|Value|
|:---|--:|
|0|-1|
|1|3|
|2|8|

### Defining a Series

To create the series specified in above, you simply call the `Series()` function and pass as an argument an array containing the values to be included in it.

In [2]:
s = pd.Series([-1,3,8]) # list
print(s)
import numpy as np
arr = np.array([100,20,-3]) # from NumPy Arrays
s1 = pd.Series(arr)
print(s1)
s2 = pd.Series(s) # from Other Series
s2

# when declaring a Series, you can specify the index
s2 = pd.Series({"a":-1,"b":3,"c":8}) # dictionary
print(s2)
s3 = pd.Series([-1,3,8], index=['x','y','z']) # specify the index by 'index' option
print(s3)

0   -1
1    3
2    8
dtype: int64
0    100
1     20
2     -3
dtype: int32
a   -1
b    3
c    8
dtype: int64
x   -1
y    3
z    8
dtype: int64


As you can see from the output of the series, on the left there are the values in the index, which is a series of labels, and on the right are the corresponding values.

If you do not specify any index during the definition of the series, by default,
pandas will assign numerical values increasing from 0 as labels. In this case, the labels correspond to the indexes (position in the array) of the elements in the series object.

Often, however, it is preferable to create a series using meaningful labels in order to distinguish and identify each item regardless of the order in which they were inserted into the series.

In this case it will be necessary, during the constructor call, to include the `index` option and assign an array of strings containing the labels.

```{margin}
Always keep in mind that the values contained in the NumPy array or in the
original series are not copied, but are passed by reference. That is, the object is inserted
dynamically within the new series object. If it changes, for example its internal element
varies in value, then those changes will also be present in the new series object.
```

In [3]:
arr[2] = -40
arr
s1

0    100
1     20
2    -40
dtype: int32

As you can see in this example, by changing the third element of the `arr` array, we
also modified the corresponding element in the `s1` series.

If you want to individually see the two arrays that make up this data structure, you
can call the two attributes of the series as follows: index and values.

In [4]:
print(s.values)
print(s.index)

[-1  3  8]
RangeIndex(start=0, stop=3, step=1)


### Selecting the Internal Elements

You can select individual elements as ordinary numpy arrays, specifying the key.

In [5]:
s2 = pd.Series({"a":-1,"b":3,"c":8})
s2[1]

3

Or you can specify the label corresponding to the position of the index.

In [6]:
s2['a']

-1

In the same way you select multiple items in a numpy array, you can specify the
following:

In [7]:
s2[0:2]

a   -1
b    3
dtype: int64

In [8]:
s2['a':'c']

a   -1
b    3
c    8
dtype: int64

In [9]:
s2[['a','c']]

a   -1
c    8
dtype: int64

### Assigning Values to the Elements

In [10]:
s1['a'] = 100
s1

0    100
1     20
2    -40
a    100
dtype: int64

### Filtering Values

If you need to know which elements in the series are greater than 4, you
write the following

In [11]:
s = pd.Series([1,3,5,2,10])
s[s>4] # greater than 4

2     5
4    10
dtype: int64

In [12]:
# According to Boolean value to filter
s = pd.Series([1,3,5,2,10])
print(s.isin([2,5]))
print(s[s.isin([2,5])])

0    False
1    False
2     True
3     True
4    False
dtype: bool
2    5
3    2
dtype: int64


### Operations and Mathematical Functions

In [13]:
s*2.5

0     2.5
1     7.5
2    12.5
3     5.0
4    25.0
dtype: float64

In [14]:
np.exp(s)

0        2.718282
1       20.085537
2      148.413159
3        7.389056
4    22026.465795
dtype: float64

### Nan Value

The `NaN` refers to `Not a Number`, which generally is caused by the missing value. Before data analysis, the `NaN` value need to be adressed.

In [15]:
import numpy as np 
# Declaring a Series with NaN value
s = pd.Series([1,np.NaN,10,9,-2,np.NaN])
s

0     1.0
1     NaN
2    10.0
3     9.0
4    -2.0
5     NaN
dtype: float64

Call `isnull()` or `notnull()` functions to generate boolean value and further identify the indexes with NaN value. 

In [16]:
print(s.isnull())
print(s.notnull())

0    False
1     True
2    False
3    False
4    False
5     True
dtype: bool
0     True
1    False
2     True
3     True
4     True
5    False
dtype: bool


Based on the generated boolean value, the series with full NaN value and without NaN can be generated.

In [17]:
print(s[s.isnull()])
print(s[s.notnull()])

1   NaN
5   NaN
dtype: float64
0     1.0
2    10.0
3     9.0
4    -2.0
dtype: float64


### Operation of multiple Series

In [18]:
s = pd.Series({"Singapore":30,"Malaysia":23,"Vietnam":36,"Cambodia":41})
s1 = pd.Series({"China":51,"Japan":73,"Vietnam":36,"Laos":31})
s*s1

Cambodia        NaN
China           NaN
Japan           NaN
Laos            NaN
Malaysia        NaN
Singapore       NaN
Vietnam      1296.0
dtype: float64

As you can see, only indexes which all series have can opearte. 

## DataFrame

Compared with the `Series`, the `DataFrame` can contain multiple dimentional data. Its fist column and first row are `index` and `columns`, respectively. (Only for DataFrame without multiple index, `DataFrame` with multiple indexes which will be introduced later). Each column must be same data type (numeric, string, boolean et al.) but different columns can have different data types. 

|index|numeric|string|boolean|
|:--|:--:|:--:|--:|
|0|-1|Singapore|True|
|1|3|China|True|
|2|8|Japan|False|


### Defining a DataFrame

Call `DataFrame()` function to create a `DataFrame`. The `Array`, `List`, `dict` all can taken as the input of `DataFrame()` function.

In [19]:
# array
df = pd.DataFrame(np.array([[14, 35, 35, 35],
                            [19, 34, 57, 34],
                            [42, 74, 49, 59]]))
print(df)
# list,  use 'columns' and 'index' parameters to specify the column and index of generated dataframe.
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
print(df)
# dict
df = pd.DataFrame({"Country":["Malaysia","Singapore","Vietnam"],
                   "Capital":["Kuala Lumpur","Singapore","Hanoi"],
              "Population":[32365999,5850342,97338579],
              "Isdeveloped":[False,True,True]})
df

    0   1   2   3
0  14  35  35  35
1  19  34  57  34
2  42  74  49  59
     Country       Capital  Population  Isdeveloped
a   Malaysia  Kuala Lumpur    32365999        False
b  Singapore     Singapore     5850342         True
c    Vietnam         Hanoi    97338579         True


Unnamed: 0,Country,Capital,Population,Isdeveloped
0,Malaysia,Kuala Lumpur,32365999,False
1,Singapore,Singapore,5850342,True
2,Vietnam,Hanoi,97338579,True


### Selecting the Internal Elements

Similar with `Series`, two ways can be used to select the elements from `DataFrame`. Call `iloc[]` and `loc[]` to select the elements by position and label, respectively .

In [20]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Vietnam,Hanoi,97338579,True


In [21]:
# use ':' to represent select all
df.iloc[:,0:2]

Unnamed: 0,Country,Capital
a,Malaysia,Kuala Lumpur
b,Singapore,Singapore
c,Vietnam,Hanoi


In [22]:
df.loc[:,"Country":"Population"]

Unnamed: 0,Country,Capital,Population
a,Malaysia,Kuala Lumpur,32365999
b,Singapore,Singapore,5850342
c,Vietnam,Hanoi,97338579


In [23]:
df.loc["a",["Country","Population"]]

Country       Malaysia
Population    32365999
Name: a, dtype: object

In [24]:
df.iloc[[0,1]] # If you omit number of columns, all columns will be selected 

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


Use ```columns```,```index``` and ```values``` atrributes to obtain corresponding object value.

In [25]:
df.index

Index(['a', 'b', 'c'], dtype='object')

In [26]:
df.columns

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

In [27]:
df.values

array([['Malaysia', 'Kuala Lumpur', 32365999, False],
       ['Singapore', 'Singapore', 5850342, True],
       ['Vietnam', 'Hanoi', 97338579, True]], dtype=object)

Select corresponding column according the label or number of columns.

In [28]:
df["Country"]

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

In [29]:
df[["Country","Population"]] # Use list to select multiple columns  

Unnamed: 0,Country,Population
a,Malaysia,32365999
b,Singapore,5850342
c,Vietnam,97338579


In [30]:
df.Country # Also support as atrribute to select

a     Malaysia
b    Singapore
c      Vietnam
Name: Country, dtype: object

In [31]:
df["a":"b"] # When select multiple rows, do not use list   

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


In [32]:
df[0:2] # When select multiple rows, do not use list

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


### Assigning value

In [33]:
df.loc["c","Country"] = "Japan"
df.loc["c","Capital"] = "Tokyo"
df.loc["c","Population"] = 126476461
df.loc["c","Isdeveloped"] = True
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Japan,Tokyo,126476461,True


In [34]:
df.loc["c"] = ["Japan", "Tokyo", 126476461, True]
df

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True
c,Japan,Tokyo,126476461,True


### Assigning index, columns, and name of index and columns

In [35]:
df.index = ["e", "f", "g"]
df.index.name = "label"
df.columns.name = "atributes"
df.columns = ["Coun", "Cap", "Pop", "ID"]
df

Unnamed: 0_level_0,Coun,Cap,Pop,ID
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
e,Malaysia,Kuala Lumpur,32365999,False
f,Singapore,Singapore,5850342,True
g,Japan,Tokyo,126476461,True


### Delete columns from dataframe

In [36]:
del df["ID"]
df

Unnamed: 0_level_0,Coun,Cap,Pop
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
e,Malaysia,Kuala Lumpur,32365999
f,Singapore,Singapore,5850342
g,Japan,Tokyo,126476461


### Filtering
Same as ```Series()``` mentioned above.

In [37]:
df = pd.DataFrame(np.array([[14, 35, 35, 35],
                            [19, 34, 57, 34],
                            [42, 74, 49, 59]]))
# filtering lesser than 30
df[df<30]

Unnamed: 0,0,1,2,3
0,14.0,,,
1,19.0,,,
2,,,,


In [38]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
# Filtering accroding to conditions of one column
df[df["Population"]<50000000]

Unnamed: 0,Country,Capital,Population,Isdeveloped
a,Malaysia,Kuala Lumpur,32365999,False
b,Singapore,Singapore,5850342,True


In [39]:
# Filtering accroding to conditions of multiple columns
df[(df["Population"]<50000000) & (df["Isdeveloped"]==True)]

Unnamed: 0,Country,Capital,Population,Isdeveloped
b,Singapore,Singapore,5850342,True


### Transposition  of a Dataframe
Same as `Array` from `Numpy` dataframe can transpose. `Columns` change to `Index` and `Index` change to `Columns`.

In [40]:
df = pd.DataFrame([["Malaysia","Kuala Lumpur",32365999,False],
              ["Singapore","Singapore",5850342,True],
              ["Vietnam","Hanoi",97338579,True]],
              columns = ["Country", "Capital", "Population", "Isdeveloped"],
              index=["a","b","c"])
df1 = df.T
df1

Unnamed: 0,a,b,c
Country,Malaysia,Singapore,Vietnam
Capital,Kuala Lumpur,Singapore,Hanoi
Population,32365999,5850342,97338579
Isdeveloped,False,True,True


In [41]:
df1.index

Index(['Country', 'Capital', 'Population', 'Isdeveloped'], dtype='object')

In [42]:
df1.columns

Index(['a', 'b', 'c'], dtype='object')

### Merge of dataframe
`Concat()`, `Join()`

In [43]:
df1 = pd.DataFrame(np.random.rand(3,4))
df2 = pd.DataFrame(np.random.rand(3,4))
df3 = pd.DataFrame(np.random.rand(6,4))
df4 = pd.DataFrame(np.random.rand(3,6))

In [44]:
df1

Unnamed: 0,0,1,2,3
0,0.833208,0.954361,0.590362,0.432176
1,0.340572,0.79039,0.902075,0.427525
2,0.3883,0.764727,0.259507,0.152453


In [45]:
df2

Unnamed: 0,0,1,2,3
0,0.74686,0.12566,0.039486,0.04097
1,0.732137,0.350305,0.412614,0.915573
2,0.942568,0.498923,0.925797,0.200364


In [46]:
df3

Unnamed: 0,0,1,2,3
0,0.687669,0.701052,0.642234,0.546058
1,0.458649,0.179672,0.572648,0.576272
2,0.957388,0.202772,0.393197,0.047321
3,0.890684,0.371428,0.902146,0.28455
4,0.883339,0.555326,0.250109,0.331097
5,0.690945,0.206538,0.404107,0.272376


In [47]:
df4

Unnamed: 0,0,1,2,3,4,5
0,0.666505,0.206574,0.616027,0.859374,0.25248,0.292821
1,0.17123,0.857158,0.127751,0.132093,0.442879,0.324312
2,0.029849,0.809577,0.275834,0.313976,0.47189,0.712662


In [48]:
pd.concat([df1,df2])

Unnamed: 0,0,1,2,3
0,0.833208,0.954361,0.590362,0.432176
1,0.340572,0.79039,0.902075,0.427525
2,0.3883,0.764727,0.259507,0.152453
0,0.74686,0.12566,0.039486,0.04097
1,0.732137,0.350305,0.412614,0.915573
2,0.942568,0.498923,0.925797,0.200364


In [49]:
result = df1.append(df2)
result

Unnamed: 0,0,1,2,3
0,0.833208,0.954361,0.590362,0.432176
1,0.340572,0.79039,0.902075,0.427525
2,0.3883,0.764727,0.259507,0.152453
0,0.74686,0.12566,0.039486,0.04097
1,0.732137,0.350305,0.412614,0.915573
2,0.942568,0.498923,0.925797,0.200364


### `join`, `inner` and `merge`

### View data

In [50]:
df = pd.DataFrame(np.random.rand(100,4))
df.head()

Unnamed: 0,0,1,2,3
0,0.29588,0.101316,0.378205,0.362995
1,0.034059,0.156049,0.878324,0.046279
2,0.271878,0.511757,0.069545,0.309593
3,0.506662,0.948239,0.587523,0.602291
4,0.628921,0.771021,0.951579,0.103621


In [51]:
df.tail()

Unnamed: 0,0,1,2,3
95,0.236849,0.049237,0.19378,0.680461
96,0.438473,0.382579,0.680808,0.753565
97,0.673163,0.086541,0.860865,0.83576
98,0.654595,0.069207,0.078337,0.00091
99,0.936152,0.14161,0.300204,0.086048


### Computational tools

### Covariance

In [52]:
df = pd.DataFrame(np.random.rand(5,5))
df.cov()

Unnamed: 0,0,1,2,3,4
0,0.03019,0.011332,-0.009482,0.048757,0.010887
1,0.011332,0.095083,0.066258,0.038145,0.024341
2,-0.009482,0.066258,0.099028,0.032475,-0.005049
3,0.048757,0.038145,0.032475,0.108102,0.010197
4,0.010887,0.024341,-0.005049,0.010197,0.053393


### Correlation

In [53]:
df.corr() # pearson (default), kendall, spearman

Unnamed: 0,0,1,2,3,4
0,1.0,0.211515,-0.173412,0.853461,0.271176
1,0.211515,1.0,0.682821,0.376241,0.341621
2,-0.173412,0.682821,1.0,0.313872,-0.069442
3,0.853461,0.376241,0.313872,1.0,0.134218
4,0.271176,0.341621,-0.069442,0.134218,1.0


### `mean()`, `sum()`, `describe()`

In [54]:
df.mean()

0    0.592261
1    0.363090
2    0.609037
3    0.538903
4    0.411179
dtype: float64

In [55]:
df.sum()

0    2.961306
1    1.815452
2    3.045184
3    2.694513
4    2.055895
dtype: float64

In [56]:
df.describe()

Unnamed: 0,0,1,2,3,4
count,5.0,5.0,5.0,5.0,5.0
mean,0.592261,0.36309,0.609037,0.538903,0.411179
std,0.173753,0.308355,0.314687,0.328789,0.23107
min,0.367331,0.018761,0.169957,0.108121,0.137269
25%,0.547973,0.265859,0.451434,0.301511,0.307734
50%,0.574291,0.290495,0.608332,0.641031,0.309945
75%,0.620433,0.380516,0.903033,0.722305,0.611272
max,0.851278,0.859821,0.912428,0.921545,0.689674


### Data ranking

## Missing data (NaN value)

In [57]:
df = pd.DataFrame(np.random.rand(5,5))
df.iloc[0,1] = np.nan
df.iloc[2,2] = np.nan
df.iloc[3,1] = np.nan
df.iloc[3,3] = np.nan
df

Unnamed: 0,0,1,2,3,4
0,0.906246,,0.212382,0.111568,0.20005
1,0.510668,0.151645,0.932783,0.632681,0.26722
2,0.552283,0.780018,,0.298506,0.493491
3,0.048853,,0.577482,,0.237624
4,0.384103,0.675192,0.875547,0.035895,0.708006


In [58]:
# detecting nan value
df.isnull()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,False
1,False,False,False,False,False
2,False,False,True,False,False
3,False,True,False,True,False
4,False,False,False,False,False


In [59]:
# detecting nan value
df.notnull()

Unnamed: 0,0,1,2,3,4
0,True,False,True,True,True
1,True,True,True,True,True
2,True,True,False,True,True
3,True,False,True,False,True
4,True,True,True,True,True


In [60]:
df.isna()

Unnamed: 0,0,1,2,3,4
0,False,True,False,False,False
1,False,False,False,False,False
2,False,False,True,False,False
3,False,True,False,True,False
4,False,False,False,False,False


In [61]:
# fill nan using a specify value
df.fillna(value=0)

Unnamed: 0,0,1,2,3,4
0,0.906246,0.0,0.212382,0.111568,0.20005
1,0.510668,0.151645,0.932783,0.632681,0.26722
2,0.552283,0.780018,0.0,0.298506,0.493491
3,0.048853,0.0,0.577482,0.0,0.237624
4,0.384103,0.675192,0.875547,0.035895,0.708006


In [62]:
# fill nan using a method
# set inplace to True, the changes will act on dataframe
df.fillna(method="ffill") # other method: ‘backfill’, ‘bfill’, ‘pad’
df

Unnamed: 0,0,1,2,3,4
0,0.906246,,0.212382,0.111568,0.20005
1,0.510668,0.151645,0.932783,0.632681,0.26722
2,0.552283,0.780018,,0.298506,0.493491
3,0.048853,,0.577482,,0.237624
4,0.384103,0.675192,0.875547,0.035895,0.708006


In [63]:
df.fillna(method="pad")

Unnamed: 0,0,1,2,3,4
0,0.906246,,0.212382,0.111568,0.20005
1,0.510668,0.151645,0.932783,0.632681,0.26722
2,0.552283,0.780018,0.932783,0.298506,0.493491
3,0.048853,0.780018,0.577482,0.298506,0.237624
4,0.384103,0.675192,0.875547,0.035895,0.708006


In [64]:
# delete NaN value
# ‘any’ : If any NA values are present, drop that row or column.
# ‘all’ : If all values are NA, drop that row or column.

# 0, or ‘index’ : Drop rows which contain missing values.
# 1, or ‘columns’ : Drop columns which contain missing value.
df.dropna(axis="index",how="any")

Unnamed: 0,0,1,2,3,4
1,0.510668,0.151645,0.932783,0.632681,0.26722
4,0.384103,0.675192,0.875547,0.035895,0.708006


## Date index

In [65]:
# pd.read_csv("")

## Upsampling and Downsampling

* Upsampling: Increase the frequency of the samples by interpolation, such as from minutes to seconds. 
* Downsampling: Ddecrease the frequency of the samples by aggregation, such as from months to years.


In [95]:
# prepare data, this section will be introduced in the next tutorial
# Source: http://www.weather.gov.sg/climate-historical-daily/
data = pd.read_csv('../../assets/data/Chang_daily_rainfall.csv',index_col=0,header=0,parse_dates=True)
data.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-01-01,0.0
1981-01-02,0.0
1981-01-03,0.0
1981-01-04,0.0
1981-01-05,0.0


In [102]:
# Downsampling: Convert monthly data to yearly data by sum and max
df = data
dfsum = df.resample("Y").sum()
dfsum.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,1336.3
1982-12-31,1581.7
1983-12-31,1866.5
1984-12-31,2686.7
1985-12-31,1483.9


In [103]:
dfmax = df.resample("Y").max()
dfmax.head()

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-12-31,109.0
1983-12-31,181.8
1984-12-31,154.4
1985-12-31,86.8


In [104]:
# Upsampling: Convert monthly data to yearly data by sum and max
dfmax.resample('10D').asfreq()[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-10,
1982-01-20,
1982-01-30,
1982-02-09,


In [105]:
dfmax.resample('10D').pad()[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-10,71.5
1982-01-20,71.5
1982-01-30,71.5
1982-02-09,71.5


In [106]:
dfmax.resample('D').ffill(limit=2)[0:5]

Unnamed: 0_level_0,Daily Rainfall Total (mm)
Date,Unnamed: 1_level_1
1981-12-31,71.5
1982-01-01,71.5
1982-01-02,71.5
1982-01-03,
1982-01-04,


## Hierarchical indexing (MultiIndex)

### Creating a MultiIndex

In [73]:
iterables = [
    ["temperature","rainfall","runoff"],
    ["max","mean","min"],
]
idx = pd.MultiIndex.from_product(iterables, names=["factor", "method"])
idx

MultiIndex([('temperature',  'max'),
            ('temperature', 'mean'),
            ('temperature',  'min'),
            (   'rainfall',  'max'),
            (   'rainfall', 'mean'),
            (   'rainfall',  'min'),
            (     'runoff',  'max'),
            (     'runoff', 'mean'),
            (     'runoff',  'min')],
           names=['factor', 'method'])

In [74]:
df = pd.DataFrame(np.random.randn(9, 4), index=idx)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
factor,method,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
temperature,max,-1.73873,-0.517419,-0.838424,-0.308485
temperature,mean,1.285246,-1.162235,0.631022,-1.081682
temperature,min,-0.697316,-0.362244,-1.886194,0.606675
rainfall,max,-0.410116,0.502862,-0.117059,0.389544
rainfall,mean,0.48418,0.464016,-0.359899,-1.26161
rainfall,min,-1.430331,0.350857,-1.106481,-0.067302
runoff,max,0.552249,-0.943803,-1.182916,-1.046358
runoff,mean,-0.057589,1.263,-0.877389,1.42498
runoff,min,-0.421448,-0.093601,0.878896,-0.380458


In [75]:
idx = pd.MultiIndex.from_arrays(iterables, names=["factor", "method"])
idx

MultiIndex([('temperature',  'max'),
            (   'rainfall', 'mean'),
            (     'runoff',  'min')],
           names=['factor', 'method'])

In [76]:
df = pd.DataFrame(np.random.randn(3, 4), index=idx)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1,2,3
factor,method,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
temperature,max,1.633299,0.490486,0.611913,0.59244
rainfall,mean,-1.057055,-0.798911,0.080047,-1.192749
runoff,min,0.427018,-1.678173,-0.018338,1.838073


```pd.MultiIndex.from_tuples```, ```pd.MultiIndex.from_frame```

### Get index for multiindex

In [77]:
iterables = [
    ["temperature","rainfall","runoff"],
    ["max","mean","min"],
]
idx = pd.MultiIndex.from_product(iterables, names=["factor", "method"])
df = pd.DataFrame(np.random.randn(9, 4), index=idx)
df.index

MultiIndex([('temperature',  'max'),
            ('temperature', 'mean'),
            ('temperature',  'min'),
            (   'rainfall',  'max'),
            (   'rainfall', 'mean'),
            (   'rainfall',  'min'),
            (     'runoff',  'max'),
            (     'runoff', 'mean'),
            (     'runoff',  'min')],
           names=['factor', 'method'])

In [78]:
df.index.get_level_values(0)

Index(['temperature', 'temperature', 'temperature', 'rainfall', 'rainfall',
       'rainfall', 'runoff', 'runoff', 'runoff'],
      dtype='object', name='factor')

In [79]:
df.index.get_level_values(1)

Index(['max', 'mean', 'min', 'max', 'mean', 'min', 'max', 'mean', 'min'], dtype='object', name='method')

## Apply and Applymap
* Apply: Apply a function along an axis of the DataFrame.
* Applymap: Apply a function to a Dataframe elementwise. You can address each element for specfic requirements.

In [80]:
df = pd.DataFrame(np.random.randn(3, 4))
df

Unnamed: 0,0,1,2,3
0,-1.99388,0.207029,-0.094932,1.70426
1,1.771246,-0.838005,0.80204,-0.175127
2,-0.49841,-1.39579,1.391114,0.617431


In [81]:
df.apply(np.abs)

Unnamed: 0,0,1,2,3
0,1.99388,0.207029,0.094932,1.70426
1,1.771246,0.838005,0.80204,0.175127
2,0.49841,1.39579,1.391114,0.617431


In [82]:
func_x3 = lambda x: x**3 # lambda functiodn
df.apply(func_x3)

Unnamed: 0,0,1,2,3
0,-7.926787,0.008873,-0.000856,4.950025
1,5.556947,-0.588491,0.515926,-0.005371
2,-0.123811,-2.719322,2.692079,0.235378


In [83]:
# This function don't have specific meaning. 
# It only defines a complex operation for each element of dataframe.
def func_range(x):
    if x > 1:
        return 1
    elif x< -1:
        return -1
    else:
        return np.abs(x)
df.applymap(func_range)

Unnamed: 0,0,1,2,3
0,-1.0,0.207029,0.094932,1.0
1,1.0,0.838005,0.80204,0.175127
2,0.49841,-1.0,1.0,0.617431


## Groupby
`Groupby()` can be used to group large amounts of data and compute operations on these groups.



In [84]:
iterables = [
    ["temperature","rainfall","runoff"],
    ["site1","site2","site3"],
]
idx = pd.MultiIndex.from_product(iterables, names=["factor", "method"])
df = pd.DataFrame(np.random.randn(9, 4), index=idx)
for n,subdf in df.groupby(by=["factor"]):
    print(n)
    print(subdf)

rainfall
                        0         1         2         3
factor   method                                        
rainfall site1   1.550864  1.387094  0.074766 -0.190843
         site2  -1.317083  1.175969 -1.678688  0.640268
         site3  -0.700120  0.267763 -0.193350 -0.026873
runoff
                      0         1         2         3
factor method                                        
runoff site1  -0.987159 -2.523512  1.006537 -0.495098
       site2  -0.852092 -1.714281 -2.049509  1.030161
       site3  -1.456943  0.901729  0.126591  1.702993
temperature
                           0         1         2         3
factor      method                                        
temperature site1  -1.125165  0.672000  0.588356 -0.592670
            site2   0.582469 -0.433644  0.886921  0.429850
            site3   0.740374 -0.603608  1.159948 -0.029015


In [85]:
df.groupby(by=["factor"]).mean()

Unnamed: 0_level_0,0,1,2,3
factor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rainfall,-0.155447,0.943608,-0.599091,0.140851
runoff,-1.098731,-1.112021,-0.30546,0.746019
temperature,0.065893,-0.12175,0.878408,-0.063945


## Advanced (operational)

## Table Visualization

In [86]:
np.random.seed(0)
df2 = pd.DataFrame(np.random.randn(10,4), columns=['A','B','C','D'])
df2.style
def style_negative(v, props=''):
    return props if v < 0 else None
s2 = df2.style.applymap(style_negative, props='color:red;')\
              .applymap(lambda v: 'opacity: 20%;' if (v < 0.3) and (v > -0.3) else None)
s2

Unnamed: 0,A,B,C,D
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674
4,1.494079,-0.205158,0.313068,-0.854096
5,-2.55299,0.653619,0.864436,-0.742165
6,2.269755,-1.454366,0.045759,-0.187184
7,1.532779,1.469359,0.154947,0.378163
8,-0.887786,-1.980796,-0.347912,0.156349
9,1.230291,1.20238,-0.387327,-0.302303


In [87]:
def highlight_max(s, props=''):
    return np.where(s == np.nanmax(s.values), props, '')
s2.apply(highlight_max, props='color:white;background-color:darkblue', axis=0)

Unnamed: 0,A,B,C,D
0,1.764052,0.400157,0.978738,2.240893
1,1.867558,-0.977278,0.950088,-0.151357
2,-0.103219,0.410599,0.144044,1.454274
3,0.761038,0.121675,0.443863,0.333674
4,1.494079,-0.205158,0.313068,-0.854096
5,-2.55299,0.653619,0.864436,-0.742165
6,2.269755,-1.454366,0.045759,-0.187184
7,1.532779,1.469359,0.154947,0.378163
8,-0.887786,-1.980796,-0.347912,0.156349
9,1.230291,1.20238,-0.387327,-0.302303


## Tooltips and Captions

In [88]:
# s.set_caption("Confusion matrix for multiple cancer prediction models.")\
#  .set_table_styles([{
#      'selector': 'caption',
#      'props': 'caption-side: bottom; font-size:1.25em;'
#  }], overwrite=False)
