## 2. Data Structures

This notebook introduces the two most important data structures in Pandas: 
    
    2.1 Series
    2.2 DataFrame

We will demonstrate how to create these data structures manually. In practice, it is rare to create Pandas Series and DataFrame objects manually. Therefore, this section focuses on providing a basic understanding of these data structures to serve as a foundation for working with them. For more detailed information, please refer to the official Pandas documentation.

+ [Intro to data structures](https://pandas.pydata.org/docs/user_guide/dsintro.html)

+ [Pandas.Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

+ [Pandas.DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

In [1]:
%%html
<style>
    table { display: inline-block }
</style>

In [2]:
import numpy as np
import pandas as pd

---
### 2.1 Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type. The axis labels are collectively referred to as the index. 


The components of a Series object are:

+ index / axis 

+ data

+ name 


#### 2.1.1 Basic Form of Creating a Pandas Series

The simplest way to create a Series is to call:

In [3]:
s = pd.Series((1, 2, 3))

print(s)

0    1
1    2
2    3
dtype: int64


The first column of the series object `s` represents the index (0, 1, 2) and the second column the data (1, 2, 3). 

The basic form to create a `Series` is to call

```python
    s = pd.Series(data, index=index)
```

where `data` can be 

+ a Python `dict`

+ a NumPy `ndarray`

+ a Python list or tuple

+ a scalar value (like 2)

The passed `index` is a list of axis labels. 

<br>

**Note:**

+ passing an `index` is optional. If no index is passed, Pandas will create the default index 0, 1, 2, ...

+ `index` must be of the same length as the `data`

+ axis labels do not need to be unique

+ axis labels must be hashable (e.g. int, float, string but not dict)

#### 2.1.2 Creating a Series from a List

In the following example, the data is passed as a Python **list**. In addition, an **index** and **name** is provided.  

In [4]:
rivers = pd.Series([6300, 6650, 6275, 5539, 6400],
                   index=['Yangtze', 'Nile', 'Mississippi', 'Yenisei', 'Amazon'], # optional
                   name='Rivers') # optional

print(rivers)

Yangtze        6300
Nile           6650
Mississippi    6275
Yenisei        5539
Amazon         6400
Name: Rivers, dtype: int64


Instead of a Python list, a **tuple** and a **ndarray** can be passed in the same way to create a Series object. 

#### 2.1.3 Creating a Series from a `dict`

In [6]:
data = dict(
    Yangtze = 6300, 
    Nile = 6650, 
    Mississippi = 6275, 
    Yenisei = 5539, 
    Amazon = 6400)

rivers = pd.Series(data, name='Rivers')

print(rivers)

Yangtze        6300
Nile           6650
Mississippi    6275
Yenisei        5539
Amazon         6400
Name: Rivers, dtype: int64


#### 2.1.3 Creating a Series from a Scalar

When passing a **scalar**, an **index** must be provided:

In [7]:
pd.Series(2.0, index=['a', 'b', 'c'])

a    2.0
b    2.0
c    2.0
dtype: float64

---
### 2.2 DataFrame

A `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. It is the most commonly used pandas object. You can think of a `DataFrame` as a tabular(rows, columns) representation of the data. 

<div style="text-align: center;">
<img src="./figs/df.png" width="600">
</div>

#### 2.2.1 Basic Form of Creating a DataFrame

The simplest way to create a Series is to call:

In [8]:
# data
d = np.arange(8).reshape(-1, 2)

# DataFrame
df = pd.DataFrame(d) 

# print
print(df)

   0  1
0  0  1
1  2  3
2  4  5
3  6  7


The basic form to create a `DataFrame` is to call

```python
    df = pd.DataFrame(data, index=index, column=columns)
```

where `data` can be 

+ a `dict` of 1D ndarrays, lists, dicts, or Series

+ another `DataFrame`

+ ...

The index (row labels) and columns (column label) arguments are optional. If axis labels (index, columns) are not passed, they will be created from the input data using the labels 0, 1, ...

#### 2.2.2 Creating a DataFrame from a Dictionary

In [9]:
# index labels
mat_num = [317, 312, 310]

# data dictionary
data = {
        'name' : ['bob', 'ann', 'cat'],
        'age'  : [20, 21, 22],
        'program' : ['math', 'art', 'physics'] 
        }

# dataframe object
df = pd.DataFrame(data, index=mat_num)

print(df)

    name  age  program
317  bob   20     math
312  ann   21      art
310  cat   22  physics


#### 2.2.3 Creating a DataFrame from a Structured Array

In [10]:
# index labels
rows = [317, 312, 310]

# column labels
cols = ['name', 'age', 'program']

# structured array
data = (['bob', 20, 'math'], ['ann', 21, 'art'], ['cat', 22, 'physics'])

# dataframe object
df = pd.DataFrame(data, index=rows, columns=cols)

print(df)

    name  age  program
317  bob   20     math
312  ann   21      art
310  cat   22  physics
