## **Pandas** - Panel Data | Python Data Analysis

**Pandas** *(styled as **pandas**)* is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language. **The library is built upon another library, NumPy.** [pandas-wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))

---
### Resources
* [Official Page](https://pandas.pydata.org/)
* [Official Docs](https://pandas.pydata.org/docs/)
* Lecture Notes *(refer to ../LectureNotes(HKUST)/13-pandas.pdf)*

### Pandas Series

A Pandas Series is a one-dimensional **labeled array** in the Pandas library for Python. It is a fundamental data structure in Pandas and can be thought of as a single column of data in a spreadsheet or a single column within a Pandas DataFrame. 

p.s. **labeled array** means apart from using [x] to index, we can use ['name'] to do the same thing.

In [1]:
import pandas
import numpy

# Let's create a pandas series out of a numpy array
data:numpy.ndarray = numpy.array(object = ['LI', 'Hantang', 'Male', 19])
series = pandas.Series(data = data, index = ['family_name', 'given_name', 'gender', 'age'])

# Let's print the series to have a peek into it
print("--------- Our First Series ---------")
print(series)

--------- Our First Series ---------
family_name         LI
given_name     Hantang
gender            Male
age                 19
dtype: object


In [2]:
# Indexing
print("------------- Indexing -------------")
print(f"Your last name is: {series.loc['family_name']} (using named index)")
print(f"Your first name is: {series.iloc[1]} (using numerical index)")
"""
Note that:
1. <series object name>.loc[] needs to fill in your NAMED index;
2. <series object name>.iloc[] needs to fill in the numerical index (just like an normal array).
"""
pass

------------- Indexing -------------
Your last name is: LI (using named index)
Your first name is: Hantang (using numerical index)


In [3]:
# Slicing (just like Python list, tuple and numpy ndarray)
print("------------- Slicing --------------")
name_series = series.loc['family_name' : 'given_name']
print(f"Series containing your name:\n{name_series}")
name_series.iloc[0] = "LIAN"
name_series.iloc[1] = "TANG"
print(f"Your 'real?' info:\n{series}")
print(f"Modified info:\n{name_series}") # meaning that, it's also a view 
"""
!!! Note that, unlike list and ndarray, 
When using .loc BOTH the starting and stopping indices are included in
the slice.
.iloc behaves like NumPy arrays and lists: specify the
start position (included) and the end position (excluded).
"""
pass

------------- Slicing --------------
Series containing your name:
family_name         LI
given_name     Hantang
dtype: object
Your 'real?' info:
family_name    LIAN
given_name     TANG
gender         Male
age              19
dtype: object
Modified info:
family_name    LIAN
given_name     TANG
dtype: object


In [4]:
# Masking
print("------------- Masking --------------")
numerical = pandas.Series(data = [1, 2, 3, 4, 5, 6]) # if no named index specified, 0, 1, ... will be used.
mask = numerical > 3
print(f"Mask:\n{mask}")
print(f"Masked Series:\n{numerical[mask]}") # we don't use .loc[] and .iloc[] here

------------- Masking --------------
Mask:
0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool
Masked Series:
3    4
4    5
5    6
dtype: int64


### Pandas Dataframe

A DataFrame is a powerful 2-dimensional data structure in Pandas, similar to a spreadsheet or SQL table.

It consists of **rows and columns**, where **each column is a Series object** that can hold different data types but shares the same index.

Each column in a DataFrame has a unique name, which allows for easy access and manipulation of the data.

DataFrames are ideal for handling structured data and performing complex data analysis and manipulation tasks.

*Parameters: (Constructor of object)*
```python
pandas.DataFrame
```
> **data** : *ndarray (structured or homogeneous), Iterable, dict, or DataFrame*
>     Dict can contain Series, arrays, constants, dataclass or list-like objects. If
>     data is a dict, column order follows insertion-order. If a dict contains Series
>     which have an index defined, it is aligned by its index. This alignment also
>     occurs if data is a Series or a DataFrame itself. Alignment is done on
>     Series/DataFrame inputs.
>     If data is a list of dicts, column order follows insertion-order.
>
> **index** : *Index or array-like*
>     Index to use for resulting frame. Will default to RangeIndex if
>     no indexing information part of input data and no index provided.
>
> **columns** : *Index or array-like*
>     Column labels to use for resulting frame when data does not have them,
>     defaulting to RangeIndex(0, 1, 2, ..., n). If data contains column labels,
>     will perform column selection instead.
>
> **dtype** : *dtype, default None*
>     Data type to force. Only a single dtype is allowed. If None, infer.
>
> **copy** : *bool or None, default None*
>     Copy data from inputs.
>     For dict data, the default of None behaves like ``copy=True``.  For DataFrame
>     or 2d ndarray input, the default of None behaves like ``copy=False``.
>     If data is a dict containing one or more Series (possibly of different dtypes),
>     ``copy=False`` will ensure that these inputs are not copied.

In [5]:
import pandas

# some data first
mr_candy = pandas.Series(data = ['LEE', 'Hantang', 'Male', 19], \
                         index = ['family_name', 'given_name', 'gender', 'age'])
mr_joggy = pandas.Series(data = ['WONG', 'Zhengyang', 'Male', 20], \
                         index = ['family_name', 'given_name', 'gender', 'age'])
mr_misty = pandas.Series(data = ['N/a', 'カスミ', 'Female', 10], \
                         index = ['family_name', 'given_name', 'gender', 'age'])

# let's create a dataframe from scratch
dataframe = pandas.DataFrame(data = [mr_candy, mr_joggy, mr_misty])

# let's visualize
dataframe

Unnamed: 0,family_name,given_name,gender,age
0,LEE,Hantang,Male,19
1,WONG,Zhengyang,Male,20
2,N/a,カスミ,Female,10


In [6]:
import pandas

# you can also insert by column (previously, row, or a single entry)
family_names = pandas.Series(data = ['LEE', 'WONG', 'N/a'])
given_names  = pandas.Series(data = ['Hantang', 'Zhengyang', 'カスミ'])
genders      = pandas.Series(data = ['Male', 'Male', 'Female'])
ages         = pandas.Series(data = [19, 20, 10])

dataframe = pandas.DataFrame(data = \
                             {'family_name' : family_names,
                              'given_name' : given_names,
                              'gender' : genders,
                              'age' : ages})

# visualize
dataframe

Unnamed: 0,family_name,given_name,gender,age
0,LEE,Hantang,Male,19
1,WONG,Zhengyang,Male,20
2,N/a,カスミ,Female,10
