## **Pandas** - Panel Data | Python Data Analysis

**Pandas** *(styled as **pandas**)* is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals, as well as a play on the phrase "Python data analysis". Wes McKinney started building what would become Pandas at AQR Capital while he was a researcher there from 2007 to 2010.

The development of Pandas introduced into Python many comparable features of working with DataFrames that were established in the R programming language. **The library is built upon another library, NumPy.** [pandas-wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))

---
### Resources
* [Official Page](https://pandas.pydata.org/)
* [Official Docs](https://pandas.pydata.org/docs/)
* Lecture Notes *(refer to ../LectureNotes(HKUST)/13-pandas.pdf)*

### Pandas Series

A Pandas Series is a one-dimensional **labeled array** in the Pandas library for Python. It is a fundamental data structure in Pandas and can be thought of as a single column of data in a spreadsheet or a single column within a Pandas DataFrame. 

p.s. **labeled array** means apart from using [x] to index, we can use ['name'] to do the same thing.

In [1]:
import pandas
import numpy

# Let's create a pandas series out of a numpy array
data:numpy.ndarray = numpy.array(object = ['LI', 'Hantang', 'Male', 19])
series = pandas.Series(data = data, index = ['family_name', 'given_name', 'gender', 'age'])

# Let's print the series to have a peek into it
print("--------- Our First Series ---------")
print(series)

# Indexing
print("------------- Indexing -------------")
print(f"Your last name is: {series.loc['family_name']} (using named index)")
print(f"Your first name is: {series.iloc[1]} (using numerical index)")
"""
Note that:
1. <series object name>.loc[] needs to fill in your NAMED index;
2. <series object name>.iloc[] needs to fill in the numerical index (just like an normal array).
"""

# Slicing (just like Python list, tuple and numpy ndarray)
print("------------- Slicing --------------")
name_series = series.loc['family_name' : 'given_name']
print(f"Series containing your name:\n{name_series}")
name_series.iloc[0] = "LIAN"
name_series.iloc[1] = "TANG"
print(f"Your 'real?' info:\n{series}")
print(f"Modified info:\n{name_series}") # meaning that, it's also a view 
"""
!!! Note that, unlike list and ndarray, 
When using .loc BOTH the starting and stopping indices are included in
the slice.
.iloc behaves like NumPy arrays and lists: specify the
start position (included) and the end position (excluded).
"""

# Masking
print("------------- Masking --------------")
numerical = pandas.Series(data = [1, 2, 3, 4, 5, 6]) # if no named index specified, 0, 1, ... will be used.
mask = numerical > 3
print(f"Mask:\n{mask}")
print(f"Masked Series:\n{numerical[mask]}") # we don't use .loc[] and .iloc[] here

--------- Our First Series ---------
family_name         LI
given_name     Hantang
gender            Male
age                 19
dtype: object
------------- Indexing -------------
Your last name is: LI (using named index)
Your first name is: Hantang (using numerical index)
------------- Slicing --------------
Series containing your name:
family_name         LI
given_name     Hantang
dtype: object
Your 'real?' info:
family_name    LIAN
given_name     TANG
gender         Male
age              19
dtype: object
Modified info:
family_name    LIAN
given_name     TANG
dtype: object
------------- Masking --------------
Mask:
0    False
1    False
2    False
3     True
4     True
5     True
dtype: bool
Masked Series:
3    4
4    5
5    6
dtype: int64


### Pandas Dataframe

A DataFrame is a powerful 2-dimensional data structure in Pandas, similar to a spreadsheet or SQL table.

It consists of **rows and columns**, where **each column is a Series object** that can hold different data types but shares the same index.

Each column in a DataFrame has a unique name, which allows for easy access and manipulation of the data.

DataFrames are ideal for handling structured data and performing complex data analysis and manipulation tasks.