# Python Pandas
## 1 - Introduction to data structures

**Tool - pandas**

Library that provides high level data structures and functions designed to work with tabular data. The main data structure is the `DataFrame`, which can be viewed as a in-memory 2D table (like a spreadsheet with rows and columns). Usefull features are:
* Availability of features similar to excel (Pivot tables, compute columns based on other columns, etc...)
* Group rows by value
* Join tables like SQL

[Official Documentation](https://pandas.pydata.org/pandas-docs/stable/index.html)

Topics to be covered:
* 1 - Object creation
  * 1.1 - Series
  * 1.2 - DataFrame
* 2 - Explore object
  * 2.1 - Data type `.dtypes`
  * 2.2 - Data size `.shape`
  * 2.3 - Descriptive statistics `.describe`
* 3 - Export DataFrame
  * 3.1 - Write to `.xlsx` format
    * 3.1.1 - Single tab
    * 3.1.2 - Multiple tabs
    * 3.1.3 - Save with a time stamp
  * 3.2 - Write to `.csv`
  * 3.3 - Save to `.zip`

# 1 - Object Creation
Two main *workhorse* to work with, data structures can typically be designated as `objects`:
* 1 - `Series` - 1D array, 
* 2 - `DataFrame` - 2D table, similar to a spreadsheet

## 1.1 - Series
A one-dimensional array-like object with a sequence of values, a Series is a list.

In [1]:
import pandas as pd

In [2]:
bps = pd.Series([23, 85, -45, 32])
print(bps)

0    23
1    85
2   -45
3    32
dtype: int64


Bellow write same list, this time include dates in indice.

**Index labels**

Alternative to below code:
```pyton
pd.Series({'3 Aug': 23 , 
           '4 Aug': 85, 
           '5 Aug': -45, 
           '6 Aug': 32})
```


ibps = pd.Series([23, 85, -45, 32], 
                 index = ['3 Aug', '4 Aug', '5 Aug', '6 Aug'])
print(ibps)

Create a list of string object, and name list as `fruit`

In [3]:
Fruit = pd.Series(['Apple', 'Banana', 'Orange', 'Melon'], 
                 name = 'fruit')
print(Fruit)

0     Apple
1    Banana
2    Orange
3     Melon
Name: fruit, dtype: object


## 1.2 - DataFrame
Represents a rectangular table of data, that contains an ordered collection of columns, that can be of different types (numeric, string, bolean, etc.). It is most typically two dimensional.

In [4]:
Countries = pd.DataFrame({'Country':       ['Portugal', 'Poland', 'Germany', 'United Kingdom', 'France', 'Spain'],
                          'SovereignDebt': ['127.7',    '46.2',   '64.1',    '87',             '98.5',   '96.7']
                         })
Countries

Unnamed: 0,Country,SovereignDebt
0,Portugal,127.7
1,Poland,46.2
2,Germany,64.1
3,United Kingdom,87.0
4,France,98.5
5,Spain,96.7


Create **Index** with funds

In [5]:
pd.DataFrame({'NPV': [105, 67],
              'NAV': [102, 68]},
             index = ['Fund A', 'Fund B'])

Unnamed: 0,NPV,NAV
Fund A,105,102
Fund B,67,68


# Data Structures - Reading
## Reading from `.csv`
Read data from a `.csv` file. Use formula `pd.read_csv()`.

In [6]:
pd.read_csv("Market.csv")

Unnamed: 0,Fund,NPV,Market Value,Fund NAV
0,Fund A,11.85,11.85,10.0
1,Fund B,7.29,7.29,7.1
2,Fund C,2.25,2.25,2.19
3,Fund D,5.15,5.15,5.07
4,Fund E,2352.82,2319.32,2333.72
5,Fund F,40.08,40.07,39.83
6,Fund H,307.36,303.47,305.55
7,Fund I,80.42,80.42,80.03
8,Fund J,178.83,178.82,178.09
9,Fund K,29.14,29.13,29.03


**Index** the `Fund` column

Alternative to below code:
```python
    pd.read_csv("Market.csv", index_col = 0, header = 0)```


In [7]:
pd.read_csv("Market.csv", index_col = 'Fund')

Unnamed: 0_level_0,NPV,Market Value,Fund NAV
Fund,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fund A,11.85,11.85,10.0
Fund B,7.29,7.29,7.1
Fund C,2.25,2.25,2.19
Fund D,5.15,5.15,5.07
Fund E,2352.82,2319.32,2333.72
Fund F,40.08,40.07,39.83
Fund H,307.36,303.47,305.55
Fund I,80.42,80.42,80.03
Fund J,178.83,178.82,178.09
Fund K,29.14,29.13,29.03


`pd.read_csv` can also be used to read `.txt` files

## Reading from `.xlsx` file

```python
pd.read_excel```

Below code it is specified the table to be from 'sheet1', index set to be 'Country' column i.e. column = 0, and data frame named as Countries. 

In [8]:
Countries = pd.read_excel('Countries.xlsx', sheet_name = 'Sheet1', 
              index_col = 0)
Countries

Unnamed: 0_level_0,European_Region,GDPperCapita(PPP),Population_Millions
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Portugal,Southern,33.4,10.3
Poland,Eastern,33.5,38.4
Germany,Western,55.0,82.9
United Kingdom,Northern,47.0,66.4
France,Western,47.1,67.3
Spain,Southern,42.1,46.8
Italy,Southern,40.7,60.5
Belgium,Western,49.7,11.5
Norway,Northern,76.6,5.4
Switzerland,Western,66.8,8.6


# 2 - Object exploration
## 2.1 - Data type
Explore the type of data on the different observations, on whether it is numeric, string, category, etc. Use function`.dtypes`. In this case DataFrame Countries will be used

In [9]:
Countries.dtypes

European_Region         object
GDPperCapita(PPP)      float64
Population_Millions    float64
dtype: object

## 2.2 - Data Shape
Tells me how many columns and rows are there in the DataFrame. Function `.shape`. ```python (11, 3)``` referes to 11 rows and 3 columns.

**Question:** Why 3 columns and not 4?

In [10]:
Countries.shape

(11, 3)

## 2.3 - Describe data
Function `.describe()` provides a statistical summary of the data 

In [11]:
Countries.describe()

Unnamed: 0,GDPperCapita(PPP),Population_Millions
count,11.0,11.0
mean,46.427273,37.090909
std,16.006379,29.055273
min,18.8,5.4
25%,37.1,10.1
50%,47.0,38.4
75%,52.35,63.45
max,76.6,82.9


## 2.4 - Transpose DataFrame
Function `.T` to tranpose columns to rows

In [12]:
Countries.T

Country,Portugal,Poland,Germany,United Kingdom,France,Spain,Italy,Belgium,Norway,Switzerland,Azerbaijan
European_Region,Southern,Eastern,Western,Northern,Western,Southern,Southern,Western,Northern,Western,Eastern
GDPperCapita(PPP),33.4,33.5,55.0,47.0,47.1,42.1,40.7,49.7,76.6,66.8,18.8
Population_Millions,10.3,38.4,82.9,66.4,67.3,46.8,60.5,11.5,5.4,8.6,9.9


# 3 - Write a dataframe

## 3.1 - Write to `.xlsx` file

Use Formula `to_excel`

### 3.1.1 - Single Tab

In [13]:
with pd.ExcelWriter('SavedPython.xlsx') as writer:
    Countries.to_excel(writer, sheet_name='Countries')

### 3.1.2 - Multiple Tabs

In [14]:
with pd.ExcelWriter('SavedPython.xlsx') as writer:
    Countries.to_excel(writer, sheet_name='Countries')
    Fruit.to_excel(writer, sheet_name='Fruit')

### 3.1.3 - Save with Time Stamp

In [15]:
from datetime import datetime

current_time = datetime.now().strftime('%Y%m%d_%H-%M-%S')
filename = f"SavedPython_{current_time}.xlsx"

with pd.ExcelWriter(filename) as writer:
    Countries.to_excel(writer, sheet_name='Days', index=False)

print(f"Excel file '{filename}' created successfully.")

Excel file 'SavedPython_20250415_22-13-11.xlsx' created successfully.


## 3.2 - Write to `.csv`

In [16]:
Countries.to_csv('SavedCountries.csv', index = 'False')

## 3.3 - Save to `.zip`

In [17]:
Countries.to_csv('SavedCountries.zip', 
                 index = 'False', 
                 compression = dict(method='zip', 
                                    archive_name='SavedCountries.csv')
                )