# Pandas 
> [Main Table of Contents](../README.md)

High performance tabular data


## In this notebook
- Note: Logical operators don't work with df, use numpy boolean functions and/or bitwise operators
- Axis
- DataFrame
- DataFrame Creation
- DataFrame Properties
- Filter DataFrame
	- with one boolean array
	- with numpy logical functions
	- with bitwise operation
	- with `isin` method
- Filter / Subset rows in multiIndex dataframe
- Get/ Set(Add) Columns
- Pivot
- Read Files
- Plotting (Avail out of the box)
- Explicit Index
- Index Methods
	- sort_index
- Subset Date Index

In [1]:
import pandas as pd

In [2]:
# NOTEBOOK DATA
twod_dict = { 'cars_per_cap': [809, 200, 70, 45, 150],
              'num_peeps': [400, 100, pd.NA, 3, 5],
           'country': ['United States', 'Russia', 'Morocco', 'Egypt', 'China'],
           'drives_right': [True, False, False, True, True]}
labels = ['US', 'RU', 'MO', 'EG', 'CH']
df = pd.DataFrame(twod_dict, index=labels)

## Axis

- Many built in Series and DataFrame methods have `axis` key word argument

Axis | Description | Default
--- | --- | ---
0 | Image arrow pointing down<br>Applying functions across axis 0 means function applied on a column/series | Yes
1 | Applying functions across axis 1 means function applied on a row | 

## DataFrame
- DataFrame is a collection of Series

## DataFrame Creation
- Many ways to create a DataFrame
- Two of the most common ways are below

In [3]:
# Common Creation Method (dictionary)
dict = {'AAPL': [143.5,  144.09, 142.73, 144.18, 143.77],    # col 1
        'GOOG': [898.7,  911.71, 906.69, 918.59, 926.99],      # col 2
        'IBM':  [155.58, 153.67, 152.36, 152.94, 153.49]}      # col 3
dates = pd.date_range('2017-07-03', periods = 5, freq = 'D')
pd.DataFrame(dict, index = dates)

Unnamed: 0,AAPL,GOOG,IBM
2017-07-03,143.5,898.7,155.58
2017-07-04,144.09,911.71,153.67
2017-07-05,142.73,906.69,152.36
2017-07-06,144.18,918.59,152.94
2017-07-07,143.77,926.99,153.49


In [4]:
# Common Creation Method (list of lists)
# Manually add column labels
# TODO:

## DataFrame Properties

Property | Description
--- | ---
.shape | tuple (num_rows, num_cols)
.columns | Index object with column names<br>Index object is an iterable
.index | Index object with row names or row numbers
.values | just the values of the dataframe

## Filter DataFrame 

### with one boolean array

In [5]:
# pandas series with comparison operator => boolean array (True, True, False ,False, True)
bool_arr = df['cars_per_cap'] > 100      
df[bool_arr]  

Unnamed: 0,cars_per_cap,num_peeps,country,drives_right
US,809,400,United States,True
RU,200,100,Russia,False
CH,150,5,China,True


### with numpy logical functions

In [6]:
import numpy as np
np_logical_fn = np.logical_or(df['cars_per_cap']>100, df['drives_right'] == True)
df[np_logical_fn]

Unnamed: 0,cars_per_cap,num_peeps,country,drives_right
US,809,400,United States,True
RU,200,100,Russia,False
EG,45,3,Egypt,True
CH,150,5,China,True


### with bit-wise operation

In [7]:
bit_wise_op = (df['cars_per_cap'] > 100) | (df['drives_right']==True)
df[bit_wise_op]

Unnamed: 0,cars_per_cap,num_peeps,country,drives_right
US,809,400,United States,True
RU,200,100,Russia,False
EG,45,3,Egypt,True
CH,150,5,China,True


### with `isin` method
- Alternative for `or` operations
- Checks whether each element is contained in values
- More concise
- Returns boolean dataframe

In [8]:
long_list_of_values_of_interest = [True, 45, 'United States']
bool_df =df.isin(long_list_of_values_of_interest)  # boolean dataframe
df[bool_df]

Unnamed: 0,cars_per_cap,num_peeps,country,drives_right
US,,,United States,1.0
RU,,,,
MO,,,,
EG,45.0,,,1.0
CH,,,,1.0


## Filter / Subset rows in multiIndex dataframe   
TODO: Comeback TO THIS AS i DON'T KNOW MOST OF THIS
 - [Link to stackoverflow comprehensive reference](https://stackoverflow.com/questions/53927460/select-rows-in-pandas-multiindex-dataframe)

## Get/ Set(Add) Columns
- Similar to adding to a dictionary
- Use square brackets to set new column name and new values

## Read Files

CSV file use `pandas.read_csv(chunksize=1000)` where the chunksize is number of lines

## Plotting (Avail out of the box)
- Pandas is not only built on top of numpy but also matplotlib
- Still Need to import matplotlib pyplot to run the show() method 


```python
# General plotting
df.plot(kind='plottype', x='df.columname', y='df.columnname')

# Histogram
df.hist(kind='plottype', x='df.columname', y='df.columnname')
```


## Index Methods

Method | Description | Use Cases
--- | --- | ---
set_index() | Explicitly set the index using EXISING columns | Easily grab named rows by .loc
reset_index() | Remove previously set index or level<br> If none before introduces numbered index |
sort_index() | Sort index | Slicing + .loc is a power combo<br>But slice can only happen on sorted values

In [9]:
from datetime import datetime, timezone, timedelta
twod_dict = { 'breed': ['Beagle', 'Mixed', 'Lab', 'Lab', 'Corgi'],
              'color': ['Brown', 'Brown', 'Black','Black', 'Brown'],
           'height': [1, 1.5, 2, 2, 1 ],
           'weight': [25, 45, 65, pd.NA, 27]}
# labels = ['US', 'RU', 'MO', 'EG', 'CH']
labels = [ datetime.now(tz=timezone.utc) - timedelta(days=r) for r in range(5)]
df = pd.DataFrame(twod_dict, index=labels)
df

Unnamed: 0,breed,color,height,weight
2022-10-28 01:59:45.915928+00:00,Beagle,Brown,1.0,25.0
2022-10-27 01:59:45.915939+00:00,Mixed,Brown,1.5,45.0
2022-10-26 01:59:45.915941+00:00,Lab,Black,2.0,65.0
2022-10-25 01:59:45.915942+00:00,Lab,Black,2.0,
2022-10-24 01:59:45.915943+00:00,Corgi,Brown,1.0,27.0


### sort_index
Option | Description 
--- | ---
axis | Column-wise 0, Row-wise 1
level | Level of multi-index whhere 0 is outer<br> can also use names
ascending | True Default
kind | 'quicksort' (Default), 'mergesort', 'heapsort', 'stable'

In [10]:
# color is level 0 (outer), breed is level 1 (inner)
new_df = df.set_index(['color', 'breed'])
# df = df.set_index(['color', 'breed'])
new_df = new_df.sort_index(level=1)  # sort by breed
# new_df = df.sort_index(level=1)  # sort by breed
new_df

Unnamed: 0_level_0,Unnamed: 1_level_0,height,weight
color,breed,Unnamed: 2_level_1,Unnamed: 3_level_1
Brown,Beagle,1.0,25.0
Brown,Corgi,1.0,27.0
Black,Lab,2.0,65.0
Black,Lab,2.0,
Brown,Mixed,1.5,45.0


## Subset Date Index
- When index is date, can subset by partial date

In [11]:
from datetime import datetime, timezone, timedelta
df = df.loc['2022-10':'2022-08']  # TODO: come back to why this issn't working
df

Unnamed: 0,breed,color,height,weight
2022-10-28 01:59:45.915928+00:00,Beagle,Brown,1.0,25.0
2022-10-27 01:59:45.915939+00:00,Mixed,Brown,1.5,45.0
2022-10-26 01:59:45.915941+00:00,Lab,Black,2.0,65.0
2022-10-25 01:59:45.915942+00:00,Lab,Black,2.0,
2022-10-24 01:59:45.915943+00:00,Corgi,Brown,1.0,27.0
