# Pandas

* [10 Minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/10min.html)

Pandas is the main library for processing vector/tabluar data in Python. It seems to use a lot of useful ideas from R (which makes sense - it's a language designed for processing data.

By convention pandas is imported as the shorted ````pd````. The library also plays well / draws a lot of functionality from numpy:

In [1]:
import pandas as pd
import numpy as np

## Quick Start

The two main object types to be aware of are ````pd.Series```` (1 dimensional set of data, typically one variable) and ````pd.DataFrame```` (2 dimensional set of data). Both have an 'index' that serves as a unique identifier for each entry (each item in a Series and each row in a DataFrame).

By default, the index will be autogenerated as a sequential integer, though can be any data type. It acts as a (poorly constrained) primary key for the data.

### Object Creation

Series are created with a list of values:

In [2]:
s = pd.Series([1,3,5,np.nan,6,8])
print(s)

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64


DataFrames a dictionary of values that can be converted to a series  (columns can have different [data types](https://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes)) (NumPy matrixes can be converted seamlessly ot a DataFrame too):

In [3]:
dates = pd.date_range('20130101', periods=4)

df = pd.DataFrame({'A' : 1.,
                   'B' : pd.Timestamp('20130102'),
                   'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                   'D' : np.array([3] * 4,dtype='int32'),
                   'E' : pd.Categorical(["test","train","test","train"]),
                   'F' : 'foo' 
                   },
                  index = dates
                 )
print(df.dtypes)
df

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object


Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.0,2013-01-02,,3,test,foo
2013-01-02,1.0,2013-01-02,,3,train,foo
2013-01-03,1.0,2013-01-02,,3,test,foo
2013-01-04,1.0,2013-01-02,,3,train,foo


### DataFrame Properties

In [4]:
print(df.index, '\n')
print(df.columns)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04'], dtype='datetime64[ns]', freq='D') 

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')


### Summarising DataFrames

Data can be subset either as the ````head```` or ````foot```` of the DataFrame:

In [5]:
df.head()

Unnamed: 0,A,B,C,D,E,F
2013-01-01,1.0,2013-01-02,,3,test,foo
2013-01-02,1.0,2013-01-02,,3,train,foo
2013-01-03,1.0,2013-01-02,,3,test,foo
2013-01-04,1.0,2013-01-02,,3,train,foo


In [6]:
df.tail(2)

Unnamed: 0,A,B,C,D,E,F
2013-01-03,1.0,2013-01-02,,3,test,foo
2013-01-04,1.0,2013-01-02,,3,train,foo


Accessing row & column names from the ````index```` and ````column```` attributes (similar to ````rownames```` and ````colnames```` in R):

And a quick statistical summary can be accessed with ````describe````:

In [7]:
print(df.describe())

         A    C    D
count  4.0  0.0  4.0
mean   1.0  NaN  3.0
std    0.0  NaN  0.0
min    1.0  NaN  3.0
25%    1.0  NaN  3.0
50%    1.0  NaN  3.0
75%    1.0  NaN  3.0
max    1.0  NaN  3.0


### Accessing Data

Data can be accessed/set with expressions like ````df['A']````, ````df[0:2]```` or ````df['A'][0:2]```` (particularly for interactive work), but it's advised to use pandas optimised access methods like ````.at````, ````.iat````, ````.loc```` and ````.iloc```` for anything beyond that.

* ````.loc```` - Accesses a row by reference to its index.
    * Can use the following syntax:
        * A single label (this is *not* converted to a position, for that use ````iloc````).
        * A list / array of labels.
        * A slice of labels (with the results being inclusive at both ends).
        * A boolean array.
        * A function with one argument (which will be the DataFrame / Series calling it, producing one of the above.
    * Will also take a second argument, which narrows down the column by label.
* ````.iloc```` - Works the same as ````.loc````, but works based on integer indexes rather than labels.


In [8]:
print(df.loc[dates[0]], '\n')
print(df.iloc[0])

A                      1
B    2013-01-02 00:00:00
C                    nan
D                      3
E                   test
F                    foo
Name: 2013-01-01 00:00:00, dtype: object 

A                      1
B    2013-01-02 00:00:00
C                    nan
D                      3
E                   test
F                    foo
Name: 2013-01-01 00:00:00, dtype: object


In [9]:
print(df.loc[dates[0], 'A'], '\n')
print(df.iloc[0, 0])

1.0 

1.0


In [10]:
print(df.loc[:,['A', 'B']], '\n')
print(df.iloc[:,[0, 1]], '\n')

              A          B
2013-01-01  1.0 2013-01-02
2013-01-02  1.0 2013-01-02
2013-01-03  1.0 2013-01-02
2013-01-04  1.0 2013-01-02 

              A          B
2013-01-01  1.0 2013-01-02
2013-01-02  1.0 2013-01-02
2013-01-03  1.0 2013-01-02
2013-01-04  1.0 2013-01-02 



In [11]:
print(df.loc['20130102':'20130104',['A','C']], '\n')
print(df.iloc[1:4,[0, 2]])

              A   C
2013-01-02  1.0 NaN
2013-01-03  1.0 NaN
2013-01-04  1.0 NaN 

              A   C
2013-01-02  1.0 NaN
2013-01-03  1.0 NaN
2013-01-04  1.0 NaN


### Boolean Indexing

Boolean value can also be used to access values in a data frame:

In [14]:
print(df.A > 0, '\n')
print(df[df.A > 0])

2013-01-01    True
2013-01-02    True
2013-01-03    True
2013-01-04    True
Freq: D, Name: A, dtype: bool 

              A          B   C  D      E    F
2013-01-01  1.0 2013-01-02 NaN  3   test  foo
2013-01-02  1.0 2013-01-02 NaN  3  train  foo
2013-01-03  1.0 2013-01-02 NaN  3   test  foo
2013-01-04  1.0 2013-01-02 NaN  3  train  foo
