# Pandas
- Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
- It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
- Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

## What does Pandas library have to offer
1. High performance, easy to use data structures/Data model
    - 1 D Index: object
    - 1 D Series: Column
    - 2 D DataFrame: Table of Rows and Columns(Sheet)
    - 3 D Panel: Multiple Sheets
2. Functions and methods for
    - Data reads and writes: CSV, JSON, HTML,MS EXCEL, HDF5, SQL, ASCII, ets
    - Date wrangling: joins, aggregation, filtering, etc
    - Data analysis and visualization
3. Implementation perspective
    - Built obn top of Cython
    - less memory overhead
    - acts like in memory nosql database
    - quicker than python but slower than numpy
    - Vectorized operations

## Pandas Resources

In [8]:
import pandas as pd

In [2]:
pd.__version__

'0.23.4'

In [None]:
pd.<TAB>

In [4]:
pd?

In [5]:
pd.read_csv?

## Pandas Data structure and Model

- #### 1D - Index object
    - Type: Index
    - Immutable 1D ndarray
    - ordered
    - slicable
    - Only hashable objects
    - Stores(for all Pandas objects)
        - Axis Labels
        - Row/Column names
        
- #### 1D - Series object
    - Type: series
    - 1D ndarray with axis labels as pd.Index objects
    - Labels
        - Not unique
        - must be hashable
    - Indexing
        - Integer based
        - Label based
    - Missing data
        - NaN
        
- #### 2D - Data frame
    - Type: Data Frame
    - Mutable
    - Potentially heterogeneous data
    - dict of series objects structured with labelled axes
    
- #### 3D - Panel
    - Type: Panel
    - 3D ndarray/ Dict of Data frames

## Index

In [6]:
pd.Index?

In [7]:
col_labels = pd.Index(['name','age','salary'])

In [8]:
print(col_labels)

Index(['name', 'age', 'salary'], dtype='object')


In [9]:
type(col_labels)

pandas.core.indexes.base.Index

In [11]:
row_labels = pd.Index(["Id"+str(i) for i in range(10)])

In [12]:
row_labels

Index(['Id0', 'Id1', 'Id2', 'Id3', 'Id4', 'Id5', 'Id6', 'Id7', 'Id8', 'Id9'], dtype='object')

## Series

In [13]:
pd.Series?

In [25]:
row_labels = pd.Index(["ID"+str(i) for i in range(1, 3)])

In [17]:
row_labels

Index(['ID1', 'ID2'], dtype='object')

In [18]:
lname = pd.Series(data=['x','y'],
                 name = "Last Name")

In [19]:
lname

0    x
1    y
Name: Last Name, dtype: object

In [20]:
lname.index

RangeIndex(start=0, stop=2, step=1)

In [26]:
lname = pd.Series(data=['x','y'],name = "Last Name", index=row_labels)

In [22]:
lname

ID1    x
ID2    y
Name: Last Name, dtype: object

In [27]:
lname.index

Index(['ID1', 'ID2'], dtype='object')

In [28]:
type(lname.index)

pandas.core.indexes.base.Index

### NOTE: Label based indexing supports:
- Dates
- Strings
- Hashable objects

## DATAFRAME

### Python builtin aggregation with zip() and unzip with zip(zipped_obj) 

- zip() zips multiple sequence objects
- in conjunction with the list(), it creates a list of tuple records
- xip() in conjunction with the * operator, it can be used to unzip a list
- Each sequence obj can be viewed as a column of tabular data
- each record, tuple, will be a row 
- this tabular data of rows an columns will be used to create a DataFrame object

In [1]:
fnames = ['jon', 'ned']
lnames = ['snow', 'stark' ]
ages = [16, 33]

zipped = list(zip(fnames, lnames, ages))

In [2]:
zipped

[('jon', 'snow', 16), ('ned', 'stark', 33)]

In [3]:
# Unzip

a, b, c = zip(*zipped)

In [4]:
a

('jon', 'ned')

In [5]:
b

('snow', 'stark')

In [6]:
c

(16, 33)

In [7]:
# Zipping tuples is ok too

a = (1, 2)
b = (3, 4)
list(zip(a,b))

[(1, 3), (2, 4)]

### DF from  a list of tuples/records

In [11]:
row_labels = pd.Index(['ID'+ str(i) for i in range(1,3)])

In [12]:
row_labels

Index(['ID1', 'ID2'], dtype='object')

In [13]:
column_labels = pd.Index(['f','l','a'])

In [14]:
column_labels

Index(['f', 'l', 'a'], dtype='object')

In [15]:
df = pd.DataFrame(data=zipped, index=row_labels, columns=column_labels)

In [16]:
df

Unnamed: 0,f,l,a
ID1,jon,snow,16
ID2,ned,stark,33


In [17]:
type(df)

pandas.core.frame.DataFrame

In [18]:
type(df.columns)

pandas.core.indexes.base.Index

In [19]:
type(df.index)

pandas.core.indexes.base.Index

### DF with default index and columns

In [20]:
df = pd.DataFrame(data = zipped)

In [21]:
df

Unnamed: 0,0,1,2
0,jon,snow,16
1,ned,stark,33


In [22]:
type(df.columns)

pandas.core.indexes.range.RangeIndex

In [23]:
df.columns

RangeIndex(start=0, stop=3, step=1)

## Pandas ADT: Attributes and Core Methods

Most of them apply to both the series and data frames

To examine attributes of DF, series
- .shape : number of rows and columns
- .axes : the axes of data frame
    - a list that contains both the index and columns
- .index : index of df
- .columns : columns of df
- .name : name of the series
    - Does not apply to DF
    
To examine some info about the df/series
- .head() returns first n rows
- .tail() returns first n rows
- .info() for basic info about the df
    - summary of the types and columns
    - does not apply to series

In [26]:
sales_df = pd.read_csv('data_raw/sales.csv')

In [27]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
3,789,1.0,2.0,1/1/2014
4,789,2.0,3.8,1/2/2014
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [28]:
sales_df.shape

(7, 4)

In [30]:
sales_df.axes

[RangeIndex(start=0, stop=7, step=1),
 Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')]

In [31]:
sales_df.axes[0]

RangeIndex(start=0, stop=7, step=1)

In [32]:
sales_df.axes[1]

Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')

In [33]:
sales_df.columns

Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')

In [34]:
sales_df.index

RangeIndex(start=0, stop=7, step=1)

In [35]:
sales_df.head(3)

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014


In [37]:
sales_df.tail(2)

Unnamed: 0,UPS,Units,Sales,Date
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [38]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
UPS      7 non-null int64
Units    6 non-null float64
Sales    6 non-null float64
Date     7 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 304.0+ bytes


In [39]:
sales_df['Date'] = pd.to_datetime (sales_df['Date'], infer_datetime_format = True)

In [40]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
UPS      7 non-null int64
Units    6 non-null float64
Sales    6 non-null float64
Date     7 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 304.0 bytes


In [41]:
sales_df.head()

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,2014-01-01
1,1234,2.0,8.0,2014-01-02
2,1234,3.0,13.0,2014-01-03
3,789,1.0,2.0,2014-01-01
4,789,2.0,3.8,2014-01-02


# INDEXING & SLICING

- Each Pandas object (Series, DataFrame, Panel), besides data, contains an ordered sequence Index objects describing its axes.

- Each Index object is an ordered sequence of hashable, not necessarily unique objects (e.g., strings, dates, numbers):
    - the default index is of type np.arange(n)
    
- Therefore, two-level indexing and slicing along each axis is supported:
    - integer-based or position-based due to the ordered nature of the Index
        - .iloc[...] or .iat[...]
    - label-based due to the arbitrary hashable objects as Index elements
        - .loc[...] or .at[...]
        
- Return Object:
    - NumPy Array: with .iat[...] and .at[...]
    - Pandas Object: with .iloc[...] and .loc[...]
    
- What can be passed to index operation for .iloc
    - a single integer as an index location
    - range of integers for index locations: start:end (open half interval)
    - a list of index locations
    
- What can be passed to index operation for .loc
    - a single label as an index location like a dictionary-based indexing
    - boolean arrays (as masks)
    - slices: label_start:label_end (closed interval)
    - a list of labels
    
- Chaining Index Operations
    - If different axes need to be sliced by different methods (index-based vs. label-based), then index operations should be chained:
        - pd_obj.iloc[2:4].loc[ :, ['Name']] or
        - pd_obj.loc[:, ['Name']].iloc[-4:]
    - Errors are raised if
        - there is no index at that location (IndexError)

In [44]:
sales_df = pd.read_csv('data_raw/sales.csv')

In [45]:
sales_df.index

RangeIndex(start=0, stop=7, step=1)

In [46]:
sales_df.index = pd.Index(['a','b','c','d','e','f','g'])

In [47]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
f,789,,,1/3/2014
g,789,1.0,1.8,1/5/2014


## index-based Indexing and Slicing with .iloc and .iat

What can be passed to index operation for .iloc:

- a single integer as an index location
- range of integers for index locations: start:end (open half interval)
- a list of index locations
- a Boolean mask

In [48]:
sales_df.iloc[2]

UPS          1234
Units           3
Sales          13
Date     1/3/2014
Name: c, dtype: object

In [49]:
sales_df.iloc[2:4]

Unnamed: 0,UPS,Units,Sales,Date
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014


In [50]:
sales_df[2:4]

Unnamed: 0,UPS,Units,Sales,Date
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014


In [101]:
sales_df.iloc[[1,2,4]]

Unnamed: 0,UPS,Units,Sales,Date
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
e,789,2.0,3.8,1/2/2014


### Boolean mask

In [57]:
boolean_mask = [True,False,True,True,True,False,True]

In [58]:
sales_df.iloc[boolean_mask]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
g,789,1.0,1.8,1/5/2014


In [59]:
sales_df[2:4, 1:3]

TypeError: unhashable type: 'slice'

In [60]:
sales_df.iloc[2:4, 1:3]

Unnamed: 0,Units,Sales
c,3.0,13.0
d,1.0,2.0


In [61]:
sales_df.iat[4,2]

3.8

In [62]:
sales_df.iat[2:4, 1:3]

ValueError: iAt based indexing can only have integer indexers

### Position Index for Integer Index Labels and Default Integer Index
- If the index is explicitly defined using integer labels, then the position-based indexing without .iloc generates KeyError
    - Fix: use .iloc or .loc
- If the index is a default index (i.e. of type RangeIndex), then position-based indexing without using .iloc works fine

In [63]:
data = [1,2,3,4,5]
index = [200,201,202,203,204]
z = pd.Series(data=data, index =index)

In [64]:
z

200    1
201    2
202    3
203    4
204    5
dtype: int64

In [65]:
z[0]

KeyError: 0

In [69]:
# KeyError: index-based access without `.iloc` does not work
# must use `.iloc` or `.loc`

z.iloc[0]

1

In [68]:
z.loc[200]

1

## Label-based Indexing and Slicing with .loc and .at
- What can be passed to index operation for .loc
    - a single label as an index location like a dictionary-based indexing
    - slices: label_start:label_end (closed interval)
    - a list of labels
    - boolean arrays (as masks)

In [71]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
f,789,,,1/3/2014
g,789,1.0,1.8,1/5/2014


In [72]:
sales_df.loc['d']

UPS           789
Units           1
Sales           2
Date     1/1/2014
Name: d, dtype: object

In [74]:
# Slices are closed intervals

sales_df.loc['d':'e','Units':'Date']

Unnamed: 0,Units,Sales,Date
d,1.0,2.0,1/1/2014
e,2.0,3.8,1/2/2014


In [75]:
sales_df.loc[['a','e','g']]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
e,789,2.0,3.8,1/2/2014
g,789,1.0,1.8,1/5/2014


In [76]:
bool_mask = [True, False, True, 
             False, False, False, True]
sales_df.loc[bool_mask]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
c,1234,3.0,13.0,1/3/2014
g,789,1.0,1.8,1/5/2014


In [78]:
sales_df.at['d','Sales']

2.0

In [79]:
sales_df.at['d']

TypeError: _get_value() missing 1 required positional argument: 'col'

In [80]:
s = pd.Series ([10, 7, 1, 22], 
              index = ['1968', '1969', '1970', '1970'])

In [81]:
s[0]

IndexError: 0

In [82]:
s.at['1970']

array([ 1, 22])

In [83]:
s.loc['1970']

1970     1
1970    22
dtype: int64

In [84]:
s.at[ ['1968','1970']]

ValueError: Invalid call for scalar access (getting)!

In [85]:
s.loc[ ['1968','1970']]

1968    10
1970     1
1970    22
dtype: int64

In [86]:
s.at[ '1968':'1970']

InvalidIndexError: slice('1968', '1970', None)

In [87]:
s.loc[ '1968':'1970']

1968    10
1969     7
1970     1
1970    22
dtype: int64

In [88]:
s.loc['1970']

1970     1
1970    22
dtype: int64

In [89]:
s.loc['1972']

KeyError: 'the label [1972] is not in the [index]'

In [90]:
s.loc[ ['1970', '1972'] ]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


1970     1.0
1970    22.0
1972     NaN
dtype: float64

### Indexing with Non-unique Index¶
- If the index labels are not unique, then the index operation returns a sub-series or sub-df rather than a scalar value

In [91]:
s = pd.Series ([10, 7, 1, 22], 
              index = ['1968', '1969', '1970', '1970'])

In [92]:
s.loc['1970']

1970     1
1970    22
dtype: int64

### Chaining Index Operations
If different axes need to be sliced by different method (index-based vs. label-based), then index operations should be chained.

In [94]:
sales_df.loc[ :, ['UPS', 'Sales']].iloc[-4:]

Unnamed: 0,UPS,Sales
d,789,2.0
e,789,3.8
f,789,
g,789,1.8


### Index Attribute and Operations¶
.index.is_unique: True or False attribute

In [95]:
s.index.is_unique

False

In [96]:
sales_df.index.is_unique

True

In [98]:
sales_df.iloc?

In [None]:
s