# Pandas
- Pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive.
- It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python.
- Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language.

## What does Pandas library have to offer
1. High performance, easy to use data structures/Data model
    - 1 D Index: object
    - 1 D Series: Column
    - 2 D DataFrame: Table of Rows and Columns(Sheet)
    - 3 D Panel: Multiple Sheets
2. Functions and methods for
    - Data reads and writes: CSV, JSON, HTML,MS EXCEL, HDF5, SQL, ASCII, ets
    - Date wrangling: joins, aggregation, filtering, etc
    - Data analysis and visualization
3. Implementation perspective
    - Built obn top of Cython
    - less memory overhead
    - acts like in memory nosql database
    - quicker than python but slower than numpy
    - Vectorized operations

## Pandas Resources

In [1]:
import pandas as pd

In [2]:
pd.__version__

'0.23.4'

In [None]:
pd.<TAB>

In [4]:
pd?

In [5]:
pd.read_csv?

## Pandas Data structure and Model

- #### 1D - Index object
    - Type: Index
    - Immutable 1D ndarray
    - ordered
    - slicable
    - Only hashable objects
    - Stores(for all Pandas objects)
        - Axis Labels
        - Row/Column names
        
- #### 1D - Series object
    - Type: series
    - 1D ndarray with axis labels as pd.Index objects
    - Labels
        - Not unique
        - must be hashable
    - Indexing
        - Integer based
        - Label based
    - Missing data
        - NaN
        
- #### 2D - Data frame
    - Type: Data Frame
    - Mutable
    - Potentially heterogeneous data
    - dict of series objects structured with labelled axes
    
- #### 3D - Panel
    - Type: Panel
    - 3D ndarray/ Dict of Data frames

## Index

In [6]:
pd.Index?

In [7]:
col_labels = pd.Index(['name','age','salary'])

In [8]:
print(col_labels)

Index(['name', 'age', 'salary'], dtype='object')


In [9]:
type(col_labels)

pandas.core.indexes.base.Index

In [11]:
row_labels = pd.Index(["Id"+str(i) for i in range(10)])

In [12]:
row_labels

Index(['Id0', 'Id1', 'Id2', 'Id3', 'Id4', 'Id5', 'Id6', 'Id7', 'Id8', 'Id9'], dtype='object')

## Series

In [13]:
pd.Series?

In [25]:
row_labels = pd.Index(["ID"+str(i) for i in range(1, 3)])

In [17]:
row_labels

Index(['ID1', 'ID2'], dtype='object')

In [18]:
lname = pd.Series(data=['x','y'],
                 name = "Last Name")

In [19]:
lname

0    x
1    y
Name: Last Name, dtype: object

In [20]:
lname.index

RangeIndex(start=0, stop=2, step=1)

In [26]:
lname = pd.Series(data=['x','y'],name = "Last Name", index=row_labels)

In [22]:
lname

ID1    x
ID2    y
Name: Last Name, dtype: object

In [27]:
lname.index

Index(['ID1', 'ID2'], dtype='object')

In [28]:
type(lname.index)

pandas.core.indexes.base.Index

### NOTE: Label based indexing supports:
- Dates
- Strings
- Hashable objects

## DATAFRAME

### Python builtin aggregation with zip() and unzip with zip(zipped_obj) 

- zip() zips multiple sequence objects
- in conjunction with the list(), it creates a list of tuple records
- xip() in conjunction with the * operator, it can be used to unzip a list
- Each sequence obj can be viewed as a column of tabular data
- each record, tuple, will be a row 
- this tabular data of rows an columns will be used to create a DataFrame object

In [1]:
fnames = ['jon', 'ned']
lnames = ['snow', 'stark' ]
ages = [16, 33]

zipped = list(zip(fnames, lnames, ages))

In [2]:
zipped

[('jon', 'snow', 16), ('ned', 'stark', 33)]

In [3]:
# Unzip

a, b, c = zip(*zipped)

In [4]:
a

('jon', 'ned')

In [5]:
b

('snow', 'stark')

In [6]:
c

(16, 33)

In [7]:
# Zipping tuples is ok too

a = (1, 2)
b = (3, 4)
list(zip(a,b))

[(1, 3), (2, 4)]

### DF from  a list of tuples/records

In [11]:
row_labels = pd.Index(['ID'+ str(i) for i in range(1,3)])

In [12]:
row_labels

Index(['ID1', 'ID2'], dtype='object')

In [13]:
column_labels = pd.Index(['f','l','a'])

In [14]:
column_labels

Index(['f', 'l', 'a'], dtype='object')

In [15]:
df = pd.DataFrame(data=zipped, index=row_labels, columns=column_labels)

In [16]:
df

Unnamed: 0,f,l,a
ID1,jon,snow,16
ID2,ned,stark,33


In [17]:
type(df)

pandas.core.frame.DataFrame

In [18]:
type(df.columns)

pandas.core.indexes.base.Index

In [19]:
type(df.index)

pandas.core.indexes.base.Index

### DF with default index and columns

In [20]:
df = pd.DataFrame(data = zipped)

In [21]:
df

Unnamed: 0,0,1,2
0,jon,snow,16
1,ned,stark,33


In [22]:
type(df.columns)

pandas.core.indexes.range.RangeIndex

In [23]:
df.columns

RangeIndex(start=0, stop=3, step=1)

## Pandas ADT: Attributes and Core Methods

Most of them apply to both the series and data frames

To examine attributes of DF, series
- .shape : number of rows and columns
- .axes : the axes of data frame
    - a list that contains both the index and columns
- .index : index of df
- .columns : columns of df
- .name : name of the series
    - Does not apply to DF
    
To examine some info about the df/series
- .head() returns first n rows
- .tail() returns first n rows
- .info() for basic info about the df
    - summary of the types and columns
    - does not apply to series

In [26]:
sales_df = pd.read_csv('data_raw/sales.csv')

In [27]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
3,789,1.0,2.0,1/1/2014
4,789,2.0,3.8,1/2/2014
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [28]:
sales_df.shape

(7, 4)

In [30]:
sales_df.axes

[RangeIndex(start=0, stop=7, step=1),
 Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')]

In [31]:
sales_df.axes[0]

RangeIndex(start=0, stop=7, step=1)

In [32]:
sales_df.axes[1]

Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')

In [33]:
sales_df.columns

Index(['UPS', 'Units', 'Sales', 'Date'], dtype='object')

In [34]:
sales_df.index

RangeIndex(start=0, stop=7, step=1)

In [35]:
sales_df.head(3)

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014


In [37]:
sales_df.tail(2)

Unnamed: 0,UPS,Units,Sales,Date
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [38]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
UPS      7 non-null int64
Units    6 non-null float64
Sales    6 non-null float64
Date     7 non-null object
dtypes: float64(2), int64(1), object(1)
memory usage: 304.0+ bytes


In [39]:
sales_df['Date'] = pd.to_datetime (sales_df['Date'], infer_datetime_format = True)

In [40]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
UPS      7 non-null int64
Units    6 non-null float64
Sales    6 non-null float64
Date     7 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(1)
memory usage: 304.0 bytes


In [41]:
sales_df.head()

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,2014-01-01
1,1234,2.0,8.0,2014-01-02
2,1234,3.0,13.0,2014-01-03
3,789,1.0,2.0,2014-01-01
4,789,2.0,3.8,2014-01-02


# INDEXING & SLICING

- Each Pandas object (Series, DataFrame, Panel), besides data, contains an ordered sequence Index objects describing its axes.

- Each Index object is an ordered sequence of hashable, not necessarily unique objects (e.g., strings, dates, numbers):
    - the default index is of type np.arange(n)
    
- Therefore, two-level indexing and slicing along each axis is supported:
    - integer-based or position-based due to the ordered nature of the Index
        - .iloc[...] or .iat[...]
    - label-based due to the arbitrary hashable objects as Index elements
        - .loc[...] or .at[...]
        
- Return Object:
    - NumPy Array: with .iat[...] and .at[...]
    - Pandas Object: with .iloc[...] and .loc[...]
    
- What can be passed to index operation for .iloc
    - a single integer as an index location
    - range of integers for index locations: start:end (open half interval)
    - a list of index locations
    
- What can be passed to index operation for .loc
    - a single label as an index location like a dictionary-based indexing
    - boolean arrays (as masks)
    - slices: label_start:label_end (closed interval)
    - a list of labels
    
- Chaining Index Operations
    - If different axes need to be sliced by different methods (index-based vs. label-based), then index operations should be chained:
        - pd_obj.iloc[2:4].loc[ :, ['Name']] or
        - pd_obj.loc[:, ['Name']].iloc[-4:]
    - Errors are raised if
        - there is no index at that location (IndexError)

In [44]:
sales_df = pd.read_csv('data_raw/sales.csv')

In [45]:
sales_df.index

RangeIndex(start=0, stop=7, step=1)

In [46]:
sales_df.index = pd.Index(['a','b','c','d','e','f','g'])

In [47]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
f,789,,,1/3/2014
g,789,1.0,1.8,1/5/2014


## index-based Indexing and Slicing with .iloc and .iat

What can be passed to index operation for .iloc:

- a single integer as an index location
- range of integers for index locations: start:end (open half interval)
- a list of index locations
- a Boolean mask

In [48]:
sales_df.iloc[2]

UPS          1234
Units           3
Sales          13
Date     1/3/2014
Name: c, dtype: object

In [49]:
sales_df.iloc[2:4]

Unnamed: 0,UPS,Units,Sales,Date
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014


In [50]:
sales_df[2:4]

Unnamed: 0,UPS,Units,Sales,Date
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014


In [101]:
sales_df.iloc[[1,2,4]]

Unnamed: 0,UPS,Units,Sales,Date
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
e,789,2.0,3.8,1/2/2014


### Boolean mask

In [57]:
boolean_mask = [True,False,True,True,True,False,True]

In [58]:
sales_df.iloc[boolean_mask]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
g,789,1.0,1.8,1/5/2014


In [59]:
sales_df[2:4, 1:3]

TypeError: unhashable type: 'slice'

In [60]:
sales_df.iloc[2:4, 1:3]

Unnamed: 0,Units,Sales
c,3.0,13.0
d,1.0,2.0


In [61]:
sales_df.iat[4,2]

3.8

In [62]:
sales_df.iat[2:4, 1:3]

ValueError: iAt based indexing can only have integer indexers

### Position Index for Integer Index Labels and Default Integer Index
- If the index is explicitly defined using integer labels, then the position-based indexing without .iloc generates KeyError
    - Fix: use .iloc or .loc
- If the index is a default index (i.e. of type RangeIndex), then position-based indexing without using .iloc works fine

In [63]:
data = [1,2,3,4,5]
index = [200,201,202,203,204]
z = pd.Series(data=data, index =index)

In [64]:
z

200    1
201    2
202    3
203    4
204    5
dtype: int64

In [65]:
z[0]

KeyError: 0

In [69]:
# KeyError: index-based access without `.iloc` does not work
# must use `.iloc` or `.loc`

z.iloc[0]

1

In [68]:
z.loc[200]

1

## Label-based Indexing and Slicing with .loc and .at
- What can be passed to index operation for .loc
    - a single label as an index location like a dictionary-based indexing
    - slices: label_start:label_end (closed interval)
    - a list of labels
    - boolean arrays (as masks)

In [71]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
b,1234,2.0,8.0,1/2/2014
c,1234,3.0,13.0,1/3/2014
d,789,1.0,2.0,1/1/2014
e,789,2.0,3.8,1/2/2014
f,789,,,1/3/2014
g,789,1.0,1.8,1/5/2014


In [72]:
sales_df.loc['d']

UPS           789
Units           1
Sales           2
Date     1/1/2014
Name: d, dtype: object

In [74]:
# Slices are closed intervals

sales_df.loc['d':'e','Units':'Date']

Unnamed: 0,Units,Sales,Date
d,1.0,2.0,1/1/2014
e,2.0,3.8,1/2/2014


In [75]:
sales_df.loc[['a','e','g']]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
e,789,2.0,3.8,1/2/2014
g,789,1.0,1.8,1/5/2014


In [76]:
bool_mask = [True, False, True, 
             False, False, False, True]
sales_df.loc[bool_mask]

Unnamed: 0,UPS,Units,Sales,Date
a,1234,5.0,20.2,1/1/2014
c,1234,3.0,13.0,1/3/2014
g,789,1.0,1.8,1/5/2014


In [78]:
sales_df.at['d','Sales']

2.0

In [79]:
sales_df.at['d']

TypeError: _get_value() missing 1 required positional argument: 'col'

In [80]:
s = pd.Series ([10, 7, 1, 22], 
              index = ['1968', '1969', '1970', '1970'])

In [81]:
s[0]

IndexError: 0

In [82]:
s.at['1970']

array([ 1, 22])

In [83]:
s.loc['1970']

1970     1
1970    22
dtype: int64

In [84]:
s.at[ ['1968','1970']]

ValueError: Invalid call for scalar access (getting)!

In [85]:
s.loc[ ['1968','1970']]

1968    10
1970     1
1970    22
dtype: int64

In [86]:
s.at[ '1968':'1970']

InvalidIndexError: slice('1968', '1970', None)

In [87]:
s.loc[ '1968':'1970']

1968    10
1969     7
1970     1
1970    22
dtype: int64

In [88]:
s.loc['1970']

1970     1
1970    22
dtype: int64

In [89]:
s.loc['1972']

KeyError: 'the label [1972] is not in the [index]'

In [90]:
s.loc[ ['1970', '1972'] ]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


1970     1.0
1970    22.0
1972     NaN
dtype: float64

### Indexing with Non-unique Index¶
- If the index labels are not unique, then the index operation returns a sub-series or sub-df rather than a scalar value

In [91]:
s = pd.Series ([10, 7, 1, 22], 
              index = ['1968', '1969', '1970', '1970'])

In [92]:
s.loc['1970']

1970     1
1970    22
dtype: int64

### Chaining Index Operations
If different axes need to be sliced by different method (index-based vs. label-based), then index operations should be chained.

In [94]:
sales_df.loc[ :, ['UPS', 'Sales']].iloc[-4:]

Unnamed: 0,UPS,Sales
d,789,2.0
e,789,3.8
f,789,
g,789,1.8


### Index Attribute and Operations¶
.index.is_unique: True or False attribute

In [95]:
s.index.is_unique

False

In [96]:
sales_df.index.is_unique

True

In [98]:
sales_df.iloc?

In [None]:
s

# PANDAS CRUD

## CREATION-
#### - Series:
    - list, tuple, or array-like ADT
    - NumPy ndarray
    - dictionary
    - scalar value
    
#### - DataFrame:
    - numpy ndarray (structured or homogeneous)
    - dict of Series, arrays, lists, tuples, or list-like objects, constants
    - keys are unique in dict  only unique indices will be supported
    - Pandas ADT allow for non-unique indices   
    - DataFrame
    - CSV or JSON file: pd.read_csv(), pd.read_json()
    - from a serized object, such as serialized dict: .from_dict(serialized_dict)
    - Other: SQL, HDF5, XML, etc
    
#### - Note: Dictionary-based creation
    - Keys are unique in dicts
    - Creating Pandas ADTs from dicts will force indices to be unique
    - However, indices do not have to be unique

## Creation - Series
- list, tuple, or array-like ADT
- NumPy ndarray
- dictionary
- scalar value

In [2]:
# LIST

pd.Series(['bilbo','frodo'])

0    bilbo
1    frodo
dtype: object

In [5]:
# TUPLE

pd.Series(('bilbo', 'tuple'))

0    bilbo
1    tuple
dtype: object

In [7]:
# ndarray

import numpy as np
pd.Series(np.random.random(3))

0    0.166197
1    0.039394
2    0.759404
dtype: float64

In [12]:
# Dictionary

a = pd.Series({'2007':'t20 worldcup', '2014':'ODI world cup', '2014':'odi wc'})

In [13]:
a

2007    t20 worldcup
2014          odi wc
dtype: object

In [14]:
a.index

Index(['2007', '2014'], dtype='object')

In [15]:
# Possible fix for non unique index

a = pd.Series({'2007':'t20 worldcup', '2014':['ODI world cup', 'odi wc']})

In [16]:
a

2007               t20 worldcup
2014    [ODI world cup, odi wc]
dtype: object

## Creation: DataFrame
- columns: dicts of lists, dicts of Series, dicts of arrays
- rows: lists of dicts
- CSV file: pd.read_csv()
- from a serized object, such as serialized dict: .from_dict(serialized_dict)
- and more: SQL, HDF5, JSON, XML, etc.

In [17]:
# List of dicts

pd.DataFrame([{'name':'jon','age':20},{'name':'ned','age':40}])

Unnamed: 0,age,name
0,20,jon
1,40,ned


In [20]:
# Dict of lists

pd.DataFrame({'name':['jon','ned'], 'age':[20, 40]})

Unnamed: 0,name,age
0,jon,20
1,ned,40


In [21]:
# Dict of Tuples

pd.DataFrame({'name':('jon','ned'), 'age':(20, 40)})

Unnamed: 0,name,age
0,jon,20
1,ned,40


In [23]:
# Dict of Series
s1 = pd.Series([20,40])
s2 = pd.Series(['jon','ned'])

pd.DataFrame({'name':s1, 'age':s2})

Unnamed: 0,name,age
0,20,jon
1,40,ned


In [24]:
# dict of arrays

pd.DataFrame({'v1': np.random.random(5),
              'v2': np.random.random(5)})

Unnamed: 0,v1,v2
0,0.80528,0.756964
1,0.235036,0.068562
2,0.470031,0.897905
3,0.651277,0.957129
4,0.673471,0.425219


In [25]:
# Numpy array

pd.DataFrame(np.random.randn(5,3),
            columns = ['v1','v2','v3'])

Unnamed: 0,v1,v2,v3
0,1.069996,0.921928,-1.434164
1,1.649852,0.362521,-0.192048
2,0.39126,0.63797,0.226528
3,-1.067709,0.013022,0.385158
4,0.245901,-0.40699,0.280505


In [27]:
# From a serialized dict

sales_df = pd.read_csv('data_raw/sales.csv')

In [28]:
serialized_dict = sales_df.to_dict()

In [30]:
serialized_dict

{'UPS': {0: 1234, 1: 1234, 2: 1234, 3: 789, 4: 789, 5: 789, 6: 789},
 'Units': {0: 5.0, 1: 2.0, 2: 3.0, 3: 1.0, 4: 2.0, 5: nan, 6: 1.0},
 'Sales': {0: 20.2, 1: 8.0, 2: 13.0, 3: 2.0, 4: 3.8, 5: nan, 6: 1.8},
 'Date': {0: '1/1/2014',
  1: '1/2/2014',
  2: '1/3/2014',
  3: '1/1/2014',
  4: '1/2/2014',
  5: '1/3/2014',
  6: '1/5/2014'}}

In [31]:
pd.DataFrame.from_dict(serialized_dict)

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
3,789,1.0,2.0,1/1/2014
4,789,2.0,3.8,1/2/2014
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [32]:
# From a serialized json object

serialized_json = sales_df.to_json()

In [33]:
serialized_json

'{"UPS":{"0":1234,"1":1234,"2":1234,"3":789,"4":789,"5":789,"6":789},"Units":{"0":5.0,"1":2.0,"2":3.0,"3":1.0,"4":2.0,"5":null,"6":1.0},"Sales":{"0":20.2,"1":8.0,"2":13.0,"3":2.0,"4":3.8,"5":null,"6":1.8},"Date":{"0":"1\\/1\\/2014","1":"1\\/2\\/2014","2":"1\\/3\\/2014","3":"1\\/1\\/2014","4":"1\\/2\\/2014","5":"1\\/3\\/2014","6":"1\\/5\\/2014"}}'

In [34]:
pd.read_json(serialized_json)

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,2014-01-01
1,1234,2.0,8.0,2014-01-02
2,1234,3.0,13.0,2014-01-03
3,789,1.0,2.0,2014-01-01
4,789,2.0,3.8,2014-01-02
5,789,,,2014-01-03
6,789,1.0,1.8,2014-01-05


# READ

### - Read:
   - Indexing and Slicing (see another notebook)
   - Getting Values
        - .get(label, [default]):
            - Returns a scalar (or Series if duplicate indexes) for label
            - Returns default on failed lookup
        - .get_value(label):
             - Returns a scalar (or Series if duplicate indexes) for label
         
### - Series Iterations over:
- values:
    - for val in ser:
    - for val in ser.values():
    
- index:
    - for idx in ser.keys():
    
- index, value (unpacked tuples):
    - for idx, val in ser.iteritems():
    
- (index, value) tuples:
    - for item in ser.iteritems():
    
### - DataFrame Iterations over:
- column names:
    - for col in df:
    - for col in df.key():
    
- column names and columns as a Series:
    - for col, ser in df.iteritems():    
- rows:
    - for row in df.iterrows():
- named tuples containing the index and row values:
    - for row in df.itertuples():
    
### - Operations performed during iteration are not vectorized in Pandas and have overhead

### Getting values

In [35]:
a = pd.Series(['jon','ned'],
             index = ['snow','stark'])

In [36]:
a

snow     jon
stark    ned
dtype: object

In [37]:
a.get('snow')

'jon'

In [44]:
# returns default value x
a.get('jon','x')

'x'

In [45]:
a.get_value('stark')

  """Entry point for launching an IPython kernel.


'ned'

In [None]:
# Throws error 
a.get_value('dayne')

### Iteration

#### Over series

In [48]:
s = pd.Series(['jon','ned'], 
             index = ['snow','stark'])

In [49]:
s

snow     jon
stark    ned
dtype: object

In [50]:
# Access Values

for value in s:
    print(value)

jon
ned


In [51]:
# Access Index

for index in s.index:
    print(index)

snow
stark


In [53]:
# Access Values

for value in s.values:
    print(value)

jon
ned


In [54]:
# Access Index

for index in s.keys():
    print(index)

snow
stark


In [55]:
# Access key value pairs - tuple unpacking

for index, value in s.iteritems():
    print('{}::{}'.format(index, value))

snow::jon
stark::ned


In [57]:
# Access key value pairs - tuple 

for index in s.iteritems():
    print('{}'.format(index))

('snow', 'jon')
('stark', 'ned')


#### Over DataFrame

In [58]:
a = pd.DataFrame({'name':['jon','ned'], 'age':[20,40]})

In [59]:
a

Unnamed: 0,name,age
0,jon,20
1,ned,40


In [62]:
# column names
for column in a:
    print(column)

name
age


In [63]:
# column names
for column in a.keys():
    print(column)

name
age


In [65]:
# column names and columns as series

for col, ser in a.iteritems():
    print(ser)

0    jon
1    ned
Name: name, dtype: object
0    20
1    40
Name: age, dtype: int64


In [66]:
# rows

for row in a.iterrows():
    print(row)

(0, name    jon
age      20
Name: 0, dtype: object)
(1, name    ned
age      40
Name: 1, dtype: object)


In [67]:
# over named tuples containing row and index values

for row in a.itertuples():
    print(row)

Pandas(Index=0, name='jon', age=20)
Pandas(Index=1, name='ned', age=40)


# UPDATE

### - Series:
- Overwrite: Index-based assignment = for existing index label value:
    - in-place overwriting of the existing value, i.e., mutating the original value for this index
    
- Overwrite: Position-based update with .iloc
    - the existing value will be overwritten
    -IndexError: if index integer value is out-of-bounds
    
- Add: Index-based assignment = for new, non-existing index label value:
    - new item is added to the original series
    
- .append(another_series): expects another Series to append to
    - the original series is intact
    - a new Series is returned
    - ok to add an item with the same index label
    
- .set_value(label, value):
    - adds a new item for non-existing label
    - overwrites the value for the existing label
    - returns a modified series
    - replaces all occurrences of a non-unique index label with a new value

In [68]:
s = pd.Series(['jon','ned'], 
             index = ['snow','stark'])

In [71]:
s['snow'] = 'Aegon'

In [72]:
s

snow     Aegon
stark      ned
dtype: object

In [73]:
s['Targaryen'] = 'Jaeherys'

In [74]:
s

snow            Aegon
stark             ned
Targaryen    Jaeherys
dtype: object

In [75]:
s[0]

'Aegon'

In [78]:
# Position based ovrwriting

s[0] = 'jon'

In [79]:
s

snow              jon
stark             ned
Targaryen    Jaeherys
dtype: object

In [81]:
# append

b = s.append(pd.Series({'dayne':'arthur','tully':'brynden'}))

In [82]:
b

snow              jon
stark             ned
Targaryen    Jaeherys
dayne          arthur
tully         brynden
dtype: object

In [83]:
b.set_value('martell','oberyn')

  """Entry point for launching an IPython kernel.


snow              jon
stark             ned
Targaryen    Jaeherys
dayne          arthur
tully         brynden
martell        oberyn
dtype: object

In [84]:
ser = pd.Series(['A', 'B', 'C'], 
                index = ['H', 'G', 'H'])

In [85]:
ser.set_value('H','XX')

  """Entry point for launching an IPython kernel.


H    XX
G     B
H    XX
dtype: object

### - DataFrame:
- Combine rows: pd.concat([list_of_dfs]): a function that combines multiple data frames from a list
    - the original df's are not modified
    - the positional index for the second df starts at zero for the new df
- Add a column: Label-based assignment of a Series with a new column name

#### NOTE: pd.concat([df1, df2,...]): Combine by rows a list of dfs

In [87]:
sales_df = pd.read_csv("data_raw/sales.csv")
sales_df.tail(2)

Unnamed: 0,UPS,Units,Sales,Date
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


In [88]:
a = pd.DataFrame([(1234, 4.0, 10.0, '1/4/2014')], 
            columns = ['UPS', 'Units', 'Sales', 'Date'])

In [89]:
a

Unnamed: 0,UPS,Units,Sales,Date
0,1234,4.0,10.0,1/4/2014


In [91]:
new_df = pd.concat([sales_df, a])

In [92]:
new_df

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
3,789,1.0,2.0,1/1/2014
4,789,2.0,3.8,1/2/2014
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014
0,1234,4.0,10.0,1/4/2014


#### Add  a column

In [93]:
sales_df['Priority'] = pd.Series(range(9))

In [94]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date,Priority
0,1234,5.0,20.2,1/1/2014,0
1,1234,2.0,8.0,1/2/2014,1
2,1234,3.0,13.0,1/3/2014,2
3,789,1.0,2.0,1/1/2014,3
4,789,2.0,3.8,1/2/2014,4
5,789,,,1/3/2014,5
6,789,1.0,1.8,1/5/2014,6


In [95]:
sales_df['Catrgory'] = 'Food'

In [96]:
sales_df.head(3)

Unnamed: 0,UPS,Units,Sales,Date,Priority,Catrgory
0,1234,5.0,20.2,1/1/2014,0,Food
1,1234,2.0,8.0,1/2/2014,1,Food
2,1234,3.0,13.0,1/3/2014,2,Food


# DELETION

### -Delete: Series:
- Uncommon: instead use filtering and masking to create the copy with the desired elements
- Label-based deletion using del
- Caution: deletion based on non-unique index leads to unpredictable results

### Position-based or Label-based Deletion with del

In [97]:
ser = pd.Series(['jon', 'ned'], 
                index = ['snow', 'stark'])
ser

snow     jon
stark    ned
dtype: object

In [98]:
del ser['snow']

In [99]:
ser

stark    ned
dtype: object

#### Non-unique indices

In [101]:
ser = pd.Series(['A', 'B', 'C'], index = ['H', 'G', 'H'])
ser

H    A
G    B
H    C
dtype: object

In [102]:
del ser['H']

  self.values = np.delete(self.values, loc, 0)
  self.mgr_locs = self.mgr_locs.delete(loc)
  return self._shallow_copy(np.delete(self._data, loc))


In [104]:
# unpredictable behavior
ser

H    C
dtype: object

### - Delete Rows: DataFrame:
- .drop(list_of_row_index_values, axis=0): returns all the rows(s) except for the ones listed as the .drop() arguments
    - the original df is NOT changed; does not work in-place
    - can drop more than one row
    - ValueError: if the list contains non-existent row index values

In [105]:
sales_df = pd.read_csv("data_raw/sales.csv")
sales_df.index

RangeIndex(start=0, stop=7, step=1)

In [106]:
remaining_rows = sales_df.drop([0,3,5], axis = 0)

In [107]:
remaining_rows

Unnamed: 0,UPS,Units,Sales,Date
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
4,789,2.0,3.8,1/2/2014
6,789,1.0,1.8,1/5/2014


In [108]:
sales_df

Unnamed: 0,UPS,Units,Sales,Date
0,1234,5.0,20.2,1/1/2014
1,1234,2.0,8.0,1/2/2014
2,1234,3.0,13.0,1/3/2014
3,789,1.0,2.0,1/1/2014
4,789,2.0,3.8,1/2/2014
5,789,,,1/3/2014
6,789,1.0,1.8,1/5/2014


### -Delete Columns: DataFrame:
- .pop(column_label): in-place operation: returns the column removed from the original df
    - the original df gets changed
    - cannot pop more than one column
    - KeyError: if the non-existent column label is passed as the argument
    
- .drop(list_of_col_labels, axis=1): returns all the column(s) except for the ones listed as the .drop() arguments
    - the original df is NOT changed
    - can drop more than one column
    - ValueError: if the list contains non-existent column labels
    
    -.reindex(): creates a new list of desired columns in the list
    - the original df is NOT changed
    - Indexing with a list of new columns: returns the columns selected in the index list
    - the original df is NOT changed
    - KeyError: if the non-existent column label is passed as the index label

In [116]:
### .pop() changes original Df

In [None]:
sales_only = sales_df.pop('Sales')

In [114]:
sales_only

0    20.2
1     8.0
2    13.0
3     2.0
4     3.8
5     NaN
6     1.8
Name: Sales, dtype: float64

In [115]:
sales_df

Unnamed: 0,UPS,Units,Date
0,1234,5.0,1/1/2014
1,1234,2.0,1/2/2014
2,1234,3.0,1/3/2014
3,789,1.0,1/1/2014
4,789,2.0,1/2/2014
5,789,,1/3/2014
6,789,1.0,1/5/2014


In [117]:
sales_df['Sales'] = sales_only

In [118]:
sales_df

Unnamed: 0,UPS,Units,Date,Sales
0,1234,5.0,1/1/2014,20.2
1,1234,2.0,1/2/2014,8.0
2,1234,3.0,1/3/2014,13.0
3,789,1.0,1/1/2014,2.0
4,789,2.0,1/2/2014,3.8
5,789,,1/3/2014,
6,789,1.0,1/5/2014,1.8


### .drop(axis = 1)

In [123]:
no_sales = sales_df.drop(['Sales'], axis = 1)

In [124]:
no_sales

Unnamed: 0,UPS,Units,Date
0,1234,5.0,1/1/2014
1,1234,2.0,1/2/2014
2,1234,3.0,1/3/2014
3,789,1.0,1/1/2014
4,789,2.0,1/2/2014
5,789,,1/3/2014
6,789,1.0,1/5/2014


### .reindex()

In [125]:
cols = ['Sales','Date']
sales_date_only = sales_df.reindex(columns=cols)

In [126]:
sales_date_only

Unnamed: 0,Sales,Date
0,20.2,1/1/2014
1,8.0,1/2/2014
2,13.0,1/3/2014
3,2.0,1/1/2014
4,3.8,1/2/2014
5,,1/3/2014
6,1.8,1/5/2014


### Indexing with a list of new columns

In [127]:
sales_units = sales_df[['Units','Sales']]
sales_units

Unnamed: 0,Units,Sales
0,5.0,20.2
1,2.0,8.0
2,3.0,13.0
3,1.0,2.0
4,2.0,3.8
5,,
6,1.0,1.8
