### 10 Pandas

### Introduction to pandas Data Structures

In [20]:
from pandas import Series, DataFrame
import pandas as pd

In [21]:
obj=Series([4,5,7,3])
obj

0    4
1    5
2    7
3    3
dtype: int64

* A Series is a one-dimensional array-like object containing an array of data (of any
  NumPy data type) and an associated array of data labels, called its index. 
  
*  The string representation of a Series displayed interactively shows the index on the left
    and the values on the right. Since we did not specify an index for the data, a default
    one consisting of the integers 0 through N - 1 (where N is the length of the data) is
    created. You can get the array representation and index object of the Series via its values
    and index attributes, respectively  

In [22]:
print(obj.values,type(obj.values))
print(obj.index,type(obj.index))

[4 5 7 3] <class 'numpy.ndarray'>
RangeIndex(start=0, stop=4, step=1) <class 'pandas.indexes.range.RangeIndex'>


In [23]:
 obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
 obj2   

d    4
b    7
a   -5
c    3
dtype: int64

* Compared with a regular NumPy array, you can use values in the index when selecting
single values or a set of values

In [24]:
print(obj2[['d','c','a']].values,obj2[['d','c','a']].values.shape)

[ 4  3 -5] (3,)


Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping
of index values to data values. It can be substituted into many functions that expect a
dict:



In [25]:
dicti={"ohio":10,"Gerogia":20,"Virginia":3,"Oregon":50}
obj3=Series(dicti)
print(obj3)

Gerogia     20
Oregon      50
Virginia     3
ohio        10
dtype: int64


In [26]:
states=["ohio","Virginia","washington","Gerogia"]
obj4=Series(dicti,states)
print(obj4)

ohio          10.0
Virginia       3.0
washington     NaN
Gerogia       20.0
dtype: float64


In [27]:
print(pd.isnull(obj4))
print(pd.notnull(obj4))
obj4.notnull()

ohio          False
Virginia      False
washington     True
Gerogia       False
dtype: bool
ohio           True
Virginia       True
washington    False
Gerogia        True
dtype: bool


ohio           True
Virginia       True
washington    False
Gerogia        True
dtype: bool

* A Series object has a name and its index can be assigned inplace

### DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered
collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index).

In [28]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'popof': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
frame2=DataFrame(data,columns=['year','popof','state'])
print(frame2)


   popof   state  year
0    1.5    Ohio  2000
1    1.7    Ohio  2001
2    3.6    Ohio  2002
3    2.4  Nevada  2001
4    2.9  Nevada  2002
   year  popof   state
0  2000    1.5    Ohio
1  2001    1.7    Ohio
2  2002    3.6    Ohio
3  2001    2.4  Nevada
4  2002    2.9  Nevada


A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
attribute

In [29]:
print(frame2.popof.name)
print(frame2['state'].name)

popof
state


* Columns can be modified by assignment. For example, the empty 'debt' column could
  be assigned a scalar value or an array of values

In [30]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,popof,state,debt
0,2000,1.5,Ohio,16.5
1,2001,1.7,Ohio,16.5
2,2002,3.6,Ohio,16.5
3,2001,2.4,Nevada,16.5
4,2002,2.9,Nevada,16.5


* When assigning lists or arrays to a column, the value’s length must match the length
  of the DataFrame. If you assign a Series, it will be instead conformed exactly to the
  DataFrame’s index, inserting missing values in any holes

 

In [31]:
val = Series([-1.2, -1.5, -1.7], index=[1,3,4])
frame2['debt'] = val
frame2

Unnamed: 0,year,popof,state,debt
0,2000,1.5,Ohio,
1,2001,1.7,Ohio,-1.2
2,2002,3.6,Ohio,
3,2001,2.4,Nevada,-1.5
4,2002,2.9,Nevada,-1.7


 * Assigning a column that doesn’t exist will create a new column. The del keyword will
    delete columns as with a dict

In [32]:
frame2['eastern'] = frame2.state == 'Ohio'
print(frame2)
del frame2['eastern']
frame2.columns

   year  popof   state  debt eastern
0  2000    1.5    Ohio   NaN    True
1  2001    1.7    Ohio  -1.2    True
2  2002    3.6    Ohio   NaN    True
3  2001    2.4  Nevada  -1.5   False
4  2002    2.9  Nevada  -1.7   False


Index(['year', 'popof', 'state', 'debt'], dtype='object')

In [33]:
frame2.values  ### even a dataframe is a ndarray

array([[2000, 1.5, 'Ohio', nan],
       [2001, 1.7, 'Ohio', -1.2],
       [2002, 3.6, 'Ohio', nan],
       [2001, 2.4, 'Nevada', -1.5],
       [2002, 2.9, 'Nevada', -1.7]], dtype=object)

### Index Objects

In [34]:

obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
print(index)

Index(['a', 'b', 'c'], dtype='object')


* pandas’s Index objects are responsible for holding the axis labels and other metadata
    (like the axis name or names). Any array or other sequence of labels used when constructing
    a Series or DataFrame is internally converted to an Index

* Index objects are immutable and thus can’t be modified by the user:

In [35]:
 index[1] = 'd' ## doesn't work!


TypeError: Index does not support mutable operations

* In addition to being array-like, an Index also functions as a fixed-size set
* Each Index has a number of methods and properties for set logic and answering other
  common questions about the data it contains.

### Essential Functionality

* dropping an index is easy
  if A id a dataframe then you can drop a column by saying A.drop(name,axis=1), row index by A.drop(name,axis=0)
  multiple columns by A.drop([col1,col2],axis=1)

### Indexing,Selection,Filtering
* Series indexing works similar to numpy indexing
* You can also indexing with series indices


In [41]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                 index=['Ohio', 'Colorado', 'Utah', 'New York'],
                 columns=['one', 'two', 'three', 'four'])

(data)

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [38]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [40]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [44]:
data[1] ## doesn't work!!

KeyError: 1

In [48]:
data.ix[:,[1,2]] ## this works!!

Unnamed: 0,two,three
Ohio,1,2
Colorado,5,6
Utah,9,10
New York,13,14


### Arithmetic operations between different indices

In [54]:
df1 = DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                index=['Ohio', 'Texas', 'Colorado'])
df2 = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(df1,"\n")
print(df2,"\n")

            b    c    d
Ohio      0.0  1.0  2.0
Texas     3.0  4.0  5.0
Colorado  6.0  7.0  8.0 

          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0 



In [52]:
df1 + df2


Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [55]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


### Operations between DataFrame and Series

In [56]:
frame = DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.ix[0]
frame - series


Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


### Function application and mapping