## Introduction to pandas data structures (series and DataFrames)

1. Series-a series is one-dimensional array-like object containint a sequence of values (of similar types to NumPy types) and an associated array of data labels, *index*. 


In [1]:
import numpy as np
import pandas as pd

In [2]:
# example 
obj=pd.Series([4,7,-5,3])
obj #Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the
#data) is created.

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
#You can get the array representation and index object of the Series via its values and index attributes, respectively:

print(obj.values)
print(obj.index)

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


In [4]:
#Often it will be desirable to create a Series with an index identifying each data point with a label:
obj2=pd.Series([4,7,-5,3],index=["a","b","c","d"])
obj2

a    4
b    7
c   -5
d    3
dtype: int64

In [5]:
obj2.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
#Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
#example
print (obj2['b'])
print (obj2[['a','b']])#list of indices

7
a    4
b    7
dtype: int64


In [7]:
#Using NumPy functions or NumPy-like operations, such as filtering with a boolean
#array, scalar multiplication, or applying math functions, will preserve the index-value link
obj2[obj2>0]

a    4
b    7
d    3
dtype: int64

In [8]:
np.exp(obj2)

a      54.598150
b    1096.633158
c       0.006738
d      20.085537
dtype: float64

In [9]:
'e' in obj2

False

In [10]:
#Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3=pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [11]:
# When you are only passing a dict, the index in the resulting Series will have the dict’s
# keys in sorted order. You can override this by passing the dict keys in the order you
# want them to appear in the resulting Series:
states = ['Carlifonia', 'Ohio', 'Oregon', 'Texas']
obj4=pd.Series(sdata,index=states)
obj4#no value for 'California' was found, it appears as NaN (not a number), which is considered in pandas to mark missing or NA values.


Carlifonia        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [12]:
#Both the Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality. 
#When you set this attribute, it helps in identifying what the data in the Series pertains to, especially useful 
#when working with multiple series or when the series is transformed into a DataFrame
obj4.name = 'population'
obj4.index.name = 'state'
obj4


state
Carlifonia        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

## Data Frame
A DataFrame is a represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type(Numeric, string, boolean, etc)-sort of a dic of series all sharing the same index. 

In [13]:
#example 
data ={'state':['Ohio','Luisianna','New York','New Jersey','Nevada','Pennyslvania'],
       'year':[2000,2001,2002,2003,2004,2005],
       'pop':[1.5,1.7,3.6,2.4,2.9,2.9]}
frame=pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Luisianna,2001,1.7
2,New York,2002,3.6
3,New Jersey,2003,2.4
4,Nevada,2004,2.9
5,Pennyslvania,2005,2.9


In [19]:
#If you specify a sequence of columns, the DataFrame’s columns will be arranged in that order
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Luisianna,1.7
2,2002,New York,3.6
3,2003,New Jersey,2.4
4,2004,Nevada,2.9
5,2005,Pennyslvania,2.9


In [14]:
#Columns can be modified by assignment. For example, the empty 'debt' column
#could be assigned a scalar value or an array of values:
frame['debt']=np.arange(6.)
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,0.0
1,Luisianna,2001,1.7,1.0
2,New York,2002,3.6,2.0
3,New Jersey,2003,2.4,3.0
4,Nevada,2004,2.9,4.0
5,Pennyslvania,2005,2.9,5.0


In [15]:
#When you are assigning lists or arrays to a column, the value’s length must match the
#length of the DataFrame. If you assign a Series, its labels will be realigned exactly to
#the DataFrame’s index, inserting missing values in any holes:
val=pd.Series([-1.3,-4.2,-1.4],index=[1,3,4])
frame.debt=val

In [16]:
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,
1,Luisianna,2001,1.7,-1.3
2,New York,2002,3.6,
3,New Jersey,2003,2.4,-4.2
4,Nevada,2004,2.9,-1.4
5,Pennyslvania,2005,2.9,


In [17]:
#Assigning a column that doesn’t exist will create a new column. The del keyword will delete columns as with a dict.
#As an example of del, I first add a new column of boolean values where the state column equals 'Ohio':
frame['color'] = np.where(frame.index % 2 == 0, 'red', 'blue')
frame

Unnamed: 0,state,year,pop,debt,color
0,Ohio,2000,1.5,,red
1,Luisianna,2001,1.7,-1.3,blue
2,New York,2002,3.6,,red
3,New Jersey,2003,2.4,-4.2,blue
4,Nevada,2004,2.9,-1.4,red
5,Pennyslvania,2005,2.9,,blue


In [18]:
frame.columns

Index(['state', 'year', 'pop', 'debt', 'color'], dtype='object')

In [20]:
#A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
frame.year # or frame["year"]

0    2000
1    2001
2    2002
3    2003
4    2004
5    2005
Name: year, dtype: int64

In [29]:
#As an example of del, I first add a new column of boolean values where the state column equals 'Ohio':
frame['eastern'] = frame.state == 'Ohio'
frame

Unnamed: 0,state,year,pop,debt,color,eastern
0,Ohio,2000,1.5,,red,True
1,Luisianna,2001,1.7,-1.3,blue,False
2,New York,2002,3.6,,red,False
3,New Jersey,2003,2.4,-4.2,blue,False
4,Nevada,2004,2.9,-1.4,red,False
5,Pennyslvania,2005,2.9,,blue,False


In [30]:
del frame["eastern"]
frame 

Unnamed: 0,state,year,pop,debt,color
0,Ohio,2000,1.5,,red
1,Luisianna,2001,1.7,-1.3,blue
2,New York,2002,3.6,,red
3,New Jersey,2003,2.4,-4.2,blue
4,Nevada,2004,2.9,-1.4,red
5,Pennyslvania,2005,2.9,,blue


In [31]:
#Another form of data is nested dict of dicts: 
pop={'nevada':{2001:2.4,2002:2.9},
     'ohio':{2000:1.5, 2001:1.7,2002:3.6}}

In [32]:
#If the nested dict is passed to the DataFrame, pandas will interpret the outer dict keys
#as the columns and the inner keys as the row indices:
frame3=pd.DataFrame(pop)
frame3

Unnamed: 0,nevada,ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [35]:
#The keys in the inner dicts are combined and sorted to form the index in the result.
#This isn’t true if an explicit index is specified:
frame4=pd.DataFrame(pop, index=[2001, 2002, 2003])
frame4


Unnamed: 0,nevada,ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [33]:
#one can transpose a data frame in a similar way to numpy. 
frame3.T

Unnamed: 0,2001,2002,2000
nevada,2.4,2.9,
ohio,1.7,3.6,1.5


In [38]:
frame3

Unnamed: 0,nevada,ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [39]:
#If a DataFrame’s index and columns have their name attributes set, these will also be displayed.
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

state,nevada,ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [42]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

## Index Objects 
pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels you use when constructing a Series or DataFrame is internally converted to an Index.Some users will not often take advantage of the capabilities provided by indexes, but because some operations will yield results containing indexed data, it’s important to understand how they
work.


Some Index methods and properties

|Method |Description
------- |-----------
|append |Concatenate with additional Index objects, producing a new Index
|difference |Compute set difference as an Index
|intersection |Compute set intersection
|union |Compute set union
|isin |Compute boolean array indicating whether each value is contained in the passed collection
|delete |Compute new Index with element at index i deleted
|drop |Compute new Index by deleting passed values
|insert |Compute new Index by inserting element at index i
|is_monotonic |Returns True if each element is greater than or equal to the previous element
|is_unique |Returns True if the Index has no duplicate values
|unique |Compute the array of unique values in the Index





In [44]:
obj=pd.Series(range(3), index=['a', 'b', 'c'])

In [45]:
index=obj.index

In [46]:
index

Index(['a', 'b', 'c'], dtype='object')

In [47]:
index[1:]

Index(['b', 'c'], dtype='object')

In [48]:
index[1]='d' #general a type error
#Index objects are immutable and thus can’t be modified by the user:


TypeError: Index does not support mutable operations

In [50]:
#Immutability makes it safer to share Index objects among data structures
labels=pd.Index(np.arange(3))
labels

Index([0, 1, 2], dtype='int32')

In [52]:
obj2=pd.Series([1.5,2.5,0], index=labels)
obj2

0    1.5
1    2.5
2    0.0
dtype: float64

In [53]:
obj2.index is labels

True