# Chapter 5: Getting Started With Pandas
## Introduction to Pandas Data Structures
### Series

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [4]:
obj = pd.Series([4, 7, -5, 3])
obj * 2

0     8
1    14
2   -10
3     6
dtype: int64

In [13]:
# create series from python dict
sdata = {'ohio': 35000, 'texas': 71000, 'oregon': 16000, 'utah': 5000}
obj3 = pd.Series(sdata)
obj3

ohio      35000
texas     71000
oregon    16000
utah       5000
dtype: int64

In [21]:
# you can override the default index created from the dictionary keys by passing your own index.
# This overrides the keys and sorting. It create entries with no value (NaN) and exclude dict objects
# not found in the provided index values
states = ['california', 'ohio', 'oregon', 'texas']
obj4 = pd.Series(sdata, index=states)
obj4

california        NaN
ohio          35000.0
oregon        16000.0
texas         71000.0
dtype: float64

In [23]:
# you can find null values or not null values
obj4.isnull(), obj4.notnull()

(california     True
 ohio          False
 oregon        False
 texas         False
 dtype: bool,
 california    False
 ohio           True
 oregon         True
 texas          True
 dtype: bool)

In [24]:
# arithmatic operations align on the index
obj3 + obj4

california         NaN
ohio           70000.0
oregon         32000.0
texas         142000.0
utah               NaN
dtype: float64

In [27]:
# It is possible to modify the index

obj.index = ['a', 'b', 'c', 'd']
obj

a    4
b    7
c   -5
d    3
dtype: int64

### DataFrame

In [31]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


You can get columns out as Series via either:
- dictionary like notation: `frame['state']`
- attribute: `frame.state`

and you can get rows out via the `loc` attribute:  
- ` frame.loc[3]`

In [33]:
frame['state'], frame.year, frame.loc[3]

(0      Ohio
 1      Ohio
 2      Ohio
 3    Nevada
 4    Nevada
 5    Nevada
 Name: state, dtype: object,
 0    2000
 1    2001
 2    2002
 3    2001
 4    2002
 5    2003
 Name: year, dtype: int64,
 state    Nevada
 year       2001
 pop         2.4
 Name: 3, dtype: object)

You can add new columns via Series with an index that matchs the dataframe.

In [37]:
val = pd.Series([-1.2, 3.2, 2.1, -0.9], index=[2, 3, 5, 0])
frame['debt']= val
frame

Unnamed: 0,state,year,pop,debt
0,Ohio,2000,1.5,-0.9
1,Ohio,2001,1.7,
2,Ohio,2002,3.6,-1.2
3,Nevada,2001,2.4,3.2
4,Nevada,2002,2.9,
5,Nevada,2003,3.2,2.1


When reading from a nested set of dictionaries, the outer dict keys will be seen as the columns and the inner keys as row indices:

In [44]:
state_pop = {'nevada': {2001:2.4, 2002: 2.9},
             'ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}}

frame2 = pd.DataFrame(state_pop)
frame2

Unnamed: 0,nevada,ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [45]:
frame2.T

Unnamed: 0,2001,2002,2000
nevada,2.4,2.9,
ohio,1.7,3.6,1.5


In [49]:
frame2.index.name = 'year'; frame2.columns.name = 'state'
frame2

state,nevada,ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [53]:
frame2.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

*Table 5-1* found on pg 134 shows various possible data type inputs to DataFrames:
![image.png](table51.PNG)

### Index Objects

Hold metadata(like axis labels, axis name/names). An array(or other sequence) is converted to an Index. Immutable/not user editable.

In [64]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [65]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [67]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [68]:
obj2.index is labels

True

In [70]:
frame2.columns

Index(['nevada', 'ohio'], dtype='object', name='state')

Pandas Index can contain duplicate labels!

## Essential Functionality
### Reindexing
This is the creation of a *new* object with the data *conformed* to a new index. It rearranges the value according to the new index and adds missing values.

In [71]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [72]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

Can use `method` to interpolate:

In [75]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [77]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

### Dropping Entries from an Axis
Easy with index array or list without the entries. Or use the `drop` method.

In [78]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [79]:
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

With a DataFrame you can drop either axis:

In [80]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                   index=['ohio', 'colorado', 'utah', 'new york'],
                   columns=['one','two','three','four'])

data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
utah,8,9,10,11
new york,12,13,14,15


In [82]:
#drow rows
data.drop(['colorado', 'ohio']) 

Unnamed: 0,one,two,three,four
utah,8,9,10,11
new york,12,13,14,15


In [83]:
# drop colums
data.drop('two', axis=1)

Unnamed: 0,one,three,four
ohio,0,2,3
colorado,4,6,7
utah,8,10,11
new york,12,14,15


`drop` and other functions like it can modify the size or shape of Series or Dataframe *inplace* without creating a new object, though it does destroy the data in the portion that is dropped.

In [84]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [85]:
'c' in obj

False

### Indexing, Selection, and Filtering
The same as NumPy array indexing but also able to use index values and not just integers. However, with slicing, the endpoints are inclusive.

In [88]:
obj['c'] = 2.0
obj['b']

1.0

In [87]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [91]:
obj[[1,3]]

b    1.0
e    4.0
dtype: float64

In [93]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

In [95]:
data < 5

Unnamed: 0,one,two,three,four
ohio,True,True,True,True
colorado,True,False,False,False
utah,False,False,False,False
new york,False,False,False,False


In [97]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
ohio,0,0,0,0
colorado,0,5,6,7
utah,8,9,10,11
new york,12,13,14,15


#### Selection with `loc` and `iloc`
DataFrame indexing on rows. `loc` selects via axis labels and `iloc` via integers.

In [98]:
data.loc['colorado', ['two', 'three']]

two      5
three    6
Name: colorado, dtype: int32

In [99]:
data.iloc[2, [3, 0, 1]]

four    11
one      8
two      9
Name: utah, dtype: int32

*Table 5-4* on pg 144 presents various Indexing options with DataFrame.

### Integer Indexes
When you have integer indexes, you can end up with subtle errors, as it can be difficult to determine if the user meant to use label-based indexing or position-based indexing. Having a non-integer index does not leave any ambiguity.

In [5]:
ser = pd.Series(np.arange(3.))
ser[-1]

KeyError: -1

In [6]:
# but with a non-integer index:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1]

2.0

For consistency, if you have an axis containing integers, select data using label-oriented techniques. Or for precise handling, use `loc` for label selection and `iloc` for integer selection.

### Arithmetic an Data Alignment

Alignment is based on index values. Null values will be generated with no overlap. Also true for DataFrames. If there is no overlap between column and row labels, the return will only contain Null values.

In [7]:
s1 = pd.Series([7.2, -2.8, 3.1, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([1.3, 2.5, -1.3, 4, 5.2], index=['a', 'c', 'e', 'f', 'g'])

s1 + s2

a    8.5
c   -0.3
d    NaN
e    0.2
f    NaN
g    NaN
dtype: float64