## Pandas

### The Series Data Structure

The NumPy array has an implicitly defined integer index used to access the values;
the Pandas Series has an explicitly defined index associated with the values.


In [1]:
import pandas as pd
# pd.Series?


In [2]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)


0    Tiger
1     Bear
2    Moose
dtype: object

In [3]:
# specify index
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data['b']


0.5

In [4]:
data.values


array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
data.index


Index(['a', 'b', 'c', 'd'], dtype='object')

In [6]:
numbers = [1, 2, 3]
pd.Series(numbers)


0    1
1    2
2    3
dtype: int64

In [7]:
animals = ['Tiger', 'Bear', None]
pd.Series(animals)


0    Tiger
1     Bear
2     None
dtype: object

NaN , not a number, is a numeric data type used to represent any value that is undefined or unpresentable. <br>
NaN is also assigned to variables, in a computation, that do not have values and have yet to be computed.


In [8]:
numbers = [1, 2, None]
pd.Series(numbers)


0    1.0
1    2.0
2    NaN
dtype: float64

One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do.


In [9]:
import numpy as np
np.nan == None


False

In [10]:
np.nan == np.nan


False

In [11]:
np.isnan(np.nan)


True

In [12]:
None == None


True

### Think of a Pandas Series like a specialized Python dictionary:

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values; <br>
a Series is a structure that maps typed keys to a set of typed values.

**type**: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.


In [13]:
# constructing a Series object directly from
# a Python dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [14]:
# the index is drawn from the keys
population.index


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [15]:
# Unlike a dictionary, the Series also supports
# array-style operations such as slicing
population['California':'Florida']


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

pd.Series(data, index=index)
data can be one of many entities:


In [16]:
# list or NumPy array; index defaults to an integer sequence:
pd.Series([2, 4, 6])


0    2
1    4
2    6
dtype: int64

In [17]:
# data can be a scalar, which is repeated to fill
# specified index
pd.Series(121, index=[100, 200, 300])


100    121
200    121
300    121
dtype: int64

In [18]:
# the index can be explicitly set if
# a different result is preferred:
pd.Series({2: 'a', 1: 'b', 3: 'c'}, index=[3, 2, 6])


3      c
2      a
6    NaN
dtype: object

### Think of a Pandas Series like a specialized Python dictionary:

A dictionary is a structure that maps arbitrary keys to a set of arbitrary values; <br>
a Series is a structure that maps typed keys to a set of typed values.

**type**: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.


In [19]:
# constructing a Series object directly from
# a Python dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [20]:
# the index is drawn from the keys
population.index


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [21]:
# Unlike a dictionary, the Series also supports
# array-style operations such as slicing
population['California':'Florida']


California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

pd.Series(data, index=index)
data can be one of many entities:


In [22]:
# list or NumPy array; index defaults to an integer sequence:
pd.Series([2, 4, 6])


0    2
1    4
2    6
dtype: int64

In [23]:
# data can be a scalar, which is repeated to fill
# specified index
pd.Series(121, index=[100, 200, 300])


100    121
200    121
300    121
dtype: int64

In [24]:
# the index can be explicitly set if
# a different result is preferred:
pd.Series({2: 'a', 1: 'b', 3: 'c'}, index=[3, 2, 6])


3      c
2      a
6    NaN
dtype: object

### Querying a Series


In [25]:
# dictionary-like expressions
data['b']


0.5

In [26]:
'a' in data


True

In [27]:
data.keys()


Index(['a', 'b', 'c', 'd'], dtype='object')

In [28]:
data.index


Index(['a', 'b', 'c', 'd'], dtype='object')

In [29]:
data.values


array([0.25, 0.5 , 0.75, 1.  ])

In [30]:
list(data.items())


[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [31]:
# modifiable as well
data['e'] = 1.25
data['a'] = 1
data


a    1.00
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [32]:
# slicing by explicit index
data['a':'c']


a    1.00
b    0.50
c    0.75
dtype: float64

In [33]:
# slicing by implicit integer index
data[0:2]


a    1.0
b    0.5
dtype: float64

Note: <br>
Slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice;
Slicing with an implicit index (i.e., data[0:2]), the final
index is **NOT** included the slice.


In [34]:
# selection
data[(data > 0.3) & (data < 0.8)]


b    0.50
c    0.75
dtype: float64

In [35]:
# fancy indexing: select specific ones
data[['a', 'e']]


a    1.00
e    1.25
dtype: float64

Be careful if your Series has an explicit integer index: <br>
An indexing operation such as data[1] will use the explicit indices; <br>
a slicing operation like data[1:3] will use the implicit Python-style index.


In [36]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data


1    a
3    b
5    c
dtype: object

In [37]:
# explicit index when indexing
data[1]


'a'

In [38]:
# implicit index when slicing
data[1:3]  # data[0:2]


  data[1:3]  # data[0:2]


3    b
5    c
dtype: object

**To avoid confusion, use special indexer:**


In [39]:
# loc attribute always references the explicit
print(data.loc[1], '\n')
print(data.loc[1:3])


a 

1    a
3    b
dtype: object


In [40]:
# iloc attribute always references the implicit
print(data.iloc[1], '\n')
print(data.iloc[1:3])


b 

3    b
5    c
dtype: object


### The DataFrame Data Structure

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

A two-dimensional array can be viewed as an ordered sequence of aligned one-dimensional columns; a DataFrame can be viewed as a
sequence of aligned Series objects sharing the same index.


In [41]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)

states = pd.DataFrame({'population': population,
                       'area': area})
states

# change orders


Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [42]:
states.index


Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [43]:
# additional columns attribute
states.columns


Index(['population', 'area'], dtype='object')

In [44]:
# specialized dictionary
states['area']


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [45]:
# notice the difference for indexing
import numpy as np
a = np.random.randint(0, 10, (2, 3))
print(a)
a[0]


[[5 2 1]
 [4 3 5]]


array([5, 2, 1])

### Construct DataFrame objects


In [46]:
# From a single Series object
pd.DataFrame(population, columns=['population'])


Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [47]:
pd.DataFrame(population)


Unnamed: 0,0
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [48]:
# From a list of dictionaries
# list comprehension
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data


[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [49]:
pd.DataFrame(data)


Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [50]:
# if some keys in the dictionary are missing
# Pandas will fill them in with NaN
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])


Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [51]:
# From a dictionary of Series objects
pd.DataFrame({'population': population, 'area': area})


Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [52]:
{'population': population, 'area': area}


{'population': California    38332521
 Texas         26448193
 New York      19651127
 Florida       19552860
 Illinois      12882135
 dtype: int64,
 'area': California    423967
 Texas         695662
 New York      141297
 Florida       170312
 Illinois      149995
 dtype: int64}

In [53]:
# From a two-dimensional NumPy array
# with specified column and index names
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])


Unnamed: 0,foo,bar
a,0.481472,0.17971
b,0.280539,0.306987
c,0.376678,0.20449


In [54]:
# if omitted, an integer index will be used
pd.DataFrame(np.random.rand(3, 2))


Unnamed: 0,0,1
0,0.473364,0.618177
1,0.203586,0.477366
2,0.525021,0.479171


### Selection in DataFrame


In [55]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area': area, 'pop': pop})
data


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [56]:
# check the first a few
data.head(2)


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193


In [57]:
# dictionary-style indexing of the column name
data['area']


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [58]:
# attribute-style access with column names that are strings:
data.area


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [59]:
# avoid data.pop as pop() is a method for DataFrame
data.pop is data['pop']


False

In [60]:
# DataFrame allows modification/addition
data['density'] = data['pop'] / data['area']
data


Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [61]:
# two-dim/three-dim array
data.values


array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [62]:
# transpose
data.T


Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [63]:
# difference between array and DataFrame
# index accesses a row for DataFrame.values
data.values[0]


array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [64]:
# "index" to a DataFrame accesses a column:
data['area']


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [65]:
# loc, iloc again
data.loc[:'Florida', :'pop']


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860


In [66]:
# implicit iloc
data.iloc[:3, :2]


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [67]:
# selection using > < etc. (row); fancy indexing (column)
data['density'] = data['pop'] / data['area']
data.loc[data.density > 100, ['pop', 'density']]


Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [68]:
# Note for []:
# indexing refers to columns, slicing refers to rows
print(data['Florida':'Illinois'], '\n')
print(data['area'])


            area       pop     density
Florida   170312  19552860  114.806121
Illinois  149995  12882135   85.883763 

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64


In [69]:
# Note for []:
# refer to rows by implicit number rather than by index
data[1:3]


Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [70]:
# Note for []:
# refer to rows for > < etc.
data[data.density > 100]


Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [71]:
# mutable:
data.iloc[0, 2] = 90
data


Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [72]:
copy_df = data.drop('Florida')
copy_df


Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Illinois,149995,12882135,85.883763


In [73]:
# drop NA
copy_df.iloc[1, 1] = None
copy_df


Unnamed: 0,area,pop,density
California,423967,38332521.0,90.0
Texas,695662,,38.01874
New York,141297,19651127.0,139.076746
Illinois,149995,12882135.0,85.883763


In [74]:
copy_df.dropna()


Unnamed: 0,area,pop,density
California,423967,38332521.0,90.0
New York,141297,19651127.0,139.076746
Illinois,149995,12882135.0,85.883763


In [75]:
del copy_df['pop']
copy_df


Unnamed: 0,area,density
California,423967,90.0
Texas,695662,38.01874
New York,141297,139.076746
Illinois,149995,85.883763


In [76]:
# assign None does not remove
copy_df['density'] = None
copy_df


Unnamed: 0,area,density
California,423967,
Texas,695662,
New York,141297,
Illinois,149995,


### Index alignment


#### Series

The resulting array contains the union of indices of the two input arrays; <br>
Any item for which one or the other does not have an entry is marked with NaN, or "Not a Number".


In [77]:
import pandas as pd
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
population / area


Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

In [78]:
# check on the indices
area.index | population.index


  area.index | population.index


Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [79]:
area.index & population.index


  area.index & population.index


Index(['Texas', 'California'], dtype='object')

In [80]:
area.index.union(population.index)


Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [81]:
area.index.intersection(population.index)


Index(['Texas', 'California'], dtype='object')

In [82]:
# another example:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B


0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using NaN values is not the desired behavior, we can modify the fill value using appropriate object methods in place of the operators. For example, calling A.add(B) is equivalent to calling A + B, but allows optional explicit specification of the fill value for any elements in A or B that might be missing:


In [83]:
A.add(B, fill_value=0)


0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64