In [1]:
import numpy as np
import pandas as pd
print("Set up completed")

Set up completed


How i can represent a two dimentional data in one dimentional series?
Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [3]:
pop[('California', 2010):('Texas', 2000)]

(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

But the convenience ends there. For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:

In [4]:
pop[1:3]

(California, 2010)    37253956
(New York, 2000)      18976457
dtype: int64

In [7]:
pop[[i for i in index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

In [8]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

In [10]:
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [14]:
pop["Texas",2010]

25145561

In [18]:
pop["Texas"][2010]

25145561

In [27]:
pop[:,2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

In [23]:
pop_df = pop.unstack()
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [36]:
pop_df.iloc[1:3]

Unnamed: 0,2000,2010
New York,18976457,19378102
Texas,20851820,25145561


In [33]:
pop_df.columns

Int64Index([2000, 2010], dtype='int64')

In [37]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

Unnamed: 0,Unnamed: 1,total,under18
California,2000,33871648,9267089
California,2010,37253956,9284094
New York,2000,18976457,4687374
New York,2010,19378102,4318033
Texas,2000,20851820,5906301
Texas,2010,25145561,6879014


In [39]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18

California  2000    0.273594
            2010    0.249211
New York    2000    0.247010
            2010    0.222831
Texas       2000    0.283251
            2010    0.273568
dtype: float64

In [40]:
f_u18.unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


## Methods of multiIndex creation:
The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:

In [41]:
index_row = [['a','a','b','b'],[1,2,1,2]]
data = pd.DataFrame(np.random.rand(4,2),index=index_row,columns=['data1','data2'])
data

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.961426,0.5393
a,2,0.571847,0.073238
b,1,0.582962,0.183043
b,2,0.231493,0.235568


Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:

In [42]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
New York    2000    18976457
            2010    19378102
dtype: int64

### Explicit MultiIndex constructors:
For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex. For example, as we did before, you can construct the MultiIndex from a simple list of arrays giving the index values within each level:

In [47]:
index = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
data = pd.DataFrame(np.random.rand(4,2),index=index,columns=['data1','data2'])
data

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.489145,0.322199
a,2,0.256559,0.694371
b,1,0.287676,0.688867
b,2,0.614994,0.079839


In [51]:
index_1 = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)],names=["state", "serial"])
data = pd.DataFrame(np.random.rand(4,2),index=index_1,columns=['data1','data2'])
data

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
state,serial,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0.000312,0.702673
a,2,0.268184,0.985062
b,1,0.497601,0.057174
b,2,0.007754,0.66875


In [55]:
index_2 = pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
data = pd.DataFrame(np.random.rand(4,2),index=index_2,columns=['data1','data2'])
data.index.names = ["state","serial"]
type(data["data1"])

pandas.core.series.Series

### MultiIndex for columns:
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:

In [52]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,39.0,36.2,39.0,36.1,44.0,34.8
2013,2,31.0,38.0,29.0,35.8,17.0,36.7
2014,1,37.0,37.1,37.0,38.4,34.0,36.4
2014,2,21.0,35.0,38.0,36.6,16.0,38.1


In [54]:
type(health_data['Guido'])

pandas.core.frame.DataFrame

In [59]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,39.0,36.1
2013,2,29.0,35.8
2014,1,37.0,38.4
2014,2,38.0,36.6


In [61]:
pop.index.names = ['state','year']

In [62]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [63]:
# We can access single elements by indexing with multiple terms:
pop["Texas",2000]

20851820

In [64]:
pop['California']

year
2000    33871648
2010    37253956
dtype: int64

In [67]:
pop[2000]

IndexError: index out of bounds

In [68]:
pop.loc["California":"Texas"]

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [71]:
pop.iloc[0:3]

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
dtype: int64

In [73]:
pop.values

array([33871648, 37253956, 18976457, 19378102, 20851820, 25145561])

In [74]:
pop.index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['state', 'year'])

In [77]:
[i for i in pop.items()]

[(('California', 2000), 33871648),
 (('California', 2010), 37253956),
 (('New York', 2000), 18976457),
 (('New York', 2010), 19378102),
 (('Texas', 2000), 20851820),
 (('Texas', 2010), 25145561)]

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:

In [81]:
pop[:,2000]

state
California    33871648
New York      18976457
Texas         20851820
dtype: int64

In [82]:
pop[pop > 22000000]

state       year
California  2000    33871648
            2010    37253956
Texas       2010    25145561
dtype: int64

In [83]:
pop[['California', 'Texas']]

state       year
California  2000    33871648
            2010    37253956
Texas       2000    20851820
            2010    25145561
dtype: int64

### Multiply indexed DataFrames

In [84]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,39.0,36.2,39.0,36.1,44.0,34.8
2013,2,31.0,38.0,29.0,35.8,17.0,36.7
2014,1,37.0,37.1,37.0,38.4,34.0,36.4
2014,2,21.0,35.0,38.0,36.6,16.0,38.1


In [90]:
print(type(health_data["Bob","HR"]))
print(type(health_data["Bob"]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


In [88]:
health_data["Bob","HR"]

year  visit
2013  1        39.0
      2        31.0
2014  1        37.0
      2        21.0
Name: (Bob, HR), dtype: float64

In [89]:
health_data["Bob"]

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,39.0,36.2
2013,2,31.0,38.0
2014,1,37.0,37.1
2014,2,21.0,35.0


In [91]:
health_data["Bob"].values

array([[39. , 36.2],
       [31. , 38. ],
       [37. , 37.1],
       [21. , 35. ]])

In [92]:
health_data.iloc[:2, :2]

Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,39.0,36.2
2013,2,31.0,38.0
