<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST3512/blob/main/CST3512_Class19_Indexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##CST3512 Class 19    
**Row and Column Indexing**    


1. **Hierarchical Indexing**    
2. **Dewey Decimal Dictionary**     



This notebook is based on [Section 8.1 Hierarchical Indexing](https://wesmckinney.com/book/data-wrangling.html) from Chapter 8 - Data Wrangling:     

  - Join,     
  - Combine, and     
  - Reshape     
  
  in Wes Mckinney's 'Python for Data Analysis'    



In many applications, data may be spread across a number of files or databases or be arranged in a form that is not convenient to analyze. This chapter focuses on tools to help combine, join, and rearrange data.    

This notebook introduces the concept of **hierarchical indexing** in pandas, which is used extensively in some of these operations. Chapter 8 of the book then digs into the particular data manipulations. Various applied usages of these tools can be seen in [Data Analysis Examples](https://wesmckinney.com/book/data-wrangling.html#data-analysis-examples).



---



##**Housekeeping**    

Import required modules    


In [1]:
# Import pandas 
import pandas as pd

# Import numpy   
import numpy as np


##**1. Hierarchical Indexing**    


**Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:

###First, using the default index values...

In [7]:
data_0 = pd.Series(np.random.randn(9))

print(data_0)  

0   -0.412033
1    0.339982
2   -0.549759
3   -0.964211
4    0.904280
5    0.798179
6   -1.941745
7    0.099181
8    1.156431
dtype: float64


In [8]:
print(data_0[5])

0.7981794250776526


###Then, using a list of unique values as row indices...   

In [10]:
data_1 = pd.Series(np.random.randn(9),
       index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'])

print(data_1)    

a   -0.083988
b    2.120044
c    0.202272
d   -0.491024
e    0.510354
f   -0.307540
g    1.228234
h   -1.990508
i    0.478982
dtype: float64


In [11]:
print(data_1['d'])

-0.4910244679893649


###Of course, this does not work...

In [12]:
data_2 = pd.Series(np.random.randn(9),
       index=['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'])

print(data_2)    


a    0.698414
a   -0.535246
a    1.872640
b   -0.070535
b    1.491435
c   -0.988773
c    0.022553
d    0.205425
d    1.204814
dtype: float64


In [13]:
print(data_2['b'])

b   -0.070535
b    1.491435
dtype: float64


###But, what does the following do?   

In [14]:
data_3 = pd.Series(np.random.randn(9),
       index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
       [1, 2, 3, 1, 3, 1, 2, 2, 3]])

print(data_3)    


a  1   -0.730263
   2    0.803144
   3    0.705221
b  1    0.564720
   3   -0.444616
c  1   -0.102345
   2   -0.373366
d  2    0.291703
   3    2.392430
dtype: float64


In [15]:
corp = pd.Series(['first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth'], 
index=[['Americas', 'Americas', 'Americas', 'EMEA', 'EMEA', 'AsiaPac', 'AsiaPac', 'Corp', 'Corp'], 
[101, 201, 301, 101, 301, 101, 201, 201, 301]])

print(corp)    

Americas  101      first
          201     second
          301      third
EMEA      101     fourth
          301      fifth
AsiaPac   101      sixth
          201    seventh
Corp      201     eighth
          301      ninth
dtype: object


What you’re seeing is a prettified view of a Series with a MultiIndex as its index. The “gaps” in the index display mean “use the label directly above”:

In [17]:
print(data_1.index)

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'], dtype='object')


In [16]:
print(data_3.index)

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )


In [18]:
print(corp.index)

MultiIndex([('Americas', 101),
            ('Americas', 201),
            ('Americas', 301),
            (    'EMEA', 101),
            (    'EMEA', 301),
            ( 'AsiaPac', 101),
            ( 'AsiaPac', 201),
            (    'Corp', 201),
            (    'Corp', 301)],
           )


With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:

In [19]:
data_3['b']

1    0.564720
3   -0.444616
dtype: float64

In [20]:
corp['EMEA']

101    fourth
301     fifth
dtype: object

In [21]:
data_3['c']

1   -0.102345
2   -0.373366
dtype: float64

In [22]:
data_3['b':'c']

b  1    0.564720
   3   -0.444616
c  1   -0.102345
   2   -0.373366
dtype: float64

In [23]:
data_3['b':'d']

b  1    0.564720
   3   -0.444616
c  1   -0.102345
   2   -0.373366
d  2    0.291703
   3    2.392430
dtype: float64

In [24]:
data_3.loc[['b', 'd']]

b  1    0.564720
   3   -0.444616
d  2    0.291703
   3    2.392430
dtype: float64

In [25]:
corp.loc[['EMEA', 'Americas', 'AsiaPac']]

EMEA      101     fourth
          301      fifth
Americas  101      first
          201     second
          301      third
AsiaPac   101      sixth
          201    seventh
dtype: object

Selection is even possible from an “inner” level. Here I select all of the values having the value "2" from the second index level:

In [26]:
data_3.loc[:, 2]

a    0.803144
c   -0.373366
d    0.291703
dtype: float64

In [27]:
corp.loc[:,201]

Americas     second
AsiaPac     seventh
Corp         eighth
dtype: object

Hierarchical indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `unstack` method:

In [28]:
data_3.unstack()

Unnamed: 0,1,2,3
a,-0.730263,0.803144,0.705221
b,0.56472,,-0.444616
c,-0.102345,-0.373366,
d,,0.291703,2.39243


In [29]:
corp.unstack()

Unnamed: 0,101,201,301
Americas,first,second,third
AsiaPac,sixth,seventh,
Corp,,eighth,ninth
EMEA,fourth,,fifth


The inverse operation of unstack is stack:

In [31]:
data_3.unstack().stack()

a  1   -0.730263
   2    0.803144
   3    0.705221
b  1    0.564720
   3   -0.444616
c  1   -0.102345
   2   -0.373366
d  2    0.291703
   3    2.392430
dtype: float64

In [32]:
corp.unstack().stack()

Americas  101      first
          201     second
          301      third
AsiaPac   101      sixth
          201    seventh
Corp      201     eighth
          301      ninth
EMEA      101     fourth
          301      fifth
dtype: object

`stack` and `unstack` are explored in more detail in [Chapter 8 of Wes Mckinney's Python for Data Analysis](https://wesmckinney.com/book/data-wrangling.html).

---

With a DataFrame, either axis can have a hierarchical index:


In [33]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
            index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
            columns=[['Ohio', 'Ohio', 'Colorado'],
            ['Green', 'Red', 'Green']])

In [34]:
print(frame)

     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11


The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:


In [42]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['key1', 'key2'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']


In [36]:
print(frame)

state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11


In [46]:
# Assign key1 and key2 as `frame` index hierarchy names, respectively   
frame.index.names = ['Region', 'Product'] 

# Assign state and color as `frame` column hierarchy names, respectively 
frame.columns.names = ['state', 'color']

In [47]:
print(frame)

state           Ohio     Colorado
color          Green Red    Green
Region Product                   
a      1           0   1        2
       2           3   4        5
b      1           6   7        8
       2           9  10       11


***Caution***    
*Be careful to note the index names 'state' and 'color' are not part of the row labels (the `frame.index values`).*

With partial column indexing you can similarly select groups of columns:

In [48]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
Region,Product,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A `MultiIndex` can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [40]:
pd.MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                          ['Green', 'Red', 'Green']],
                          names=['state', 'color'])

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

###Reordering and Sorting Levels    



At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):

In [49]:
# frame.swaplevel('key1', 'key2')  # not applicable due to `Region` `Product` assignment earlier
frame.swaplevel('Region', 'Product')  

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Product,Region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use `sort_index` so that the result is lexicographically sorted by the indicated level:

In [50]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Region,Product,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [51]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Product,Region,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


***Note:***    

*Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling `sort_index(level=0)` or `sort_index()`.*    



###Summary Statistics by Level    



Many descriptive and summary statistics on DataFrame and Series have a `level` option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by `level` on either the rows or columns like so:

In [52]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
Region,Product,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [53]:
# frame.groupby(level='key2').sum()   # not applicable due to `Region` `Product` assignment earlier
frame.groupby(level='Product').sum()

state,Ohio,Ohio,Colorado
color,Green,Red,Green
Product,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [54]:
frame.groupby(level='color', axis=1).sum()

Unnamed: 0_level_0,color,Green,Red
Region,Product,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


Internally, this utilizes pandas’s `groupby` machinery, which is discussed in more detail in the book [Python for Data Analysis](https://wesmckinney.com/book/data-aggregation.html).

---

###Indexing with a DataFrame's Columns    



It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:

In [55]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
            'c': ['one', 'one', 'one', 'two', 'two',
                 'two', 'two'],
            'd': [0, 1, 2, 0, 1, 2, 3]})


In [56]:
print(frame)

   a  b    c  d
0  0  7  one  0
1  1  6  one  1
2  2  5  one  2
3  3  4  two  0
4  4  3  two  1
5  5  2  two  2
6  6  1  two  3


DataFrame’s `set_index` function will create a new DataFrame using one or more of its columns as the index:

In [57]:
frame2 = frame.set_index(['c', 'd'])

In [58]:
print(frame2)

       a  b
c   d      
one 0  0  7
    1  1  6
    2  2  5
two 0  3  4
    1  4  3
    2  5  2
    3  6  1


By default the columns are removed from the DataFrame, though you can leave them in by passing `drop=False` to `set_index`:

In [59]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


`reset_index`, on the other hand, does the opposite of `set_index`; the hierarchical index levels are moved into the columns:

In [60]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1




---



####**Related Exercise**


*See the following section of this notebook or the notebook ['Dewey_Dictionary'](https://bit.ly/dewey_notebook) for a related exercise on hierarchical indexing using the Dewey Decimal System.* 



---



##**2. Dewey Dictionary**    

Loads the Dewey Decimal System **codes** and **categories** to a dictionary using a [reference from the University of Illinois Library](https://www.library.illinois.edu/infosci/research/guides/dewey/).   

Makes the Dewey `code:category` dictionary available as a `pandas` DataFrame or as a Python Dictionary.     



First, copy the file to the current working directory using a `.csv` file from ProfessorPatrickSlatraigh Github at:     
* https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/dewey_codes_categories.csv

In [61]:
!curl 'https://raw.githubusercontent.com/ProfessorPatrickSlatraigh/data/main/dewey_codes_categories.csv' -o dewey_dictionary.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 33258  100 33258    0     0   178k      0 --:--:-- --:--:-- --:--:--  179k


###Housekeeping    

Import modules required:    
* **csv** - to read a csv file into a variable    
* **pandas** - for dataframes, etc.
* **numpy** - for arrays   


In [62]:
import csv    

import pandas as pd    

import numpy as np    


###Read the `CSV` into a Dictionary    

Create an empty `dewey_dict` dictionary and populate it with `key:value` pairs read from the two columns in the `.csv` file, excluding the header row.    


In [63]:
# Create an empty dictionary for Dewey code:category pairs   
dewey_dict = {}

try: 
    with open('dewey_dictionary.csv', mode='r') as source:
        csv_read = csv.reader(source)
        next(csv_read)              # to skip the header row in the csv_read file
        for line in csv_read: 
            # print(line)           # scaffolding to peek at lines in csv_read
            # print(line[0])        # scaffolding to peek at col 0 in csv_read
            # print(type(line[0]))  # scaffolding to peek at col 0 type in csv_read
            # wait = input('Hit Enter to continue.') # wait for scaffolding output
            dewey_dict[line[0]] = line[1] # dict entry (key=1st col, value=2nd col)
 
    print('Created the `dewey_dict` Dictionary.')
except:
    print('Error encountered attempting to create `dewey_dict`.')

Created the `dewey_dict` Dictionary.


Descriptive information on the `dewey_dict` Dictionary. 

In [64]:
# print the length of the dictionary (# of key:value pairs)
print(len(dewey_dict))

1000


In [65]:
# print the populated dictionary   
print(dewey_dict)

{'000': 'Generalities', '001': 'Knowledge', '002': 'The book', '003': 'Systems', '004': 'Data processing Computer science', '005': 'Computer programming, programs, data', '006': 'Special computer methods', '007': 'Not assigned or no longer used', '008': 'Not assigned or no longer used', '009': 'Not assigned or no longer used', '010': 'Bibliography', '011': 'Bibliographies', '012': 'Bibliographies of individuals', '013': 'Bibliographies of works by specific classes of authors', '014': 'Bibliographies of anonymous and pseudonymous works', '015': 'Bibliographies of works from specific places', '016': 'Bibliographies of works from specific subjects', '017': 'General subject catalogs', '018': 'Catalogs arranged by author & date', '019': 'Dictionary catalogs', '020': 'Library & information sciences', '021': 'Library relationships', '022': 'Administration of the physical plant', '023': 'Personnel administration', '024': 'Not assigned or no longer used', '025': 'Library operations', '026': 'Li



---



###Read the .CSV into a Dataframe then Create a Dictionary    

Use of pandas' built-in function `read_csv()` with a few parameters to specify the `.csv` file format. After calling pandas `read_csv()`, convert the result to a dictionary using the built-in pandas function `to_dict()`.


* `header` parameter specifies that the headers are explicitly passed or declared by another parameter.    
* `index_col` specifies which column is used as the labels for the DataFrame object that the `read_csv()` function returns. In this case, the first column of index 0 is the key.    
* `squeeze` parameter defines if the data contains only one column for values. In this case, there is only one column since the first column is used as the index column or the labels.    


In [66]:
try: 
    # Use pandas `read_csv` to read the file
    df_dewey = pd.read_csv('dewey_dictionary.csv', header=0, index_col=0, squeeze=True)
    
    # Use pandas `to_dict()` to assign dataframe index:value to dictionary
    dict_dewey = df_dewey.to_dict()
    
    print('Created `df_dewey` DataFrame and `dict_dewey` Dictionary.')
except:
    print('Error attempting to create `df_dewey` and/or `dict_dewey.')

Created `df_dewey` DataFrame and `dict_dewey` Dictionary.


Descriptive information on the `df_dewey` DataFrame (a series, with `dewey_code` as the index.)

In [67]:
df_dewey.describe

<bound method NDFrame.describe of dewey_code
0                                           Generalities
1                                              Knowledge
2                                               The book
3                                                Systems
4                       Data processing Computer science
                             ...                        
995    General history of other areas Melanesia New G...
996    General history of other areas Other parts of ...
997    General history of other areas Atlantic Ocean ...
998    General history of other areas Arctic islands ...
999                              Extraterrestrial worlds
Name: dewey_category, Length: 1005, dtype: object>

In [68]:
df_dewey.head()

dewey_code
0                        Generalities
1                           Knowledge
2                            The book
3                             Systems
4    Data processing Computer science
Name: dewey_category, dtype: object

Descriptive information on the `dict_dewey` Dictionary.

In [69]:
print(len(dict_dewey))

1000


In [70]:
print(dict_dewey)

{0: 'Generalities', 1: 'Knowledge', 2: 'The book', 3: 'Systems', 4: 'Data processing Computer science', 5: 'Computer programming, programs, data', 6: 'Special computer methods', 7: 'Not assigned or no longer used', 8: 'Not assigned or no longer used', 9: 'Not assigned or no longer used', 10: 'Bibliography', 11: 'Bibliographies', 12: 'Bibliographies of individuals', 13: 'Bibliographies of works by specific classes of authors', 14: 'Bibliographies of anonymous and pseudonymous works', 15: 'Bibliographies of works from specific places', 16: 'Bibliographies of works from specific subjects', 17: 'General subject catalogs', 18: 'Catalogs arranged by author & date', 19: 'Dictionary catalogs', 20: 'Library & information sciences', 21: 'Library relationships', 22: 'Administration of the physical plant', 23: 'Personnel administration', 24: 'Not assigned or no longer used', 25: 'Library operations', 26: 'Libraries for specific subjects', 27: 'General libraries', 28: 'Reading, use of other informa



---



###**<u>Exercise</u>**    

Add your code below to take a DataFrame with Dewey Decimal System `dewey_code` and `dewey_category` columns and transform the DataFrame to include a hierarchical structure of the following columns, derived from the `dewey_code` column:    
* **dewey_level1** - based on the **first** character in `dewey_code`    
* **dewey_level2** - based on the **second** character in `dewey_code`    
* **dewey_level3** - based on the **third** character in `dewey_code`    



*Note:*    

*Refer to the first section of this notebook or the Colab notebook on [**Hierarchical Indexing**](https://bit.ly/hierarchical_indexing) for reference/refresher information.*    

In [None]:
### YOUR CODE HERE ###
### add snippets below, if you like ###