# Multiindexing and Lord of the Rings


https://www.kaggle.com/mokosan/lord-of-the-rings-character-data?select=WordsByCharacter.csv

https://towardsdatascience.com/how-to-use-multiindex-in-pandas-to-level-up-your-analysis-aeac7f451fce

### Create a multi-indexed data frame a standard style dataframe

This is a table of the number of words spoke by characters in the "Lord of the Rings"

We will load it and look at the starting structure and then convert into a multi-indexed form

Note: Down at the bottom of the file are instructions about how to turn off multi-indexing for both rows (index) and column (columns)

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import pandas as pd
# load data

infile="/content/drive/MyDrive/Colab Notebooks/Spring 2024/DAT_512-Stat.-Approaches-to-Big-Data/Data/WordsByCharacter.csv"

df = pd.read_csv(infile)

This loads up as a very standard style of data frame

In [5]:
df.head()

Unnamed: 0,Film,Chapter,Character,Race,Words
0,The Fellowship Of The Ring,01: Prologue,Bilbo,Hobbit,4
1,The Fellowship Of The Ring,01: Prologue,Elrond,Elf,5
2,The Fellowship Of The Ring,01: Prologue,Galadriel,Elf,460
3,The Fellowship Of The Ring,01: Prologue,Gollum,Gollum,20
4,The Fellowship Of The Ring,02: Concerning Hobbits,Bilbo,Hobbit,214


In [6]:
df.index.names

FrozenList([None])

In a frozen list, we cannot set items,but we can look up items with it, or hash it for lookups

In [7]:
# Make a copy of the original version of the data frame

# This example of using Multiindexing shows a number of interesting ways of using multiindexing
#  I wanted to also see how difficult the multiindexing searches are using using a flat data table
# so I duplicated some multi-index queries using just the standard index and column forms.

df_orig=df.copy()

Here comes the multi-label on the indexes (rows),  using many of the columns as the multiindex

These columns are all really identifiers or categoricals,   

Notice that we specify that an index should be set on the data table

This is an example of a highly composite key, with 4 elements in this case

In [8]:
multi = df.set_index(['Film', 'Chapter', 'Race', 'Character'])

In [9]:
multi.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Words
Film,Chapter,Race,Character,Unnamed: 4_level_1
The Fellowship Of The Ring,01: Prologue,Hobbit,Bilbo,4
The Fellowship Of The Ring,01: Prologue,Elf,Elrond,5
The Fellowship Of The Ring,01: Prologue,Elf,Galadriel,460
The Fellowship Of The Ring,01: Prologue,Gollum,Gollum,20
The Fellowship Of The Ring,02: Concerning Hobbits,Hobbit,Bilbo,214


this operation converted columns to multiindices and removed them from the dataframe

In [10]:
multi.index.names

FrozenList(['Film', 'Chapter', 'Race', 'Character'])

In [11]:
multi.index.values[0:10]

array([('The Fellowship Of The Ring', '01: Prologue', 'Hobbit', 'Bilbo'),
       ('The Fellowship Of The Ring', '01: Prologue', 'Elf', 'Elrond'),
       ('The Fellowship Of The Ring', '01: Prologue', 'Elf', 'Galadriel'),
       ('The Fellowship Of The Ring', '01: Prologue', 'Gollum', 'Gollum'),
       ('The Fellowship Of The Ring', '02: Concerning Hobbits', 'Hobbit', 'Bilbo'),
       ('The Fellowship Of The Ring', '03: The Shire', 'Hobbit', 'Bilbo'),
       ('The Fellowship Of The Ring', '03: The Shire', 'Hobbit', 'Frodo'),
       ('The Fellowship Of The Ring', '03: The Shire', 'Ainur', 'Gandalf'),
       ('The Fellowship Of The Ring', '03: The Shire', 'Hobbit', 'Hobbit Kids'),
       ('The Fellowship Of The Ring', '03: The Shire', 'Hobbit', 'Hobbits')],
      dtype=object)

In [12]:
#use loc to find specific events,   like Who spoke in the Prologue?

multi.loc[('The Fellowship Of The Ring','01: Prologue'),:]

  multi.loc[('The Fellowship Of The Ring','01: Prologue'),:]


Unnamed: 0_level_0,Unnamed: 1_level_0,Words
Race,Character,Unnamed: 2_level_1
Hobbit,Bilbo,4
Elf,Elrond,5
Elf,Galadriel,460
Gollum,Gollum,20


In [13]:
# Could we do this without multiindexing?
df_orig[ (df_orig.Chapter=="01: Prologue")]

Unnamed: 0,Film,Chapter,Character,Race,Words
0,The Fellowship Of The Ring,01: Prologue,Bilbo,Hobbit,4
1,The Fellowship Of The Ring,01: Prologue,Elrond,Elf,5
2,The Fellowship Of The Ring,01: Prologue,Galadriel,Elf,460
3,The Fellowship Of The Ring,01: Prologue,Gollum,Gollum,20


# notice that in this view the movie and the chapter are not shown,   those are fixed by the index,   all the other indices and values are shown though

In [14]:
# Who where the three first elves to speak in the movie

multi.loc[("The Fellowship Of The Ring",slice(None),'Elf'),:].head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Words
Film,Chapter,Race,Character,Unnamed: 4_level_1
The Fellowship Of The Ring,01: Prologue,Elf,Elrond,5
The Fellowship Of The Ring,01: Prologue,Elf,Galadriel,460
The Fellowship Of The Ring,21: Flight To The Ford,Elf,Arwen,131


In [15]:
#note that we needed to skip a level in the hiearchy of the index,  the Chapter,  so slice(None) indicates no index on that axis. all entries are collected

In [16]:
# How much do Gandalf and Saruman talk in each chapter of the The Two Towers

multi.loc[('The Two Towers',slice(None),slice(None),['Gandalf','Saruman']), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Words
Film,Chapter,Race,Character,Unnamed: 4_level_1
The Two Towers,01: The Foundations Of Stone,Ainur,Gandalf,39
The Two Towers,15: The White Rider,Ainur,Gandalf,298
The Two Towers,17: The Heir Of Númenor,Ainur,Gandalf,226
The Two Towers,20: The King Of The Golden Hall,Ainur,Gandalf,151
The Two Towers,22: Simbelmynë on the Burial Mounds,Ainur,Gandalf,28
The Two Towers,23: The King's Decision,Ainur,Gandalf,165
The Two Towers,58: Forth Eorlingas,Ainur,Gandalf,21
The Two Towers,65: The Battle For Middle Earth Is About To Begin,Ainur,Gandalf,36
The Two Towers,06: The Burning of the Westfold,Ainur,Saruman,187
The Two Towers,25: The Ring Of Barahir,Ainur,Saruman,68


In [17]:
# how much does Isildur talk in all the films

# the xs option allows use to indicate we want to search the level="Character" index only, so all films + Chapters are included

multi.xs('Isildur', level='Character').sum()

Words    1
dtype: int64

# Question/Actions

Answer the following questions using multi-indexing ideas

1.) Find a list of all the Chapters

2.) Find the Chapter with the least words spoken by an Ainur

3.) Which chapter(s) does Treebeard speak in?

4.) Are there any chapters in which only Sam and/or Frodo speak?

In [18]:
# can we do this in the old school way?
df_orig[df_orig.Character=="Isildur"].Words

101    1
Name: Words, dtype: int64

In [19]:
# How much does each Hobbit talk

# Use a pivot table with the words aggregated across all films for each character,

pivoted = df.pivot_table(index = ['Race','Character'],
                         columns = 'Film',
                         aggfunc = 'sum',
                         margins = True, # total column
                         margins_name = 'All Films',
                         fill_value = 0).sort_index()
order = [('Words', 'The Fellowship Of The Ring'),
         ('Words', 'The Two Towers'),
         ('Words', 'The Return Of The King'),
         ('Words', 'All Films')]
pivoted = pivoted.sort_values(by=('Words', 'All Films'), ascending=False)
pivoted = pivoted.reindex(order, axis=1)

  pivoted = df.pivot_table(index = ['Race','Character'],
  pivoted = df.pivot_table(index = ['Race','Character'],
  pivoted = df.pivot_table(index = ['Race','Character'],


In [20]:
pivoted

Unnamed: 0_level_0,Unnamed: 1_level_0,Words,Words,Words,Words
Unnamed: 0_level_1,Film,The Fellowship Of The Ring,The Two Towers,The Return Of The King,All Films
Race,Character,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
All Films,,11225,11169,9575,31969
Ainur,Gandalf,2360,964,1504,4828
Hobbit,Sam,557,1044,924,2525
Men,Aragorn,920,822,580,2322
Hobbit,Frodo,967,664,650,2281
Hobbit,...,...,...,...,...
Hobbit,Mrs. Bracegirdle,2,0,0,2
Orc,Mauhur,0,2,0,2
Men,Eothain,0,2,0,2
Hobbit,Proudfoot,1,0,0,1


# Question/Action

Repeat any two questions from the previous question/action using classic dataframe methods.

Which approach seems easier?   Is this a matter of just being more familiar with the classic approach?

# Question/Action

Try to come up with a question of your own that is easier to answer using multi-indexing

# More multiindexing,  Simple examples of setting up and altering a multiindex

"Pandas in Action" Boris Paskhaver

Chapter 7

In [21]:
addresses = [
            ("8809 Flair Square", "Toddside", "IL", "37206"),
            ("9901 Austin Street", "Toddside", "IL", "37206"),
            ("905 Hogan Quarter", "Franklin", "IL", "37206"),
        ]

In [22]:
row_index = pd.MultiIndex.from_tuples(
            tuples = addresses,
            names = ["Street", "City", "State", "Zip"]
        )

row_index

MultiIndex([( '8809 Flair Square', 'Toddside', 'IL', '37206'),
            ('9901 Austin Street', 'Toddside', 'IL', '37206'),
            ( '905 Hogan Quarter', 'Franklin', 'IL', '37206')],
           names=['Street', 'City', 'State', 'Zip'])

In [23]:
column_index = pd.MultiIndex.from_tuples(
             [
                 ("Culture", "Restaurants"),
                 ("Culture", "Museums"),
                 ("Services", "Police"),
                 ("Services", "Schools"),
             ]
         )

column_index

MultiIndex([( 'Culture', 'Restaurants'),
            ( 'Culture',     'Museums'),
            ('Services',      'Police'),
            ('Services',     'Schools')],
           )

In [24]:
data = [
            ["C-", "B+", "B-", "A"],
            ["D+", "C", "A", "C+"],
            ["A-", "A", "D+", "F"]
        ]

In [25]:
neighborhoods=pd.DataFrame(
             data = data, index = row_index, columns = column_index
         )
neighborhoods

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Restaurants,Museums,Police,Schools
Street,City,State,Zip,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A
9901 Austin Street,Toddside,IL,37206,D+,C,A,C+
905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F


In [26]:
neighborhoods.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 3 entries, ('8809 Flair Square', 'Toddside', 'IL', '37206') to ('905 Hogan Quarter', 'Franklin', 'IL', '37206')
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   (Culture, Restaurants)  3 non-null      object
 1   (Culture, Museums)      3 non-null      object
 2   (Services, Police)      3 non-null      object
 3   (Services, Schools)     3 non-null      object
dtypes: object(4)
memory usage: 810.0+ bytes


In [27]:
neighborhoods.index

MultiIndex([( '8809 Flair Square', 'Toddside', 'IL', '37206'),
            ('9901 Austin Street', 'Toddside', 'IL', '37206'),
            ( '905 Hogan Quarter', 'Franklin', 'IL', '37206')],
           names=['Street', 'City', 'State', 'Zip'])

In [28]:
 neighborhoods.index.names

FrozenList(['Street', 'City', 'State', 'Zip'])

In [29]:
neighborhoods.columns

MultiIndex([( 'Culture', 'Restaurants'),
            ( 'Culture',     'Museums'),
            ('Services',      'Police'),
            ('Services',     'Schools')],
           )

In [30]:
neighborhoods.columns.names = ["Category", "Subcategory"]
neighborhoods.columns.names

FrozenList(['Category', 'Subcategory'])

Below are a number of examples showing how to manipulate data using multi-index approaches

In [31]:
 neighborhoods.sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Subcategory,Restaurants,Museums,Police,Schools
Street,City,State,Zip,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A
905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F
9901 Austin Street,Toddside,IL,37206,D+,C,A,C+


In [32]:
 neighborhoods.sort_index(ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Subcategory,Restaurants,Museums,Police,Schools
Street,City,State,Zip,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
9901 Austin Street,Toddside,IL,37206,D+,C,A,C+
905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F
8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A


In [33]:
neighborhoods.sort_index(ascending = [True, False, True])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Subcategory,Restaurants,Museums,Police,Schools
Street,City,State,Zip,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A
905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F
9901 Austin Street,Toddside,IL,37206,D+,C,A,C+


In [34]:
neighborhoods["Services"]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Subcategory,Police,Schools
Street,City,State,Zip,Unnamed: 4_level_1,Unnamed: 5_level_1
8809 Flair Square,Toddside,IL,37206,B-,A
9901 Austin Street,Toddside,IL,37206,A,C+
905 Hogan Quarter,Franklin,IL,37206,D+,F


In [35]:
neighborhoods[("Services", "Schools")]

Street              City      State  Zip  
8809 Flair Square   Toddside  IL     37206     A
9901 Austin Street  Toddside  IL     37206    C+
905 Hogan Quarter   Franklin  IL     37206     F
Name: (Services, Schools), dtype: object

In [36]:
neighborhoods[[("Services", "Schools"), ("Culture", "Museums")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Services,Culture
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Subcategory,Schools,Museums
Street,City,State,Zip,Unnamed: 4_level_2,Unnamed: 5_level_2
8809 Flair Square,Toddside,IL,37206,A,B+
9901 Austin Street,Toddside,IL,37206,C+,C
905 Hogan Quarter,Franklin,IL,37206,F,A


In [37]:
neighborhoods.xs(key="Franklin",level="City")

Unnamed: 0_level_0,Unnamed: 1_level_0,Category,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Subcategory,Restaurants,Museums,Police,Schools
Street,State,Zip,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
905 Hogan Quarter,IL,37206,A-,A,D+,F


In [38]:
neighborhoods.xs(axis = "columns", key = "Museums", level = "Subcategory").head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Culture
Street,City,State,Zip,Unnamed: 4_level_1
8809 Flair Square,Toddside,IL,37206,B+
9901 Austin Street,Toddside,IL,37206,C
905 Hogan Quarter,Franklin,IL,37206,A


In [39]:
# reordering indices
new_order=["City","State","Zip","Street"]
neighborhoods.reorder_levels(order=new_order)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Category,Culture,Culture,Services,Services
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Subcategory,Restaurants,Museums,Police,Schools
City,State,Zip,Street,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Toddside,IL,37206,8809 Flair Square,C-,B+,B-,A
Toddside,IL,37206,9901 Austin Street,D+,C,A,C+
Franklin,IL,37206,905 Hogan Quarter,A-,A,D+,F


In [40]:
neighborhoods.reset_index()

Category,Street,City,State,Zip,Culture,Culture,Services,Services
Subcategory,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Restaurants,Museums,Police,Schools
0,8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A
1,9901 Austin Street,Toddside,IL,37206,D+,C,A,C+
2,905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F


In [41]:
 neighborhoods.reset_index(col_level = 1)

Category,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Culture,Culture,Services,Services
Subcategory,Street,City,State,Zip,Restaurants,Museums,Police,Schools
0,8809 Flair Square,Toddside,IL,37206,C-,B+,B-,A
1,9901 Austin Street,Toddside,IL,37206,D+,C,A,C+
2,905 Hogan Quarter,Franklin,IL,37206,A-,A,D+,F


# Question/Action

Which approach or tactic used with the street data seems most interesting, unusual or productive?

Apply this to the Lord of the Rings data,  making up your own question.