# Indexing

As we've seen, both Series and DataFrames can have indices applied to them. The index is essentially a row level label, and in pandas the rows correspond to axis zero. Indices can either be either autogenerated, such as when  we create a new Series without an index, in which case we get numeric values, or they can be set explicitly, like when we use the dictionary object to create the series, or when we loaded data from the CSV file and set appropriate parameters. Another option for setting an index is to use the set_index() function. This function takes a list of columns and promotes those columns to an index. In this lecture we'll explore more about how indexes work in pandas.


Lets talk about the set index method shall we? The set index method takes a column name from the given dataset as its input and then sets it up as the new index of the dataframe, but in the process, its end up deleting the existing index form the dataframe itself.

In order to mitigate the damage, we need to create a new column in the dataframe where we save the values of the current index for later usage.

The general syntax for set_index method is as follows:

###### < Loaded Dataframe >=< Loaded Dataframe >.set_index(< Name of Index you want to set as column >)

Lets see a few examples.

In [1]:
#Importing Pandas
import pandas as pd
#Open the Dataframe
csvDataframe=pd.read_csv("Admission_Predict.csv",index_col=0)
csvDataframe.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [2]:
#Lets first trim and reshape the column for easy access and editing.
csvDataframe.columns=[x.lower().strip() for x in csvDataframe.columns]

In [3]:
#Printing the edited dataframe.
csvDataframe

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


In [4]:
#Lets set a different column for index column using set_index method. But first lets create a copy of current index.
csvDataframe["New Column"]=csvDataframe.index
#let now change the index of dataframe. Lets set it to chance of admit.
csvDataframe=csvDataframe.set_index("chance of admit")
csvDataframe

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,New Column
chance of admit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0.92,337,118,4,4.5,4.5,9.65,1,1
0.76,324,107,4,4.0,4.5,8.87,1,2
0.72,316,104,3,3.0,3.5,8.00,1,3
0.80,322,110,3,3.5,2.5,8.67,1,4
0.65,314,103,2,2.0,3.0,8.21,0,5
...,...,...,...,...,...,...,...,...
0.82,324,110,3,3.5,3.5,9.04,1,396
0.84,325,107,3,3.0,3.5,9.11,1,397
0.91,330,116,4,5.0,4.5,9.45,1,398
0.67,312,103,3,3.5,4.0,8.78,0,399


As observable we have a completely new column at the end of the dataframe and our now index is set at chance of admit.


In order to somewhat reverse the changes of set_index() method we have a new method called reset_index(). This simply converts the current index column into a normal column and replaces it with 0 base integer indexing.

General syntax is as follow:

###### < Variable >=< Name Of DataFrame >.reset_index()

Lets see a example.

In [5]:
csvDataframe=csvDataframe.reset_index()

In [6]:
csvDataframe

Unnamed: 0,chance of admit,gre score,toefl score,university rating,sop,lor,cgpa,research,New Column
0,0.92,337,118,4,4.5,4.5,9.65,1,1
1,0.76,324,107,4,4.0,4.5,8.87,1,2
2,0.72,316,104,3,3.0,3.5,8.00,1,3
3,0.80,322,110,3,3.5,2.5,8.67,1,4
4,0.65,314,103,2,2.0,3.0,8.21,0,5
...,...,...,...,...,...,...,...,...,...
395,0.82,324,110,3,3.5,3.5,9.04,1,396
396,0.84,325,107,3,3.0,3.5,9.11,1,397
397,0.91,330,116,4,5.0,4.5,9.45,1,398
398,0.67,312,103,3,3.5,4.0,8.78,0,399


As observable, the reset index removes "chance of admit" from index position and replaces it with 0 index labeling.

Now we have operated on this dataset for too long. Now lets change it, Lets import a new csv file as our dataframe and learn new ways to manipulate it.

In [7]:
newCSVdataframe=pd.read_csv("census.csv")

In [8]:
#Let print a segment of the dataframe.
newCSVdataframe.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


In [9]:
#Lets just lowercase all the column heads and trim extra spaces for better efficiency
newCSVdataframe.columns=[x.upper().strip() for x in newCSVdataframe.columns]

In [10]:
#Lets print a segment of this edited DataFrame.
newCSVdataframe.head()

Unnamed: 0,SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
0,40,3,6,1,0,Alabama,Alabama,4779736,4780127,4785161,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
1,50,3,6,1,1,Alabama,Autauga County,54571,54571,54660,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
2,50,3,6,1,3,Alabama,Baldwin County,182265,182265,183193,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
3,50,3,6,1,5,Alabama,Barbour County,27457,27457,27341,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
4,50,3,6,1,7,Alabama,Bibb County,22915,22919,22861,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


### The .unique method
The general syntax of using .unique is:

###### < Variable >=< Name Of DataFrame >[" Column Whose Unique Values Are To Be Accessed"].unique()

As the name suggests, this method iterates through columns of a given dataset and prints unique value.

Lets see an example to understand this.

Lets say we want to print unique values stored in the column named "SUMLEV". Lets write a code for it.

In [11]:
intialVariable=newCSVdataframe["STNAME"].unique()

In [12]:
intialVariable

array(['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California',
       'Colorado', 'Connecticut', 'Delaware', 'District of Columbia',
       'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana',
       'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland',
       'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi',
       'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire',
       'New Jersey', 'New Mexico', 'New York', 'North Carolina',
       'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
       'West Virginia', 'Wisconsin', 'Wyoming'], dtype=object)

In [13]:
initialVariable2=newCSVdataframe["CTYNAME"].unique()
initialVariable2

array(['Alabama', 'Autauga County', 'Baldwin County', ..., 'Uinta County',
       'Washakie County', 'Weston County'], dtype=object)

The output of unique method is an array. With a common dtype.

### Multi-Level Indexing
A special and a beutiful phenomenon supported by python is Multi-Level Indexing. Till now in DataFrame we have observed that there is single index column around which most of the dataframe revolves. But now we are going to discuss dataframes where there are mutiple index column(For our case 2) and we are going to operate on them. 

To create Multi-Level Indexed Dataframe we simply need to call the set_index method and pass a list made of column names which we wanna use as index for a given dataFrame. Let see via an example what we wanna say.

In [14]:
newCSVdataframe=newCSVdataframe.set_index(["STNAME","CTYNAME"])

In [15]:
newCSVdataframe.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,SUMLEV,REGION,DIVISION,STATE,COUNTY,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,...,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Alabama,Alabama,40,3,6,1,0,4779736,4780127,4785161,4801108,4816089,...,0.002295,-0.193196,0.381066,0.582002,-0.467369,1.030015,0.826644,1.383282,1.724718,0.712594
Alabama,Autauga County,50,3,6,1,1,54571,54571,54660,55253,55175,...,7.242091,-2.915927,-3.012349,2.265971,-2.530799,7.606016,-2.626146,-2.722002,2.59227,-2.187333
Alabama,Baldwin County,50,3,6,1,3,182265,182265,183193,186659,190396,...,14.83296,17.647293,21.845705,19.243287,17.197872,15.844176,18.559627,22.727626,20.317142,18.293499
Alabama,Barbour County,50,3,6,1,5,27457,27457,27341,27226,27159,...,-4.728132,-2.50069,-7.056824,-3.904217,-10.543299,-4.874741,-2.758113,-7.167664,-3.978583,-10.543299
Alabama,Bibb County,50,3,6,1,7,22915,22919,22861,22733,22642,...,-5.527043,-5.068871,-6.201001,-0.177537,0.177258,-5.088389,-4.363636,-5.403729,0.754533,1.107861


Note that the set_index method doesn't affect the orignal dataframe but saves the edited dataframe on the variable on which the command is being executed.


While we are at it, let say I don't wanna operate on all the columns but a select few. I want to operate on 'STNAME', 'CTYNAME', 'BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015' only.

In [16]:
columnWeWant=['BIRTHS2010', 'BIRTHS2011', 'BIRTHS2012', 'BIRTHS2013', 'BIRTHS2014', 'BIRTHS2015', 'POPESTIMATE2010', 
              'POPESTIMATE2011', 'POPESTIMATE2012', 'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015']

In [17]:
newCSVdataframe=newCSVdataframe[columnWeWant]
newCSVdataframe.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Alabama,14226,59689,59062,57938,58334,58305,4785161,4801108,4816089,4830533,4846411,4858979
Alabama,Autauga County,151,636,615,574,623,600,54660,55253,55175,55038,55290,55347
Alabama,Baldwin County,517,2187,2092,2160,2186,2240,183193,186659,190396,195126,199713,203709
Alabama,Barbour County,70,335,300,283,260,269,27341,27226,27159,26973,26815,26489
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583


Now we just saw a Multi-Level indexed DataFrame. Now lets talk about querying this DataFrame. We till date have know two ways to query DataFrames using row. One is .loc and the other is .iloc. Again for row wise querying we are going to use them.

In single index DataFrame when we used to pass two arguments in form of list, the first used to select the row and the second column. Here thing will we a bit different. Lets undertand this via example of our current DataFrame.

In our dataframe we have 2 index columns. So to query, we need to pass atleast two elements in the list we are going to pass through the .loc method. One represent the level 1 index(STNAME) and the other representing the level 2 index(CTYNAME). We can definatly pass more than 2 elements in the list but after the first two rest of them will be meant for column selection.

Let see examples to have better understanding.

In [18]:
#We are writing this code to get population statistics of Alabama States' Bibb County
newCSVdataframe.loc["Alabama","Bibb County"]

  newCSVdataframe.loc["Alabama","Bibb County"]


Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Alabama,Bibb County,44,266,245,259,247,253,22861,22733,22642,22512,22549,22583


If you are interested in comparing two counties, for example, Washtenaw and Wayne County, we can pass a list of list of tuples in the .loc method with each tuple describing the indices we wish to query. Since we have a MultiIndex of two values, the state and the county, we need to provide two values as each element of our filtering list. Each tuple should have two elements, the first element being the first index and the second element being the second index.

Let see the code and understand it better.

In [19]:
newCSVdataframe.loc[[("Michigan","Washtenaw County"),("Michigan","Wayne County")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
STNAME,CTYNAME,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Michigan,Washtenaw County,977,3826,3780,3662,3683,3709,345563,349048,351213,354289,357029,358880
Michigan,Wayne County,5918,23819,23270,23377,23607,23586,1815199,1801273,1792514,1775713,1766008,1759335


Okay so that's how hierarchical indices work in a nutshell. They're a special part of the pandas library which I think can make management and reasoning about data easier. Of course hierarchical labeling isn't just for rows. For example, you can transpose this matrix and now have hierarchical column labels. And projecting a single column which has these labels works exactly the way you would expect it to. Now, in reality, I don't tend to use hierarchical indicies very much, and instead just keep everything as columns and manipulate those. But, it's a unique and sophisticated aspect of pandas that is useful to know, especially if viewing your data in a tabular form.