Index in Dataframe is core to the Pandas functionality.

In [3]:
import pandas as pd

In [6]:
drinks = pd.read_csv('http://bit.ly/drinksbycountry')

In [7]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,Asia
1,Albania,89,132,54,4.9,Europe
2,Algeria,25,0,14,0.7,Africa
3,Andorra,245,138,312,12.4,Europe
4,Angola,217,57,45,5.9,Africa


In above results, column which is having 0, 1, 2, 3, 4 are called Index and labels (country, beer_servings, spirit_servings, wine_servings, total_litres_of_pure_alcohol, continent) are called columns. 

Today our focus is on INDEX: 

In [8]:
drinks.index

RangeIndex(start=0, stop=193, step=1)

Dataframe.index: The index (row labels) of the DataFrame. one index for each row. Every Dataframe has an Index and Columns attribute. So these index are mandatory but not optional and index are sometimes known as 'Row Labels'.

In [9]:
drinks.columns

Index(['country', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

Above we see columns attribute as Index and thats the type of object. 

Neither the Index not the columns are considered as part of the Dataframe contents, for instance:

In [10]:
drinks.shape 

(193, 6)

As result, 193 is count of rows that doesn't include column headers and 6 is count of columns that doesn't include index. So Index is not part of Dataframe in that way. It turns out that the index and the columns both default to these integers if no index or columns are specified.

In [11]:
pd.read_table('http://bit.ly/movieusers', header=None, sep='|').head()

Unnamed: 0,0,1,2,3,4
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


For instance, from above results, we can see the data but we didn't specify a header. So header is also these default 0,1,2,3,4 labels. 

So most of the time people will leave the default 'Index' of integers (0,1,2,3,4) as it is, but rarely people leave the default 'Columns' of integers(0,1,2,3,4) as it is because we usually identify them(index/columns) by what's in them.

Why does the 'Index' exist?

3 Main reasons are:

1. Identification.
2. Selection.
3. Alignment.

######  IDENTIFICATION:  

In [12]:
drinks[drinks.continent=='South America']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
6,Argentina,193,25,221,8.3,South America
20,Bolivia,167,41,8,3.8,South America
23,Brazil,245,145,16,7.2,South America
35,Chile,130,124,172,7.6,South America
37,Colombia,159,76,3,4.2,South America
52,Ecuador,162,74,3,4.2,South America
72,Guyana,93,302,1,7.1,South America
132,Paraguay,213,117,74,7.3,South America
133,Peru,163,160,21,6.1,South America
163,Suriname,128,178,7,5.6,South America


From the above results, i would like to notify that Index also known as Row Labels stayed with the rows, meaning when i apply filters on the 'drinks' Dataframe - Index just showed up as it is(as with original row), without renumbering Index from 0.

So this the reason we say 'Index' is for IDENTIFICATION, can identify what rows we are working with even if we filter the original Dataframe.

###### SELECTION: 

What if i want to grab a piece of this 'drinks' Dataframe and we will use a method that we have used many times: loc () method.

loc() method allows me to say if i want particular single cell data in Dataframe, i can say what row it is in and will refer to the index. For instance:

In [13]:
drinks.loc[23,'beer_servings']

245

As a result, i can pull out this particular single cell data 245 where Index is 23 and column is 'beer_servings' from the Dataframe 'drinks', which is not a best practice.

Why use an 'Index' and Why not keep everything as columns? 

And the reason for above question is as below: 

In [14]:
drinks.set_index('country', inplace=True)
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


From above results, we can see now the Dataframe has changed because the 'country' Series(column) has now become the 'Index' and prior index(default index of integers) has disappeared. 

In [15]:
drinks.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola',
       'Antigua & Barbuda', 'Argentina', 'Armenia', 'Australia', 'Austria',
       ...
       'Tanzania', 'USA', 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela',
       'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='country', length=193)

Now we will see that 'drinks' Dataframe index is Afghanistan through Zimbabwe with still the length 193.

In [16]:
drinks.columns

Index(['beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol', 'continent'],
      dtype='object')

Now when you check drinks.columns, we will see 'country' is no longer one of the columns. And in fact if we check shape of Dataframe now, below is the result: 

In [17]:
drinks.shape

(193, 5)

Now it says 193 by 5 (instead of 6) because the index (default index of integers) is not part of Dataframe.

So because now we have set 'country' as the Index, we can now use the loc() method as below: 

In [18]:
drinks.loc['Brazil','beer_servings']

245

So by setting the 'Index' as something that was meaningful to us, we can now select data from Dataframe more easily.

In [19]:
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


In the above Dataframe, we can see 'country' label is hanging out there, with no meaning. That's actually name of the index.

We don't have to have an index name, its helpful as a kind of an identifier of what it represents, but actually we can clear it out if we dont need as below: 

In [20]:
drinks.index.name=None

In [21]:
drinks.head()

Unnamed: 0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


Now index name is gone, its(country) still the index but doesn't have the name.

Now lets say we changed our mind and would like to keep the default index which is integer and move this 'country' as column as before. This is how we can do as below:

In [22]:
drinks.index.name='country' #1: Name back index as 'country'

In [23]:
drinks.head()

Unnamed: 0_level_0,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Afghanistan,0,0,0,0.0,Asia
Albania,89,132,54,4.9,Europe
Algeria,25,0,14,0.7,Africa
Andorra,245,138,312,12.4,Europe
Angola,217,57,45,5.9,Africa


In [25]:
drinks.reset_index(inplace=True) #2: Reset index and make it permanent by saying 'inplace=True'
drinks.head()

Unnamed: 0,index,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,0,Afghanistan,0,0,0,0.0,Asia
1,1,Albania,89,132,54,4.9,Europe
2,2,Algeria,25,0,14,0.7,Africa
3,3,Andorra,245,138,312,12.4,Europe
4,4,Angola,217,57,45,5.9,Africa


Above result says, we are back to having the default integer index and the 'country' which was the index rejoined the Dataframe as one of the columns. 

Here it was important to set the name of the index as before or doing the reset because pandas decided what to call this column based upon the name of the index. 

###### Useful hints: 

In [26]:
drinks.describe() # Describe(): its shows a numerical summary of the numerical columns. 

Unnamed: 0,index,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol
count,193.0,193.0,193.0,193.0,193.0
mean,96.0,106.160622,80.994819,49.450777,4.717098
std,55.858452,101.143103,88.284312,79.697598,3.773298
min,0.0,0.0,0.0,0.0,0.0
25%,48.0,20.0,4.0,1.0,1.3
50%,96.0,76.0,56.0,8.0,4.2
75%,144.0,188.0,128.0,59.0,7.2
max,192.0,376.0,438.0,370.0,14.4


Above is actually Dataframe and as such dataframe it has an Index as shown below: 

In [27]:
drinks.describe().index

Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

In [28]:
drinks.describe().columns

Index(['index', 'beer_servings', 'spirit_servings', 'wine_servings',
       'total_litres_of_pure_alcohol'],
      dtype='object')

So the point here is not so much we are going to do something with .describe() we might. 

But we need to be aware that most of methods results Dataframe and if we know about the 'Index' and the 'Columns' and we know that we can interact with that resulting Dataframe. 

For instance as below: 

In [29]:
drinks.describe().loc['25%','beer_servings']

20.0

Here its drinks that describe outputs a Dataframe .loc() is a dataframe method and then we pulled out '20.0' using index(25%) and column(beer_servings) attributes. 

So, Always we should be aware of the type of the objects we are interacting with and should take the advantages of the 'index' and the 'columns' wherever we can.