<h1 class="title">Python Tips<br>#5 Indexing</h1>
<br>
<center>Michael Siebel</center>
<br>

In [1]:
# Remove warnings
import warnings
warnings.filterwarnings('ignore')

%run ../HTML_Functions.ipynb 

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

When it comes to data wrangling, perhaps the key distinction between different software is how they index data frames.  In Python (using Pandas), every data frame has to have at least one index.  An index is a row that contains values to help users identify a row.  Just like columns have names, rows need to have names (although these "names" can and often are integers).

This is similar to SQL where every table has a Primary Key.  In contrast, Stata does not use an index, but demands users sort on a column that will be used for merging data.  R has no index and does not require its data to be sorted at all.

# Load Data

To start, let's important our data:

In [2]:
# Load Libraries
## Main Data Wrangling Library
import pandas as pd
## Practice data
from sklearn import datasets

# Load IRIS data and c
X, y = datasets.load_iris(return_X_y=True, as_frame=True)
## Add column names
X.columns = ["Sepal Length", "Sepal Width", "Petal Length", "Petal Width"]

# Preview first 5 rows of data
display(X.head())

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


Because we have not defined an index, Pandas defaults to using integers.  In the output above, the index ranges from 0-4.  While R, Stata, SAS, and SPSS count from 1, Python counts from 0--which is also true of other general purpose programming languages such as C and Java.

# Droping a Row

If we drop the second row, the index will skip the value of the missing row.

In [3]:
X = X.drop(1, axis=0)
display(X.head())

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4


The second row, with value 1, is now missing from the output.  This shows that the index is "fixed": it will not rename the rows 0-4.

# Selecting a Row

Pandas uses loc[] (location) and iloc[] (index location) to select data.  loc[] uses row and column names; iloc[] uses row and column numbers.  Right now, the second row is named 2.  If we use loc[] with a value of 2 for the row name and a colon to select all columns we get:

In [4]:
X.loc[2, :]

Sepal Length    4.7
Sepal Width     3.2
Petal Length    1.3
Petal Width     0.2
Name: 2, dtype: float64

If we use iloc with the same value of 2, it will select a different row--the third row.  Why?  iloc[] is looking for the row number and not the row name and so iloc[] will see row 1 as 0, row 2 as 1, and row 3 as 2:

In [5]:
X.iloc[2, :]

Sepal Length    4.6
Sepal Width     3.1
Petal Length    1.5
Petal Width     0.2
Name: 3, dtype: float64

In other words, before dropping row 2 (named row 1 because we start counting at 0), loc[2, :] would have produced the same output as the name did not change.  iloc[2, :] changed after the drop because the row count changed.

# Named Index

Let's say we want to name each row starting on 1 instead of 0 and ignore the dropped row.  This loop creates a string with the base Python function str() with each row's name.

In [6]:
# Create a column named 'Index'
X.loc[:, 'Index'] = 0
# Loop through each row of index and add a row name
for i in range(0, 149):
    X.iloc[i, 4] = str("Row: " + str(i+1))
    
# Set our new column as the index
X = X.set_index('Index')
    
display(X.head())

Unnamed: 0_level_0,Sepal Length,Sepal Width,Petal Length,Petal Width
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Row: 1,5.1,3.5,1.4,0.2
Row: 2,4.7,3.2,1.3,0.2
Row: 3,4.6,3.1,1.5,0.2
Row: 4,5.0,3.6,1.4,0.2
Row: 5,5.4,3.9,1.7,0.4


In [7]:
X.loc["Row: 2", :]

Sepal Length    4.7
Sepal Width     3.2
Petal Length    1.3
Petal Width     0.2
Name: Row: 2, dtype: float64

# Reset Index

Finally, if we want to reset the index, which will rename each row an integer value such as 0-4, we can use the function reset_index().  This will convert the current index into the first column.  To prevent this, we use the option drop=True.

In [8]:
X = X.reset_index(drop=True)
display(X.head())

Unnamed: 0,Sepal Length,Sepal Width,Petal Length,Petal Width
0,5.1,3.5,1.4,0.2
1,4.7,3.2,1.3,0.2
2,4.6,3.1,1.5,0.2
3,5.0,3.6,1.4,0.2
4,5.4,3.9,1.7,0.4


# Conclusion

Indexing can be annoying if you are coming from R as it prevents not index based merges such as the rbind() and cbind() equivalents in Python.  It can be valuable when you get into machine learning as it enables you to create K-Folds for cross validation and then resemble these K-Folds into the original dataset more easily.

# Save Log

In [9]:
from IPython.display import display, Javascript

display(Javascript(
    "document.body.dispatchEvent("
    "new KeyboardEvent('keydown', {key:'s', keyCode: 83, ctrlKey: true}"
    "))"
))

!jupyter nbconvert --to html_toc "Tip5_Indexing.ipynb"  --ExtractOutputPreprocessor.enabled=False --CSSHTMLHeaderPreprocessor.style=stata-dark 

<IPython.core.display.Javascript object>

[NbConvertApp] Converting notebook Tip5_Indexing.ipynb to html_toc
[NbConvertApp] Writing 378142 bytes to Tip5_Indexing.html
