# Boolean Indexing

Boolean
indexing (also known as boolean selection) can be a confusing term, but for the purposes of
pandas, it refers to selecting rows by providing a boolean value (True or False) for each
row. These boolean values are usually stored in a Series or NumPy ndarray and are
usually created by applying a boolean condition to one or more columns in a DataFrame.
Each value of a boolean series evaluates to 0 or 1 so all the
Series methods that work with numerical values also work with booleans.

In [1]:
import pandas as pd
movie = pd.read_csv('https://raw.githubusercontent.com/DatasRev/source-files/master/csv/movie.csv', index_col='movie_title')
movie.head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
Pirates of the Caribbean: At World's End,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
Spectre,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
Star Wars: Episode VII - The Force Awakens,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,...,,,,,,,12.0,7.1,,0


In [2]:
movie['duration'] > 120

movie_title
Avatar                                          True
Pirates of the Caribbean: At World's End        True
Spectre                                         True
The Dark Knight Rises                           True
Star Wars: Episode VII - The Force Awakens     False
John Carter                                     True
Spider-Man 3                                    True
Tangled                                        False
Avengers: Age of Ultron                         True
Harry Potter and the Half-Blood Prince          True
Batman v Superman: Dawn of Justice              True
Superman Returns                                True
Quantum of Solace                              False
Pirates of the Caribbean: Dead Man's Chest      True
The Lone Ranger                                 True
Man of Steel                                    True
The Chronicles of Narnia: Prince Caspian        True
The Avengers                                    True
Pirates of the Caribbean: On Stran

In [3]:
movie_2_hours = movie['duration'] > 120   # Itt még nem filtereli le, csak True és False értékeket kapunk.
movie_2_hours.head(10)

movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
John Carter                                    True
Spider-Man 3                                   True
Tangled                                       False
Avengers: Age of Ultron                        True
Harry Potter and the Half-Blood Prince         True
Name: duration, dtype: bool

In [4]:
# We can now use this Series to determine the number of movies that are longer than two hours:
movie_2_hours.sum()   # megszámolja a fenti oszlopban a True értékeket.

1039

In [5]:
# To find the percentage of movies in the dataset longer than two hours, use the mean method:
# Ezt úgy kell érteni hogy a nulla és 1 értékek átlaga adja meg a %-ot.
movie_2_hours.mean()

0.2113506916192026

In [6]:
movie_2_hours.value_counts()  # Jegyezzük meg ezeket a számokat , 4 lépéssel később kelleni fognak.

False    3877
True     1039
Name: duration, dtype: int64

In [7]:
movie_2_hours.value_counts(normalize=True)

False    0.788649
True     0.211351
Name: duration, dtype: float64

Unfortunately, the output from step 4 is misleading. The duration column has a
few missing values. If you look back at the DataFrame output from step 1, you
will see that the last row is missing a value for duration. The boolean condition
in step 2 returns False for this. We need to drop the missing values first, then
evaluate the condition and take the mean:

In [8]:
movie['duration'].dropna().gt(120).mean()    #120 a 120 percet jelöli   Ez a .gt nagyon érdekes, de csak feltételezni
# tudom, hogy a greater than rövidítése.

0.21199755152009794

In [9]:
movie['duration'].gt(120).head()    #  ugyanaz, mint movie['duration'] > 120

movie_title
Avatar                                         True
Pirates of the Caribbean: At World's End       True
Spectre                                        True
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
Name: duration, dtype: bool

In [10]:
movie_2_hours.describe()

count      4916
unique        2
top       False
freq       3877
Name: duration, dtype: object

For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq. The top is the most common value. The freq is the most common value’s frequency. Timestamps also include the first and last items.

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

In [11]:
movie[movie['duration'].gt(120)].count()

color                        1038
director_name                1031
num_critic_for_reviews       1035
duration                     1039
director_facebook_likes      1031
actor_3_facebook_likes       1039
actor_2_name                 1039
actor_1_facebook_likes       1039
gross                         938
genres                       1039
actor_1_name                 1039
num_voted_users              1039
cast_total_facebook_likes    1039
actor_3_name                 1039
facenumber_in_poster         1038
plot_keywords                1028
movie_imdb_link              1039
num_user_for_reviews         1039
language                     1037
country                      1039
content_rating               1001
budget                        985
title_year                   1031
actor_2_facebook_likes       1039
imdb_score                   1039
aspect_ratio                 1016
movie_facebook_likes         1039
dtype: int64

It is possible to compare two columns from the same DataFrame to produce a boolean
Series. For instance, we could determine the percentage of movies that have actor 1 with
more Facebook likes than actor 2.

In [12]:
actors = movie[['actor_1_facebook_likes',
'actor_2_facebook_likes']].dropna()
(actors['actor_1_facebook_likes'] >
actors['actor_2_facebook_likes']).mean()


0.9777687130328371

# Constructing multiple boolean conditions

In Python, boolean expressions use the built-in logical operators and, or, and not. These
keywords do not work with boolean indexing in pandas and are respectively replaced with
&, |, and ~. Additionally, each expression must be wrapped in parentheses or an error will
be raised.
All values in a Series can be compared against a scalar value using the standard comparison
operators( <, >, ==, !=, <=, >=).

In this recipe, we construct multiple boolean
expressions before combining them together to find all the movies that have an
imdb_score greater than 8, a content_rating of PG-13, and a title_year either before
2000 or after 2009.

In [13]:
criteria1 = movie.imdb_score > 8
criteria2 = movie.content_rating == 'PG-13'
criteria3 = ((movie.title_year < 2000) | (movie.title_year > 2009))

The criteria3 variable is created by two independent boolean expressions. Each
expression must be enclosed in parentheses to function properly. The pipe character, |, is
used to create a logical or condition between each of the values in both Series.
All three criteria need to be True to match the requirements of the recipe. They are each
combined together with the ampersand character, &, which creates a logical and condition
between each Series value.

In [14]:
criteria_final = criteria1 & criteria2 & criteria3
criteria_final.head()

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

A consequence of pandas using different syntax for the logical operators is that operator
precedence is no longer the same. The comparison operators have a higher precedence than
and, or, and not. However, the new operators for pandas (the bitwise operators &, |, and ~)
have a higher precedence than the comparison operators, thus the need for parentheses. An
example can help clear this up. Take the following expression:

In [15]:
5 < 10 and 3 > 4

False

Let's take a look at what would happen if the expression in criteria3 was written as
follows:
>>> movie.title_year < 2000 | movie.title_year > 2009
TypeError: cannot compare a dtyped [float64] array with a scalar of type
[bool]
As the bitwise operators have higher precedence than the comparison operators, 2000 |
movie.title_year is evaluated first, which is nonsensical and raises an error.

Many objects in Python have boolean representation. For instance, all
integers except 0 are considered True. All strings except the empty string
are True. All non-empty sets, tuples, dictionaries, and lists are True. An
empty DataFrame or Series does not evaluate as True or False and instead
an error is raised. In general, to retrieve the truthiness of a Python object,
pass it to the bool function.

## Filtering with boolean indexing

In [16]:
crit_b1 = movie.imdb_score < 5
crit_b2 = movie.content_rating == 'R'
crit_b3 = ((movie.title_year >= 2000) &
(movie.title_year <= 2010))
final_crit_b = crit_b1 & crit_b2 & crit_b3

In [17]:
# Combine the two sets of criteria using the pandas or operator. This yields a
# boolean Series of all movies that are members of either set:
final_crit_a=criteria_final
final_crit_all = final_crit_a | final_crit_b
final_crit_all.head()

movie_title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                          True
Star Wars: Episode VII - The Force Awakens    False
dtype: bool

Once you have your boolean Series, you simply pass it to the indexing operator
to filter the data:

In [18]:
movie[final_crit_all].head()

Unnamed: 0_level_0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Dark Knight Rises,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
The Avengers,Color,Joss Whedon,703.0,173.0,0.0,19000.0,Robert Downey Jr.,26000.0,623279547.0,Action|Adventure|Sci-Fi,...,1722.0,English,USA,PG-13,220000000.0,2012.0,21000.0,8.1,1.85,123000
Captain America: Civil War,Color,Anthony Russo,516.0,147.0,94.0,11000.0,Scarlett Johansson,21000.0,407197282.0,Action|Adventure|Sci-Fi,...,1022.0,English,USA,PG-13,250000000.0,2016.0,19000.0,8.2,2.35,72000
Guardians of the Galaxy,Color,James Gunn,653.0,121.0,571.0,3000.0,Vin Diesel,14000.0,333130696.0,Action|Adventure|Sci-Fi,...,1097.0,English,USA,PG-13,170000000.0,2014.0,14000.0,8.1,2.35,96000
Interstellar,Color,Christopher Nolan,712.0,169.0,22000.0,6000.0,Anne Hathaway,11000.0,187991439.0,Adventure|Drama|Sci-Fi,...,2725.0,English,USA,PG-13,165000000.0,2014.0,11000.0,8.6,2.35,349000


We have successfully filtered the data and all the columns of the DataFrame. We
can't easily perform a manual check to determine whether the filter worked
correctly. Let's filter both rows and columns with the .loc indexer:

In [19]:
cols = ['imdb_score', 'content_rating', 'title_year']
movie_filtered = movie.loc[final_crit_all, cols]
movie_filtered.head(10)

Unnamed: 0_level_0,imdb_score,content_rating,title_year
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
The Dark Knight Rises,8.5,PG-13,2012.0
The Avengers,8.1,PG-13,2012.0
Captain America: Civil War,8.2,PG-13,2016.0
Guardians of the Galaxy,8.1,PG-13,2014.0
Interstellar,8.6,PG-13,2014.0
Inception,8.8,PG-13,2010.0
The Martian,8.1,PG-13,2015.0
Town & Country,4.4,R,2001.0
Sex and the City 2,4.3,R,2010.0
Rollerball,3.0,R,2002.0


In [20]:
# To replicate the final_crit_a variable from step 1 with one long line
# of code, we can do the following:
final_crit_a2 = (movie.imdb_score > 8) & \
(movie.content_rating == 'PG-13') & \
((movie.title_year < 2000) |
(movie.title_year > 2009))
final_crit_a2.equals(final_crit_a)

True

## Replicating boolean indexing with index selection
It is possible to replicate specific cases of boolean selection by taking advantage of the
index. Selection through the index is more intuitive and makes for greater readability. könyv 145.

In [21]:
college = pd.read_csv('https://raw.githubusercontent.com/DatasRev/source-files/master/csv/college.csv')
college[college['STABBR'] == 'TX'].head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
3610,Abilene Christian University,Abilene,TX,0.0,0.0,0.0,1,530.0,545.0,0.0,...,0.0454,0.0423,0.0045,0.0468,1,0.2595,0.5527,0.0381,40200,25985
3611,Alvin Community College,Alvin,TX,0.0,0.0,0.0,0,,,0.0,...,0.0002,0.0,0.0143,0.7123,1,0.1549,0.0625,0.2841,34500,6750
3612,Amarillo College,Amarillo,TX,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0001,0.0085,0.6922,1,0.3786,0.1573,0.3431,31700,10950
3613,Angelina College,Lufkin,TX,0.0,0.0,0.0,0,,,0.0,...,0.0264,0.0005,0.0,0.56,1,0.5308,0.0,0.2603,26900,PrivacySuppressed
3614,Angelo State University,San Angelo,TX,0.0,0.0,0.0,0,475.0,490.0,0.0,...,0.0285,0.0331,0.0011,0.1289,1,0.4068,0.5279,0.1407,37700,21319.5


To replicate this using index selection, we need to move the STABBR column into
the index. We can then use label-based selection with the .loc indexer:

In [22]:
college2 = college.set_index('STABBR')
college2.loc['TX'].head()
## Ez elég furcsa, korábban az intézmény volt az index, annak nincs sok értelme, hogy az állam legyen az index.

Unnamed: 0_level_0,INSTNM,CITY,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TX,Abilene Christian University,Abilene,0.0,0.0,0.0,1,530.0,545.0,0.0,3572.0,...,0.0454,0.0423,0.0045,0.0468,1,0.2595,0.5527,0.0381,40200,25985
TX,Alvin Community College,Alvin,0.0,0.0,0.0,0,,,0.0,4682.0,...,0.0002,0.0,0.0143,0.7123,1,0.1549,0.0625,0.2841,34500,6750
TX,Amarillo College,Amarillo,0.0,0.0,0.0,0,,,0.0,9346.0,...,0.0,0.0001,0.0085,0.6922,1,0.3786,0.1573,0.3431,31700,10950
TX,Angelina College,Lufkin,0.0,0.0,0.0,0,,,0.0,3825.0,...,0.0264,0.0005,0.0,0.56,1,0.5308,0.0,0.2603,26900,PrivacySuppressed
TX,Angelo State University,San Angelo,0.0,0.0,0.0,0,475.0,490.0,0.0,5290.0,...,0.0285,0.0331,0.0011,0.1289,1,0.4068,0.5279,0.1407,37700,21319.5


Boolean indexing takes three times as long as index selection. As setting the index
does not come for free, let's time that operation as well: # könyv 147. oldal

Let's select Texas (TX), California (CA), and New York (NY).
With boolean selection, you can use the isin method but with indexing,
just pass a list to .loc:

In [23]:
states = ['TX', 'CA', 'NY']
college[college['STABBR'].isin(states)]
college2.loc[states]

Unnamed: 0_level_0,INSTNM,CITY,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,UGDS,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
TX,Abilene Christian University,Abilene,0.0,0.0,0.0,1,530.0,545.0,0.0,3572.0,...,0.0454,0.0423,0.0045,0.0468,1,0.2595,0.5527,0.0381,40200,25985
TX,Alvin Community College,Alvin,0.0,0.0,0.0,0,,,0.0,4682.0,...,0.0002,0.0000,0.0143,0.7123,1,0.1549,0.0625,0.2841,34500,6750
TX,Amarillo College,Amarillo,0.0,0.0,0.0,0,,,0.0,9346.0,...,0.0000,0.0001,0.0085,0.6922,1,0.3786,0.1573,0.3431,31700,10950
TX,Angelina College,Lufkin,0.0,0.0,0.0,0,,,0.0,3825.0,...,0.0264,0.0005,0.0000,0.5600,1,0.5308,0.0000,0.2603,26900,PrivacySuppressed
TX,Angelo State University,San Angelo,0.0,0.0,0.0,0,475.0,490.0,0.0,5290.0,...,0.0285,0.0331,0.0011,0.1289,1,0.4068,0.5279,0.1407,37700,21319.5
TX,Arlington Baptist College,Arlington,0.0,0.0,0.0,1,,,0.0,214.0,...,0.0000,0.0047,0.0000,0.1682,1,0.4978,0.4892,0.2251,34200,22905
TX,Arlington Career Institute,Grand Prairie,0.0,0.0,0.0,0,,,0.0,204.0,...,0.0000,0.0000,0.0000,0.2843,1,0.6186,0.7119,0.7745,27600,9500
TX,The Art Institute of Houston,Houston,0.0,0.0,0.0,0,,,0.0,1887.0,...,0.0000,0.0000,0.0419,0.3466,1,0.6183,0.7604,0.3845,32600,30750
TX,Austin College,Sherman,0.0,0.0,0.0,1,600.0,595.0,0.0,1272.0,...,0.0031,0.0267,0.0031,0.0016,1,0.2867,0.7581,0.0124,47800,26000
TX,Austin Community College District,Austin,0.0,0.0,0.0,0,,,0.0,32581.0,...,0.0233,0.0335,0.0532,0.7340,1,0.2393,0.2447,0.3914,34400,8601.5


Pandas implements the index differently based on whether the index is unique or sorted.

## Selecting with unique and sorted indexes

Index selection performance drastically improves when the index is unique or sorted. The
prior recipe used an unsorted index that contained duplicates, which makes for relatively
slow selections.   -- Megmondtam.

In [26]:
 college2.index.is_monotonic_increasing   # return if the index is monotonic increasing (only equal or increasing) values.

False

In [27]:
# Sort the index from college2 and store it as another object:
college3 = college2.sort_index()
college3.index.is_monotonic_increasing

True

Time the selection of the state of Texas (TX) from all three DataFrames:
    -- ezt inkább átugorjuk:
    The sorted index performs nearly an order of magnitude faster than boolean
selection. Let's now turn towards unique indexes. For this, we use the institution
name as the index:

In [28]:
college_unique = college.set_index('INSTNM')
college_unique.index.is_unique

True

In [30]:
# Let's select Stanford University with boolean indexing:
college[college['INSTNM'] == 'Stanford University']

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
4217,Stanford University,Stanford,CA,0.0,0.0,0.0,0,730.0,745.0,0.0,...,0.1067,0.0819,0.0031,0.0,1,0.1556,0.1256,0.0401,86000,12782


In [31]:
# Let's select Stanford University with index selection:
college_unique.loc['Stanford University']

CITY                  Stanford
STABBR                      CA
HBCU                         0
MENONLY                      0
WOMENONLY                    0
RELAFFIL                     0
SATVRMID                   730
SATMTMID                   745
DISTANCEONLY                 0
UGDS                      7018
UGDS_WHITE              0.3752
UGDS_BLACK              0.0591
UGDS_HISP               0.1607
UGDS_ASIAN              0.1979
UGDS_AIAN               0.0114
UGDS_NHPI               0.0038
UGDS_2MOR               0.1067
UGDS_NRA                0.0819
UGDS_UNKN               0.0031
PPTUG_EF                     0
CURROPER                     1
PCTPELL                 0.1556
PCTFLOAN                0.1256
UG25ABV                 0.0401
MD_EARN_WNE_P10          86000
GRAD_DEBT_MDN_SUPP       12782
Name: Stanford University, dtype: object

In [32]:
%timeit college[college['INSTNM'] == 'Stanford University']

768 µs ± 21.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [34]:
%timeit college_unique.loc['Stanford University']

111 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


When the index is not sorted and contains duplicates, as with college2, pandas will need
to check every single value in the index in order to make the correct selection. When the
index is sorted, as with college3, pandas takes advantage of an algorithm called binary
search to greatly improve performance.
In the second half of the recipe, we use a unique column as the index. Pandas implements
unique indexes with a hash table, which makes for even faster selection. Each index location
can be looked up in nearly the same time regardless of its length.

Boolean selection gives much more flexibility than index selection as it is possible to
condition on any number of columns. In this recipe, we used a single column as the index. It
is possible to concatenate multiple columns together to form an index. For instance, in the
following code, we set the index equal to the concatenation of the city and state columns:

In [35]:
college.index = college['CITY'] + ', ' + college['STABBR']
college = college.sort_index()
college.head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
"ARTESIA, CA",Angeles Institute,ARTESIA,CA,0.0,0.0,0.0,0,,,0.0,...,0.0175,0.0088,0.0088,0.0,1,0.6275,0.8138,0.5429,,16850
"Aberdeen, SD",Presentation College,Aberdeen,SD,0.0,0.0,0.0,1,440.0,480.0,0.0,...,0.0284,0.0142,0.0823,0.2865,1,0.4829,0.756,0.3097,35900.0,25000
"Aberdeen, SD",Northern State University,Aberdeen,SD,0.0,0.0,0.0,0,480.0,475.0,0.0,...,0.0219,0.0425,0.0024,0.1872,1,0.2272,0.4303,0.1766,33600.0,24847
"Aberdeen, WA",Grays Harbor College,Aberdeen,WA,0.0,0.0,0.0,0,,,0.0,...,0.0937,0.0009,0.025,0.182,1,0.453,0.1502,0.5087,27000.0,11490
"Abilene, TX",Hardin-Simmons University,Abilene,TX,0.0,0.0,0.0,1,508.0,515.0,0.0,...,0.0298,0.0159,0.0102,0.0685,1,0.3256,0.5547,0.0982,38700.0,25864


In [36]:
'''From here, we can select all colleges from a particular city and state combination without
boolean indexing. Let's select all colleges from Miami, FL:'''
college.loc['Miami, FL'].head()

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SATVRMID,SATMTMID,DISTANCEONLY,...,UGDS_2MOR,UGDS_NRA,UGDS_UNKN,PPTUG_EF,CURROPER,PCTPELL,PCTFLOAN,UG25ABV,MD_EARN_WNE_P10,GRAD_DEBT_MDN_SUPP
"Miami, FL",New Professions Technical Institute,Miami,FL,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0,0.0,0.4464,1,0.8701,0.678,0.8358,18700,8682
"Miami, FL",Management Resources College,Miami,FL,0.0,0.0,0.0,0,,,0.0,...,0.0,0.0,0.0,0.0,1,0.4239,0.5458,0.8698,PrivacySuppressed,12182
"Miami, FL",Strayer University-Doral,Miami,FL,,,,1,,,,...,,,,,1,,,,49200,36173.5
"Miami, FL",Keiser University- Miami,Miami,FL,,,,1,,,,...,,,,,1,,,,29700,26063
"Miami, FL",George T Baker Aviation Technical College,Miami,FL,0.0,0.0,0.0,0,,,0.0,...,0.0046,0.0,0.0,0.5686,1,0.2567,0.0,0.4366,38600,PrivacySuppressed


We can compare the speed of this compound index selection with boolean indexing. There
is more than an order of magnitude difference:

Work faster using keyboard shortcuts ⌨️

a, b = insert cell above, below
x, c, v, z = cut, copy, paste, undo
m, y = change cell to markdown, code
f = find
h = help
p = command palette
dd = delete cell