[![Binder](https://mybinder.org/badge.svg)](https://mybinder.org/v2/gh/treehouse-projects/python-introducing-pandas/master?filepath=s2n2-selecting-data.ipynb)

# Selecting Data

Oftentimes you'll want to grab a subset of records that meet a certain criteria. You can do this by indexing the `DataFrame` much like you've seen done with a `NumPy.ndarray`.

In [1]:
import os
import pandas as pd

users = pd.read_csv(os.path.join('data', 'users.csv'), index_col=0)
# Pop out a quick sanity check
len(users)

467

CashBox uses a referral system, everyone you refer will earn you $5 credit. Let's see if we can find everyone who hasn't yet taken advantage of that deal. The number of referrals a user has made is defined in the **`referral_count`** column.

In [2]:
# This vectorized comparison returns a new `Series`, which we are naming so we can use it later
no_referrals_index = users['referral_count'] < 1
# See how the boolean `Series` returned includes all rows from the `DataFrame`.
#  The value is the result of each comparison
no_referrals_index.head()

aaron.wrightaaron6549    False
abanks                   False
abarnes                  False
achristensen             False
acole                    False
Name: referral_count, dtype: bool

Using the boolean `Series` we just created, **`no_referrals_index`**, we can retrieve all rows where that comparison was True.

In [3]:
users[no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
ahenry,Angela,Henry,angela@yahoo.com,True,2018-03-09,0,30.41
amber,Amber,,amber@gmail.com,True,2018-01-30,0,33.12
amber.carteramber2778,Amber,Carter,carter@yahoo.com,True,2018-05-31,0,58.68
anderson7178,Jonathan,Anderson,janderson@gmail.com,True,2018-06-02,0,8.15
beverly.taylorbeverly9627,Beverly,Taylor,beverly@gmail.com,False,2018-09-07,0,1.6


## Inversed index
A handy shortcut is to prefix the index with a `~` (tilde). This returns the inverse of the boolean `Series`. While I wish that the `~` was called "the opposite day" operator, it is in fact called `bitwise not` operator.

In [4]:
~no_referrals_index.head()

aaron.wrightaaron6549    True
abanks                   True
abarnes                  True
achristensen             True
acole                    True
Name: referral_count, dtype: bool

In [5]:
# Use the inverse of the index to find where referral values DO NOT equal zero
users[~no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron.wrightaaron6549,Aaron,Wright,aaron.wrightaaron6226@gmail.com,True,2018-01-30,6,32.12
abanks,Anthony,Banks,banks@garrett-ramsey.com,True,2018-04-07,3,0.86
abarnes,Ashley,Barnes,ashley@mcbride.com,True,2018-01-05,2,89.01
achristensen,Amanda,Christensen,amanda@yahoo.com,False,2018-03-27,1,42.67
acole,Anthony,Cole,anthony.coleanthony2659@hotmail.com,True,2018-05-17,3,85.4


## In `loc`
Boolean `Series` as an index may also be used as an index the `DataFrame.loc` object.  

In [6]:
# Select rows where there are no referrals, and select only the following ordered columns
users.loc[no_referrals_index, ['balance', 'email']].head()

Unnamed: 0,balance,email
ahenry,30.41,angela@yahoo.com
amber,33.12,amber@gmail.com
amber.carteramber2778,58.68,carter@yahoo.com
anderson7178,8.15,janderson@gmail.com
beverly.taylorbeverly9627,1.6,beverly@gmail.com


It is also possible to do the comparison inline, without storing the index in a variable.

In [7]:
users[users['referral_count'] == 0].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
ahenry,Angela,Henry,angela@yahoo.com,True,2018-03-09,0,30.41
amber,Amber,,amber@gmail.com,True,2018-01-30,0,33.12
amber.carteramber2778,Amber,Carter,carter@yahoo.com,True,2018-05-31,0,58.68
anderson7178,Jonathan,Anderson,janderson@gmail.com,True,2018-06-02,0,8.15
beverly.taylorbeverly9627,Beverly,Taylor,beverly@gmail.com,False,2018-09-07,0,1.6


Just like a NumPy `ndarray`, it's possible for a boolean `Series` to be compared to another boolean `Series` using bitwise operators.

**NOTE**: Remember to surround your expressions with parenthesis to control the order of operations.

In [8]:
# Select all users where they haven't made a referral AND their email has been verified
users[(users['referral_count'] == 0) & (users['email_verified'] == True)].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
ahenry,Angela,Henry,angela@yahoo.com,True,2018-03-09,0,30.41
amber,Amber,,amber@gmail.com,True,2018-01-30,0,33.12
amber.carteramber2778,Amber,Carter,carter@yahoo.com,True,2018-05-31,0,58.68
anderson7178,Jonathan,Anderson,janderson@gmail.com,True,2018-06-02,0,8.15
cathy,Cathy,Beck,cathy@hotmail.com,True,2018-01-12,0,47.36


# Practice Challenge 
CashBox wants to know the top referrers with verified email addresses, so that they can send them some more motivational emails. Currently, anyone with **5 or more** referrals is considered a top referrer.

In [9]:
## CHALLENGE - Find the top referrers ##
# Select users that have a referral count greater than or equal to 5 and have verified emails 
users

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron.wrightaaron6549,Aaron,Wright,aaron.wrightaaron6226@gmail.com,True,2018-01-30,6,32.12
abanks,Anthony,Banks,banks@garrett-ramsey.com,True,2018-04-07,3,0.86
abarnes,Ashley,Barnes,ashley@mcbride.com,True,2018-01-05,2,89.01
achristensen,Amanda,Christensen,amanda@yahoo.com,False,2018-03-27,1,42.67
acole,Anthony,Cole,anthony.coleanthony2659@hotmail.com,True,2018-05-17,3,85.40
acortez,Alyssa,Cortez,cortez8509@yahoo.com,True,2018-06-21,2,77.34
acosta,Christian,Acosta,christian.acostachristian7501@gmail.com,True,2018-06-21,6,83.46
adam,Adam,,adam@hotmail.com,True,2018-02-20,1,44.98
adam.fergusonadam5149,Adam,Ferguson,ferguson@hotmail.com,True,2018-05-20,2,92.47
adrian,Adrian,Lamb,adrian@yahoo.com,True,2018-07-26,6,30.01


In [None]:
from tests.helpers import check

check(__name__, 'Find the top referrers')