# SLU2 - Subsetting data in Pandas: Learning notebook

In this notebook we will cover the following topics: 

* Indexing
* Selecting columns
* Selecting rows
* Chain indexing (not good) vs Multi-axis indexing (good)
* Masks
* Where
* Subsetting on conditions
* Adding columns and rows
* Removing columns and rows
    
You will make use of the following basic pandas functions and methods:

* [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
* [`sort_index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html)
* [`reset_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reset_index.html)
* [`set_index`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html)
* [`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)
* [`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)
* [`mask`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.mask.html)
* [`where`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html)
* [`isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html)
* [`assign`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html)
* [`drop`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html)

<br>

Let's start by importing pandas, like we learned in the previous unit. It will be the only thing that we will need for this learning unit as well.

In [1]:
import pandas as pd

# This is an option to preview less rows in the notebook's cells' outputs
pd.options.display.max_rows = 10

## Indexing  in pandas

An **index** is a labeling that allows us to locate data points across a Dataframe or series more easily. You can think of it as an address system for our data, or in a way similar to a system that a library or archive would have. There's a notion of what is the main characteristic that you use to organize your data (eg. book titles or genre) and everything is organized and located using those characteristics. 

![data indexing](images/looking_for_data.jpg)

In the same way in pandas we use labels to organize and track your data. Both rows and columns have these labels, and usually we refer loosely to the rows indices as **index** and the column indices as **column names**. Proper indexing is what provides us with efficient ways of finding data across our datasets.

**Indexing** is then the process of selecting particular rows and columns of data from a DataFrame, using this address system. For example, you might want to point at the 3rd row or the 2nd column, or specifically specify the column name or row element by this label. You'll see how to do all of these in this notebook.

<br>

### Dataframe index (row labelling)

Let's start by understanding how rows are labelled. We'll read the data that we'll use in this unit from the file `airbnb_input.csv`, which is located in the `data/` directory. For this, we'll use function `read_csv`, which was already shown in the previous unit.

In [2]:
# Read the data in file airbnb_input.csv into a pandas DataFrame
df = pd.read_csv('data/airbnb_input.csv')

# Preview the first rows of the DataFrame.
df.head()

Unnamed: 0,room_id,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


Pandas automatically creates an index for the rows that ranges from 0 to the length of the DataFrame minus 1 - this is the leftmost column you see above. But what if we want to use another column as this labelling? For example ids, such as __room_id__ or __host_id__ might sound like good possibilities for the index. 

To do that, you can specify the column name you want to use as this set of labels to the `read_csv` function, by passing it as the `index_col` argument. Let's say we want to use the column __room_id__ as the DataFrame index:

In [3]:
# Read the data in file airbnb_input.csv into a pandas DataFrame and use column room_id as the DataFrame index.
df = pd.read_csv('data/airbnb_input.csv', index_col='room_id')

# Preview the first rows of the DataFrame.
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


### Sorting the index

With the `sort_index` function, we can sort the DataFrame along the index. 

In this particular case, our DataFrame df was already sorted along the index, according to the natural order of the type of data it has - in this case the ids are number, so they are sorted from smaller to bigger. However, we can change it from bigger to smaller ids, by using the __ascending=False__ parameter.

In [4]:
# Original df - sorted from smaller to bigger
df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0


In [5]:
# df with the index sorted from bigger to smaller room_id
df.sort_index(ascending=False)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19400722,28219108,Entire home/apt,Areeiro,0,0.0,5,3.0,75.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
...,...,...,...,...,...,...,...,...
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0


### Changing the index

We can also change the index of a DataFrame after it's loaded, with functions `reset_index` and `set_index`. When we do this, we may choose to keep the old index as a column in the DataFrame or dropping it completely. This is done through the argument __drop__, available in both functions. 


#### Resetting the index

The function `reset_index` will reset your index to the default one from pandas - a range from 0 to the length of the DataFrame minus 1, as mentioned above. In this function, the default behavior is to not drop the index (`drop=False`). See below what happens when we run it on our dataframe, without dropping the previous index (__room_id__).

In [6]:
# Resetting the index and keeping the old as a column (room_id) - default behavior
df.reset_index()

Unnamed: 0,room_id,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...,...
13227,19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
13228,19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
13229,19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
13230,19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


Now notice the argument __drop__ added and the missing column in the next dataframe.

In [7]:
# Resetting the index and dropping it 
df.reset_index(drop=True)

Unnamed: 0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
0,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
1,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
2,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
3,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
4,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...
13227,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
13228,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
13229,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
13230,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


#### Setting the index

The function `set_index` allows us to pick a new column as our index. This function always discards the old index, and it replaces it by this new column. By default, this column will be moved to the index, and as such not kept as a regular column (`drop=True`). However, we can override it by setting the argument to `False`. See below both examples:

In [8]:
# Setting column neighborhood as the new index - default behavior moves the column to the index
df.set_index('neighborhood')

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Belém,14455,Entire home/apt,8,5.0,2,1.0,57.0
Alvalade,66015,Entire home/apt,0,0.0,2,1.0,46.0
Santa Maria Maior,107347,Entire home/apt,63,5.0,3,1.0,69.0
Santa Maria Maior,125768,Entire home/apt,225,4.5,4,1.0,58.0
Santa Maria Maior,126415,Entire home/apt,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...
São Vicente,135915593,Entire home/apt,0,0.0,6,3.0,415.0
Santa Maria Maior,5376796,Entire home/apt,0,0.0,3,1.0,50.0
Santo António,6115933,Entire home/apt,0,0.0,6,4.0,138.0
São Vicente,97139334,Entire home/apt,0,0.0,4,1.0,56.0


In [9]:
# Setting column neighborhood as the new index - use `drop` argument to keep as a regular column besides the index
df.set_index('neighborhood', drop=False)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Belém,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
Alvalade,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
Santa Maria Maior,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
Santa Maria Maior,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
Santa Maria Maior,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...
São Vicente,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
Santa Maria Maior,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
Santo António,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
São Vicente,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


__Note__: Both functions to set and reset the index of the dataframe will generate a new dataframe with the desired index, they won't mutate your current dataframe. In all examples above we are just visualizing that output, and not assigning it to any new variable, so if we now look into our original dataframe, it still has the original index from `df = pd.read_csv('data/airbnb_input.csv', index_col='room_id')` - __room_id__, as demonstrated below.

In [10]:
# Looking into index of a dataframe
df.index.name

'room_id'

<br>
    
## Selecting columns

As mention before, **indexing** is what we call the process of selecting particular rows and columns of data from a DataFrame. We'll start by seeing how we can make use of the columns' indices - the __column names__ - to perform this task. 

There are two ways of doing this:

* using dot notation (`dataframe.column_name`)
* using braket notation (`dataframe[column_name]`)

<br>

### Selecting columns by name - dot notation

Using __dot notation__, you can select __one column__ from your DataFrame, obtaining a __Series__ with the column values. Try this out by selecting the __room_type__ column using dot notation:

In [11]:
df.room_type

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

### Selecting columns by name - brackets notation

Using __brackets__, we can select __one or more columns__ from the DataFrame. Try selecting the same column as before:

In [12]:
df['room_type']

room_id
6499        Entire home/apt
17031       Entire home/apt
25659       Entire home/apt
29248       Entire home/apt
29396       Entire home/apt
                 ...       
19388006    Entire home/apt
19393935    Entire home/apt
19396300    Entire home/apt
19397373    Entire home/apt
19400722    Entire home/apt
Name: room_type, Length: 13232, dtype: object

Now try selecting multiple columns, such as __room_type__ and __neighborhood__. Notice that the output of the multiple selection is now a __Dataframe__ itself, containing the index and both columns selected

In [13]:
df[['room_type', 'neighborhood']]

Unnamed: 0_level_0,room_type,neighborhood
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
6499,Entire home/apt,Belém
17031,Entire home/apt,Alvalade
25659,Entire home/apt,Santa Maria Maior
29248,Entire home/apt,Santa Maria Maior
29396,Entire home/apt,Santa Maria Maior
...,...,...
19388006,Entire home/apt,São Vicente
19393935,Entire home/apt,Santa Maria Maior
19396300,Entire home/apt,Santo António
19397373,Entire home/apt,São Vicente


<br>

## Selecting rows

We'll now use the rows labels - the dataframe index - to select rows. We will show you two ways of doing this:

* Selecting rows by index position (`iloc`)
* Selecting rows by index labels (`loc`)

### Selecting rows by index position - iloc

With function `iloc` you can select specific rows from a DataFrame, by their position in the index. To do this, you specify one integer, a list of integers or a slice. All indices specified should be between 0 and Dataframe length minus 1 (remember that Python starts indexing with a 0)

See how you can select the first row with this method. Notice that the output produced is a __Series__.

In [14]:
df.iloc[0]

host_id                           14455
room_type               Entire home/apt
neighborhood                      Belém
reviews                               8
overall_satisfaction                5.0
accommodates                          2
bedrooms                            1.0
price                              57.0
Name: 6499, dtype: object

Now use a list of indices to fetch multiple rows. Notice that the output produced is now a __Dataframe__.

In [15]:
df.iloc[[0, 2, 4, 6]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0


Now see how you can provide a slice - a range provided in the format `start_row`:`end_row` - to obtain multiple rows. When your slice is at the beggining of the dataframe (your `start_row` is 0) or at the end of the dataframe (your `end_row` is the length of your dataframe minus 1) then you can omit it. See below some examples of slicing

In [16]:
# Get first 3 rows
df.iloc[:3]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0


In [17]:
# Get all rows from row 13225 to the last row
df.iloc[13225:]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19386235,55922649,Entire home/apt,Santa Maria Maior,0,0.0,4,1.0,87.0
19386898,95929866,Entire home/apt,Arroios,0,0.0,3,1.0,91.0
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0
19400722,28219108,Entire home/apt,Areeiro,0,0.0,5,3.0,75.0


In [18]:
# Get all rows in between 4211 to 7442
df.iloc[4211:7442]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6553475,21712933,Private room,Misericórdia,69,4.5,2,1.0,40.0
6553789,12771657,Private room,Misericórdia,4,3.5,2,1.0,44.0
6554612,15702462,Entire home/apt,Arroios,71,4.5,3,1.0,40.0
6555042,32194684,Private room,Arroios,46,4.5,1,1.0,23.0
6555767,34195436,Private room,Arroios,11,5.0,2,1.0,31.0
...,...,...,...,...,...,...,...,...
12965399,13728607,Entire home/apt,Santa Maria Maior,24,4.0,8,3.0,173.0
12966379,44954448,Entire home/apt,Santo António,58,4.5,4,1.0,52.0
12966744,24519638,Private room,São Vicente,0,0.0,1,1.0,40.0
12967980,60982357,Entire home/apt,Lumiar,15,5.0,4,2.0,58.0


Another possibility when slicing is to pick only part of the elements in the provided range, by a given __step__. This will mean that we only select one row every `step` rows. This notation follows:

`start_row:end_row:step`

See how this works by selecting the first 10 columns with a step of 2.

In [19]:
df.iloc[0:10:2,:]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0
29915,128890,Entire home/apt,Avenidas Novas,28,4.5,3,1.0,58.0


### Selecting rows by index name - loc

Selecting by position is useful but oftentimes we want to specify the label to use to find a given row. You can do this with function `loc`. It follows the same notation of `iloc` but the content you are providing is an actual label. 

Let's see how you can do this. Start by selecting the row with __room_id__ 29396. Notice that once again, when you select only one row, a __Series__ object is returned.

In [20]:
df.loc[29396]

host_id                            126415
room_type                 Entire home/apt
neighborhood            Santa Maria Maior
reviews                               132
overall_satisfaction                  5.0
accommodates                            4
bedrooms                              1.0
price                                67.0
Name: 29396, dtype: object

Also note that if you search for an index that doesn't exist, you'll get a KeyError:

In [21]:
try:
    df.loc[100]
except KeyError as e:
    print('\033[91mError\033[0m: {} not found'.format(e))

[91mError[0m: 100 not found


Similarly to the `iloc` function, you can also pass lists or slices of labels. You should get __Dataframe__ objects back.

In [22]:
# Retrieving rows from list of labels
df.loc[[29396,17031]]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0


We can also apply the slice operation using the loc function. It is important to sort the index before doing this operation.

In [23]:
# Retrieving rows from slices
df.loc[50000:900000,:]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
55116,259744,Private room,Avenidas Novas,1,0.0,2,2.0,68.0
56906,270457,Entire home/apt,São Vicente,86,4.5,3,1.0,52.0
57850,276092,Entire home/apt,Estrela,52,4.5,3,2.0,70.0
59227,184400,Entire home/apt,Misericórdia,25,4.5,7,3.0,138.0
...,...,...,...,...,...,...,...,...
896034,990553,Entire home/apt,Santa Maria Maior,105,4.5,5,2.0,46.0
896173,4777554,Entire home/apt,Campo de Ourique,1,0.0,4,2.0,69.0
896591,4773942,Entire home/apt,Santa Maria Maior,159,4.5,4,1.0,64.0
898016,4791230,Entire home/apt,São Vicente,21,4.5,3,1.0,52.0


You might have noticed that both the position system and our labels are integers. If you are wondering if the label selection only works for integers, you would be wrong.

One cool thing about all of these methods is that we can even use them with an index that is not integer. For example if our index is a string we can select rows from one label or a list of labels. Lets set the __neighborhood__ column as index and select some rows from the labels "Alvalade", "Estrela" and "Belém"

In [24]:
# Select rows by one label
df.set_index('neighborhood').sort_index().loc['Alvalade']

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alvalade,31062136,Entire home/apt,8,5.0,8,3.0,83.0
Alvalade,2381723,Entire home/apt,116,5.0,4,2.0,52.0
Alvalade,8104036,Entire home/apt,0,0.0,2,1.0,64.0
Alvalade,11440809,Private room,0,0.0,2,1.0,40.0
Alvalade,62222594,Private room,0,0.0,2,1.0,52.0
...,...,...,...,...,...,...,...
Alvalade,106149355,Private room,0,0.0,1,1.0,40.0
Alvalade,49096387,Private room,112,5.0,1,1.0,23.0
Alvalade,15192960,Entire home/apt,1,0.0,6,5.0,242.0
Alvalade,6981742,Private room,5,5.0,2,1.0,52.0


In [25]:
# Select rows by a list of labels
df.set_index('neighborhood').sort_index().loc[['Alvalade', 'Belém', 'Estrela']]

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alvalade,31062136,Entire home/apt,8,5.0,8,3.0,83.0
Alvalade,2381723,Entire home/apt,116,5.0,4,2.0,52.0
Alvalade,8104036,Entire home/apt,0,0.0,2,1.0,64.0
Alvalade,11440809,Private room,0,0.0,2,1.0,40.0
Alvalade,62222594,Private room,0,0.0,2,1.0,52.0
...,...,...,...,...,...,...,...
Estrela,6022520,Entire home/apt,40,5.0,4,1.0,103.0
Estrela,33750074,Entire home/apt,30,4.5,3,1.0,75.0
Estrela,49922719,Entire home/apt,0,0.0,10,4.0,219.0
Estrela,4235131,Entire home/apt,103,5.0,10,5.0,346.0


This also works for slicing. The notion of what is "in between" the labels uses the natural ordering of the index. In the case of strings, for example, this would be the alphabetic order of the elements. See below the result of slicing for a string index:

In [26]:
# Select rows betweem 'Alvalade' and 'Belem'
df.set_index('neighborhood').sort_index().loc['Alvalade':'Belém',:]

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alvalade,31062136,Entire home/apt,8,5.0,8,3.0,83.0
Alvalade,2381723,Entire home/apt,116,5.0,4,2.0,52.0
Alvalade,8104036,Entire home/apt,0,0.0,2,1.0,64.0
Alvalade,11440809,Private room,0,0.0,2,1.0,40.0
Alvalade,62222594,Private room,0,0.0,2,1.0,52.0
...,...,...,...,...,...,...,...
Belém,8048828,Entire home/apt,122,5.0,4,1.0,58.0
Belém,21277737,Entire home/apt,0,0.0,4,2.0,230.0
Belém,3168004,Entire home/apt,13,4.5,3,2.0,46.0
Belém,4132746,Private room,49,4.5,3,1.0,22.0


<br>

## Multi-axis indexing

Selecting by rows or by columns corresponds to indexing by only one axis (we normally depict rows as axis 0 and columns as axis 1). However, one nice thing about loc and iloc is that it allows for multi-axis indexing, this is, we can select columns and rows at the same time. 

Let's use the `iloc` to select based on the position of the rows and columns to pick the last five rows and the first 3 columns.

In [27]:
df.iloc[-5:,:3]

Unnamed: 0_level_0,host_id,room_type,neighborhood
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
19388006,135915593,Entire home/apt,São Vicente
19393935,5376796,Entire home/apt,Santa Maria Maior
19396300,6115933,Entire home/apt,Santo António
19397373,97139334,Entire home/apt,São Vicente
19400722,28219108,Entire home/apt,Areeiro


Let's now use the `loc` to select based on the names of the rows and columns to pick the _neighborhood_ and _price_ of the rooms 29396 and 17031.

In [28]:
df.loc[[29396,17031],['neighborhood','price']]

Unnamed: 0_level_0,neighborhood,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1
29396,Santa Maria Maior,67.0
17031,Alvalade,46.0


<br>

## Performance remarks


### Chain indexing vs Multi-axis indexing

Imagine you are asked to select the neighborhood of room 17031. When we want to select a specific value in a DataFrame, given the row and the column, we might be tempted to do __chain indexing__, this is, first selecting by one index and then by the other, as shown below.

In [29]:
## Chain indexing
df['room_type'][55116]

'Private room'

However, if you measure performance for both chain indexing or multi indexing, you'll see there is a good reason to pick the second. Run both cells below to measure the time each process takes (we do this by using `%%time`, a command used to count the time that the code in one cell took to run)

In [30]:
%%time

## Chain indexing
df['room_type'][55116]

CPU times: user 64 µs, sys: 1 µs, total: 65 µs
Wall time: 68.9 µs


'Private room'

In [31]:
%%time

## Multi indexing
df.loc[55116, 'room_type']

CPU times: user 63 µs, sys: 0 ns, total: 63 µs
Wall time: 66 µs


'Private room'

But why?

When we select a row or column in a DataFrame using brackets, the Python bellow Pandas is calling the  `getitem` method to return the requested data. Well, when we chain two sets of brackets, as in the first example, we are calling the  `getitem` method twice! On the other hand, when we use loc to select a value given a row and column at the same time, Python is only calling the `getitem` method once. 

In a small dataset like this the times might not be as different, but the bigger your dataframe the bigger this problem can get, so keep it in mind when you need to select by multiple axis.

<br>


## Subsetting data using mask and where 

When we are doing analysis on data, we usually want to filter it or select it according to certain conditons. Pandas dataframes provide two inbuilt methods that are useful to filter data according to a condition or a set of conditions:

* `df.mask` - Replace value when condition is true
* `df.where` - Replace value when condition is false



### Masks

The `mask` function replaces values when the condition passed is `True`, so it can be used to "hide" rows given a condition. These rows will have all values replaced by NaN:

In [32]:
df.mask(df.overall_satisfaction == 5.0)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,,,,,,,,
17031,66015.0,Entire home/apt,Alvalade,0.0,0.0,2.0,1.0,46.0
25659,,,,,,,,
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0
29396,,,,,,,,
...,...,...,...,...,...,...,...,...
19388006,135915593.0,Entire home/apt,São Vicente,0.0,0.0,6.0,3.0,415.0
19393935,5376796.0,Entire home/apt,Santa Maria Maior,0.0,0.0,3.0,1.0,50.0
19396300,6115933.0,Entire home/apt,Santo António,0.0,0.0,6.0,4.0,138.0
19397373,97139334.0,Entire home/apt,São Vicente,0.0,0.0,4.0,1.0,56.0


You can leave only the non-hidden values by dropping the NaN rows. Pandas already provides a function for that, `dropna`. 

__Hint__: notice the argument `how=all` in the function. This means that only masked rows - rows where __all__ elements are set to `NaN` - are dropped. 

In [33]:
df.mask(df.overall_satisfaction == 5.0).dropna(how='all')

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015.0,Entire home/apt,Alvalade,0.0,0.0,2.0,1.0,46.0
29248,125768.0,Entire home/apt,Santa Maria Maior,225.0,4.5,4.0,1.0,58.0
29915,128890.0,Entire home/apt,Avenidas Novas,28.0,4.5,3.0,1.0,58.0
33312,144398.0,Entire home/apt,Misericórdia,24.0,4.5,4.0,1.0,66.0
33348,144484.0,Private room,Lumiar,2.0,0.0,6.0,1.0,46.0
...,...,...,...,...,...,...,...,...
19388006,135915593.0,Entire home/apt,São Vicente,0.0,0.0,6.0,3.0,415.0
19393935,5376796.0,Entire home/apt,Santa Maria Maior,0.0,0.0,3.0,1.0,50.0
19396300,6115933.0,Entire home/apt,Santo António,0.0,0.0,6.0,4.0,138.0
19397373,97139334.0,Entire home/apt,São Vicente,0.0,0.0,4.0,1.0,56.0


## Where

The `where` function, on the other hand, can be used to hide the rows that __do not__ verify a certain condition, leaving only the rows that do verify it. The "hidden" rows will have all values replaced by NaN.

In [34]:
df.where(df.overall_satisfaction == 5.0)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0
17031,,,,,,,,
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0
29248,,,,,,,,
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0
...,...,...,...,...,...,...,...,...
19388006,,,,,,,,
19393935,,,,,,,,
19396300,,,,,,,,
19397373,,,,,,,,


Once again, we can leave only non-hidden values by using `dropna`

In [35]:
df.where(df.overall_satisfaction == 5.0).dropna(how='all')

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455.0,Entire home/apt,Belém,8.0,5.0,2.0,1.0,57.0
25659,107347.0,Entire home/apt,Santa Maria Maior,63.0,5.0,3.0,1.0,69.0
29396,126415.0,Entire home/apt,Santa Maria Maior,132.0,5.0,4.0,1.0,67.0
29720,128075.0,Entire home/apt,Estrela,14.0,5.0,16.0,9.0,1154.0
29872,128698.0,Entire home/apt,Alcântara,25.0,5.0,2.0,1.0,75.0
...,...,...,...,...,...,...,...,...
18997896,19063709.0,Entire home/apt,Avenidas Novas,3.0,5.0,4.0,0.0,62.0
19034170,62521369.0,Entire home/apt,Belém,5.0,5.0,4.0,1.0,85.0
19051322,132979089.0,Private room,Penha de França,3.0,5.0,3.0,1.0,29.0
19079169,2009620.0,Private room,Estrela,3.0,5.0,2.0,1.0,35.0


Basically __mask__ and __where__ do the opposite of each other! One is used to choose data given a condition and the other one is used to hide data given a condition. Applying the same condition to both and dropping "hidden" rows, you'll get complementary sets of your data.

<br>

## Subsetting data on conditions

Besides relying on `mask` and `where`, we can also use the __bracket notation__ with conditions to subset data from the DataFrame. By doing this, we get a DataFrame that most likely has a different shape from the initial one, since it only returns the subset of its rows that satisfy the condition.

Let's subset the DataFrame to get all the rooms in the Alvalade neighborhood. Note the DataFrame shape!

__Note__:  this is different from what we saw in the mask/filter functions: those functions don't change the DataFame shape, instead, they just replace the values that we don't want with NaNs. 

In [36]:
df[df.neighborhood == 'Alvalade']

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
72807,378525,Private room,Alvalade,1,0.0,1,1.0,29.0
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
143882,697596,Entire home/apt,Alvalade,0,0.0,3,2.0,577.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
...,...,...,...,...,...,...,...,...
19206346,131826194,Entire home/apt,Alvalade,0,0.0,5,2.0,87.0
19225159,84062304,Private room,Alvalade,0,0.0,2,1.0,54.0
19227195,134599148,Entire home/apt,Alvalade,0,0.0,4,2.0,56.0
19266319,15462808,Entire home/apt,Alvalade,0,0.0,2,1.0,52.0


As another example, we're selecting the rooms in Alvalade, that have more than 10 reviews.

__Note the parenthesis around each condition, they're required!__

In [37]:
df[(df.neighborhood == 'Alvalade') & (df.reviews > 10)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
172248,507901,Private room,Alvalade,38,4.5,2,1.0,29.0
216881,1119812,Entire home/apt,Alvalade,22,4.5,6,2.0,74.0
333919,507901,Private room,Alvalade,48,4.0,3,1.0,40.0
...,...,...,...,...,...,...,...,...
15044690,18671578,Entire home/apt,Alvalade,33,5.0,3,1.0,52.0
15786593,102115202,Entire home/apt,Alvalade,22,5.0,5,2.0,57.0
15839689,16844987,Entire home/apt,Alvalade,24,5.0,4,2.0,58.0
16690259,63598544,Entire home/apt,Alvalade,16,5.0,2,1.0,58.0


### Special conditions

There are different types of conditions you may want to provide, and it is useful to know some of the basics operators you can use to subset:

- values of a column are equal to a specific value: `==` works for any type
- values of a column are __not__ equal to a specific value: `!=` works for any type
- other basic operator on numeric values:
   - Greater than, less than: `>` or `<`
   - Greater than or equal to `>=`
   - Less than or equal to `<=`
- values of a column are in a list of values: `isin` method 
- negate conditions with `~`  

The first three are either represented above, or are small variations of it. Play around with those to see different outcomes. The last two, however, might seem new to you. Let's look a bit more into those.


#### Matching values in a list

Sometimes you don't want to test equality with one value, but instead you want to check for several possible values. Even though you could match with each of the values in your list, as shown below:

In [38]:
# match listings in Alvalade, Belem and Estrela

df[(df.neighborhood == 'Alvalade') | (df.neighborhood == 'Belém') | (df.neighborhood == 'Estrela')]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
34783,149980,Private room,Estrela,0,0.0,1,1.0,54.0
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
...,...,...,...,...,...,...,...,...
19356899,11470182,Private room,Estrela,0,0.0,2,1.0,29.0
19379554,104083974,Entire home/apt,Belém,0,0.0,4,1.0,69.0
19380177,104083974,Entire home/apt,Belém,0,0.0,6,2.0,69.0
19380457,104083974,Entire home/apt,Belém,0,0.0,6,3.0,93.0


It may seem a good idea for a list of 3 elements, but what if you had 100 or 1000 elements? You could still hack it this way, but it is not ideal. There must be a better way! 

And there is. The function `isin` allows you to test against all elements in a list, making the task easier for you. Check the example above using this function:

In [39]:
# match listings in Alvalade, Belem and Estrela

df[df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela'])]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
34783,149980,Private room,Estrela,0,0.0,1,1.0,54.0
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
...,...,...,...,...,...,...,...,...
19356899,11470182,Private room,Estrela,0,0.0,2,1.0,29.0
19379554,104083974,Entire home/apt,Belém,0,0.0,4,1.0,69.0
19380177,104083974,Entire home/apt,Belém,0,0.0,6,2.0,69.0
19380457,104083974,Entire home/apt,Belém,0,0.0,6,3.0,93.0


#### Negating a condition

What if we wanted to say that a given value is not in the desired list? Or if we just wanted to negate the conditions that were given to sample the "excluded" data. There are many ways to do this, but a very easy one is to use the `~` operator behind your conditions.

See the following examples to understand how this works. Let's start with a simple one using only one condition:

In [40]:
# Start by running for a given condition
df[df.neighborhood == 'Alvalade']

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
72807,378525,Private room,Alvalade,1,0.0,1,1.0,29.0
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
143882,697596,Entire home/apt,Alvalade,0,0.0,3,2.0,577.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
...,...,...,...,...,...,...,...,...
19206346,131826194,Entire home/apt,Alvalade,0,0.0,5,2.0,87.0
19225159,84062304,Private room,Alvalade,0,0.0,2,1.0,54.0
19227195,134599148,Entire home/apt,Alvalade,0,0.0,4,2.0,56.0
19266319,15462808,Entire home/apt,Alvalade,0,0.0,2,1.0,52.0


In [41]:
# Negate the previous condition
df[~(df.neighborhood == 'Alvalade')]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
...,...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


Now see what you can do when you have multiple conditions. See below how to negate only one condition, a subset of conditions or all of the conditions provided:

In [42]:
# Use a more complex set of conditions
df[df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela']) & (df.reviews > 100) & (df.price > 30)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
119120,387186,Entire home/apt,Estrela,136,5.0,4,2.0,73.0
608769,676330,Entire home/apt,Estrela,109,4.5,4,1.0,58.0
632865,387186,Entire home/apt,Estrela,198,5.0,2,1.0,48.0
706651,2381723,Entire home/apt,Alvalade,116,5.0,4,2.0,52.0
751725,4235131,Entire home/apt,Estrela,103,5.0,10,5.0,346.0
...,...,...,...,...,...,...,...,...
5829860,27694497,Entire home/apt,Estrela,106,5.0,3,1.0,58.0
5962736,30954075,Entire home/apt,Estrela,104,4.5,4,1.0,69.0
6812451,35420045,Entire home/apt,Estrela,113,5.0,3,1.0,64.0
6907675,15900664,Private room,Estrela,114,4.5,2,1.0,40.0


In [43]:
# Negate first condition
df[~(df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela'])) & (df.reviews > 100) & (df.price > 30)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
40817,176410,Entire home/apt,Misericórdia,229,4.5,2,1.0,52.0
44043,192830,Entire home/apt,Santa Maria Maior,316,5.0,7,3.0,80.0
65553,320407,Entire home/apt,Campo de Ourique,102,4.5,2,1.0,58.0
...,...,...,...,...,...,...,...,...
12289972,62521369,Entire home/apt,Misericórdia,113,4.5,4,0.0,50.0
12543061,21980829,Entire home/apt,Santa Maria Maior,107,4.5,4,0.0,67.0
12601298,62521369,Entire home/apt,Misericórdia,103,4.5,3,1.0,48.0
12931609,26284159,Private room,Santa Maria Maior,101,5.0,2,1.0,41.0


In [44]:
# Negate second condition
df[(df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela'])) & ~(df.reviews > 100) & (df.price > 30)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
34783,149980,Private room,Estrela,0,0.0,1,1.0,54.0
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
...,...,...,...,...,...,...,...,...
19336231,2068065,Entire home/apt,Estrela,0,0.0,2,1.0,104.0
19379554,104083974,Entire home/apt,Belém,0,0.0,4,1.0,69.0
19380177,104083974,Entire home/apt,Belém,0,0.0,6,2.0,69.0
19380457,104083974,Entire home/apt,Belém,0,0.0,6,3.0,93.0


In [45]:
# Negate second and third conditions  notice the extra parentheses surrounding the desired conditions to negate
df[(df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela'])) & ~((df.reviews > 100) & (df.price > 30))]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
34783,149980,Private room,Estrela,0,0.0,1,1.0,54.0
50108,229376,Entire home/apt,Estrela,33,4.5,4,1.0,69.0
...,...,...,...,...,...,...,...,...
19356899,11470182,Private room,Estrela,0,0.0,2,1.0,29.0
19379554,104083974,Entire home/apt,Belém,0,0.0,4,1.0,69.0
19380177,104083974,Entire home/apt,Belém,0,0.0,6,2.0,69.0
19380457,104083974,Entire home/apt,Belém,0,0.0,6,3.0,93.0


In [46]:
# Negate all conditions - notice the extra parentheses surrounding ALL conditions
df[~((df.neighborhood.isin(['Alvalade', 'Belém', 'Estrela'])) & (df.reviews > 100))]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


### Combining conditions

In many cases you will want to sample according to several conditions. As you see above, you should wrap your conditions in between parentheses. There are two main ways of combining conditions:

* AND: if you want to make sure your data satisfies all conditions
* OR: if you want to make sure your data satisfies one of the conditions

Even though python has `and` and `or` as regular keywords for this, for pandas subsetting you should use `&` and `|`. When using the logical operators  `and` and `or` you are implicitly asking Python to convert the conditions to boolean values. Numpy arrays and thus pandas columns have no truth value. So we need to use the bitwise operators, which numpy makes use of to do element-wise operations, since unlike their logical counterparts, they can be overriden to return specific output values.

See the examples below

In [47]:
# Proper usage of & for element wise AND
df[(df.neighborhood == 'Alvalade') & (df.reviews > 10)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
108058,557442,Private room,Alvalade,24,5.0,1,1.0,27.0
172014,820718,Private room,Alvalade,77,4.5,4,1.0,45.0
172248,507901,Private room,Alvalade,38,4.5,2,1.0,29.0
216881,1119812,Entire home/apt,Alvalade,22,4.5,6,2.0,74.0
333919,507901,Private room,Alvalade,48,4.0,3,1.0,40.0
...,...,...,...,...,...,...,...,...
15044690,18671578,Entire home/apt,Alvalade,33,5.0,3,1.0,52.0
15786593,102115202,Entire home/apt,Alvalade,22,5.0,5,2.0,57.0
15839689,16844987,Entire home/apt,Alvalade,24,5.0,4,2.0,58.0
16690259,63598544,Entire home/apt,Alvalade,16,5.0,2,1.0,58.0


In [48]:
# Proper usage of | for element wise OR
df[(df.neighborhood == 'Alvalade') | (df.reviews > 10)]

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
...,...,...,...,...,...,...,...,...
19206346,131826194,Entire home/apt,Alvalade,0,0.0,5,2.0,87.0
19225159,84062304,Private room,Alvalade,0,0.0,2,1.0,54.0
19227195,134599148,Entire home/apt,Alvalade,0,0.0,4,2.0,56.0
19266319,15462808,Entire home/apt,Alvalade,0,0.0,2,1.0,52.0


In [49]:
# Wrong usage of "and" for element wise AND
try:
    df[(df.neighborhood == 'Alvalade') and (df.reviews > 10)]
except ValueError as e:
    print("\033[91mError\033[0m: {}".format(e))

[91mError[0m: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


In [50]:
# Wrong usage of "or" for element wise OR
try:
    df[(df.neighborhood == 'Alvalade') or (df.reviews > 10)]
except ValueError as e:
    print("\033[91mError\033[0m: {}".format(e))

[91mError[0m: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


<br>

## Adding Rows & Columns

### Adding or replacing a row

We can use the `loc` indexing operation to add a new row in a dataframe with a specific index. If dataframe already has any row with that index name then this will replace the contents of that row. See below an example of adding a non existing label:

In [51]:
new_df = df.copy()
original_size = len(new_df.index)

# Add new room to our dataframe with roomid=100
new_df.loc[100] = [123456,'Private room','Entrecampos',82,4.5,2,1,29.0]
final_size = len(new_df.index)

# Show our new room
new_df.loc[100]

host_id                       123456
room_type               Private room
neighborhood             Entrecampos
reviews                           82
overall_satisfaction             4.5
accommodates                       2
bedrooms                         1.0
price                           29.0
Name: 100, dtype: object

In [52]:
print('Size of dataframe before: {} \nSize of dataframe after: {}'.format(original_size, final_size))

Size of dataframe before: 13232 
Size of dataframe after: 13233


Now see what happens if you do that for an existing room_id:

In [53]:
new_df = df.copy()
original_size = len(new_df.index)

# Add new room to our dataframe with roomid=100
new_df.loc[72807] = [123456,'Private room','Entrecampos',82,4.5,2,1,29.0]
final_size = len(new_df.index)

# Show our new room
new_df.loc[72807]

host_id                       123456
room_type               Private room
neighborhood             Entrecampos
reviews                           82
overall_satisfaction             4.5
accommodates                       2
bedrooms                         1.0
price                           29.0
Name: 72807, dtype: object

In [54]:
print('Size of dataframe before: {} \nSize of dataframe after: {}'.format(original_size, final_size))

Size of dataframe before: 13232 
Size of dataframe after: 13232


We can also use `iloc` to __replace__ the row at a given index position:

In [55]:
new_df = df.copy()
original_size = len(new_df.index)

new_df.iloc[-1] = [56787,'Private room','Avenidas Novas',82,4.5,2,1,29.0]
final_size = len(new_df.index)

# Show our new room
new_df.tail()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0
19400722,56787,Private room,Avenidas Novas,82,4.5,2,1.0,29.0


In [56]:
print('Size of dataframe before: {} \nSize of dataframe after: {}'.format(original_size, final_size))

Size of dataframe before: 13232 
Size of dataframe after: 13232


### Adding or replacing a column

In the same way we did for rows, we can add a column to a daframe using the same notations that we used to select them, i.e., dot notation, brackets notation, or loc operator. See first some examples on how you can replace a column in your dataframe, for example let's normalize the prices to thousands of dollars instead of dollars:

In [57]:
# Normalizes the price from dollars to thousands of dollars
new_df = df.copy()
new_df['price'] = new_df.price/1000
new_df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,0.057
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,0.046
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,0.069
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,0.058
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,0.067


Now see how you can use the same notation to add completely new columns:

In [58]:
# Creates a new column in the DataFrame (price_per_week), where each row is equal to the price * 7
new_df = df.copy()
new_df['price_per_week'] = new_df.price * 7
new_df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,price_per_week
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0,399.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0,322.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0,483.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0,406.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0,469.0


There is also a cool function called `assign`. This function returns a new dataframe with all the columns plus the one that we creating.

In [59]:
# Creates new columns in the DataFrame (people_per_bedroom, price_per_month), 
# where each row is equal to the value of the accommodates column divided by the bedrooms column
new_df = df.copy()
new_df = new_df.assign(
    people_per_bedroom = new_df['accommodates']/new_df['bedrooms'],
    price_per_month = new_df['price'] * 31
)

new_df.head()

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price,people_per_bedroom,price_per_month
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0,2.0,1767.0
17031,66015,Entire home/apt,Alvalade,0,0.0,2,1.0,46.0,2.0,1426.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0,3.0,2139.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0,4.0,1798.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0,4.0,2077.0


<br>

## Removing Rows & Columns

Finally, you can also remove columns or rows from your dataset. In order to drop rows and columns from a DataFrame, we can use function `drop`. Here's how you drop a row:

In [60]:
# This drops the row with index 17031. This is the same than doing drop(17031,axis=0)
df.drop(labels=17031)

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6499,14455,Entire home/apt,Belém,8,5.0,2,1.0,57.0
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
...,...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0


In order to drop a column, we do the following:

In [61]:
# This drops column neighborhood. This is the same than doing drop('neighborhood',axis=1)
df.drop(columns='neighborhood')

Unnamed: 0_level_0,host_id,room_type,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
6499,14455,Entire home/apt,8,5.0,2,1.0,57.0
17031,66015,Entire home/apt,0,0.0,2,1.0,46.0
25659,107347,Entire home/apt,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,132,5.0,4,1.0,67.0
...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,0,0.0,4,1.0,56.0


If we want to drop multiple rows (or columns), we can use lists:

In [62]:
df.drop(labels=[6499, 17031])

Unnamed: 0_level_0,host_id,room_type,neighborhood,reviews,overall_satisfaction,accommodates,bedrooms,price
room_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
25659,107347,Entire home/apt,Santa Maria Maior,63,5.0,3,1.0,69.0
29248,125768,Entire home/apt,Santa Maria Maior,225,4.5,4,1.0,58.0
29396,126415,Entire home/apt,Santa Maria Maior,132,5.0,4,1.0,67.0
29720,128075,Entire home/apt,Estrela,14,5.0,16,9.0,1154.0
29872,128698,Entire home/apt,Alcântara,25,5.0,2,1.0,75.0
...,...,...,...,...,...,...,...,...
19388006,135915593,Entire home/apt,São Vicente,0,0.0,6,3.0,415.0
19393935,5376796,Entire home/apt,Santa Maria Maior,0,0.0,3,1.0,50.0
19396300,6115933,Entire home/apt,Santo António,0,0.0,6,4.0,138.0
19397373,97139334,Entire home/apt,São Vicente,0,0.0,4,1.0,56.0



That's a wrap, you now have all the tools to subset your dataframes and sample your datasets. Play around with these functions, and then continue to the example and exercise notebooks to practice what you've learned!