# <font color='#eb3483'> Filtering in Pandas </font>

Once we have read our data frame in and had a look around, we may want to start working with specific columns or rows, or data that only meets a certain criterion. We do this with filtering.   

Indices are incredibly useful, because they allow us to quickly and intuitively pick out specific data points. In this notebook, we're going to practice using indices to filter our dataframes.

The two most fundamental commands for indexing are `loc` and `iloc` (that is, integer-loc) followed by an identifier for the desired location in square brackets. Mastering the use of `loc` and `iloc` early will set you in good stead for use of the Pandas data API.

There are two important things you should know about `iloc`:

1. It is reserved for purely number-based indexing (integers only). So if you ever call ```iloc``` with a non-integer index, it will throw an error.
2. `iloc` does not interact with your assigned index at all - it only considers the row positions, starting at zero.  This is important to remember if your assigned index is integer-based.

In contrast, `loc` references only the assigned index of your dataframe without regard for row ordering.


In this notebook we will cover:

1. Selecting rows by their numerical position - ```iloc```
1. Selecting rows by their index - ```loc```
1. Selecting columns
1. Advanced filtering using:
 - ```mask``` and ```where```
 - multiple selections


In [None]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in Lisbon.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Class Materials/Week 1 DS Fundamentals/W1D3 Data Wrangling/Classwork/airbnb.csv', index_col='room_id')

In [None]:
df = pd.read_csv('data/airbnb.csv', index_col='room_id') # indexing the df using room_id

<font color='#eb3483'> Exercise: </font> How large is this dataset? Have a look at the top few rows to familiarize yourself with it.

In [None]:
df.info()

## <font color='#eb3483'> 1. Selecting Rows by their Position </font>

We use the function `iloc` to select specific rows of a dataframe **regardless of the index**. With `iloc`, we select rows by row number, starting at 0.

In [None]:
df.iloc[0] # using one square bracket returns it as a series

We can select multiple rows at once by passing in a list of the row numbers that we want:

In [None]:
df.iloc[[0,3,5]]

Or use slices like with arrays.

<font color='#eb3483'> Exercise: </font> Select rows 2 to 20.

In [None]:
df.iloc[1:19]

## <font color='#eb3483'> 2. Selecting Rows by their Index Value </font>

With `.loc` we can select rows based on their index value. Since we have set the dataframe index as the ```room_id```, we can select a specific room based on its id, for example, ```room_id == 6499```:

In [None]:
df.loc[6499]

Selecting an index value that doesnt exist will fail

In [None]:
df.loc[5]

Same as with ```.iloc```, we can select multiple index values at once by passing these in as a list:

In [None]:
df.loc[[29872, 19188572, 4612503]]

The ```.loc``` method also allows you to pass in a boolean list (i.e. a list of ```True```s and ```False```s) to extract specific rows (where the boolean list contains ```True```). For example, to extract all AirBnB listings with 10 bedrooms, we can do the following:

In [None]:
bed10 = df['bedrooms']>=40 # this creates a pandas Series of type bool (boolean)
bed10

In [None]:
df.loc[bed10]

In [None]:
bed10

Of course, we could combine these into a single line of code:

```python
df.loc[df['bedrooms']==10]
```

Strictly speaking, we do not need to use ```.loc``` here. This will also work:

```python
df[df['bedrooms']==10]
```

However, it is important to remember that ```.iloc``` **will not work** since it only accepts lists of integers (not booleans). Check for yourself!

## <font color='#eb3483'> 3. Column Selection </font>

We can select columns using dot notation **as long as the column names do not have spaces or non-standard characters in them**. It is good practice to name your columns without such characters. This will save you time later! :)

In [None]:
df.room_type

...is the same as doing...

In [None]:
df['room_type']

Note that when we select one column like in the above example, we get a ```pd.Series``` object. We can use a list of column names to select multiple columns. This returns a ```pd.DataFrame``` object as expected:

In [None]:
df[["room_type", "price"]].head()

We can also use ```.loc``` to simultaneously select specific rows and a subset of the columns.

<font color='#eb3483'> Exercise: </font> What are the room types and neighborhoods of the AirBnB listings with more than 300 reviews?

In [None]:
df[["room_type", "neighborhood"]][df["reviews"]>300]

## <font color='#eb3483'> 4. Advanced Filtering </font>

### <font color='#eb3483'> Mask & Where </font>

The function `mask` allows us to "hide" or "deselect" parts of a dataframe that match a certain condition. Note that this is similar to how we use masks in NumPy.

In [None]:
df.mask(df.overall_satisfaction == 5.0).head()

The rows that don't match the condition appear as `NaN`, which stands for **Not a Number**, a standard way of saying "*there is no relevant data here*". Pandas will usually ignore the NaNs.

In contrast, [where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) hides those rows that don't match the condition. So ```where``` selects and ```mask``` deselects - they are opposites!

In [None]:
df.where(df.overall_satisfaction == 5.0).head()

<font color='#eb3483'> Exercise: </font> How does ```df[df.overall_satisfaction == 5.0]``` differ from the above?

<font color='#eb3483'> Exercise: </font> What does ```df[~(df.overall_satisfaction == 5.0)]``` do?

**Hint:** Compare the result of ```df.overall_satisfaction == 5.0``` and ```~(df.overall_satisfaction == 5.0)```.

In [None]:
df[df.overall_satisfaction == 5.0]

In [None]:
df[~(df.overall_satisfaction == 5.0)]

### <font color='#eb3483'> Multiple Selection </font>

We can filter a dataframe based on multiple conditions. We can select rows that match multiple conditions by concatenating the conditions with the AND operator, `&`. For example, if we want those listings in Belém with more than 3 bedrooms:

In [None]:
df[(df.neighborhood == 'Belém') & (df.bedrooms > 3)].head()

Similarly, we can select rows that match one condition OR the other with the `|` operator:

In [None]:
df[(df.neighborhood == "Belém") | (df.neighborhood == "Benfica")].head()

# <font color='#eb3483'> Filtering Pandas Exercises </font>

Let's pretend we are an Airbnb employee assigned to the Lisbon market. Our job is to help clients find their desired listing. We have a file named `airbnb.csv` that has information on all the listings we have available right now in the city. Start by importing pandas and loading the data in.

In [None]:
# your code goes here

### <font color='#eb3483'> Exercise 1 </font>

Alice is going to Lisbon for a week with her husband and 2 kids. They are looking for a full apartment with separate rooms for parents and children. Money is not an issue for them, but they are looking for a good place. This means they are only looking for places with more than 10 reviews and a score above 4. When we show Alice our listing selection we need to make sure we are sorting the listings from the best score to the worse one. In case some listings have the same score, we will have to sort them by the number of reviews (the more the better). We need to give her  3 alternatives.

In [None]:
# your code goes here


### <font color='#eb3483'> Exercise 2 </font>

Diana is going to spend 3 nights in Lisbon and she wants to meet new people. Se has a budget of 50€. We need to provide her with the 10 cheapests listings, with a preference for shared rooms. We need to sort the rooms by score (descending).

In [None]:
# your code goes here