# <font color='#eb3483'> Filtering in Pandas </font>

Once we have read our data frame in and had a look around 

We may want to start working with specific columns or rows, or data that only meets a certain criteria.
We do this with filtering.   

Indices are incredibly useful, because they allow us to quickly and intuitively (especially if we've used a meaningful index) pick out relative data points. In this module we're going to learn how to use indices to filter our dataframes.

The two most fundamental commands for indexing are `loc` and `iloc` (integar-loc) followed by an identifier for the desired location in square brackets. Mastering the use of `loc` and `iloc` early will set you in good stead for use of the Pandas data API.

1. There are two things you should know about `iloc`. Firstly, it is reserved for purely number-based indexing (integars only). So if you ever call iloc with a non-integer index, it will throw an error. Secondly, `iloc` **does not interact with your index at all** -> important to remember if your index is integer-based.
2. `loc` is based purely on the assigned index for your dataframe.


In this notebook we will cover:
<font color='#eb3483'>
1. Selecting rows by their numerical position - iloc
1. Selecting rows by their index - loc
1. Selecting columns
1. Advanced filtering
 - mask & where
 - filtering with []
 - multiple selctions
    </font>

In [None]:
import pandas as pd

We are going to use a dataset that has Airbnb listing information in Lisbon.

In [None]:
df = pd.read_csv('data/airbnb.csv', index_col='room_id') #indexing the df using room_id

In [None]:
# look at the shape


In [None]:
#look at the head


## <font color='#eb3483'> 1. Selecting rows by their position - iloc </font>

We use the function `iloc` to select specific rows on a Data Frame (regardless of the index).

With `iloc` we select rows regarding their row number, starting at 0.

In [None]:
df.iloc[0] # using one square bracket returns it as a series

In [None]:
type(df.iloc[0])

Generally we would want to keep working with a dataframe - so we use double brackets `[[]]`

In [None]:
df.iloc[[0]]

We can select multiple rows at once:

In [None]:
df.iloc[[0,3,5]]

Or use slices like with arrays:

In [None]:
#select rows 2:10



## <font color='#eb3483'> 2. Selecting rows by their index value - loc </font>

With `.loc` we can select rows based on their index value.

Since we have set the dataframe index as the Airbnb listing, we can select a specific room based on its id, for example, the listing 10186098.

In [None]:
df.loc[29396]

Selecting an index value that doesnt exist will fail

In [None]:
df.loc[[5]]

Same as with .iloc, we can select multiple values at once.

In [None]:
df.loc[[29872, 19188572, 4612503 ]]

## <font color='#eb3483'> 3. Column Selection </font>

We can select columns using dot notation **(as long as the column names dont have spaces or non alphanumerical characters on them)** - which is why it is always good to name your columns without these. Saves time later :)

In [None]:
df.room_type

Which is the same as doing:

In [None]:
df['room_type']

When we select one column we receive a pd.Series, we can use double brackets to select multiple columns (if we select multiple columns we will always receive a dataframe). 

In [None]:
df[["room_type", "price"]].head()

We can also use loc select specific rows from a desired columns with loc

In [None]:
df.loc[:, "room_type"][:10]

## <font color='#eb3483'> 4. Advanced Filtering </font>

### <font color='#eb3483'> Mask & Where </font>

The function `mask` allows us to "hide" parts of a dataframe that match a certain condition. Note that is similar to how we used masks in NumPy!

In [None]:
df.mask(df.overall_satisfaction == 5.0).head()

We see that the rows that dont match the condition appear as `NaN`, which stands for **Not a Number**, a standard way of saying *"there is no relevant data here"*. Pandas will usually ignore the NaNs.

On the other hand, [where](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html) hides those rows that don't match the condition (where is the opposite of mask).

In [None]:
df.where(df.overall_satisfaction == 5.0).head()

### <font color='#eb3483'> Filtering with [ ] </font>

We can also filter by using brackets.
The difference between filtering with brackets and using `mask/where` is that with brackets we only receive a segment of the dataframe (less rows), while with `mask/where` we receive a dataframe with the same rows and index as the original one.  

We call this subsetting to create a specific "sub" dataframe

Why would it be useful to retain all the rows?



For example, we can filter the dataframe to see all the listings in `Belem`:

In [None]:
df = pd.read_csv('data/airbnb.csv', index_col='room_id')

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.where(df.neighborhood=="Belém").shape

If we use brackets, the dataframe we get is smaller

In [None]:
df[df.neighborhood == 'Belém'].head()

In [None]:
df[df.neighborhood == 'Belém'].shape

We can select the inverse of a condition if we put `~` in front of it.

For example, to select all listings that are not in Belem, we can do this:

In [None]:
df[~(df.neighborhood ==  "Belém")].shape

### <font color='#eb3483'> Multiple Selection </font>

We can filter a dataframe based on multiple conditions.

We can select rows that match multiple conditions by concatenating the conditions with `&`.

For example, if we want those listings in Belém with more than 3 bedrooms:

In [None]:
df[(df.neighborhood == 'Belém') & (df.bedrooms > 3)].head()

Same way, we can select rows that match one condition OR the other with the pipe (`|`)

In [None]:
df[(df.neighborhood == "Belém") | (df.neighborhood == "Benfica")].head()

# <font color='#eb3483'> GET SOME PRACTICE </font>

## Take 10 minutes and work through 1 or 2 of these problems to get a feel for doing the coding yourself.

It is going to be rough at first. And that's okay. You can copy paste and scroll up. You dont have to remember each command. it's all there - and if it isnt ... google is your friend.


# <font color='#eb3483'> Filtering Pandas Exercises </font>

Let's pretend we are an Airbnb employee assigned to the Lisbon market. Our job is to help clients find their desired listing. We have a file named `airbnb.csv` that has information on all the listings we have available right now in the city. Start by import pandas and loading our data in.

### <font color='#eb3483'> Exercise 1 </font>

Alice is going to Lisbon for a week with her husband and 2 kids. They are looking for a full apartment with separate rooms for parents and children. Money is not an issue for them, but they are looking for a good place. This means they are only looking for places with more than 10 reviews and a score above 4. When we show Alice our listing selection we need to make sure we are sorting the listings from the best score to the worse one. In case some listings have the same score, we will have to sort them by the number of reviews (the more the better). We need to give her  3 alternatives.


### <font color='#eb3483'> Exercise 2 </font>

Diana is going to spend 3 nights in Lisbon and she wants to meet new people. Se has a budget of 50€. We need to provide to her the 10 cheapests listings, with a preference for shared rooms. We need to sort the rooms by score (descending).

### - Solutions
### - TA help
### - sharing