## Chapter III

### Filtering joins

So far, we have only worked with mutating joins, which combines data from two tables. However, filtering joins filter observations from one table based on whether or not they match an observation in another table.

- **Mutating joins:**
    * Combines data from two tables based on matching observations in both tables.
- **Filtering joins:**
    * Filter observations from tables based on whether or not they match an observation in another table.
    
#### Semi-join

A semi join filters the left table down to those observations that have a match in the right table. It is smiliar to an inner join where only the intersection between the tables is returned, but unlike an inner join, only the columns from the left table are shown. Finally, no duplicate rows from the left table are returned, even if there is a one-to-many relationship.

#### Example Datasets

In this new dataset, we have table of song genres shown here. There's also a table of top-rated song tracks. The 'gid' column connects the two tables. Let's say we want to find what genres appear in our table of top songs. A semijoin would return only the columns from the genre table and not the tracks.
<img src='pictures\Example_datasets.png' alt=example dataset/>

#### Step 1 (semi-join)

First, let's merge the two tables with an inner join. We also print the first few rows of the genres_tracks variable. Since this is an inner join, the returned 'gid' column holds only values where both tables matched.


```python
genres_tracks = genres.merge(top_tracks, on='gid')
genres_tracks.head()
```
<img src='pictures\Step_1_semi_join.png' alt=Step_1 />

#### Step 2

For the next step in the technique, let's focus on this line of code. It uses a method called **isin()**, which compares every **'gid'** in the genres table to the **'gid'** in the **genres_tracks** table. This will tell us if our genre appears in our merged genres_tracks table.

```python
genres['gid'].isin(genres_tracks['gid'])
```
<img src='pictures\Step_2.png' alt=step_2 />

#### Step 3


To combine everything, we use that line of code to subset the genres table. The results are saved to top_genres and we print a few rows. We've completed a semi join. These are rows in the genre table that are also found in the top_tracks table. This is called a filtering join because we've filtered the genres table by what's in the top_tracks table

```python
genres_tracks = genres.merge(top_tracks, on='gid')
top_genres = genres[genres['gid'].isin(genres_tracks['gid'])]
top_genres.head()
```

<img src='pictures\Step_3.png' alt=step_3 />

### Anti-join

Anti-join returns the observations in the left table that do not have a matching observation in the right table. It also only returns the columns from the left table.

- **Anti join:**
    * Returns the left table, excluding the intersection
    * Returns only columns from the left table and **not** the right

Now, let's look go back to our example. Instead of finding the top genres, lets find which genres are not.

#### Step 1 (anti-join)

The first step is to use a **left join** returning all of the rows from the left table. Here we'll use the **indicator argument** and set it to **True**. With indicator set to True, the merge method adds a column called **"_merge"** to the output. This column tells the source of each row. For example, the first four rows found a match in both tables, whereas the last can only be found in the left table.

```python
genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)
genres_tracks.head()
```

<img src='pictures\Anti_join_1.png' alt=anti_join />

#### Step 2

Next, we use the **"loc"** accessor and **"_merge" column** to select the rows that only appeared in the left table and return only the **"gid"** column from the **genres_tracks** table. We now have a list of gids not in the tracks table.

```python
gid_list = genres_tracks.loc[genres_tracks['merge'] == 'left_only', 'gid']
gid_list.head()
```

<img src='pictures\Anti_join_Step_2.png' alt=step_2 />

#### Step 3

In our final step we use the **isin() method** to filter for the rows with **gids** in our **gid_list**. Our output shows those genres not in the tracks table.

```python
genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)
gid_list = genres_tracks.loc[genres_tracks['gid'] == 'left_only', 'gid']
non_top_genres = genres.[genres['gid'].isin(gid_list)]
non_top_genres.head()
```

<img src='pictures\Anti_join_Step_3.png' alt=step_3 />

### Concatenate DataFrames together vertically

So far in this course, we have only discussed how to merge two tables, which mainly grows them horizontally. But what if we wanted to grow them vertically? We can use the **concat method** to concatenate, or stick tables together, **vertically or horizontally**, but in this lesson, we'll focus on vertical concatenation.

* pandas ```.concat()```method can concatenate both vertical and horizontal.
    - ```axis=0```, vertical
    
#### Basic concatenation

* 3 different tables
                    <img style="float: right;" src='pictures\three_tables.png'  alt=three_tables />

* Same column names

* Table variable names:

    - ```inv_jan```(top)

    - ```inv_feb```(middle)

    - ```inv_mar```(bottom)

<br></br>
<br></br>
<br></br>
<br></br>
We can pass a list of table names into pandas dot concat to combine the tables in the order they're passed in. To concatenate vertically, the axis argument should be set to 0, but 0 is the default, so we don't need to explicitly write this. The result is a vertically combined table. Notice each table's index value was retained.

<img style="float: right;" src='pictures\basic_concat.png' alt=basic />

```python                                               
pd.concat([inv_jan, inv_feb, inv_mar])
```                                             

<br></br>
<br></br>
<br></br>
<br></br>
<br></br>
<b>Ignoring the index</b>  

If the index contains no valuable information, then we can ignore it in the concat method by setting <b>ignore_index to True</b>. The result is that the index will go from 0 to n-1.


```python
pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=True)
```


<b>Setting labels to original tables</b>  
Now, suppose we wanted to associate specific keys with each of the pieces of our three original tables. We can provide a list of labels to the **keys argument**. Make sure that **ignore_index argument is False**, since you can't add a key and ignore the index at the same time. This results in a table with a multi-index, with the label on the first level.

```python
pd.concat([inv_jan, inv_feb, inv_mar], ignore_index=False, keys=['jan', 'feb','mar'])
```

#### Concatenate tables with different column names

What if we need to combine tables that have different column names? The concat method by default will include all of the columns in the different tables it's combining. The sort argument, if true, will alphabetically sort the different column names in the result. If we only want the matching columns between tables, we set the join argument to "inner". Its default value is equal to "outer", which is why concat by default will include all of the columns. Additionally, the sort argument has no effect when join equals "inner". The order of the columns will be the same as the input tables.

### Using append Method

Append is a simplified concat method. It supports the **ignore_index and sort arguments**. However, it does not support **keys or join**. Join is always set to outer.

```python
inv_jan.append([inv_feb, inv_mar], ignore_index=True, sort=True)
```