A semi join filters the left table down to those observations that have a match in the right table. It is similar to an inner join where only the intersection is returned, but, unlike inner joins, only the columns from the left table are shown. Also, no duplicate rows are returned, even in a one-to-many relationship.

In [3]:
import pandas as pd
genres = pd.read_pickle('genres.p')
genres.head()

Unnamed: 0,movie_id,genre
0,5,Crime
1,5,Comedy
2,11,Science Fiction
3,11,Action
4,11,Adventure


There doesn't appear to be a "top_tracks" table, so the example code will be in this Markdown rather than a code block.

**genres_tracks = genres.merge(top_tracks, on='gid')<br>
print(genres_tracks.head())**

Since this is an inner join, only values where both tables matched is returned.

**genres['gid'].isin(genres_tracks['gid'])**

The method isin() compares every 'gid' in the genres table to the 'gid' in the genres_tracks table. This will tell us if our genre appears in our merged genres_tracks table. This line of code returns a Boolean Series of values.

To combine everything, we use that line of code to subset the genres table. The results are saved to top_genres and we print a few rows. We've completed a semi join. These are rows in the genre table that are also found in the top_tracks table. This is called a filtering join because we've filtered the genres table by what's in the top_tracks table.

**genres_tracks = genres.merge(top_tracks, on='gid')<br>
top_genres = genres[genres['gid'].isin(genres_tracks['gid'])]<br>
print(top_genres.head())**

An anti join returns the observations in the left table that do not have a matching observation in the right table. It also only returns the columns from the left table. Now, let's go back to our example. Instead of finding which genres are in the table of top tracks, let's now find which genres are not with an anti join.

The first step is to use a left join returning all of the rows from the left table. Here we'll use the indicator argument and set it to True. With indicator set to True, the merge method adds a column called "\_merge" to the output. This column tells the source of each row. For example, the first four rows found a match in both tables, whereas the last can only be found in the left table.

**genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)<br>
print(genres_tracks.head())**

Next, we use the "loc" accessor and "\_merge" column to select the rows that only appeared in the left table and return only the "gid" column from the genres_tracks table. We now have a list of gids not in the tracks table.

**gid_list = genres_tracks.loc[genres_track['_merge'] == 'left_only', 'gid']<br>
print(gid_list.head())**

In our final step we use the isin() method to filter for the rows with gids in our gid_list. Our output shows those genres not in the tracks table.

**genres_tracks = genres.merge(top_tracks, on='gid', how='left', indicator=True)<br>
gid_list = genres_tracks.loc[genres_track['_merge'] == 'left_only', 'gid']<br>
non_top_genres = genres[genres['gid'].isin(gid_list)]<br>
print(non_top_genres.head())**