# Joining Data With PANDAS

## Chapter II 

### Left Join

By default, the merge method performs an inner join, returning only the rows of data with matching values in the key columns of both tables.

In this lesson, we'll talk about the idea of a left join. A left join returns all rows of data from the left table and only those rows from the right table where key columns match.

In this chapter, we will use data from The Movie Database, a community-built movie database with info on thousands of movies, their casts, and popularity.

In [1]:
# Movies Table
import pandas as pd

movies = pd.read_pickle('datasets\\movies.p')
print(movies.head())
print(movies.shape) 

      id                 title  popularity release_date
0    257          Oliver Twist   20.415572   2005-09-23
1  14290  Better Luck Tomorrow    3.877036   2002-01-12
2  38365             Grown Ups   38.864027   2010-06-24
3   9672              Infamous    3.680896   2006-11-16
4  12819       Alpha and Omega   12.300789   2010-09-17
(4803, 4)


Our first table, names movies, holds information about individual movies such as the title name and its popularity. Each movie is given an ID number. Table has 4803 rows of data.

In [2]:
# Tagline Table

taglines = pd.read_pickle('datasets\\taglines.p')
print(taglines.head())
print(taglines.shape)

       id                                         tagline
0   19995                     Enter the World of Pandora.
1     285  At the end of the world, the adventure begins.
2  206647                           A Plan No One Escapes
3   49026                                 The Legend Ends
4   49529            Lost in our world, found in another.
(3955, 2)


Our second table is named taglines, which contains a movie ID number and the tagline for the movie. Notice that this table has 3955 rows of data compared to movies table which has 4803 rows of data.

In [3]:
# Merge with left join

movies_taglines = movies.merge(taglines, on='id', how='left')
movies_taglines.head()

Unnamed: 0,id,title,popularity,release_date,tagline
0,257,Oliver Twist,20.415572,2005-09-23,
1,14290,Better Luck Tomorrow,3.877036,2002-01-12,Never underestimate an overachiever.
2,38365,Grown Ups,38.864027,2010-06-24,Boys will be boys. . . some longer than others.
3,9672,Infamous,3.680896,2006-11-16,There's more to the story than you know
4,12819,Alpha and Omega,12.300789,2010-09-17,A Pawsome 3D Adventure


To merge these tables with a left join, we use our merge method and we add an argument **how** and give it **left** value.

By default this value is set to inner, therefore we did not give any value to this argument in the first chapter.

The result of the merge shows a table with all of the rows from the movies table and a value for tag line where the ID column matches in both tables. Whereever there isn't a matching ID, a null value (NaN) is entered for the tagline.

In [4]:
# Number of rows returned

print(movies_taglines.shape)

## The result is 4803 rows of data just as the left table.

(4803, 5)


In [12]:
# Exercises I 

# Counting missing rows with left join
# DataFrames ==> movies and financials
financials = pd.read_pickle('datasets\\financials.p')

# Merge movies and financials with a left join
movies_financials = movies.merge(financials, on='id', how='left')
# Count the number of rows in the budget column that are missing
number_of_missing_fin = movies_financials['budget'].isnull().sum()
# Print the number of movies missing financials
print(number_of_missing_fin)

1574


### Other Joins

The merge method supports two other join types.

* **Right Join** : It will return all of the rows from the right table and includes only these rows from the left table that have matching values. Mirror opposite of left join.

In [17]:
# Looking at data

# DataFrame
movies_to_genres = pd.read_pickle('datasets\\movie_to_genres.p')
# Subsetting the DataFrame to only include 'TV Movies'
tv_genre = movies_to_genres[movies_to_genres['genre'] == 'TV Movie']
# Show the tv_genre
tv_genre

Unnamed: 0,movie_id,genre
4998,10947,TV Movie
5994,13187,TV Movie
7443,22488,TV Movie
10061,78814,TV Movie
10790,153397,TV Movie
10835,158150,TV Movie
11096,205321,TV Movie
11282,231617,TV Movie


In [18]:
# Filtering the data

# Subsetting the data and assigning to the variable
m = movies_to_genres['genre'] == 'TV Movie'
# Filter the data
tv_genre = movies_to_genres[m]
# Show the DataFrame
tv_genre

Unnamed: 0,movie_id,genre
4998,10947,TV Movie
5994,13187,TV Movie
7443,22488,TV Movie
10061,78814,TV Movie
10790,153397,TV Movie
10835,158150,TV Movie
11096,205321,TV Movie
11282,231617,TV Movie


In [19]:
# Data to Merge
# Movies and TV Movies

movies.head()

Unnamed: 0,id,title,popularity,release_date
0,257,Oliver Twist,20.415572,2005-09-23
1,14290,Better Luck Tomorrow,3.877036,2002-01-12
2,38365,Grown Ups,38.864027,2010-06-24
3,9672,Infamous,3.680896,2006-11-16
4,12819,Alpha and Omega,12.300789,2010-09-17


Our goal is to merge it with the movies table. We will set movies as our left table and merge it with tv_genre table. We want to use a right join to check that our movies table is not missing data.

In addition to showing a right join, this example will also allows us to look at another future. Notice that column with the movie ID number in the movies table is ID, and in the **tv_genre** table it is names **movie_id**. 

In [20]:
# Merge with right join

tv_movies = movies.merge(tv_genre, how='right', left_on='id', right_on='movie_id')
# Notice the two new arguments 'left_on=' and 'right_on='
tv_movies.head()

Unnamed: 0,id,title,popularity,release_date,movie_id,genre
0,10947,High School Musical,16.536374,2006-01-20,10947,TV Movie
1,13187,A Charlie Brown Christmas,8.701183,1965-12-09,13187,TV Movie
2,22488,Love's Abiding Joy,1.128559,2006-10-06,22488,TV Movie
3,78814,We Have Your Husband,0.102003,2011-11-12,78814,TV Movie
4,153397,Restless,0.812776,2012-12-07,153397,TV Movie


### Outer Join

Our last type of join is called an outer join. An outer join will return all of the rows from both tables regardless if there is a match between the tables.

In [23]:
# Datasets for outer join

# We filter movies to genres table to two small tables called 'Family' and 'Comedy'
# Family table
m = movies_to_genres['genre'] == 'Family'
family = movies_to_genres[m].head(3)
# Comedy table
m = movies_to_genres['genre'] == 'Comedy'
comedy = movies_to_genres[m].head(3)

# Merge with outer join
family_comedy = family.merge(comedy, on='movie_id', how='outer', suffixes=('_fam', '_com'))
# Show the table
family_comedy

Unnamed: 0,movie_id,genre_fam,genre_com
0,12,Family,
1,35,Family,Comedy
2,105,Family,
3,5,,Comedy
4,13,,Comedy


In [32]:
# Exercises II 
# Right join to find unique movies

# Creating two tables called 'scifi_movies' and 'action_movies'
scifi_movies = movies_to_genres[movies_to_genres['genre'] == 'Science Fiction']
action_movies = movies_to_genres[movies_to_genres['genre'] == 'Action']

# Merge action_movies to scifi_movies with right join, and add the suffixes
action_scifi = action_movies.merge(scifi_movies, on='movie_id', how='right',
                                   suffixes=('_act', '_sci'))

# From action_scifi, select only the rows where the genre_act column is null
scifi_only = action_scifi[action_scifi['genre_act'].isnull()]

# Merge the movies and scifi_only tables with an inner join
movies_and_scifi_only = movies.merge(scifi_only, left_on='id', right_on='movie_id')

movies_and_scifi_only

Unnamed: 0,id,title,popularity,release_date,movie_id,genre_act,genre_sci
0,18841,The Lost Skeleton of Cadavra,1.680525,2001-09-12,18841,,Science Fiction
1,26672,The Thief and the Cobbler,2.439184,1993-09-23,26672,,Science Fiction
2,15301,Twilight Zone: The Movie,12.902975,1983-06-24,15301,,Science Fiction
3,8452,The 6th Day,18.447479,2000-11-17,8452,,Science Fiction
4,1649,Bill & Ted's Bogus Journey,11.349664,1991-07-19,1649,,Science Fiction
...,...,...,...,...,...,...,...
253,245703,Midnight Special,32.717853,2016-02-18,245703,,Science Fiction
254,3509,A Scanner Darkly,26.093043,2006-05-25,3509,,Science Fiction
255,42188,Never Let Me Go,30.983397,2010-09-15,42188,,Science Fiction
256,18045,The Dark Hours,1.428483,2005-03-11,18045,,Science Fiction


### Merging a table to itself

So when would you ever need to merge a table to itself? The table shown here is called sequels and has three columns. It contains a column for movie id, title, and sequel. The sequel number refers to the movie id that is a sequel to the original movie.

If we would like to see a table with the movies and the corresponding sequel movie in one row of the table, we will need to merge the table to itself. In the left table, the sequel ID for Toy Story of 863 is matched with 863 in the ID column of the right table. Similarly, Toy Story 2 of the left table is matched with Toy Story 3 in the right table. We will talk more about this later, but the merge is an inner join. Therefore, we do not see Avatar and Titanic because they do not have sequels.

In [36]:
# Merging a table to itself
# DataFrame ==> sequels
sequels = pd.read_pickle('datasets\\sequels.p')

# Self merging
original_sequels = sequels.merge(sequels, left_on='sequel', right_on='id', suffixes=('_org', '_seq'))
# Show the table
original_sequels.head()


Unnamed: 0,id_org,title_org,sequel_org,id_seq,title_seq,sequel_seq
0,862,Toy Story,863,863,Toy Story 2,10193.0
1,863,Toy Story 2,10193,10193,Toy Story 3,
2,675,Harry Potter and the Order of the Phoenix,767,767,Harry Potter and the Half-Blood Prince,
3,121,The Lord of the Rings: The Two Towers,122,122,The Lord of the Rings: The Return of the King,
4,120,The Lord of the Rings: The Fellowship of the Ring,121,121,The Lord of the Rings: The Two Towers,122.0


In [38]:
# Continue format results
# We only select those two columns to see that we have succesfully merge the table to itself
original_sequels[['title_org', 'title_seq']].head()

Unnamed: 0,title_org,title_seq
0,Toy Story,Toy Story 2
1,Toy Story 2,Toy Story 3
2,Harry Potter and the Order of the Phoenix,Harry Potter and the Half-Blood Prince
3,The Lord of the Rings: The Two Towers,The Lord of the Rings: The Return of the King
4,The Lord of the Rings: The Fellowship of the Ring,The Lord of the Rings: The Two Towers


Pausing here is a good time to highlight again that when merging a table to itself, we can use the **different types of joins** we have already reviewed. 

Let's take the same merge from earlier but make it a **left join**. The **'how'** argument is set in the merge method to left from the default **'inner'**. Now the resulting table will show all of our original movie info. If the sequel movie exists in the table, it will fill out the rest of the row. 


If you compare this to our earlier merger, you now see movies like Avatar and Titanic in the result set.

In [39]:
# Merging a table to itself with a left join
original_sequels = sequels.merge(sequels, left_on='sequel', right_on='id', how='left', suffixes=('_org', '_seq'))

original_sequels.head()

Unnamed: 0,id_org,title_org,sequel_org,id_seq,title_seq,sequel_seq
0,19995,Avatar,,,,
1,862,Toy Story,863.0,863.0,Toy Story 2,10193.0
2,863,Toy Story 2,10193.0,10193.0,Toy Story 3,
3,597,Titanic,,,,
4,24428,The Avengers,,,,


### When to merge at table to itself

Common situations:

* **Hierarchical Relationships**
* **Sequential Relationships**
* **Graph data**

You might need to merge a table to itself when working with tables that have a hierarchical relationship, like employee and manager. You might use this on sequential relationships such as logistic movements. Graph data, such as networks of friends, might also require this technique.

In [44]:
# Exercise III
# Self Join

# DataFrame ==> crew.p
crews = pd.read_pickle('datasets\\crews.p')

# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on='id', suffixes=('_dir', '_crew'))
# Create a Boolean index to select the appropriate
boolean_filter = ((crews_self_merged['job_dir'] == 'Director') & (crews_self_merged['job_crew'] != 'Director'))
direct_crews = crews_self_merged[boolean_filter]
# Show the head
direct_crews.head()

Unnamed: 0,id,department_dir,job_dir,name_dir,department_crew,job_crew,name_crew
156,19995,Directing,Director,James Cameron,Editing,Editor,Stephen E. Rivkin
157,19995,Directing,Director,James Cameron,Sound,Sound Designer,Christopher Boyes
158,19995,Directing,Director,James Cameron,Production,Casting,Mali Finn
160,19995,Directing,Director,James Cameron,Writing,Writer,James Cameron
161,19995,Directing,Director,James Cameron,Art,Set Designer,Richard F. Mays


### Merging on indexes

So far, we've only looked at merging two tables together using their columns. In this lesson, we'll discuss how to merge tables using their indexes. Often, the DataFrame indexes are given a unique id that we can use when merging two tables together.

There are different methods to set the index of a table, but if our data starts off in a CSV file, we can use the **index_col** argument of the read_csv method. This lesson will not focus on how to set a table index, but how to use that index to merge two tables together.

In [50]:
# Exercises IV
# Index merge for movie ratings

# DataFrames ==> movies, ratings
ratings = pd.read_pickle('datasets\\ratings.p')
# Merge to the movies table the ratings table on the index
movies_ratings = movies.merge(ratings, on='id', how='left')
# Show the head()
movies_ratings.head()

Unnamed: 0,id,title,popularity,release_date,vote_average,vote_count
0,257,Oliver Twist,20.415572,2005-09-23,6.7,274.0
1,14290,Better Luck Tomorrow,3.877036,2002-01-12,6.5,27.0
2,38365,Grown Ups,38.864027,2010-06-24,6.0,1705.0
3,9672,Infamous,3.680896,2006-11-16,6.4,60.0
4,12819,Alpha and Omega,12.300789,2010-09-17,5.3,124.0


In [59]:
# Do sequels earn more
# DataFrames ==> sequels, financials

# Merge sequels and financials on index_id
sequels_fin = sequels.merge(financials, on='id', how='left')
# Selft merge with suffixes as inner join with left sequel and right on id
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', right_on='id', suffixes=('_org', '_seq'))
# Add calculation to substract revenue_org from revenue_seq
orig_seq['diff'] = orig_seq['revenue_seq'] - orig_seq['revenue_org']
# Select the title_org, title_seq and diff
titles_diff = orig_seq[['title_org', 'title_seq', 'diff']]
# Print the first rows of the sorted titles_diff
titles_diff.sort_values('diff', ascending=False).head()

Unnamed: 0,title_org,title_seq,diff
28,Jurassic Park III,Jurassic World,1144748000.0
26,Batman Begins,The Dark Knight,630339800.0
11,Iron Man 2,Iron Man 3,591506700.0
1,Toy Story 2,Toy Story 3,569602800.0
14,Quantum of Solace,Skyfall,522470300.0
