# Merging Tables With Different Join Types

## Counting missing rows with left join
The Movie Database is supported by volunteers going out into the world, collecting data, and entering it into the database. This includes financial data, such as movie budget and revenue. If you wanted to know which movies are still missing data, you could use a left join to identify them. Practice using a left join by merging the `movies` table and the `financials` table.

What column is likely the best column to merge the two tables on?

* on='budget'

* on='popularity'

* **on='id'**

* Merge the `movies` table, as the left table, with the `financials` table using a left join, and save the result to `movies_financials`.

In [1]:
import pandas as pd
financials = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/financials.p')
movies = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/movies.p')

In [2]:
# Merge the movies table with the financials table with a left join
movies_financials = movies.merge(financials, on='id', how='left')

* Count the number of rows in `movies_financials` with a null value in the budget column.

In [3]:
# Count the number of rows in the budget column that are missing
number_of_missing_fin = movies_financials['budget'].isnull().sum()

# Print the number of movies missing financials
print(number_of_missing_fin)

1574


Great job! You used a left join to find out which rows in the financials table were missing data. When performing a left join, the .merge() method returns a row full of null values for columns in the right table if the key column does not have a matching value in both tables. We see that there are at least 1,500 rows missing data. Wow! That sounds like a lot of work.

## Enriching a dataset
Setting `how='left'` with the `.merge()` method is a useful technique for enriching or enhancing a dataset with additional information from a different table. In this exercise, you will start off with a sample of movie data from the movie series Toy Story. Your goal is to enrich this data by adding the marketing tag line for each movie. You will compare the results of a left join versus an inner join.

* Merge `toy_story` and `taglines` on the id column with a left join, and save the result as `toystory_tag`.

In [12]:
toy_story = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/movies.p')
taglines = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/taglines.p')

In [13]:
# Merge the toy_story and taglines tables with a left join
toystory_tag = toy_story.merge(taglines, on='id', how='left')

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

         id                 title  popularity release_date  \
0       257          Oliver Twist   20.415572   2005-09-23   
1     14290  Better Luck Tomorrow    3.877036   2002-01-12   
2     38365             Grown Ups   38.864027   2010-06-24   
3      9672              Infamous    3.680896   2006-11-16   
4     12819       Alpha and Omega   12.300789   2010-09-17   
...     ...                   ...         ...          ...   
4798   3089             Red River    5.344815   1948-08-26   
4799  11934   The Hudsucker Proxy   14.188982   1994-03-11   
4800  13807                Exiled    8.486390   2006-09-06   
4801  73873          Albert Nobbs    7.802245   2011-12-21   
4802  11622   Blast from the Past    8.737058   1999-02-12   

                                                tagline  
0                                                   NaN  
1                  Never underestimate an overachiever.  
2       Boys will be boys. . . some longer than others.  
3               There's

* With `toy_story` as the left table, merge to it taglines on the id column with an inner join, and save as `toystory_tag`.

In [14]:
# Merge the toy_story and taglines tables with a inner join
toystory_tag = toy_story.merge(taglines, on='id', how='inner')

# Print the rows and shape of toystory_tag
print(toystory_tag)
print(toystory_tag.shape)

         id                 title  popularity release_date  \
0     14290  Better Luck Tomorrow    3.877036   2002-01-12   
1     38365             Grown Ups   38.864027   2010-06-24   
2      9672              Infamous    3.680896   2006-11-16   
3     12819       Alpha and Omega   12.300789   2010-09-17   
4     49529           John Carter   43.926995   2012-03-07   
...     ...                   ...         ...          ...   
3950  12281            Mean Creek    8.519202   2004-01-15   
3951   3089             Red River    5.344815   1948-08-26   
3952  11934   The Hudsucker Proxy   14.188982   1994-03-11   
3953  73873          Albert Nobbs    7.802245   2011-12-21   
3954  11622   Blast from the Past    8.737058   1999-02-12   

                                                tagline  
0                  Never underestimate an overachiever.  
1       Boys will be boys. . . some longer than others.  
2               There's more to the story than you know  
3                      

## How many rows with a left join?
Select the true statement about left joins.

Try running the following code statements in the IPython shell.
```python
left_table.merge(one_to_one, on='id', how='left').shape
left_table.merge(one_to_many, on='id', how='left').shape
```
Note that the `left_table` starts out with 4 rows.

* The output of a one-to-one merge with a left join will have more rows than the left table.

* The output of a one-to-one merge with a left join will have fewer rows than the left table.

* **The output of a one-to-many merge with a left join will have greater than or equal rows than the left table.**

That's correct! A left join will return all of the rows from the left table. If those rows in the left table match multiple rows in the right table, then all of those rows will be returned. Therefore, the returned rows must be equal to if not greater than the left table. Knowing what to expect is useful in troubleshooting any suspicious merges.

## Right join to find unique movies
Most of the recent big-budget science fiction movies can also be classified as action movies. You are given a table of science fiction movies called scifi_movies and another table of action movies called action_movies. Your goal is to find which movies are considered only science fiction movies. Once you have this table, you can merge the movies table in to see the movie names. Since this exercise is related to science fiction movies, use a right join as your superhero power to solve this problem.

The `movies`, `scifi_movies`, and `action_movies` tables have been loaded for you.

* Merge `action_movies` and `scifi_movies` tables with a right join on `movie_id`. Save the result as `action_scifi`.

In [17]:
genres = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/movie_to_genres.p')

In [21]:
scifi_movies = genres.query("genre == 'Science Fiction'")
action_movies = genres.query("genre == 'Action'")

In [22]:
# Merge action_movies to scifi_movies with right join
action_scifi = action_movies.merge(scifi_movies, on='movie_id', how='right')

* Update the merge to add suffixes, where `'_act'` and `'_sci'` are suffixes for the left and right tables, respectively.

In [23]:
# Merge action_movies to scifi_movies with right join
action_scifi = action_movies.merge(scifi_movies, on='movie_id', how='right',
                                   suffixes=('_act','_sci'))

# Print the first few rows of action_scifi to see the structure
print(action_scifi.head())

   movie_id genre_act        genre_sci
0        11    Action  Science Fiction
1        18    Action  Science Fiction
2        19       NaN  Science Fiction
3        38       NaN  Science Fiction
4        62       NaN  Science Fiction


* From `action_scifi`, subset only the rows where the `genre_act` column is null.

In [25]:
# From action_scifi, select only the rows where the genre_act column is null
scifi_only = action_scifi[action_scifi['genre_act'].isnull()]

* Merge movies and `scifi_only` using the id column in the left table and the `movie_id` column in the right table with an inner join.

In [26]:
# Merge the movies and scifi_only tables with an inner join
movies_and_scifi_only = movies.merge(scifi_only, how='inner',
                                     left_on='id', right_on='movie_id')

# Print the first few rows and shape of movies_and_scifi_only
print(movies_and_scifi_only.head())
print(movies_and_scifi_only.shape)

      id                         title  popularity release_date  movie_id  \
0  18841  The Lost Skeleton of Cadavra    1.680525   2001-09-12     18841   
1  26672     The Thief and the Cobbler    2.439184   1993-09-23     26672   
2  15301      Twilight Zone: The Movie   12.902975   1983-06-24     15301   
3   8452                   The 6th Day   18.447479   2000-11-17      8452   
4   1649    Bill & Ted's Bogus Journey   11.349664   1991-07-19      1649   

  genre_act        genre_sci  
0       NaN  Science Fiction  
1       NaN  Science Fiction  
2       NaN  Science Fiction  
3       NaN  Science Fiction  
4       NaN  Science Fiction  
(258, 7)


Well done, right join to the rescue! You found over 250 action only movies by merging action_movies and scifi_movies using a right join. With this, you were able to find the rows not found in the action_movies table. Additionally, you used the left_on and right_on arguments to merge in the movies table. Wow! You are a superhero.

## Popular genres with right join
What are the genres of the most popular movies? To answer this question, you need to merge data from the movies and `movie_to_genres` tables. In a table called pop_movies, the top 10 most popular movies in the movies table have been selected. To ensure that you are analyzing all of the popular movies, merge it with the `movie_to_genres` table using a right join. To complete your analysis, count the number of different genres. Also, the two tables can be merged by the movie ID. However, in `pop_movies` that column is called id, and in movies_to_genres it's called `movie_id`.

In [63]:
import matplotlib.pyplot as plt
pop_movies = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/ratings.p')
movie_to_genres = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/movies.p')

In [None]:
genres_movies = movie_to_genres.merge(pop_movies, how='right', 
                                      left_on='movie_id', 
                                      right_on='id')

# Count the number of genres
genre_count = genres_movies.groupby('genre').agg({'id':'count'})

# Plot a bar chart of the genre_count
genre_count.plot(kind='bar')
plt.show()



Nice job! The right join ensured that you were analyzing all of the pop_movies. You see from the results that adventure and action are the most popular genres.

## Using outer join to select actors
One cool aspect of using an outer join is that, because it returns all rows from both merged tables and null where they do not match, you can use it to find rows that do not have a match in the other table. To try for yourself, you have been given two tables with a list of actors from two popular movies: Iron Man 1 and Iron Man 2. Most of the actors played in both movies. Use an outer join to find actors who did not act in both movies.

The Iron Man 1 table is called `iron_1_actors`, and Iron Man 2 table is called `iron_2_actors`. Both tables have been loaded for you and a few rows printed so you can see the structure.
![](https://assets.datacamp.com/production/repositories/5486/datasets/c5d02ebba511e90ae132f89ff091e6729c040bd2/noJoin.png)

* Save to `iron_1_and_2` the merge of `iron_1_actors` (left) with `iron_2_actors` tables with an outer join on the id column, and set suffixes to ('_1','_2').
* Create an index that returns True if `name_1` or `name_2 `are null, and False otherwise.

In [None]:
# Merge iron_1_actors to iron_2_actors on id with outer join using suffixes
iron_1_and_2 = iron_1_actors.merge(iron_2_actors, 
                                     on='id', 
                                     how='outer', 
                                     suffixes=('_1','_2'))

# Create an index that returns true if name_1 or name_2 are null
m = ((iron_1_and_2['name_1'].isnull()) | 
     (iron_1_and_2['name_2'].isnull()))

# Print the first few rows of iron_1_and_2
print(iron_1_and_2[m].head())

Nice job! Using an outer join, you were able to pick only those rows where the actor played in only one of the two movies.

## Self join
Merging a table to itself can be useful when you want to compare values in a column to other values in the same column. In this exercise, you will practice this by creating a table that for each movie will list the movie director and a member of the crew on one row. You have been given a table called crews, which has columns id, job, and name. First, merge the table to itself using the movie ID. This merge will give you a larger table where for each movie, every job is matched against each other. Then select only those rows with a director in the left table, and avoid having a row where the director's job is listed in both the left and right tables. This filtering will remove job combinations that aren't with the director.

* To a variable called `crews_self_merged`, merge the crews table to itself on the id column using an inner join, setting the suffixes to `'_dir'` and `'_crew'` for the left and right tables respectively.

In [67]:
crews = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/crews.p')

In [68]:
# Merge the crews table to itself
crews_self_merged = crews.merge(crews, on='id', how='inner',
                                suffixes=('_dir','_crew'))

* Create a Boolean index, named `boolean_filter`, that selects rows from the left table with the job of 'Director' and avoids rows with the job of 'Director' in the right table.

In [69]:
# Create a Boolean index to select the appropriate rows
boolean_filter = ((crews_self_merged['job_dir'] == 'Director') & 
                  (crews_self_merged['job_crew'] != 'Director'))
direct_crews = crews_self_merged[boolean_filter]

* Use the `.head()` method to print the first few rows of `direct_crews`.

In [70]:
# Print the first few rows of direct_crews
print(direct_crews.head())

        id department_dir   job_dir       name_dir department_crew  \
156  19995      Directing  Director  James Cameron         Editing   
157  19995      Directing  Director  James Cameron           Sound   
158  19995      Directing  Director  James Cameron      Production   
160  19995      Directing  Director  James Cameron         Writing   
161  19995      Directing  Director  James Cameron             Art   

           job_crew          name_crew  
156          Editor  Stephen E. Rivkin  
157  Sound Designer  Christopher Boyes  
158         Casting          Mali Finn  
160          Writer      James Cameron  
161    Set Designer    Richard F. Mays  


Great job! By merging the table to itself, you compared the value of the director from the jobs column to other values from the jobs column. With the output, you can quickly see different movie directors and the people they worked with in the same movie.

## How does pandas handle self joins?
Select the false statement about merging a table to itself.


* You can merge a table to itself with a right join.

* Merging a table to itself can allow you to compare values in a column to other values in the same column.

* **The pandas module limits you to one merge where you merge a table to itself. You cannot repeat this process over and over.**

* Merging a table to itself is like working with two separate tables.

Perfect! This statement is false. pandas treats a merge of a table to itself the same as any other merge. Therefore, it does not limit you from chaining multiple `.merge()` methods together.

## Index merge for movie ratings
To practice merging on indexes, you will merge movies and a table called ratings that holds info about movie ratings. Make sure your merge returns all of the rows from the movies table and not all the rows of ratings table need to be included in the result.

* Merge `movies` and `ratings` on the index and save to a variable called `movies_ratings`, ensuring that all of the rows from the movies table are returned.

In [71]:
ratings = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/ratings.p')
movies = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/movies.p')

In [74]:
# Merge to the movies table the ratings table on the index
movies_ratings = movies.merge(ratings, on='id', how='left')

# Print the first few rows of movies_ratings
print(movies_ratings.head())

      id                 title  popularity release_date  vote_average  \
0    257          Oliver Twist   20.415572   2005-09-23           6.7   
1  14290  Better Luck Tomorrow    3.877036   2002-01-12           6.5   
2  38365             Grown Ups   38.864027   2010-06-24           6.0   
3   9672              Infamous    3.680896   2006-11-16           6.4   
4  12819       Alpha and Omega   12.300789   2010-09-17           5.3   

   vote_count  
0       274.0  
1        27.0  
2      1705.0  
3        60.0  
4       124.0  


Good work! Merging on indexes is just like merging on columns, so if you need to merge based on indexes, there's no need to turn the indexes into columns first.

## Do sequels earn more?
It is time to put together many of the aspects that you have learned in this chapter. In this exercise, you'll find out which movie sequels earned the most compared to the original movie. To answer this question, you will merge a modified version of the sequels and financials tables where their index is the movie ID. You will need to choose a merge type that will return all of the rows from the sequels table and not all the rows of financials table need to be included in the result. From there, you will join the resulting table to itself so that you can compare the revenue values of the original movie to the sequel. Next, you will calculate the difference between the two revenues and sort the resulting dataset.

* With the `sequels` table on the left, merge to it the `financials` table on index named id, ensuring that all the rows from the sequels are returned and some rows from the other table may not be returned, Save the results to `sequels_fin`.

In [75]:
sequels = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/sequels.p')
financials = pd.read_pickle('C:/Users/Александр/pj/DataCamp_projects/Data Scientist With Python/Joining Data with pandas/datasets/financials.p')

In [80]:
# Merge sequels and financials on index id
sequels_fin = sequels.merge(financials, on='id', how='left')

* Merge the `sequels_fin` table to itself with an inner join, where the left and right tables merge on sequel and id respectively with suffixes equal to (`'_org','_seq'`), saving to `orig_seq`.

In [89]:
sequels_fin = sequels_fin.fillna(0)

# Self merge with suffixes as inner join with left on sequel and right on id
orig_seq = sequels_fin.merge(sequels_fin, how='inner', left_on='sequel', 
                             right_on='id', right_index=True,
                             suffixes=('_org','_seq'))

# Add calculation to subtract revenue_org from revenue_seq 
orig_seq['diff'] = orig_seq['revenue_seq'] - orig_seq['revenue_org']

* Select the `title_org`, `title_seq`, and diff columns of `orig_seq` and save this as `titles_diff`.

In [91]:
# Select the title_org, title_seq, and diff 
titles_diff = orig_seq[['title_org','title_seq','diff']]

* Sort by `titles_diff` by diff in descending order and print the first few rows.

In [92]:
# Print the first rows of the sorted titles_diff
print(titles_diff.sort_values('diff', ascending=False).head())

                title_org title_seq          diff
3612        Class of 1984    Avatar  2.787965e+09
3754        Extreme Movie    Avatar  2.787965e+09
3756  Eye of the Beholder    Avatar  2.787965e+09
3757               Fabled    Avatar  2.787965e+09
3758         Factory Girl    Avatar  2.787965e+09


Amazing, that was great work! To complete this exercise, you needed to merge tables on their index and merge another table to itself. After the calculations were added and sub-select specific columns, the data was sorted. You found out that Jurassic World had one of the highest of all, improvement in revenue compared to the original movie.