<a href="https://colab.research.google.com/drive/12mjMZMB3i5Y3py7IlVSI-nzVczZcSTfI?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you are working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [1]:
# print current directory
%pwd

'/content'

In [2]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [3]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# Importing pandas library and data loading

In [4]:
import pandas as pd

In this lesson we are will be using movies_cleaned.csv file.

In the lesson instructions for Pandas - Advanced Real World Data Analysis, we have mentioned that you need to rename the file 

Movies_cleaned_lesson2.csv (created in lesson 2 of Pandas - Data Cleaning) -> movies_cleaned.csv

The file is saved in the path where rest of the IMDB dataset is saved. i.e. 

"Data/IMDB_rotten_tomato_dataset/IMDB/movies-cleaned.csv"

You can read this file in the below way.

In [5]:
# if you are working with this tutorial on local machine use the file path where the data is saved in your computer
movies_cleaned = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/movies_cleaned.csv")
# We can use .head command to quickly observe the first 5 rows of the dataset
movies_cleaned.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110


Other than  the above file we will also read up the file 

movies_ratings_cleaned.csv (this is the file made during the Data Cleaning Assignment)

The file is saved where rest of the IMDB dataset is located i.e.

"Data/IMDB_rotten_tomato_dataset/IMDB/movies_ratings_cleaned.csv"
 
 The file can be read as below

In [6]:
ratings_cleaned = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/movies_ratings_cleaned.csv")
ratings_cleaned.head()

Unnamed: 0,imdb_title_id,imdb_score_overall,imdb_score_male,imdb_score_female,gender_perference
0,tt0000009,5.9,6.2,6.0,male
1,tt0000574,6.1,6.1,6.2,female
2,tt0001892,5.8,5.9,5.7,male
3,tt0002101,5.2,5.1,5.9,female
4,tt0002130,7.0,7.0,7.2,female


# **What we will learn?**

In this lesson we will learn
* Why join or merging of dataframe is an important operation
* Simple Join Example
* Join movies_cleaned and ratings_cleaned

# **Why Join or merge Dataframes?**

### Problem
The major reason for joining or merging two dataframes is that we want to combined the information in two places and apply some operation on combined dataset.

For eg.
If we wanted to know top 10 imdb movies according to the score given by females.

To know this we would first sort ratings_cleaned dataframe on column 'imdb_score_female' as shown below.


In [9]:
ratings_cleaned.sort_values('imdb_score_female',ascending=False).head(10)

Unnamed: 0,imdb_title_id,imdb_score_overall,imdb_score_male,imdb_score_female,gender_perference
37203,tt0226720,6.2,6.0,10.0,female
79170,tt5949226,7.6,5.5,10.0,female
35847,tt0198465,6.4,6.3,10.0,female
80784,tt6544384,4.2,4.0,10.0,female
79133,tt5935196,7.4,3.8,10.0,female
83594,tt8011346,4.2,3.9,10.0,female
73693,tt4087822,5.3,5.1,10.0,female
82170,tt7253188,4.6,4.1,10.0,female
35806,tt0197573,5.1,5.1,10.0,female
82781,tt7542962,7.2,7.2,10.0,female


Though we have gotten top 10 movies in above dataframe but we do not know the movie names.

For knowing the movie names we will have to use each imdb_title_id and then find correspoding original_title value.

For eg. for imdb_title_id = tt0226720

In [11]:
movies_cleaned.loc[movies_cleaned['imdb_title_id']=='tt0226720']

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
37203,tt0226720,Un banco en el parque,1999,1999-12-10,Comedy,77,Spain,Spanish,6.2,108,,,,,22


Now we can see the title of the movie.

### Solution
If the information from both the dataframes were contained in one single dataframe then our effort would have reduced. 

Because when we would have computed the top 10 movies according to females, we would have the value of original_title column as well in the combined dataframe. 

Combining these 2 datasets would be called Pandas Join.

# **Simple Join Example**

There are many ways to join two dataframes. We will be learning about only particular type of join in this lesson.

We will learn about **Inner Join**.

Let's suppose we have two dataframes 
* left
* right

defined below

In [12]:
# left has 2 columns - key and value_left
left = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value_left':[2,1,4,6]})
left

Unnamed: 0,key,value_left
0,A,2
1,B,1
2,C,4
3,D,6


In [13]:
# right has 2 columns - key and value_right
right = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value_right':[112,240,321,179]})
right

Unnamed: 0,key,value_right
0,B,112
1,D,240
2,E,321
3,F,179


For the sake of simplicity, the key column has the same name (for now).

An INNER JOIN is represented by



![inner_join.png](https://drive.google.com/uc?export=view&id=1W3sKCxytv39EJUhScJLVN2PaMVP817WV)

* blue indicates rows that are present in the merge result
* red indicates rows that are excluded from the result (i.e., removed)

To perform an INNER JOIN, call pd.merge on the left and right DataFrame, specifying the left as first dataframe & right one as 2nd DataFrame and the join "key" as arguments.

In [14]:
comb_df = pd.merge(left,right,on='key')
comb_df

Unnamed: 0,key,value_left,value_right
0,B,1,112
1,D,6,240


This returns only rows from left and right which share a common key (in this example, "B" and "D).



# **Join movie_cleaned and ratings_cleaned dataframe**

From the above illustration,we can observe that whenever we need to do INNER JOIN of a dataframe we need to have one column which is common across the two dataframes.

In [15]:
movies_cleaned.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110


In [16]:
ratings_cleaned.head()

Unnamed: 0,imdb_title_id,imdb_score_overall,imdb_score_male,imdb_score_female,gender_perference
0,tt0000009,5.9,6.2,6.0,male
1,tt0000574,6.1,6.1,6.2,female
2,tt0001892,5.8,5.9,5.7,male
3,tt0002101,5.2,5.1,5.9,female
4,tt0002130,7.0,7.0,7.2,female


You can see from above that 'imdb_title_id' column is common in both the dataframes.

To speak simply we have the data of a given movie in both dataframes. The data in the two dataframe are linked together with the help of imdb_title_id.

Now we can combine the two dataframes on 'imdb_title_id' as shown below

In [17]:
comb_movie_data = pd.merge(movies_cleaned,ratings_cleaned,on = 'imdb_title_id')
comb_movie_data

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age,imdb_score_overall,imdb_score_male,imdb_score_female,gender_perference
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127,5.9,6.2,6.0,male
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115,6.1,6.1,6.2,female
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110,5.8,5.9,5.7,male
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109,5.2,5.1,5.9,female
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110,7.0,7.0,7.2,female
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85849,tt9908390,Le lion,2020,2020-01-29,Comedy,95,"France, Belgium",French,5.3,398,,,$ 3507171,,1,5.3,5.3,6.0,female
85850,tt9911196,De Beentjes van Sint-Hildegard,2020,2020-02-13,"Comedy, Drama",103,Netherlands,"German, Dutch",7.7,724,,,$ 7299062,,1,7.7,7.8,7.6,male
85851,tt9911774,Padmavyuhathile Abhimanyu,2019,2019-03-08,Drama,130,India,Malayalam,7.9,265,,,,,2,7.9,6.0,,female
85852,tt9914286,Sokagin Çocuklari,2019,2019-03-15,"Drama, Family",98,Turkey,Turkish,6.4,194,,,$ 2833,,2,6.4,3.1,4.0,female


In the comb_movie_data dataframe, you can observe that the columns from both 'movies_cleaned' and 'ratings_cleaned' is combined. 

There is one single column for imdb_title_id. Because it is the common column in both of them.

We can see the shape of each of the three dataframe to check if the resulting dataframe has some deletion of rows or not.

In [18]:
print(comb_movie_data.shape)
print(movies_cleaned.shape)
print(ratings_cleaned.shape)

(85854, 19)
(85854, 15)
(85855, 5)


You can see in the combined dataframe, there has been only one rwo deleted in comparison to movies_cleaned which in comparison to ratings_cleaned, none is deleted. 

So all the movies in the two dataframes were common.

Now we can solve our initial problem of finding top 10 movies liked by females.

In [20]:
comb_movie_data.sort_values('imdb_score_female',ascending=False).head(10)

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age,imdb_score_overall,imdb_score_male,imdb_score_female,gender_perference
52405,tt10126462,Post,2019,2019-04-12,"Drama, Fantasy, Horror",45,Russia,,8.4,230,RUR 50,,,,2,8.4,7.8,10.0,female
81140,tt6750796,Kittu Unnadu Jagratha,2017,2017-03-03,Comedy,143,India,Telugu,5.8,115,,,$ 1330,,4,5.8,5.9,10.0,female
82654,tt7482302,Semma Botha Aagatha,2018,2018-06-29,"Action, Thriller",132,India,Tamil,5.0,222,,,,,3,5.0,5.0,10.0,female
34477,tt0173235,Sporlaust,1998,1998-08-27,"Comedy, Crime",87,Iceland,Icelandic,3.8,103,,,,,23,3.8,3.7,10.0,female
17021,tt0073041,Ghazal,1976,1976,Drama,106,Iran,Persian,6.4,125,,,,,45,6.4,6.3,10.0,female
14546,tt0065958,Huang jiang nu xia,1970,1970-02-27,"Action, Adventure",84,Hong Kong,Mandarin,6.2,119,,,,,51,6.2,6.2,10.0,female
29451,tt0115079,État des lieux,1995,1995-06-14,"Comedy, Drama",80,France,French,5.7,150,,,,,26,5.7,5.7,10.0,female
11830,tt0058090,The Fat Black Pussycat,1963,1963,"Crime, Horror, Drama",94,USA,English,4.5,108,,,,,58,4.5,4.4,10.0,female
32204,tt0132889,Blossi/810551,1997,1997-08-14,Drama,82,Iceland,Icelandic,4.1,167,$ 1000000,,,,24,4.1,4.1,10.0,female
38013,tt0245905,Four Jacks,2001,2001-08-07,"Mystery, Thriller",92,Australia,English,4.9,116,,,,,20,4.9,4.9,10.0,female


Now we can see all the information in one single dataframe.

**NOTES ON JOINING**<br>
In this merging, none of the movies were lost, but usually when you join two dataframes there is a loss of some rows.

Another thing is that it is advisable to have the same column names for merging. If the names are not same, then rename them to make them same.