<a href="https://colab.research.google.com/drive/1j7kSeMUjVTruRyDcXCQFD1VcHFcIMHUu?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you are working on colab follow the below data setup instruction

# **Data Setup Instructions**

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [1]:
# print current directory
%pwd

'/content'

In [2]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [3]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# **Importing pandas library and data loading**

In [4]:
import pandas as pd

In this lesson we are will be using movies_cleaned.csv file.

In the lesson instructions for Pandas - Advanced Real World Data Analysis, we have mentioned that you need to rename the file 

Movies_cleaned_lesson2.csv (created in lesson 2 of Pandas - Data Cleaning) -> movies_cleaned.csv

The file is saved in the path where rest of the IMDB dataset is saved. i.e. 

"Data/IMDB_rotten_tomato_dataset/IMDB/movies-cleaned.csv"

You can read this file in the below way.

In [5]:
# if you are working with this tutorial on local machine use the file path where the data is saved in your computer
movies_cleaned = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/movies_cleaned.csv")
# We can use .head command to quickly observe the first 5 rows of the dataset
movies_cleaned.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110


# **What we will study?**

In this lesson we will learn to apply groupby operation in conjuction with another special function.

**.explode()**

This will help us in applying groupby on more complex columns like language and genre.

# **Exploding and grouping**

### The Problem
Let's suppose we want to answer following question

**For each genre, Find the mean rating of all movies of that genre**<br>

Let's first see how does the genre column lookslike in movies dataset along with imdb_score column

In [6]:
movies_cleaned[['genre','imdb_score']]

Unnamed: 0,genre,imdb_score
0,Romance,5.9
1,"Biography, Crime, Drama",6.1
2,Drama,5.8
3,"Drama, History",5.2
4,"Adventure, Drama, Fantasy",7.0
...,...,...
85849,Comedy,5.3
85850,"Comedy, Drama",7.7
85851,Drama,7.9
85852,"Drama, Family",6.4


each value in genre column sometimes contain multiple genre values.

if we apply groupby operation over it we won't be able to find the mean score for individual genres rather it will show the values for combined genre values.

For eg. the below operation shows how the groupby operation over genre column shows does not show "Action" genre mean movies rating for all movies in one place.

In [7]:
movies_cleaned.groupby('genre')['imdb_score'].mean()

genre
Action                          4.929510
Action, Adventure               5.398000
Action, Adventure, Biography    6.374194
Action, Adventure, Comedy       5.430693
Action, Adventure, Crime        5.613537
                                  ...   
Western, Comedy                 5.616667
Western, Comedy, Drama          6.000000
Western, Drama                  6.133333
Western, Family                 5.800000
Western, Horror                 3.900000
Name: imdb_score, Length: 1257, dtype: float64

In the above code we get mean movies ratings for just 'Action' genre movies and then we get mean movies ratings for 'Action, Adventure' genre movies.

If we want to find mean movie ratings for all movies with Action genre.

We can do so without groupby as follows

In [9]:
movies_action = movies_cleaned.loc[movies_cleaned['genre'].str.contains('Action')]
movies_action

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
36,tt0004465,The Perils of Pauline,1914,1914-03-23,"Action, Adventure, Drama",199,USA,English,6.3,939,$ 25000,,,,107
37,tt0004635,The Squaw Man,1914,1914-02-15,"Action, Drama, Romance",74,USA,English,5.7,879,$ 20000,,,,107
61,tt0006206,Les vampires,1915,1915-11-13,"Action, Adventure, Crime",421,France,French,7.3,4166,,,,,106
63,tt0006333,"20,000 Leagues Under the Sea",1916,1916-12-24,"Action, Adventure, Sci-Fi",105,USA,English,6.2,1501,$ 200000,,,,105
80,tt0007257,Reggie Mixes In,1916,1916-06-11,"Action, Comedy, Drama",50,USA,English,4.7,564,,,,,105
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
85798,tt9844256,Code Geass: Lelouch of the Rebellion Episode III,2018,2018-05-26,"Animation, Action, Sci-Fi",120,Japan,Japanese,7.7,184,,,,,3
85829,tt9887580,Bulletproof 2,2020,2020-01-07,"Action, Comedy",97,USA,English,3.5,326,,,,,1
85836,tt9894470,VFW,2019,2020-02-14,"Action, Crime, Horror",92,USA,English,6.1,4178,,,$ 23101,72.0,2
85838,tt9898858,Coffee & Kareem,2020,2020-04-03,"Action, Comedy",88,USA,English,5.1,10627,,,,35.0,1


In [10]:
movies_action['imdb_score'].mean()

5.625478838430615

You can observe how different is the ratings from groupby operation and the above operation.

But if we want to find the average rating for each genre in this way it will be difficult so we need to find a way to do it with groupby.

### The Solution
**Essentially it will be easy to apply groupby operation if for each genre in each movie we have one row with that movies rating.**

We can form such a dataframe using explode operation.

Before applying this kind of operation let us select the genre and imdb_score column

In [11]:
movies_genre = movies_cleaned[['genre','imdb_score']].copy()

Before applying explode operation we need to convert the genre column string values in list value.

We can do so using apply function as shown below

In [12]:
def convert_genre_list(genre):
  split_genre = genre.split(',')
  remove_spaces_genre_list = [x.strip() for x in split_genre]
  return remove_spaces_genre_list

# why we need to use remove spaces line
# let's take an example genre string
genre_example = ' Action, Adventure'
# if we just apply split_genre line and print it
split_genre = genre_example.split(',')
# in the print statement you will observe that each genre name has some space left
print(split_genre)

#but if we apply convert genre list function then spaces would not come
print(convert_genre_list(genre_example))

[' Action', ' Adventure']
['Action', 'Adventure']


Now let's create a new column in movies_genre dataframe called 'genre_list'

In [13]:
movies_genre['genre_list'] = movies_genre.apply(lambda row:convert_genre_list(row['genre']),axis=1)
movies_genre

Unnamed: 0,genre,imdb_score,genre_list
0,Romance,5.9,[Romance]
1,"Biography, Crime, Drama",6.1,"[Biography, Crime, Drama]"
2,Drama,5.8,[Drama]
3,"Drama, History",5.2,"[Drama, History]"
4,"Adventure, Drama, Fantasy",7.0,"[Adventure, Drama, Fantasy]"
...,...,...,...
85849,Comedy,5.3,[Comedy]
85850,"Comedy, Drama",7.7,"[Comedy, Drama]"
85851,Drama,7.9,[Drama]
85852,"Drama, Family",6.4,"[Drama, Family]"


Let's drop genre column now 

In [14]:
movies_genre.drop(['genre'],axis=1,inplace=True)
movies_genre

Unnamed: 0,imdb_score,genre_list
0,5.9,[Romance]
1,6.1,"[Biography, Crime, Drama]"
2,5.8,[Drama]
3,5.2,"[Drama, History]"
4,7.0,"[Adventure, Drama, Fantasy]"
...,...,...
85849,5.3,[Comedy]
85850,7.7,"[Comedy, Drama]"
85851,7.9,[Drama]
85852,6.4,"[Drama, Family]"


Now we will apply explode function to movies_genre dataset and see the output

In [None]:
movies_genre_explode = movies_genre.explode('genre_list') 
movies_genre_explode

Unnamed: 0,imdb_score,genre_list
0,5.9,Romance
1,6.1,Biography
1,6.1,Crime
1,6.1,Drama
2,5.8,Drama
...,...,...
85851,7.7,Drama
85852,7.9,Drama
85853,6.4,Drama
85853,6.4,Family


If you observe the 2nd row of original movies_genre dataset.

The corresponding exploded values for 2nd row is 2nd,3rd and 4th row in genre_list. 

In there for each genre we get the same imdb_score.

Now this operation helps us create a dataset over which if we apply groupby we will be able to find the mean ratings for each genre.

We can apply groupby as shown below

In [None]:
movies_genre_explode.groupby('genre_list')['imdb_score'].mean()

genre_list
Action         5.625479
Adult          4.550000
Adventure      5.845810
Animation      6.381317
Biography      6.623822
Comedy         5.865049
Crime          6.026559
Documentary    7.300000
Drama          6.235876
Family         5.926098
Fantasy        5.744648
Film-Noir      6.644042
History        6.543380
Horror         4.833347
Music          6.243635
Musical        6.247379
Mystery        5.823100
News           6.400000
Reality-TV     3.800000
Romance        6.139687
Sci-Fi         5.071369
Sport          6.048402
Thriller       5.473762
War            6.427520
Western        5.978395
Name: imdb_score, dtype: float64

For action movies, we can see we have got the same mean rating that we found out using filter operation.

Let's apply more aggregation functions over it

In [None]:
movies_genre_explode.groupby('genre_list')['imdb_score'].agg({'count','max','mean'})

Unnamed: 0_level_0,count,mean,max
genre_list,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,12948,5.625479,9.9
Adult,2,4.55,4.8
Adventure,7590,5.84581,9.3
Animation,2141,6.381317,9.0
Biography,2376,6.623822,9.0
Comedy,29367,5.865049,9.7
Crime,11066,6.026559,9.7
Documentary,2,7.3,7.5
Drama,47110,6.235876,9.8
Family,3962,5.926098,9.4
