<a href="https://colab.research.google.com/drive/1Gy8uESDT1z4PkX0qJViNKP9j1wTuC-JR?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [None]:
# print current directory
%pwd

'/content'

In [None]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [2]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# **Import libraries and loading the combined dataset**

In this project we will need the help of both pandas and plotly. 

So we will import both of them below

In [3]:
import pandas as pd
import plotly.express as px

we will load the combined dataset file 

combined_final_data.csv (we created it in last lesson)

the file is located in this path - "Data/IMDB_rotten_tomato_dataset/combined_final_data.csv"

In [4]:
combined = pd.read_csv("Data/IMDB_rotten_tomato_dataset/combined_final_data.csv")

In [5]:
combined.head()

Unnamed: 0,original_title,year,genre,duration,country,language,imdb_score,worldwide_gross_income,tomatometer_rating
0,The Kid,1921,"Comedy, Drama, Family",68,USA,"English, None",83.0,0.026916,100.0
1,A Woman of Paris: A Drama of Fate,1923,"Drama, Romance",82,USA,"None, English",70.0,0.011233,92.0
2,The Gold Rush,1925,"Adventure, Comedy, Drama",95,USA,"English, None",82.0,0.026916,100.0
3,Metropolis,1927,"Drama, Sci-Fi",153,Germany,German,83.0,1.349711,97.0
4,Sunrise: A Song of Two Humans,1927,"Drama, Romance",94,USA,English,81.0,0.121107,98.0


# **Assessment of the movie earning potential**

Using worldwide_gross_income column data we will analyse following points
1. **Earning potential and Rating Comparison** -  does movies with higher rating have higher earning potential
2. **Earning potential and Genre Comparison** - does genre affect the earning of a movie
3. **Earning potential and country Comparison** - does the earning of a movie differ based on the country it is getting released into.
4. **Earning potential and language Comparison** - how does language affect earning of a movie
5. **Earning potential and duration Comparison** - do the earning of high and low duration movie differ


We will do the first 2 analysis in this lesson and the next three will be the done by you in the coding assignment.

# **1. Earning potential and Rating Comparison**

We want to understand if the rating of a given movie really affects its earning. Since we have two ratings so we have to do this analysis for both IMDB and rotten tomatoes.

Now the comparison needs to be done between worldwide_gross_income and ratings columns, since both of these are numerical columns, so the best visual comparison tool would be a scatter plot.

We will first plot the scatter plot of imdb_score and worldwide_gross_income. Each point of this scatter plot will show a particular movie and it gross income and imdb_score. We will also add movie title as hover data to this plot so as to look at movies easily.It is constructed as shown below

In [8]:
px.scatter(combined,x='imdb_score',y='worldwide_gross_income',hover_data=['original_title'])

We can draw a similar kind of scatter plot for tomatometer_rating and worldwide_gross_income column

In [9]:
px.scatter(combined,x='tomatometer_rating',y='worldwide_gross_income',hover_data=['original_title'])

**Analysis Points**<br>
1. if we see the first plot of gross income and imdb score, it shows that there is a bit of a relation between income and score values. As the score increase, so does the earnings of movie. 

2. The second plot shows a much weaker relation between tomatometer_rating and worldwide_gross_income.

3. We may have higher relation in case of imdb and not in case of tomatometer because, imdb rating is influenced by layman audience while tomatometer is more influenced by critics. And it is layman audience who pays for movies, so that is why we see that higher earning movies are usually rated higher.

4. Movies with very high earnings are usually rated better in both rating systems for eg. movies like 'Avengers','Avatar','Titanic' have very high earnings and they are rated good in both the rating system.

# **2. Earning Potential and Genre Comparison**

In this analysis we will analyse how worldwide_gross_income varies with genre of a movie.

Since genre is a categorical value and worldwide_gross_income is a numerical value. We could use either a box plot or a bar plot.

We will try to first do our analysis using bar plot.

We will plot mean worldwide_gross_income for each genre and see the progression of change there.

For finding mean worldwide_gross_income we will have to do groupby on genre and before doing groupby on genre column, we will need to do explode as well to get single genre in each row.

We are doing this below

In [10]:
def convert_genre_list(genre):
  split_genre = genre.split(',')
  remove_spaces_genre_list = [x.strip() for x in split_genre]
  return remove_spaces_genre_list

income_genre = combined[['genre','worldwide_gross_income']].copy()
income_genre['genre_list'] = income_genre.apply(lambda row:convert_genre_list(row['genre']),axis=1)
income_genre.drop(['genre'],axis=1,inplace=True)
income_genre_explode = income_genre.explode('genre_list') 
income_genre_meanincome = income_genre_explode.groupby('genre_list')['worldwide_gross_income'].mean().reset_index()
income_genre_meanincome

Unnamed: 0,genre_list,worldwide_gross_income
0,Action,140.4278
1,Adventure,208.668843
2,Animation,247.629655
3,Biography,45.727052
4,Comedy,64.080941
5,Crime,45.153267
6,Drama,43.587649
7,Family,114.075065
8,Fantasy,120.872804
9,Film-Noir,0.221779


 Now we can use the above dataframe to plot bar plot.

In [11]:
px.bar(income_genre_meanincome,x='genre_list',y='worldwide_gross_income')

**Analysis Points**<br>
Here we can clearly see that some genres have very earning potential. Most of them have mediocre earning potential.

Genres with earning potnential more than 100 million.<br>
* Action
* Adventure
* Animation
* Family
* Fantasy
* Sci-Fi

Most of the other genres have mediocre earning potential of around 50 million.

Film-noir is the genre with very little earning potential.

We can even do a box plot of worldwide_gross_income for each genre. That will just reinforce above analysis points. But it can also show us the genres with most outlier kind of earning movies.

We can do the box plot for each genre on income_genre_explode dataframe where each genre is coming up in single rows

In [12]:
px.box(income_genre_explode,x='genre_list',y='worldwide_gross_income')

**Analysis Points**<br>
We can observe that most of the genres have movies with outlier earning potential.

But if we look closely we will observe that few genres do not have any outlier earning potential movies beyond 500 mil. These are genre with little earning potential. Some of these genres are 
* Sport
* Western
* Film-Noir
* War

Music and Biography has only one movie beyond 500 million mark.

Some genres like Adventure and Animation have very big boxes which means that most of the movies have very high earning potential in this category.