<a href="https://colab.research.google.com/drive/1l0oe-DnxthZUeLZotBChmaNeusF8zSio?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [2]:
# print current directory
%pwd

'/content'

In [3]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [1]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# **Import libraries and loading the combined dataset**

In this project we will need the help of both pandas and plotly. 

So we will import both of them below

In [2]:
import pandas as pd
import plotly.express as px

we will load the combined dataset file 

combined_final_data.csv (we created it in last lesson)

the file is located in this path - "Data/IMDB_rotten_tomato_dataset/combined_final_data.csv"

In [3]:
combined = pd.read_csv("Data/IMDB_rotten_tomato_dataset/combined_final_data.csv")

In [4]:
combined.head()

Unnamed: 0,original_title,year,genre,duration,country,language,imdb_score,worldwide_gross_income,tomatometer_rating
0,The Kid,1921,"Comedy, Drama, Family",68,USA,"English, None",83.0,0.026916,100.0
1,A Woman of Paris: A Drama of Fate,1923,"Drama, Romance",82,USA,"None, English",70.0,0.011233,92.0
2,The Gold Rush,1925,"Adventure, Comedy, Drama",95,USA,"English, None",82.0,0.026916,100.0
3,Metropolis,1927,"Drama, Sci-Fi",153,Germany,German,83.0,1.349711,97.0
4,Sunrise: A Song of Two Humans,1927,"Drama, Romance",94,USA,English,81.0,0.121107,98.0


# **Assessment of the movie preference of audience**

We can analyse the two ratings of IMDB and rotten tomatoes across multiple dimensions to assess audience movie preference. 

We will try and implement following analysis points

1. **Comparison of rotten tomatoes and IMDB scores** - how different is scores of imdb audience and the rotten tomatoes scores - this helps in understanding how movies are preferred across different rating system. Is this similar or people and critics rated movies have wildly different ideas regarding similar movies.

2. **Rating preference across genre** - which genres are more preferable

3. **Rating preference across year released** - are old movies liked more or newer movies are liked more

4. **Rating preference across language** - Comparison of audience preference across langauges

5. **Rating prefernece across duration** - long duation movies are liked or disliked

we will be analysing the first two points in this lesson.

Analysis of the next three points is part of the assignment.

# **1. Comparison of Imdb and rotten tomatoes scores**

The first analysis that we will do is to understand the overall difference between the two rating system values.

Since both of these ratings are numbers, so the most easy visual comparison tool is a box plot.

We will first plot the box plot for imdb_score and then do it for tomatometer_rating.

In [5]:
px.box(combined, y = 'imdb_score')

In [None]:
px.box(combined,y = 'tomatometer_rating')

**Analysis Point**<br>
From the above two plots we can observe that the most of the IMDB score are in a narrow band of 60 - 70 while most of the rotten tomatoes score lie in a bigger band of 30 - 80.

This might be due to the fact that audiences usually rate most of the movies averagely with less strictness i.e. movies are not usually rated at extremes

but critics have a more strict rating criteria so despite many movies being rated average, many of the other movies are either rated very high or very low.

To test the above assumption further we can compare these two ratings across years

In [6]:
px.box(combined, x = 'year',y='imdb_score')

In [7]:
px.box(combined, x = 'year',y='tomatometer_rating')

Based on above plots we can still say that rotten tomatoes lie in more broader range while imdb score lie in narrow range even across years.

Since our assumption is that critics rate many movies high as well as low (with some average ones as well) while audience rates most of these movies averagely.

It would be great if we could compare the number of movies rated in different bands for both of these ratings.

For eg. if we observe that in the band of 0-10 number of imdb rating are low while tomatometer rating are comparatively high, then it will concretely prove our point.

We can do so using by first creating two categorical columns 
* imdb_rating_band
* tomatometer_rating_band

We will categories the rating in following 10 categories<br>
* 0-10
* 10-20
* 20-30
* 30-40
* 40-50
* 50-60
* 60-70
* 70-80
* 80-90
* 90-100

In [9]:
def categorise_ratings(rating):
  if rating<=10:
    return '0-10'
  elif rating>10 and rating<=20:
    return '10-20'
  elif rating>20 and rating<=30:
    return '20-30'
  elif rating>30 and rating<=40:
    return '30-40'
  elif rating>40 and rating<=50:
    return '40-50'
  elif rating>50 and rating<=60:
    return '50-60'
  elif rating>60 and rating<=70:
    return '60-70'
  elif rating>70 and rating<=80:
    return '70-80'
  elif rating>80 and rating<=90:
    return '80-90'
  else:
    return '90-100'


In [10]:
combined['imdb_rating_band'] = combined.apply(lambda row:categorise_ratings(row['imdb_score']),axis=1)

In [11]:
combined['tomatometer_rating_band'] = combined.apply(lambda row:categorise_ratings(row['tomatometer_rating']),axis=1)

In [12]:
combined

Unnamed: 0,original_title,year,genre,duration,country,language,imdb_score,worldwide_gross_income,tomatometer_rating,imdb_rating_band,tomatometer_rating_band
0,The Kid,1921,"Comedy, Drama, Family",68,USA,"English, None",83.0,0.026916,100.0,80-90,90-100
1,A Woman of Paris: A Drama of Fate,1923,"Drama, Romance",82,USA,"None, English",70.0,0.011233,92.0,60-70,90-100
2,The Gold Rush,1925,"Adventure, Comedy, Drama",95,USA,"English, None",82.0,0.026916,100.0,80-90,90-100
3,Metropolis,1927,"Drama, Sci-Fi",153,Germany,German,83.0,1.349711,97.0,80-90,90-100
4,Sunrise: A Song of Two Humans,1927,"Drama, Romance",94,USA,English,81.0,0.121107,98.0,80-90,90-100
...,...,...,...,...,...,...,...,...,...,...,...
7143,The Sound of Silence,2019,Drama,85,USA,English,55.0,0.021994,65.0,50-60,60-70
7144,Jexi,2019,"Comedy, Romance",84,"USA, Canada",English,61.0,9.341824,17.0,60-70,10-20
7145,The Death of Dick Long,2019,"Comedy, Crime, Drama",100,USA,English,63.0,0.036856,75.0,60-70,70-80
7146,The King of Staten Island,2020,"Comedy, Drama",136,USA,English,71.0,2.060358,74.0,70-80,70-80


In [13]:
combined_imdb_groupby = combined.groupby('imdb_rating_band').size().reset_index().rename(\
                                    {0:'imdb_movie_count','imdb_rating_band':'rating_band'},axis=1)
combined_imdb_groupby

Unnamed: 0,rating_band,imdb_movie_count
0,10-20,3
1,20-30,33
2,30-40,124
3,40-50,526
4,50-60,1887
5,60-70,2925
6,70-80,1482
7,80-90,166
8,90-100,2


In [14]:
px.bar(combined_imdb_groupby,x='rating_band',y='imdb_movie_count')

In [15]:
combined_tomatometer_groupby = combined.groupby('tomatometer_rating_band').size().reset_index().rename(\
                                    {0:'tomatometer_movie_count','tomatometer_rating_band':'rating_band'},axis=1)
combined_tomatometer_groupby

Unnamed: 0,rating_band,tomatometer_movie_count
0,0-10,472
1,10-20,721
2,20-30,692
3,30-40,705
4,40-50,713
5,50-60,699
6,60-70,722
7,70-80,792
8,80-90,915
9,90-100,717


In [16]:
px.bar(combined_tomatometer_groupby,x='rating_band',y='tomatometer_movie_count')

**Final Analysis Points**<br>
Based on above 2 bar graphs we can clearly observe that imdb rating are mostly concentrated aroung the 50-70 range.

While tomatometer rating are very evenly distributed.

This also proves our initial assumption about the difference in rating behaviour of audience and critics.

Audiences usually rate most of the movies averagely with less strictness i.e. movies are not usually rated at extremes

but critics have a more strict rating criteria so despite many movies being rated average, many of the other movies are either rated very high or very low.

# **2. Rating preference across genre**

For comparing ratings across genres for each of the two ratings we will create two dataframes.

* imdb_genre_meanrating - this dataframe contains mean imdb rating of movies for each genre.
* tomatometer_genre_meanrating - this dataframe contains mean imdb rating of movies for each genre.


In the pandas section we had learnt how to form the dataframe like imdb_genre_rating where we get mean imdb rating for each genre.

We will follow the same steps to create such a dataframe below

In [17]:
imdb_genre = combined[['genre','imdb_score']].copy()

In [18]:
def convert_genre_list(genre):
  split_genre = genre.split(',')
  remove_spaces_genre_list = [x.strip() for x in split_genre]
  return remove_spaces_genre_list

In [19]:
imdb_genre['genre_list'] = imdb_genre.apply(lambda row:convert_genre_list(row['genre']),axis=1)
imdb_genre

Unnamed: 0,genre,imdb_score,genre_list
0,"Comedy, Drama, Family",83.0,"[Comedy, Drama, Family]"
1,"Drama, Romance",70.0,"[Drama, Romance]"
2,"Adventure, Comedy, Drama",82.0,"[Adventure, Comedy, Drama]"
3,"Drama, Sci-Fi",83.0,"[Drama, Sci-Fi]"
4,"Drama, Romance",81.0,"[Drama, Romance]"
...,...,...,...
7143,Drama,55.0,[Drama]
7144,"Comedy, Romance",61.0,"[Comedy, Romance]"
7145,"Comedy, Crime, Drama",63.0,"[Comedy, Crime, Drama]"
7146,"Comedy, Drama",71.0,"[Comedy, Drama]"


In [20]:
imdb_genre.drop(['genre'],axis=1,inplace=True)
imdb_genre_explode = imdb_genre.explode('genre_list') 
imdb_genre_explode

Unnamed: 0,imdb_score,genre_list
0,83.0,Comedy
0,83.0,Drama
0,83.0,Family
1,70.0,Drama
1,70.0,Romance
...,...,...
7146,71.0,Comedy
7146,71.0,Drama
7147,65.0,Biography
7147,65.0,Drama


In [21]:
imdb_genre_meanrating = imdb_genre_explode.groupby('genre_list')['imdb_score'].mean().reset_index()
imdb_genre_meanrating

Unnamed: 0,genre_list,imdb_score
0,Action,61.186093
1,Adventure,63.449428
2,Animation,65.834507
3,Biography,69.700431
4,Comedy,61.271972
5,Crime,63.311464
6,Drama,65.500502
7,Family,60.920354
8,Fantasy,61.492337
9,Film-Noir,78.705882


In [22]:
# for forming the similar dataframe for tomatometer rating we can do so through following steps
tomatometer_genre = combined[['genre','tomatometer_rating']].copy()
tomatometer_genre['genre_list'] = tomatometer_genre.apply(lambda row:convert_genre_list(row['genre']),axis=1)
tomatometer_genre.drop(['genre'],axis=1,inplace=True)
tomatometer_genre_explode = tomatometer_genre.explode('genre_list') 
tomatometer_genre_meanrating = tomatometer_genre_explode.groupby('genre_list')['tomatometer_rating'].mean().reset_index()
tomatometer_genre_meanrating

Unnamed: 0,genre_list,tomatometer_rating
0,Action,45.296419
1,Adventure,53.805458
2,Animation,63.260563
3,Biography,67.439655
4,Comedy,49.915888
5,Crime,51.515152
6,Drama,59.004268
7,Family,49.378319
8,Fantasy,47.846449
9,Film-Noir,97.176471


In [23]:
px.bar(imdb_genre_meanrating,x='genre_list',y='imdb_score')

In [24]:
px.bar(tomatometer_genre_meanrating,x='genre_list',y='tomatometer_rating')

**Analysis Points**<br>
Based on the above 2 plots, we can clrealy following points about genre preference according to two rating systems:<br>
1. film-noir genre is more preferred in both of the rating system
2. Some genres like Animation, War, Western are equally preferred in both rating system.
3. Overall it seems that in tomatometer rating(usually most of the genres are rated around 50) the genre are rated less than Imdb ratings(usually most of the genres are rated around 60).
4. A good differnce in rating can be observed for following genre (usually for these genres, imdb mean rating is higher than rotten tomato mean rating)
  * Action
  * Comedy
  * Crime
  * Family
  * Thriller

**Extra Analysis Pointers**<br>
One more kind of visual plotting could also be done to see the difference between the genre based rating preference.

We could construct box plot of imdb rating for each genre using imdb_genre_explode dataframe.

imdb_genre_explode dataframe lookslike below

In [26]:
imdb_genre_explode

Unnamed: 0,imdb_score,genre_list
0,83.0,Comedy
0,83.0,Drama
0,83.0,Family
1,70.0,Drama
1,70.0,Romance
...,...,...
7146,71.0,Comedy
7146,71.0,Drama
7147,65.0,Biography
7147,65.0,Drama


As you can see above we have got two column in the dataframe, genre column contains the genre name and imdb_score column contains the corresponding imdb_score. (one thing to understand about this dataframe is that we have explded genre column to get this dataframe so, for each movie if there are more than one genre then more than one row is built for it, but due to this we have got the imdb_score for each genre in different row)

Now these two columns can be used to create the box plot as below

In [25]:
px.box(imdb_genre_explode,x='genre_list',y='imdb_score')

We can create the box plots of tomatometer rating for each genre using tomatometer_genre_explode dataframe.

tomatometer_genre_explde dataframe lookslike below

In [27]:
tomatometer_genre_explode

Unnamed: 0,tomatometer_rating,genre_list
0,100.0,Comedy
0,100.0,Drama
0,100.0,Family
1,92.0,Drama
1,92.0,Romance
...,...,...
7146,74.0,Comedy
7146,74.0,Drama
7147,51.0,Biography
7147,51.0,Drama


we can use it's two columns to make box plots as below

In [28]:
px.box(tomatometer_genre_explode,x='genre_list',y='tomatometer_rating')

We can arrive at the same analysis points as we did from bar plots of mean rating using these two box plots.

Please try to compare the above two box plots and try to arrive at the Analysis points that we described above.