<a href="https://colab.research.google.com/drive/1ZOvti_PuFHKfm49sbjyYu4fNF8nlZZMm?usp=sharing">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

This tutorial needs data so if you are working on colab follow the below data setup instruction

# Data Setup Instructions

These are the instructions for mounting the data from google drive to colab and accessing it in the colab.

STEP 1 - After opening the tutorial in  your colab, go to folder button and click on mount google drive

STEP 2 - drive folder will be mounted in the current directory of /content, you can access it as below 

In [None]:
# print current directory
%pwd

'/content'

In [None]:
%ls

[0m[01;34mdrive[0m/  [01;34msample_data[0m/


STEP 3 - Find your data folder where you saved the data and sym link it to /content folder so as to simplify data access

In the current case the Data folder is located at this path in google drive (Use your own data path in your case)

/content/drive/Othercomputers/My MacBook Pro/Data/

We can sym link it to /content folder using the following command

In [2]:
# sym linked the original data folder to new location at /content
!ln -s "/content/drive/Othercomputers/My MacBook Pro/Data" "/content"

Now we can access the data from this folder by simply giving the file path name after /Data

# Importing pandas library and data loading

In [3]:
import pandas as pd

In this lesson we are will be using movies_cleaned.csv file.

In the lesson instructions for Pandas - Advanced Real World Data Analysis, we have mentioned that you need to rename the file 

Movies_cleaned_lesson2.csv (created in lesson 2 of Pandas - Data Cleaning) -> movies_cleaned.csv

The file is saved in the path where rest of the IMDB dataset is saved. i.e. 

"Data/IMDB_rotten_tomato_dataset/IMDB/movies-cleaned.csv"

You can read this file in the below way.

In [4]:
# if you are working with this tutorial on local machine use the file path where the data is saved in your computer
movies_cleaned = pd.read_csv("Data/IMDB_rotten_tomato_dataset/IMDB/movies_cleaned.csv")
# We can use .head command to quickly observe the first 5 rows of the dataset
movies_cleaned.head()

Unnamed: 0,imdb_title_id,original_title,year,date_published,genre,duration,country,language,imdb_score,votes,budget,usa_gross_income,worldwide_gross_income,metascore,movie_age
0,tt0000009,Miss Jerry,1894,1894-10-09,Romance,45,USA,,5.9,154,,,,,127
1,tt0000574,The Story of the Kelly Gang,1906,1906-12-26,"Biography, Crime, Drama",70,Australia,,6.1,589,$ 2250,,,,115
2,tt0001892,Den sorte drøm,1911,1911-08-19,Drama,53,"Germany, Denmark",,5.8,188,,,,,110
3,tt0002101,Cleopatra,1912,1912-11-13,"Drama, History",100,USA,English,5.2,446,$ 45000,,,,109
4,tt0002130,L'Inferno,1911,1911-03-06,"Adventure, Drama, Fantasy",68,Italy,Italian,7.0,2237,,,,,110


# Data Visualisation

In this section we will learn about different graphs that we can use to summarise and gain important insights about our data. Visuals are so important becuase they can help us understand data in a way that rows and rows of tabular data just can not tell us.

Different kinds of graphs are suitable for different kinds of data.

For eg. let's consider two different columns.

Duration column and genre column.

In [5]:
movies_cleaned[['duration','genre']]

Unnamed: 0,duration,genre
0,45,Romance
1,70,"Biography, Crime, Drama"
2,53,Drama
3,100,"Drama, History"
4,68,"Adventure, Drama, Fantasy"
...,...,...
85849,95,Comedy
85850,103,"Comedy, Drama"
85851,130,Drama
85852,98,"Drama, Family"


You can observe that Duration column is numerical while genre column is categorical.

Let's suppose we want to represent the information about Duration graphically.

The graph we should use is a box plot. This will show us how much duration value of the whole movies are spread out as shown below.


![box_plot_duration.png](https://drive.google.com/uc?export=view&id=1pmwf6uyxUg20E1oYSHdvyUP-ECbT7Op5)


While if we want to represent the data about how many number of movies are made in each genre, we would better try a bar plot as shown below.

![bar_plot_genre.png](https://drive.google.com/uc?export=view&id=1lrkL4UoUE9Gs5wVmslZ_96O7fy28nnFC)

# **Plotly Introduction**

**What is Plotly?**<br>
* Plotly is a Javascript based graphing library
  * But you don't have to learn Javascript in order to use plotly
* Plotly has a Python  wrapper.

**Why Plolty?**<br>
It has some advantages over other graphing libraries in Python.
* It is fast and easy to implement to simple plots 
* plotly.express provides low code and low effort based options to build graphs
* Plots made by can be extremely customizable if want to do so.
* Many interactive features are available with plotly like mouse hover feature etc.


 **Creating plotly figures**<br>
 Ploltly Graphs can be created by:<br>
1. With ploltly.express for simple quick plots
2. For more customised plots, we need to use plotly.graph_objects(go)

In this course we will utilise plotly.express to plot graphs.

Graphs in plotly are made with the help of dataframes. In the next section we will learn how to make graphs with plolty using dataframes.

**Plotly Documentation**<br>
Plotly library gives a lot of flexibility while making plots. 

We would not be able to cover many of these flexibility options in this course.

So it is advisable to go through plolty dcoumentation for plotly express on this page - https://plotly.com/python/plotly-express/

# **Different types of plots**

In this lesson we will basically learn about three types of plots
1. bar plot
2. box plot
3. scatter plot

First we will import plotly.express functionality as shown below

In [6]:
import plotly.express as px

Now for making any graph with plotly.express this is syntax you use.

<code> px.graph_name(dataframe, x= col_name...other graph attributes) </code>

# **Bar Plot**

Bar Plot is used to plot the categorical column along with a numerical value.

For eg. Let's suppose we want to solve the below problem

Plot the number of movies for each genre.

We can plot it using a bar plot.

First we need to create a dataframe where we have the data about the number of movies in each genre.

If you recall the section of groupby with explode we can do so in the below way.

In [7]:
genre_col = movies_cleaned[['genre']].copy()

genre_col

Unnamed: 0,genre
0,Romance
1,"Biography, Crime, Drama"
2,Drama
3,"Drama, History"
4,"Adventure, Drama, Fantasy"
...,...
85849,Comedy
85850,"Comedy, Drama"
85851,Drama
85852,"Drama, Family"


In [8]:
def convert_genre_list(row):
  gen_list = [genre.strip() for genre in row['genre'].split(',')]
  return gen_list

genre_col['genre_list'] = genre_col.apply(lambda row:convert_genre_list(row),axis=1)

genre_col

Unnamed: 0,genre,genre_list
0,Romance,[Romance]
1,"Biography, Crime, Drama","[Biography, Crime, Drama]"
2,Drama,[Drama]
3,"Drama, History","[Drama, History]"
4,"Adventure, Drama, Fantasy","[Adventure, Drama, Fantasy]"
...,...,...
85849,Comedy,[Comedy]
85850,"Comedy, Drama","[Comedy, Drama]"
85851,Drama,[Drama]
85852,"Drama, Family","[Drama, Family]"


In [9]:
# drop genre column
genre_col.drop('genre',axis=1,inplace=True)

# explode genre_list column
exp_genre = genre_col.explode('genre_list')

exp_genre

Unnamed: 0,genre_list
0,Romance
1,Biography
1,Crime
1,Drama
2,Drama
...,...
85850,Drama
85851,Drama
85852,Drama
85852,Family


In [10]:
# doing groupby with size to get the count of movies for each genre
count_genre_movie = exp_genre.groupby('genre_list').size().reset_index().rename({0:'count'},axis=1)

count_genre_movie

Unnamed: 0,genre_list,count
0,Action,12948
1,Adult,2
2,Adventure,7590
3,Animation,2141
4,Biography,2376
5,Comedy,29367
6,Crime,11066
7,Documentary,2
8,Drama,47110
9,Family,3962


Now we can use count_movies_genre dataframe columns to build the bar graphs.

In the bar graph, we will have one bar for each genre and the height of each bar will be equal to the count of movies in that genre.

In bar graph, on x-axis we want the genre information and on y-axis we want the count value of the genre information.

In [11]:
px.bar(count_genre_movie,x='genre_list',y='count')

Now with the help of above bar, we can very clearly say that some of the genre have very little movies in the dataset for eg. documentary. And some genre like Drama have most number of movies.

# **Box Plot**

Box Plot is useful to plot numerical data columns. It shows the spread of values in numerical column.

When we just take all the values of a numerical column, and plot it like box plot, it will looklike below 

![box_plot_1.png](https://drive.google.com/uc?export=view&id=1v-k_8TEF5lYKukUZjX1pjFCDc1ULWj2t)

Box plot is also called box and whisker plot because of the below pic

![box_plot_2.png](https://drive.google.com/uc?export=view&id=17RKikwN7f2_MxmyLX-tzBlWfHbbG01Rd)

Box Plots shows median values

![box_plot_3.png](https://drive.google.com/uc?export=view&id=1gZGo29-srO_HZ7Wlz4DruP51QiAwo70f)

![box_plot_4.png](https://drive.google.com/uc?export=view&id=1uFAWymkpzRmfwfDkKa67J8tcq2C1VYsC)

Within the box itself, the boundaries of box mean following thing

![box_plot_5.png](https://drive.google.com/uc?export=view&id=1m-_lg0TUHl30EeyjUmxD0xGDHGX15EZt)

We can also represent outlier in box plot as well

![box_plot_6.png](https://drive.google.com/uc?export=view&id=1qTrVT2Dm4PmACoO3gt8AL2SsP-ECK0Q9)

The different points on box plot are named as shown below

![box_plot_7.png](https://drive.google.com/uc?export=view&id=1TxkuIfhx4xm0v7evXzF0udQ7JSY-KT4f)

Now we can do box plotting of any numerical column like duration from movies_cleaned dataframe as shown below.

In box plot you have to put the column name for y axis over which your box plot is made.

In [12]:
px.box(movies_cleaned,y='duration')

Hovering over the box plot shows following important points of box plot:
* median
* q1
* q3
* lower fence 
* upper fence
* min
* max

All of these information would have been very difficult to understand from tabular data but showing it in visual way simplifies the information.

This is the box plot of all movies in the dataset.

Let's suppose we want to graph box plot for movies released in different years. We could also do that using px.box function.

Before just think about plotting duration column as box plot for movies released in 2010.

We can do so by first creating a dataframe of only those movies released in 2010.

In [13]:
movies_cleaned_2010 = movies_cleaned.loc[movies_cleaned['year']==2010]

In [14]:
px.box(movies_cleaned_2010,y='duration')

The above box plot of duration is only for movies released in 2010.

What if we want to do so for all the years.

In that case creating a dataframe for each year would be a tedious effort.

There is another way to directly create box plot for each year in one graph only using the dataframe movies_cleaned.

In this way in place of y-axis we use 'duration' column, but in place of x-axis we give 'year' column

In [15]:
px.box(movies_cleaned,x='year',y='duration')

In the above graph, we have got the box plot of duration for each year.

# **Scatter Plot**

Scatter Plots are a useful to represent data about two numerical columns in one graph.

We plot one variable on x-axis and another variable on y-axis and then the corresponding points between them are plotted in the quadrant between x and y axis.

Let's plot the scatter chart for the column duration and imdb_score

In [16]:
px.scatter(movies_cleaned,x='duration',y='imdb_score')

Each dot in above scatter plot is a movie.

Based on above scatter, we can see that most of the movies are of duration less than 200 min.

Usually higher duration movies beyond 350 min has rating more than 6.

# **More Chart Formatting Options**

## Adding hover_data argument

In the scatter plot that we have created above, 

when we hover over it we can only see the duration and imdb_score data.

But each dot in the above plot is a movie, it would be beneficial if we could watch more information about each dot if we hover over it. 

In particular we would want to see following information about each dot:<br>
* original_title
* genre
* year

We can change our scatter plot in such a way that the above information will also be visible in the scatter plot.

In [17]:
px.scatter(movies_cleaned,x='duration',y='imdb_score',hover_data=['original_title','genre','year'])

Output hidden; open in https://colab.research.google.com to view.

Now if we hover over the scatter plot, we can observe the other information like genre, title and year as well.

We have added one more argument to scatter plot called 'hover_data'. To this argument we have given a list of column values that we want to see in scatter plot.

## Adding hover_name argument

If you hover over scatter plot, the top item that you see is duration, then imdb_score and then movie_title.

It would be great if movie_title value comes at top while hovering. This would help us directly identify the movie.

We can get movie title at the top of each hover by using a new argument hover_name. This gives title to each hover, in this case the title of each hover will become movie title.

In [18]:
px.scatter(movies_cleaned,x='duration',y='imdb_score',hover_name = 'original_title', hover_data=['genre','year'])

Output hidden; open in https://colab.research.google.com to view.

Now at the top of each hover we can see the movie title. Below that all the other hover information is available.

## Adding range_x and range_y

Let's suppose in the graph we want to observe only movies beyond imdb_score 6 with duration more than 200min.

We could define the range of x and y above to restrict the graph in x-axis and y-axis as shown below

In [19]:
px.scatter(movies_cleaned,x='duration',y='imdb_score',range_x=[200,850],range_y=[6,10], hover_name = 'original_title', hover_data=['genre','year'])

Output hidden; open in https://colab.research.google.com to view.

Using reset_x and reset_y argument we can restrict the graph in x and y axis.

## Adding title argument

We can also add title argument to a plotly graph in order to have a title at the top of the graph.

In [20]:
px.scatter(movies_cleaned,x='duration',y='imdb_score',\
           range_x=[200,850],range_y=[6,10], \
           hover_name = 'original_title', hover_data=['genre','year'],
           title = 'scatter plot of imdb_score and duration')

Output hidden; open in https://colab.research.google.com to view.