- Recently I got a lot of feedback from my dear friends who just change or about the change their career towards to Data Analysis/ Data Science and Machine Learning areas about the lack of material between beginning the analysis journey and the advanced techniques.

- They are looking for detailed but at the same time beginner friendly, not so much complicated (with different regression, normalization techniques, etc.) explained Explanatory Data Analysis examples, which show them how to start and most importantly how to read the descriptive statistics and graphs.

- After getting these feedbacks, I have decided to make some kind of series of EDA’s from different datasets, without making so complicated for the people at their first steps of DS/ML journey.

### This notebook is part of the 9 Beginner Friendly EDAs. If these EDAs would be helpful to anyone, I would be more than happy.



#### **INTRO**



- In this study, we are going to make Exploratory Data Analysis (EDA) with the Hollywood's Most Profitable Movies dataset. 
- Study aims to be beginner friendly and give as much as possible explanation for each step on the way.
- Study's dataset has 74 movies along with their ratings, profitability, worldwide gross and leading studio.
- Data includes 2007-2011 movies.

- First, let's import the required libraries.
- We will use Plotly's interactive environment for visualization.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt 
import seaborn as sns 
import matplotlib as mpl


import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

### Overview Stage

- Read the csv
- Look for basic information about the dataset

In [None]:
df= pd.read_csv('../input/hollywood-most-profitable-stories/HollywoodsMostProfitableStories.csv')

In [None]:
df.head()

In [None]:
df.shape

- We have 74 films and 8 variables to work on

In [None]:
df.isnull().sum()

- We have several missing values, which we need to look into.

In [None]:
df.info()

- As a data type, everything is in order to work on.

In [None]:
df.describe()

Before going further, let's summarize what we have got from the dataset.

- Our dataset has 74 films from different genres and lead studios.

- Object data type variables, like genre and lead studio can be grouped and see the differences among them.

- There are several  missing values to look for it. 

-  Numerical variables deserves special attention for further analysis.


- Let's make the necessary adjustments before moving to the analysis part.

#### Missing Values

- Let's remember, which columns have the missing values

In [None]:
df.isnull().sum()

- Since we will use all the films, film column must be without missing values. So we are OK for the film column.
- Genre, Worldwide Gross and Year columns do not have any missing values. Hurray!!
- Let's see the missing values.

In [None]:
df[df['Lead Studio'].isnull()]

- In addition to domain knowledge and expertise, there are tons of different ways to deal with the missing values.

- Most of the time, people tend to use drop function or use fillna function to use that row in their analysis. As I said, it depends on the data, domain knowledge and importance of the variable in our analysis.

- For our analysis in this dataset, main variable which we want to work on it, is 'Film'. If we have the name of the film and also values for at least some of the other variables, we can use that row.

- Based on aferomentioned points, we can keep this row, it has a lot of usefull information.


In [None]:
df[df['Audience  score %'].isnull()]

- Audience score and Rotten Tomates scores are good variables to use for rating purposes. 
- But in this row we don't have any of them.
- But still we have values for other columns, so we can keep this row.

In [None]:
df[df['Profitability'].isnull()]

- Profitability has 3 missing values.
- Even though we don't have profitability values in these rows, we have values for other columns.
- So better to keep them.

#### Look at the 'Genre' and 'Lead Studio'

In [None]:
df['Genre'].value_counts()

- Seems quite OK to use in the groupby.
- Noted

In [None]:
df['Lead Studio'].value_counts()

- It can be used.
- OK it is not in a perfect shape to use in group by, but not bad at all to use. 
- Noted.

- Everything seems OK.  Let's move on to the next step: **analysis part**.

### Analysis Part

#### Genre

In [None]:
df['Genre'].value_counts(normalize=True)

- 55% of the movies are in the Comedy genre
- Romance and Drama follows it 

In [None]:
fig = px.histogram(df, x="Genre", title='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

#### Lead Studio

In [None]:
df['Lead Studio'].value_counts(normalize=True)

- Independent studios make 26% of the all the studios in this dataset.
- Warner Bros also comes close by 16% 

In [None]:
fig = px.histogram(df, x="Lead Studio", title='Lead Studios')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

### Audience Score %

In [None]:
df['Audience  score %'].describe()

- Based on the mean and median values, auidence score seems quite normally distributed.
- Both mean and median value is 64.

In [None]:

fig = px.histogram(df, x= 'Audience  score %', title='Percentage of the Audience  score', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

#### Profitability

In [None]:
df['Profitability'].describe()

- We have rightly skewed distribution (mean is signicantly bigger than median)
- Which basicaly means, we have possible outliers in our dataset and they affect mean value.
- Let's see it.

In [None]:
fig = px.histogram(df, x= 'Profitability', title='Profitability of the Movies', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

- Yeah, we have a quite a number of outliers and rightly skewed distribution of the Profitability of the movies.

#### Rotten Tomatoes %

In [None]:
df['Rotten Tomatoes %'].describe()

- We have maximum number of 96 in the data. It affects mean score.
- We can expect rightly skewed distribution but not that much extend.
- Because mean and median scores are close to each others (47 & 45)

In [None]:
fig = px.histogram(df, x= 'Rotten Tomatoes %', title='Rating Score-Rotten Tomatoes %', marginal="box", hover_data = df[['Film','Genre']])


fig.show()

In [None]:
df['Worldwide Gross'].describe()

- Yep, as you correctly see, we have highly rightly skewed distribution (mean= 136.5, median = 73.19)
- We have possible outliers.
- Let's see it

In [None]:
fig = px.histogram(df, x= 'Worldwide Gross', title='Worldwide Gross', marginal="box", hover_data = df[['Film','Genre']])
fig.show()

- Befor moving on the details, let's see the correlation matrix for our dataset

In [None]:
df.drop('Year', axis=1).corr()

In [None]:
index_vals = df['Genre'].astype('category').cat.codes

fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='Audience  score %',
                                 values=df['Audience  score %']),
                            dict(label='Profitability',
                                 values=df['Profitability']),
                            dict(label='Rotten Tomatoes %',
                                 values=df['Rotten Tomatoes %']),
                            dict(label='Worldwide Gross',
                                 values=df['Worldwide Gross'])],
                showupperhalf=False, 
                text=df['Film'],
                marker=dict(color=index_vals,
                            showscale=False, # colors encode categorical variables
                            line_color='white', line_width=0.5)
                ))


fig.update_layout(
    title='Movies',
    width=1000,
    height=1000,
)

fig.show()

Based on the results:
- There is positive but not so strong relationship (.60) between Audience Score and Rotten Tomatoes
- Also there is positive but weak (.395) relationship between Audience Score and the Worldwide Gross.

- After getting overall picture about the data, we can go into more details.

In [None]:
genre_by_year = df.groupby('Year')['Genre'].value_counts().reset_index(level=0).rename(columns={'Genre': 'Genre count'}, index={'index': 'Genre'})
genre_by_year

In [None]:
fig = px.line(genre_by_year, x='Year', y='Genre count', color= genre_by_year.index, title='Movies By Genre in Each Year')
fig.show()

- From the line plot we can see that Movies in the Comedy and Drama genres do not have consistency by year
- Movies in Romance genre, increased significantly after 2008 
- Animation movies are stable in count by year. 

### Profitability by Year

In [None]:
fig = px.scatter(df, x='Year', y='Profitability', title='Movies By Profitability in Each Year')
fig.show()

### Worldwide Gross by Year

In [None]:
fig = px.scatter(df, x='Year', y='Worldwide Gross', title='Movies By Worldwide Gross in Each Year')
fig.show()

- Let's look at the top 15 WorldWide Gross Movies

#### Top 15 WorldWide Gross Movies

In [None]:
top_15 = df.sort_values('Worldwide Gross', ascending=False)[:15]
top_15

In [None]:
fig = px.bar(top_15, x='Film', y= 'Worldwide Gross',  hover_data = top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see top 15 Rotten Tomatoes % Rating Movies

#### Top 15 Rotten Tomatoes % Rating Movies

In [None]:
rotten_tomatoes_top_15 = df.sort_values('Rotten Tomatoes %', ascending=False)[:15]
rotten_tomatoes_top_15

In [None]:
fig = px.bar(rotten_tomatoes_top_15, x='Film', y= 'Rotten Tomatoes %',  hover_data = rotten_tomatoes_top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Let's see top 15 Audience score % Rating Movies

#### Top 15 Audience score % Rating Movies

In [None]:
audience_score_top_15 = df.sort_values('Audience  score %', ascending=False)[:15]
audience_score_top_15

In [None]:
fig = px.bar(audience_score_top_15, x='Film', y= 'Audience  score %',  hover_data = audience_score_top_15[['Year','Genre']], color='Genre')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Both ratings rated Wall E with the highest score.
- On the other hand, 'Twilight', 'PS I Love You' and 'The Twilight Saga' has lower score in Rotten Tomatoes than Audience Scores.

- And finally, let's see the most profitable  15 movies

#### Top 15 Most Profitable Movies

In [None]:
top_15_profit = df.sort_values('Profitability', ascending=False)[:15]
top_15_profit

In [None]:
fig= px.bar(top_15_profit, x='Film', y= 'Profitability',  hover_data = top_15_profit[['Year','Genre']], color='Lead Studio')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

- Interestingly, or not interestingly, 'Fireproof', the most profitable movie in our dataset, has low score in both of the ratings.

- Almost all of the lead studios have their movies in the most profitable top 15 movies list.

## This notebook is a part of the 9 Beginner Friendly EDAs
## If you like this one, you can also check out other notebooks in the Beginner Friendly EDAs series!

* [Data Analyst Jobs - EDA](https://www.kaggle.com/kaanboke/plotly-data-analyst-jobs)
* [Top Games on Google Play Store](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-top-games)
* [UDEMY Courses EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-udemy)
* [World Happiness Report - EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-eda)
* [Countries Life Expectancy](https://www.kaggle.com/kaanboke/plotly-beginner-friendly)
* [Netflix Movies- EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-netflix)
* [Amazon Top 50 Bestselling Books EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-amazon)
* [London Bike Sharing EDA](https://www.kaggle.com/kaanboke/plotly-beginner-friendly-london-bike)


- Thanks for the dataset contibutor for this data. I really enjoyed working on it.

- It was a quite pleasure to share with you this detailed, beginner friendly EDA. Thanks for your time.

- All the best 