<center><img src = "https://sm.pcmag.com/pcmag_au/review/n/netflix/netflix_38rt.jpg"></center>

<h1><center>  <div style="background-color:lightpink;border-radius:10px; padding: 10px;">Introduction</div></center></h1>

🎞 Motivation

> I love watching movies and who doesn't! I was thinking about, does the gerne affect the IMDB rating of the movie? Those who don't know about IMBD, it is full form of Internet Movie Database, a web site that provides information about millions of films and television programs as well as their cast and crew. I came across this dataset and decided to use Data Analysis skills to answer the questions in my mind. 

🎯 Goal

> To find the insights about movies, their IMDB rating, genre, runtime, etc.

🛠 Tools

> I have used plotly and seaborn for simple plotting, but for quick and efficient EDA I have used Power BI. It is a Microsoft business analytics service. It provides interactive visualizations and business intelligence capabilities with an interface that Microsoft says is simple enough for end users to create reports and dashboards. It is part of the Microsoft Power Platform.

<center><img src ="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcStw-j0YVvmKPfks6f9tjBdku4MsLI6QNO78whXdmeGjISdIk6mSjbC-zKjL7gOz9vQjOk&usqp=CAU"></center>


<h1><center>  <div style="background-color:lightpink;border-radius:10px; padding: 10px;">Reading and Cleaning the Data 🧹</div></center></h1>

In [None]:
# For data handling
import numpy as np
import pandas as pd

# For visvalization
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

In [None]:
df=pd.read_csv('../input/netflix-original-films-imdb-scores/NetflixOriginals.csv')
df.head()

In [None]:
# checking for correct data types
df.info()

> `Premiere` is a date, so I will change its format to datetime

In [None]:
# Changing the format of 'Premiere' to datetime
df['Premiere']=pd.to_datetime(df['Premiere'], dayfirst=True)

In [None]:
# Checking for missing values
df.isnull().sum()

> As our data has no missing values, we will proceed for checking duplicate entries.

In [None]:
# Checking for dublicate data
df.duplicated().sum()

> No dublicate entries found. Now, let's check for some anomaly values. For example, the `IMDB Score` can't be more than 10 and less than 0.

In [None]:
# Checkign for some anomaly values
((df['IMDB Score']>10)|(df['IMDB Score']<0)).sum()

> There are no such values. The data is clean, so now let's do some feature engineering by adding some useful columns for EDA. I will be adding `Year`, `Month` and `Day` columns from `Premiere`.


<h4><center>  <div style="background-color:lightpink;border-radius:10px; padding: 10px;">Feature Engineering 📐📏</div></center></h4>

In [None]:
# Adding day, month and year columns
df['Day']=df['Premiere'].apply(lambda x: x.day)

month_dict = {1:"January", 2:"February", 3:"March", 4:"April", 5:"May", 6:"June",
  7:"July", 8:"August", 9:"September", 10:"October", 11:"November", 12:"December"}
df['Month']=df['Premiere'].apply(lambda x: month_dict[x.month])

df['Year']=df['Premiere'].apply(lambda x: x.year)

#### The data is clean and useful features area also added, so let's being the EDA and explore the unexplored!

<h1><center>  <div style="background-color:lightpink;border-radius:10px; padding: 10px;">Exploratory Data Analysis 📊</div></center></h1>

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Distribution of the numeric features</div></center></h4>

In [None]:
## numeric features
num_col=['Runtime', 'IMDB Score']

# Plotting
fig=make_subplots(cols=1, rows=2)

for i,col in enumerate(num_col):
    fig.add_trace(go.Box(x=df[col], name=col, hovertext = df['Title'] + " " + df['Genre']),
                  row=i+1, col=1)

fig.update_layout(title=dict(text='Distribution of Runtime and IMDB score', xanchor='center', yanchor='top', x=0.5))
fig.show()

#### 🔍Observations:
> The highest runtime is ⏳ 208 minute (3.5 hrs!) for the crime drama "The Irishman" and the shortest runtime is 4 minutes for "Sol Levante", an anime.

> The highest IMDB score is 9 for a documentry "A Life on our Planet" and least IMDB score is 2.5 for "Enter the Anime" which is also a documentry!

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Runtime ⏱ and the IMDB score 💯</div></center></h4>

In [None]:
fig, ax = plt.subplots(1,1, figsize = (10, 6), constrained_layout = True)
ax = sns.regplot(x = 'IMDB Score', y = 'Runtime', data = df)

ax.set_ylabel('Runtime', fontsize = 14)
ax.set_xlabel('IMDB Score', fontsize = 14)
plt.title('IMDB Score Vs Runtime', fontsize = 16)

correlation = np.corrcoef(df['IMDB Score'], df['Runtime'])[0][1]

ax.text(x = 8, y = 180,
        s = f"correlation : {round(correlation,3)}", 
        ha = 'center', size = 12, rotation = 0, color = 'black',
        bbox=dict(boxstyle="round,pad=0.5", fc='skyblue', ec="skyblue", lw=2));

#### 🔍 Observations
> The correlation between `Runtime` and `IMDB Score` is -0.041, which is close to 0, so there is no relation between them. It means the runtime of a movie doesn't affect the IMDB score!

📝 Note
> The correlation score:
> - 1  : Strongly and positively correlated (one increases, other also increases and vice versa)
> - 0  : No correlation
> - 1 : Strongly and negetively correlated (one increases, other also decreases and vice versa)

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Genre 🎬 and the IMDB score 💯</div></center></h4>

In [None]:
# Let's first check how many Genre are there
len(df['Genre'].unique())

> As there are too many `Genre`, it will look mess if I plot all of them in a single plot, also we won't be able to interprete it. So, I will plot the top 10 `Genre` based on the average `IMDB Score`.

In [None]:
df_temp=df.groupby(['Genre']).mean(['IMDB rating']).sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:10,:]

fig, ax = plt.subplots(1,1, figsize = (10, 6), constrained_layout = True)
ax = sns.barplot(x = 'Genre', y = 'IMDB Score', data = df_temp, color = 'violet')

for i in ax.patches:    
    ax.text(x = i.get_x() + i.get_width()/2, y = i.get_height()/2,
            s = f"{round(i.get_height(),1)}", 
            ha = 'center', size = 14, weight = 'bold', rotation = 0, color = 'white',
            bbox=dict(boxstyle="round,pad=0.5", fc='pink', ec="pink", lw=2))


ax.set_xlabel('Genre', fontsize=14)
ax.set_ylabel('Average IMDB Score', fontsize=14)
ax.set_xticklabels([i[:15] for i in df_temp['Genre'].unique()], fontsize=12, rotation = -45 )
plt.title('Top 10 Genre by IMDB Score', fontsize=16);


#### 🔍Observations:
> People mostly love animation, comedy, adventure and musical type movies.

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Top 10 rated movies 🎬</div></center></h4>

In [None]:
df_temp=df.sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:13,:]

fig, ax = plt.subplots(1,1, figsize = (15, 6), constrained_layout = True)
ax = sns.barplot(x = 'Title', y = 'IMDB Score', data = df_temp, hue = 'Genre')

for i in ax.patches:    
    ax.text(x = i.get_x() + i.get_width()/2, y = i.get_height()+0.1,
            s = f"{i.get_height()}", 
            ha = 'center', size = 14, weight = 'bold', rotation = 0, color = 'white',
            bbox=dict(boxstyle="circle,pad=0.5", fc='lightblue', ec="lightblue", lw=2))


ax.set_xlabel('Title', fontsize=14)
ax.set_ylabel('Average IMDB Score', fontsize=14)
ax.set_xticklabels([i[:15] for i in df_temp['Title'].unique()], fontsize=12, rotation = -30)
plt.title('Top 10 movies by IMDB Score', fontsize=16)
plt.legend(title='Gerne', bbox_to_anchor=(1.05, 1), loc='upper left');

#### 🔍Observations
> Most of the top 10 rated movies are documentries. That's interesting as documentry was not there in top 10 genre, it means that some of the documetries have got very less IMDB score so the average rating low.

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Language 🔠 and IMDB Score</div></center></h4>

In [None]:
# Let's first check how many languages are there
len(df['Language'].unique())

> As there are too many Languages (same problem with Genre), I will plot the top 10 Language movies based on the average IMDB score.

In [None]:
df_temp=df.groupby(['Language']).mean(['IMDB rating']).sort_values(by='IMDB Score', ascending=False).reset_index().iloc[:10,:]

# fig=px.pie(df_temp, names='Language', values='IMDB Score')
# fig.update_layout(title=dict(text='Top 5 rated Language movies', xanchor='center', yanchor='top', x=0.45))
# fig.show()


fig, ax = plt.subplots(1,1, figsize = (10, 6), constrained_layout = True)
ax = sns.barplot(x = 'Language', y = 'IMDB Score', data = df_temp, color = 'violet')

ax.set_xlabel('Language', fontsize=14)
ax.set_ylabel('Average IMDB Score', fontsize=14)
ax.set_xticklabels(df_temp['Language'].unique(), fontsize=12, rotation = -30 )
plt.title('Language and IMDB Score', fontsize=16);

<center><img src = "https://i.imgur.com/vqFuTVn.jpg"></center>

#### 🔍Observations
> Russian movies have highest average IMDB score, **отличная работа**!

> Mostly Indian movies have longest runtime - Marathi, Hindi, and Tamil

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Premiere 📅 and IMDB rating</div></center></h4>

In [None]:
monthlist=['January', 'February', 'March', 'April', 'May', 'June', 'July', 
           'August', 'September', 'October', 'November', 'December']

yearlist=list(np.sort(df['Year'].unique()))

# If month list is not given in 'category_orders', then the month names will not be in order
fig=px.box(df, y='Month', x='IMDB Score', category_orders={'Month':monthlist}, hover_name='Title')
fig.update_layout(title=dict(text='Premiere month and IMDB Score', xanchor='center', yanchor='top', x=0.5))
fig.show()

<center><img src = "https://i.imgur.com/IkIlXyu.jpg"> </center>

#### 🔍Observations
> It doesn't seem that there is any ralation between the `Premier` and the `IMDB Score`, as it is almost samiliar.

> In the month of June, the average `IMDB Score` is little more than other months, so let's explore it further and see if can get anything useful.

<center><img src = "https://i.imgur.com/pLxP29I.jpg"></center>

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Good Movies 👏 and Bad Movies👎</div></center></h4>

>⚠ Good movies means, whose IMDB score is more than 7. I took it based on my search, but you can take the threshold as you like!

In [None]:
# Movies with IMDB Score is more than or equal to 7
threshold=7
df['Best']=df['IMDB Score'].apply(lambda x: 1 if x>=threshold else 0)

In [None]:
fig=px.histogram(df, x='Year', color='Best', barmode='group')
fig.update_layout(title=dict(text='Month and Number of Premieres', xanchor='center', yanchor='top', x=0.5), 
                 xaxis=dict(title='Number of Premieres'))
fig.show()

<center><img src = "https://i.imgur.com/eqq7DpI.jpg"></center>

#### 🔍Observation
> The of proportion of good movies is decreasing over the years (except in 2019). Let's hope we get some good directors, writers and 🎭 actors soon!

<h4><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Number of Premiers and Date 📅</div></center></h4>

In [None]:
df_temp=df.groupby(['Month'])['Title'].count().reset_index()

# If month list is not given in 'category_orders', then the month names will not be in order
fig=px.bar(df_temp, y='Month', x='Title', category_orders={'Month':monthlist})
fig.update_layout(title=dict(text='Month and Number of Premieres', xanchor='center', yanchor='top', x=0.5), 
                 xaxis=dict(title='Number of Premieres'))
fig.show()

<center><img src = "https://i.imgur.com/fth8MZ4.jpg"></center>

#### 🔍Observation
> The number of movies released each is rising.

> Mostly the movies are released in the month of April and October, hollidays can be the reason but I am not sure. What do you think? , please write in the ✍ comment.

In 2020 many people were in their houses due to lockdown, Netfix being a online platform, has reased many movies to take advantage of this opportunity. Let's explore when and how many movies were premiered in 2020.

<center><img src = "https://i.imgur.com/qzVS1or.jpg"></center>

<center><img src = "https://miro.medium.com/max/1920/1*jfR0trcAPT3udktrFkOebA.jpeg"></center>

<h3><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Science Fiction 👽</div></center></h3>

> I love watching science fiction movies as they are very exciting and involves lots of creative ideas. Let's explore more about them!

In [None]:
# Replace 'Science fiction' with your favorite gerne
favorite_genre='Science fiction'

# Filtering the favorite genre
df_genre=df[df['Genre'].str.contains(favorite_genre)].reset_index(drop=True)
df_genre

In [None]:
# For counting the movies over the years
df_genre['Count']=1

fig=px.sunburst(df_genre, path=['Year','Month','Title'], values='Count')
fig.update_layout(title=dict(text=f'Number of {favorite_genre} movies over the years',
                             xanchor='center', yanchor='top', x=0.5), yaxis=dict(title='Movies count'));
# fig.show()

In [None]:
# Distribution of Science Fiction movies over the Languages
df_hist=df_genre.groupby(['Year','Language']).mean(['Count']).reset_index()
fig=px.histogram(df_hist, x='Language', y='Count', color='Year')
fig.update_layout(title=dict(text=f'Distribution of {favorite_genre} movies over languages', xanchor='center', yanchor='top', x=0.5),
                 xaxis=dict(title='Language'), yaxis=dict(title='Movies count'));
# fig.show()

<center><img src = "https://i.imgur.com/yvk56EH.png"></center>

<h3><center>  <div style="background-color:skyblue;border-radius:10px; padding: 10px;">Language and Genre [Good 👏 or Bad 👎 movie]</div></center></h3>

In [None]:
# If I consider all the Genre, the plot won't look good so I am considering top 100 movies for the plot
df_temp=df.sort_values(by='IMDB Score', ascending=False).reset_index(drop=True).iloc[:200,:]
fig=px.parallel_categories(df_temp,  dimensions=['Language', 'Genre', 'Best'],
                          color='IMDB Score',color_continuous_scale=[(0,'blue'),(0.5,'yellow'),(1,'red')])
fig.update_layout(title=dict(text='Parallel Categories Plot', xanchor='center', yanchor='top', x=0.5))
fig.show()

<h3><center>  <div style="background-color:lightpink;border-radius:10px; padding: 10px;">If you like it, don't forget to upvote! </div></center></h3>