In [None]:
import pandas as pd
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

np.warnings.filterwarnings('ignore')

# Table of Contents
1.  [Introduction](#1.-Introduction)

    1.1.  [Netflix and IMDB](#1.1.-Netflix,-IMDB-and-the-Dataset)
    
    1.2.  [Objectives](#1.2.-Objectives)
    
    1.3.  [Dataset Features](#1.3.-Dataset-Features)
2.  [Data Loading, Data Cleaning and Descriptive Analysis](#2.-Data-Loading,-Data-Cleaning-and-Descriptive-Analysis)

    2.1. [Date Features](#Date-Features)
    
    2.2. [Nulls and duplicates](#Nulls-and-Duplicates)
    
    2.3. [Descriptive Analysis](#Descriptive-Analysis)
3. [Data Analysis](#3.-Data-Analysis)

    3.1. [What are the top 5 movies with the best rating?](#3.1.-What-are-the-top-5-movies-with-the-best-rating?)
    
    3.2. [Does movie runtime affect rating?](#3.2.-Does-movie-runtime-affect-rating?)
    
    3.3. [Does Language affect Rating?](#3.3.-Does-Language-affect-Rating?)
    
    3.4. [Does Genre affect Rating?](#3.4.-Does-Genre-affect-Rating?)
    
    3.5. [Has the average rating of the movies in the dataset been increasing over the years?](#3.5.-Has-the-average-rating-of-the-movies-in-the-dataset-been-increasing-over-the-years?)
4. [Conclusions](#4.-Conclusions)
5. [Recommendations](#5.-Recommendations)

# 1. Introduction

## 1.1. Netflix, IMDB and the Dataset

**Netflix**

Netflix, Inc. is an American over-the-top content platform and production company headquartered in Los Gatos, California. Netflix was founded in 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California. The company's primary business is a subscription-based streaming service offering online streaming from a library of films and television series, including those produced in-house. In April 2021, Netflix had 208 million subscribers, including 74 million in the United States and Canada. (https://en.wikipedia.org/wiki/Netflix)

**IMDB**

IMDb is an online database of information related to films, television programs, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews. (https://en.wikipedia.org/wiki/IMDb)

**About the dataset**

This dataset consists of all Netflix original films released as of June 1st, 2021. Additionally, it also includes all Netflix documentaries and specials. The data is then integrated with a dataset consisting of all of their corresponding IMDB scores. IMDB scores are voted on by community members, and the majority of the films have 1,000+ reviews.


## 1.2. Objectives
- Find how factors in the dataset affect the rating of a movie 
- See if these factors vary from one category to another
- Elaborate a reccomendation based on the analysis to optimize the rating of a movie

## 1.3. Dataset Features
Each row of the dataset represents a movie that has the following features.

- Title of the film
- Genre of the film
- Original premiere date
- Runtime in minutes
- IMDB scores (as of 06/01/21)
- Languages currently available (as of 06/01/21)


# 2. Data Loading, Data Cleaning and Descriptive Analysis

In [None]:
netflix = pd.read_csv("../input/netflix-original-films-imdb-scores/NetflixOriginals.csv")
netflix

The dataset consists of 584 rows and 6 columns. With the first glimpse of the dataset, we can start creating three new features from the premier date by extracting the date, month and year, we will transform the Premier column into a more convinient date field. Then, we will check if we have any duplicates or nulls and delete them if that's the case. Finally, we will do a descriptibe analysis to get to know better the shape of the dataset and see if it got strange data that could be discarted for the analysis.

### Date Features

In [None]:
netflix["Premiere"] = pd.to_datetime(netflix['Premiere'])
netflix['Year'] = netflix['Premiere'].dt.year
netflix['Month'] = netflix['Premiere'].dt.month
netflix['Day'] = netflix['Premiere'].dt.day

netflix

### Nulls and Duplicates

In [None]:
# duplicates
netflix.drop_duplicates(inplace=True)

In [None]:
# Nulls
sns.heatmap(netflix.isnull())

No duplicates or Nulls. Let's continue with the descriptive analysis

## Descriptive Analysis

In this section we will take a look at how the data is distributed in each of it's the features. The purpose will be to have a better image of the dataset, come up with some first insights and delete suspicious data.

### Categorical variables
Genre, Language

In [None]:
categories = netflix.loc[:,['Genre', 'Language']]

print(categories.info(), "\n")
display(categories.describe())

Both of the categories have lot's of unique values: Genre with 115 and Language with 38. To ease the analysis, let's focus on the ten most frequent values

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))

order = list(categories["Genre"].value_counts()[:10].keys())
sns.countplot('Genre', data=categories, order=order, ax=ax[0], palette="vlag")
ax[0].tick_params(labelrotation=90)
ax[0].set_title('Genre')

order = list(categories["Language"].value_counts()[:10].keys())
sns.countplot('Language', data=categories, order=order, ax=ax[1], palette="vlag")
ax[1].tick_params(labelrotation=90)
ax[1].set_title('Language')


plt.show()

print('Value count top 10 Genre',"\n", "\n", categories["Genre"].value_counts()[:10])
print("\n", "\n",'Value count top 10 Language',"\n", "\n", categories["Language"].value_counts()[:10]) 

**Genre.**
Most of the movies in the dataset have a "Documentary" in Genre with a value count of 152. "Drama" palces itself in second place with a total of 77 movies. Then, Comedy with 49.

**Language.**
It is expected that most of the movies have "English" as the language since the company is American and is all over the world.

### Number Variables
Runtime, IMDB Score

In [None]:
numbers = netflix.loc[:,['Runtime', 'IMDB Score']]

numbers.hist(figsize=(20,6), color='firebrick', edgecolor='white', alpha=0.7, grid=False)

plt.show()

print(numbers.info(), "\n")
display(numbers.describe())

**Runtime.** The Runtime of the movies or series tend be between 86 and 108 min (IQR). The distribution has a normal shape without inconsistent outliers. 

**IMDB Score.** The IMDB score tends to be between 5.7 and 7. The distribution doesn't any inconsistent data and has a more normal shape than 'Runtime'

### Date Variables
Premiere, year, month and day

In [None]:
date = netflix.loc[:,['Premiere', 'Year', 'Month', 'Day']]

In [None]:
# Premiere Date

date['Premiere'].value_counts().sort_index().plot(figsize=(23,5))
plt.title("Premiere Date Value Counts",
            fontdict={'fontsize':15})

plt.show()

date['Premiere'].value_counts().sort_values(ascending=False)

As for the exact date, it doesn't seem to be a preference for the release date since the most frequent value is just 6 in 2020-10-02 for a sample of 584. But, we can say that Netflix is releasing movies and series more and more frequent since que pikes in the plot and are closer as time advances.

Let's take a broader and discrete approach and see how the data is distributed in years, days and months.

In [None]:
fig, ax = plt.subplots(1,3, figsize=(20,6))

# Year
sns.countplot('Year', data=date, palette='vlag', ax=ax[0])
ax[0].set_title('Year Value Counts',
            fontdict={'fontsize':15})

# Month
sns.countplot('Month', data=date, palette='vlag', ax=ax[1])
ax[1].set_title('Month Value Counts',
            fontdict={'fontsize':15})

# Day
sns.countplot('Day', data=date, palette='vlag', ax=ax[2])
ax[2].set_title('Day Value Counts',
            fontdict={'fontsize':15})

plt.show()

**Year.** We confirm that netflix is releasing more and more movies as times goes on. 2021 has a sudden decrease, but this could be because the dataset ends there

**Month.** We see in this chart that the favorites months to release a move/series is during April and October (during spring, fall and Halloween)

**Day.** The day chart doesn's seem to follow any trend

# 3. Data Analysis
Having know how the dataset is distributed, we can proceed and analyze the dataset. Following the objectives, we are going to focus our analysis on the rating of the movies.

## 3.1. What are the top 5 movies with the best rating?

In [None]:
top = netflix.sort_values(by='IMDB Score', ascending=False)
print('The top 5 movies are: ')
display(top.iloc[:5,:])

The movie with the highest rating is "David Attenborough: A Life on Our Planet", which is a documentary and has a raing of 9.0. We can also se that three of the movies in the top 5 are documentaries and also three have English as a language.

## 3.2. Does movie runtime affect rating?

This question implies two continuous variables: Runtime and IMDB Score. A scatter plot alon with the correlation between these two variables will answer the question.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,5))

sns.scatterplot(x='Runtime', y='IMDB Score', data=netflix, ax=ax[0], alpha=0.7)
sns.heatmap(netflix[['Runtime', 'IMDB Score']].corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True, ax=ax[1])

plt.show()

The correlation between these two variables is -0.041, which means that they don't have any relationship. The scatter plot doesn't show any trend either.

IMDB Score then doesn't have anything to do Runtime.

## 3.3. Does Language affect Rating?

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,5))

lang = netflix.groupby('Language')['IMDB Score'].mean().sort_values(ascending=False)

order = list(lang[:5].keys())
sns.barplot(x='Language', y='IMDB Score', data=netflix, order=order, palette='vlag', ax=ax[0])

order = list(lang[-5:].keys())
sns.barplot(x='Language', y='IMDB Score', data=netflix, order=order, palette='vlag', ax=ax[1])

plt.show()

print('\n', 'The top 5 languages with the best ratings', lang[:5], '\n')
print('The least 5 languages with the worst ratings are', lang[-5:])

- Very important to remark that all movies positioned on the top 5 have English as one of their languages. To have a high rating then, it is neccesary to have English as an language option.

- On the bottom we have polish, norwegian, filipino, english/japanesse and malay. The movies with these languages haven't gain a lot of popularity, or perhaps the audio-visual industry is not well developed in those countries. It is reccomended to make a significant effort if the movie must be done in these languages to increase the rating, an average movie in this language won't succeed.

## 3.4. Does Genre affect Rating?

In [None]:
fig, ax = plt.subplots(1,2, figsize=(19,5))

gen = netflix.groupby('Genre')['IMDB Score'].mean().sort_values(ascending=False)

order = list(gen[:5].keys())
sns.barplot(x='Genre', y='IMDB Score', data=netflix, order=order, palette='vlag', ax=ax[0])
ax[0].tick_params(labelrotation=90)

order = list(gen[-5:].keys())
sns.barplot(x='Genre', y='IMDB Score', data=netflix, order=order, palette='vlag', ax=ax[1])
ax[1].tick_params(labelrotation=90)

plt.show()

print('\n', 'The top 5 Genres with the best ratings', gen[:5], '\n')
print('The least 5 Genres with the worst ratings are', gen[-5:])

- The list is quite variate. But, we can see that Animation and comedy related movies are on top. This might explain the huge success of disney and their separation from netflix to Disney +
- The Genres with the least rating are Superhero-Comedy, Horror anthology, Political thriller, Musical/Western/Fantasy and Heist film/Thriller. These are the genres to be avoided.

## 3.5. Has the average rating of the movies in the dataset been increasing over the years?

In [None]:
date = netflix[['Premiere', 'IMDB Score']]
date = date.set_index('Premiere')
date = date.sort_index()
date.plot(figsize=(23,5))

plt.title("Exact date Rating",
            fontdict={'fontsize':15})

plt.show()

The graph is too noisy and doesn't show any trend. Let's group rating by year and month to have a more clear view.

In [None]:
netflix['Year-Month'] = netflix['Premiere'].dt.to_period('M')

netflix.groupby('Year-Month')['IMDB Score'].mean().plot(figsize=(23,5))

plt.title("Year-Month Rating",
            fontdict={'fontsize':15})

plt.show()

The graph is more clean and shows that the average rating of movies have been decreasing.It is important to remark that netflix catalog has been increasing considerably, and, even if they release high quality productions with high ratings, the streaming service focus more on having a huge catalog with diverse genres that aren't that popular. Therefore, it does makes sense that the rating has been decreasing.

Let's double check and group ratings just by year.

In [None]:
netflix.groupby('Year')['IMDB Score'].mean().plot(figsize=(23,5))

plt.title("Year Rating",
            fontdict={'fontsize':15})

plt.show()

The statement is confirmed with this clear descending trend

# 4. Conclusions 
- Netflix releases production more frequent as time advances, increasing the number of series released each month. But, The rating has been decreasing as the number of films increasing, implying that netflix is aiming for catalog diversity that reaches all kind of public.
- Language does affect Rating, since English is always present in the top rated movies by language. Not having English as a language might place the movie at the bottom
- The genres are pretty diverse. But, there's a tendency for animated and comedy movies to be on top.

# 5. Recommendations
- Is pretty hard to increase average rating if the strategy is too increase the selection and variety, is a trade-off. But, If Netflix wants to still be competitive, it should still invest in a few high quality and original movies/series to attract new costumers and compete against the increasing number of streaming services. A focus on diversity might target niches, but if the platform won't release high grossing productions, the customer churn will be gigantic. Long story short: keep diversifying stratety along  with some high-grossing productions.
- English is an international language and it should be present in every movie. Low-rated movies without the english option are missing the chance to reach the internaitonal community. English as a language option should be one of the priorities when launching a production in the platform.
- As for the Genres, the strategy is pretty simple: more of the populars and less of the non-populars. 