###  This project analyses the Netflix movies and shows dataset.  
#### Using Python's data analysis libraries, such as pandas, numpy, seaborn, plotly, I explore the content and highlight interesting trends through visualisations. *Note: This project is a work in progress*

Uploading the dataset and reviewing the information available

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
%matplotlib inline
import cufflinks as cf
cf.go_offline()

In [None]:
df = pd.read_csv("../input/netflix-shows/netflix_titles.csv")
df.head()

### We can see from the data frame that both movies and TV show data is available. What's the content distribution like?

In [None]:
fig = px.pie(df['type'].value_counts().reset_index(), values='type', names='index', color_discrete_sequence=px.colors.sequential.Reds)
fig.update_traces(textposition='inside', textinfo='percent')
fig.show()

Netflix's library consists of 69% movies and 31% TV shows out of 7,787 titles.

### The top 10 countries with the most content

In [None]:
top_countries = (df['country'].value_counts())[:15]
top_countries.iplot(kind='bar', color='red')

The US is the country with the most content on Netflix. This makes sense since Netflix was founded in 1997 and only in 2011 did it seriously expand into several other countries across the globe.  
India comes in second, no doubt aided by Bollywood which is a huge industry in the country. In third, we have the UK, also a top producing country.

### Which years saw the biggest content release?

In [None]:
df['release_year'].iplot(kind='hist',bins=100, color='red')

Netflix's library consists mostly of content released in the 2010s. It saw an exponential growth in the late 2000s, with 2018 peaking at 1,121 movies and TV show releases. At this point, Netflix invested significantly in original content which, in addition to releases from other production companies, contributed to the growth in media available in the streaming service.

###  When is content most commonly released?

In [None]:
releases = df[['date_added']].dropna()
releases['year'] = releases['date_added'].apply(lambda date: date.split(', ')[-1])
releases['month'] = releases['date_added'].apply(lambda date: date.strip().split(' ')[0])
column_order = ['January', 'February', 'March', 'April', 'May', 'June', 'August', 'September', 'October', 'November', 'December']

df_releases = releases.groupby('year')['month'].value_counts().unstack().fillna(0)[column_order]

plt.figure(figsize=(10,8))
ax = sns.heatmap(df_releases, cmap = 'Reds')

Looking at the data, it seems that a good percentage of content is released towards the end of the year. In particular, 2019 saw a lot of releases between October and December, with November 2019 seeing the highest number of releases when compared to other months and years.  

Netflix seems to add the most content around holidays periods (Christmas, Summer). Producing companies looking to really boost the visibility of their media could make use of "quieter" months as viewers could be more likely to search for new content then.

### What are the most popular genres available?

In [None]:
movies = df[df['type'] == 'Movie']
tv_shows = df[df['type'] == 'TV Show']

from collections import Counter
genre_split = ", ".join(movies['listed_in']).split(", ")
top_ten = Counter(genre_split).most_common(10)
genres = [_[0] for _ in top_ten][::-1]
genre_count = [_[1] for _ in top_ten][::-1]

df_genres = pd.DataFrame(
    {'Genre': genres,
     'Count': genre_count
    })
fig2 = df_genres.iplot(kind='bar', x='Genre', y='Count',orientation='h')

Given how many countries contribute to Netflix's library, International Movies is expectedly the top category, closely followed by Dramas and Comedies.

In [None]:
tv_genre_split = ", ".join(tv_shows['listed_in']).split(", ")
tv_top_ten = Counter(tv_genre_split).most_common(10)
tv_genres = [_[0] for _ in tv_top_ten][::-1]
tv_genre_count = [_[1] for _ in tv_top_ten][::-1]

df_tvgenres = pd.DataFrame(
    {'Genre': tv_genres,
     'Count': tv_genre_count
    })
fig3 = df_tvgenres.iplot(kind='bar', x='Genre', y='Count',orientation='h')

We see a very similar case for TV Shows, once again the International category has the most content, followed by TV Dramas and TV Comedies.

### Movie duration over the years

Netflix data aside, when I compare the duration of movies released in the last two decades to the 70s-90s, it seems that they are getting longer and longer. From action movies to dramas, they seem to average the 2h20m mark, some exceeding it, which is a contrast to several popular movies released in the late 1900s. 

In [None]:
movies['length'] = movies['duration'].apply(lambda duration: duration.strip().split(' ')[0])
movies['length'] = movies['length'].astype(str).astype(int)
fig, ax = plt.subplots(figsize=(15,7))
movies.groupby(['release_year']).mean()['length'].plot(ax=ax,color = 'darkred')
fig = plt.title('Movie duration over the years')
fig = plt.xlabel('Release year')
fig = plt.ylabel('Duration in minutes')

I was surprised by the graph above which looks at the average movie length released between 1942 and 2021.

The 60s seem to have movies with longer duration than the 2000s! Having had a look at other [sources](https://www.businessinsider.com/are-movies-getting-longer-2016-6?r=US&IR=T), I found out that there was a rapid increase in film length between the 50s and 60s which decreased in the mid 60s. This source also highlights that in the late 80s film length started increasing again, similar to levels seens in the 60s.

So why doesn't the above graph show the same trend? If we look at the popular movie genres graph, we can see that Stand-up and Documentaries specials make it to the top 10. In general, these tend to have a shorter duration than Action movies or Dramas, and in the last few years Netflix has invested more in this content which could be a big reason why the average duration is unexpectedly lower in the last two decades.