## **TV Shows and Movies listed on Netflix** <br>
This dataset consists of tv shows and movies available on Netflix as of 2019. The dataset is collected from Flixable which is a third-party Netflix search engine.

In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px


%matplotlib inline
plt.style.use('seaborn-darkgrid')
palette = plt.get_cmap('Set2')

#### Let's load the data

#### The dataset has 7787 rows and 12 columns:
* show_id: unique id of each show (not much of a use for us in this notebook)
* type: The category of a show, can be either a Movie or a TV Show
* title: Name of the show
* director: Name of the director(s) of the show
* cast: Name of actors and other cast of the show
* country: Name of countries the show is available to watch on Netflix
* date_added: Date when the show was added on Netflix
* release_year: Release year of the show
* rating: Show rating on netflix
* duration: Time duration of the show
* listed_in: Genre of the show
* description: Some text describing the show

In [None]:
main_df = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
main_df.head()

#### Display a graph of missing values

In [None]:
msno.bar(main_df, figsize=(20,10))
plt.show()

##### As we can see, the missing values are found only in columns "director", "cast", "country"

#### Display descriptions of the values in the columns

In [None]:
main_df.describe(include=[np.number])

In [None]:
main_df.describe(include=[np.object])

In [None]:
main_df.info()

#### Displaying a graph of the amount of content by country

In [None]:
countries = main_df['country'].value_counts()[main_df['country'].value_counts(normalize=True)> 0.005]
list_countries = list(countries.index)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Amount of content by country', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=countries.index, x=countries.values, alpha=0.6)
plt.show()

In [None]:
plt.figure(figsize=(14, 7))

figure = px.pie(countries.sort_values(ascending=False).iloc[:20], values=countries.values, names=countries.index,
                title='Content quantity by country in percentage', color_discrete_sequence=px.colors.sequential.RdBu)
figure.show()

##### The largest amount of content is concentrated in the US, because the film industry is highly developed in the US and Netflix is an American company

In [None]:
content_type = main_df['type'].value_counts()


plt.figure(figsize=(14, 7))

plt.bar(content_type.index, content_type.values, alpha=0.8)
plt.title('Amount of movies and TV shows', fontsize=18)
plt.show()

##### There are 5377 Movies and 2410 TV Shows available on Netflix all across the world.

In [None]:
TVshows = main_df[main_df['type'] == 'TV Show']
Movie = main_df[main_df['type'] == 'Movie']

In [None]:
TVshows_progress = TVshows['release_year'].value_counts().sort_index()
Movie_progress = Movie['release_year'].value_counts().sort_index()

In [None]:
plt.figure(figsize=(14, 7))

plt.plot(TVshows_progress.index, TVshows_progress.values, label='TV shows')
plt.plot(Movie_progress.index, Movie_progress.values, label='Movie')

plt.axvline(2019, alpha=0.3, linestyle='--', color='r')
plt.axvline(2021, alpha=0.3, linestyle='--', color='r')
plt.axvspan(2019, 2021, alpha=0.2, color='r', label='Coronavirus')

plt.xticks(list(range(1925, 2026, 5)), fontsize=12)
plt.title('Content growth throughout history', fontsize=18)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Amount of content', fontsize=14)
plt.yticks(fontsize=12)
plt.legend()
plt.show()

##### As we can see the amount of content grew before the coronavirus period, during this period everything dropped sharply.

In [None]:
main_df.dropna(inplace=True)

In [None]:
rating = main_df['rating'].value_counts()

plt.figure(figsize=(14,7))
plt.title('Content ratings of the shows', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=rating.index, x=rating.values, alpha=0.6)

plt.show()

##### The largest count of shows are made with the 'TV-MA' rating (2863 shows) "TV-MA": For mature audiences only.

##### Second largest is the 'TV-14' rating (1931 shows) "TV-14": May be inappropriate for children younger than 14 years of age.

##### Third largest is the 'TV-PG' rating (806 shows) "TV-PG": Parental guidance suggested

##### Fourth largest is the very popular 'R' rating (665 shows) "R": May be unsuitable for children under the age of 17 (Under 17 requires accompanying parent or adult guardian")

In [None]:
main_df.head()

In [None]:
grp = main_df.groupby('type')
movie = grp.get_group('Movie')
movie['duration'] = [int(i.split(' ')[0]) for i in movie.duration.dropna()]

plt.figure(figsize=(14, 7))

sns.distplot(movie['duration'], bins=60).set(ylabel=None)

plt.title('Length distribution of films', fontsize=18)
plt.xlabel('Duration', fontsize=14)
plt.show()

In [None]:
short = movie.sort_values('duration')[['title', 'duration']].iloc[:20]

plt.figure(figsize=(14,7))
plt.title('Top 20 shortest movies available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=short['title'], x=short['duration'], alpha=0.6)

plt.show()

In [None]:
long = movie.sort_values('duration')[['title', 'duration']].iloc[-20:]

plt.figure(figsize=(14,7))
plt.title('Top 20 longest movies available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=long['title'], x=long['duration'], alpha=0.6)

plt.show()

In [None]:
directors = main_df['director'].value_counts()

plt.figure(figsize=(14,7))
plt.title('Top 10 directors available on Netflix', fontsize=18)
plt.tick_params(labelsize=14)
sns.barplot(y=directors.index[:10], x=directors.values[:10], alpha=0.6)

plt.show()