**1.1 Introduction**

* This dataset consists of tv shows and movies available on Netflix as of 2019.
* The dataset is collected from Flixable which is a third-party Netflix search engine.
* In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010.
* The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled.
* It will be interesting to explore what all other insights can be obtained from the same dataset.
* Integrating this dataset with other external datasets such as IMDB ratings, rotten tomatoes can also provide many interesting findings.


**1.2 Problem Statement**

Some of the interesting questions (tasks) which can be performed on this dataset -

1. Understanding what content is available in different countries
2. Identifying similar content by matching text-based features
3. Network analysis of Actors / Directors and find interesting insights
4. Is Netflix has increasingly focusing on TV rather than movies in recent years.

**1.3 About the dataset**

netflix_titles.csv: The csv file contains information about the various movies and the data related to them:

* Show ID - unique ID of that particular show
* Type - type of the video - movie, TV Series etc.
* Title - title of the video
* Director - director name
* Cast - cast members
* Country - country where it was released
* Data Added - date when it became live on NETFLIX
* Release Year - year of release
* Rating - user rating
* Duration - duration of the movie, TV Series etc.
* Listed in - Genre information
* Description - concise plot of the series



In [None]:
import pandas as pd
import numpy as np

import missingno as msno

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

import warnings 
warnings.filterwarnings('ignore')

%matplotlib inline

**Loading the Dataset**

In [None]:
df=pd.read_csv("/kaggle/input/netflix-shows/netflix_titles.csv")

In [None]:
df.head()


In [None]:
df.info()


We can clearly see that there are few columns which have null values

In [None]:
df.describe().T


In [None]:
# Visualize missing values as a matrix
msno.matrix(df)

Observation: Using this matrix we can quickly find the pattern of 'missingness' in the dataset. From the above visualisation we can observe that "director" has a peculiar pattern that stands out."cast" and "country" have a similar pattern of missing values while column "date_added" shows a different pattern.

In [None]:
df.isnull().sum()


Observation: The missing values are in:

* "director": Very less information-not needed for the analysis so I will be dropping this
* "cast": there are too many diferent values so I will be dropping this
* "country": Important variable hence we need to fix this
* "date_added": there are just a few cases, so lets scrap them
* "rating": there are just a few cases, so lets try finding thir replacements on the internet

In [None]:
msno.bar(df)


**Observation:** This bar chart gives you an idea about how many missing values are there in each column. "director" has the most missing value followed by "cast" and "country". There are few missing value in "date_added" and "rating".

In [None]:
for i in df.columns:
    null_rate = df[i].isnull().sum() / len(df)*100
    if null_rate > 0 :
        print( "{}'s null rate: {}%".format(i, round(null_rate, 2)))
    

**Observation:** As per our earlier analysis, we can see that the 'director' column has the highest whereas 'rating' column has the lowest number of missing values!

In [None]:
df.nunique()


**Observation:** As I already assumed, all the show IDs and Titles are unique whereas others might have been repeated once or more times!

In [None]:
df[df["rating"].isnull()]


In [None]:
rating_replacements = {
    67: 'TV-PG',
    2359: 'TV-14',
    3660: 'TV-MA',
    3736: 'TV-MA',
    3737: 'NR',
    3738: 'TV-MA',
    4323: 'TV-MA '
}

for id, rate in rating_replacements.items():
    df.iloc[id, 8] = rate
    
df['rating'].isnull().sum()

**Observation:** All the missing values in the ratings column have been fixed!

In [None]:
df.drop(['director', 'cast'], inplace=True, axis=1)
df.columns

In [None]:
df=df[df["date_added"].notna()]


In [None]:
df.isnull().sum()


**Observation:** We have null values left only in the 'country' column. Let's try to fix it with the value which occurs the most number of times i.e. MODE



In [None]:
df['country'] = df['country'].fillna(df['country'].mode()[0])


In [None]:
df['country'].nunique()


**Observation:** When looked upon the 'country' column closely, we see that there are some entry where it has multiple values. so I think I will add a new column with just the first one so we can check which regions have more productions

In [None]:
# Lets retrieve just the first country
df['principal_country'] = df['country'].apply(lambda x: x.split(",")[0])
df['principal_country'].head()

In [None]:
df['principal_country'].nunique()


**Observation:** Wow! Earlier there were 681 unique countries and we have reduced them to just 81. Such an amazing work!

In [None]:
df.isna().sum()


In [None]:
df.dtypes


**Observation:**

* show_id, type, title, country, date_added, rating, duration, listed_in, description, principal_country are of object data type
* Only release_year is of int data type

In [None]:
df['year_added'] = df['date_added'].apply(lambda x: x.split(" ")[-1])
df['year_added'].head()

In [None]:
df['month_added'] = df['date_added'].apply(lambda x: x.split(" ")[0])
df['month_added'].head()

In [None]:
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}

In [None]:
df['target_ages'] = df['rating'].replace(ratings_ages)
df['target_ages'].unique()

**Observation:** This looks so much more clear. Using the target_ages column we can easily make out the intended audiance for a particular movie!



Now let us try doing some pictorial representaion of our dataset

In [None]:
val = df['type'].value_counts().index
cnt = df['type'].value_counts().values

fig = go.Figure([go.Bar(x=val, y=cnt, marker_color='darkturquoise')])
fig.update_layout(title_text='Netflix Sources Distribution', title_x=0.5)
fig.show()

**Observation:** There are almost half the number of TV Shows than there are Movies on Netflix

In [None]:
# type should be a category
df['type'] = pd.Categorical(df['type'])

# target_ages is another category (4 classes)
df['target_ages'] = pd.Categorical(df['target_ages'], categories=['Kids', 'Older Kids', 'Teens', 'Adults'])

# Year added should be integer so we can compare with `released_year`
df['year_added'] = pd.to_numeric(df['year_added'])


In [None]:
plt.figure(figsize=(15,8))
label=['TV Show', 'Movie']
plt.pie(df['type'].value_counts().sort_values(), labels=label, explode=[0.15,0.15], 
        autopct='%1.2f%%', startangle=90)
plt.title('Different Types of Netflix Content')
plt.axis('equal')

Observation: Nearly 2/3rd of the content on netflix are movies while the rest are TV Show

In [None]:
df_movie = df[df['type']=='Movie'].groupby('release_year').count()
df_tv = df[df['type']=='TV Show'].groupby('release_year').count()


df_movie.reset_index(level=0, inplace=True)
df_tv.reset_index(level=0, inplace=True)

# fig = px.line(data_movie, x="release_year", y="show_id")
# fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=df_movie['release_year'], y=df_movie['show_id'],
                    mode='lines',
                    name='Movies', marker_color='mediumpurple'))
fig.add_trace(go.Scatter(x=df_tv['release_year'], y=df_tv['show_id'],
                    mode='lines',
                    name='TV Shows', marker_color='lightcoral'))
fig.update_layout(title_text='Trend Movies vs TV Shows in recent years', title_x=0.5)
fig.show()

In [None]:
df_tv = df[df["type"] == "TV Show"]
df_movie = df[df["type"] == "Movie"]

movie_ratings = df_movie.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (18,8)
fig, ax = plt.subplots(figsize=fig_dims)  
sns.pointplot(x='rating',y='count',data=movie_ratings)
plt.title('Top Movie Ratings Based On Rating System',size='20')
plt.show()

In [None]:
tv_ratings = df_tv.groupby(['rating'])['show_id'].count().reset_index(name='count').sort_values(by='count',ascending=False)
fig_dims = (18,8)
fig, ax = plt.subplots(figsize=fig_dims)  
sns.pointplot(x='rating',y='count',data=tv_ratings)
plt.title('Top TV Show Ratings Based On Rating System',size='20')
plt.show()


In [None]:
def generate_rating_df(df):
    rating_df = df.groupby(['rating', 'target_ages']).agg({'show_id': 'count'}).reset_index()
    rating_df = rating_df[rating_df['show_id'] != 0]
    rating_df.columns = ['rating', 'target_ages', 'counts']
    rating_df = rating_df.sort_values('target_ages')
    return rating_df


rating_df = generate_rating_df(df)
fig = px.bar(rating_df, x='rating', y='counts', color='target_ages', title='Ratings of Movies And TV Shows Based On Target Age Groups',  labels={'counts':'COUNT', 'rating':'RATINGS', 'target_ages':'TARGET AGE GROUPS' })
fig.show()

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="whitegrid")
ax = sns.countplot(y="release_year", data=df_movie, palette="coolwarm", order=df_movie['release_year'].value_counts().index[0:15])

plt.title('ANALYSIS ON RELEASE YEAR OF MOVIES', fontsize=15, fontweight='bold')
plt.show()

**Observation:** Most of the movies were released in the year 2017 followed by 2018 and 2016.

In [None]:
plt.figure(figsize=(12,10))
sns.set(style="darkgrid")
ax = sns.countplot(y="release_year", data=df_tv, palette="coolwarm", order=df_tv['release_year'].value_counts().index[0:15])

plt.title('ANALYSIS ON RELEASE YEAR OF TV Show', fontsize=15, fontweight='bold')
plt.show()

**Observation:** Most of the TV Shows were released in 2020 followed by 2019 and 2018

In [None]:
from collections import Counter
country_data = df['country']
country_count = pd.Series(dict(Counter(','.join(country_data).replace(' ,',',').replace(
    ', ',',').split(',')))).sort_values(ascending=False)
top20country = country_count.head(20)
plt.figure(figsize=(15,5))
sns.barplot(x= top20country.index, y=top20country, palette="pastel")
plt.xticks(rotation=50)
plt.title('Top 20 countries with most contents', fontsize=15, fontweight='bold')
plt.show()

In [None]:
rating_order_movie =  ['G', 'TV-Y', 'TV-G', 'PG', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'PG-13', 'TV-14', 'R', 'NC-17', 'TV-MA']
rating_order_tv =  [ 'TV-Y', 'TV-G', 'TV-Y7', 'TV-Y7-FV', 'TV-PG', 'TV-14', 'R', 'TV-MA']
movie_rating = df_movie['rating'].value_counts()[rating_order_movie]
tv_rating = df_tv['rating'].value_counts()[rating_order_tv].fillna(0)
def rating_barplot(data, title, height, h_lim=None):
    fig, ax = plt.subplots(1,1, figsize=(15, 7))
    if h_lim :
        ax.set_ylim(0, h_lim)
    ax.bar(data.index, data,  color="#d0d0d0", width=0.6, edgecolor='black')

    color =  ['green',  'blue',  'orange',  'red']
    span_range = [[0, 2], [3,  6], [7, 8], [9, 11]]

    for idx, sub_title in enumerate(['Little Kids', 'Older Kids', 'Teens', 'Mature']):
        ax.annotate(sub_title,
                    xy=(sum(span_range[idx])/2 ,height),
                    xytext=(0,0), textcoords='offset points',
                    va="center", ha="center",
                    color="w", fontsize=16, fontweight='bold',
                    bbox=dict(boxstyle='round4', pad=0.4, color=color[idx], alpha=0.6))
        ax.axvspan(span_range[idx][0]-0.4,span_range[idx][1]+0.4,  color=color[idx], alpha=0.1)

    ax.set_title(f'Distribution of {title} Rating', fontsize=20, fontweight='bold', position=(0.5, 1.0+0.03))
    plt.show()

In [None]:

rating_barplot(movie_rating,'Movie', 1500)


Observation: There are much more movies for Older Kids, Teens and Mature than there are for the Little Kids!

In [None]:
rating_barplot(tv_rating,'TV Show' , 600, 1500)


**Observation:** TV Show are more for the kids and teens.

In [None]:
import plotly.offline as py
py.offline.init_notebook_mode()
import pycountry


df1 = pd.DataFrame(dict1.items(), columns=['Country', 'Count'])
df2 = pd.DataFrame(dict2.items(), columns=['Country', 'Count'])

total = set(list(df1['Country'].append(df2['Country'])))

d_country_code = {}  # To hold the country names and their ISO
for country in total:
    try:
        country_data = pycountry.countries.search_fuzzy(country)
        # country_data is a list of objects of class pycountry.db.Country
        # The first item  ie at index 0 of list is best fit
        # object of class Country have an alpha_3 attribute
        country_code = country_data[0].alpha_3
        d_country_code.update({country: country_code})
    except:
        #print('could not add ISO 3 code for ->', country)
        # If could not find country, make ISO code ' '
        d_country_code.update({country: ' '})
for k, v in d_country_code.items():
    df1.loc[(df1.Country == k), 'iso_alpha'] = v
    df2.loc[(df2.Country == k), 'iso_alpha'] = v
        
fig = px.scatter_geo(df1, locations="iso_alpha",
                     hover_name="Country", # column added to hover information
                     size="Count", # size of markers, "pop" is one of the columns of gapminder
                     )
fig.update_layout(title_text='Top Countries where Movie are released', title_x=0.5)
fig.show()

fig = px.scatter_geo(df2, locations="iso_alpha",
                     hover_name="Country", # column added to hover information
                     size="Count", # size of markers, "pop" is one of the columns of gapminder
                     )

fig.update_layout(title_text='Top Countries where TV Shows are released', title_x=0.5)
fig.show()

In [None]:
df_content = df['year_added'].value_counts().reset_index().rename(columns = {
    'year_added' : 'count', 'index' : 'year_added'}).sort_values('year_added')
df_content['percent'] = df_content['count'].apply(lambda x : 100*x/sum(df_content['count']))


df_tv1 = df_tv['year_added'].value_counts().reset_index().rename(columns = {
    'year_added' : 'count', 'index' : 'year_added'}).sort_values('year_added')
df_tv1['percent'] = df_tv1['count'].apply(lambda x : 100*x/sum(df_tv1['count']))


df_movie1 = df_movie['year_added'].value_counts().reset_index().rename(columns = {
    'year_added' : 'count', 'index' : 'year_added'}).sort_values('year_added')
df_movie1['percent'] = df_movie1['count'].apply(lambda x : 100*x/sum(df_movie1['count']))

t1 = go.Scatter(x=df_movie1['year_added'], y=df_movie1["count"], name="Movies", marker=dict(color="royalblue"))
t2 = go.Scatter(x=df_tv1['year_added'], y=df_tv1["count"], name="TV Shows", marker=dict(color="red"))
t3 = go.Scatter(x=df_content['year_added'], y=df_content["count"], name="Total Contents", marker=dict(color="gold"))

data = [t1, t2, t3]

layout = go.Layout(title="Content added over the years", legend=dict(x=0.1, y=1.1, orientation="h"))
fig = go.Figure(data, layout=layout)
fig.show()

**Observation:**

* The growth in number of movies on netflix is much higher than that of TV shows
* About 1200 new movies were added in both 2018 and 2019
* The growth in content started from 2013

In [None]:
from wordcloud import WordCloud, STOPWORDS

text = ' '.join(df_movie['listed_in'])

plt.rcParams['figure.figsize'] = (12,12)
wordcloud = WordCloud(background_color = 'black',colormap='vlag', width = 1200,  height = 1200, max_words = 121).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Such an interesting output! I've used wordcloud for the first time. Lets try the same for TV Show as well!

In [None]:
from wordcloud import WordCloud, STOPWORDS

text = ' '.join(df_tv['listed_in'])

plt.rcParams['figure.figsize'] = (12,12)
wordcloud = WordCloud(background_color = 'lightblue', width = 1200,  height = 1200, max_words = 121).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Note: So, I've referred to two notebooks for this analysis:

https://www.kaggle.com/sauravjoshi23/netflix-eda-and-data-visualization-plotly
https://www.kaggle.com/bhartiprasad17/netflix-movies-and-tv-shows-eda
Thanks to the authors for creating such a wonderful notebook! I really learnt a lot from them.