![netflix.jpg](https://occ-0-1722-1723.1.nflxso.net/dnm/api/v6/LmEnxtiAuzezXBjYXPuDgfZ4zZQ/AAAABc62Uu86O-KyG_QVooZls2_LQXEqPggHKhGNXfvFoTTdtZf0y9YrDXOrlFts44M5PgQY21fus6w4ij1QGGkwiDWn9uX-JpYo06BH.png?r=72a)

In this notebook my focus was mainly on to analyse the data and various factors affecting the trend of Movies and Shows available on Netflix. Data Visualization was a primary aim and was implemented using Plotly. Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts. 90% of the features are explored in depth and rest will also explored in the newer versions of the notebook. 

## Import Libraries

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly as py
import plotly.express as px
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode, iplot, plot

import warnings 
warnings.filterwarnings('ignore')
%matplotlib inline

pd.set_option('display.max_columns', None)

## Load Data and Basic Analysis

In [None]:
data = pd.read_csv('../input/netflix-shows/netflix_titles.csv')
display(data.sample(3))
print('Data Shape: ', data.shape)

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data['date_added'] = data['date_added'].fillna('NaN Data')
data['year'] = data['date_added'].apply(lambda x: x[-4: len(x)])
data['month'] = data['date_added'].apply(lambda x: x.split(' ')[0])

display(data.sample(3))

## Type

There are only 2 major types in Netflix watch: Movies and TV Show. 

In [None]:
val = data['type'].value_counts().index
cnt = data['type'].value_counts().values

fig = go.Figure([go.Bar(x=val, y=cnt, marker_color='darkturquoise')])
fig.update_layout(title_text='Netflix Sources Distribution', title_x=0.5)
fig.show()

Now lets look at the trend using bar plot.

In [None]:
from collections import defaultdict

dict = data.groupby(['type', 'year']).groups
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)
for key, values in dict.items():
    val = key[0]+','+key[1]
    dict2[val] = len(values)
    
x = list(np.arange(2008, 2022, 1))

y1, y2= [], []
for i in x:
    y1.append(dict2['Movie,'+str(i)])
    y2.append(dict2['TV Show,'+str(i)])
    
fig = go.Figure(data = [
    go.Bar(name='Movie', x=x, y=y1, marker_color='mediumpurple'),
    go.Bar(name='TV Show', x=x, y=y2, marker_color='lightcoral')
])
fig.update_layout(title_text='Trend Movies vs TV Shows in recent years', title_x=0.5)
fig.show()

In [None]:
dict = data.groupby(['type', 'month']).groups
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)
for key, values in dict.items():
    val = key[0]+','+key[1]
    dict2[val] = len(values)
    
x = ['January', 'February', 'March', 'April', 'May', 'June', 'July',
     'August', 'September', 'October', 'November', 'December']

y1, y2= [], []
for i in x:
    y1.append(dict2['Movie,'+str(i)])
    y2.append(dict2['TV Show,'+str(i)])
    
fig = go.Figure(data = [
    go.Bar(name='Movie', x=x, y=y1, marker_color='mediumpurple'),
    go.Bar(name='TV Show', x=x, y=y2, marker_color='lightcoral')
])
fig.update_layout(title_text='Trend Movies vs TV Shows during Months', title_x=0.5)
fig.show()

Lets look at the trend with the help of a Line Plot.

In [None]:
data_movie = data[data['type']=='Movie'].groupby('release_year').count()
data_tv = data[data['type']=='TV Show'].groupby('release_year').count()
data_movie.reset_index(level=0, inplace=True)
data_tv.reset_index(level=0, inplace=True)

# fig = px.line(data_movie, x="release_year", y="show_id")
# fig.show()

fig = go.Figure()
fig.add_trace(go.Scatter(x=data_movie['release_year'], y=data_movie['show_id'],
                    mode='lines',
                    name='Movies', marker_color='mediumpurple'))
fig.add_trace(go.Scatter(x=data_tv['release_year'], y=data_tv['show_id'],
                    mode='lines',
                    name='TV Shows', marker_color='lightcoral'))
fig.update_layout(title_text='Trend Movies vs TV Shows in recent years', title_x=0.5)
fig.show()

## Country

Here we see the distribution of movies and tv shows across various countries. We will also plot the Counts on a map.

In [None]:
import collections
import string

dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)

data['country'] = data['country'].fillna(' ')

for i in range(len(data)):
    if data['type'][i] == 'Movie':
        val = data['country'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict1[x]+=1
    else:
        val = data['country'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict2[x]+=1
            
dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]

fig = go.Figure([go.Bar(x=x1, y=y1, marker_color='mediumpurple')])
fig.update_layout(title_text='Top Countries where Movies are released', title_x=0.5)
fig.show()

fig = go.Figure([go.Bar(x=x2, y=y2, marker_color='lightcoral')])
fig.update_layout(title_text='Top Countries where TV Shows are released', title_x=0.5)
fig.show()

In [None]:
import plotly.offline as py
py.offline.init_notebook_mode()
import pycountry


df1 = pd.DataFrame(dict1.items(), columns=['Country', 'Count'])
df2 = pd.DataFrame(dict2.items(), columns=['Country', 'Count'])

total = set(list(df1['Country'].append(df2['Country'])))

d_country_code = {}  # To hold the country names and their ISO
for country in total:
    try:
        country_data = pycountry.countries.search_fuzzy(country)
        # country_data is a list of objects of class pycountry.db.Country
        # The first item  ie at index 0 of list is best fit
        # object of class Country have an alpha_3 attribute
        country_code = country_data[0].alpha_3
        d_country_code.update({country: country_code})
    except:
        #print('could not add ISO 3 code for ->', country)
        # If could not find country, make ISO code ' '
        d_country_code.update({country: ' '})
        
for k, v in d_country_code.items():
    df1.loc[(df1.Country == k), 'iso_alpha'] = v
    df2.loc[(df2.Country == k), 'iso_alpha'] = v
        
fig = px.scatter_geo(df1, locations="iso_alpha",
                     hover_name="Country", # column added to hover information
                     size="Count", # size of markers, "pop" is one of the columns of gapminder
                     )
fig.update_layout(title_text='Top Countries where Movie are released', title_x=0.5)
fig.show()

fig = px.scatter_geo(df2, locations="iso_alpha",
                     hover_name="Country", # column added to hover information
                     size="Count", # size of markers, "pop" is one of the columns of gapminder
                     )
fig.update_layout(title_text='Top Countries where TV Shows are released', title_x=0.5)
fig.show()

## Cast

In [None]:
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)

data['cast'] = data['cast'].fillna(' ')

for i in range(len(data)):
    if data['type'][i] == 'Movie':
        val = data['cast'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict1[x]+=1
    else:
        val = data['cast'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict2[x]+=1
            
dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]

fig = go.Figure([go.Bar(x=x1, y=y1, marker_color='mediumpurple')])
fig.update_layout(title_text='Most appeared Cast Globally in Movies', title_x=0.5)
fig.show()

fig = go.Figure([go.Bar(x=x2, y=y2, marker_color='lightcoral')])
fig.update_layout(title_text='Most appeared Cast Globally in TV Shows', title_x=0.5)
fig.show()

Now, lets look at the distribution of gender in movies and tv shows released on netflix across all the countries.

In [None]:
import nltk
import random
from nltk.corpus import names

def gender_features(word):
    return {'last_letter': word[-1]}

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + 
                 [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)

featuresets = [(gender_features(n), gender) for (n, gender) in labeled_names]

trainset, testset = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(trainset)

In [None]:
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)

df1 = pd.DataFrame(columns = ['Gender', 'Count'])
df2 = pd.DataFrame(columns = ['Gender', 'Count'])

data['cast'] = data['cast'].fillna(' ')

for i in range(len(data)):
    if data['type'][i] == 'Movie':
        val = data['cast'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                if classifier.classify(gender_features(x)) == 'male':
                    df1.loc[len(df1)] = ['male', 1]
                else:
                    df1.loc[len(df1)] = ['female', 1]
    else:
        val = data['cast'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                if classifier.classify(gender_features(x)) == 'male':
                    df2.loc[len(df2)] = ['male', 1]
                else:
                    df2.loc[len(df2)] = ['female', 1]

fig = px.pie(df1, values='Count', names='Gender', color='Gender',
             color_discrete_map={'female':'lightcyan',
                                 'male':'darkblue'})
fig.update_layout(title_text='Gender Ratio in Movies', title_x=0.5)
fig.show()

fig = px.pie(df2, values='Count', names='Gender', color='Gender',
             color_discrete_map={'female':'lightcyan',
                                 'male':'darkblue'})
fig.update_layout(title_text='Gender Ratio in TV Shows', title_x=0.5)
fig.show()

We can see clearly that the gender ratio is a bit biased towards male category.

## Genre

In [None]:
dict1 = {}
dict1 = defaultdict(lambda: 0, dict1)
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)

data['listed_in'] = data['listed_in'].fillna(' ')

for i in range(len(data)):
    if data['type'][i] == 'Movie':
        val = data['listed_in'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict1[x]+=1
    else:
        val = data['listed_in'][i].split(',')
        for j in val:
            x = j.lower()
            x = x.strip()
            if x!='':
                dict2[x]+=1
            
dict1 = collections.OrderedDict(sorted(dict1.items(), key=lambda x: x[1], reverse=True))
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

x1 = list(dict1.keys())[:20]
x2 = list(dict2.keys())[:20]
y1 = list(dict1.values())[:20]
y2 = list(dict2.values())[:20]

fig = go.Figure([go.Bar(x=x1, y=y1, marker_color='mediumpurple')])
fig.update_layout(title_text='Highest occurring genres Globally in Movies', title_x=0.5)
fig.show()

fig = go.Figure([go.Bar(x=x2, y=y2, marker_color='lightcoral')])
fig.update_layout(title_text='Highest occurring genres Globally in TV Shows', title_x=0.5)
fig.show()

In [None]:
dict2 = {}
dict2 = defaultdict(lambda: 0, dict2)

data2 = data
data2['country'] = data2['country'].apply(lambda x: x.lower())
data2['listed_in'] = data2['listed_in'].apply(lambda x: x.lower())

df1 = pd.DataFrame(columns=['Country', 'Genre', 'Count'])
    
for i in range(len(data2)):
    for j in data2['country'][i].split(','):
        for k in data2['listed_in'][i].split(','):
            val = j+','+k
            dict2[val]+=1
            
dict2 = collections.OrderedDict(sorted(dict2.items(), key=lambda x: x[1], reverse=True))

a, b, c = 0, 0, 0
for k,v in dict2.items():
    if k.split(',')[0] == 'india' and a<5:
        df1.loc[len(df1)] = [k.split(',')[0], k.split(',')[1],v]
        a+=1
    elif k.split(',')[0] == 'united states' and b<5:
        df1.loc[len(df1)] = [k.split(',')[0], k.split(',')[1],v]
        b+=1
    elif k.split(',')[0] == 'united kingdom' and c<5:
        df1.loc[len(df1)] = [k.split(',')[0], k.split(',')[1],v]
        c+=1
        
df1

In [None]:
fig = px.sunburst(df1, path = ['Country', 'Genre'], values = 'Count', color = 'Country',
                 color_discrete_map = {'united states': '#85e0e0', 'india': '#99bbff', 'united kingdom': '#bfff80'})
fig.update_layout(title_text='Distribution of Genres in India, US, UK', title_x=0.5)                  
fig.show()

## Rating

Few NaN values are filled manually and mapped to a respective age group for a better analysis.

In [None]:
data.iloc[67, 8] = 'R' 
data.iloc[2359, 8] = 'TV-14'
data.iloc[3660, 8] = 'TV-PG'
data.iloc[3736, 8] = 'R'
data.iloc[3737, 8] = 'R'
data.iloc[3738, 8] = 'R'
data.iloc[4323, 8] = 'PG-13'

data['age_group'] = data['rating']
MR_age = {'TV-MA': 'Adults',
          'R': 'Adults',
          'PG-13': 'Teens',
          'TV-14': 'Young Adults',
          'TV-PG': 'Older Kids',
          'NR': 'Adults',
          'TV-G': 'Kids',
          'TV-Y': 'Kids',
          'TV-Y7': 'Older Kids',
          'PG': 'Older Kids',
          'G': 'Kids',
          'NC-17': 'Adults',
          'TV-Y7-FV': 'Older Kids',
          'UR': 'Adults'}
data['age_group'] = data['age_group'].map(MR_age)

val = data['age_group'].value_counts().index
cnt = data['age_group'].value_counts().values

fig = go.Figure([go.Bar(x=val, y=cnt, marker_color='darkturquoise')])
fig.update_layout(title_text='Age Group Distribution', title_x=0.5)
fig.show()

Most of the Movies and TV Shows are for adults which is also quite obvious.

## Duration

Here, we examine the duration of movies and tv shows for the past years.

In [None]:
data_movie = data[data['type'] == 'Movie']
data_tv = data[data['type'] == 'TV Show']

# create trace 1 that is 3d scatter
trace1 = go.Scatter3d(
    x=data_movie.duration,
    y=data_tv.duration,
    z=data.release_year,
    mode='markers',
    marker_color='darkturquoise'
)

data2 = [trace1]
layout = go.Layout(
)
fig = go.Figure(data=data2, layout=layout)
fig.update_layout(title_text='Distribution of Duration across Movies and TV Show in the past years', title_x=0.5)
iplot(fig)

In [None]:
data_movie = data[data['type'] == 'Movie']
data_tv = data[data['type'] == 'TV Show']

trace0 = go.Box(
    y = data_movie.duration,
    name = "Duration of Movies",
    marker_color='mediumpurple'
)

trace1 = go.Box(
    y = data_tv.duration,
    name = "Duration of TV Shows",
    marker_color='lightcoral'
)

data2 = [trace0,trace1]
iplot(data2)

If you learnt something from the notebook or found it interesting Upvote it👍