# Comprehensive Analysis of Netflix Movies, Series, and Actors

This project provides an in-depth analysis of Netflix's extensive catalog of movies and series, including detailed insights into actors, genres, and content attributes. By examining a wide array of data, including release years, genres, descriptions, and other key metrics, the project aims to uncover patterns and trends within Netflix’s offerings.

###### Key Objectives

- Content Analysis: Explore relationships between movies, series, actors, genres, and countries of origin.
- Trend Visualization: Utilize various visualizations to identify trends in release years, genres, and content popularity over time.
- Word Cloud Generation: Create word clouds to visualize the most frequently used words in content descriptions, providing insights into common themes and topics.
- Time-to-Add Analysis: Analyze the time taken for Netflix to add content from its release date to the platform, using histograms and other visualizations to highlight trends and intervals.

# Imports

In [None]:
#imports
import pandas as pd
import numpy as np
import plotly.express as px
import re
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objects as go


In [None]:
from dash import Dash, html, dcc
from dash.dependencies import Input, Output

# Data Loading

In [None]:
df = pd.read_csv(r"netflix_titles.csv")
df.head()

In [None]:
df.info()

# Data cleaning

##### Documentation

* A total of 2634 null values for director column - Filled with unknown
* 825 Null values for cast column - Filled with unknown
* 831 Null values for country - Filled with International
* 4 Null values for Rating - Replaced with mode rating value
* Split duration column into -> DurationMins(Movies) and NoOfSeasons(TV Shows)
* 3 Null values for Duration - If movie fill with mean number of minutes, else fill with mean number of seasons
* 10 Null values for Date added - Can be subbed by the release year

In [None]:
df.info()

In [None]:
# Checking for null vals
df.isnull().sum()

In [None]:
# show_id should be unique ID for every show
# Checking if duplicate IDs exist
df.duplicated().sum()

Let's start by cleaning the null values for director column.

First let's check if director column has any relation with cast column by using a data without any null values

In [None]:

cleanedDf = df.dropna()

In [None]:
cleanedDf['director'].isin(cleanedDf['cast']).value_counts()

* Very weak relation between director and cast thus can't assume a value

In [None]:
# Fill null vals with unknown in director&cast
df['cast'].replace(np.nan, 'unknown', inplace = True)
df['director'].replace(np.nan, 'unknown', inplace = True)

In [None]:
#Fixing country col by replacing all nan values with International
df['country'] = df['country'].fillna('International')

In [None]:
df['rating'].value_counts()

In [None]:
# Replacing rating with most occurent rating
df['rating'] = df['rating'].fillna(df['rating'].mode()[0])

In [None]:
#Dropping NaN vals in the rest of the data
df.dropna(inplace=True)

In [None]:
df['duration'].unique()

In [None]:
# Parse the duration information from a string and separate it into minutes or seasons.
#Returns -> tuple: (duration in minutes, number of seasons)

def parse_duration(x):
    
    if isinstance(x, str):
        
        # Extract minutes
        mins_match = re.search(r'(\d+)\s*mins?', x, re.IGNORECASE)
        if mins_match:
            return (int(mins_match.group(1)), 0)
        
        # Extract seasons
        seasons_match = re.search(r'(\d+)\s*seasons?', x, re.IGNORECASE)
        if seasons_match:
            return (0, int(seasons_match.group(1)))
    
    return (None, None)

In [None]:
# Apply the function to create two new columns
df[['DurationMins', 'NoOfSeasons']] = df['duration'].apply(lambda x: pd.Series(parse_duration(x)))

In [None]:
# Final look at our data after cleaning
df.info()

# EDA and Data Visualization

### Ratio between Movies and TV Shows in data set
* We'll use Pie chart to visualize the percentage of Shows and Movies in data

In [None]:
# Ratio between TV Shows and Movies on netflix
plot = px.pie(df, values=df['type'].value_counts().values, names=df['type'].value_counts().index, 
color_discrete_sequence=['#db0000', '#000000'], title='Number of TV Shows vs Movies on Netflix'
             )
plot.update_layout(title_font_color='#000000')
plot.show()

### Top 10 contributing countries
* For this we'll use a bar chart to summarise the content by each country

In [None]:
# Filtering our data by getting the top 10 contributing countries and putting it in a separte df
top10 = df['country'].value_counts().nlargest(11).index
top10df = df[df['country'].isin(top10)]
plot = px.histogram(top10df, y='country', color='type', orientation='h', 
color_discrete_map= {"Movie" : 'red', "TV Show" : "black"},
title="Top 10 contributing Countries ranked", barmode='group')
plot.update_layout(title_font_color='#000000', xaxis_title='Number of content')
plot.update_yaxes(categoryorder="total ascending", title='Country')
plot.show()

### When did netflix add most of it's content?
* We can use a line chart to summarise the distribution of content added through the years.
* We'll use a stacked bar chart to display the month where most content get added

In [None]:
# Filtering our data by creating a new dataframe that holds the value count for each type for each year
# Creating the new data frame
s = df.groupby('year_added')['type'].value_counts()
newdf = pd.DataFrame(data=s.values, index=s.index)
newdf = newdf.reset_index()
newdf.rename(columns={0:'Count'}, inplace=True)
# Visualizing the data
plot = px.line(newdf, x='year_added', y='Count', color='type', markers=True, 
               color_discrete_map = {"Movie" : "#db0000", "TV Show" : "#000000"})
plot.update_layout(title='Line chart for content added among the years',
    xaxis_title='Year',
    yaxis_title='Number of content added', hovermode='x unified')
plot.show()

### How many content was added in each month of the year?

In [None]:
# Filtering our data by creating a new dataframe that holds the value count for each type for each month
# Creating the new data frame
s = df.groupby('month_added')['type'].value_counts()
newdf = pd.DataFrame(data=s.values, index=s.index)
newdf = newdf.reset_index()
newdf.rename(columns={0:'Count'}, inplace=True)
# Visualizing the data
plot = px.bar(newdf, x='month_added', y='Count', color='type', 
               color_discrete_map = {"Movie" : "#db0000", "TV Show" : "#000000"})
plot.update_layout(title='Stacked Bar chart for content added in each month',
    xaxis_title='Month',
    yaxis_title='Number of content added', hovermode='x unified')
plot.show()

### Top 25 most actors that appeared on Netflix.
* For this we can use a bar chart. It will summarise the distribution of actors based on number of appearances.

In [None]:
# Creating new dataframe without the unknown cast rows
castdf = df[df['cast'] != 'unknown']

In [None]:
# Fucntion to each row and split the cast string into a list, and then creating a dictionary with the names
# Everytime the name appears the value is incremented by 1 
# This function returns a dictionary for the name of the actor as key and the value is the number of appearance
def get_actor_appearance(x, cdic):
    templist = x.split(', ')
    for name in templist:
        if name in cdic:
            cdic[name] = cdic[name] + 1
        else:
            cdic[name] = 1
    
    return cdic


In [None]:
# Applying the function and switching the dictionary into a pandas series
castdict = {}
castdf['cast'].apply(lambda x: get_actor_appearance(x, castdict))
castdict = pd.Series(castdict)
castdict.describe()

In [None]:
# Taking the top 25 actor appearances and plotting the graph
castdictsample = castdict.nlargest(25)
plot = px.bar(castdictsample, x=castdictsample.index, y=castdictsample.values)
plot.update_layout(title = 'Top 25 actors by appearance', xaxis_title= 'Actor name', yaxis=dict(title='Number of appearances'))
plot.update_traces(marker_color='red')
plot.show()

### Pie chart for the Ratings on Netflix Content

* For this we can use a pie chart. It will summarise the distribution of content based on ratings.

In [None]:
# Drawing the pie chart
plot = px.pie(df, values=df['rating'].value_counts().values, names=df['rating'].value_counts().index)
plot.update_traces(text=df['rating'].value_counts().index)
plot.update_layout(title='Pie chart showing ratings')
plot.show()

### Most watched Genre on Netflix
* We'll use a bar chart to visualize all genres available

In [None]:
# Fucntion to get each row and split the listed_in string into a list, and then creating a dictionary with the genres
# Everytime the genre appears the value is incremented by 1 
# This function returns a dictionary for the genre as key and the value is the number of occurances
# Just like the get_actor_appearance function

def get_genres(x, cdic):
    templist = x.split(', ')
    for name in templist:
        if name in cdic:
            cdic[name] = cdic[name] + 1
        else:
            cdic[name] = 1
    
    return cdic


In [None]:
# Applying the function
gdict = {}
df['listed_in'].apply(lambda x: get_genres(x, gdict))
# Switching the dictionary to a series
gdict = pd.Series(gdict)
# Plotting the graph
plot = px.bar(gdict, y=gdict.index, x=gdict.values, orientation='h', title = 'Most Watched Genres')
plot.update_traces(marker_color='#db0000')
plot.update_layout(width=1000, height=1000, yaxis_title='Genre', xaxis_title='Count')
plot.update_yaxes(categoryorder="total ascending")
plot.show()

In [None]:
df.head()

### The Distribution of number of minutes for movies
* We'll use a violin plot to see the distribution of minutes 

In [None]:
movies = df[df['type'] == 'Movie']
plot = px.violin(movies, x='DurationMins', title='Distribution of Minutes for Movies')
# Customize layout
plot.update_layout(
    xaxis_title='Duration (Minutes)',
    yaxis_title='Density',
    bargap=0.2,  # Gap between bars
    title_font_size=20,
)
plot.update_traces(marker_color='#000000')
plot.show()

### The Distribution of number of seasons for TV shows
* We'll also use a violin plot to see the distribution of seasons 

In [None]:
seasons = df[df['type'] == 'TV Show']
plot = px.violin(seasons, x='NoOfSeasons', title='Distribution of Seasons for TV shows')
# Customize layout
plot.update_layout(
    xaxis_title='Duration (Minutes)',
    yaxis_title='Density',
    bargap=0.2,  # Gap between bars
    title_font_size=20,
)
plot.update_traces(marker_color='#db0000')
plot.show()

### How fast does netflix add the content to it's platform?
* We'll use a histogram to visualize the time intervals between content release and its addition to Netflix.

In [None]:
df['YearsToBeAdded'] = df['year_added'] - df['release_year'] 

In [None]:
# Create a scatter plot
plot = px.histogram(df, x='YearsToBeAdded', title='Time to add content to Netflix')
# Customize layout
plot.update_layout(
    xaxis_title='Release Year',
    yaxis_title='Year Added',
    title_font_size=20
)
plot.update_traces(marker_color='#000000')

plot.show()

### Most Frequent Words in Netflix Content Descriptions
* The word cloud displays the most frequently occurring words in the content descriptions. This visualization helps us understand the common themes and characteristics of content on Netflix.

In [None]:
# Combine all descriptions into a single string
text = ' '.join(df['description'].dropna())

In [None]:
# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white', stopwords=STOPWORDS).generate(text)

# Convert word cloud to image array
wordcloud_image = np.array(wordcloud.to_image())

In [None]:
# Create a plotly figure with the word cloud image
fig = go.Figure()

fig.add_trace(go.Image(z=wordcloud_image))

# Customize layout
fig.update_layout(
    title='Word Cloud of Descriptions',
    height=500,
    width=800
)

fig.show()

# Creating Dash Application for more interactive charts!

In [None]:
countries = []
def get_all_countries(x, countriesList):
    clist = x.split(', ')
    for country in clist:
        if country not in countriesList:
            if country != '':
                countriesList.append(country)
        else:
            pass
    return countriesList
df['country'].apply(lambda x: get_all_countries(x, countries))


    

In [None]:
app = Dash(__name__)
app.layout = html.Div(children=[
    
    
    ############################################### First Graph #################################################
    html.Div(style={'backgroundColor' : '#000000'}, 
        children=[
    
    html.H1(children='Hello Kimo!', style={'color': '#db0000'}),
    
    html.Div(children='This is your first Dash webapp!', style={'textAlign' : 'center','color': '#db0000'}),
# Dropdown for first graph
    dcc.Dropdown(
                id='slct_country', 
                options=countries,
                value='All', placeholder='All', maxHeight=100, 
                 style={'width':'65%'}
                  ),
    ### DCC 1st graph
    dcc.Graph(id='actor-graph', figure={}),
                  ]),
    
    ############################################### Second Graph #################################################
    
    html.Div(style={'backgroundColor' : '#000000'},
            children=[
                    # Multi mode Dropdown for second graph
          dcc.Dropdown(id='slct_genre', options=gdict.index, maxHeight=100, 
                       style={'width' : '50%'}, value=['All'], placeholder='All'),
          dcc.Graph(id='genre-graph', figure={}), 
           ]) 

])
################################################# Top 25 actors #######################################################
@app.callback(
Output(component_id='actor-graph', component_property='figure'),
Input(component_id='slct_country', component_property='value')
)
def update_graph(slcted_value):
    # Create a copy dataframe not to mess with the original dataframe
    dff = df.copy()
    
        
   
    
    # Graph for top 25 actor appearance per origin of movie's country
    castdff = dff[dff['cast'] != 'unknown']
    
    # Check if a condition is selected if not show top 25 among all countries
    if slcted_value != 'All':
        castdff = castdff[castdff['country'].str.contains(slcted_value)]
    else:
        pass
    # Getting the dictionary for actors and then casting it to a series
    actorAppear = {}
    castdff['cast'].apply(lambda x: get_actor_appearance(x, actorAppear))
    actorAppear = pd.Series(actorAppear)
    actorAppear = actorAppear.nlargest(25)
    #Drawing the graph
    actorAppearanceBar = px.bar(x=actorAppear.index, y=actorAppear.values,
    title="Top 25 Actors Appearance", template='plotly_dark', height=500)
    actorAppearanceBar.update_traces(marker_color="#db0000")
    actorAppearanceBar.update_layout(xaxis_title='Actor Name', yaxis_title='Number of Appearance')
    # End of Actor graph code
    
    
    
    return actorAppearanceBar    
################################################# Genre Graph #######################################################
@app.callback(
    Output(component_id='genre-graph', component_property='figure'),
    Input(component_id='slct_genre', component_property='value')

)

def update_graph2(slcted_genre):
    gdff = df.copy()
    # Filtering our DataFrame using the selected genres from the dropdown
    if slcted_genre != ['All']:
        gdff = df[df['listed_in'].isin([slcted_genre])]
        
    
    graphData = gdff.groupby('country')['listed_in'].value_counts().nlargest(20)
    graphData = pd.DataFrame(data=graphData.values, index=graphData.index).reset_index()
    graphData.rename(columns={0:'Count'}, inplace=True)
    
    # Plottin the graph
    genreGraph = px.bar(graphData, y='country', x='Count', template='plotly_dark', height=500,
                       orientation='h')
    genreGraph.update_yaxes(categoryorder="total ascending")
    genreGraph.update_traces(marker_color='#db0000')
    genreGraph.update_layout(yaxis_title='Country', xaxis_title='Number of Content in the choosen Genres')
    
    
    return genreGraph

In [None]:
if __name__ == '__main__':
    app.run_server(debug=False)