<a href="https://colab.research.google.com/github/DougFc-byte/Netflix-Content-Strategy-Case-Study/blob/main/NetflixContentStrategy2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The dataset used in this analysis was found by Aman Kharwal. It contains data about title, release date, language, content type (show or movie), availability status, and viewership hours of all the shows and movies released in 2023.


Aman Kharwal describes Content Strategy Analysis as analyzing how content is created, released, distributed, and consumed to achieve specific goals, such as maximizing audience engagement, viewership, brand reach or revenue.

In [2]:
# First, before we start our analysis, we need to import python libraries and the dataset.
# Make sure the path to the dataset is consistent whether you are using collab like me or a local jupyter instance.

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"

netflix_data = pd.read_csv("netflix_content_2023.csv")

# We then ask for a sample of the data using .head()
netflix_data.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


In [3]:
#I wanted more info on the dataframe:
netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24812 entries, 0 to 24811
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Title                24812 non-null  object
 1   Available Globally?  24812 non-null  object
 2   Release Date         8166 non-null   object
 3   Hours Viewed         24812 non-null  object
 4   Language Indicator   24812 non-null  object
 5   Content Type         24812 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


In [7]:
#From this, I could see the important column "Hours Viewed" would need some cleaning to prepare it for data analysis.
netflix_data['Hours Viewed'] = netflix_data['Hours Viewed'].replace(',', '', regex=True).astype(float)

#In essence, this line of code is taking the Hours Viewed column, removing any commas it finds, and then converting the cleaned values into numbers (specifically, floating-point numbers) so that we can perform numerical analysis on them. Next we see if the data object type successfully changed:
netflix_data.info()
netflix_data[['Title', 'Hours Viewed']].head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24812 entries, 0 to 24811
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Title                24812 non-null  object 
 1   Available Globally?  24812 non-null  object 
 2   Release Date         8166 non-null   object 
 3   Hours Viewed         24812 non-null  float64
 4   Language Indicator   24812 non-null  object 
 5   Content Type         24812 non-null  object 
dtypes: float64(1), object(5)
memory usage: 1.1+ MB


Unnamed: 0,Title,Hours Viewed
0,The Night Agent: Season 1,812100000.0
1,Ginny & Georgia: Season 2,665100000.0
2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000.0
3,Wednesday: Season 1,507700000.0
4,Queen Charlotte: A Bridgerton Story,503000000.0


In [9]:
# It's clear that the “Hours Viewed” column has been successfully converted to a numeric format. Now we can analyze trends in content type to determine whether shows or movies dominate viewership.
# Let’s start by visualizing the distribution of total viewership hours between Shows and Movies:
# aggregate viewership hours by content type. What we did here was to use the .groupby method of pandas to groups the rows of the DataFrame based on unique values, for example, all rows where the 'Content Type' is 'Show' will be grouped together, and all rows where it's 'Movie' will be grouped separately. After the groupby method, we selec the 'Hours Viewed" column withing each group, which means we are focusing on the viewership hours for each specific content type, and then we calculate the sum of the values with the .sum() method, giving us the total virewership hours for each of the content, which is stored in a new variable called content_type_viewership.
content_type_viewership = netflix_data.groupby('Content Type')['Hours Viewed'].sum()

In [14]:
#Next, we use the plotly.graph_objects library to create and display a bar chart visualizing the total viewership hours (shows and movies):
fig = go.Figure(data=[
    go.Bar(
        x=content_type_viewership.index,
        y=content_type_viewership.values,
        marker_color=['skyblue', 'salmon']
    )
])
fig.show()

In [18]:
#While the chart looks okay, we can use the update_layout() method to customize its appearance and add attributes to the figure, such as titles, axis, labels, and dimensions:
fig.update_layout(
    title="Total Viewership Hours by Content Type (2023)",
    xaxis_title="Content Type",
    yaxis_title="Total Viewership in billion Hours",
    width=800,
    height=500
)
fig.show()

We can see from the plotted chart that "Show" type content is responsible for 107.7641 billion hours in views, and "Movie" type content is 50.6378B. So it becomes clear that more than double of the hours watched on netflix in 2023 were Shows. This suggests that netflix takes a content strategy that leans heavily towards shows as they tend to attract more watched hours. It would be interesting to plot this data against 2024 data and see if there were any changes in preferences in the course of an year. At this point, I asked myself: how many "movies" were released in 2023 and how many "Shows"?

In [19]:
# how many movies were released in 2023 and how many Shows?

content_type_counts = netflix_data['Content Type'].value_counts()

print("Number of Movies released in 2023:", content_type_counts.get('Movie', 0))
print("Number of Shows released in 2023:", content_type_counts.get('Show', 0))

Number of Movies released in 2023: 14104
Number of Shows released in 2023: 10708


So, 14.104 movies were released in 2023, while only 10.708 shows were released. We can see that netflix received more views while releasing less shows. We wanted to understand next which languages are contributing the most to netflix's content consumption.


In [29]:
# The first step is to aggregate viewership hours by language. This line of code groups the Netflix data by the language of the content, calculates the total viewership hours for each language, and then sorts these totals from the highest viewership to the lowest.
language_viewership = netflix_data.groupby('Language Indicator')['Hours Viewed'].sum().sort_values(ascending=False)


In [32]:
# Next, we plot the figure:
fig = go.Figure(data=[
    go.Bar(
        x=language_viewership.index,
        y=language_viewership.values,
        marker_color='salmon'
    )
])
# I removed the fig.show() after my first look
#After this point, we once again update the layout to make it better:
fig.update_layout(
    title='Total Viewership Hours by Language (2023)',
    xaxis_title='Language',
    yaxis_title='Total Hours Viewed in billions',
    xaxis_tickangle=45,
    height=600,
    width=1000
)
fig.show()

The visualization shows that English is the most viewed language content (124Bi), followed by Korean (15.3BI), Non-english (10.4Bi), Japanese (7.1Bi) and Hindi (926m). So netflix primary audience is composed of english viewers, but other languages also have a considerable share, suggesting a diverse strategy. Being brazilian, I became interested in which share belongs to Brazil.

In [35]:
# First, I wanted to know if PT-BR was written in the 'Language Indicator' column.
netflix_data['Language Indicator'].unique()

array(['English', 'Korean', 'Non-English', 'Japanese', 'Hindi', 'Russian'],
      dtype=object)

Unfortunately, this dataset didn't contain specific portuguese language, only that it was part of 'Non-English' content, so I was left without answers. What I wanted to do next was analyze how viewership varies based on release dates to identify trends over time, such as seasonality or patterns over certain months.
