<a href="https://colab.research.google.com/github/QuyenNguyen0611/Netflix-s-Content-Strategy/blob/main/Analyzing_Netflix's_Content_Strategy_Using_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Netflix's Content Strategy
First, I loaded file from drive and import the necessary Python libraries

In [None]:
# Load file from drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import libraries
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = "plotly_white"

In [None]:
# Upload file
file_path = '/content/drive/MyDrive/Data Analysis Projects/netflix_content_2023.csv'
netflix_data = pd.read_csv(file_path)

In [None]:
netflix_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24812 entries, 0 to 24811
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Title                24812 non-null  object
 1   Available Globally?  24812 non-null  object
 2   Release Date         8166 non-null   object
 3   Hours Viewed         24812 non-null  object
 4   Language Indicator   24812 non-null  object
 5   Content Type         24812 non-null  object
dtypes: object(6)
memory usage: 1.1+ MB


In [None]:
netflix_data.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


In [None]:
netflix_data.dtypes

Unnamed: 0,0
Title,object
Available Globally?,object
Release Date,object
Hours Viewed,object
Language Indicator,object
Content Type,object


After examining the data types, I found that 'Release Date' and 'Hours Viewed' were incorrect. Therefore, I converted their data types to the appropriate formats

In [None]:
# Covert data type
netflix_data['Hours Viewed'] = netflix_data['Hours Viewed'].replace(',', '', regex=True).astype(float)
netflix_data['Release Date'] = pd.to_datetime(netflix_data['Release Date'])
netflix_data[['Title', 'Hours Viewed', 'Release Date']].head()

Unnamed: 0,Title,Hours Viewed,Release Date
0,The Night Agent: Season 1,812100000.0,2023-03-23
1,Ginny & Georgia: Season 2,665100000.0,2023-01-05
2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000.0,2022-12-30
3,Wednesday: Season 1,507700000.0,2022-11-23
4,Queen Charlotte: A Bridgerton Story,503000000.0,2023-05-04


In [None]:
netflix_data.dtypes

Unnamed: 0,0
Title,object
Available Globally?,object
Release Date,datetime64[ns]
Hours Viewed,float64
Language Indicator,object
Content Type,object


I will analyze content type trends to determine whether shows or movies dominate viewership. To do this, I will visualize the distribution of total viewership hours between Shows and Movies.

In [None]:
# Total viewership hours by content type
content_type_viewership = netflix_data.groupby('Content Type')['Hours Viewed'].sum()

fig = go.Figure(data=[
    go.Bar(
        x=content_type_viewership.index,
        y=content_type_viewership.values,
        marker_color=['#edada3', '#28405c']
    )
])

fig.update_layout(
    title='Total Viewership Hours by Content Type (2023)',
    xaxis_title='Content Type',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=0,
    height=500,
    width=800
)

fig.show()

The visualization reveals that shows dominated total viewership hours on Netflix in 2023 compared to movies. This suggests that Netflix’s content strategy is heavily focused on shows, as they tend to generate higher watch hours overall.  

Next, I will analyze the distribution of viewership across different languages to identify which languages contribute the most to Netflix’s content consumption.

In [None]:
# Total viewership hours by language
language_viewership = netflix_data.groupby('Language Indicator')['Hours Viewed'].sum().sort_values(ascending=False)

fig = go.Figure(data=[
    go.Bar(
        x=language_viewership.index,
        y=language_viewership.values,
        marker_color='#edada3'
    )
])

fig.update_layout(
    title='Total Viewership Hours by Language (2023)',
    xaxis_title='Language',
    yaxis_title='Total Hours Viewed (in billions)',
    height=600,
    width=1000
)

fig.show()

The analysis indicates that English-language content leads in total viewership on Netflix, with Korean content following closely behind. This suggests that while English remains the dominant language among viewers, non-English content also attracts a significant audience, highlighting Netflix’s diverse content strategy.

Next, I will examine how viewership fluctuates based on release dates to identify trends over time, such as seasonal patterns or spikes in specific months.

In [None]:
# Extract the month
netflix_data['Release Month'] = netflix_data['Release Date'].dt.month

# Tggregate viewership hours by release month
monthly_viewership = netflix_data.groupby('Release Month')['Hours Viewed'].sum()

fig = go.Figure(data=[
    go.Scatter(
        x=monthly_viewership.index,
        y=monthly_viewership.values,
        mode='lines+markers',
        marker=dict(color='#28405c'),
        line=dict(color='#28405c')
    )
])

fig.update_layout(
    title='Total Viewership Hours by Release Month (2023)',
    xaxis_title='Month',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis=dict(
        tickmode='array',
        tickvals=list(range(1, 13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    height=600,
    width=1000
)

fig.show()

The graph illustrating total viewership hours by month highlights a significant increase in June, followed by a sharp rise in December. This pattern suggests that Netflix experiences peak audience engagement during these periods, potentially driven by strategic content releases, seasonal trends, or holiday-related viewership surges. Meanwhile, the middle months maintain a steady but comparatively lower viewership.  

To gain deeper insights, we can analyze the top-performing content - both shows and movies - and explore key characteristics such as genre or theme that may have contributed to their high viewership.

In [None]:
# Top 5 titles based on viewership hours
top_5_titles = netflix_data.nlargest(5, 'Hours Viewed')

top_5_titles[['Title', 'Hours Viewed', 'Language Indicator', 'Content Type', 'Release Date']]

Unnamed: 0,Title,Hours Viewed,Language Indicator,Content Type,Release Date
0,The Night Agent: Season 1,812100000.0,English,Show,2023-03-23
1,Ginny & Georgia: Season 2,665100000.0,English,Show,2023-01-05
18227,King the Land: Limited Series // 킹더랜드: 리미티드 시리즈,630200000.0,Korean,Movie,2023-06-17
2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000.0,Korean,Show,2022-12-30
18214,ONE PIECE: Season 1,541900000.0,English,Show,2023-08-31


English-language shows hold the top viewership spots, but Korean content also stands out with a strong presence among the most-watched titles, highlighting its global appeal.  

Next, let's explore the viewership trends by content type to identify patterns in audience preferences.

In [None]:
# Total viewership hours by content type and release month
monthly_viewership_by_type = netflix_data.pivot_table(index='Release Month',
                                                      columns='Content Type',
                                                      values='Hours Viewed',
                                                      aggfunc='sum')
content_colors = {'Movie': '#edada3','Show': '#28405c'}

fig = go.Figure()

for content_type in monthly_viewership_by_type.columns:
    fig.add_trace(
        go.Scatter(
            x=monthly_viewership_by_type.index,
            y=monthly_viewership_by_type[content_type],
            mode='lines+markers',
            name=content_type,
            line=dict(color=content_colors.get(content_type)),
        )
    )

fig.update_layout(
    title='Viewership Trends by Content Type and Release Month (2023)',
    xaxis_title='Month',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis=dict(
        tickmode='array',
        tickvals=list(range(1, 13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    height=600,
    width=1000,
    legend_title='Content Type'
)

fig.show()

The graph compares viewership trends between movies and shows throughout 2023, revealing that shows consistently attract higher viewership, with a peak in December. In contrast, movie viewership fluctuates, showing notable increases in June and October. This pattern suggests that Netflix’s audience remains more engaged with shows year-round, while movies see periodic spikes, likely influenced by major releases or special events.

Next, let's examine how total viewership hours are distributed across different release seasons.

In [None]:
# define seasons based on release months
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

# apply the season categorization to the dataset
netflix_data['Release Season'] = netflix_data['Release Month'].apply(get_season)

# aggregate viewership hours by release season
seasonal_viewership = netflix_data.groupby('Release Season')['Hours Viewed'].sum()

# order the seasons as 'Winter', 'Spring', 'Summer', 'Fall'
seasons_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_viewership = seasonal_viewership.reindex(seasons_order)

fig = go.Figure(data=[
    go.Bar(
        x=seasonal_viewership.index,
        y=seasonal_viewership.values,
        marker_color='#edada3'
    )
])

fig.update_layout(
    title='Total Viewership Hours by Release Season (2023)',
    xaxis_title='Season',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=0,
    height=500,
    width=800,
    xaxis=dict(
        categoryorder='array',
        categoryarray=seasons_order
    )
)

fig.show()

The graph reveals that viewership hours peak dramatically during the Fall season, surpassing 80 billion hours, while Winter, Spring, and Summer maintain relatively stable levels around 20 billion hours each. This suggests that Netflix sees the highest audience engagement in the Fall, potentially due to major content releases or seasonal viewing habits.  

Next, let's analyze the number of content releases per month and their corresponding viewership hours to identify trends in content performance.

In [None]:
monthly_releases = netflix_data['Release Month'].value_counts().sort_index()

monthly_viewership = netflix_data.groupby('Release Month')['Hours Viewed'].sum()

fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=monthly_releases.index,
        y=monthly_releases.values,
        name='Number of Releases',
        marker_color='#edada3',
        yaxis='y1'
    )
)

fig.add_trace(
    go.Scatter(
        x=monthly_viewership.index,
        y=monthly_viewership.values,
        name='Viewership Hours',
        mode='lines+markers',
        marker=dict(color='#28405c'),
        line=dict(color='#28405c'),
        yaxis='y2'
    )
)

fig.update_layout(
    title='Monthly Release Patterns and Viewership Hours (2023)',
    xaxis=dict(
        title='Month',
        tickmode='array',
        tickvals=list(range(1, 13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    yaxis=dict(
        title='Number of Releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Hours Viewed (in billions)',
        overlaying='y',
        side='right',
        showgrid=False
    ),
    legend=dict(
        x=1.05,
        y=1,
        orientation='v',
        xanchor='left'
    ),
    height=600,
    width=1000
)

fig.show()

In [None]:
# calculate correlation between number of releases and total hours viewed
correlation = netflix_data.groupby('Release Month')['Hours Viewed'].agg(['count', 'sum']).corr()
correlation


Unnamed: 0,count,sum
count,1.0,0.639259
sum,0.639259,1.0


Although the number of content releases remains relatively consistent throughout the year, viewership hours see a notable surge in June and a significant peak in December. This suggests that audience engagement is driven more by the timing and appeal of specific content rather than the sheer volume of releases.  

Next, let's examine whether Netflix favors certain weekdays for releasing content and how this impacts viewership patterns.

In [None]:
netflix_data['Release Day'] = netflix_data['Release Date'].dt.day_name()

weekday_releases = netflix_data['Release Day'].value_counts().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
)

# aggregate viewership hours by day of the week
weekday_viewership = netflix_data.groupby('Release Day')['Hours Viewed'].sum().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
)

fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=weekday_releases.index,
        y=weekday_releases.values,
        name='Number of Releases',
        marker_color='#edada3',
        yaxis='y1'
    )
)

fig.add_trace(
    go.Scatter(
        x=weekday_viewership.index,
        y=weekday_viewership.values,
        name='Viewership Hours',
        mode='lines+markers',
        marker=dict(color='#28405c'),
        line=dict(color='#28405c'),
        yaxis='y2'
    )
)

fig.update_layout(
    title='Weekly Release Patterns and Viewership Hours (2023)',
    xaxis=dict(
        title='Day of the Week',
        categoryorder='array',
        categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    ),
    yaxis=dict(
        title='Number of Releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Hours Viewed (in billions)',
        overlaying='y',
        side='right',
        showgrid=False
    ),
    legend=dict(
        x=1.05,
        y=1,
        orientation='v',
        xanchor='left'
    ),
    height=600,
    width=1000
)

fig.show()

The graph reveals that the majority of content releases occur on Fridays, with viewership hours peaking significantly on the same day. This suggests that Netflix strategically schedules releases ahead of the weekend to maximize audience engagement. Interestingly, viewership declines sharply on Saturdays and Sundays despite some releases, indicating that audiences prefer consuming new content as soon as it drops, making Friday the most influential day for both releases and engagement.

To further explore Netflix’s content strategy, let’s analyze high-impact dates, such as holidays or major events, and their correlation with content releases.

In [None]:
# Define significant holidays and events in 2023
important_dates = [
    '2023-01-01',  # new year's day
    '2023-02-14',  # valentine's ay
    '2023-07-04',  # independence day (US)
    '2023-10-31',  # halloween
    '2023-12-25'   # christmas day
]

# Convert to datetime
important_dates = pd.to_datetime(important_dates)

# Check for content releases close to these significant holidays (within a 3-day window)
holiday_releases = netflix_data[netflix_data['Release Date'].apply(
    lambda x: any((x - date).days in range(-3, 4) for date in important_dates)
)]

# Total viewership hours for releases near significant holidays
holiday_viewership = holiday_releases.groupby('Release Date')['Hours Viewed'].sum()

holiday_releases[['Title', 'Release Date', 'Hours Viewed']]

Unnamed: 0,Title,Release Date,Hours Viewed
2,The Glory: Season 1 // 더 글로리: 시즌 1,2022-12-30,622800000.0
6,La Reina del Sur: Season 3,2022-12-30,429600000.0
11,Kaleidoscope: Limited Series,2023-01-01,252500000.0
29,Perfect Match: Season 1,2023-02-14,176800000.0
124,Lady Voyeur: Limited Series // Olhar Indiscret...,2022-12-31,86000000.0
...,...,...,...
22324,The Romantics: Limited Series,2023-02-14,1000000.0
22327,Aggretsuko: Season 5 // アグレッシブ烈子: シーズン5,2023-02-16,900000.0
22966,The Lying Life of Adults: Limited Series // La...,2023-01-04,900000.0
22985,Community Squad: Season 1 // División Palermo:...,2023-02-17,800000.0


The data indicates that Netflix strategically times content releases around major holidays and events to maximize viewership. Some notable releases include:  

- **New Year’s Period**: Titles like *The Glory: Season 1*, *La Reina del Sur: Season 3*, and *Kaleidoscope: Limited Series* premiered around New Year’s Day, driving high engagement.  
- **Valentine’s Day**: *Perfect Match: Season 1* and *The Romantics: Limited Series* debuted on February 14th, aligning with the holiday’s romantic theme to capitalize on audience sentiment.  

This strategic scheduling suggests that Netflix leverages seasonal trends to enhance content performance and viewer engagement.

# Conclusion
Netflix's content strategy focuses on maximizing viewership through strategic release timing and diverse content offerings. Shows consistently outperform movies in total watch hours, with notable spikes in December and June, indicating key release periods designed to drive engagement. The Fall season emerges as the peak time for audience activity, further reinforcing the importance of well-timed releases.  

Additionally, Netflix favors Fridays for new content drops, ensuring maximum engagement heading into the weekend. Viewership trends closely follow this pattern, highlighting the effectiveness of this approach. While content releases remain steady throughout the year, viewership fluctuates significantly, suggesting that Netflix prioritizes high-impact titles and optimal release timing over sheer volume.