# Netflix Content Analysis
This notebook performs an analysis of Netflix's content, exploring trends, genres, and user preferences. 
It is structured into the following sections:
1. **Setup**: Import libraries and configure settings.
2. **Data Loading**: Load and inspect data.
3. **Data Preprocessing**: Clean and prepare data for analysis.
4. **Exploratory Data Analysis (EDA)**: Analyze trends and patterns.
5. **Modeling**: Train and evaluate predictive models.
6. **Results and Insights**: Summarize key findings.

---


## Setup
This section includes necessary imports and configurations.

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots
pio.templates.default = "plotly_white"

## Data Loading
Here we load the dataset and perform an initial inspection.

In [2]:
df = pd.read_csv('netflix_content_2023.csv')
df.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


## Data Preprocessing
In this section, we clean and prepare the data for analysis.

In [3]:
df['Hours Viewed'] = df['Hours Viewed'].replace(',', '', regex=True).astype(float)

In [4]:
df['Available Globally?'] = df['Available Globally?'].replace({'Yes': 1, 'No': 0}).astype(int)

In [5]:
df.head(10)

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,1,2023-03-23,812100000.0,English,Show
1,Ginny & Georgia: Season 2,1,2023-01-05,665100000.0,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,1,2022-12-30,622800000.0,Korean,Show
3,Wednesday: Season 1,1,2022-11-23,507700000.0,English,Show
4,Queen Charlotte: A Bridgerton Story,1,2023-05-04,503000000.0,English,Movie
5,You: Season 4,1,2023-02-09,440600000.0,English,Show
6,La Reina del Sur: Season 3,0,2022-12-30,429600000.0,English,Show
7,Outer Banks: Season 3,1,2023-02-23,402500000.0,English,Show
8,Ginny & Georgia: Season 1,1,2021-02-24,302100000.0,English,Show
9,FUBAR: Season 1,1,2023-05-25,266200000.0,English,Show


## Exploratory Data Analysis (EDA)
Here we analyze trends, patterns, and distributions in the data.

Lets start analysing trend in content type, 
checking what dominates shows or movies in viewership

In [6]:
viewership_per_category = df['Hours Viewed'].groupby(df['Content Type']).sum()
viewership_per_category

Content Type
Movie    5.063780e+10
Show     1.077641e+11
Name: Hours Viewed, dtype: float64

In [7]:
fig = go.Figure(data=[
    go.Bar(
        x=viewership_per_category.index,
        y=viewership_per_category.values,
        marker_color=['skyblue', 'salmon']
    )
])

fig.update_layout(
    title='Total Viewership Hours by Content Type (2023)',
    xaxis_title='Content Type',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=0,
    height=500,
    width=800
)

fig.show()

The visualization indicates that shows dominate the total viewership hours on Netflix in 2023 compared to movies. This suggests that Netflix’s content strategy leans heavily toward shows, as they tend to attract more watch hours overall.

Distribution of viewership across different languages

In [8]:
language_viewership = df['Hours Viewed'].groupby(df['Language Indicator']).sum().sort_values(ascending=False)
language_viewership

Language Indicator
English        1.244417e+11
Korean         1.537840e+10
Non-English    1.043910e+10
Japanese       7.102000e+09
Hindi          9.261000e+08
Russian        1.146000e+08
Name: Hours Viewed, dtype: float64

In [9]:
fig = go.Figure(data=[
    go.Bar(
        x=language_viewership.index,
        y=language_viewership.values
    )
])

fig.update_layout(
    title='Total Viewership Hours by Language (2023)',
    xaxis_title='Language',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=45,
    height=600,
    width=1000
)

fig.show()

In [10]:
english_viewership = language_viewership['English']
total_viewership = language_viewership.sum()
other_viewership = total_viewership - english_viewership

english_share = english_viewership / total_viewership
other_share = other_viewership / total_viewership

print(f"English Viewership: {english_viewership/1e9:.2f} Billion | Share: {english_share:.2%}")
print(f"Other Language Viewership: {other_viewership/1e9:.2f} Billion | Share: {other_share:.2%}")


English Viewership: 124.44 Billion | Share: 78.56%
Other Language Viewership: 33.96 Billion | Share: 21.44%


The visualization reveals that English-language content significantly dominates Netflix’s viewership. It indicates that Netflix’s primary audience is consuming English content, although non-English shows and movies also have a considerable viewership share, which shows a diverse content strategy.

In [11]:
df['Release Date'] = pd.to_datetime(df['Release Date'], format='%Y-%m-%d')
df['Month']= df['Release Date'].dt.month
df['Month Name'] = df['Release Date'].dt.month_name()
df['Year'] = df['Release Date'].dt.year

df.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type,Month,Month Name,Year
0,The Night Agent: Season 1,1,2023-03-23,812100000.0,English,Show,3.0,March,2023.0
1,Ginny & Georgia: Season 2,1,2023-01-05,665100000.0,English,Show,1.0,January,2023.0
2,The Glory: Season 1 // 더 글로리: 시즌 1,1,2022-12-30,622800000.0,Korean,Show,12.0,December,2022.0
3,Wednesday: Season 1,1,2022-11-23,507700000.0,English,Show,11.0,November,2022.0
4,Queen Charlotte: A Bridgerton Story,1,2023-05-04,503000000.0,English,Movie,5.0,May,2023.0


In [12]:
month_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]

monthly_viewership = (
    df['Hours Viewed']
    .groupby(df['Month Name'])
    .sum()
    .reindex(month_order)
)

monthly_viewership

Month Name
January      7.271600e+09
February     7.103700e+09
March        7.437100e+09
April        6.865700e+09
May          7.094600e+09
June         8.522000e+09
July         6.524800e+09
August       6.817800e+09
September    7.262200e+09
October      8.123200e+09
November     7.749500e+09
December     1.005580e+10
Name: Hours Viewed, dtype: float64

In [13]:
fig = go.Figure(data=[
    go.Scatter(
        x=monthly_viewership.index,
        y=monthly_viewership.values,
        mode='lines+markers',
        marker=dict(color='blue'),
        line=dict(color='lightblue')
    )
])


fig.update_layout(
    title='Total Monthly Viewership Hours (2023)',
    xaxis_title='Months',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=45,
    height=600,
    width=1000
)

The graph shows the total viewership hours by month, which reveals a notable increase in viewership during June and a sharp rise toward the end of the year in December. It suggests that Netflix experiences spikes in audience engagement during these periods, possibly due to strategic content releases, seasonal trends, or holidays, while the middle months have a steady but lower viewership pattern.

In [14]:
top_5_shows = df[df['Content Type']=='Show'].nlargest(5, 'Hours Viewed').sort_values('Hours Viewed', ascending=False)
top_5_movies = df[df['Content Type']=='Movie'].nlargest(5, 'Hours Viewed').sort_values('Hours Viewed', ascending=False)

top_5_shows[['Title', 'Hours Viewed', 'Release Date', 'Language Indicator']]

Unnamed: 0,Title,Hours Viewed,Release Date,Language Indicator
0,The Night Agent: Season 1,812100000.0,2023-03-23,English
1,Ginny & Georgia: Season 2,665100000.0,2023-01-05,English
2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000.0,2022-12-30,Korean
18214,ONE PIECE: Season 1,541900000.0,2023-08-31,English
3,Wednesday: Season 1,507700000.0,2022-11-23,English


In [15]:
top_5_movies[['Title', 'Hours Viewed', 'Release Date', 'Language Indicator']]

Unnamed: 0,Title,Hours Viewed,Release Date,Language Indicator
18227,King the Land: Limited Series // 킹더랜드: 리미티드 시리즈,630200000.0,2023-06-17,Korean
4,Queen Charlotte: A Bridgerton Story,503000000.0,2023-05-04,English
18216,Who is Erin Carter?: Limited Series,286200000.0,2023-08-24,English
18217,Lupin: Part 3,274300000.0,2023-10-05,English
18215,Dear Child: Limited Series // Liebes Kind: Min...,252800000.0,2023-09-07,English


English-language shows dominate the top viewership spots. But, Korean content also has a notable presence in the top titles, which indicates its global popularity

In [16]:
monthly_viewership_by_type = df.pivot_table(index='Month Name',
                                                      columns='Content Type',
                                                      values='Hours Viewed',
                                                      aggfunc='sum').reindex(month_order)


fig = go.Figure()

for content_type in monthly_viewership_by_type.columns:
    fig.add_trace(
        go.Scatter(
            x=monthly_viewership_by_type.index,
            y=monthly_viewership_by_type[content_type],
            mode='lines+markers',
            name=content_type
        )
    )

fig.update_layout(
    title='Viewership Trends by Content Type and Release Month (2023)',
    xaxis_title='Month',
    yaxis_title='Total Hours Viewed (in billions)',
    height=600,
    width=1000,
    legend_title='Content Type'
)

fig.show()

The graph illustrates viewership trends for movies and shows throughout 2023. It reveals that shows consistently attract higher viewership than movies, reaching their peak in December. In contrast, movie viewership displays more variability, with significant increases in June and October. This suggests that Netflix’s audience tends to engage more with shows year-round, while movie viewership sees occasional surges, likely driven by specific releases or events.

Now, let’s examine the distribution of total viewership hours across various release seasons:

In [17]:
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['Release Season'] = df['Month'].apply(get_season)

seasonal_viewership = df.groupby('Release Season')['Hours Viewed'].sum()

# ordered as 'Winter', 'Spring', 'Summer', 'Fall'
seasons_order = ['Winter', 'Spring', 'Summer', 'Fall']
seasonal_viewership = seasonal_viewership.reindex(seasons_order)

fig = go.Figure(data=[
    go.Bar(
        x=seasonal_viewership.index,
        y=seasonal_viewership.values,
        marker_color='orange'
    )
])

fig.update_layout(
    title='Total Viewership Hours by Release Season (2023)',
    xaxis_title='Season',
    yaxis_title='Total Hours Viewed (in billions)',
    xaxis_tickangle=0,
    height=500,
    width=800,
    xaxis=dict(
        categoryorder='array',
        categoryarray=seasons_order
    )
)

fig.show()

The graph shows a significant peak in viewership hours during the Fall season, surpassing 80 billion hours, while Winter, Spring, and Summer maintain relatively stable and similar viewership levels, each around 20 billion hours. This indicates that Netflix achieves its highest audience engagement during the Fall.

Now, let’s analyze the number of content releases and their viewership hours across months:

In [18]:
monthly_releases = df['Month'].value_counts().sort_index()

monthly_viewership = df.groupby('Month')['Hours Viewed'].sum()


In [19]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=monthly_releases.index,
        y=monthly_releases.values,
        name='Number of Releases',
        marker_color='goldenrod', 
        opacity=0.7,
        yaxis='y1'
    )
)

fig.add_trace(
    go.Scatter(
        x=monthly_viewership.index,
        y=monthly_viewership.values,
        name='Viewership Hours',
        mode='lines+markers',
        marker=dict(color='red'),
        line=dict(color='red'),
        yaxis='y2'
    )
)

fig.update_layout(
    title='Monthly Release Patterns and Viewership Hours (2023)',
    xaxis=dict(
        title='Month',
        tickmode='array',
        tickvals=list(range(1, 13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    yaxis=dict(
        title='Number of Releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Hours Viewed (in billions)',
        overlaying='y',
        side='right',
        showgrid=False
    ),
    legend=dict(
        x=1.05,  
        y=1,
        orientation='v',
        xanchor='left'
    ),
    height=600,
    width=1000
)

Although the number of releases remains relatively consistent throughout the year, viewership hours show a sharp spike in June and a notable rise in December. This suggests that audience engagement is not solely driven by the quantity of releases but is influenced by the timing and appeal of specific content during these months.

Next, let’s explore whether Netflix has a preference for releasing content on specific weekdays and how this influences viewership patterns:

In [20]:
df['Release Day'] = df['Release Date'].dt.day_name()

weekday_releases = df['Release Day'].value_counts().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
)

weekday_viewership = df.groupby('Release Day')['Hours Viewed'].sum().reindex(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
)

In [21]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=weekday_releases.index,
        y=weekday_releases.values,
        name='Number of Releases',
        marker_color='blue',
        opacity=0.6,
        yaxis='y1'
    )
)

fig.add_trace(
    go.Scatter(
        x=weekday_viewership.index,
        y=weekday_viewership.values,
        name='Viewership Hours',
        mode='lines+markers',
        marker=dict(color='red'),
        line=dict(color='red'),
        yaxis='y2'
    )
)

fig.update_layout(
    title='Weekly Release Patterns and Viewership Hours (2023)',
    xaxis=dict(
        title='Day of the Week',
        categoryorder='array',
        categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    ),
    yaxis=dict(
        title='Number of Releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Hours Viewed (in billions)',
        overlaying='y',
        side='right',
        showgrid=False
    ),
    legend=dict(
        x=1.05,  
        y=1,
        orientation='v',
        xanchor='left'
    ),
    height=600,
    width=1000
)

fig.show()

The graph reveals that most content releases happen on Fridays, with viewership hours also peaking significantly on that day. This suggests that Netflix strategically schedules releases toward the weekend to maximize audience engagement. Despite some releases on Saturdays and Sundays, viewership drops sharply, indicating that audiences prefer to watch newly released content early in the weekend, making Friday the most impactful day for both releases and viewership.

To further understand the strategy, let’s explore specific high-impact dates, such as holidays or major events, and their correlation with content releases:

In [22]:
important_dates = [
    '2023-01-01',  # new year's day
    '2023-02-14',  # valentine's ay
    '2023-07-04',  # independence day 
    '2023-10-31',  # halloween
    '2023-12-25'   # christmas day
]

important_dates = pd.to_datetime(important_dates)

holiday_releases = df[df['Release Date'].apply(
    lambda x: any((x - date).days in range(-3, 4) for date in important_dates)
)]

holiday_viewership = holiday_releases.groupby('Release Date')['Hours Viewed'].sum()

holiday_releases[['Title', 'Release Date', 'Hours Viewed']]

Unnamed: 0,Title,Release Date,Hours Viewed
2,The Glory: Season 1 // 더 글로리: 시즌 1,2022-12-30,622800000.0
6,La Reina del Sur: Season 3,2022-12-30,429600000.0
11,Kaleidoscope: Limited Series,2023-01-01,252500000.0
29,Perfect Match: Season 1,2023-02-14,176800000.0
124,Lady Voyeur: Limited Series // Olhar Indiscret...,2022-12-31,86000000.0
...,...,...,...
22324,The Romantics: Limited Series,2023-02-14,1000000.0
22327,Aggretsuko: Season 5 // アグレッシブ烈子: シーズン5,2023-02-16,900000.0
22966,The Lying Life of Adults: Limited Series // La...,2023-01-04,900000.0
22985,Community Squad: Season 1 // División Palermo:...,2023-02-17,800000.0


The data reveals that Netflix has strategically released content around key holidays and events. Some of the significant releases include:

### New Year's Period:
- **The Glory: Season 1**
- **La Reina del Sur: Season 3**
- **Kaleidoscope: Limited Series**

These releases, timed close to New Year's Day, resulted in a surge of viewership.

### Valentine's Day:
- **Perfect Match: Season 1**
- **The Romantics: Limited Series**

Released on **February 14th**, these titles align with a romantic theme, perfectly capitalizing on the sentiment of the holiday.

# Conclusion
So, the content strategy of Netflix revolves around maximizing viewership through targeted release timing and content variety. Shows consistently outperform movies in viewership, with significant spikes in December and June, indicating strategic releases around these periods. The Fall season stands out as the peak time for audience engagement. Most content is released on Fridays, which aims to capture viewers right before the weekend, and viewership aligns strongly with this release pattern. While the number of releases is steady throughout the year, viewership varies, which suggests a focus on high-impact titles and optimal release timing over sheer volume.