<a href="https://colab.research.google.com/github/ManuKashyap01/Netflix_Data_Analysis/blob/main/Netflix_Content_Data_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For analyzing the Netflix content, we are using a dataset which contains title, release date, language, content type (show or movie), availability status, and viewership hours of the content on Netflix of all the shows and movies released in 2023.

In [1]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio

data_netflix = pd.read_csv('netflix_content_2023.csv')
data_netflix.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


# **Data cleaning**
of 'Hours Viewed' column to transform it into floating point integer type in order to perform analysis

In [6]:
data_netflix['Hours Viewed'] = data_netflix['Hours Viewed'].replace(',','',regex=True).astype(float)
data_netflix.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000.0,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000.0,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000.0,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000.0,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000.0,English,Movie


# **Data Analysis 1**
Let's find the top 5 most succesfull content(both shows and movies) that have contributed to high viewership

In [56]:
top_5_titles = data_netflix.nlargest(5, 'Hours Viewed')

top_5_titles[['Title', 'Hours Viewed', 'Language Indicator', 'Content Type', 'Release Date']]

Unnamed: 0,Title,Hours Viewed,Language Indicator,Content Type,Release Date
0,The Night Agent: Season 1,812100000.0,English,Show,2023-03-23
1,Ginny & Georgia: Season 2,665100000.0,English,Show,2023-01-05
18227,King the Land: Limited Series // 킹더랜드: 리미티드 시리즈,630200000.0,Korean,Movie,2023-06-17
2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000.0,Korean,Show,2022-12-30
18214,ONE PIECE: Season 1,541900000.0,English,Show,2023-08-31


The top 5 most-viewed Netflix titles in 2023 are:

1. **The Night Agent: Season 1 (English, Show)** – 812.1 million hours viewed
2. **Ginny & Georgia: Season 2 (English, Show)** – 665.1 million hours viewed
3. **King the Land: Limited Series (Korean, Movie)** – 630.2 million hours viewed
4. **The Glory: Season 1 (Korean, Show)** – 622.8 million hours viewed
5. **ONE PIECE: Season 1 (English, Show)** – 541.9 million hours viewed

While English-language shows dominate the top spots, the strong performance of Korean content highlights its growing global popularity.

# **Data Analysis 2**
Let's analyze whether shows or movies dominate the veiwership and create a visualization of distribution of total viewership hours between shows and movies

In [11]:
content_type_distribution = data_netflix.groupby('Content Type')['Hours Viewed'].sum()
content_type_distribution.head()

Unnamed: 0_level_0,Hours Viewed
Content Type,Unnamed: 1_level_1
Movie,50637800000.0
Show,107764100000.0


In [30]:
figure = go.Figure(
  data=[
    go.Bar(
      x=content_type_distribution.index,
      y=content_type_distribution.values,
      marker_color=['blue','red']
    )],
  layout=go.Layout(
      title='Distribution of Total Viewership Hours Between Shows and Movies',
      xaxis_title='Content Type',
      yaxis_title='Total Viewership Hours (in Billions)',
      width=900,
      height=700
  )
)
figure.show()

The visualization shows that in 2023, people spent more time watching shows on Netflix than movies.

# **Data Analysis 3**
Let's analyze and create a visualization of distribution of viewersip across different languages

In [28]:
language_type_distribution = data_netflix.groupby('Language Indicator')['Hours Viewed'].sum().sort_values(ascending=False)
language_type_distribution.head()

Unnamed: 0_level_0,Hours Viewed
Language Indicator,Unnamed: 1_level_1
English,124441700000.0
Korean,15378400000.0
Non-English,10439100000.0
Japanese,7102000000.0
Hindi,926100000.0


In [29]:
figure = go.Figure(
    data=[
        go.Bar(
            x=language_type_distribution.index,
            y=language_type_distribution.values,
            marker_color='blue'
        )
    ],
    layout=go.Layout(
        title='Total viewership by Language (2023)',
        xaxis_title='Language',
        yaxis_title='Total Viewership Hours (in Billions)',
        width=900,
        height=700
    )
)
figure.show()

Netflix's viewership is heavily dominated by English-language content, reflecting its strong appeal to a global audience. However, the significant popularity of non-English content, such as Korean shows and movies, highlights Netflix's commitment to offering a diverse range of content.

# **Data Analysis 4**
Let's analyse and visualize the distribution of viewership of across different months of the year

In [31]:
data_netflix['Release Date'] = pd.to_datetime(data_netflix['Release Date'])
data_netflix['Release Month'] = data_netflix['Release Date'].dt.month
data_netflix.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type,Release Month
0,The Night Agent: Season 1,Yes,2023-03-23,812100000.0,English,Show,3.0
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000.0,English,Show,1.0
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000.0,Korean,Show,12.0
3,Wednesday: Season 1,Yes,2022-11-23,507700000.0,English,Show,11.0
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000.0,English,Movie,5.0


In [35]:
monthly_distribution = data_netflix.groupby('Release Month')['Hours Viewed'].sum()
monthly_distribution.head()

Unnamed: 0_level_0,Hours Viewed
Release Month,Unnamed: 1_level_1
1.0,7271600000.0
2.0,7103700000.0
3.0,7437100000.0
4.0,6865700000.0
5.0,7094600000.0


In [36]:
figure = go.Figure(
    data=[
        go.Scatter(
            x=monthly_distribution.index,
            y=monthly_distribution.values,
            mode='lines+markers',
            marker_color='blue',
            line_color='blue',
        )
    ],
    layout=go.Layout(
        title='Total viewership by Release Month (2023)',
        yaxis_title='Total Viewership Hours (in Billions)',
        xaxis=dict(
            title='Release Month',
            tickmode='array',
            tickvals=list(range(1,13)),
            ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
        ),
        width=900,
        height=700
    )
)
figure.show()

The graph highlights monthly viewership trends, showing a noticeable increase in June and a sharp surge in December. These spikes suggest that Netflix sees higher audience engagement during these times, likely driven by strategic content launches, seasonal trends, or holiday seasons. Meanwhile, the middle months maintain a steady but relatively lower level of viewership.

# **Data Analysis 5**
Let's analyze the distribution of veiwership by content type and release month of the year

In [37]:
monthly_distribution_by_type = data_netflix.pivot_table(index='Release Month', columns='Content Type', values='Hours Viewed', aggfunc='sum')
monthly_distribution_by_type.head()

Content Type,Movie,Show
Release Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1.0,2275900000.0,4995700000.0
2.0,1654400000.0,5449300000.0
3.0,2109400000.0,5327700000.0
4.0,2757600000.0,4108100000.0
5.0,2520500000.0,4574100000.0


In [38]:
figure = go.Figure()

for content_type in monthly_distribution_by_type.columns:
  figure.add_trace(
      go.Scatter(
          x=monthly_distribution_by_type.index,
          y=monthly_distribution_by_type[content_type],
          mode='lines+markers',
          name=content_type
      )
  )

figure.update_layout(
    title='Total viewership by Content Type and Release Month (2023)',
    yaxis_title='Total Viewership Hours (in Billions)',
    xaxis=dict(
        title='Release Month',
        tickmode='array',
        tickvals=list(range(1,13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    legend_title='Content Type',
    width=900,
    height=700
)
figure.show()



The graph compares 2023 viewership trends, showing shows consistently outperforming movies, with a peak in December. Movie viewership fluctuates, spiking in June and October, suggesting audience engagement varies with specific releases or events.

# **Data Analysis 6**
Let's analyze the distribution of content release and viewership across different months of year

In [50]:
monthly_release_distribution = data_netflix['Release Month'].value_counts().sort_index()
monthly_release_distribution.head()

Unnamed: 0_level_0,count
Release Month,Unnamed: 1_level_1
1.0,608
2.0,560
3.0,690
4.0,647
5.0,624


In [51]:
monthly_viewership_distribution = data_netflix.groupby('Release Month')['Hours Viewed'].sum()
monthly_viewership_distribution.head()

Unnamed: 0_level_0,Hours Viewed
Release Month,Unnamed: 1_level_1
1.0,7271600000.0
2.0,7103700000.0
3.0,7437100000.0
4.0,6865700000.0
5.0,7094600000.0


In [54]:
figure = go.Figure()

figure.add_trace(
    go.Bar(
        x=monthly_release_distribution.index,
        y=monthly_release_distribution.values,
        marker_color='blue',
        opacity=0.7,
        name='Number of releases',
        yaxis='y1'
    )
)

figure.add_trace(
    go.Scatter(
        x=monthly_viewership_distribution.index,
        y=monthly_viewership_distribution.values,
        mode='lines+markers',
        marker_color='red',
        name='Total viewership',
        yaxis='y2'
    )
)

figure.update_layout(
    title='Total viewership by Release Month (2023)',
    xaxis=dict(
        title='Month of Year',
        tickmode='array',
        tickvals=list(range(1,13)),
        ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
    ),
    yaxis=dict(
        title='Number of releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Viewership Hours (in Billions)',
        overlaying='y',
        showgrid=False,
        side='right',
    ),
    legend=dict(
        x=1.05,
        y=1,
    ),
    width=900,
    height=700
)

figure.show()

Although releases remain steady, viewership spikes in June and December, likely due to the holiday season, highlighting that timing and the appeal of specific content drive viewership, rather than just the number of releases.

# **Data Analysis 7**
Let's analyze the distribution of content release and viewership across different days of week

In [39]:
data_netflix['Release Day'] = data_netflix['Release Date'].dt.day_name()
data_netflix.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type,Release Month,Release Day
0,The Night Agent: Season 1,Yes,2023-03-23,812100000.0,English,Show,3.0,Thursday
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000.0,English,Show,1.0,Thursday
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000.0,Korean,Show,12.0,Friday
3,Wednesday: Season 1,Yes,2022-11-23,507700000.0,English,Show,11.0,Wednesday
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000.0,English,Movie,5.0,Thursday


In [40]:
weekday_release_distribution = data_netflix['Release Day'].value_counts().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
weekday_release_distribution.head()

Unnamed: 0_level_0,count
Release Day,Unnamed: 1_level_1
Monday,436
Tuesday,995
Wednesday,1310
Thursday,1145
Friday,3863


In [41]:
weekday_viewership_distribution = data_netflix.groupby('Release Day')['Hours Viewed'].sum().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
weekday_viewership_distribution.head()

Unnamed: 0_level_0,Hours Viewed
Release Day,Unnamed: 1_level_1
Monday,3954500000.0
Tuesday,5562300000.0
Wednesday,15744100000.0
Thursday,20292800000.0
Friday,38217200000.0


In [48]:
figure = go.Figure()

figure.add_trace(
  go.Bar(
      x=weekday_release_distribution.index,
      y=weekday_release_distribution.values,
      marker_color='blue',
      opacity=0.7,
      name='Number of releases',
      yaxis='y1'
  )
)

figure.add_trace(
    go.Scatter(
        x=weekday_viewership_distribution.index,
        y=weekday_viewership_distribution.values,
        mode='lines+markers',
        marker_color='red',
        name='Total viewership',
        yaxis='y2'
    )
)

figure.update_layout(
    title='Total viewership by Release Day (2023)',
    xaxis=dict(
        title='Day of week',
        categoryorder='array',
        categoryarray=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
    ),
    yaxis=dict(
        title='Number of releases',
        showgrid=False,
        side='left'
    ),
    yaxis2=dict(
        title='Total Viewership Hours (in Billions)',
        overlaying='y',
        showgrid=False,
        side='right',
    ),
    legend=dict(
        x=1.05,
        y=1,
    ),
    width=900,
    height=700
)

figure.show()

The graph shows that most content is released on Fridays, with viewership peaking on that day. This suggests Netflix targets the weekend for maximum engagement. Viewership drops sharply on Saturdays and Sundays, indicating audiences consume new content early in the weekend, making Friday the key day for both releases and viewership.

# **Conclusion**
Netflix’s strategy focuses on maximizing viewership through strategic release timing and content variety. Shows outperform movies, with viewership spiking in December and June, driven by targeted releases. Fridays see the most releases, capturing viewers just before the weekend, and viewership aligns with this pattern. While releases are steady, viewership fluctuates, indicating a focus on high-impact titles and optimal timing rather than volume.