<a href="https://colab.research.google.com/github/RawanAlharbi8/disneyEDA/blob/main/disney%2B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#EDA (Exploratory Data Analysis) Disney+ Movies and TV Shows
 The dataset, sourced from Kaggle, contains comprehensive information about Disney's movies and TV shows available on the Disney+ streaming platform. It includes details such as titles, release years, genres, ratings, IMDb scores, and duration. The dataset offers valuable insights into the type of content Disney produces and how it has evolved over time.
https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows

* Import necessary libraries for data analysis, preprocessing, and visualization.



* Use kagglehub to download the latest version of the Disney movies and TV shows dataset.

In [23]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import os
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shivamb/disney-movies-and-tv-shows")

print("Path to dataset files:", path)


Path to dataset files: /kaggle/input/disney-movies-and-tv-shows


In [24]:
files = os.listdir(path)
print("Files in dataset:", files)

for file in files:
    if file.endswith(".csv"):
        df = pd.read_csv(os.path.join(path, file))
        print(f"\n--- First 5 rows of {file} ---")
        print(df.head())

Files in dataset: ['disney_plus_titles.csv']

--- First 5 rows of disney_plus_titles.csv ---
  show_id     type                                             title  \
0      s1    Movie  Duck the Halls: A Mickey Mouse Christmas Special   
1      s2    Movie                            Ernest Saves Christmas   
2      s3    Movie                      Ice Age: A Mammoth Christmas   
3      s4    Movie                        The Queen Family Singalong   
4      s5  TV Show                             The Beatles: Get Back   

                            director  \
0  Alonso Ramirez Ramos, Dave Wasson   
1                        John Cherry   
2                       Karen Disher   
3                    Hamish Hamilton   
4                                NaN   

                                                cast        country  \
0  Chris Diamantopoulos, Tony Anselmo, Tress MacN...            NaN   
1           Jim Varney, Noelle Parker, Douglas Seale            NaN   
2  Raymond Albert Ro

# Data Cleaning




Checking Missing Values in the Dataset




In [25]:
#identify missing value
def missing_pct(df):
    # Calculate missing value and their percentage for each column
    missing_count_percent = df.isnull().sum() * 100 / df.shape[0]
    df_missing_count_percent = pd.DataFrame(missing_count_percent).round(2)
    df_missing_count_percent = df_missing_count_percent.reset_index().rename(
                    columns={
                            'index':'Column',
                            0:'Missing_Percentage (%)'
                    }
                )
    df_missing_value = df.isnull().sum()
    df_missing_value = df_missing_value.reset_index().rename(
                    columns={
                            'index':'Column',
                            0:'Missing_value_count'
                    }
                )
    Final = df_missing_value.merge(df_missing_count_percent, how = 'inner', left_on = 'Column', right_on = 'Column')
    Final = Final.sort_values(by = 'Missing_Percentage (%)',ascending = False)
    return Final

missing_pct(df)

Unnamed: 0,Column,Missing_value_count,Missing_Percentage (%)
3,director,473,32.62
5,country,219,15.1
4,cast,190,13.1
6,date_added,3,0.21
8,rating,3,0.21
2,title,0,0.0
0,show_id,0,0.0
1,type,0,0.0
7,release_year,0,0.0
9,duration,0,0.0


#Mapping Ratings:
Replace detailed rating categories with broader audience groups like

Kids, Teens, and Adults for easier analysis.

#Handling Missing Values:

Fill missing country values with the most common country.

Replace missing values in cast and director columns with "No Data".

Fill missing date_added entries with "Unknown".

#Removing Nulls and Duplicates:

Drop any remaining rows with missing values.

Remove duplicate rows to ensure unique records.

In [26]:

rating_map = {
    'PG-13': 'Teens',
    'PG': 'Kids',
    'TV-14': 'Teens',
    'TV-PG': 'Kids',
    'TV-Y': 'Kids',
    'TV-Y7': 'Kids',
    'TV-Y7-FV': 'Kids',
    'TV-G': 'Kids',
    'G': 'Kids',
    'TV-MA': 'Adults',
    'R': 'Adults',
    'NC-17': 'Adults'
}

df['rating'] = df['rating'].replace(rating_map)

df['country'] = df['country'].fillna(df['country'].mode()[0])
df['cast'] = df['cast'].fillna('No Data')
df['director'] = df['director'].fillna('No Data')
df['date_added'] = df['date_added'].fillna('Unknown')

df.dropna(inplace=True)

df.drop_duplicates(inplace=True)


df.head(14)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Duck the Halls: A Mickey Mouse Christmas Special,"Alonso Ramirez Ramos, Dave Wasson","Chris Diamantopoulos, Tony Anselmo, Tress MacN...",United States,"November 26, 2021",2016,Kids,23 min,"Animation, Family",Join Mickey and the gang as they duck the halls!
1,s2,Movie,Ernest Saves Christmas,John Cherry,"Jim Varney, Noelle Parker, Douglas Seale",United States,"November 26, 2021",1988,Kids,91 min,Comedy,Santa Claus passes his magic bag to a new St. ...
2,s3,Movie,Ice Age: A Mammoth Christmas,Karen Disher,"Raymond Albert Romano, John Leguizamo, Denis L...",United States,"November 26, 2021",2011,Kids,23 min,"Animation, Comedy, Family",Sid the Sloth is on Santa's naughty list.
3,s4,Movie,The Queen Family Singalong,Hamish Hamilton,"Darren Criss, Adam Lambert, Derek Hough, Alexa...",United States,"November 26, 2021",2021,Kids,41 min,Musical,"This is real life, not just fantasy!"
5,s6,Movie,Becoming Cousteau,Liz Garbus,"Jacques Yves Cousteau, Vincent Cassel",United States,"November 24, 2021",2021,Teens,94 min,"Biographical, Documentary",An inside look at the legendary life of advent...
6,s7,TV Show,Hawkeye,No Data,"Jeremy Renner, Hailee Steinfeld, Vera Farmiga,...",United States,"November 24, 2021",2021,Teens,1 Season,"Action-Adventure, Superhero",Clint Barton/Hawkeye must team up with skilled...
7,s8,TV Show,Port Protection Alaska,No Data,"Gary Muehlberger, Mary Miller, Curly Leach, Sa...",United States,"November 24, 2021",2015,Teens,2 Seasons,"Docuseries, Reality, Survival",Residents of Port Protection must combat volat...
8,s9,TV Show,Secrets of the Zoo: Tampa,No Data,"Dr. Ray Ball, Dr. Lauren Smith, Chris Massaro,...",United States,"November 24, 2021",2019,Kids,2 Seasons,"Animals & Nature, Docuseries, Family",A day in the life at ZooTampa is anything but ...
9,s10,Movie,A Muppets Christmas: Letters To Santa,Kirk R. Thatcher,"Steve Whitmire, Dave Goelz, Bill Barretta, Eri...",United States,"November 19, 2021",2008,Kids,45 min,"Comedy, Family, Musical",Celebrate the holiday season with all your fav...
10,s11,Movie,Adventure Thru the Walt Disney Archives,John Gleim,"Don Hahn, Kathryn Beaumont, Pete Docter, Kevin...",United States,"November 19, 2021",2020,Kids,59 min,Documentary,Explore the treasures and rich history of the ...


#Splitting and Creating New Columns
For further analysis, I split the genres into different rows because sometimes content can be listed in one more genre. That condition also happens in the country column

In [57]:
df_expanded = df.copy()
df_expanded['country'] = df_expanded['country'].str.split(',')

df_expanded = df_expanded.explode('country')
df_expanded['country'] = df_expanded['country'].str.strip()

df_expanded['listed_in'] = df_expanded['listed_in'].str.split(',')
df_expanded = df_expanded.explode('listed_in')
df_expanded['listed_in'] = df_expanded['listed_in'].str.strip()

genre_by_country_type = df_expanded.groupby(['country', 'listed_in', 'type']).size().reset_index(name='count')

top_genre_by_country_type = genre_by_country_type.sort_values('count', ascending=False).drop_duplicates(['country', 'type'])

top_genre_by_country_type = top_genre_by_country_type.sort_values('count', ascending=False)

top_genre_by_country_type.head(10)



Unnamed: 0,show_id,title,type,listed_in
0,s1,Duck the Halls: A Mickey Mouse Christmas Special,Movie,Animation
0,s1,Duck the Halls: A Mickey Mouse Christmas Special,Movie,Family
1,s2,Ernest Saves Christmas,Movie,Comedy
2,s3,Ice Age: A Mammoth Christmas,Movie,Animation
2,s3,Ice Age: A Mammoth Christmas,Movie,Comedy
...,...,...,...,...
1448,s1449,Bend It Like Beckham,Movie,Comedy
1448,s1449,Bend It Like Beckham,Movie,Coming of Age
1449,s1450,Captain Sparky vs. The Flying Saucers,Movie,Action-Adventure
1449,s1450,Captain Sparky vs. The Flying Saucers,Movie,Animals & Nature


In [28]:
# Total Shows and movies
df_count = df['show_id'].count().sum()
print(df_count)
# Split of showes and TV
df_type = df.groupby('type')['show_id'].count().reset_index()
df_type = df_type.rename(columns = {"show_id":"count_showids"})
df_type

1447


Unnamed: 0,type,count_showids
0,Movie,1051
1,TV Show,396


#Visualization


In [33]:

colors = ["#284675", "#0077be"]  # Navy و Blue

fig_indicator = go.Figure()

fig_indicator.add_trace(go.Indicator(
    mode="number",
    value=df_count,
    title={"text": "Total content on Disney+"},
    number={"font": {"size": 60}}
))

fig_indicator.update_layout(
    template="plotly_white",
    height=150,
    margin=dict(l=50, r=50, t=20, b=0)
)


fig_indicator.show()

fig_mix = make_subplots(
    rows=1, cols=2,
    specs=[[{"type": "bar"}, {"type": "pie"}]],
    column_widths=[0.5, 0.5]
)

fig_mix.add_trace(
    go.Bar(
        x=df_type["count_showids"],
        y=df_type["type"],
        orientation='h',
        marker=dict(color=colors),
        text=df_type["count_showids"],
        textposition='auto',
        showlegend=False
    ),
    row=1, col=1
)

fig_mix.add_trace(
    go.Pie(
        labels=df_type["type"],
        values=df_type["count_showids"],
        marker=dict(colors=colors)
    ),
    row=1, col=2
)

fig_mix.update_layout(
    title_text="What type of content is more uploaded more on Disney+?",
    plot_bgcolor='white',
    height=450
)

fig_mix.show()




There are 1444 content in the dataset including movie and TV show. From that amount of content, Disney+ uploaded more movies than TV shows. So there are roughly 1000+ movies and almost 400 TV shows.

In [64]:
# Filter only movies
top_movie_genre = top_genre_by_country_type[top_genre_by_country_type['type'] == 'Movie']

fig = px.bar(
    top_movie_genre.head(5),
    x='count',
    y='country',
    color='listed_in',
    text='listed_in',
    title='🎬 Top Movie Genre per Country on Disney+',
    labels={'count': 'Count', 'country': 'Country', 'listed_in': 'Genre'},
    height=700,
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig.update_traces(textposition='outside')
fig.update_layout(height=600)
fig.show()


top_tv_genre = top_genre_by_country_type[top_genre_by_country_type['type'] == 'TV Show']

fig = px.bar(
    top_tv_genre.head(5),
    x='count',
    y='country',
    color='listed_in',
    text='listed_in',
    title='📺 Top TV Show Genre per Country on Disney+',
    labels={'count': 'Count', 'country': 'Country', 'listed_in': 'Genre'},
    height=700,
    color_discrete_sequence=px.colors.qualitative.Set3
)

fig.update_traces(textposition='outside')
fig.update_layout(height=600)
fig.show()



 Popular Genres by Country


*   United States and United Kingdom prefer Family content.


* Japan and South Korea watch Animation.

* Canada shows a high interest in Action content.

These patterns suggest **regional preferences that can guide content strategy and recommendations for streaming platforms** like Disney+.

In [46]:
df_5 = df.query("release_year >= 2007")
df_5 = df_5.groupby("release_year")["show_id"].count().reset_index()

fig = px.area(df_5, x='release_year', y='show_id', color_discrete_sequence=['lightskyblue'],
      title='Overall content release Trend')
fig.show()

Disney+ launched and was introduced in 2019. They only added a few pieces of content that were released before 2019, so there is less content that was released before 2019. After the launch, they usually increase the content. They have added more content, so there is more content for releases in 2019, 2020, and 2021. The amount of Disney+ content is still increasing as they develop their content.

In [47]:

df_4 = df.query("release_year >= 2007")
df_4 = df_4.groupby(["type","release_year"])["show_id"].count().reset_index()
df_4_movie = df_4.query("type == 'Movie'")
df_4_show = df_4.query("type == 'TV Show'")

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=  df_4_movie['release_year'],
    y= df_4_movie['show_id'],
    showlegend=True,
    text = df_4_movie['show_id'],

    name='Movie',
    marker_color='Navy'

))
fig.add_trace(go.Scatter(
    x=  df_4_show['release_year'],
    y= df_4_show['show_id'],
    showlegend=True,
    text = df_4_show['show_id'],

    name='TV Show',
    marker_color='Blue'
))

fig.update_traces( mode='lines+markers')
fig.update_layout(title_text = 'Movies/TV Show release yearly Trend' )
fig.show()

It seems like Disney+ is focused on movies, and the movie count increases significantly until 2020.

In [65]:
df_9 = df.query("type == 'TV Show'")
df_9 = df_9[[ "title", "duration"]]
df_9 = df_9.groupby(['duration'])["title"].count().reset_index().sort_values('title', ascending = False)
df_9 = df_9.rename(columns = {"title": "TV Shows", "duration" : "Seasons"})


df_10 = df.query("type == 'Movie'")
df_10['duration'] = df_10['duration'].fillna("0")
df_10['duration'] = df_10['duration'].str.split(" ").str[0].astype(int)




fig_show = px.bar(df_9, x='Seasons', y='TV Shows', color_discrete_sequence=['Blue'],
       title='TV Shows seasons ')
fig_Movie = px.histogram(df_10, x="duration" ,nbins = 20, color_discrete_sequence=['Navy']
                  , title = "Movie Duration")

fig_Movie.show()
fig_show.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



The duration of most movies on Disney+ is 100 minutes, with very few movies lasting more than 150 minutes. Most shows on Disney+ have only season 1.

Insight

In [53]:

fig = px.bar(
    top_genre_by_country_type.head(10),
    x='count',
    y='country',
    color='type',
    title='Distribution of Movies and TV Shows by Country on Disney+',
    labels={'count': 'Count', 'type': 'Content Type', 'country': 'Country'},
    barmode='group',
    height=700,
    color_discrete_sequence=px.colors.qualitative.Set2
)

fig.update_layout(yaxis={'categoryorder':'total ascending'})
fig.show()

#📊  Insights
🇺🇸 United States dominates content production


* The U.S. has by far the highest number of both movies and TV shows on Disney+.

* TV shows significantly outnumber movies for the U.S., indicating a strong focus on serialized content.

*  United Kingdom,  Canada,  France,  Australia,  South Korea have much smaller contributions

* These countries contribute far fewer titles compared to the U.S.

* In countries like the UK and Canada, movies still lead, but the gap between movies and TV shows is narrower.

🎬 Genre/Content Type Trends by Country
* TV shows are more prevalent in the U.S. and South Korea

* Suggests strong demand or production capability in series-based content.

* Could indicate Disney+ should focus more on localized TV series in these countries.

* France, Canada, UK, and Australia lean more toward movies
