# **PROJECT:- Disney+ HOTSTAR DATA ANALYSIS**

## **1. Overview**

This dataset contains metadata of TV Shows and Movies available on Disney+ Hotstar. It includes details such as title, genre, release year, language, content type, and other descriptive attributes.

* Dataset Name :- Disney+ Hotstar Tv and Movie Catalog
* File Name: hotstar.csv
* Data source :- Kaggle
* Dataset link :- [link](https://www.kaggle.com/datasets/goelyash/disney-hotstar-tv-and-movie-catalog)
* Dataset size :- 1.4 MB


### **2. About Dataset Structure and its contains**
The dataset contains metadata-level information (not streaming content).
Each row represents one movie or TV show.

**Typical dataset size:**
* Rows: 6245 entries 
* Columns: 10 attributes
  
Following is the meaning of each attribute:

1. **hotstar_id:** ID of tv show and movie.
2. **title:** Title/Name of tv show and movie.
3. **description:** Short Description describing content.
4. **genre:** Genre of content.
5. **year:** Release year of content.
6. **age_rating:** Age Rating. (as listed)
7. **running_time:** Running time of movie.(in minutes)
8. **season:** Number of seasons of tv show.
9. **episodes:** Number of episodes of tv show.
10. **type:** Content is Tv show or Movie.

This files contains 6245 unique tv show and movies.

In [1]:
import pandas as pd

In [2]:
datapath = r"hotstar.csv"
or_data = pd.read_csv(datapath) #original dataset 
data = or_data.copy() # copy dataset
data.head(5)

Unnamed: 0,hotstar_id,title,description,genre,year,age_rating,running_time,seasons,episodes,type
0,1000087439,Sambha - Aajcha Chawa,A young man sets off on a mission to clean up ...,Action,2012,U/A 16+,141.0,,,movie
1,1260023113,Cars Toon: Mater And The Ghostlight,Mater is haunted by a mysterious blue light th...,Animation,2006,U,7.0,,,movie
2,1260103188,Kanmani Rambo Khatija,"Unlucky since birth, Rambo finds hope when he ...",Romance,2022,U/A 16+,157.0,,,movie
3,1260126754,Butterfly,While trying to rescue her sister's kids from ...,Thriller,2022,U/A 16+,136.0,,,movie
4,1260018228,Sister Act,"Rene, a lounge singer, decides to stay at a Ch...",Comedy,1992,U/A 7+,100.0,,,movie


**REMARK:** Hotstar Dataset has 6874 rows and 10 columns

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6874 entries, 0 to 6873
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   hotstar_id    6874 non-null   int64  
 1   title         6874 non-null   object 
 2   description   6874 non-null   object 
 3   genre         6874 non-null   object 
 4   year          6874 non-null   int64  
 5   age_rating    6874 non-null   object 
 6   running_time  4568 non-null   float64
 7   seasons       2306 non-null   float64
 8   episodes      2306 non-null   float64
 9   type          6874 non-null   object 
dtypes: float64(3), int64(2), object(5)
memory usage: 537.2+ KB


In [4]:
data.describe()

Unnamed: 0,hotstar_id,year,running_time,seasons,episodes
count,6874.0,6874.0,4568.0,2306.0,2306.0
mean,1059077000.0,2011.71865,98.746716,2.661752,127.366869
std,481266600.0,11.936894,49.411142,4.942716,258.138186
min,3.0,1928.0,1.0,1.0,1.0
25%,1000088000.0,2009.0,70.0,1.0,6.0
50%,1260008000.0,2016.0,116.0,1.0,22.0
75%,1260099000.0,2019.0,135.0,2.0,130.75
max,1837059000.0,2023.0,229.0,73.0,3973.0


In [5]:
data.isnull().sum()

hotstar_id         0
title              0
description        0
genre              0
year               0
age_rating         0
running_time    2306
seasons         4568
episodes        4568
type               0
dtype: int64

**REMARK:** Important column (like title, genre) does not has null value. As per no need to remove any null value, which is in other columns (like seasons, episodes)

In [6]:
data.describe(include=object)

Unnamed: 0,title,description,genre,age_rating,type
count,6874,6874,6874,6874,6874
unique,6677,6815,37,6,2
top,Rubaru Roshni,"Unlucky since birth, Rambo finds hope when he ...",Drama,U/A 13+,movie
freq,7,4,2043,2980,4568


**REMARK:** This shows that all titles and description of dataset is not unique

In [7]:
data.duplicated(['title','description']).sum()

np.int64(16)

**REMARK:** This shows that there is 16 duplicate value on dataset (means both together are identical)

In [8]:
#duplicates value
data[data.duplicated(['title','description'], keep=False)].head(10)

Unnamed: 0,hotstar_id,title,description,genre,year,age_rating,running_time,seasons,episodes,type
2,1260103188,Kanmani Rambo Khatija,"Unlucky since birth, Rambo finds hope when he ...",Romance,2022,U/A 16+,157.0,,,movie
13,1000228200,Maari 2,"Maari, a local don, in his new naughty-turned-...",Action,2018,U/A 13+,140.0,,,movie
459,1260068261,Annabelle Sethupathi,A scary journey begins when a small-time thief...,Horror,2021,U,134.0,,,movie
1507,1260108123,RRR,"Under the British Raj, two revolutionaries wit...",Action,2022,U/A 16+,177.0,,,movie
1510,1000228512,Rubaru Roshni,"Based in India, traversing three decades, test...",Drama,2019,U/A 16+,110.0,,,movie
1925,1260047825,Mookuthi Amman,Small-time reporter Engels' life changes drast...,Drama,2020,U/A 7+,134.0,,,movie
1937,1260108122,RRR,"Under the British Raj, two revolutionaries wit...",Action,2022,U/A 16+,177.0,,,movie
2150,1260113147,Vikrant Rona,When a series of mysterious events stirs pande...,Action,2022,U/A 16+,145.0,,,movie
2558,1260108121,RRR,"Under the British Raj, two revolutionaries wit...",Action,2022,U/A 16+,178.0,,,movie
2875,1260103193,Kanmani Rambo Khatija,"Unlucky since birth, Rambo finds hope when he ...",Romance,2022,U/A 16+,157.0,,,movie


In [9]:
#Droping values having Both 'title'and 'description' duplicate
data = data.drop_duplicates(['title','description'])

In [10]:
data.duplicated(['title','description']).sum()

np.int64(0)

In [11]:
data = data.drop(['seasons','episodes'],axis=1)

**REMARK:** As there is no need to drop any column but in case to reduce dataset we can drop column "seasons" and "episodes".

In [12]:
column_list = list(data.columns)
column_list

['hotstar_id',
 'title',
 'description',
 'genre',
 'year',
 'age_rating',
 'running_time',
 'type']

In [17]:
#Changing name of column
data.rename({'year':'released_year'},axis=1, inplace= True)

In [18]:
data.rename({'age_rating':'age_limit'}, axis = 1, inplace = True)
data.columns

Index(['hotstar_id', 'title', 'description', 'genre', 'released_year',
       'age_limit', 'running_time', 'type'],
      dtype='object')

In [19]:
#Sorting data based on year
sorted_data = data.sort_values('released_year',ignore_index=True)
print(f"Oldest : {sorted_data['title'].iloc[0]}, Released year : {sorted_data['released_year'].iloc[0]} and type : {sorted_data['type'].iloc[0]}")
print(f"Newest : {sorted_data['title'].iloc[-1]}, Released year : {sorted_data['released_year'].iloc[-1]} and type : {sorted_data['type'].iloc[0]}")



Oldest : Mickey Mouse Steamboat Willie, Released year : 1928 and type : movie
Newest : Mahanadhi, Released year : 2023 and type : movie


In [20]:
#converting csv file to excel for performing visualisation
cleaned_data = data.to_excel("cleaned_data_hotstar.xlsx", index = False)