# Netflix Engagement (Jan - Jun 23) +
#### Netflix Engagement data Supplemented with IMDB data

**About Dataset**

Netflix published the first half of 2023 engagement data. This included the hours viewed (rounded to 100,000 hours), the premiere data and whether the title was available globally for Netflix TV series or films.

The dataset aims to supplement this with information scraped from IMDB, including ratings and a brief description.
The dataset includes a genre column containing Python list syntax.

In this data set there are several columns that contain differnt sleep pattern data of each indivisuals. The columns are as follow:

1. **Title:** Title of the movie or series.
2. **Available Globally?:** Is the show available globally or not. Yes or No.
3. **Release Date:** Release date of the show.
4. **Hours Viewed:** Hours viewrship of the show.
5. **Number of Ratings:** Total nos. of ***IMDB*** users who rated the show/movie.
6. **Rating:** Audience rating of the show/movie.
7. **Genre:** Genre of the show/movie, Which type of movie/show is this.
8. **Key Words:** Keyword for the movie/show which fited the most for it.
9. **Description:** A small description or summary of the show or movie.

Click [here](https://www.kaggle.com/datasets/vassyesboy/netflix-engagement-jan-jun-23) to vist the dataset of the kaggle website. To find out more.

## Table of Contents
1. [Chapter 1 - Define the Problem](#ch1)
1. [Chapter 2 - Importing Libraries](#ch2)
1. [Chapter 3 - Look at Dataset](#ch3)
1. [Chapter 4 - Look Deep Into the Dataset](#ch4)
1. [Chapter 5 - Change/modify the Dataset](#ch5)
1. [Chapter 6 - Fill/delete NaN/null valueus](#ch6)
1. [Chapter 7 - Analysis from the Dataset](#ch7)
1. [Chapter 8 - Visualization](#ch8)
1. [Chapter 9 - Conclusion](#ch9)

<a id="ch1"></a>
### 1. Define the Problem:

In this project we are goning to use Netflix's data to find out some intresting statistics about the shows and many more features like most liked movie, most liked genre and etc.

<a id="ch2"></a>
### 2. Importing Libraries:
The following *libraries* are needed to understand the data and use the data to play with the dataset to get some answers.

In [None]:
import pandas as pd #collection of functions for data processing and analysis.
import numpy as np #foundational package for scientific computing.

import matplotlib.pyplot as plt #collection of functions for scientific and publication-ready visualization.
import seaborn as sns #visualization

#Configure Visualization Defaults
#%matplotlib inline = show plots in Jupyter Notebook browser
%matplotlib inline

#ignore warnings
import warnings
warnings.filterwarnings("ignore")

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

ModuleNotFoundError: No module named 'matplotlib'

<a id="ch3"></a>
### 3. Look at Dataset
In this section we will look into dataset using *pandas* library.

In [1]:
df = pd.read_csv("netflix-engagement-jan-jun-23/Netflix Engagement (plus).csv")
df.sample(10)

NameError: name 'pd' is not defined

<a id="ch4"></a>
### 4. Look Deep Into the Dataset
In this step we will look deep into the dataset. And play with it to examine the dataset. Like shape of the dataset and what kind of data it has.

In [None]:
# Shape of the dataset
df.shape

So, there are 18332 rows and 9 columns originally.

In [None]:
# Columns name in the dataset.
df.columns

In [None]:
# 1st 5 rows of the dataset using pd.head()
df.head(5)

In [None]:
# Last 5 rows of the dataset.
df.tail(5)

In [None]:
# This will show the description of the columns with numbers.
df.describe()

In [None]:
# It will decribe all columns.
df.describe(include = 'all')

In [None]:
# Total nos. of null values in each columns.
df.isnull().sum()

In [None]:
# This loop will show us the total nos. of unique values in each columns.
for i in df.columns:
    
    BOLD = "\033[1m"
    RESET = "\033[0m"
    print(BOLD + i + RESET)
    
    print(df[i].nunique(), '\n')

In [None]:
# Basic information about the dataset like datatype of the each columns and null count in each columns and also memory usage.
df.info()

Here we can see that columns like 'Release Date' should be of datatype datetime but instead its of datatype object and 'Number of Ratings' should be int64 instead of float64. So, we will change the datatype of those.

<a id="ch5"></a>
### 5. Change/modify the Dataset
In this section we will look into dataset and change the dataset as per our requirment to view the dataset more accurately.

In [None]:
# Chage the datatype of "Release Date" column.
df['Release Date'] = pd.to_datetime(df['Release Date'])

In [None]:
# Chage the datatype of "Number of Ratings" column.
df['Number of Ratings'] = np.int64(df['Number of Ratings'])

In [None]:
df.info()

Now we can see that the datatype has been changed.

In [None]:
# Just for fun we will remove '?' from the column name "Available Globally?" and rene 
df["Available Globally"] = df["Available Globally?"]
df.drop(columns=("Available Globally?"), axis = 1, inplace = True)

In [None]:
df.sample(3)

## 5.1 Create new columns.
In this section we will create new columns in order to get more informations.

In [None]:
# Add 'Release Year' column from the 'Release Date' column. Year in which show/movie releases.
df['Release Year'] = df['Release Date'].dt.year.fillna(0).astype('int64')

In [None]:
# Add 'Release Month' column from the 'Release Month' column. Month in which show/movie releases.
df['Release Month'] = df['Release Date'].dt.month.fillna(0).astype('int64')

In [None]:
# Add 'Release Day' column from the 'Release Day' column. Day of the week in which show/movie releases.
df['Release Day'] = df['Release Date'].dt.day_name()

In [None]:
df.sample(3)

In [None]:
df.isnull().sum()

We can see that there are so many null values in the dataset. We will have to fix the null values in order to get more informations.

<a id="ch6"></a>
### 6. Fill/delete NaN/null valueus.
In this section we will fill the nan/null values or remove rows that have nan/null values.

In [None]:
# Fill 'Unknown' inplace of null values in 'Release Day' column.
df['Release Day'].fillna('Unknown', inplace = True)

In [None]:
# No. of ratings which has most people rated.
df['Number of Ratings'].max()

In [None]:
# No. of ratings which has least people rated.
df['Number of Ratings'].min()

This is false information as there shouldn't be a row with less then 0 no. of rated, as that can't be possible.

In [None]:
# Nos. of row in 'Number of Ratings' column that have value less then 0.
df[df['Number of Ratings'] <= 0]['Number of Ratings'].count()

In [None]:
# No. of unique values out of those 4110 values which are less then 0.
df[df['Number of Ratings'] <= 0]['Number of Ratings'].nunique()

Here we can see that there are 4110 values in 'Number of Ratings' columns which are less then 0 nd all are -9223372036854775808. Then it must be a typo. As there are only 4110 rows we can skip those columns or delete those rows.

In [None]:
# We delete those row which are less then 0. And stored that dataset in a new dataframe 'df_with_ratings'
df_with_ratings = df.loc[df['Number of Ratings'] > 0]
df_with_ratings.sample(3)

In [None]:
df_with_ratings.shape

After delete those rows the new dataframe have 14222 rows and 12 columns.

In [None]:
# Null values in 'Rating' column.
df_with_ratings.Rating.isnull().sum()

In [None]:
# All unique genre in the new dataset.
df_with_ratings.Genre.unique()

In [None]:
# No. of null values in 'Genre' column.
df_with_ratings.Genre.isnull().sum()

As there are only 33 rows it is safe to delete those rows as well.

In [None]:
df_with_ratings.dropna(subset=['Genre'], inplace = True)

In [None]:
# Null values in the columns of the new dataset after delete the rows and fill nan.
df_with_ratings.isnull().sum()

Still there are 3 columns which have null values. In 'Release Date', 'Key Words' & 'Description' columns.

In [None]:
# Fill 0 inplace of null values in the 'Release Date' column.
df_with_ratings['Release Date'].fillna(0, inplace = True)

In [None]:
# Shape of the dataframe after delete/remove the rows.
df_with_ratings.shape

Finally we have 14189 rows and 12 columns.

In [None]:
# Fill 'Unknown' inplace of null value in 'Key Words' column.
df_with_ratings['Key Words'].fillna('unknown', inplace = True)

In [None]:
# Fill 'Unknown' inplace of null value in 'Description' column.
df_with_ratings['Description'].fillna('unknown', inplace = True)

In [None]:
# No. of null values in the all columns of the dataset.
df_with_ratings.isnull().sum()

Finally we can see that there are no null values in the dataset anymore.

In [None]:
# Sample of the new dataset after remove null values.
df_with_ratings.sample(3)

<a id="ch7"></a>
### 7. Analysis from the Dataset.
In this section we will analysis the dataset. Try to find some answers.

In [None]:
# Q1. Which day has most number of releases?
df_with_ratings['Release Day'].value_counts()

If we skip the unknown one, we can see that most of the shows or movies releases in Friday.

In [None]:
# Q2. In which year there are highest numbers of shows/movies released?
df_with_ratings['Release Year'].value_counts()

Here, we can see that numbers of new movies/shows are releasing/increasing over the years.

In [None]:
# Q3. How many shows with 'Drama' genre?
## All the shows with drama genre.
df_with_ratings[df_with_ratings['Genre'].apply(lambda x: 'Drama' in x)]

We can see that there are 5847 shows/movies of 'Drama' genre.

In [None]:
# Detailes of the movie/show of 'Drama' genre with highest rating.
df_with_ratings[df_with_ratings['Genre'].apply(lambda x: 'Drama' in x)].sort_values(by= 'Rating', ascending = False).iloc[1]

So, 'One Day at a Time: Season 1' is the best show of 'Drama' genre.

In [None]:
# Q4. What are the all the genres of which movies/shows releases?
all_genres = set(genre.strip("[]").replace("'", "") for genres in df_with_ratings['Genre'].str.split(", ") for genre in genres)
all_genres

In [None]:
# Q5. How many genres of movies/shows are released? 
len(all_genres)

There are 27 numbers of genres movies/shows released over the year.

In [None]:
# Q6. Which are the best rated shows/movies of each genre?
for genre in all_genres:
    print(genre,'--->')
    print(df_with_ratings[df_with_ratings['Genre'].apply(lambda x: genre in x)].sort_values(by='Rating', ascending=False).iloc[1][['Title', 'Rating', 'Genre']])
    print('---------------------------------------------')

In [None]:
# The best rated shows/movies of each genre in table formart using tabulate library.
from tabulate import tabulate

for genre in all_genres:
    genre_df = df_with_ratings[df_with_ratings['Genre'].apply(lambda x: genre in x)].sort_values(by='Rating', ascending=False).iloc[1][['Genre', 'Title', 'Rating']]
    print(f"\nGenre: {genre}\n")
    print(tabulate(genre_df.to_frame().T, headers='keys', tablefmt='github'))

<a id="ch8"></a>
### 8. Visualization.
In this section we will try to visualize the data using matplotlib to beter understand the data.

In [None]:
# Q7. Which are the top 10 most viewed shows/movies?
top_10_shows = df_with_ratings.sort_values('Hours Viewed', ascending= False).head(10)
top_10_shows

In [None]:
# Visualization of top 10 most viewed shows/movies.
plt.figure(figsize=(10, 6))

sns.barplot(x='Hours Viewed', y='Title', data=top_10_shows)
plt.title('Top 10 Most Viewed Netflix Shows (Jan-Jun 2023)')
plt.xlabel('Hours Viewed (in 100 Millions)')
plt.ylabel('Titles')
plt.show()

In [None]:
# 'Number of ratings' less then 100
df_with_ratings[df_with_ratings['Number of Ratings'] < 100].shape

We can see here  that there are around 1800 shows/movies which are rated by less then 100 people, that means might be the shows ratings isn't correct as the sample size is too small.

In [None]:
# Top 10 top rated movies/series of all time.
top_ratted_show = df_with_ratings[df_with_ratings['Number of Ratings'] > 1000].sort_values('Rating', ascending=False).head(10)
top_ratted_show

In [None]:
# So, this is the list of top 10 most rated movies/shows in the year 2023.
top_ratted_show_23 = df_with_ratings[(df_with_ratings['Number of Ratings'] > 1000) & (df_with_ratings['Release Year'] == 2023)].sort_values('Rating', ascending = False).head(10)
top_ratted_show_23

In [None]:
# Viewership of Top 10 Highest Rated Netflix Shows/Movies w.r.t. Hours Viewed.
plt.figure(figsize=(8, 8))
plt.pie(top_ratted_show_23['Hours Viewed'], labels=top_ratted_show_23['Title'], autopct='%1.1f%%')
plt.title('Distribution of Viewership Among Top 10 Highest Rated Netflix Shows/Movies')
plt.show()

In [None]:
# Visualizing Viewership Across Release Years
plt.figure(figsize=(12, 6))
sns.boxplot(x='Release Year', y='Hours Viewed', data=df_with_ratings)
plt.xlabel('Release Year')
plt.ylabel('Hours Viewed')
plt.title('Viewership Across Release Years')
plt.show()

We can see that viewership of Netflix increasing rapidly.

In [None]:
# How many movies/shows are available globally?
sns.countplot(x='Available Globally', data=df_with_ratings)
plt.title('Count of Movies/Shows Available Globally')
plt.show()

In [None]:
# Percentage of movies/series available globaly
global_counts = df_with_ratings['Available Globally'].map({'Yes': 1, 'No': 0}).value_counts()

plt.figure(figsize=(8, 8))
plt.pie(global_counts, labels=global_counts.index, autopct='%1.1f%%', colors=['#48dbfb', '#9980FA'])
plt.title('Count of Movies Available Globally')
plt.legend(labels=['Yes', 'No'], loc='upper right')
plt.show()

That means 3/4th of the allshows or movies are available globally.

In [None]:
# Movies/Shows released timeline.
df_with_ratings['Release Date'] = pd.to_datetime(df_with_ratings['Release Date'], errors='coerce')

plt.figure(figsize=(12, 6))
sns.histplot(df_with_ratings['Release Date'], bins=20, kde=True, color='salmon')
plt.xlabel('Release Date')
plt.ylabel('Frequency')
plt.title('Distribution of Release Dates')
plt.show()

In [None]:
# How many shows/movies per genre.
genre_counts = {}
for genre in all_genres:
    genre_counts[genre] = df_with_ratings['Genre'].apply(lambda x: genre in x).sum()
    
genre_counts

In [None]:
# Most popular genre
most_popular_genre = max(genre_counts, key=genre_counts.get)
most_popular_genre

Drama is the most poopular genre with 5847 shows/movies.

In [None]:
# Most watched genre according to Hours Viewed.
genre_watch_hours = {}

for _, row in df_with_ratings.iterrows():
    genres = row['Genre']
    watch_hours = row['Hours Viewed']

    if not isinstance(genres, list):
        try:
            genres = eval(genres)
        except (SyntaxError, NameError):
            genres = []

    for genre in genres:
        if genre in genre_watch_hours:
            genre_watch_hours[genre] += watch_hours
        else:
            genre_watch_hours[genre] = watch_hours

genre_watch_hours_df = pd.DataFrame(list(genre_watch_hours.items()), columns=['Genre', 'Total Watch Hours'])

most_popular_genre = genre_watch_hours_df.loc[genre_watch_hours_df['Total Watch Hours'].idxmax(), 'Genre']

print(f"The most popular genre is: {most_popular_genre}")
genre_watch_hours_df.sort_values(by = 'Total Watch Hours', ascending=False)

In [None]:
# Visualization Most watched genre according to Hours Viewed.
plt.figure(figsize=(12, 8))
plt.bar(genre_watch_hours_df['Genre'], genre_watch_hours_df['Total Watch Hours'], color='skyblue')
plt.xlabel('Genre')
plt.ylabel('Total Watch Hours')
plt.title('Total Watch Hours by Genre')
plt.xticks(rotation=90, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Relationship between Hours Viewed and Release Date
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Release Date', y='Hours Viewed', data=df_with_ratings, color='coral')
plt.xlabel('Release Date')
plt.ylabel('Hours Viewed')
plt.title('Relationship between Hours Viewed and Release Date')
plt.show()

In [None]:
# There are somoe shows/movies they don't have release date so they dont have release month.
df_with_month = df_with_ratings[df_with_ratings['Release Month'] != 0]
df_with_month.sample(3)

In [None]:
# Total nos of Show Realeses in Each Month.
sns.countplot(x='Release Month', data=df_with_month)
plt.title('Count of release each month')
plt.xlabel('Release Month in number')
plt.ylabel('Total Releases')
plt.title('Total nos of Show Realeses in Each Month.')
plt.show()

From this plot we can see that there are so many movies/shows are releasing in winters.

<a id="ch9"></a>
### 9. Conclusion
I didn't use descriptions and Key Words column.. If anyone have ideas about how to use those columns, Feel free to Copy or Edit or give feedbacks.

**Thank You have a great day**