# Netflix - Data Analysis 

In today's date , Netflix is one of the major players of the Entertainment industry . It is a streaming service that offers a wide variety of award winning TV shows , movies , anime , documentaries and much more on thousands of internet-connected devices. It lets its users to have access to the best of movies and shows on a single platform for a very reasonable price. Netflix gained its popularity particulary during the covid-19 period, when people found a source of entertainment in it.

For this project, we are going to use a dataset of all the movies and Tv shows that are available on netflix. This data was acquired in May 2022 containing data available in the United States. This dataset is in .csv format titled (titles.csv).By analyzing these dataset, we will try to understand which type of movies and shows does netflix offers , which countries provides the most content , how many releases were recorded in each year since the 1940s , which movies had the best imdb ratings and which are the most common genres on netflix.  

Dataset source - https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies

The tools that are used for analysis are:

* Pandas
* Numpy
* Matplotlib
* Seaborn


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/netflix-tv-shows-and-movies/titles.csv'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Install required libraries
First we install all the libraries of python that are used in this analysis and then import them in our notebook.


In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

## Downloading the Dataset

 We have downloaded and uploaded the titles.csv and credits.csv files on our jupyter notebook. To check it we can use the following command: 

In [3]:
titles_raw=pd.read_csv("/kaggle/input/netflix-tv-shows-and-movies/titles.csv")


In [4]:
import os
os.listdir()

## Loading the datasets
To access these files , we will read it through pandas dataframe and load it in variables named 'titles_raw'.

Let's look at a sample data of these files

In [5]:
titles_raw.sample(10)


## Data Preparation and Cleaning

In these section , we will :


* Explore the number of rows & columns, ranges of values etc.
* Handle missing, incorrect and invalid data

This will allow us to understand the data more precisely and make it easier for us to further analyze it.




Let's get the details about the data like data type and missing values.


In [6]:
titles_raw.info()

The file contains a total of 5806 rows and 15 columns. Out of the 15 columns ,2 are of type 'int' , 5 are of type 'float' and the rest are 'object' types.

For better analysis, we will drop the columns: ['tmdb_score','imdb_votes','imdb_id'] as they are of least use to us.



In [7]:
titles_raw=titles_raw.drop(['tmdb_score','imdb_votes','imdb_id'],axis=1)

By looking at the sample, it can be observed that there are multiple NaN values present in the columns 'seasons', 'imdb_score', and 'tmdb_popularity'. To perform various operations on this datasets, we must handle this NaN values.

In [8]:
titles_raw['seasons'].fillna(0,inplace=True)
titles_raw['imdb_score'].fillna(0,inplace=True)
titles_raw['tmdb_popularity'].fillna(0,inplace=True)
titles_raw

It appears that the columns 'imdb_score' , 'seasons' , and 'tmdb_popularity' has the data type 'float'. We must convert them into integer type.

In [9]:
titles_raw['seasons']=titles_raw['seasons'].astype(int)
titles_raw['tmdb_popularity']=titles_raw['tmdb_popularity'].astype(int)


In [10]:
titles_raw.info()

All columns with type 'float' has now been converted to 'integer', except for the column 'imdb_score'.

Now we will see the number of missing values in our dataset

In [11]:
titles_raw.isnull().sum()


Let's look at how many production countries are involved in netflix's content creation.

In [12]:
titles_raw.production_countries.value_counts()
titles_raw.production_countries.nunique()

#### There are about 449 different sets of production countries which contributes to the vast variety of content on netflix.

Now since the columns 'production_countries' and 'genres' are currently presented as arrays of strings containing multiple values in each row, we will unpack them and convert them to a single value for the purpose of simplification.
It will ease the process of categorizing the movies based on genres and production countries.

In [13]:
import ast
import random

def repair_array_bound_categories(arr):
    
    arr = ast.literal_eval(arr)
    
    if len(arr) == 0:
        return np.nan
    
    elif len(arr) == 1:
        return arr[0]
    
    else:
        return random.choice(arr)

In [14]:
titles_raw["genres"] = titles_raw["genres"].apply(repair_array_bound_categories)


In [15]:
titles_raw["production_countries"] = titles_raw["production_countries"].apply(repair_array_bound_categories)

In [16]:
titles_raw

 We have successfully unpacked and cleaned the data from both the columns.

Netflix provides an option of genre selection, based on which it recommends different shows and movies to its user. Lets have a look at the genre section available on netflix.

lets take a look at the different genres that netflix provides its users to choose from.

In [17]:
titles_raw.genres.unique()


#### There are total 19 different genres available on netflix to choose from.

## Exploratory Analysis and Visualization

In this section, we will :
 * Explore relationships between columns and rows present in the dataset by visualizing them using matplotlib and seaborn library in    python.
 * Compute the mean, sum, range and other interesting statistics for numeric columns
 * Explore distributions of numeric columns using histograms etc.
 * Make a note of interesting insights from the exploratory analysis


Let's begin by importing`matplotlib.pyplot` and `seaborn`.

In [18]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 15
matplotlib.rcParams['figure.figsize'] = (18, 8)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

lets find out some general stats of this dataset.

In [19]:
titles_raw.describe()

## Lets make a note of few important insights from the stats given above:

#### * The total number of Movies/shows in this dataset is around 5000.
#### * The range of years for this dataset is from year 1945 to 2022.
#### * The maximum runtime for a movie or show recorded was 251 min.
#### * The maximum number of seasons for a Tv show on netflix was recorded to be 42 seasons.
#### * The average imdb score given to the shows and movies on netflix is less than 6


## Production Countries

The various countries who contributed their contents to netflix were:  

In [20]:
titles_raw.production_countries.unique()

Now let's have a look at the distribution of content based on countries.

In [21]:
shows_countries= titles_raw.production_countries.value_counts()
shows_countries = pd.DataFrame(shows_countries)
shows_countries.head(15)


#### These are the top 15 countries which has produced the most content for netflix

 Now let's illustrate this distribution in the form of a pie chart.

In [22]:
labels = ['US','IN','JP','GB','KR','ES','FR','CA','MX','BR','PH','TR','NG','DE','AU']
values = [2115,614,300,276,212,178,162,152,110,96,92,82,80,80,73]

In [23]:
my_explode=[0.05,0.05,0.05,0,0,0,0,0,0,0,0,0,0,0,0]
plt.pie(values,labels=labels,autopct='%1.1f%%',shadow=True,explode=my_explode)
plt.axis('equal')
plt.legend(title='production countries')
plt.show()

#### From the above data , it is clear that most of shows/movies on netflix are produced by U.S.A

#### However, this is not the correct percentage as we haven't taken the rest of the countries into conseideration.

#### To get a more accurate result on the percent of content creation by USA, we will consider all the countries in our calculation.

In [24]:
# maximum content produced by a single country(USA)
max_content = titles_raw.groupby('production_countries').title.count().max() 

# total conent produced by all the countries collectively
total_content = titles_raw.groupby('production_countries').title.count().sum() 


In [25]:
max_percent = max_content / total_content * 100
print('USA produces ',max_percent, '% of the total content on netflix')

 #### Almost 38 % of the content available on netflix has being produced by USA.

## Genres

Lets look at the content distribution on netflix based on genres.

In [26]:
sns.countplot(x='genres',data=titles_raw)
plt.xticks(rotation=90)
plt.title('Number of Movies/Shows based on Genre')



From the generated countplot , we can conclude that:

#### * Most of the content on netflix is based on 'Drama' and 'Comedy'.
#### * Movies or shows in the category of 'western' are minimum.
#### * Netflix provides almost 600 documentaries which is collectively more than the content based on 'horror','sci-fi','sport' and 'reality'.

## Age Certification

To work with the age certification's column, we must handle the NaN values present in it.

In [27]:
titles_raw.age_certification.fillna(0)

In [28]:
certifications=titles_raw.groupby('age_certification').title.count()
certifications=pd.DataFrame(certifications)
certifications

In [29]:
sns.catplot(x='age_certification',kind="count",data=titles_raw)
plt.xticks(rotation=55)
plt.title('Content based on Age certification')

From the given above stats, we can summarize the following:
#### * It appears that most of cthe content showed on netflix is suitalbe only for adults.
#### * More than 800 shows/Movies falls under the 'TV-MA' category , which allows only adults to watch them.
#### * About 600 shows/movies are R-Rated(Restricted), which only allows children under 17 to watch under the supervision of their parents/guardian.
#### * Only a handful of shows/movies are rated 'NC-17', which strictly prohibits under-17 children to watch.

## Release year

#### Netflix contains movies and shows released from around 1940's . Lets take a look at the number of movies/shows released each year since then.

In [30]:
x2=titles_raw.groupby('release_year').title.count()
x2=pd.DataFrame(x2)
count_movies=x2['title'].to_numpy()
years=titles_raw['release_year'].unique()


In [31]:
sns.pointplot(x=years,y=count_movies,color='r')
plt.xticks(rotation=90)
plt.title('Number of Moves/shows released each year')
plt.xlabel('years')
plt.ylabel('Number of movies/shows')




#### * There were only a handful of movies and shows that had released from the 1940 s to up until 2010.
#### * It was only after 2011 that the film and TV industry started producing enough content. 
#### * The maximum amount of content produced was in the year 2019, that was year of the global pandemic.
#### * The industry saw a dip in content production in the year 2021, however it quickly picked pace in the very next year.

## Asking and Answering Questions

Here, we will try to answer few of the general questions that arises after analyzing the data. We will use the libraries like malplotlib , numpy and seaborn for illustrating the data more precisely.



## Q1: What is the average imdb-rating for movies/shows of each genre?

In [32]:
genre_imdb=titles_raw.groupby('genres').imdb_score.describe()
genre_imdb=pd.DataFrame(genre_imdb)
genre_imdb=genre_imdb.reset_index()
genre_imdb.sort_values('mean',ascending=True)
genre_imdb[['genres','mean','std']]




In [33]:
sns.barplot(x='mean',y='genres',data=genre_imdb)
plt.title('Average imdb rating for each genre')
plt.xlabel('Average imdb')
plt.ylabel('Genres')

#### Given above is a barplot depicting the average imdb rating for the movies and shows of each genre:
#### * The genre with the highest rating is 'history', with an mean imdb score of  7.2 
#### * The genre 'genre' received the lowest imdb rating of about 5.2



## Q2: illustrate the distribution of movies/shows produced in each country on the basis of genres.

In [34]:
diff_genres=titles_raw.genres.unique()
diff_genres


In [35]:
x=titles_raw.groupby(['genres','production_countries']).title.count()
x=pd.DataFrame(x)
x=x.reset_index()
x=x.set_index('production_countries')
x=x.drop_duplicates(keep=False)


In [36]:
x=pd.DataFrame(x)


In [37]:
x=x.reset_index()


In [38]:
result=pd.DataFrame(x)


In [39]:
result=result.reset_index()

In [40]:
result=x.pivot(index='production_countries',columns='genres',values='title')


In [41]:
result=result.fillna(0)
result=result.astype(int)
result

In [42]:
plt.title('Number of Movies/Shows')

sns.heatmap(result,vmin=0,vmax=300,fmt="d", annot=True, annot_kws={"fontsize":13},cmap='inferno')
plt.figure(figsize=(12, 16))


## Q3: Classify the data on the basis of its type(movie/show) and find out which one is the most popular.

In [43]:
imdb = titles_raw.groupby('type').imdb_score.describe()
imdb = imdb[['count','mean']]
imdb = imdb.reset_index()
imdb

In [44]:
tmdb = titles_raw.groupby('type').tmdb_popularity.describe()
tmdb = tmdb[['count','mean']]
tmdb = tmdb.reset_index()
tmdb

In [45]:
fig, axes = plt.subplots(1, 2,sharex=True, figsize=(12, 6))
fig.suptitle('Moveis vs Shows',color='r')
axes[0].set_title('Average imdb rating')
plt.legend(['movie','shows'])
sns.barplot(y='mean',x='type',data=imdb, ax=axes[0])
axes[1].set_title('Average tmdb popularity score')
sns.barplot(y='mean',x='type',data=tmdb,ax=axes[1])



## Q4: Depict the distribution of Movies/shows on the basis of their genre.

In [46]:
titles_raw.groupby(['genres','type']).title.count()
sns.countplot(x='genres',hue='type',data=titles_raw)
plt.xticks(rotation=90)
plt.title('Distribution of Movies/Shows based on type & genre')

#### from the given plot, we can conclude that:

#### * The maximum number of movies and shows available on netflix are from category 'drama' and 'comedy'.
#### * The least famous genre appears to be 'western' and 'war'.
#### * Shows of only 3 genres outnumbered the movies of their respective genres, they are: 'animation','sci-fi' and 'reality'.


## Q5: What is the average runtime of a movie or a show? Calculate the total runtime for all the movies and depict a distribution for the number of seasons in the shows.

In [47]:
avg_runtime = titles_raw.groupby(['type']).runtime.mean()
avg_runtime =pd.DataFrame(avg_runtime)
avg_runtime

#### The average runtime of a movie on netflix is about 100 minutes , whereas the average runtime of a show is nearly 40 minutes.

In [48]:
shows = titles_raw.seasons.value_counts()
shows = pd.DataFrame(shows)
shows=shows.rename(columns={'seasons':'Number of shows'})
shows=shows.rename_axis('Seasons')
shows=shows.reset_index()
shows=shows[1:]
shows

####  Maximum number of the shows on netflix never went past the first season.
####  Only 374 shows made it to the 2nd season whereas, only 181 and 116 shows made upto the 3rd and 4th seasons respectively. 

In [49]:
titles_raw.loc[titles_raw['seasons']==42].title

#### One particular show named 'Survivor' made a total of 42 seasons , which is the maximum for any show on netflix. 

In [50]:
sns.pointplot(x='Seasons',y='Number of shows',data=shows,color='r')

Now lets calculate the total runtime of all the movies available on netflix

In [51]:
movies_runtime = titles_raw.loc[titles_raw['type']=='MOVIE'].runtime.sum()
movies_runtime = movies_runtime/60  #converting in hours


In [52]:
print(movies_runtime)

#### Total runtime of all the movies on netflix since 1945 is 6128.5 hours

## Inferences and Conclusion

In this EDA, we took a closer look at the data collected from netflix since the 1940s to 2022. The dataset contained all necessary information of all the movies and TV shows that are available on netflix platform. We analyzed it based on various categories like genre, release year, age certification , type and runtime. After its analysis, we have found some very useful and interesting insights, a few of which are noted below:

* Netflix is a very popular streaming service with a lot of content available for its users to choose from. There are approximately 5000+ different movies/shows that netflix provides, all those which released in between the period of 1940s to the current year.

* A total of 97 different countries contribute the best of their content to netflix. A few of the top contributors being U.S.A   ,India , U.K and Japan. infact, U.S.A alone contributes about 38% of the total content available on netflix. No wonder      Hollywood is such a big industry.

* Netflix offers total 19 different genres of movies and shows to choose from. Most famous of them being 'Drama' and 'Comedy' which constitutes for more than 2300 titles on netflix. The least famous genre appears to be 'western' and 'war'. To our surprise,  Netflix provides more than 600 documentaries which is collectively more than the content based on 'horror','sci-fi','sport' and 'reality'.

* There are total 11 different age certification categories that the movies and shows on netflix falls under. More than 800 shows/Movies falls under the 'TV-MA' category , which allows only adults to watch. About 600 shows/movies are R-Rated(Restricted), which only allows children under 17 to watch under the supervision of their parents/guardian. Only a handful of shows/movies are rated 'NC-17', which strictly prohibits under-17 children from watching.

* There were only a handful of movies and shows that had released from the 1940 s to up until 2010. It was only after 2011 that the film and TV industry started producing enough content. More than 800 movies/shows were released in the period between 2019 - 2020, and about 700 in the year 2018 and 2022.

* We also compared the average imdb ratings and tmdb popularity of 'Movie' and 'show'. Turns out, shows on netflix are more popular among the people than movies.

* The average runtime of a movie on netflix is about 100 minutes , whereas the average runtime of a show is nearly 40 minutes.

* Most Tv shows runs only a single season, a few hundreds of them extends upto 2,3 or 4 seasons. One particular show named 'Survivors' made a total of 42 seasons which is the most for any show on netflix.



## Future Work

In addition to this data, we can merge another dataset containing the information about the cast and directors who worked in the films and shows. We can then analyze which actors appears the most on the screen and which directors has the highest success rate in the film and tv industry.


## References

Following are the resources that have been used in this analysis for data set and tools :-

* Dataset: https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies
* Pandas guide: https://pandas.pydata.org/docs/user_guide/index.html
* Matplotlib guide: https://matplotlib.org/3.3.1/users/index.html
* Seaborn guide & tutorial: https://seaborn.pydata.org/tutorial.html

