# An analysis of Netflix Movies and TV shows using Pandas and Seaborn


Welcome.
This is an analysis of Netflix movies and TV Shows spanning many decades. In this analysis we will explore vaious charts and insights using the data collected from Kaggle.
Kaggle is a very trusted source of a wide range of data. We will find out how various countries have contributed to the wide range of titles available on Netflix, the most frequently featuring directors among other things.
We will be using various funtions in the numpy, matplotlib and seaborn libraries we learned about in the Data Analysis with Python: Zero to Pandas](zerotopandas.com) course. Lets jump in.

### How to run the code

This is an executable [*Jupyter notebook*](https://jupyter.org) hosted on [Jovian.ml](https://www.jovian.ml), a platform for sharing data science projects. You can run and experiment with the code in a couple of ways: *using free online resources* (recommended) or *on your own computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of this page, and select "Run on Binder". This will run the notebook on [mybinder.org](https://mybinder.org), a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle".


#### Option 2: Running on your computer locally

1. Install Conda by [following these instructions](https://conda.io/projects/conda/en/latest/user-guide/install/index.html). Add Conda binaries to your system `PATH`, so you can use the `conda` command on your terminal.

2. Create a Conda environment and install the required libraries by running these commands on the terminal:

```
conda create -n zerotopandas -y python=3.8 
conda activate zerotopandas
pip install jovian jupyter numpy pandas matplotlib seaborn opendatasets --upgrade
```

3. Press the "Clone" button above to copy the command for downloading the notebook, and run it on the terminal. This will create a new directory and download the notebook. The command will look something like this:

```
jovian clone notebook-owner/notebook-id
```



4. Enter the newly created directory using `cd directory-name` and start the Jupyter notebook.

```
jupyter notebook
```

You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook file (it has a `.ipynb` extension) to open it.


## Downloading the Dataset

Our dataset can be found in kaggle which is a very popular platform for datasets. We will use the opendatasets librabry to download the dataset and then go ahead and use the os library to view it.

In [None]:
!pip install jovian opendatasets --upgrade --quiet

Let's begin by downloading the data, and listing the files within the dataset.

In [None]:
dataset_url = 'https://www.kaggle.com/datasets/shivamb/netflix-shows'

In [None]:
import opendatasets as od
od.download(dataset_url)

The dataset has been downloaded and extracted.

In [None]:
data_dir = './netflix-shows'

In [None]:
import os
os.listdir(data_dir)

Let us save and upload our work to Jovian before continuing.

In [None]:
project_name = "an-analysis-of-netflix-movies-and-tv-shows-using-pandas-and-seaborn"

In [None]:
!pip install jovian --upgrade -q

In [None]:
import jovian

In [None]:
jovian.commit(project=project_name, files=['netflix-shows'])

## Data Preparation and Cleaning

Now that we have downloaded the dataset we can get to work on cleaning it and looking through it. This is where we will address any issues regarding missing data, incorrect data and invalid data as well as perform any additional steps. This is a key step because in order to get accurate results you want to make sure you are working with accurate data from the beginning.



In [None]:
import pandas as pd
import seaborn as sns

Now that we have imported the relevant libraries we will go ahead and use pandas to read the csv file from the directory

In [None]:
netflix_df = pd.read_csv('netflix-shows/netflix_titles.csv')

Alright now lets take a look at the data frame.

In [None]:
netflix_df

First off, lets take a look at the basic information of the data. This includes the number of rows and columns, the number of non-null elements in each column and the data type. This basic information can let us know which area of the data to take a closer look at while cleaning.

In [None]:
netflix_df.info()

The data shows 8807 entries so straight away we can tell which columns have null rows. Director, cast, country and date added clearly stand out.

In [None]:
netflix_df.shape

The dot shape function simply returns the number of rows and columns in the data frame.

Lets start our data cleaning by looking at our dates. The date column is set as an object data type by default so lets convert the column to a date format.

In [None]:
netflix_df.date_added

In [None]:
netflix_df['date_added'] = pd.to_datetime(netflix_df.date_added)

In [None]:
netflix_df.date_added

In [None]:
netflix_df['date_added']

Now lets drop any columns that we deem irrelvant or we know for a fact will not be useful to our analysis. For us this is the show ID column.

In [None]:
netflix_df.drop('show_id',inplace=True,axis=1)

In [None]:
netflix_df

Now let us check for null data in each column.

In [None]:
netflix_df.isnull().sum()

Let us remove the nulls that will not affect the quality of the data. We want to be careful to not remove so much data that could alter the outcome of our analysis. In this case we can remove all nulls apart from directors which has over 2000 null values. 

In [None]:
netflix_clean_df = netflix_df.dropna(subset=['date_added', 'rating','country'])

In [None]:
netflix_clean_df.shape

In [None]:
netflix_df.shape

After removing our null values we can see that the adjusted dataframe did not change my too much.

We can go ahead and create separate data frames for movies and TV Shows. We are doing this because we will be exploring these categories separately to an extent 

In [None]:
movies = netflix_clean_df[netflix_clean_df.type == 'Movie']
shows = netflix_clean_df[netflix_clean_df.type == 'TV Show']

## Exploratory Analysis and Visualization
Now that we have cleaned the data and we can explore the data further to find more insights. We will compute the mean, sum, range and other interesting statistics for numeric columns. We will apply various visualisations to help explain the data.

Let us use the describe funcition to get the descriptive statistics for both data movie and TV Show data frames. This will be applied to the release year column. The pandas library applies this function to the the most applicable column which in this case is the release year column.

In [None]:
movies.describe()

The descriptive statistics tells us that there are 5,690 movies in the netflix database. The mean year is 2012 which basically means that the frequency of movie releases are heavily centered around the year 2012. This could imply that the demand for movies in the years leading up to 2012 and a few years after was high. However the standard deviation is so high which indicates that the data is not so heavily concentrated around the mean.

We can also tell that the earliest release year is 1942. The most recent release date is 2021.

In [None]:
shows.describe()

The descriptive statistics tells us that there are 2,274 TV Shows in the netflix database. The mean year is 2016 which basically means that the frequency of TV Shows releases are heavily centered around the year 2016. This could imply that the demand for TV Shows in the years leading up to 2016 and a few years after was high. The standard deviation of 5 is a little high but good enough to indicate that the data moderately concentrated around the mean.

Next we are going to check for duplicates in the dataframe.

In [None]:
netflix_clean_df['title'].duplicated().sum()

The data has no repeated titles and that is a very good thing consideration all the analysis we have conducted already.

In [None]:
import jovian

In [None]:
jovian.commit()

## Import `matplotlib.pyplot` and `seaborn`.
Let us import the matplotlib.pyplot` and `seaborn' libraries so that we can plot a few graphs

In [None]:
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

Let's explore the release year column by plotting a graph below, and add some explanation about it.

In [None]:
netflix_clean_df.head(5)

In [None]:
netflix_clean_df.describe()

We use the describe function once again but this time it applies to all the titles instead of the movies and TV Shows separately.

We can tell that the data count matches the number of rows we get from the dot shape function earlier.

In [None]:
titles_per_year = netflix_clean_df['release_year'].value_counts()

In [None]:
titles_per_year

We use the values count function to count the number of times a data unit in a column appears. In other words, we can tell how many titles were released every year using the value counts function.

next let us create a data frame of the titles per year. Using pd.DataFrame

In [None]:
titles_per_year = pd.DataFrame(titles_per_year)

In [None]:
titles_per_year

Now lets rename the column to properly reflect the data in that column.

In [None]:
titles_per_year.rename(columns = {'release_year':'total_shows'}, inplace = True)
titles_per_year

Let's plot a simple line graph to show the trend of releases per year.

In [None]:
sns.lineplot(data = titles_per_year)
plt.title('Netflix´s shows release date (1940-2021)')
plt.xlabel('Year released')
plt.ylabel('Total Shows on Netflix')
plt.show()


Now lets take a closer look at the last 10 years. From the plot above we can tell that more activity is concentrated between that range.

In [None]:
sns.lineplot(data = titles_per_year)
plt.xlim(2010, 2021)
plt.title('Netflix´s shows release date (1940-2021)')
plt.xlabel('Year released')
plt.ylabel('Total Shows on Netflix')
plt.show()

Now that we have isolated the shows released from 2010 and beyond we get a clear picture of the activity. We see a clear spike in number of releases per year starting from 2014, peaking at 2018 and slowly declining.

## Explore one or more columns by plotting a graph below, and add some explanation about it

Now lets explore movies and tv shows again. This time lets take a count of them and plot a bar graph to show the difference in total number of movies compared to TV Shows.

Using the value_counts function we can easily get the totals.

In [None]:
type_count = netflix_clean_df['type'].value_counts()
type_count = pd.DataFrame(type_count)
type_count.rename(columns = {'type':'total_type'}, inplace = True)
type_count

Now that we have the total types we can go ahead and plot our bar graph using seaborn.

In [None]:
sns.barplot(x= type_count.index, y = 'total_type', data = type_count)
plt.title('Netflix´s shows by type (1940-2021)')
plt.xlabel('Types')
plt.ylabel('Total Shows on Netflix')
plt.show()

## Explore one or more columns by plotting a graph below, and add some explanation about it

Next lets explore the ratings column. We can extract how many titles are in each category. First lets call the our dataframe again for easy reference. We will only call a couple rows.

In [None]:
netflix_clean_df.head(2)

In [None]:
jovian.commit()

Lets use the value counts function again to get the total  number of titles under each rating category. We will call that 'Clean rating'

In [None]:
clean_rating = netflix_clean_df.rating.value_counts()

In [None]:
clean_rating

Now we can see that the columns dont have any identification. So lets use the set axis function to add a name.

In [None]:
clean_rating = pd.DataFrame(clean_rating)
clean_rating = clean_rating.set_axis(['total_number'], axis=1, inplace=False)
clean_rating

Great. Now we notice a problem at the bottom. We see that there are runtime data in the rating column. We will just remove those so that they dont affect our analysis.

In [None]:
clean_rating = clean_rating[clean_rating['total_number'] > 1]
clean_rating

Lets go ahead and plot another bar graph to show the various quantities of titles in each category.

In [None]:
sns.barplot(x= clean_rating.index, y = 'total_number', data =clean_rating)
plt.title('Netflix´s shows by type (1940-2021)')
plt.xlabel('Rating')
plt.ylabel('Total Shows on Netflix')
plt.show()

## Explore one or more columns by plotting a graph below, and add some explanation about it

Lets explore the countries column now. An interesting insight would be the top 10 countries that contribute to the netflix database over the years.

In [None]:
countries = netflix_clean_df['country'].value_counts().head(10)

In [None]:
countries =  pd.DataFrame(countries)
countries.rename(columns={'total_movies':'total_number'}, inplace = True)
countries

We can see how powerful the value counts function is in data analysis. Now lets plot a bar graph to give a visual representation of the data.

In [None]:
sns.barplot(x= countries.index, y = 'country', data =countries)
plt.title('Netflix´s shows by type (1940-2021)')
plt.xlabel('Countries')
plt.ylabel('Total Shows on Netflix')
plt.show()

Let us save and upload our work to Jovian before continuing

In [None]:
import jovian

In [None]:
jovian.commit()

## Asking and Answering Questions

In this section we are going to find answers to some interesting question about the data. The kind of answers that can inform business decisions.



#### Q1: Showcase user demand for top 7 genres of film

First lets print out the clean dataframe once again so that we can see what we are working with.

In [None]:
netflix_clean_df

Since the listed in section has a string of various genres in each cell we need to count the number of times a specific string occurs.

In [None]:
Drama = netflix_clean_df['listed_in'].str.count('Dramas').sum()
Documentaries = netflix_clean_df['listed_in'].str.count('Documentaries').sum()
Comedy = netflix_clean_df['listed_in'].str.count('Comedies').sum()
Horror = netflix_clean_df['listed_in'].str.count('Horror Movies').sum()
Reality = netflix_clean_df['listed_in'].str.count('Reality TV').sum()
Adventure = netflix_clean_df['listed_in'].str.count('Adventure').sum()
SciFi = netflix_clean_df['listed_in'].str.count('Sci-Fi').sum()

d = {'Genre': ['Drama', 'Documentaries', 'Comedy', 'Horror', 'Reality', 'Adventure', 'SciFi'], 
     'Data' :[Drama, Documentaries, Comedy, Horror, Reality, Adventure, SciFi]}
top_genre_df = pd.DataFrame(d)
top_genre_df

The data above shows the number of titles that have been released in each of the genres. The genres were manually selected by a simple google search of the most popular genres around the world.

Our next step is to plot the graph. Since we are dealing with frequency we will use a bar graph to display our results.

In [None]:
sns.barplot(x= top_genre_df.Genre, y = 'Data', data =top_genre_df)
plt.title('Netflix´s shows by Genre (1940-2021)')
plt.xlabel('Genre')
plt.ylabel('Total Shows on Netflix')
plt.show()

In [None]:
jovian.commit()

#### Q2: Who are the top ten popular directors?

In [None]:
top10_dir_df = netflix_clean_df['director'].value_counts().head(10)
top10_dir_df

#### Q3: What is the number of Movies and TV Shows were released during the summer?

First lets use our datetime function to allow us to interact with the month and the year indiviually. Here we will use the apply funcition as well.

In [None]:
netflix_clean_df['date'] = pd.to_datetime(netflix_clean_df['date_added'])
netflix_clean_df['date_added'] = pd.to_datetime(netflix_clean_df['date_added'])
netflix_clean_df['year'] = netflix_clean_df['date'].apply(lambda datetime: datetime.year)
netflix_clean_df['month'] = netflix_clean_df['date'].apply(lambda datetime: datetime.month)

Now lets create a dataframe for movies that were released in the summer. In this case we will use May to represent the summer.

In [None]:
dfm = netflix_clean_df[netflix_clean_df['type']=='Movie'][['title','month']]
dfm = dfm[dfm['month']==5]
dfm

Above we see the list of movies released in May. Now we only need to use our value counts function to to count the number of instances and sum them up using the sum function

In [None]:
dfm.title.value_counts().sum()

There were 392 Movies released in the summer.

Now lets do the same for the TV shows. We already modified our date column for this purpose so we do not need to do that again. We only need to Create a dataframe for TV shows that were released in May.

In [None]:
dftv = netflix_clean_df[netflix_clean_df['type']=='TV Show'][['title','month']]
dftv = dftv[dftv['month']==5]
dftv

Now we apply the value counts function and sum the result.

In [None]:
dftv.title.value_counts().sum()

The final result shows there were 153 TV shows released in the summer.

#### Q4: How many cast members are typically needed to produce a title??

In [None]:
import numpy as np
temp_df = netflix_clean_df[['type', 'cast']].copy()
temp_df['cast'] = temp_df['cast'].astype('string')
temp_df.loc[temp_df['cast'].notnull(), 'cast'] = [str(c.count(',') + 1) 
                                                  for c in temp_df.loc[temp_df['cast'].notnull(), 'cast']]
temp_df['cast'] = temp_df['cast'].astype('Int64')

# Pivot Table
temp_df.groupby('type').describe(percentiles=list(np.arange(0.1, 1, 0.1)))

Most movies and TV shows have 8 cast members; however, for titles with more than 10 cast members, TV shows tend to have more.

#### Q5:How long does it take to add a newly released title?

In [None]:
import numpy as np
temp_series = netflix_clean_df['date_added'].dt.year - netflix_clean_df.release_year

temp_series.hist(bins = 20, grid = False)
plt.xlabel('Difference between Added Year and Release Year')
plt.xticks(np.arange(0, temp_series.max(), 5))
plt.ylabel('Count of current titles')
plt.text(25, 3500, 
         'Median: ' + str(round(temp_series.median())) + ' year', 
         bbox={'facecolor': 'white', 'edgecolor': 'red'})
plt.gcf().set_size_inches(12, 5)
plt.plot()

From the above data we find that it takes a median of 1 year to add a newly released title.

Let us save and upload our work to Jovian before continuing.

In [None]:
import jovian

In [None]:
jovian.commit()

## Inferences and Conclusion

Write some explanation here: a summary of all the inferences drawn from the analysis, and any conclusions you may have drawn by answering various questions.

From the data we are able to tell the preferences of audiences. We have determined that drama and comedy genres have the highest number of titles to their names and this could be said to be due to the fact that those two genres are the most likely to attract repeat viewing. The genres are light and interesting compared to the other genres so they would have a higher demand for daily use.

The data seems to support the idea that audiences are shifting from more conventional avenues to streaming platforms. When we looked at the line graph of titles released overall we could see that the line was relatively flat until 2010 when it started to rapidly incline peaking around 2018. The general landscape of media consumption has migrated to more streaming platform primarily due to social media and this has translated to the larger platforms like Television. Users have developed a taste for "content on demand" and the data seems to support that.

In [None]:
import jovian

In [None]:
jovian.commit()

## References and Future Work

The data is limited to the year 2020. This analysis is conducted in the year 2022. The most obvious future project would be how the corona pandemic affected releases per year giving the lockdowns that were imposed around the world. It can be assumed that there was a spike in releases in 2021 and 2022 however a data driven analysis and visualisations will be a more practical approach to understand the impact of the pandemic on streaming.

## References:

* Netflix Movies and TV Shows Data set:
https://www.kaggle.com/shivamb/netflix-shows

* Pandas documentation:
https://pandas.pydata.org/docs/reference/index.html

* Seaborn documentation:
https://seaborn.pydata.org/api.html

In [None]:
import jovian

In [None]:
jovian.commit()