# Netflix Movie and TV Show 📊 
In this notebook, we perform the different Exploratory Data Analytics or EDA Techniques to understand the dataset and make the possible prediction over the dataset.

For making ourself on the track lets note down some of the basic steps that we are going to follow while performing the EDA.
* Load the Dataset.
* Understand the Dataset, its types and missing value in the dataset.
* Clean the dataset and handle the missing value from the dataset.
* Perform the Data Visualization 
    * Perform Data Normalization where ever seems to require.
    * Perform the Feature Engineering to make data more understandable and extract the possible new information from the dataset.
* Create the final Summary report from the information we gather while performing EDA. 

**Also we create some question based on the dataset which we going to answer while doing EDA.**

In [None]:
# import library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
import missingno
import warnings
warnings.filterwarnings('ignore')

## Load the Dataset
In this section we load the dataset and take the basic understanding from the dataset.

In [None]:
df = pd.read_csv('/kaggle/input/netflix-shows/netflix_titles.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

So, most of the dataset that we have is based on the string dtype and also from this we see that `director`, `country`, `date_added`, `cast`, and `rating` have a missing value.

Lets check out how to handle out this data but before moving to that lets visualize the missing value to know how much missing value are there in each columns.

In [None]:
missingno.bar(df);

So from this we get better understanding about the missing value in the dataset. As we can see that `director` column have the maximum missing values in the dataset and `date_added` column have few missing value, so we drop those rows as we cannot hangle those value as it belongs to date(we cannot predict that but we can search it on the google about the date on which that is added on netflix).

## Handle Missing Value
In this section, we work with the missing value to make our dataset complete before performing the EDA. This will help us understand the dataset more clearly and make prediction more precisely.

In [None]:
df.dropna(subset=['date_added'], inplace=True)

In [None]:
df.head()

In [None]:
df.info()

So, now we 7777 rows left in the dataset after droping the missing value from the `date_added` column.

In `director` column we replace the missing value with the `missing`.

In [None]:
df.director.fillna('missing', inplace=True)

For cast column we fill the missing value with the missing keyword and also we create a dictonary which stores the unique cast value as key and the number of times it come in the dataset as value. This will help while performing the EDA.

In [None]:
df.cast.fillna('missing', inplace=True)

In [None]:
data = []
for i in range(len(df)):
    data.extend(df.cast.iloc[i].split(','))

In [None]:
element = {}
for i in data:
    element[i] = data.count(i)

In [None]:
element = sorted(element.items(), key = lambda item: item[1], reverse=True)

In [None]:
len(set(element))

In [None]:
df.info()

Lets check out the `country` column and try to fill the missing value in column. We handle the missing value with the `missing` keyword.

In [None]:
df.country.fillna('missing', inplace=True)

In [None]:
df.rating.fillna('missing', inplace=True)

In [None]:
df.info()

We work out with handling the missing value in the dataset. Now we move forward and perform `feature engineering` to extract the information from the dataset that help us to understand the dataset more better.

## Feature Enginnering
In this section, we perform the Feature Engineering to make our dataset more informative and help us to make more prediction over the dataset.

In [None]:
df.head()

**What Data we can extract from the dataset?**
* split the `date_added` column into month, date and year.
* split the `listed_in` column into different category.

*One feature engineering we had perform while handing the missing value of cast. (element)*


In [None]:
df['added_month'] = np.nan
for i in range(len(df)):
    df['added_month'][i] = df.date_added.iloc[i].split(' ')[0]

In [None]:
df['added_date'] = np.nan
for i in range(len(df)):
    df['added_date'][i] = df.date_added.iloc[i].split(' ')[1][:-1]

In [None]:
df['added_year'] = np.nan
for i in range(len(df)):
    df['added_year'][i] = df.date_added.iloc[i].split(' ')[2]

In [None]:
listed_in = []
for i in range(len(df)):
    listed_in.extend(df.listed_in.iloc[i].split(','))

In [None]:
listed_dic = {}
for i in listed_in:
    listed_dic[i] = listed_in.count(i)

In [None]:
listed_dic = sorted(listed_dic.items(), key=lambda item: item[1], reverse=True)

In [None]:
listed_dic = dict(listed_dic)

Since we extract the `date_added` column. so, we drop this column from the dataset.

In [None]:
df.drop('date_added', axis=1, inplace=True)

In [None]:
df.isna().sum()

In [None]:
df.dropna(inplace=True)

## Exploratoy Data Analysis
In this section, we perform the EDA over the sorted data that we make ready to understand the dataset make different prediction.

What we are doing in this is that, we create some sort of question at the beginning to which we are going to answer while performing Exploratory Data Analysis.

**QUESTION**
1. What different types of show or movie uploaded on Netflix?
2. Which director is the most common on the Netflix?
3. Which director directs the shows maximum?
4. What different types of celebrities cast the shows more often?
5. In which country these shows are cast maximum that were added on Netflix?
6. In which month or year these shows added onto the Netflix mainly?
7. In which year the shows were released maximum and added onto the Netflix?
8. What are the different types of rating were giving on the Netflix?
9. What are the different categories were found on the Netflix?
10. What is the most common duration that were added on the Netflix?

In [None]:
sns.set_theme('paper', style='darkgrid')
plt.bar(df.type.unique(), df.type.value_counts(), width=0.5, color='red')
plt.ylabel('Value Count');

So, from the above graph we can say that on Netflix mostly `TV Show` where added.

In [None]:
df.director.value_counts()[1:20].sort_values(ascending=False).plot(kind='bar', width=0.5, color='red');

`Raúl Campos, Jan Suter` are most common director that direct the TV Show or movie that are found on the Netflix. We can also check the what type of content i.e. TV Show or Movie they were direct.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
data = df.groupby('type')['director'].value_counts()['Movie'][1: 20]
data = pd.DataFrame(data)
ax1.bar(data.index, data.director, color='red')
ax1.tick_params(labelrotation=90)
ax1.set_title('Movie', fontsize=24, fontweight='bold')
data2 = df.groupby('type')['director'].value_counts()['TV Show'][1: 20]
data2 = pd.DataFrame(data2)
ax2.bar(data2.index, data2.director, color='grey')
ax2.tick_params(labelrotation=90)
ax2.set_title('TV Show', fontsize=24, fontweight='bold');

`Raúl Campos, Jan Suter` directs the Movie only that are uploaded on the Netflix and `Alastair Fothergill` mostly directs TV Show that were found on the Netflix.

In [None]:
dic_element = dict(element[1:])

In [None]:
dic_element_key = list(dic_element.keys())
dic_element_value = list(dic_element.values())
plt.bar(dic_element_key[:20], dic_element_value[:20], color='red')
plt.xticks(rotation='vertical');

`Anupam Kher` is the most common cast that is found over the Netflix.

In [None]:
country_list = []
for data in df.country:
    split_data = data.split(',')
    for new_data in split_data:
        country_list.append(new_data.strip())

In [None]:
country_count = {}
for i in country_list:
    country_count[i] = country_list.count(i)

In [None]:
country_count_sort = sorted(country_count.items(), key=lambda item: item[1], reverse=True)

In [None]:
country_count_sort = dict(country_count_sort)

In [None]:
plt.bar(list(country_count_sort.keys())[:20], list(country_count_sort.values())[:20], color='red')
plt.xticks(rotation='vertical');

This graph shows that most of the TVShow and Movie cast in the `United States` that were found over the Netflix.

In [None]:
plt.bar(df.added_month.unique()[:-1], df.added_month.value_counts()[:-1], color='red');
plt.xticks(rotation='90');

So most of the TV Shows and Movie that were added on the Netflix is in the month of August.

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))
df_movie = pd.DataFrame(df.groupby('type')['release_year'].value_counts()['Movie'].sort_values(ascending=False)[:20])
ax1.barh(df_movie.index, df_movie.release_year, color='red')
ax1.set_title('Release Year of Movie', fontsize=24, fontweight='bold');
df_tv = pd.DataFrame(df.groupby('type')['release_year'].value_counts()['TV Show'].sort_values(ascending=False)[:20])
ax2.barh(df_tv.index, df_tv.release_year, color='grey')
ax2.set_title('Release Year of TV Show', fontsize=24, fontweight='bold');

This shows that in the year 2020, TV Show popularity increase over the Movie on Netflix.

In [None]:
plt.figure(figsize=(20, 8))
plt.bar(listed_dic.keys(), listed_dic.values(), color='red')
plt.xticks(rotation='90');

Most of the Movie with the tag `International Movies` were found over the Netflix. So, making the Show or Movie famous you must have International Movie tag over your show.

In [None]:
plt.figure(figsize=(6, 6))
labels=['1 Season', '2 Season', '3 Season']
_, _, texts = plt.pie(df.duration.value_counts()[:3], labels=labels, autopct='%1.2f%%', startangle=90, 
                      explode=(0.0, 0.1, 0.0), colors=['red', 'grey', 'black'])
plt.axis('equal')
plt.title('Season Duration', fontsize=22, fontweight='bold');
for text in texts:
    text.set_color('white')

In [None]:
plt.plot(df.duration.value_counts().index.to_list()[3: 20], df.duration.value_counts()[3:20], color='red')
plt.xticks(rotation='90')
plt.title('Movie and TV Show Duration', fontsize=18, fontweight='bold');

Most of the Movie of TV Show are of duration 90 min found over the Netflix.

In [None]:
plt.figure(figsize=(8, 10))
sns.countplot(y='rating', data=df, order=df.rating.value_counts().index.to_list(), palette='dark:salmon_r')
plt.title('Different Ratings', fontsize=24, fontweight='bold');

So, TV-MA is the most common rating over the Netflix.

## Summary
So, far we had perform lots of operation over the dataset and dig out some of the useful information from it. In some of the cases we create some new data from the dataset to help out in understanding the data more clearly. If we had to conclude the dataset in the few lines, than we can say that:

*On Netflix, mostly `TV Show` were added which were directed by the `Alastair Fothergill` that were release in the year `2020` with one of the tag listed in `International TV Shows` have a duration of `1 Season` got rating of `TV-MA`.*