![f7o5zvqe.png](attachment:f7o5zvqe.png)

# NETFLIX Dataset Exploratory Data Analysis (EDA)

Dataset:https://www.kaggle.com/datasets/shivamb/netflix-shows

Lijesh T K

## 1.Import the libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2.Load the Dataset

In [None]:
df=pd.read_csv("netflix_titles.csv")
df

## 3.Data Exploration

In [None]:
#Shape of dataset
df.shape

In [None]:
df.head()

In [None]:
#data types
df.info()

In [None]:
#check for unique values
df.nunique()

In [None]:
#check for duplicates
df[df.duplicated()]

In [None]:
# check for null values
df.isnull().sum()

In [None]:
# treat the null values by most appropriate way,
df['director'].fillna("Unknown_Value",inplace=True)
df['cast'].fillna("Unknown_Value",inplace=True)
df['country'].fillna("Unknown_Value",inplace=True)

In [None]:
df.dropna(subset=['date_added'],inplace=True)
df.dropna(subset=['rating'],inplace=True)
df.dropna(subset=['duration'],inplace=True)

In [None]:
df.isna().sum()

In [None]:
df['type'].unique()

In [None]:
df['title'].unique()

In [None]:
df['director'].unique()

In [None]:
df['cast'].unique()

In [None]:
df['country'].unique()

In [None]:
df['date_added'].unique()

In [None]:
df['release_year'].unique()

In [None]:
df['rating'].unique()

In [None]:
df['duration'].unique()

In [None]:
df['listed_in'].unique()

In [None]:
df['description'].unique()

## 4.Data Analysis and Data Visualization.

In [None]:
df.head(3)

## i) Plot the count of different types of shows availble on netflix

In [None]:
color=['Red','Black'] #netflix color pattern
sns.countplot(x='type',data=df,palette=color)
plt.title('Count of Different shows on Netflix')
plt.show()

observations from the graph:
1. Netflix contains more movies than TV shows.
2. It indicates that Movies has more audience than  TV Shows.

In [None]:
df['type'].value_counts()

## ii) Plot the Distribution of Release years

In [None]:
sns.histplot(df['release_year'],bins=30,color='r')
plt.title("Distribution of Release Years")
plt.show()

observations from the graph:
1. Most no:of releases are during the period 2000-2020
2. Highest no: of releases are in the year 2018

In [None]:
df['release_year'].value_counts()

## iii) Top 10 countries with highest content

In [None]:
df.country.value_counts().head(10)

In [None]:
data=df.country.value_counts().head(10)

In [None]:
sns.barplot(x=data.index,y=data.values)
plt.xticks(rotation=90)
plt.ylabel("Count of Contents")
plt.title("Top 10 Countries with Highest no: of Contents")
plt.show()

observations from the graph:
1. Netflix has a greater no: of contents from United States.
2. India stands out in the second position.

## iV) list out TV shows from France?

In [None]:
df[(df['type']=='TV Show')&(df['country']=='France')]['title']

## V) list out Movies from India?

In [None]:
df[(df['type']=='Movie')&(df['country']=='India')]['title']

## Vi) Top 10 Directors with the highest no: of contents

In [None]:
df['director'].value_counts().head(10)

## Vii) List all the movies of Mohan Lal(actor)?

In [None]:
df[df['cast'].str.contains('Mohanlal')]

## Viii) How many Movies got the 'TV-14' rating in India?

In [None]:
df[(df['rating']=='TV-14') & (df['country']=='India')]

## ix) Which individual country has the highest no: of TV Shows?

In [None]:
df[df['type']=='TV Show']['country'].value_counts().head(1)
df[df['type']=='TV Show']['country'].value_counts().head(2)
df[df['type']=='TV Show']['country'].value_counts().head(3)
df[df['type']=='TV Show']['country'].value_counts().head(4)

## x) Top 10 Genres with the Most Content on Netflix

In [None]:
# Extracting unique genres from the 'listed_in' column
all_genres = ', '.join(df['listed_in']).split(', ')
unique_genres = list(set(all_genres))
# Counting occurrences of each genre
genre_counts = {genre: all_genres.count(genre) for genre in unique_genres}
# Get the top 10 genres with the most content
top_genres = dict(sorted(genre_counts.items(), key=lambda x: x[1], reverse=True)[:10])

# Create a horizontal bar plot
sns.barplot(x=list(top_genres.values()), y=list(top_genres.keys()))
plt.xlabel('Number of Contents')
plt.ylabel('Genre')
plt.title('Top 10 Genres with the Most Content on Netflix')
plt.show()