# Netflix Data Analysis Project
Netflix is one of the largest streaming platforms nowadays.In this project, I analyzed their dataset of movies and TV shows using python Pandas, Seaborn and Matplotlib. After this analysis, we will gain a better understanding of the platform and derive meaningful insights from the data through visualized graphs.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib
import seaborn as sns

In [None]:
import pandas as pd
data=pd.read_csv("/content/netflix_titles.csv (2).zip")
data

In [None]:
data.head(5)

In [None]:
data.size

In [None]:
data.dtypes

In [None]:
data.info()

In [None]:
data[data.duplicated()]

In [None]:
data.drop_duplicates(inplace=True)

By using "inplace=True",make the change last permanently in the dataset.

In [None]:
data[data.duplicated]

In [None]:
data.shape

# Show null value with heat map



After removing duplicated data, our next step involved examining the presence of null values within the dataset. This crucial analysis allows us to identify any missing or incomplete data points, enabling us to address them effectively before further processing.

In [None]:
data.head()

In [None]:
data.isnull()

In [None]:
data.isnull().sum()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(data.isnull())
plt.title('Null value with heat map')

# Q&A and Visualization: Let's dig deeper!

# Q1. For "House of cards", What is the show ID and who is the director of this show?

In [None]:

data[data['title'].isin(['House of Cards'])]

Another way to find certain data is by using the "str.contains" function.

In [None]:
data[data['title'].str.contains('House of Cards')]

# Q2. In which year the highest number of movies and TV shows were released? w/Bar Graph

In [None]:
data.dtypes

In [None]:
data['Date_N'] = pd.to_datetime(data['release_year'])



In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have a DataFrame named 'data' with columns 'type' and 'release_year'

# Group by 'release_year' and 'type', count the number of entries in each group
counts_by_year = data.groupby(['release_year', 'type']).size().unstack(fill_value=0)

# Calculate the total count of movies and TV shows for each year
counts_by_year['Total'] = counts_by_year.sum(axis=1)

# Find the year with the highest total count
year_with_highest_count = counts_by_year['Total'].idxmax()

# Plotting
counts_by_year['Total'].plot(kind='bar', color='skyblue', figsize=(10, 6))
plt.title('Number of Movies and TV Shows Released Each Year')
plt.xlabel('Year')
plt.ylabel('Number of Releases')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print(f"The year with the highest number of movies and TV shows released was {year_with_highest_count} with a total of {counts_by_year['Total'].max()} releases.")


In [None]:
data['Date_N'].dt.year.value_counts()


# Q3. How many movies & TV shows are in this data set?

In [None]:
data.groupby('type').type.count()

In [None]:
sns.countplot(x=data['type'])
plt.title("Number of Movies and TV Shows")
plt.xlabel("Movies/TV Shows")
plt.ylabel("Total Count")

The dataset consists of over 2,500 TV shows and 6,000 movies. It is evident that the number of movies is more than twice that of TV Shows.

# Q4. Show all the movies that were released in 2021?

In [None]:
data[(data['type']=='Movie')&(data['release_year']== 2021)]

# Number of Movies and TV Shows each year

In [None]:
plt.figure(figsize=(35,6))
sns.countplot(data=data,x="release_year",hue="type")
plt.title("Number of Movies and TV Shows each year")

Each year, the dataset consistently exhibits a higher count of movies compared to TV shows except 2021.

# Q5. Show only the Title of all TV Shows that were released in United States only?

In [None]:
data[(data['type']=='TV Show')&(data['country']== 'United States')]

# Top 15 Countries on Netflix based on amount of content

In [None]:
import pandas as pd
 # Splitting countries and stacking them
filtered_countries = data['country'].str.split(',', expand=True).stack().reset_index(drop=True)

# Counting occurrences of each country
country_counts = filtered_countries.value_counts()

# Extracting the top 15 countries
top_15_countries = country_counts.head(15)

# Displaying the result
print("Top 15 Countries on Netflix based on amount of content:")
print(top_15_countries)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have a DataFrame named 'data' with columns 'type' and 'country'

# Splitting countries and stacking them
filtered_countries = data['country'].str.split(',', expand=True).stack().reset_index(drop=True)

# Counting occurrences of each country
country_counts = filtered_countries.value_counts()

# Extracting the top 15 countries
top_15_countries = country_counts.head(15)

# Plotting
plt.figure(figsize=(10, 6))
top_15_countries.plot(kind='bar')
plt.title("Top 15 Countries on Netflix based on amount of content")
plt.xlabel("Country")
plt.ylabel("Number of Content")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


The United States has the most content on Netflix,followed by India in second place and the United Kingdom in third place.There is a considerable gap between the number of content offerings from the U.S. and India.The U.S. has over 3000 content items, while india has 900+ content items in second place. This highlights the dominance of U.S. content on Netflix when it comes to quantity. There are several Asian countries that rank prominently. This indicates a growing trend of anime and K-drama on Netflix,reflecting their popularity among viewers.

# Q7. Show Top 10 Directors who gave the highest number of TV Shows & Movies to Netflix

In [None]:
import pandas as pd

# Grouping by director and counting the number of entries (movies and TV shows) for each director
movie_director = data.groupby('director').size().reset_index(name='Count')

# Sorting by count in descending order and selecting the top 10 directors
movie_director = movie_director.sort_values(by='Count', ascending=False).head(10)

# Displaying the result
print("Top 10 Directors who gave the highest number of TV Shows & Movies to Netflix:")
print(movie_director)


# Q8. Show all Records, where "type is TV Show and Type is British TV Shows" or "Country is United Kingdom.

In [None]:
filtered_records = data[(data['type'] == 'TV Show') & (data['listed_in'].str.contains('British TV Shows')) | (data['country'] == 'United Kingdom')]
print (filtered_records)



# In how many movies/shows,Tom Cruise was cast?

In [None]:

tom_cruise_count = data[data['cast'].str.contains('Tom Cruise', na=False)].shape[0]
print("Tom Cruise was cast in", tom_cruise_count, "movies/shows.")


# Q10. What are the different Ratings defined by Netflix?

In [None]:
data['rating'].unique()

In [None]:
data['rating'].nunique()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
unique_ratings = data['rating'].unique()
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='rating', order=unique_ratings)
plt.title("Distribution of Netflix Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()


# Q10-(1). How many Movies got the 'TV-14' rating, in Canada?

In [None]:
tv_14_canada_movies = data[(data['type'] == 'Movie') & (data['rating'] == 'TV-14') & (data['country'] == 'Canada')]
tv_14_canada_movie_count = tv_14_canada_movies.shape[0]
print("Number of movies with 'TV-14' rating in Canada:", tv_14_canada_movie_count)



# Q10-(2). How many TV Shows got the 'R' rating,after year 2020?

In [None]:
filtered_data = data[(data['type'] == 'TV Show') & (data['rating'] == 'R') & (data['release_year'] > 2020)]
num_tv_shows_rated_r_after_2020 = len(filtered_data)
print("Number of TV shows with an 'R' rating released after 2020:", num_tv_shows_rated_r_after_2020)


# Q11. What is the maximum duration of a Movie/Show on Netflix?

In [None]:
data['duration'].unique()

In [None]:
data.duration.dtypes

Duration data is an object format so we have to format it as integer to find the maximum duration.

In [None]:
data[['Minutes','Unit']] = data['duration'].str.split(' ',expand=True)
data.head(2)

In [None]:
data['Minutes']=pd.to_numeric(data['Minutes'])
data['Minutes'].max()

As an initial step we converted the duration data type to numeric values. This allowed us to analyze the content duration more effectively. By separating the duration into minutes and units, we discovered that the longest content on Netflix has a duration of 312 minutes.

# Q12. Which individual country has the Highest No. of TV Shows?

In [None]:
data_tvshow=data[data ['type']=='TV Show']
data_tvshow.head()


In [None]:
data_tvshow.country.value_counts().head(1)

Based on the visual representation of the"Top 15 Countries on Netflix,"It is evident that the United States has the highest number of TV Shows, as they occupy the first position in the graph.

# Q13.How can we sort the dataset by release_year?

In [None]:
data.sort_values(by='release_year',ascending=False).head(5)

# Q14.Find all the instances where:type is Movie and listed_in is 'Dramas'or type is 'TV Show'& listed_in is 'Dramas'.

In [None]:
data[(data['type']=='Movie')&( data['listed_in']=='Dramas')| (data['type']=='TV Show')&( data['listed_in']=='Dramas')]


# Conclusions and Insights

(1) The number of movies and TV shows on Netflix has been steadily increasing until 2019, but there was a decrease in content in 2020. This decline might be due to the COVID-19 pandemic, which likely affected the production and availability of new content during that year.

(2) The dataset consists of a total of 8807 content items,with 6,131 movies and 2,676 TV Shows.Each year,the dataset consistently exhibits a higher count of movies compared to TV Shows except 2021.

(3) The United States has the most content on Netflix, followed by India in second place and the United Kingdom in third place.In terms of quantity, the United States takes the lead on Netflix with over 3,000 content items emphasizing its dominant position.In contrast,India two countries when it comes to the number of content on Netflix.

(4) There are several Asian countries that rank prominently. This indicates a growing trend of anime and K-drama on Netflix,reflecting their popularity among viewers.

(5) The most common rating on Netflix is "TV-MA",This indicates that there is a higher proportion of content targeted toward mature audiences rather than younger viewers.

(6) We discovered that the longest content on Netflix has a duration of 312 minutes.

(6) The majority of null values in the dataset are found in the director column.


(7) The United States has the highest number of TV Shows, as they occupy the first position in "Top 15 Countries on Netflix" graph

