# About the database
## Introduction
This database, which contains information about Netflix movies and series, has 12 columns, including some null values. The analysis was carried out with the Pandas, Matplotlib and Seaborn libraries, which were used to treat, process and visualize the information.

Data:

show_id - Unique content identifier

type - Whether it's a movie or TV show

title - Title

director - Director

cast - Cast

country - Country where it was produced

date_added - Date it was added to the Netflix catalog

release_year - Year of release

rating - Classification

duration - Total duration - in minutes or in number of seasons.

# Exploratory Data Analysis - ED

Possible analyzes:

older or more modern movies in the catalog.

number of movies released per year.

proportion between movies and series in the catalog.

length of movies.

movies directors with the most productions on Netflix.

longer duration movies.

series with more seasons on Netflix.

countries that produce the most content on Netflix.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

sns.set_theme(style="darkgrid", palette='Set2') # applying style and color palette

In [None]:
df=pd.read_csv("netflix_movies.csv")

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.nunique() # checking for unique values

In [None]:
# Release year range
# get the first index of the column
starting_year = df["release_year"].min()

# get the last index of the column
final_year = df["release_year"].max()

print(f"This dataset covers movies released from {starting_year} to {final_year}.")

## Exploratory analysis and data visualization

### Number of movies released per year

In [None]:
# separating the release year column into a variable
# getting the frequency of each year of release
release_year = df["release_year"].value_counts().reset_index()
release_year.head()

In [None]:
# creating a scatterplot to check the proportion of movies released each year
fig, ax = plt.subplots(figsize=(12,4))

sns.scatterplot(x=release_year["release_year"], y=release_year["count"])
for spines in ax.spines.values(): # removing the spines, the chart frame
        spines.set_visible(False) 

plt.xlabel("Year of release")
plt.ylabel("Amount")

plt.title("Number of movies released per year", fontsize=14, weight='bold', color="dimgray");

The Netflix catalog features a predominance of 21st century movies, especially productions released from 2010 onwards.

### Proportion between movies and series in the catalogue.

In [None]:
movies_and_tv_prog = df["type"].value_counts().reset_index()
movies_and_tv_prog

In [None]:
movie_counts = movies_and_tv_prog["count"][0]
tv_show_counts = movies_and_tv_prog["count"][1]

print(f"Number of movies: {movie_counts}\nNumber of TV Show: {tv_show_counts}")

In [None]:
# creating an occupancy graph between categories in the netflix catalog
plt.pie([movie_counts, tv_show_counts], labels=["Movie", "TV Show"], shadow=True, explode=[0.25,-0.18], autopct="%.1f")

plt.title("Occupancy percentage in the catalog", fontsize=12, weight='bold', color='dimgrey')

plt.legend(loc="lower left", bbox_to_anchor=(-0.02,-0.01))

plt.show()

The Netflix catalog features a predominance of movies, which correspond to 69.6% of the total, while TV shows occupy the remaining 30.4%.

### Duration of movies

In [None]:
# taking the duration of the movies, removing the word min from the cell and transforming the type into numeric
movies_duration = df.loc[df["type"] == "Movie"]["duration"].str.replace("min", '').astype("float").reset_index(drop=True)
movies_duration.head()

In [None]:
import warnings

warnings.filterwarnings('ignore')

fig, ax = plt.subplots(figsize=(8,4))

h = sns.histplot(movies_duration, bins=15)
for i in h.containers:
    ax.bar_label(ax.containers[0])
    
plt.ylabel('Count')
plt.xlabel('Duration (in minutes)')

plt.yticks([])

plt.grid(False)

plt.title('Duration of movies', fontsize=14, weight='bold', color='dimgrey');

Netflix has most of its movies lasting around 85-100 minutes.

### Movies directors with the most productions on Netflix

In [None]:
# finding the 10 most popular directors
director_movies = df['director'].value_counts()[:10].reset_index()
director_movies.head()

In [None]:
# getting colors from the current color palette
color_palette = sns.color_palette()

fig, ax = plt.subplots(figsize=(14,6.5))

b = sns.barplot(data=director_movies, x='director', y='count', color=color_palette[0])
for barra in b.containers:
    ax.bar_label(ax.containers[0], fontsize=11) # including the values ​​of each bar referring to the title
    b.set_ylabel('')

plt.yticks([])
plt.xlabel('')

plt.tick_params(axis='x', rotation=60)

plt.title('Directors with the highest number of productions on Netflix', weight='bold', fontsize=14, color='dimgrey');

Rajiv Chilaka is the director with the highest number of productions on Netflix, which may indicate that the public appreciates the content he produces.

### Longer-running movies

In [None]:
# separating the title and duration of the movies
longer_running_movies = df.loc[df['type'] == 'Movie'][['title', 'duration']]

# removing the word 'min' from the string and transforming the duration column into float type
longer_running_movies['duration'] = longer_running_movies['duration'].str.replace('min', '').astype('float')

# taking the title of the movies and breaking all spaces into two lines
longer_running_movies['title'] = [title.replace(' ', '\n') for title in longer_running_movies['title']]

# sorting the dataframe by duration and restarting the index
longer_running_movies = longer_running_movies.sort_values('duration', ascending=False).reset_index(drop=True)[:10]

longer_running_movies.head()

In [None]:
# viewing the 10 longest running movies on Netflix
fig, ax = plt.subplots(figsize=(14, 6.5))

b = sns.barplot(data=longer_running_movies, x='title', y='duration', color=color_palette[0])
for i in b.containers:
    ax.bar_label(ax.containers[0], fontsize=11)
    b.set_ylabel('')
    b.set_xlabel('')

plt.yticks([])

plt.title('Longest running movies (in minutes) on Netflix', fontsize=14, weight='bold', color='dimgrey');

The longest movie available on Netflix is ​​Black Mirror: Bandersnatch, with a total running time of 312 minutes, which is equivalent to 5 hours and 12 minutes

### TV shows with the most seasons on Netflix

In [None]:
# separating the lines with the TV Show Type and filtering only the duration and title
# we can notice that the series are separated by seasons
tv_show_seasons = df.loc[df['type'] == 'TV Show'][['title', 'duration']]
tv_show_seasons.head()

In [None]:
#  removing the word Seasons from the strings and transforming the remaining value into an integer type
tv_show_seasons['duration'] = tv_show_seasons['duration'].str.replace('Seasons', '').str.replace('Season', '').astype('int')

# turning title spaces into line breaks
tv_show_seasons['title'] = [title.replace(' ', '\n') for title in tv_show_seasons['title']]

# ordering the dataframe by the duration column
tv_show_seasons = tv_show_seasons.sort_values('duration', ascending=False).reset_index(drop=True)[:10]

tv_show_seasons.head()

In [None]:
tv_show_seasonsfig, ax = plt.subplots(figsize=(12, 5.5))

b = sns.barplot(data=tv_show_seasons, x='title', y='duration', color=color_palette[0])
for i in b.containers:
    ax.bar_label(ax.containers[0], padding=-1, fontsize=12)
    b.set_ylabel('')
    b.set_xlabel('')

plt.yticks([])

plt.title('TV shows with more seasons on Netflix', fontsize=16, weight='bold', color='dimgrey');

Longer-running TV shows may indicate a trend in the genres preferred by the public

### Countries that produce the most content on Netflix

In [None]:
launchy_country = df['country'].value_counts().reset_index()[:10]
launchy_country.head()

In [None]:
fig, ax = plt.subplots(figsize=(10, 4))

b = sns.barplot(data=launchy_country, x='country', y='count', color=color_palette[0])

for i in b.containers:
    ax.bar_label(ax.containers[0], padding=-1, fontsize=10)
    b.set_xlabel('')
    b.set_ylabel('')

plt.yticks([])
plt.tick_params(axis='x', rotation=60)

plt.title('Countries that produce the most content on Netflix', fontsize=16, weight='bold', color='dimgrey');

The United States is the country that produces the most content for Netflix.¶