# <span style="color:red">Netflix Analysis</span>

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRCYsSjQNx8c9DC7x3Lqty9ng2hbd5M_5Ggnnu8a7Lml5FvEE8RVY24kuiRRz5BUy3JIGE&usqp=CA" align='center' style="width: 100px;"/>


This dataset content information about tv shows and movies available on Netflix as of 2019. As scope of the current analysis is to indenfity the different trend in the dataset by the different data available.

You can follow the steps below to guide your data analysis and model building portion of this project.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


df = pd.read_csv('/kaggle/input/exploratory-data-analysis-on-netflix-data/netflix_titles_2021.csv')


In [None]:
# View of the information in the dataset
print("=== Dataset Info ===")
df.info()

print("\n=== Dataset Shape ===")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")

print("\n=== Null Value Summary ===")
print(df.isnull().sum())

## <span style="color:red">Top 10 Countries with more Content</span>

In [None]:
# Exclude null values for 'country' and calculate count and percentage of content by country
country_counts = df[df['country'].notna()]['country'].value_counts(normalize=False).reset_index()

# Rename columns and calculate the percentage
country_counts.columns = ['country', 'count']
country_counts['percentage'] = round((country_counts['count'] / len(df)) * 100, 2).astype(str) + '%'

# Display top 10 countries by content count
country_counts.sort_values(by='count', ascending=False).head(10)


In [None]:

# Plotting the top 10 countries by content count
top_countries = country_counts.head(10)

plt.figure(figsize=(20, 3))

# Create a bar plot
barplot=plt.barh(top_countries['country'], top_countries['count'], color='skyblue')
plt.bar_label(barplot, labels=top_countries['percentage'],fontsize=12)
         
# Add labels, title and color index
plt.xlabel('Count', fontsize=16)
plt.ylabel('Country', fontsize=18)
plt.title('Top 10 Countries by Content Availability', fontsize=14)
plt.barh(top_countries['country'], top_countries['count'], color='red')



# Show the plot for the top ten
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## <span style="color:red">Top 10 Countries by Type of Content</span>

In [None]:
# Group by 'country' and 'type' to get content count
df_cleaned = df.dropna(subset=['duration', 'type'])
country_type_counts = (
    df_cleaned
    .groupby(['country', 'type'])
    .size()
    .reset_index(name='count') #adding count column
)
country_type_counts['percentage'] = round((country_type_counts['count'] / len(df)) * 100, 2).astype(str) + '%'

# Sort the results in descending order by count
country_type_counts = country_type_counts.sort_values(by='count', ascending=False)
country_type_counts.head(10)


In [None]:
# top ten countries by type for content count
top_countries_type = country_type_counts.head(10)

# Create a horizontal bar plot
plt.figure(figsize=(10, 6))
barplot=sns.barplot(
    data=top_countries_type,
    x='country',
    y='count',
    hue='type',  # Differentiate by type (Movie or TV Show)
    palette='Reds'
)


# Add labels, layout and title
plt.xlabel('Content Count', fontsize=14)
plt.ylabel('Country', fontsize=14)
plt.title('Top 10 Countries by Content Count and Type', fontsize=16)


# Display the plot
plt.tight_layout()
plt.show()

## <span style="color:red">Frequent Rating and average Duration</span>

In [None]:
# Delete rows with null values in 'duration'
df_cleaned = df_cleaned.dropna(subset=['duration', 'type'])

# Extract numeric duration by type in a more efficient way
def extract_numeric_duration(row):
    duration = str(row['duration']).strip()
    if row['type'] == "Movie" and 'min' in duration:
        # Extract minutes for movies
        return pd.to_numeric(duration.replace('min', '').strip(), errors='coerce')
    elif row['type'] == "TV Show" and any(season in duration.lower() for season in ['season', 'seasons']):
        # Extract seasons for TV shows
        return pd.to_numeric(duration.replace('Season', '').replace('seasons', '').strip(), errors='coerce')
    return None  # Return None for non-matching cases

# Apply the extraction function and create the new 'duration_num' column
df_cleaned['duration_num'] = df_cleaned.apply(extract_numeric_duration, axis=1)

# Calculate and round the average duration by type
average_duration_by_type = df_cleaned.groupby('type')['duration_num'].mean().round(2).reset_index()
average_duration_by_type


In [None]:
# Count the content by type and rating
most_frequent_rating_by_type = (
    df_cleaned.groupby(['type', 'rating'])
    .size()
    .reset_index(name='count')
)

# Find the most frequent rating for each type
most_frequent_rating_by_type = (
    most_frequent_rating_by_type.loc[
        most_frequent_rating_by_type.groupby('type')['count'].idxmax()
    ][['type', 'rating', 'count']]
)

# Merge both tables (average duration and most frequent rating) into one table
final_table = pd.merge(average_duration_by_type, most_frequent_rating_by_type, on='type')
final_table

## <span style="color:red">Conclusion of the Analysis</span>

* We can observe that the most frequent content in netflix up to 2019 is produced in US and India.
* For the content produced in US we can see that **23.33%** are movies and the rest are TV Series, while for the content produced in India added in Netflix are only movies.
* For both type of contents (movies and tv series) the most common rating is the TV-MA, which is related for content for more than 18 years old, so we can assume Netflix has been focusing mainly in the adult audience.
* Finally in general the content has an average duration of 100 minutes for movies and 1 session for tv series (one possible reason is that Netflix is either focusing on mini-series or most of the series available don't get enough audience to renew for a second session)

After analysing the data we can conclude that up to 2019 most movies in Netflix has been produced in US, India and UK which we can assume Netflix has been focused in release english content. Also, we can see the majority of the content in the plattform is targted to people +18 years old.
We can assume that Netflix has focused on this segment because it is the most attractive to their audience, helping the platform retain existing users while also attracting new ones, which can drive revenue growth.